unicast

package
v0.32.8 Latest Latest
Warning

This package is not in the latest version of its module.

Go to latest
Published: Nov 10, 2023 License: AGPL-3.0 Imports: 20 Imported by: 3

README

Unicast Manager

Overview

In Flow blockchain, nodes communicate with each other in 3 different ways; unicast, multicast, and publish. The multicast and publish are handled by the pubsub (GossipSub) protocol. The unicast is a protocol that is used to send messages over direct (one-to-one) connections to remote nodes. Each unicast message is sent through a single-used, one-time stream. One can see a stream as a virtual protocol that expands the base direct connection into a full-duplex communication channel. Figure below illustrates the notion of direct connection and streams between nodes A and B. The direct connection is established between the nodes and then the nodes can open multiple streams over the connection. The streams are shown with dashed green lines, while the direct connection is illustrated by blue lines that encapsulates the streams. streams.png

The unicast Manager is responsible for establishing streams between nodes when they need to communicate over unicast protocol. When the manager receives a CreateStream invocation, it will try to establish a stream to the remote peer whose identifier is provided in the invocation (peer.ID). The manager is expanding the libp2p functionalities, hence, it operates on the notion of the peer (rather than Flow node), and peer.ID rather than flow.Identifier. It is the responsibility of the caller to provide the correct peer.ID of the remote node.

If there is existing connection between the local node and the remote node, the manager will try to establish a connection first, and then open a stream over the connection. The connection is assumed persistent, i.e., it will be kept until certain events such as Flow node shutdown, restart, disallow-listing of either ends of the connection by each other, etc. However, a stream is a one-time communication channel, i.e., it is assumed to be closed by the caller once the message is sent. The caller (i.e., the Flow node) does not necessarily re-use a stream, and the Manager creates one stream per request (i.e., CreateStream invocation), which is typically a single message. However, we have certain safeguards in place to prevent nodes from establishing more than one connection to each other. That is why the Manager establishes the connection only when there is no existing connection between the nodes, and otherwise re-uses the existing connection.

Note: pubsub protocol also establishes connections between nodes to exchange gossip messages with each other. The connection type is the same between pubsub and unicast protocols, as they both consult the underlying LibP2P node to establish the connection. However, the level of reliability, life-cycle, and other aspects of the connections are different between the two protocols. For example, pubsub requires some number of connections to some number of peers, which in most cases is regardless of their identity. However, unicast requires a connection to a specific peer, and the connection is assumed to be persistent. Hence, both these protocols have their own notion of connection management; the unicast Manager is responsible for establishing connections when unicast protocol needs to send a message to a remote peer, while the PeerManager is responsible for establishing connections when pubsub. These two work in isolation and independent of each other to satisfy different requirements.

The PeerManager regularly checks the health of the connections and closes the connections to the peers that are not part of the Flow protocol state. One the other hand, the unicast Manager only establishes a connection if there is no existing connection to the remote peer. Currently, Flow nodes operate on a full mesh topology, meaning that every node is connected to every other node through PeerManager. The PeerManager starts connecting to every remote node of the Flow protocol upon startup, and then maintains the connections unless the node is disallow-listed or ejected by the protocol state. Accordingly, it is a rare event that a node does not have a connection to another node. Also, that is the reason behind the unicast Manager not closing the connection after the stream is closed. The unicast Manager assumes that the connection is persistent and will be kept open by the PeerManager.

Backoff and Retry Attempts

The flowchart below explains the abstract logic of the UnicastManager when it receives a CreateStream invocation. One a happy path, the UnicastManager expects a connection to the remote peer exists and hence it can successfully open a stream to the peer. However, there can be cases that the connection does not exist, the remote peer is not reliable for stream creation, or the remote peer acts maliciously and does not respond to connection and stream creation requests. In order to distinguish between the cases that the remote peer is not reliable and the cases that the remote peer is malicious, the UnicastManager uses a backoff and retry mechanism.

retry.png

Addressing Unreliable Remote Peer

To address the unreliability of remote peer, upon an unsuccessful attempt to establish a connection or stream, the UnicastManager will wait for a certain amount of time before it tries to establish (i.e., the backoff mechanism), and will retry a certain number of times before it gives up (i.e., the retry mechanism). The backoff and retry parameters are configurable through runtime flags. If all backoff and retry attempts fail, the UnicastManager will return an error to the caller. The caller can then decide to retry the request or not. By default, UnicastManager retries each connection (dialing) attempt as well as stream creation attempt 3 times. Also, the backoff intervals for dialing and stream creation are initialized to 1 second and progress exponentially with a factor of 2, i.e., the i-th retry attempt is made after t * 2^(i-1), where t is the backoff interval. The formulation is the same for dialing and stream creation. For example, if the backoff interval is 1s, the first attempt is made right-away, the first (retry) attempt is made after 1s * 2^(1 - 1) = 1s, the third (retry) attempt is made after 1s * 2^(2 - 1) = 2s, and so on.

These parameters are configured using the config/default-config.yml file:

  # Unicast create stream retry delay is initial delay used in the exponential backoff for create stream retries
  unicast-create-stream-retry-delay: 1s
  # The backoff delay used in the exponential backoff for consecutive failed unicast dial attempts to a remote peer.
  unicast-dial-backoff-delay: 1s
Addressing Concurrent Stream Creation Attempts

There might be the case that multiple threads attempt to create a stream to the same remote peer concurrently, while there is no existing connection to the remote peer. In such cases, the UnicastManager will let the first attempt to create a stream to the remote peer to proceed with dialing while it will backoff the concurrent attempts for a certain amount of time. The backoff delay is configurable through the config/default-config.yml file:

  # The backoff delay used in the exponential backoff for backing off concurrent create stream attempts to the same remote peer
  # when there is no available connections to that remote peer and a dial is in progress.
  unicast-dial-in-progress-backoff-delay: 1s

This is done for several reasons including:

  • The resource manager of the remote peer may block the concurrent dial attempts if they exceed a certain threshold.
  • As a convention in networking layer, we don't desire more than one connection to a remote peer, and there are hard reactive constraints in place. However, as a soft proactive measure, we backoff concurrent dial attempts to the same remote peer to prevent multiple connections to the same peer.
  • Dialing is a resource-intensive operation, and we don't want to waste resources on concurrent dial attempts to the same remote peer.
Addressing Malicious Remote Peer

The backoff and retry mechanism is used to address the cases that the remote peer is not reliable. However, there can be cases that the remote peer is malicious and does not respond to connection and stream creation requests. Such cases may cause the UnicastManager to wait for a long time before it gives up, resulting in a resource exhaustion and slow-down of the dialing node. To mitigate such cases, the UnicastManager uses a retry budget for the stream creation and dialing. The retry budgets are initialized using the config/default-config.yml file:

  # The maximum number of retry attempts for creating a unicast stream to a remote peer before giving up. If it is set to 3 for example, it means that if a peer fails to create
  # retry a unicast stream to a remote peer 3 times, the peer will give up and will not retry creating a unicast stream to that remote peer.
  # When it is set to zero it means that the peer will not retry creating a unicast stream to a remote peer if it fails.
  unicast-max-stream-creation-retry-attempt-times: 3
  # The maximum number of retry attempts for dialing a remote peer before giving up. If it is set to 3 for example, it means that if a peer fails to dial a remote peer 3 times,
  # the peer will give up and will not retry dialing that remote peer.
  unicast-max-dial-retry-attempt-times: 3

As shown in the above snippet, both retry budgets for dialing and stream creation are set to 3 by default for every remote peer. Each time the UnicastManager is invoked on CreateStream to pid (peer.ID), it loads the retry budgets for pid from the dial config cache. If no dial config record exists for pid, one is created with the default retry budgets. The UnicastManager then uses the retry budgets to decide whether to retry the dialing or stream creation attempt or not. If the retry budget for dialing or stream creation is exhausted, the UnicastManager will not retry the dialing or stream creation attempt, respectively, and returns an error to the caller. The caller can then decide to retry the request or not.

Penalizing Malicious Remote Peer

Each time the UnicastManager fails to dial or create a stream to a remote peer and exhausts the retry budget, it penalizes the remote peer as follows:

  • If the UnicastManager exhausts the retry budget for dialing, it will decrement the dial retry budget as well as the stream creation retry budget for the remote peer.
  • If the UnicastManager exhausts the retry budget for stream creation, it will decrement the stream creation retry budget for the remote peer.
  • If the retry budget reaches zero, the UnicastManager will only attempt once to dial or create a stream to the remote peer, and will not retry the attempt, and rather return an error to the caller.
  • When any of the budgets reaches zero, the UnicastManager will not decrement the budget anymore.

Note: UnicastManager is part of the networking layer of the Flow node, which is a lower-order component than the Flow protocol engines who call the UnicastManager to send messages to remote peers. Hence, the UnicastManager must not outsmart the Flow protocol engines on deciding whether to dial or create stream in the first place. This means that UnicastManager will attempt to dial and create stream even to peers with zero retry budgets. However, UnicastManager does not retry attempts for the peers with zero budgets, and rather returns an error immediately upon a failure. This is the responsibility of the Flow protocol engines to decide whether to send a message to a remote peer or not after a certain number of failures.

Restoring Retry Budgets

The UnicastManager may reset the dial and stream creation budgets for a remote peers from zero to the default values in the following cases:

  • Restoring Stream Creation Retry Budget: To restore the stream creation budget from zero to the default value, the UnicastManager keeps track of the consecutive successful streams created to the remote peer. Everytime a stream is created successfully, the UnicastManager increments a counter for the remote peer. The counter is reset to zero upon the first failure to create a stream to the remote peer. If the counter reaches a certain threshold, the UnicastManager will reset the stream creation budget for the remote peer to the default value. The threshold is configurable through the config/default-config.yml file:
    # The minimum number of consecutive successful streams to reset the unicast stream creation retry budget from zero to the maximum default. If it is set to 100 for example, it
    # means that if a peer has 100 consecutive successful streams to the remote peer, and the remote peer has a zero stream creation budget,
    # the unicast stream creation retry budget for that remote peer will be reset to the maximum default.
    unicast-stream-zero-retry-reset-threshold: 100
    
    Reaching the threshold means that the remote peer is reliable enough to regain the default retry budget for stream creation.
  • Restoring Dial Retry Budget: To restore the dial retry budget from zero to the default value, the UnicastManager keeps track of the last successful dial time to the remote peer. Every failed dialing attempt will reset the last successful dial time to zero. If the time since the last successful dialing attempt reaches a certain threshold, the UnicastManager will reset the dial budget for the remote peer to the default value. The threshold is configurable through the config/default-config.yml file:
      # The number of seconds that the local peer waits since the last successful dial to a remote peer before resetting the unicast dial retry budget from zero to the maximum default.
    # If it is set to 3600s (1h) for example, it means that if it has passed at least one hour since the last successful dial, and the remote peer has a zero dial retry budget,
    # the unicast dial retry budget for that remote peer will be reset to the maximum default.
    unicast-dial-zero-retry-reset-threshold: 3600s
    
    Reaching the threshold means that either the UnicastManager has not dialed the remote peer for a long time, and the peer deserves a chance to regain its dial retry budget, or the remote peer maintains a persistent connection to the local peer, for a long time, and deserves a chance to regain its dial retry budget. Note that the networking layer enforces a maximum number of one connection to a remote peer, hence the remote peer cannot have multiple connections to the local peer. Also, connection establishment is assumed a more resource-intensive operation than the stream creation,
    hence, in contrast to the stream reliability that is measured by the number of consecutive successful streams, the dial reliability is measured by the time since the last successful dial.

Documentation

Index

Constants

View Source
const (
	// MaxRetryJitter is the maximum number of milliseconds to wait between attempts for a 1-1 direct connection
	MaxRetryJitter = 5
)

Variables

This section is empty.

Functions

func IsErrDialInProgress added in v0.30.0

func IsErrDialInProgress(err error) bool

IsErrDialInProgress returns whether an error is ErrDialInProgress

func IsErrMaxRetries added in v0.30.0

func IsErrMaxRetries(err error) bool

IsErrMaxRetries returns whether an error is ErrMaxRetries.

Types

type DialConfig added in v0.32.2

type DialConfig struct {
	DialRetryAttemptBudget           uint64    // number of times we have to try to dial the peer before we give up.
	StreamCreationRetryAttemptBudget uint64    // number of times we have to try to open a stream to the peer before we give up.
	LastSuccessfulDial               time.Time // timestamp of the last successful dial to the peer.
	ConsecutiveSuccessfulStream      uint64    // consecutive number of successful streams to the peer since the last time stream creation failed.
}

DialConfig is a struct that represents the dial config for a peer.

type DialConfigAdjustFunc added in v0.32.2

type DialConfigAdjustFunc func(DialConfig) (DialConfig, error)

DialConfigAdjustFunc is a function that is used to adjust the fields of a DialConfigEntity. The function is called with the current config and should return the adjusted record. Returned error indicates that the adjustment is not applied, and the config should not be updated. In BFT setup, the returned error should be treated as a fatal error.

type DialConfigCache added in v0.32.2

type DialConfigCache interface {
	// GetOrInit returns the dial config for the given peer id. If the config does not exist, it creates a new config
	// using the factory function and stores it in the cache.
	// Args:
	// - peerID: the peer id of the dial config.
	// Returns:
	//   - *DialConfig, the dial config for the given peer id.
	//   - error if the factory function returns an error. Any error should be treated as an irrecoverable error and indicates a bug.
	GetOrInit(peerID peer.ID) (*DialConfig, error)

	// Adjust adjusts the dial config for the given peer id using the given adjustFunc.
	// It returns an error if the adjustFunc returns an error.
	// Args:
	// - peerID: the peer id of the dial config.
	// - adjustFunc: the function that adjusts the dial config.
	// Returns:
	//   - error if the adjustFunc returns an error. Any error should be treated as an irrecoverable error and indicates a bug.
	Adjust(peerID peer.ID, adjustFunc DialConfigAdjustFunc) (*DialConfig, error)

	// Size returns the number of dial configs in the cache.
	Size() uint
}

DialConfigCache is a thread-safe cache for dial configs. It is used by the unicast service to store the dial configs for peers.

type DialConfigCacheFactory added in v0.32.2

type DialConfigCacheFactory func(configFactory func() DialConfig) DialConfigCache

type ErrDialInProgress added in v0.30.0

type ErrDialInProgress struct {
	// contains filtered or unexported fields
}

ErrDialInProgress indicates that the libp2p node is currently dialing the peer.

func NewDialInProgressErr added in v0.30.0

func NewDialInProgressErr(pid peer.ID) ErrDialInProgress

NewDialInProgressErr returns a new ErrDialInProgress.

func (ErrDialInProgress) Error added in v0.30.0

func (e ErrDialInProgress) Error() string

type ErrMaxRetries added in v0.30.0

type ErrMaxRetries struct {
	// contains filtered or unexported fields
}

ErrMaxRetries indicates retries completed with max retries without a successful attempt.

func NewMaxRetriesErr added in v0.30.0

func NewMaxRetriesErr(attempts uint64, err error) ErrMaxRetries

NewMaxRetriesErr returns a new ErrMaxRetries.

func (ErrMaxRetries) Error added in v0.30.0

func (e ErrMaxRetries) Error() string

type Manager

type Manager struct {
	// contains filtered or unexported fields
}

Manager manages libp2p stream negotiation and creation, which is utilized for unicast dispatches.

func NewUnicastManager

func NewUnicastManager(cfg *ManagerConfig) (*Manager, error)

NewUnicastManager creates a new unicast manager. Args:

  • cfg: configuration for the unicast manager.

Returns:

  • a new unicast manager.
  • an error if the configuration is invalid, any error is irrecoverable.

func (*Manager) CreateStream

func (m *Manager) CreateStream(ctx context.Context, peerID peer.ID) (libp2pnet.Stream, error)

CreateStream tries establishing a libp2p stream to the remote peer id. It tries creating streams in the descending order of preference until it either creates a successful stream or runs out of options. Args:

  • ctx: context for the stream creation.
  • peerID: peer ID of the remote peer.

Returns:

  • a new libp2p stream.
  • error if the stream creation fails; the error is benign and can be retried.

func (*Manager) Register

func (m *Manager) Register(protocol protocols.ProtocolName) error

Register registers given protocol name as preferred unicast. Each invocation of register prioritizes the current protocol over previously registered ones.

func (*Manager) SetDefaultHandler added in v0.32.0

func (m *Manager) SetDefaultHandler(defaultHandler libp2pnet.StreamHandler)

SetDefaultHandler sets the default stream handler for this unicast manager. The default handler is utilized as the core handler for other unicast protocols, e.g., compressions.

type ManagerConfig added in v0.32.2

type ManagerConfig struct {
	Logger        zerolog.Logger               `validate:"required"`
	StreamFactory p2p.StreamFactory            `validate:"required"`
	SporkId       flow.Identifier              `validate:"required"`
	ConnStatus    p2p.PeerConnections          `validate:"required"`
	Metrics       module.UnicastManagerMetrics `validate:"required"`

	// CreateStreamBackoffDelay is the backoff delay between retrying stream creations to the same peer.
	CreateStreamBackoffDelay time.Duration `validate:"gt=0"`

	// DialInProgressBackoffDelay is the backoff delay for parallel attempts on dialing to the same peer.
	// When the unicast manager is invoked to create stream to the same peer concurrently while there is
	// already an ongoing dialing attempt to the same peer, the unicast manager will wait for this backoff delay
	// and retry creating the stream after the backoff delay has elapsed. This is to prevent the unicast manager
	// from creating too many parallel dialing attempts to the same peer.
	DialInProgressBackoffDelay time.Duration `validate:"gt=0"`

	// DialBackoffDelay is the backoff delay between retrying connection to the same peer.
	DialBackoffDelay time.Duration `validate:"gt=0"`

	// StreamZeroRetryResetThreshold is the threshold that determines when to reset the stream creation retry budget to the default value.
	//
	// For example the default value of 100 means that if the stream creation retry budget is decreased to 0, then it will be reset to default value
	// when the number of consecutive successful streams reaches 100.
	//
	// This is to prevent the retry budget from being reset too frequently, as the retry budget is used to gauge the reliability of the stream creation.
	// When the stream creation retry budget is reset to the default value, it means that the stream creation is reliable enough to be trusted again.
	// This parameter mandates when the stream creation is reliable enough to be trusted again; i.e., when the number of consecutive successful streams reaches this threshold.
	// Note that the counter is reset to 0 when the stream creation fails, so the value of for example 100 means that the stream creation is reliable enough that the recent
	// 100 stream creations are all successful.
	StreamZeroRetryResetThreshold uint64 `validate:"gt=0"`

	// DialZeroRetryResetThreshold is the threshold that determines when to reset the dial retry budget to the default value.
	// For example the threshold of 1 hour means that if the dial retry budget is decreased to 0, then it will be reset to default value
	// when it has been 1 hour since the last successful dial.
	//
	// This is to prevent the retry budget from being reset too frequently, as the retry budget is used to gauge the reliability of the dialing a remote peer.
	// When the dial retry budget is reset to the default value, it means that the dialing is reliable enough to be trusted again.
	// This parameter mandates when the dialing is reliable enough to be trusted again; i.e., when it has been 1 hour since the last successful dial.
	// Note that the last dial attempt timestamp is reset to zero when the dial fails, so the value of for example 1 hour means that the dialing to the remote peer is reliable enough that the last
	// successful dial attempt was 1 hour ago.
	DialZeroRetryResetThreshold time.Duration `validate:"gt=0"`

	// MaxDialRetryAttemptTimes is the maximum number of attempts to be made to connect to a remote node to establish a unicast (1:1) connection before we give up.
	MaxDialRetryAttemptTimes uint64 `validate:"gt=0"`

	// MaxStreamCreationRetryAttemptTimes is the maximum number of attempts to be made to create a stream to a remote node over a direct unicast (1:1) connection before we give up.
	MaxStreamCreationRetryAttemptTimes uint64 `validate:"gt=0"`

	// DialConfigCacheFactory is a factory function to create a new dial config cache.
	DialConfigCacheFactory DialConfigCacheFactory `validate:"required"`
}

Directories

Path Synopsis

Jump to

Keyboard shortcuts

? : This menu
/ : Search site
f or F : Jump to
y or Y : Canonical URL