Documentation ¶
Index ¶
Constants ¶
This section is empty.
Variables ¶
var DefaultTimeout = 10 * time.Minute
DefaultTimeout is the default timeout for all gather.
var ErrAllGatherTimeoutExceeded = fmt.Errorf(
"some ranks are taking a long time to connect to master " +
"during all gather; when running on kubernetes this may happen " +
"because only some of the pods have been scheduled; it is possible " +
"that some pods will never be scheduled without adding compute " +
"resources or pausing / killing other experiments in the cluster",
)
ErrAllGatherTimeoutExceeded indicates that we not halt within the expected deadline.
var ErrClosed = fmt.Errorf("left or closed")
ErrClosed is returned from a closed and incomplete allgather.
var ErrReconnected = fmt.Errorf("another watcher with the same ID connected")
ErrReconnected indicates another watcher connected with the same ID. Only one watcher should connect per ID. Anyone attempted to synchronize more things should use more `numPeers` and different IDs.
Functions ¶
Types ¶
type Watcher ¶
type Watcher struct {
C <-chan Result
}
Watcher signals all gather completion via a channel which is closed upon said completion.
func Join ¶
func Join( groupID string, id uuid.UUID, numPeers int, data any, ready func(), timeout func(error), ) Watcher
Join adds the member with `id` to the allgather group `groupID`. The allgather waits until `numPeers` members are waiting then gives all members all the submitted `data` and fires the `ready` callback. If the `DefaultTimeout` is exceeded before `ready`, the `timeout` callback fires. Note, the data is not a copy, it should not be mutated.