Egalitarian Paxos
A pluggable implementation of the Egalitarian Paxos Consensus Protocol
Paxos is a protocol for solving consensus through state machine replication in
an asynchronous environment with unreliable processes. It can tolerate up to F
concurrent replica failures with 2F+1 total replicas. This consensus protocol is
then extended with a stable leader optimization to a replication protocol
(commonly referred to as Multi-Paxos) to assign global, persistent, total order
to a sequence of client updates. The protocol works by having multiple replicas
work in parallel to maintain the same state. This state is updated on each
request from a client by each replica, allowing it to be automatically
replicated and preserved even in the case of failures. The basic algorithm was
famously described by Leslie Lamport in his 1998 paper, The Part-Time
Parliament.
It was later clarified in his follow-up paper from 2001, Paxos Made
Simple.
Egalitarian Paxos is an efficient, leaderless variation of this protocol,
proposed by Iulian Moraru in his 2013 paper, There Is More Consensus in
Egalitarian
Parliaments.
It provides strong consistency with optimal wide-area latency, perfect
load-balancing across replicas (both in the local and the wide area), and
constant availability for up to F failures. Concretely, it provides the
following properties:
- High throughput, low latency
- Constant availability
- Load distributed evenly across all replicas (no leader)
- Limited by fastest replicas, not slowest
- Can always use closest replicas (low latency)
- 1 round-trip fast path
It does so by breaking the global command slot space into subspaces, each owned
by a single replica. Replicas then attach ordering constraints to each command
while voting on them to allow for proper ordering during command execution. For
more intuition on how this works, check out the presentation given at SOSP
'13, and for a full technical
report and proof of correctness of the protocol, check out A Proof of
Correctness for Egalitarian
Paxos.
This library is implemented with a minimalistic philosophy. The main epaxos
package implements only the core EPaxos algorithm, with storage handling,
network transport, and physical clocks left to the clients of the library. This
minimalism buys flexibility, determinism, and performance. The design was
heavily inspired by CoreOS's raft
library.
Features
The epaxos implementation is a full implementation of the Egalitarian Paxos
replication protocol. Features include:
- Command replication
- Command compaction
- Persistence
- Failure Recovery
Features not yet implemented:
- Explicit Prepare Phase
- Optimized Egalitarian Paxos (smaller fast path quorum)
- Membership changes
- Batched commands
- Thrifty operation (see paper)
- Quorum leases
- Snapshots
Building
Run make
or make test
to run all tests against the library
Run make clean
to clean all build artifacts
Run make check
to perform linting and static analysis
Testing
The project comes with an automated test suite which contains both direct unit
tests to test pieces of functionality within the EPaxos state machine, and larger
network tests that test a network of EPaxos nodes. The unit tests are scattered
throughout the epaxos/*_test.go
files, while the network tests are located in
the epaxos/epaxos_test.go
file.
To run all tests, run the command make test
Library Interface
The library is designed around the the epaxos
type, which is a single-threaded
state machine implementing the Egalitarian Paxos consensus protocol. The state
machine can be interacted with only through a Node
instance, which is a
thread-safe handle to a epaxos
state machine.
Because the library pushes tasks like storage handling and network transport up
to the users of the library, these users have a few responsibilities. In a loop,
the user should read from the Node.Ready
channel and process the updates it
contains. These Ready
struct will contain any updates to the persistent state
of the node that should be synced to disk, and messages that need to be
delivered to other nodes, and any commands that have been successfully committed
and that are ready to be executed. The user should also periodically call
Node.Tick
in regular interval (probably via a time.Ticker
).
Together, the state machine handling loop will look something like:
for {
select {
case <-ticker.C:
node.Tick()
case rd := <-node.Ready():
for _, msg := range rd.Messages {
send(msg)
}
for _, cmd := range rd.ExecutableCommands {
execute(cmd)
}
case <-ctx.Done():
return
}
}
To propose a change to the state machine, first construct a pb.Command
message. The pb.Command
message contains both an arbitrary byte slice to hold
client updates and additional metadata fields to related to command
interference. Use of these metadata fields in pb.Command
is the mechanism in
which clients of the library express application-specific command interference
semantics. pb.Commands
operate in a virtual keyspace, and each command
operates over a subset of this keyspace, which is expressed in the Span
field.
pb.Commands
can also be specified as reads or writes using the Writing
field. Interference between commands is then defined as two commands whose
Spans
overlap, where at-least one of the commands is Writing
.
After a pb.Command
is constructed with the desired update and the necessary
interference constrains, call:
node.Propose(ctx, command)
Once executable, the pb.Command
will appear in the rd.ExecutableCommands
slice.