Distributed Consensus: Paxos & Raft
How do five servers scattered across data-centres agree on which command to execute next — even if two of them crash and the network drops some messages? The answer is consensus, and it is much harder than it looks.
1. The Consensus Problem
Formally, consensus requires that a set of processes each propose a value, and they all eventually agree on exactly one of the proposed values. Three properties must hold:
- Agreement: No two correct processes decide different values.
- Validity: The decided value was proposed by some process (no value pulled from thin air).
- Termination: Every correct process eventually decides.
In practice we use replicated state machines: every server stores an append-only log of commands. Consensus ensures all servers execute the same commands in the same order. This is the foundation of databases like CockroachDB, etcd (Kubernetes), and TiKV.
2. FLP & CAP — Impossibility Results
FLP Impossibility (1985)
Fischer, Lynch, and Paterson proved that in a purely asynchronous distributed system with even one crashable process, no deterministic consensus algorithm can always terminate — it might run forever in bad cases. In practice this means we need timeouts and randomisation (or synchrony assumptions) to make progress.
CAP Theorem
Brewer's CAP theorem states a distributed system can guarantee at most two of:
Consistency (C)
Every read returns the most recent write or an error. All nodes see the same data at the same time.
Availability (A)
Every request receives a response (not necessarily the latest data). The system is always up.
Partition Tolerance (P)
The system keeps working despite arbitrary network partitions between nodes.
In practice
Networks do partition, so P is required. Systems choose: CP (strong consistency, may reject requests) or AP (eventual consistency, always responds).
Raft and Paxos are CP systems. During a partition, the minority side refuses requests to preserve consistency.
3. Paxos: The Classic Algorithm
Paxos was described by Lamport in 1989 (published 1998). It runs in two phases per consensus round (instance):
Phase 1 — Prepare
A proposer picks a ballot number n (higher than
any it has seen) and broadcasts Prepare(n) to a majority
of acceptors. Acceptors reply with:
- Promise: "I will not accept ballots < n."
- Highest accepted value (if any): the proposer must preserve it.
Phase 2 — Accept
The proposer picks a value (its own, or the highest previously
accepted value from phase 1), sends Accept(n, v) to
acceptors. Acceptors accept if they have not promised a higher ballot.
Once a majority accepts, the value is chosen.
then any value chosen at ballot n' > n must also be v.
Proof sketch: majority (quorum) overlaps ensure at least
one acceptor in the new quorum saw the old accepted value.
4. Raft: Understandability First
Raft was designed by Ongaro & Ousterhout (2014) with an explicit goal: be easier to understand than Paxos. Key design choices:
- Strong leader: all log entries flow through a single leader.
- Randomised timeouts: simple, probabilistic leader election.
- Membership changes: joint consensus for safe cluster reconfiguration.
- Log matching: if two logs have the same entry at the same index, all preceding entries are identical.
Raft decomposes consensus into three independent sub-problems: leader election, log replication, and safety.
5. Leader Election
Each server is in one of three states: Follower, Candidate, or Leader. Time is divided into terms (monotonically increasing integers).
- Followers wait for a heartbeat from the leader. Each follower has a randomised election timeout (150–300 ms).
-
If the timeout expires without a heartbeat, the follower becomes a
Candidate and starts an election for term
T+1. -
The candidate votes for itself and sends
RequestVote(term, lastLogIndex, lastLogTerm)to all peers. - A server grants its vote if: it has not voted in this term AND the candidate's log is at least as up-to-date as its own.
- A candidate that receives a majority of votes becomes leader. It immediately sends heartbeats to stop other elections.
AND self.votedFor ∈ {null, candidate.id}
AND candidate.log is at-least-as-up-to-date as self.log
"At least as up-to-date": candidate's lastLogTerm > self.lastLogTerm
OR (equal term AND
candidate.lastLogIndex ≥ self.lastLogIndex)
6. Log Replication
Clients send all commands to the leader. The leader appends the
command to its log (with the current term), then replicates it in
parallel to all followers via AppendEntries RPCs.
| Index | Term | Command | Status |
|---|---|---|---|
| 1 | 1 | SET x=5 | Committed |
| 2 | 1 | SET y=3 | Committed |
| 3 | 2 | DEL x | Committed |
| 4 | 3 | SET z=9 | Pending (not yet majority) |
An entry is committed once the leader has replicated
it to a majority of servers. The leader then applies the command to
its state machine and responds to the client. It notifies followers of
the commit index in future AppendEntries calls.
Log Matching Property
When the leader sends
AppendEntries(prevLogIndex, prevLogTerm, entries), the
follower checks that its entry at prevLogIndex has term
prevLogTerm. If not, it rejects and the leader retries
from an earlier index. This consistency check ensures logs never
diverge silently.
7. Safety and Liveness
Safety: No Two Committed Values at the Same Index
Raft guarantees that if an entry at index i is committed in term t, no different entry at index i will ever be committed in any term. This follows because:
- Committing requires majority acknowledgement.
- Any future leader must have received votes from a majority, so at least one voter has the committed entry.
- The election restriction (vote only for up-to-date candidates) ensures the new leader's log includes all committed entries.
Liveness: Progress Under Partial Failures
Raft tolerates up to ⌊(n-1)/2⌋ server failures in a cluster of n servers (a 5-node cluster survives 2 failures). As long as a majority is alive and connected, elections complete and replication proceeds.
8. Real-World Usage
Consensus algorithms power the coordination layer of many critical systems:
- etcd (Raft) — configuration store for Kubernetes; every pod schedule, service discovery record, and secret flows through it.
- CockroachDB, TiKV (Raft) — distributed SQL; each range (64 MiB shard) has its own Raft group.
- Chubby, Zookeeper (Paxos / ZAB) — Google/Yahoo distributed lock services; used by HDFS, HBase.
- Apache Kafka — KRaft mode (Raft) replaces ZooKeeper for metadata management since 2022.
- Consul (Raft) — service mesh consensus and health checking.
Choosing cluster size is a trade-off: 3 nodes (1 failure tolerance) for cost-sensitive deployments; 5 nodes (2 failures) for high availability; 7+ nodes very rare (diminishing returns on write latency vs. fault tolerance).