Distributed Consensus: Paxos and Raft Explained

1. The Consensus Problem

Formally, consensus requires that a set of processes each propose a value, and they all eventually agree on exactly one of the proposed values. Three properties must hold:

Agreement: No two correct processes decide different values.
Validity: The decided value was proposed by some process (no value pulled from thin air).
Termination: Every correct process eventually decides.

In practice we use replicated state machines: every server stores an append-only log of commands. Consensus ensures all servers execute the same commands in the same order. This is the foundation of databases like CockroachDB, etcd (Kubernetes), and TiKV.

Why is it hard? Processes crash at any time. Messages are delayed, reordered, or dropped. You can't distinguish a dead machine from a very slow one (asynchronous network model). You only know a server crashed once it stops responding — but that takes time.

2. FLP & CAP — Impossibility Results

FLP Impossibility (1985)

Fischer, Lynch, and Paterson proved that in a purely asynchronous distributed system with even one crashable process, no deterministic consensus algorithm can always terminate — it might run forever in bad cases. In practice this means we need timeouts and randomisation (or synchrony assumptions) to make progress.

CAP Theorem

Brewer's CAP theorem states a distributed system can guarantee at most two of:

Consistency (C)

Every read returns the most recent write or an error. All nodes see the same data at the same time.

Availability (A)

Every request receives a response (not necessarily the latest data). The system is always up.

Partition Tolerance (P)

The system keeps working despite arbitrary network partitions between nodes.

In practice

Networks do partition, so P is required. Systems choose: CP (strong consistency, may reject requests) or AP (eventual consistency, always responds).

Raft and Paxos are CP systems. During a partition, the minority side refuses requests to preserve consistency.

3. Paxos: The Classic Algorithm

Paxos was described by Lamport in 1989 (published 1998). It runs in two phases per consensus round (instance):

Phase 1 — Prepare

A proposer picks a ballot number n (higher than any it has seen) and broadcasts Prepare(n) to a majority of acceptors. Acceptors reply with:

Promise: "I will not accept ballots < n."
Highest accepted value (if any): the proposer must preserve it.

Phase 2 — Accept

The proposer picks a value (its own, or the highest previously accepted value from phase 1), sends Accept(n, v) to acceptors. Acceptors accept if they have not promised a higher ballot. Once a majority accepts, the value is chosen.

Paxos safety invariant If value v is chosen at ballot n, then any value chosen at ballot n' > n must also be v. Proof sketch: majority (quorum) overlaps ensure at least one acceptor in the new quorum saw the old accepted value.

Multi-Paxos: Running one Paxos instance per log entry is expensive. Multi-Paxos elects a stable leader who skips Phase 1 for subsequent entries, reducing the common-case to one round-trip (Phase 2 only).

4. Raft: Understandability First

Raft was designed by Ongaro & Ousterhout (2014) with an explicit goal: be easier to understand than Paxos. Key design choices:

Strong leader: all log entries flow through a single leader.
Randomised timeouts: simple, probabilistic leader election.
Membership changes: joint consensus for safe cluster reconfiguration.
Log matching: if two logs have the same entry at the same index, all preceding entries are identical.

Raft decomposes consensus into three independent sub-problems: leader election, log replication, and safety.

5. Leader Election

Each server is in one of three states: Follower, Candidate, or Leader. Time is divided into terms (monotonically increasing integers).

Followers wait for a heartbeat from the leader. Each follower has a randomised election timeout (150–300 ms).
If the timeout expires without a heartbeat, the follower becomes a Candidate and starts an election for term T+1.
The candidate votes for itself and sends RequestVote(term, lastLogIndex, lastLogTerm) to all peers.
A server grants its vote if: it has not voted in this term AND the candidate's log is at least as up-to-date as its own.
A candidate that receives a majority of votes becomes leader. It immediately sends heartbeats to stop other elections.

Split vote: If two candidates start simultaneously, neither may get a majority. Both time out, increment the term, and try again. Randomised timeouts make repeated splits rare — in practice elections complete in one round.

Vote grant condition grant \leftarrow candidate.term \geq self.currentTerm AND self.votedFor \in {null, candidate.id} AND candidate.log is at-least-as-up-to-date as self.log "At least as up-to-date": candidate's lastLogTerm > self.lastLogTerm OR (equal term AND candidate.lastLogIndex \geq self.lastLogIndex)

6. Log Replication

Clients send all commands to the leader. The leader appends the command to its log (with the current term), then replicates it in parallel to all followers via AppendEntries RPCs.

Index	Term	Command	Status
1	1	SET x=5	Committed
2	1	SET y=3	Committed
3	2	DEL x	Committed
4	3	SET z=9	Pending (not yet majority)

An entry is committed once the leader has replicated it to a majority of servers. The leader then applies the command to its state machine and responds to the client. It notifies followers of the commit index in future AppendEntries calls.

Log Matching Property

When the leader sends AppendEntries(prevLogIndex, prevLogTerm, entries), the follower checks that its entry at prevLogIndex has term prevLogTerm. If not, it rejects and the leader retries from an earlier index. This consistency check ensures logs never diverge silently.

7. Safety and Liveness

Safety: No Two Committed Values at the Same Index

Raft guarantees that if an entry at index i is committed in term t, no different entry at index i will ever be committed in any term. This follows because:

Committing requires majority acknowledgement.
Any future leader must have received votes from a majority, so at least one voter has the committed entry.
The election restriction (vote only for up-to-date candidates) ensures the new leader's log includes all committed entries.

Liveness: Progress Under Partial Failures

Raft tolerates up to ⌊(n-1)/2⌋ server failures in a cluster of n servers (a 5-node cluster survives 2 failures). As long as a majority is alive and connected, elections complete and replication proceeds.

Byzantine faults: Raft and Paxos assume crash-stop failures — servers either run correctly or stop. They don't handle Byzantine faults (malicious/corrupted servers). For that, you need BFT algorithms (PBFT, HotStuff) which require 3f+1 replicas to tolerate f Byzantine faults — much more expensive.

8. Real-World Usage

Consensus algorithms power the coordination layer of many critical systems:

etcd (Raft) — configuration store for Kubernetes; every pod schedule, service discovery record, and secret flows through it.
CockroachDB, TiKV (Raft) — distributed SQL; each range (64 MiB shard) has its own Raft group.
Chubby, Zookeeper (Paxos / ZAB) — Google/Yahoo distributed lock services; used by HDFS, HBase.
Apache Kafka — KRaft mode (Raft) replaces ZooKeeper for metadata management since 2022.
Consul (Raft) — service mesh consensus and health checking.

Choosing cluster size is a trade-off: 3 nodes (1 failure tolerance) for cost-sensitive deployments; 5 nodes (2 failures) for high availability; 7+ nodes very rare (diminishing returns on write latency vs. fault tolerance).

Performance numbers: A well-tuned 5-node Raft cluster on a LAN achieves ~50 000 writes/sec at <1 ms median latency. Cross-datacenter deployments add RTT (30–150 ms) directly to commit latency — this is unavoidable; it is the cost of durability across failure domains.