System Design Fundamentals  /  Consensus & Coordination
Fundamental 04~14 min readAdvanced
Deep Dive

Nodes crash.
The cluster still agrees.

Machines die, messages vanish, clocks drift — and yet a distributed system must still settle on exactly one answer: one leader, one next log entry, one commit decision. That's the consensus problem, the hardest in the field, and the algorithms that solve it underpin almost everything reliable.

01

Why agreement is hard

Consensus means getting a group of nodes to agree on a single value, even when some fail. It sounds trivial until you add reality: messages arrive late or never, nodes crash and recover, and you can't tell a slow node from a dead one. The FLP impossibility result proves that in a fully asynchronous network, no algorithm can guarantee consensus if even one node may fail — so real systems use timeouts and randomization to make progress in practice while staying safe always.

Many everyday problems are consensus in disguise: electing a leader, agreeing the next entry in a replicated log, deciding whether a distributed transaction committed, or maintaining a distributed lock. Solve consensus once and you get all of them.

02

Quorums: agreement by majority

The core trick is the majority quorum. If every decision needs votes from a strict majority, then any two decisions must share at least one node — because two majorities of the same set always overlap. That single overlapping node can't vote for two conflicting values, so two leaders or two conflicting commits can never both succeed, even during a partition.

3 of 5 is a majority — and any two majorities overlap
·
·
A 5-node cluster tolerates 2 failures: 3 survivors still form a majority and keep deciding. This is why clusters are sized at odd numbers.
03

Raft & Paxos

The two famous consensus algorithms. Same guarantee, different ergonomics.

Raft was designed to be understandable. It elects a single leader for a term; the leader takes all writes, appends them to a replicated log, and an entry is committed once a majority has stored it. If the leader goes silent, followers time out and a new election starts. It powers etcd, Consul, and CockroachDB.

Raft — elect, then replicate
Follower
times out
Candidate
requests votes
Majority?
→ Leader
Replicate log
to followers
A higher term always wins, so a recovered old leader steps down — no split brain.

Paxos reaches the same safety with proposers, acceptors, and learners, but is famously hard to reason about and implement; Multi-Paxos adds a stable leader for efficiency. In interviews you rarely need the internals — what matters is knowing that a majority-quorum algorithm gives you safe agreement, and reaching for an existing implementation rather than writing your own.

04

Coordination services

You almost never implement consensus yourself. Instead you lean on a coordination serviceZooKeeper, etcd, or a cloud equivalent — which runs the consensus protocol internally and exposes simple primitives: leader election, distributed locks, configuration storage, and membership/failure detection. The rule of thumb: keep the coordination service for the small, critical state that must be globally agreed, and keep your bulk data out of it.

→ Mental model

The senior move is knowing where you genuinely need consensus (one leader, one source of truth) versus where you can avoid it entirely with idempotency and eventual consistency. Consensus is correct but expensive — every decision costs a majority round-trip.

05

Distributed transactions: 2PC vs saga

When one logical operation spans multiple services or shards, you need them to agree on commit-or-abort.

Two-Phase Commit (2PC)

  • Coordinator: "prepare?" → all vote
  • All yes → "commit"; any no → "abort"
  • Strong atomicity across services
  • Blocks if coordinator dies mid-flight
  • Holds locks for the whole round-trip

Saga

  • A sequence of local transactions
  • Each step has a compensating undo
  • Failure → run compensations backwards
  • Non-blocking, eventually consistent
  • You design the rollbacks

2PC buys you atomicity at the cost of availability — a stalled coordinator leaves participants holding locks. Sagas trade atomicity for availability: each step commits independently and a failure triggers compensating actions (refund the charge, release the seat). Most modern microservice systems prefer sagas precisely because they don't block.

06

Idempotency: the exactly-once illusion

Networks deliver at least once — retries mean the same request can arrive twice. True "exactly-once delivery" is impossible, but you can get exactly-once effects by making operations idempotent: applying them twice has the same result as applying them once.

The standard mechanism is an idempotency key — a unique ID the client attaches to a request. The server records processed keys and, on a duplicate, returns the original result instead of doing the work again. This is how a retried payment charges the card once, and how a redelivered queue message isn't processed twice.

Idempotency key deduplicates the retry
Pay req
key=ab12
Charged ✓
store ab12
Retry
key=ab12
Seen ab12 →
return original
→ Key insight

Idempotency is how you avoid needing consensus. If every operation is safe to retry, much of a system can run on cheap at-least-once delivery and eventual consistency — reserving expensive agreement for the few places that truly need one source of truth. Pair it with the fencing tokens from concurrency control to stay correct under retries and pauses.

Frequently asked

Quick answers

Why do we need Raft or Paxos?

Nodes that can crash or be partitioned still need to agree on one value — the leader, the next log entry, a commit decision. These algorithms let a majority quorum agree safely even when a minority fails.

2PC vs saga?

2PC is an atomic prepare-then-commit across services, but blocks if the coordinator fails. A saga is a chain of local transactions with compensating undos — non-blocking and eventually consistent, but you write the rollbacks.

What is idempotency?

An operation that has the same effect applied once or many times. Using an idempotency key the server deduplicates on, retries give exactly-once effects on top of at-least-once delivery.

Why does consensus need a majority?

Any two majorities of the same set overlap on at least one node, so two conflicting decisions can't both win — even during a partition. That quorum intersection is what keeps consensus safe.