Fundamental 01

CAP & consistency models

When the network is healthy you can have it all. CAP is about the moment it isn't.

The CAP theorem states that during a network partition — when some nodes can't reach others — a distributed store must choose between Consistency (every read returns the latest write) and Availability (every request still gets a non-error response). Partition tolerance isn't optional in a real network, so the real choice is CP or AP.

But CAP is binary and only speaks about partitions. PACELC extends it: else, when the system is running normally, you still trade Latency against Consistency. And "consistency" itself is a spectrum — from linearizable (behaves like a single copy) down through causal to eventual (replicas converge, given time).

During a partition — pick one
CP

Consistency

Reject or block requests that can't be made consistent. The system may go unavailable, but never returns stale data. e.g. a payment ledger.

AP

Availability

Always answer, even if some replicas are behind. Reconcile later. e.g. a shopping cart, social feed.

ModelGuaranteeCostTypical use
LinearizableReads see the most recent write, globally orderedHighest latency, needs consensusLocks, leader election, balances
CausalOperations that depend on each other are seen in orderModerateComments, messaging threads
Read-your-writesYou always see your own latest changeLow (session pinning)Profile edits, settings
EventualReplicas converge if writes stopLowest latency, can read staleCaches, view counts, DNS
→ Mental model

"Consistency vs availability" is not a property you set once — it's a decision you make per operation. The same system can require linearizable writes for an account balance and serve eventually-consistent reads for that account's activity feed.

Deep dive

CAP Theorem & Consistency Models, explained →

Read the full guide
Fundamental 02

Concurrency & locking

Two users, one last concert ticket, the same millisecond. Correctness under concurrency is the whole game.

When operations overlap, you risk race conditions: lost updates, dirty reads, and double-spends. Locking serializes access so only safe interleavings happen. The strategy you pick depends on how often operations actually collide.

Pessimistic locking assumes conflict is likely and takes the lock before touching data — others wait. Optimistic locking assumes conflict is rare: read a version number, do the work, and only at commit check whether the version changed, retrying if so. Modern databases also use MVCC to give readers a consistent snapshot without blocking writers.

Pessimistic vs optimistic

Pessimistic

  • Lock → read → write → unlock
  • Others block and wait
  • No wasted work
  • Risk: deadlocks, contention
  • Best for high-conflict writes

Optimistic

  • Read version → work → CAS commit
  • No blocking
  • Retry on conflict
  • Risk: wasted work if hot
  • Best for read-heavy, low-conflict
Isolation levelPreventsStill allows
Read UncommittedDirty reads
Read CommittedDirty readsNon-repeatable reads
Repeatable ReadNon-repeatable readsPhantoms (in classic SQL)
SerializableEverything — behaves as if serialNothing (highest cost)
→ Interview tip

When asked "how do you stop two people booking the same seat?", name the trade-off out loud: a pessimistic row lock or SELECT … FOR UPDATE is simplest at low scale; an optimistic version check or a short-lived distributed lock scales better but needs a retry path. The wrong answer is to not mention the race at all.

Deep dive

Concurrency, Locks & Isolation Levels →

Read the full guide
Fundamental 03

Partitioning & replication

One machine can't hold the internet. So you split the data — then copy it so a dead disk doesn't take it with you.

Partitioning (sharding) splits one dataset across many machines so each holds a slice — this scales capacity and throughput. The hard part is the key: hash the key for even spread, or range-partition it for efficient scans. Naïve hashing breaks when you add a node, so production systems use consistent hashing, which moves only a small fraction of keys when the cluster changes.

Replication copies the same data onto several machines — this gives fault tolerance and read scaling. A common pattern is one leader taking writes and streaming them to followers. Quorum systems generalize this: with N replicas, require W writes and R reads such that W + R > N to guarantee overlap. Real systems do both — shard for scale, replicate each shard for durability.

Consistent hashing ring
Node A
Node B
Node C

A key hashes to a point on the ring and is owned by the next node clockwise. Add a node and only its neighbour's keys move.

Leader / follower replication
Client write
Leader
Follower 1
Follower 2

Synchronous replication = durable but slower. Asynchronous = fast but risks losing the tail on leader failure.

→ Key insight

Partitioning and replication answer different questions. Sharding asks "where does this key live?"; replication asks "how many copies survive a failure?" You almost always need both, and a good shard key (high cardinality, even access) is the single decision that makes or breaks the design.

Deep dive

Partitioning, Sharding & Replication →

Read the full guide
Fundamental 04

Consensus & coordination

Machines crash and messages get lost — yet the cluster must still agree on exactly one answer. That's the hardest problem in distributed systems, and it has a name.

Many problems reduce to consensus: who is the leader, what's the next entry in the replicated log, did this transaction commit? Raft and Paxos let a majority quorum agree safely even when a minority of nodes fail. Raft makes this approachable with an explicit leader election plus log replication — it's what powers etcd, Consul, and CockroachDB.

Above the cluster, distributed transactions coordinate work across services. Two-phase commit (2PC) gives atomicity but blocks if the coordinator dies; the saga pattern trades atomicity for availability using compensating actions. And because networks retry, every operation that mutates state should be idempotent — applying it twice has the same effect as once.

2PC vs Saga

Two-Phase Commit

  • Prepare → all vote → commit
  • Strong atomicity
  • Coordinator is a bottleneck
  • Blocks on failure

Saga

  • Local commits + compensations
  • No global lock
  • Eventually consistent
  • You write the rollbacks
Raft — agree by majority quorum
Candidate
Request votes
Majority? → Leader
Replicate log

A node needs votes from a strict majority to lead — so two leaders can never co-exist, even during a partition.

→ Mental model

You rarely implement consensus yourself — you reach for a system that already did (ZooKeeper, etcd, a managed database). The senior move is knowing where you need it (one source of truth, leader election) versus where you can avoid it entirely with idempotency and eventual consistency.

Deep dive

Consensus, Transactions & Coordination →

Read the full guide
Now apply it
29

worked system designs.

Fundamentals stick when you watch them assemble into something real. Every design on Vibe Engines builds step by step through an interactive diagram — spot the CAP trade-off, the shard key, the lock, the quorum, as each one appears.

Frequently asked

Quick answers

What are the fundamentals of system design?

The load-bearing fundamentals are CAP & consistency, concurrency & locking, partitioning & replication, and consensus & coordination. Almost every design decision in a distributed system reduces to a trade-off among these four areas.

What is the CAP theorem in simple terms?

During a network partition, a distributed system must choose between Consistency (every read sees the latest write) and Availability (every request still gets an answer) — it cannot have both. Systems are classified CP or AP accordingly.

Optimistic vs pessimistic locking — what's the difference?

Pessimistic locking takes a lock before touching data (best for high contention). Optimistic locking takes no lock, then checks a version at commit time and retries on conflict (best for read-heavy, low-contention workloads).

Partitioning vs replication?

Partitioning (sharding) splits one dataset across machines to scale capacity. Replication copies the same data onto several machines for fault tolerance and read scaling. Real systems use both: shard for scale, replicate each shard for durability.

Why do distributed systems need Raft or Paxos?

Independent nodes that can crash or be partitioned still need to agree on one value — the leader, the next log entry, whether a transaction committed. Consensus algorithms let a majority quorum agree safely even when some nodes fail.

The System Design Fundamentals Handbook · Vibe Engines · 2026
Learn the four ideas once — then design anything.
Finished this one? 0 / 12 Handbooks done