What is the difference between optimistic and pessimistic locking?

Pessimistic locking takes a lock before touching data, assuming conflicts are likely, so other writers wait. Optimistic locking takes no lock; it reads a version, makes changes, and only checks at commit time whether the version changed — retrying if it did. Pessimistic suits high-contention writes; optimistic suits read-heavy, low-contention workloads.

What is the difference between partitioning and replication?

Partitioning (sharding) splits one dataset across many machines so each holds a slice — it scales capacity and throughput. Replication copies the same data onto several machines — it provides fault tolerance and read scaling. Real systems use both: shard for scale, replicate each shard for durability and availability.

Why do distributed systems need consensus algorithms like Raft or Paxos?

Independent nodes that can crash or be partitioned still need to agree on one value — who is the leader, what the next log entry is, whether a transaction committed. Consensus algorithms such as Raft and Paxos let a majority (quorum) of nodes agree safely even when some fail, which is the foundation for leader election, replicated logs, and strongly consistent stores.

Reference & Foundations ~40 min read Intermediate

The System Design Fundamentals Handbook

The ideas behind
every system you'll design.

Q: What are the fundamentals of system design?

The load-bearing fundamentals are: the CAP theorem and consistency models (what guarantees a system gives during a network partition), concurrency and locking (how simultaneous operations stay correct), partitioning and replication (how data is split and copied across machines), and consensus and coordination (how independent nodes agree on a single value). Almost every design decision in a distributed system reduces to a trade-off among these four areas.

Q: What is the CAP theorem in simple terms?

The CAP theorem says that when a network partition happens — some nodes cannot talk to others — a distributed system must choose between Consistency (every read sees the latest write) and Availability (every request still gets an answer). It cannot have both during the partition. Systems are therefore classified as CP (favour consistency) or AP (favour availability).

Q: When is a stateful service actually the right choice?

When the state is the problem itself: WebSocket gateways holding live connections, game servers and collaborative editors running an authoritative in-memory simulation at 30 to 60 ticks per second, and stream processors such as Flink or Kafka Streams that keep a local store per partition. What makes these correct is that the state is declared — recovered through checkpoints, replication, or partition reassignment — rather than an accidental in-memory session map with no recovery story.

Uber, Twitter, a key-value store — strip away the product and the same four questions remain. What happens during a network partition? How do simultaneous writes stay correct? How is data split and copied? How do independent machines agree? Master these and every design becomes a recombination of things you already understand.

Start with CAP → Browse 29 worked designs

↗

CAP & Consistency

What a system can promise when the network splits — and the spectrum of consistency between "always latest" and "eventually right".

CAP / PACELCLinearizabilityEventual

↗

Concurrency & Locking

How overlapping operations stay correct — locks, isolation levels, optimistic vs pessimistic, MVCC, and the races they prevent.

LocksIsolationMVCC

↗

Partitioning & Replication

How one dataset becomes many — sharding, consistent hashing, leader/follower replication, and quorum reads and writes.

ShardingHashingQuorum

↗

Consensus & Coordination

How nodes that can crash still agree on one truth — Raft and Paxos, leader election, distributed transactions, and idempotency.

Raft / Paxos2PC / SagaIdempotency

Fundamental 01

CAP & consistency models

When the network is healthy you can have it all. CAP is about the moment it isn't.

The CAP theorem states that during a network partition — when some nodes can't reach others — a distributed store must choose between Consistency (every read returns the latest write) and Availability (every request still gets a non-error response). Partition tolerance isn't optional in a real network, so the real choice is CP or AP.

But CAP is binary and only speaks about partitions. PACELC extends it: else, when the system is running normally, you still trade Latency against Consistency. And "consistency" itself is a spectrum — from linearizable (behaves like a single copy) down through causal to eventual (replicas converge, given time).

During a partition — pick one

Consistency

Reject or block requests that can't be made consistent. The system may go unavailable, but never returns stale data. e.g. a payment ledger.

Availability

Always answer, even if some replicas are behind. Reconcile later. e.g. a shopping cart, social feed.

Model	Guarantee	Cost	Typical use
Linearizable	Reads see the most recent write, globally ordered	Highest latency, needs consensus	Locks, leader election, balances
Causal	Operations that depend on each other are seen in order	Moderate	Comments, messaging threads
Read-your-writes	You always see your own latest change	Low (session pinning)	Profile edits, settings
Eventual	Replicas converge if writes stop	Lowest latency, can read stale	Caches, view counts, DNS

→ Mental model

"Consistency vs availability" is not a property you set once — it's a decision you make per operation. The same system can require linearizable writes for an account balance and serve eventually-consistent reads for that account's activity feed.

Deep dive

CAP Theorem & Consistency Models, explained →

Read the full guide →

See it in practice

Key-Value Store — tunable C/A Distributed Cache — eventual reads Payment System — strong consistency Google Docs — causal + convergence

Fundamental 02

Concurrency & locking

Two users, one last concert ticket, the same millisecond. Correctness under concurrency is the whole game.

When operations overlap, you risk race conditions: lost updates, dirty reads, and double-spends. Locking serializes access so only safe interleavings happen. The strategy you pick depends on how often operations actually collide.

Pessimistic locking assumes conflict is likely and takes the lock before touching data — others wait. Optimistic locking assumes conflict is rare: read a version number, do the work, and only at commit check whether the version changed, retrying if so. Modern databases also use MVCC to give readers a consistent snapshot without blocking writers.

Pessimistic vs optimistic

Pessimistic

Lock → read → write → unlock
Others block and wait
No wasted work
Risk: deadlocks, contention
Best for high-conflict writes

Optimistic

Read version → work → CAS commit
No blocking
Retry on conflict
Risk: wasted work if hot
Best for read-heavy, low-conflict

Isolation level	Prevents	Still allows
Read Uncommitted	—	Dirty reads
Read Committed	Dirty reads	Non-repeatable reads
Repeatable Read	Non-repeatable reads	Phantoms (in classic SQL)
Serializable	Everything — behaves as if serial	Nothing (highest cost)

→ Interview tip

When asked "how do you stop two people booking the same seat?", name the trade-off out loud: a pessimistic row lock or SELECT … FOR UPDATE is simplest at low scale; an optimistic version check or a short-lived distributed lock scales better but needs a retry path. The wrong answer is to not mention the race at all.

Deep dive

Concurrency, Locks & Isolation Levels →

Read the full guide →

See it in practice

Ticketmaster — seat locking Stock Exchange — order matching Payment System — no double-spend Google Docs — concurrent edits

Fundamental 03

Partitioning & replication

One machine can't hold the internet. So you split the data — then copy it so a dead disk doesn't take it with you.

Partitioning (sharding) splits one dataset across many machines so each holds a slice — this scales capacity and throughput. The hard part is the key: hash the key for even spread, or range-partition it for efficient scans. Naïve hashing breaks when you add a node, so production systems use consistent hashing, which moves only a small fraction of keys when the cluster changes.

Replication copies the same data onto several machines — this gives fault tolerance and read scaling. A common pattern is one leader taking writes and streaming them to followers. Quorum systems generalize this: with N replicas, require W writes and R reads such that W + R > N to guarantee overlap. Real systems do both — shard for scale, replicate each shard for durability.

Consistent hashing ring

Node A

→

Node B

→

Node C

↺

A key hashes to a point on the ring and is owned by the next node clockwise. Add a node and only its neighbour's keys move.

Leader / follower replication

Client write

→

Leader

→

Follower 1

→

Follower 2

Synchronous replication = durable but slower. Asynchronous = fast but risks losing the tail on leader failure.

→ Key insight

Partitioning and replication answer different questions. Sharding asks "where does this key live?"; replication asks "how many copies survive a failure?" You almost always need both, and a good shard key (high cardinality, even access) is the single decision that makes or breaks the design.

Deep dive

Partitioning, Sharding & Replication →

Read the full guide →

See it in practice

Twitter — fan-out & sharding Uber — geo-sharding Key-Value Store — consistent hashing ID Generator — partitioned IDs Search Engine — index shards

Fundamental 04

Statelessness & horizontal scaling

Sharding decides where data lives. This decides where it doesn't — because a service only scales sideways once no single machine holds something a request depends on.

A stateful server remembers you. After you log in it keeps your session — who you are, what's in your cart, which step of checkout you reached — in its own process memory, like a coat-check attendant who memorises which hook your coat is on. A stateless server keeps nothing between requests. It hands you a numbered ticket instead, and any attendant at the counter can redeem it.

"Stateless" does not mean the system forgets. It means the server forgets, on purpose. Every request arrives self-sufficient: it either carries the state with it — a signed token such as a JWT, whose claims are verified rather than looked up — or it carries a key into a shared store that every instance can reach equally. The state still exists. It simply does not live in one particular machine's RAM, which is the only property that matters when you go to add the second machine.

Where does the session live?

Stateful

Session in the server's own memory
Requests must return to that server
Sticky sessions at the load balancer
Server dies → the session dies with it
Simpler and faster — for exactly one box

Stateless

State in a token, or in a shared store
Any instance can serve any request
Load balancer routes on capacity alone
Server dies → next request lands elsewhere
Costs a round trip, or bytes per request

The second instance that breaks everything

Before. One application process, sessions held by the default in-memory store. A user logs in, the process writes sessions["abc123"] = {userId: 42, cart: […]} and sets a cookie. Every later request arrives with that cookie, finds the entry, and works. Traffic doubles, so you put a second instance behind the load balancer and change nothing else. The login lands on instance A and writes the session into A's heap. The next request round-robins to B. B looks up abc123, finds nothing, and treats the visitor as anonymous — a redirect to the login page, or a bare 401 from the API. The user signs in again, lands on A, works for a click or two, hits B, and is logged out again.

Half of everything now fails, non-deterministically, and only under the load you added capacity to survive. The same silent breakage hits everything else quietly parked in memory: CSRF tokens minted on A and rejected by B, multi-step wizards that lose step two, in-process rate-limit counters that suddenly allow twice the intended budget because each instance only counts its own half of the traffic, and a warm local cache whose hit rate collapses the moment a cold instance joins.

Before — session in one instance's memory

→

Instance A
session in RAM

→

Next request

→

Instance B
"who are you?"

Two instances round-robining means roughly half of every user's requests land on a machine that has never heard of them.

After. The session moves to Redis. The cookie still carries only abc123; both instances read and write the same key, under a TTL that expires it. Nothing on the request path is local any more. A can serve the login and B can serve the next click, and neither knows or cares. A rolling deploy can kill A mid-session, an autoscaler can add a fifth instance during a spike and remove it at 3am, and a crashed process costs exactly the requests it was holding — not the sessions of everyone who happened to be pinned to it. The bill is one network round trip, sub-millisecond inside an availability zone, plus a new dependency that now has to be at least as available as the service in front of it. That is the honest trade: a small, predictable latency cost in exchange for instances that are disposable.

After — session in a shared store

Any request

→

Instance A / B / C
holds nothing

→

Shared store
(Redis)

→

Same answer
every time

This is the precondition for horizontal scaling: a new instance is immediately as capable as an old one, because there is no history to catch up on.

The sticky-session trap

There is a cheaper-looking fix. Tell the load balancer to pin each client to one backend — by hashing the source IP (nginx ip_hash) or by issuing its own cookie (ALB sticky sessions). No application code changes, the logouts stop, and it reads like a solution. It is not: it is the original coupling relocated into infrastructure, where it is harder to see and harder to test.

Deploys break first. A rolling restart replaces instances one at a time, and every user pinned to a replaced instance loses their session — so routine deploys start logging people out, and the team learns to ship at midnight rather than fix the cause. Scale-out breaks second. New instances only receive new sessions, so adding capacity during a spike does almost nothing for the users already pinned to a saturated box; the fleet stays lopsided for as long as sessions live, and the dashboard shows idle instances next to one that is on fire. Scale-in breaks third. Terminating an instance is no longer free, so the autoscaler either refuses to shrink or drops live sessions when it does. Graceful draining, canary releases, and shifting traffic between availability zones all inherit the same defect. Keep stickiness if you want it for cache locality — never for correctness. If the system stops working when stickiness is switched off, the state is still in the wrong place.

→ The test

Can you terminate any single instance, at random, in the middle of the working day, and lose nothing but the requests it was actively holding? If yes, the tier is stateless. If the answer involves draining, waiting, or a maintenance window, something is being remembered that shouldn't be.

Where state is supposed to live

Statelessness is not the absence of state — it is a deliberate decision about which systems are allowed to hold it. The application tier is not one of them; the systems below are built for it, and each answers a different question.

Home	What belongs there	Why there
Relational database	Data of record — users, orders, balances	Durable, transactional, queryable; the only correct home for a business fact
Redis / Memcached	Sessions, rate-limit counters, short-lived locks, hot lookups	Shared and fast, with TTLs that expire state for you — see caching patterns
Object storage	Uploads, exports, generated media	Cheap and effectively unbounded; never one instance's local disk
The token itself	Identity and coarse-grained claims	No lookup at all — the cost is bytes per request and awkward revocation (OAuth & auth)
Queue or log	In-flight work	A crashed worker's job is redelivered instead of lost (backpressure lab)

When stateful is the right answer

Plenty of services are stateful because the problem is. A WebSocket gateway holds a real TCP connection, and that connection is state living on exactly one machine — you cannot make it stateless, only recoverable. The working pattern splits the two apart: the socket stays pinned, the meaning does not. A shared pub/sub layer fans messages out to whichever gateway currently holds the socket, and clients reconnect and resume from a sequence number, so losing a gateway costs a reconnect rather than a conversation.

Realtime game servers and collaborative editors run an authoritative simulation in memory at thirty to sixty ticks a second; a network round trip per tick is not physically available to them, so they snapshot and journal instead. Stream processors such as Flink and Kafka Streams keep local state deliberately: a windowed aggregation lives in an embedded store on the instance that owns that partition, and correctness comes from the partition-to-instance assignment plus periodic checkpoints, not from having no state at all. Databases are the original case of the same idea.

What separates these from the accidental kind is that the state was declared. Someone chose it, wrote down how it is recovered, and sized the blast radius of losing one node. An in-memory session map has none of that — no checkpoint, no reassignment, no owner — just an assumption made once for one machine that quietly stops being true the day a second one appears. The rule the industry converged on, and the one the twelve-factor guidance states outright, is simply that: keep the application tier stateless and disposable, and let only the systems designed to hold state hold it.

→ Interview tip

When you say "we'll scale horizontally", expect the follow-up: what makes that possible? The answer is that no request depends on which instance serves it. Name where the session goes (a signed token, or Redis), name the cost you accepted (bytes per request, or a round trip), and name the one thing you deliberately left stateful — that last part is what separates a rehearsed answer from a real one.

Deep dive

Designing the stateless request itself — The API Design Handbook →

Read the full guide →

See it in practice

Load Balancer — routing without stickiness Queue Backpressure — state in the queue Networking — connections & load balancing Monolith vs Microservices — service boundaries Caching Patterns — the shared store

▶ Watch it explained

Stateless vs Stateful: Remember the User, or Hand Them a Ticket?

Fundamental 05

Consensus & coordination

Machines crash and messages get lost — yet the cluster must still agree on exactly one answer. That's the hardest problem in distributed systems, and it has a name.

Many problems reduce to consensus: who is the leader, what's the next entry in the replicated log, did this transaction commit? Raft and Paxos let a majority quorum agree safely even when a minority of nodes fail. Raft makes this approachable with an explicit leader election plus log replication — it's what powers etcd, Consul, and CockroachDB.

Above the cluster, distributed transactions coordinate work across services. Two-phase commit (2PC) gives atomicity but blocks if the coordinator dies; the saga pattern trades atomicity for availability using compensating actions. And because networks retry, every operation that mutates state should be idempotent — applying it twice has the same effect as once.

2PC vs Saga

Two-Phase Commit

Prepare → all vote → commit
Strong atomicity
Coordinator is a bottleneck
Blocks on failure

Saga

Local commits + compensations
No global lock
Eventually consistent
You write the rollbacks

Raft — agree by majority quorum

Candidate

→

Request votes

→

Majority? → Leader

→

Replicate log

A node needs votes from a strict majority to lead — so two leaders can never co-exist, even during a partition.

→ Mental model

You rarely implement consensus yourself — you reach for a system that already did (ZooKeeper, etcd, a managed database). The senior move is knowing where you need it (one source of truth, leader election) versus where you can avoid it entirely with idempotency and eventual consistency.

Deep dive

Consensus, Transactions & Coordination →

Read the full guide →

See it in practice

Message Queue — ordering & replication Job Scheduler — leader election Key-Value Store — quorum writes Notifications — idempotent delivery

Fundamental 06

Back-of-the-envelope

Before the four questions, one habit: put numbers on it. Load, concurrency, storage, quorum — a napkin's worth of arithmetic decides the whole shape of a design.

Two formulas carry most interviews. Little's Law turns a request rate and a latency into the number of requests in flight — which is the count of threads, connections, or servers you must provision. And the quorum rule decides whether replicated reads are even correct: a read only sees the newest write when the read and write quorums are forced to overlap.

peak QPS = (DAU · req/user) / 86,400 · peak · L = λ·W (in-flight = QPS · latency) · strong ⟺ R + W > N

Little's Law is exact and assumption-free — if 185k requests arrive each second and each takes 30 ms, then ~5,600 are always in flight, no matter what the system does inside. And R+W>N is the entire reason a key-value store with N=3 uses R=2, W=2: any read quorum and any write quorum share at least one node, so the read can't miss the write. The estimator below is these formulas; feed it your product's numbers.

CPython · WebAssembly

# The four numbers every system-design answer starts with -- all napkin arithmetic.

def peak_qps(dau, req_per_user_per_day, peak_factor=4):
    avg = dau * req_per_user_per_day / 86_400      # seconds in a day
    return avg * peak_factor                        # traffic clusters, so scale the average

def concurrency(qps, latency_s):
    return qps * latency_s                          # Little's Law: L = lambda * W

def storage_bytes(writes_per_day, bytes_each, days):
    return writes_per_day * bytes_each * days

def strongly_consistent(N, R, W):
    return R + W > N                                # read quorum must overlap the last write

def majority(N):
    return N // 2 + 1                               # smallest W that can't split-brain

# a Twitter-scale read path
dau, reads = 200_000_000, 20
qps = peak_qps(dau, reads)
print(f"peak read QPS        : {qps:>12,.0f}")
print(f"in-flight @ 30ms     : {concurrency(qps, 0.030):>12,.0f}   (servers x threads to buy)")

tb = storage_bytes(50_000_000, 300, 365 * 5) / 1e12   # 50M tweets/day, 300 B, 5 years
print(f"5-year tweet storage : {tb:>12,.1f} TB\n")

for N, R, W in [(3, 2, 2), (3, 1, 1), (5, majority(5), majority(5))]:
    tag = "STRONG" if strongly_consistent(N, R, W) else "stale reads possible"
    print(f"N={N} R={R} W={W}  ->  {tag}")

Now apply it

worked system designs.

Fundamentals stick when you watch them assemble into something real. Every design on Vibe Engines builds step by step through an interactive diagram — spot the CAP trade-off, the shard key, the lock, the quorum, as each one appears.

Browse all designs → Start with a simple one

Frequently asked

Quick answers

What are the fundamentals of system design?

The load-bearing fundamentals are CAP & consistency, concurrency & locking, partitioning & replication, and consensus & coordination. Almost every design decision in a distributed system reduces to a trade-off among these four areas.

What is the CAP theorem in simple terms?

During a network partition, a distributed system must choose between Consistency (every read sees the latest write) and Availability (every request still gets an answer) — it cannot have both. Systems are classified CP or AP accordingly.

Optimistic vs pessimistic locking — what's the difference?

Pessimistic locking takes a lock before touching data (best for high contention). Optimistic locking takes no lock, then checks a version at commit time and retries on conflict (best for read-heavy, low-contention workloads).

Partitioning vs replication?

Partitioning (sharding) splits one dataset across machines to scale capacity. Replication copies the same data onto several machines for fault tolerance and read scaling. Real systems use both: shard for scale, replicate each shard for durability.

Why do distributed systems need Raft or Paxos?

Independent nodes that can crash or be partitioned still need to agree on one value — the leader, the next log entry, whether a transaction committed. Consensus algorithms let a majority quorum agree safely even when some nodes fail.

What does it mean for a server to be stateless?

It means the server keeps nothing about you between requests. Each request arrives self-sufficient — it either carries a signed token whose claims the server verifies, or a key into a shared store (Redis, the database) that every instance can reach. State still exists; it just doesn't live in one machine's memory, so any instance can serve any request.

Why do sticky sessions break autoscaling and deploys?

Stickiness pins a user to one instance, which makes that instance non-disposable. A rolling deploy replaces it and logs those users out; scaling in drops their sessions; and scaling out barely helps, because new instances only receive new sessions while the saturated box keeps its old ones. Keep stickiness for cache locality if you like — never for correctness.

When is a stateful service actually the right choice?

When the state is the problem: WebSocket gateways holding live connections, game servers and collaborative editors running an in-memory simulation at 30–60 ticks a second, and stream processors like Flink that keep a local store per partition. The difference is that this state is declared — with checkpoints, replication, or partition reassignment — instead of an accidental session map with no recovery story.

Finished this one? 0 / 208 Handbooks done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

Distributed Systems

The ideas behindevery system you'll design.

CAP & Consistency

Concurrency & Locking

Partitioning & Replication

Consensus & Coordination

CAP & consistency models

Consistency

Availability

CAP Theorem & Consistency Models, explained →

Concurrency & locking

Pessimistic

Optimistic

Concurrency, Locks & Isolation Levels →

Partitioning & replication

Partitioning, Sharding & Replication →

Statelessness & horizontal scaling

Stateful

Stateless

The second instance that breaks everything

The sticky-session trap

Where state is supposed to live

When stateful is the right answer

Designing the stateless request itself — The API Design Handbook →

Stateless vs Stateful: Remember the User, or Hand Them a Ticket?

Consensus & coordination

Two-Phase Commit

Saga

Consensus, Transactions & Coordination →

Back-of-the-envelope

worked system designs.

Quick answers

What are the fundamentals of system design?

What is the CAP theorem in simple terms?

Optimistic vs pessimistic locking — what's the difference?

Partitioning vs replication?

Why do distributed systems need Raft or Paxos?

What does it mean for a server to be stateless?

Why do sticky sessions break autoscaling and deploys?

When is a stateful service actually the right choice?

Explore the topic

More Handbooks

Explore more from Vibe Engines

Get the next one in your inbox.

The ideas behind
every system you'll design.