System Design Fundamentals  /  Partitioning & Replication
Fundamental 03~13 min readIntermediate
Deep Dive

Split the data to scale.
Copy it to survive.

One machine can't hold the internet, and one disk failure shouldn't erase your product. Partitioning answers "where does this key live?" and replication answers "how many copies survive a failure?" — two different questions you almost always answer together.

01

Two questions, not one

Partitioning (sharding) splits one dataset across many machines so each holds a slice — it scales capacity and throughput. Replication copies the same data onto several machines — it provides fault tolerance and read scaling. They're orthogonal, and production systems combine them: shard the data into N partitions, then keep R replicas of each partition.

Partitioning (sharding)Replication
GoalScale capacity & write throughputFault tolerance & read scaling
Each node holdsA different slice of the dataA full copy of its data
Lose a node →Lose that slice (unless replicated)Lose nothing; a copy survives
Key decisionThe shard keyTopology & sync vs async
02

How to partition

The partitioning strategy decides which node owns a given key.

StrategyHowStrengthWeakness
RangeContiguous key ranges per nodeEfficient range scansHotspots on sequential keys
Hashhash(key) → nodeEven spreadRange scans hit every node
DirectoryA lookup service maps key → nodeFlexible, easy rebalancingThe directory is a dependency

Plain hashing — node = hash(key) % N — has a fatal flaw: change N (add or lose a machine) and almost every key remaps, forcing a full reshuffle. That's what consistent hashing fixes.

03

Consistent hashing

Add a node and move a few keys — not all of them.

Map both keys and nodes onto a circular hash space (a ring). A key is owned by the first node found clockwise from its position. When a node joins, it takes over only the keys between it and its predecessor; everyone else is untouched. When a node leaves, only its keys move to the next node.

Keys land on the ring, owned by the next node clockwise
Node A
Node B
Node C
Adding Node D between B and C moves only B→C's keys onto D — roughly 1/N of the data, not all of it.

One refinement matters in practice: virtual nodes. Each physical machine is placed at many points on the ring, so load is even and removing a node spreads its keys across all survivors rather than dumping them on one neighbour.

04

Hot partitions & the shard key

Even spread depends entirely on the shard key. A bad key creates a hot partition — one shard drowning in traffic while others idle. A celebrity user on a "shard by user_id" scheme is the canonical example.

Bad shard key

  • Low cardinality (e.g. country)
  • Skewed access (celebrities)
  • Sequential (timestamps)
  • → hotspots, uneven load

Good shard key

  • High cardinality
  • Even access distribution
  • Matches your query pattern
  • → balanced, scalable

When a single key is unavoidably hot, the fixes are: add a random suffix to fan it across sub-partitions, cache it separately, or give it dedicated capacity. The shard key is the one decision that most often makes or breaks the design.

05

Replication topologies

TopologyWrites go toTrade-off
Single-leaderOne leader, streamed to followersSimple, no write conflicts; leader is a bottleneck & SPOF
Multi-leaderSeveral leaders (e.g. per region)Low-latency local writes; must resolve write conflicts
LeaderlessAny replica; client uses quorumsHighly available; app handles read repair & conflicts
Single-leader replication
Client write
Leader
Follower 1
Follower 2
Synchronous = durable but slower (wait for followers). Asynchronous = fast but the un-replicated tail is lost if the leader dies.

Asynchronous replication introduces replication lag: a follower can be behind the leader, so a read from it may be stale. That silently breaks read-your-writes — a user edits their profile, then sees the old value. The fix is to route a user's reads to the leader (or a caught-up follower) for a short window after they write.

06

Quorums tie it together

Leaderless systems make the consistency knob explicit. With N replicas, require W acknowledgements per write and R replicas per read. If W + R > N, the read and write sets overlap on at least one current replica — strong consistency. Drop below that and you trade freshness for latency and availability.

→ Key insight

Partitioning and replication are different axes: shard for scale, replicate for survival, and use quorums to choose how consistent each operation must be. When you need one agreed value rather than a tunable one — a single leader, a committed transaction — you've crossed into consensus. The consistency you're tuning here is defined in CAP & consistency models.

Frequently asked

Quick answers

Partitioning vs replication?

Partitioning splits a dataset across machines to scale capacity. Replication copies the same data onto several machines for fault tolerance and read scaling. Real systems shard for scale and replicate each shard for durability.

What is consistent hashing?

Keys and nodes map onto a ring; a key is owned by the next node clockwise. Adding/removing a node moves only a small fraction of keys instead of remapping everything. Virtual nodes even out the load.

What is a hot partition?

A shard receiving disproportionate traffic due to a skewed shard key (e.g. a celebrity). Fixes: a higher-cardinality key, a random suffix to spread the hot key, or separate caching.

What is replication lag?

The delay between a write committing on the leader and appearing on a follower. With async replication, reads from a lagging follower can be stale, breaking read-your-writes unless you route reads to a caught-up replica.