Design a Vector Database — A Guided System Design

Q: Why a database just for vectors?

A vector database is built around approximate nearest-neighbour (ANN) search: specialized in-memory indexes (IVF, HNSW) that return the closest vectors in milliseconds by being approximately right instead of exhaustively exact.

Q: A query vector in, k neighbours out

A Query API takes a query vector, k, and optional filters, plans the search, and returns the k nearest ids + scores. Everything else — indexes, segments, shards — hangs off this one contract.

Q: Ingest, segments, and the index

An Ingest / Upsert path writes each vector twice: the full vector into an immutable Vector Segment (for exact re-ranking and index rebuilds), and a pointer into the ANN Index (graph links or a cell assignment). Segments merge in the background, so live search never blocks on writes.

Q: Exact kNN and the distance metric

Similarity is a distance metric — usually cosine (angle) or dot product for embeddings, L2 for some. Exact kNN computes it against every vector and keeps the top k: correct, and O(N·d) per query. That’s the baseline — and why everything past here is about avoiding it.

Q: IVF — search a few cells, not all

IVF (inverted file) clusters the vectors with k-means into many cells, each with a centroid. A query measures distance to the centroids, picks the closest nprobe cells, and scans only those. Visit more cells → higher recall, slower; fewer → faster, lower recall.

Q: HNSW — navigable small worlds

HNSW (Hierarchical Navigable Small World) links each vector to a handful of near neighbours across layers: a sparse top layer for long jumps, denser lower layers for fine approach. Search enters at the top, greedily hops to closer nodes, and descends — reaching the neighbourhood in a few hops. efSearch controls how many candidates it keeps in flight: the speed/recall dial.

Q: Product quantization compresses the vectors

Product Quantization (PQ) splits each vector into m sub-vectors, learns a small codebook per subspace, and stores only the codebook ids — turning a 3 KB vector into ~64–96 bytes. Search estimates distances from the codes (fast, approximate), then re-ranks the top candidates against the full vectors in the segments for precision.

Q: Metadata filters and the recall trap

The Query API consults a Metadata Store and picks a strategy by selectivity: pre-filter when the predicate is selective (search only matching rows), over-fetch then post-filter when it’s loose (grab k′ ≫ k so enough survive), or push the predicate into the traversal. Naive post-filtering is the classic way to silently return too few, wrong results.

Q: Shard, replicate, scale

A Shard Router partitions vectors across shards and runs scatter/gather: every shard searches its slice and returns a local top-k; the coordinator merges them into the global top-k. Replicate each shard for QPS and failover. Because nearest-neighbour is mergeable, the global answer is just the best of the locals.

System Design · step by stepDesign a Vector Database

Step 1 / 9

Design a Vector Database — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Why a database just for vectors?

Embeddings turn text, images and users into points in a few-hundred-dimensional space, where nearest = most similar. The whole game is one query: "given this vector, find the k closest out of a billion." A normal database indexes values you can sort and equal-match — but "closest in 768-D space" isn’t a B-tree lookup. So how do you build an index for proximity?

A vector database is built around approximate nearest-neighbour (ANN) search: specialized in-memory indexes (IVF, HNSW) that return the closest vectors in milliseconds by being approximately right instead of exhaustively exact.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, starve the search to feel ANN’s quiet failure. Hit Begin.

Step 1 · The skeleton

A query vector in, k neighbours out

A service has a query vector and wants the k most similar stored vectors. What sits between the request and the answer?

Design decision: What’s the minimal shape of a vector-search request?

The call: Send {query vector, k, filters} to a Query API that ranks by distance. — A coordinator accepts the vector, k, and any metadata filters, searches the index, and returns the k nearest ids with their similarity scores.

A Query API takes a query vector, k, and optional filters, plans the search, and returns the k nearest ids + scores. Everything else — indexes, segments, shards — hangs off this one contract.

Search is the product: A vector DB is read-optimized: writes feed an index whose only job is to make "k nearest" fast. Keep that contract in mind — every later piece exists to serve it.

Step 2 · Store what you search

Ingest, segments, and the index

Vectors arrive from an embedding pipeline. You need to both keep them (to return and re-rank) and index them (to search). How does the write path work?

Design decision: A new batch of vectors arrives. Where do they go?

The call: Append the full vectors to a segment, and insert them into the index. — The ingest path does both: the raw vectors land in an immutable segment (for exact re-rank and rebuilds), and each vector is inserted into the ANN index (a cell assignment or graph links).

An Ingest / Upsert path writes each vector twice: the full vector into an immutable Vector Segment (for exact re-ranking and index rebuilds), and a pointer into the ANN Index (graph links or a cell assignment). Segments merge in the background, so live search never blocks on writes.

Index vs. store: Two jobs, two structures: the segment is the durable source of truth (full precision); the index is a fast, rebuildable accelerator. You can always rebuild the index from segments — never the other way around.

Step 3 · The naive baseline

Exact kNN and the distance metric

To rank by similarity you need a number for "how close." And the obvious algorithm — compare the query to every vector — is correct. Why can’t you just ship that?

Design decision: What kills brute-force exact nearest-neighbour at scale?

The call: It’s O(N·d) per query — a billion 768-D dot products every search. — Exact search compares the query to every vector. At a billion vectors × hundreds of dims, that’s billions of multiply-adds per query — far too slow and CPU-hungry to serve online.

Similarity is a distance metric — usually cosine (angle) or dot product for embeddings, L2 for some. Exact kNN computes it against every vector and keeps the top k: correct, and O(N·d) per query. That’s the baseline — and why everything past here is about avoiding it.

Pick one metric, normalize: The metric must match how the embeddings were trained (cosine vs dot vs L2). Normalize vectors so cosine becomes a dot product, and so every later shortcut measures the same "close."

Step 4 · Partition the space

IVF — search a few cells, not all

Exact search looks at all N vectors. But the answer is almost always near the query. How do you skip the 99% of vectors that are obviously far away?

Design decision: How do you avoid scanning vectors that can’t be the answer?

The call: Cluster vectors into cells; at query time, scan only the nearest few cells. — IVF runs k-means to make ~√N centroids ("cells"). A query finds its closest centroids and scans only the vectors in those nprobe cells — a fraction of N.

IVF (inverted file) clusters the vectors with k-means into many cells, each with a centroid. A query measures distance to the centroids, picks the closest nprobe cells, and scans only those. Visit more cells → higher recall, slower; fewer → faster, lower recall.

nprobe is the speed/recall dial: IVF turns "scan N" into "scan N·(nprobe/ncells)." The trade-off is explicit: nprobe is a knob you tune against a measured recall target — the same knob the chaos button will yank.

Step 5 · Walk a graph instead

HNSW — navigable small worlds

IVF is great, but its recall plateaus and it needs retraining as data shifts. What if, instead of partitioning space, you could walk straight to the neighbourhood from any starting point?

Design decision: How can you reach a query’s neighbours in a few hops, from anywhere?

The call: Link each vector to its near neighbours in a multi-layer graph, then greedily hop closer. — HNSW builds a layered "small-world" graph: sparse long-range links up top for big jumps, dense local links below for precision. Search greedily hops to ever-closer nodes.

HNSW (Hierarchical Navigable Small World) links each vector to a handful of near neighbours across layers: a sparse top layer for long jumps, denser lower layers for fine approach. Search enters at the top, greedily hops to closer nodes, and descends — reaching the neighbourhood in a few hops. efSearch controls how many candidates it keeps in flight: the speed/recall dial.

Six degrees, for vectors: Small-world graphs give short paths between any two points. HNSW exploits that: O(log N)-ish hops to the neighbourhood, high recall, no retraining — at the cost of more RAM for the edges.

Step 6 · Make a billion fit

Product quantization compresses the vectors

A billion 768-D float32 vectors is ~3 TB — it won’t fit in RAM, and the index needs RAM to be fast. How do you shrink the vectors without losing the ability to rank them?

Design decision: How do you fit billions of vectors in memory and still rank by distance?

The call: Split each vector into sub-vectors and replace each with a learned codebook id (PQ). — Product quantization chops the vector into m sub-vectors, k-means-clusters each subspace, and stores tiny codebook ids. A 768-D float32 vector (3 KB) becomes ~64–96 bytes — distances estimated from the codes.

Product Quantization (PQ) splits each vector into m sub-vectors, learns a small codebook per subspace, and stores only the codebook ids — turning a 3 KB vector into ~64–96 bytes. Search estimates distances from the codes (fast, approximate), then re-ranks the top candidates against the full vectors in the segments for precision.

Compress to search, store to verify: Two tiers: compressed codes (in RAM) make the index huge-but-fast and approximate; the full vectors (in segments) re-rank the shortlist exactly. It’s the recall/precision split again — applied to memory.

Step 7 · Nearest, but filtered

Metadata filters and the recall trap

Real queries aren’t just "nearest" — they’re "nearest where tenant = X and recency < 30 days." Bolt a filter onto ANN search naively and recall quietly tanks. Why?

Design decision: You want the nearest vectors that also match a metadata filter. What goes wrong?

The call: Combine them — pre-filter selective queries, over-fetch + post-filter loose ones, or filter inside the index. — There’s no single winner: the planner picks a strategy by selectivity, often over-fetching (k′ > k) before filtering, or evaluating the predicate during graph/cell traversal.

The Query API consults a Metadata Store and picks a strategy by selectivity: pre-filter when the predicate is selective (search only matching rows), over-fetch then post-filter when it’s loose (grab k′ ≫ k so enough survive), or push the predicate into the traversal. Naive post-filtering is the classic way to silently return too few, wrong results.

Filtering is a planner problem: The right plan depends on how many rows the filter keeps. That’s why production vector DBs ship a little query planner — "nearest + filtered" is two problems that fight, and selectivity decides the winner.

Step 8 · Past one machine

Shard, replicate, scale

One billion-vector index outgrows a single box’s RAM, and one replica can’t serve the QPS. How do you grow past one machine without breaking "k nearest"?

Design decision: How do you scale a vector index beyond one machine’s memory?

The call: Partition vectors across shards; scatter the query to all, gather and merge top-k. — A Shard Router fans the query out to every shard, each returns its local top-k, and the coordinator merges them into the global top-k. Add read replicas per shard for QPS and availability.

A Shard Router partitions vectors across shards and runs scatter/gather: every shard searches its slice and returns a local top-k; the coordinator merges them into the global top-k. Replicate each shard for QPS and failover. Because nearest-neighbour is mergeable, the global answer is just the best of the locals.

Top-k merges cleanly: Unlike a global SQL sort, k-NN distributes beautifully: the global top-k is always contained in the union of per-shard top-ks. Scatter wide, gather, merge — done.

The payoff

You built a vector database

From "find the closest of a billion vectors" to a system that does it in milliseconds: a read-optimized Query API, an ingest path feeding immutable segments and an ANN index, IVF cells or an HNSW graph for sub-linear search, product quantization to fit RAM, a filter-aware planner, and scatter/gather sharding.

Now starve the search — drop nprobe/efSearch to the floor — and watch recall collapse with no error at all: the index returns k vectors, just not the nearest ones. That’s why you measure recall@k against an exact baseline instead of trusting that "it returned results."

Query API — one contract: query vector + k + filters → nearest ids
Segments vs index — full vectors are truth; the ANN index is a rebuildable accelerator
Distance metric — cosine/dot/L2 — normalize, and match the embeddings
Exact kNN — correct but O(N·d) — the baseline you approximate
IVF — k-means cells; nprobe trades recall for speed
HNSW — layered small-world graph; ~log N hops, efSearch dial
Product quantization — compress to fit RAM, re-rank exactly on the shortlist
Filters + sharding — a selectivity-aware planner; scatter/gather mergeable top-k