Design Semantic Search — A Guided System Design

Q: Why search by meaning, not words?

Semantic search embeds text into vectors where nearby = similar meaning, then retrieves by vector distance instead of word overlap. The whole system is an embeddings pipeline: encode the corpus once, encode each query, and find the closest documents in that shared meaning-space.

Q: Text query in, ranked documents out

A Search API takes a text query, turns it into a vector, searches for the nearest document vectors, and returns ranked ids + scores + snippets. Everything else — the embedder, the indexes, the store — hangs off this one contract.

Q: One embedding model, two callers

A single Embedding Model — one set of weights, one version — is the heart of the system. The ingest pipeline calls it to embed documents; the Search API calls it to embed queries. Same model, same space, comparable vectors. Pin that version: it’s the contract every stored vector depends on.

Q: Chunk → embed → index the corpus

The ingest pipeline chunks and cleans each document, embeds every chunk with the shared model, and the Indexer upserts it under one id into three places: the Vector Index (for semantic search), the Document Store (text + metadata to display), and — soon — the Lexical Index. Do it once, offline; serve it forever.

Q: Embed the query, search the space

The Search API embeds the query with the shared model, sends the vector to the Vector Index for approximate-nearest-neighbour search, gets back the closest chunk ids + scores, and hydrates them from the Document Store. That’s semantic retrieval end to end — meaning in, ranked documents out.

Q: Hybrid search — add lexical BM25

Add a Lexical Index (BM25) fed by the same indexer. Every query now runs both: vector search for meaning, BM25 for exact terms. The Search API fuses the two ranked lists — typically Reciprocal Rank Fusion (RRF), which needs no score calibration — into one list that has the best of both.

Q: Rerank the shortlist with a cross-encoder

A Reranker (cross-encoder) takes the fused shortlist and scores each candidate by reading the query and document together, then reorders the top results. It’s expensive, so it runs on tens of candidates — not the index. Retrieve wide and cheap; rerank narrow and precise.

Q: Cache, batch, and stay fresh

Put a cache in front of the embedding model (keyed by query text + model version) and optionally on fused result sets with a short TTL. Batch concurrent embed calls to keep the model’s throughput up. Invalidate result caches on index updates so fresh documents surface. Now the model runs mostly on new text, and QPS scales.

System Design · step by stepDesign Semantic Search

Step 1 / 9

Design Semantic Search — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Why search by meaning, not words?

Keyword search matches tokens: query "cancel my plan" only finds documents containing those exact words. It misses "end your subscription," "close account," "stop billing" — the same intent in different words, and it ranks a page that merely mentions "plan" ten times above the one that answers the question. The user thinks in meaning; the index thinks in strings. How do you close that gap?

Semantic search embeds text into vectors where nearby = similar meaning, then retrieves by vector distance instead of word overlap. The whole system is an embeddings pipeline: encode the corpus once, encode each query, and find the closest documents in that shared meaning-space.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, skew the model to feel semantic search’s quietest failure. Hit Begin.

Step 1 · The skeleton

Text query in, ranked documents out

A client sends a plain-text query and wants the most relevant documents back, ranked. What sits between the words and the answer?

Design decision: What’s the minimal shape of a semantic-search request?

The call: Send the text to a Search API that embeds it and ranks documents by vector distance. — A coordinator embeds the query into the same space as the documents, searches an index for the closest vectors, and returns ranked ids with scores and snippets.

A Search API takes a text query, turns it into a vector, searches for the nearest document vectors, and returns ranked ids + scores + snippets. Everything else — the embedder, the indexes, the store — hangs off this one contract.

Retrieval is the product: Semantic search stops at "here are the most relevant documents." It’s the retrieval layer that RAG later builds on — but on its own it’s already a product: site search, help centers, doc search, e-commerce.

Step 2 · The shared model

One embedding model, two callers

Both documents and queries have to become vectors, and their vectors are only comparable if they live in the same space. Where does that mapping come from — and who is allowed to use it?

Design decision: How do documents and queries end up in a comparable vector space?

The call: Use one shared embedding model (same weights + version) for both ingest and query. — A single model maps text to a fixed-length vector where nearby = similar meaning. Both the ingest pipeline and the query path call the same model+version, so their vectors are directly comparable.

A single Embedding Model — one set of weights, one version — is the heart of the system. The ingest pipeline calls it to embed documents; the Search API calls it to embed queries. Same model, same space, comparable vectors. Pin that version: it’s the contract every stored vector depends on.

One space or nothing: Embeddings only have meaning relative to the model that made them. The moment query and document vectors come from different models — or different versions of the same model — distance becomes noise. This coupling is subtle, and it’s the trap the chaos button springs.

Step 3 · Build the index

Chunk → embed → index the corpus

A document can be a 40-page manual — too big to embed as one vector or return as one result. And the same chunk has to be findable, rankable, and displayable. How does the write path work?

Design decision: A new batch of documents arrives. What does ingest do with them?

The call: Split into chunks, embed each chunk, and write it to the index, store, and (later) lexical index under one id. — The pipeline chunks and cleans, embeds each chunk with the shared model, and the indexer upserts it everywhere: vector → ANN index, text+metadata → document store, tokens → lexical index — all keyed by one id.

The ingest pipeline chunks and cleans each document, embeds every chunk with the shared model, and the Indexer upserts it under one id into three places: the Vector Index (for semantic search), the Document Store (text + metadata to display), and — soon — the Lexical Index. Do it once, offline; serve it forever.

One id, three homes: The vector finds it, the lexical index also finds it, and the document store shows it. Keeping them under a single chunk id is what lets you fuse and hydrate results later without a join nightmare.

Step 4 · The query path

Embed the query, search the space

The index is built. A query arrives as text. Walk the path from words to ranked documents — where does the query become a vector, and what does it hit?

Design decision: What’s the read path for a semantic query?

The call: Embed the query with the shared model, ANN-search the vector index, hydrate ids from the store. — The Search API embeds the query into the same space as the documents, asks the ANN index for the nearest chunk vectors, then pulls their text from the document store to return.

The Search API embeds the query with the shared model, sends the vector to the Vector Index for approximate-nearest-neighbour search, gets back the closest chunk ids + scores, and hydrates them from the Document Store. That’s semantic retrieval end to end — meaning in, ranked documents out.

Symmetry is the whole trick: Query and documents take the same road through the same model into the same space. Break that symmetry anywhere — different model, different version, different normalization — and distances stop meaning anything.

Step 5 · Cover the blind spot

Hybrid search — add lexical BM25

Pure semantic search has a weakness: it blurs exact tokens. Search "error ORA-00942" or the product name "Zephyr-9" and embeddings may return things that are about errors or breezes — semantically near, literally wrong. How do you keep meaning and exact matches?

Design decision: How do you fix embeddings missing exact terms and rare tokens?

The call: Run BM25 lexical search alongside vector search and fuse the two ranked lists. — Hybrid search queries a lexical (BM25) index and the vector index in parallel, then merges their rankings — often with Reciprocal Rank Fusion — so exact tokens and semantic matches both surface.

Add a Lexical Index (BM25) fed by the same indexer. Every query now runs both: vector search for meaning, BM25 for exact terms. The Search API fuses the two ranked lists — typically Reciprocal Rank Fusion (RRF), which needs no score calibration — into one list that has the best of both.

Two recall sources, one ranking: Semantic recall and lexical recall fail in different places, so their union is stronger than either. RRF just rewards documents that rank high in either list — a robust, tuning-free way to combine them.

Step 6 · Precision at the top

Rerank the shortlist with a cross-encoder

Retrieval gives you ~50 decent candidates fast, but the order of the top 5 is what the user sees. Bi-encoder vector scores are cheap but coarse. How do you sharpen just the top of the list without reranking the whole corpus?

Design decision: How do you get the ordering of the top results right?

The call: Rerank the fused shortlist with a cross-encoder that reads query + candidate together. — A cross-encoder scores true relevance by attending over the query and document jointly — far more accurate than independent embeddings. It’s slow, so you run it only on the ~50 fused candidates.

A Reranker (cross-encoder) takes the fused shortlist and scores each candidate by reading the query and document together, then reorders the top results. It’s expensive, so it runs on tens of candidates — not the index. Retrieve wide and cheap; rerank narrow and precise.

Two-stage retrieval: Bi-encoders (embeddings) are fast and coarse; cross-encoders are slow and sharp. The winning pattern is both: ANN + BM25 recall the candidates, the cross-encoder gets the final ordering right. Fetch text from the store to feed it.

Step 7 · The silent trap

Model versioning & reindexing

Six months in, a better embedding model ships. You point the query embedder at it and deploy. Search quality craters — but nothing errors, no logs, no empty results. Every stored vector was built by the old model. What just happened?

Design decision: You upgrade the query embedder but not the corpus. What breaks?

The call: Query and document vectors land in different spaces, so results go quietly random. — Embeddings only compare within one model+version. Change the query model without re-embedding the corpus and distances become noise — k results returned, ranked, with no error, but irrelevant.

Query and document vectors are only comparable if they come from the same model+version. Upgrading a model means re-embedding the entire corpus into a new index, then cutting over atomically — often building the new index alongside the old and swapping. Until then, pin the query path to the version the index was built with. Never mix.

The version is part of the contract: Treat the embedding model version like a schema: it’s baked into every vector you’ve stored. A "harmless" model bump is a full migration. This is the failure the chaos button triggers — silent, no exception, just wrong.

Step 8 · Serve it at scale

Cache, batch, and stay fresh

Popular queries repeat constantly, the embedding model is the most expensive hop per query, and the corpus keeps changing. How do you serve high QPS cheaply without going stale?

Design decision: What’s the cheapest safe way to cut per-query cost?

The call: Cache query embeddings (and hot result sets), and batch embed calls under load. — The same queries recur, so caching their embeddings — and even fused results with a short TTL — skips the model’s cost. Batching concurrent embeds keeps the model’s throughput high.

Put a cache in front of the embedding model (keyed by query text + model version) and optionally on fused result sets with a short TTL. Batch concurrent embed calls to keep the model’s throughput up. Invalidate result caches on index updates so fresh documents surface. Now the model runs mostly on new text, and QPS scales.

The model is the bottleneck: Embedding dominates per-query cost and latency, so every repeat query you can serve from cache is pure win — as long as the cache key includes the model version (or a skew like step 7 caches the wrong space).

The payoff

You built semantic search

From "keywords miss meaning" to a system that finds documents by intent: a shared embedding model, a chunk→embed→index ingest path, a query path that embeds and ANN-searches, hybrid BM25 fusion for exact terms, a cross-encoder reranker for precision, and version discipline so the space never fractures.

Now skew the model — upgrade the query embedder without reindexing — and watch relevance collapse with no error at all: k ranked results that are quietly random, because query and document vectors no longer share a space. That’s why the embedding model version is part of the contract, stamped on every vector, and an upgrade is a full reindex.