Why retrieval at all?

Retrieval-Augmented Generation: before generating, fetch the most relevant passages from your knowledge and put them in the prompt. The model reasons over real, current text instead of fuzzy memory — and can cite its sources.

Retrieve, then generate

A RAG Gateway orchestrates two phases for every question: retrieve the most relevant passages, then generate an answer grounded in them. Knowledge lives outside the model, so it’s always current and citable.

Chunk and embed your documents

An offline Ingestion pipeline loads each document, splits it into chunks, runs every chunk through an Embedding Model to get a vector, and upserts the vectors into a Vector Store (plus the raw text into a Document Store). This is the prep work that makes retrieval possible.

Embed the query, search by similarity

The Retriever embeds the query with the same model, then asks the Vector Store for the top-k nearest chunks via an ANN index. The Document Store returns their text. Semantic match means "cancel my plan" finds "terminating your subscription."

Top-k and the recall/precision dial

Retrieve a modest top-k candidate set from the Vector + Document stores — wide enough that the answer is almost certainly in there (recall), but not so wide it drowns the prompt. The next step tightens it to the best few (precision).

A Reranker (a cross-encoder) re-scores the top-k candidates by reading the query and each chunk together, then keeps only the best 3–5. It’s too slow to run over the whole store — but perfect over a 20-candidate shortlist.

Build the prompt and generate

The Gateway assembles a grounded prompt — system instructions + the reranked chunks + the question — and tells the LLM to answer only from the passages, cite them, and admit when they don’t cover the question. The chunks carry their source metadata, so citations point at real documents.

Citations, "I don’t know", and evals

The answer ships with citations back to the source chunks, the model is allowed to say "I don’t know" when retrieval comes up thin, and an eval harness tracks retrieval quality (did we fetch the right chunks?) and answer faithfulness (did the answer stick to them?). That’s how you catch a silently drifting index before users do.

Design a RAG Pipeline — A Guided System Design

System Design · step by stepDesign a RAG Pipeline

Step 1 / 9

RUN IT YOURSELF

Retrieval by cosine similarity

RAG finds the most relevant chunks for a query by comparing embedding vectors with cosine similarity, then feeds the top-k to the LLM. Here is that retrieval core in real Python, running live. Read the comments, edit the vectors, and hit Run.

HOW TO READ THE CODE — 4 IDEAS

Text becomes a vector (embedding); similar meaning → similar direction.
Cosine similarity measures the angle between two vectors, ignoring length (steps 1–2).
Score every document against the query, then take the top-k (step 3).
Those k chunks are what actually get stuffed into the LLM prompt.

CPython · WebAssembly

# A RAG pipeline retrieves the most relevant chunks for a query by
# comparing embedding vectors with COSINE SIMILARITY, then feeds the
# top-k to the LLM. Here's the retrieval core (toy 3-D embeddings).
import math

def cosine(a, b):
    dot = sum(x * y for x, y in zip(a, b))            # STEP 1 - dot product
    na = math.sqrt(sum(x * x for x in a))
    nb = math.sqrt(sum(y * y for y in b))
    return dot / (na * nb) if na and nb else 0.0      # STEP 2 - normalise by lengths

def retrieve(query_vec, docs, k):
    # STEP 3 - score every doc, then take the top-k by similarity.
    scored = [(cosine(query_vec, vec), name) for name, vec in docs]
    scored.sort(reverse=True)
    return [name for _, name in scored[:k]]

# toy embeddings on axes [pricing, support, billing]
docs = [
    ("refunds_policy", [0.1, 0.2, 0.9]),
    ("pricing_tiers",  [0.9, 0.1, 0.2]),
    ("how_to_contact", [0.2, 0.9, 0.1]),
]
query = [0.05, 0.1, 0.95]   # a billing-ish question
print("top 2:", retrieve(query, docs, 2))

Finished this one? 0 / 5 AI System Designs done