Design an LLM Cache — A Guided System Design

Q: Why cache LLM calls at all?

Put a caching layer in front of the model: check for a stored answer to this (or a near-identical) prompt before calling the LLM. A hit costs almost nothing and returns in milliseconds. It’s the single highest-leverage optimization for read-heavy LLM workloads — the cheapest call is the one you never make.

Q: Cache-aside around the model

The Cache API runs cache-aside: normalize the request, look it up, return on a hit, and on a miss call the LLM Provider and write the result into the store before returning. Apps call the same completion contract — they never know whether an answer was cached or freshly generated.

Q: Exact-match cache on a normalized key

Key the Exact Cache on a hash of the normalized request: canonicalized prompt/messages plus model, temperature, max-tokens, and tool definitions. A hit is byte-identical and always safe. This tier is cheap, O(1), and catches the surprisingly large share of traffic that repeats exactly — especially deterministic (temperature 0) calls.

Q: Semantic cache with a similarity threshold

Add a semantic tier: the Embedder vectorizes the prompt, the Semantic Index (ANN) finds the nearest cached prompt, and if similarity clears a tuned threshold its stored answer is reused. Now near-duplicates hit too. That threshold is the whole game: too tight and you miss real duplicates; too loose and you serve wrong answers — the failure this page’s chaos button triggers.

Q: Prompt-prefix (KV) caching on a miss

Layer prompt-prefix caching: the provider (or your inference server) caches the computed KV state for a shared prefix, so a request reusing that prefix skips prefill on those tokens and only processes the new suffix. It cuts cost and time-to-first-token even when the response cache misses — because the expensive part of a long prompt is the prefix everyone shares.

Q: TTL, invalidation, and versioning

Give every entry a TTL, invalidate entries whose inputs change (a document edited, a config updated), and version the key by model so a model upgrade doesn’t serve pre-upgrade answers. Scope keys per tenant. Now the cache stays fast and honest — freshness is a policy, not an accident.

Q: Hit-rate, savings, and false-hit alarms

Emit hit-rate per tier, cost and latency saved, and a false-hit signal (sample semantic hits and verify) — attributed per tenant. This proves the cache’s value and alarms when a loosened threshold starts serving wrong answers. Warm the cache for known-hot prompts so hit-rate is high from the first request.

System Design · step by stepDesign an LLM Cache

Step 1 / 9

Design an LLM Cache — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Why cache LLM calls at all?

Every LLM call costs real money and hundreds of milliseconds to seconds. And a huge fraction of production traffic is repetitive: the same FAQ, the same system-prompt-plus-document, a thousand slightly-reworded versions of the same question. Paying full price and full latency for an answer you’ve already computed is pure waste. How do you stop re-generating what you already know?

Put a caching layer in front of the model: check for a stored answer to this (or a near-identical) prompt before calling the LLM. A hit costs almost nothing and returns in milliseconds. It’s the single highest-leverage optimization for read-heavy LLM workloads — the cheapest call is the one you never make.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, loosen the match to feel the semantic cache’s quiet failure. Hit Begin.

Step 1 · The skeleton

Cache-aside around the model

A prompt arrives. You want to return a cached answer if you have one, and otherwise call the model — and remember the result. What’s the control flow?

Design decision: What’s the right caching pattern in front of an LLM?

The call: Check the cache; on a hit return it, on a miss call the model then write the result back. — Classic cache-aside (read-through): look up first, serve on hit, and on a miss call the model and populate the cache so the next identical request is free.

The Cache API runs cache-aside: normalize the request, look it up, return on a hit, and on a miss call the LLM Provider and write the result into the store before returning. Apps call the same completion contract — they never know whether an answer was cached or freshly generated.

Read-through, populate on miss: The cache sits transparently between app and model. Correctness rule: only cache what’s safe to reuse, and always key on enough of the request that a hit really is the same question.

Step 2 · The trivial win

Exact-match cache on a normalized key

The safest hit is an identical request. But "identical" is subtle — whitespace, key order, and default params vary. What do you hash, and when is a hit safe to return verbatim?

Design decision: What makes a good exact-cache key for an LLM call?

The call: A hash of the normalized prompt + model + params (temperature, max tokens, tools). — Normalize (trim, canonical JSON, sorted keys), then hash prompt + model + all sampling params. Identical inputs → same key → a byte-identical, safe-to-reuse answer.

Key the Exact Cache on a hash of the normalized request: canonicalized prompt/messages plus model, temperature, max-tokens, and tool definitions. A hit is byte-identical and always safe. This tier is cheap, O(1), and catches the surprisingly large share of traffic that repeats exactly — especially deterministic (temperature 0) calls.

Determinism makes caching safe: Cache aggressively where output is (near-)deterministic; be cautious where sampling is high. The normalized key is what makes "the same request" a precise, hashable notion instead of a fuzzy one.

Step 3 · Catch the near-duplicates

Semantic cache with a similarity threshold

Exact matching misses "How do I cancel?" vs "How can I cancel my plan?" — same intent, different bytes, no hit. Most repetition is semantic, not literal. How do you reuse an answer for a prompt that means the same thing?

Design decision: How do you get a cache hit on a reworded-but-equivalent prompt?

The call: Embed the prompt, ANN-search cached-prompt embeddings, and reuse a hit above a similarity threshold. — The Embedder maps the prompt to a vector; the Semantic Index finds the nearest cached prompt; if similarity clears a tuned threshold, its stored answer is reused. Meaning-level matching.

Add a semantic tier: the Embedder vectorizes the prompt, the Semantic Index (ANN) finds the nearest cached prompt, and if similarity clears a tuned threshold its stored answer is reused. Now near-duplicates hit too. That threshold is the whole game: too tight and you miss real duplicates; too loose and you serve wrong answers — the failure this page’s chaos button triggers.

The threshold is a correctness dial: Semantic caching trades exactness for reach. Set the threshold conservatively, measure false hits, and optionally re-verify borderline matches — because a confidently-wrong cached answer is worse than a clean miss.

Step 4 · Reuse the prefix too

Prompt-prefix (KV) caching on a miss

Even on a cache miss, most requests share a huge, identical prefix — a long system prompt, a few-shot block, a retrieved document — followed by a short unique question. The model re-processes that whole prefix every time. Can a miss be cheaper?

Design decision: Two requests share a 4k-token system prompt but differ in the last line. How do you save?

The call: Cache the model’s computed prefix (KV) so the shared prefix isn’t recomputed each call. — Server-side prompt-prefix caching stores the transformer’s key/value tensors for the shared prefix; a new request with the same prefix skips prefill on those tokens and only computes the new suffix — a big latency and cost cut even on a response-cache miss.

Layer prompt-prefix caching: the provider (or your inference server) caches the computed KV state for a shared prefix, so a request reusing that prefix skips prefill on those tokens and only processes the new suffix. It cuts cost and time-to-first-token even when the response cache misses — because the expensive part of a long prompt is the prefix everyone shares.

Two caches, two layers: The response cache reuses answers; the prefix cache reuses computation. Structure prompts with the stable content first (system, examples, context) and the variable question last, so the shared prefix is as long as possible.

Step 5 · Keep it fresh

TTL, invalidation, and versioning

A cache that never forgets eventually lies: the underlying document changed, a policy updated, the model was upgraded — but the old answer sits there, served forever. How do you keep a cache from going stale?

Design decision: What keeps cached answers from becoming quietly wrong over time?

The call: TTL every entry, invalidate on input change, and version the key by model. — Each entry gets a TTL; when its source (a retrieved doc, a policy) changes you invalidate the affected entries; and the cache key includes the model version so an upgrade naturally partitions old from new.

Give every entry a TTL, invalidate entries whose inputs change (a document edited, a config updated), and version the key by model so a model upgrade doesn’t serve pre-upgrade answers. Scope keys per tenant. Now the cache stays fast and honest — freshness is a policy, not an accident.

Staleness is the tax on caching: Every cache trades freshness for speed. Make the trade explicit: short TTLs for volatile content, invalidation hooks on the sources you cache, and model-version in the key so upgrades don’t leak old outputs.

Step 6 · Prove it helps

Hit-rate, savings, and false-hit alarms

You’ve added two cache tiers, a prefix cache, TTLs and a threshold. Is any of it actually helping — and is the semantic tier quietly hurting? You can’t tell without measurement. What do you track?

Design decision: What metrics tell you the cache is working (and not lying)?

The call: Hit-rate (per tier), cost + latency saved, and a false-hit / mismatch signal — per tenant. — Hit-rate per tier shows reach, cost/latency saved shows value, and sampling semantic hits for correctness catches a threshold that’s gone too loose — all attributed per tenant.

Emit hit-rate per tier, cost and latency saved, and a false-hit signal (sample semantic hits and verify) — attributed per tenant. This proves the cache’s value and alarms when a loosened threshold starts serving wrong answers. Warm the cache for known-hot prompts so hit-rate is high from the first request.

Measure the whole trade: A cache’s value is hits × savings; its risk is false hits. Track both, or you’ll optimize hit-rate straight into the failure mode. The metrics are how you keep speed from silently costing correctness.

The payoff

You built an LLM cache

From "pay full price for every answer" to a layered cache: cache-aside control flow, an exact-match tier on a normalized key, a semantic tier with a tuned threshold, prompt-prefix (KV) reuse to cheapen even misses, TTL + invalidation + model-versioning for freshness, and hit-rate/savings metrics that keep it honest.

Now loosen the match — drop the semantic threshold — and watch the cache answer the wrong question at a great hit-rate: "cancel" served the "upgrade" answer, instantly, as a valid response, no error. That’s why the threshold is a correctness dial, why you sample semantic hits for false positives, and why a clean miss beats a confident wrong hit.

Cache-aside — look up first; on a miss call the model and populate
Exact cache — hash the normalized prompt+model+params — byte-identical, always safe
Semantic cache — embed + ANN + threshold catches near-duplicates
Threshold — the dial between hit-rate and correctness — set it conservatively
Prefix (KV) cache — reuse the shared prompt prefix to cheapen even misses
Freshness — TTL, invalidate on input change, version the key by model
Metrics — hit-rate + savings + false-hit alarm, per tenant
The failure — a loose semantic threshold serves wrong answers silently