Design an LLM Gateway — A Guided System Design

Q: Why put a gateway in front of the models?

Put an LLM Gateway in front of every provider: one stable, provider-agnostic API your apps code against, with a model router, fallback, rate limits, caching, and cost accounting behind it. Swap models, add vendors, and enforce budgets centrally — without touching a single app.

Q: One API for every model

Apps call one LLM Gateway endpoint with a provider-agnostic request (messages, params, maybe a task hint). The gateway authenticates the caller and owns everything downstream. Because it’s a network boundary, you change routing, keys, limits and providers centrally — the apps never notice.

Q: Provider adapters normalize the vendors

Each provider gets an adapter that maps the gateway’s canonical request ↔ that vendor’s native API — including auth, streaming format, error codes, and token accounting. The core stays vendor-neutral; onboarding a new model is writing one adapter, not rewriting the gateway.

Q: The router chooses per request

A Model Router chooses per request from a policy in the Model Registry: match the required capability first (reasoning depth, tool use, context length, structured-output reliability), then let cost and latency break ties within a capability tier. Pins and per-tenant overrides sit on top. The registry updates without a redeploy.

Q: Fallback, retries, and timeouts

On failure the gateway does bounded, jittered retries, then fails over to a capability-equivalent model on another provider. A circuit breaker stops routing to a provider that’s persistently failing (and probes for recovery). Hard timeouts and hedging cap tail latency. Multi-provider isn’t just for cost — it’s your redundancy.

Q: Rate limits, quotas, and budgets

Enforce per-key and per-tenant token buckets (requests and tokens/min) plus spend budgets at the gateway. Over the limit → throttle, queue, or reject with a clear 429 + Retry-After. Now a runaway tenant hits their ceiling, not everyone’s — and no key can silently overrun the bill.

Q: Response caching cuts cost and latency

Put a Response Cache in the gateway: an exact-match layer (identical prompt + model + params → stored completion) and optionally a semantic layer (embed the prompt, reuse a near-duplicate above a similarity threshold). Scope keys per tenant, set a TTL, and skip the provider on a hit — often the single biggest win on both cost and p50 latency.

Q: Observability, cost, and attribution

Emit a structured record per request: tenant, model chosen (and why), tokens in/out, cost, latency, cache-hit, retries/fallback, and outcome — into a Cost & Traces store. That one ledger drives billing, budget enforcement, alerting, and quality debugging. Log metadata always; sample or redact prompt/response content deliberately.

Q: Streaming, statelessness, and multi-region

Make the gateway stateless — cache, registry, limits and traces live in external stores — so you run many replicas behind a load balancer, multi-region for availability. Stream provider tokens straight through (SSE) so time-to-first-token stays low. The control plane must be more available than any single provider, because everything depends on it.

System Design · step by stepDesign an LLM Gateway

Step 1 / 9

Design an LLM Gateway — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Why put a gateway in front of the models?

Your product calls an LLM. Easy — until you have a dozen services each hard-coding one vendor’s SDK, key, and quirks. Then that vendor has an outage, or triples its price, or a cheaper model ships, or one feature floods your shared rate limit at 2am. Every one of those is now a code change in a dozen places. How do you make "which model, from whom, under what limits" a runtime decision instead of a compile-time one?

Put an LLM Gateway in front of every provider: one stable, provider-agnostic API your apps code against, with a model router, fallback, rate limits, caching, and cost accounting behind it. Swap models, add vendors, and enforce budgets centrally — without touching a single app.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, route by price alone to feel the router’s quiet failure. Hit Begin.

Step 1 · The skeleton

One API for every model

Apps need completions from many models over time. What should they actually call, so that changing the underlying model never changes their code?

Design decision: What’s the right contract between your apps and the models?

The call: Apps call one gateway endpoint with a provider-agnostic request; the gateway decides the rest. — A single unified API (often OpenAI-compatible) lets apps say "complete this" without naming a vendor. Routing, keys, limits and fallback all live behind it.

Apps call one LLM Gateway endpoint with a provider-agnostic request (messages, params, maybe a task hint). The gateway authenticates the caller and owns everything downstream. Because it’s a network boundary, you change routing, keys, limits and providers centrally — the apps never notice.

A control plane, not a proxy: The gateway isn’t just a pass-through: it’s where cross-cutting policy lives — auth, routing, limits, caching, cost. One place to reason about "how we use models," for the whole org.

Step 2 · Speak every dialect

Provider adapters normalize the vendors

Provider A, Provider B and your self-hosted model each have different request shapes, auth, streaming formats, error codes, and token accounting. The gateway promises one contract. How does it talk to all of them?

Design decision: How does one API front three incompatible provider APIs?

The call: Give each provider an adapter that maps the unified request/response to its native API. — A thin per-provider adapter translates the gateway’s canonical request into the vendor’s format (and its response, streaming chunks, errors and token counts back). Add a vendor = add an adapter.

Each provider gets an adapter that maps the gateway’s canonical request ↔ that vendor’s native API — including auth, streaming format, error codes, and token accounting. The core stays vendor-neutral; onboarding a new model is writing one adapter, not rewriting the gateway.

Normalize at the edge: Push all the vendor-specific mess into adapters so everything inside the gateway — routing, caching, metering — works on one clean shape. It’s the anti-corruption layer pattern, applied to model providers.

Step 3 · Pick the right model

The router chooses per request

Now several models are reachable through one API. But which one should a given request use? "Always the biggest" is slow and expensive; "always the cheapest" is the trap you’ll feel later. How does the gateway decide, per request?

Design decision: What should the router optimize when choosing a model?

The call: Capability first — match task difficulty to a model that can do it — then cost/latency as tie-breakers. — The router reads a policy: route by required capability (reasoning, tools, context length, JSON strictness), then let cost and latency choose among the models that can actually do the job.

A Model Router chooses per request from a policy in the Model Registry: match the required capability first (reasoning depth, tool use, context length, structured-output reliability), then let cost and latency break ties within a capability tier. Pins and per-tenant overrides sit on top. The registry updates without a redeploy.

Capability first, cost second: Cost is a tie-breaker, not the objective. The winning policy is "cheapest model that can actually do this task" — which needs a notion of task difficulty and model capability, not just a price list.

Step 4 · When a provider fails

Fallback, retries, and timeouts

Providers rate-limit you, time out, 500, or go fully down — regularly. Your gateway is now a single point every app depends on. How do you keep serving when the chosen model can’t?

Design decision: The chosen provider returns 429/503. What should the gateway do?

The call: Retry with backoff, then fail over to an equivalent model on another provider; trip a circuit breaker on sustained failure. — Bounded retries with jittered backoff handle blips; on repeated failure the router fails over to a capability-equivalent model elsewhere, and a circuit breaker stops sending to a provider that’s down.

On failure the gateway does bounded, jittered retries, then fails over to a capability-equivalent model on another provider. A circuit breaker stops routing to a provider that’s persistently failing (and probes for recovery). Hard timeouts and hedging cap tail latency. Multi-provider isn’t just for cost — it’s your redundancy.

Failover needs equivalence: You can only fail over between models the router considers interchangeable for the task — which is exactly the capability tiers from step 3. Resilience and routing are the same policy, viewed twice.

Step 5 · Protect the shared resource

Rate limits, quotas, and budgets

Every app shares your provider quotas and your bill. One buggy loop or one heavy tenant can exhaust the rate limit for everyone — or spend a month’s budget by lunch. How do you isolate them?

Design decision: How do you stop one caller from starving the rest (or blowing the bill)?

The call: Per-key / per-tenant token buckets and spend budgets, enforced at the gateway. — Each key gets its own rate (requests + tokens) and a spend budget. Over the limit → throttle, queue, or reject — so one tenant’s spike can’t exhaust the shared provider quota or the budget.

Enforce per-key and per-tenant token buckets (requests and tokens/min) plus spend budgets at the gateway. Over the limit → throttle, queue, or reject with a clear 429 + Retry-After. Now a runaway tenant hits their ceiling, not everyone’s — and no key can silently overrun the bill.

Fairness is isolation: Provider quota and budget are shared, finite resources. Per-tenant limits turn "whoever spikes first wins" into predictable, fair allocation — the same reason multi-tenant systems isolate any shared resource.

Step 6 · Stop paying twice

Response caching cuts cost and latency

Many prompts repeat — the same FAQ, the same system-prompt+doc, near-identical questions. Each one is a full-price, full-latency model call. Where’s the cheapest token? The one you never send.

Design decision: How do you avoid re-calling the model for repeated work?

The call: Cache by exact prompt, and optionally by semantic similarity, keyed on model + params. — An exact-match cache returns the stored completion for an identical request instantly. A semantic cache embeds the prompt and reuses an answer for a near-duplicate — with a similarity threshold. Both key on model+params so a cache entry is valid.

Put a Response Cache in the gateway: an exact-match layer (identical prompt + model + params → stored completion) and optionally a semantic layer (embed the prompt, reuse a near-duplicate above a similarity threshold). Scope keys per tenant, set a TTL, and skip the provider on a hit — often the single biggest win on both cost and p50 latency.

The cheapest call is none: Caching is the highest-leverage optimization a gateway offers: a hit costs ~nothing and returns in milliseconds. Semantic caching extends it to "close enough" — powerful, but gated by a threshold you tune against quality.

Step 7 · See what it costs

Observability, cost, and attribution

You’re routing across providers, failing over, caching, and throttling. Now finance asks "what did each team spend?", an app owner asks "why did quality drop?", and you ask "is the router’s policy actually working?" You can’t answer any of it without data. What do you record?

Design decision: What must the gateway log on every request to stay operable?

The call: Per-request metadata: model, tokens in/out, cost, latency, cache-hit, fallback — attributed to a tenant. — Structured records of model, token counts, cost, latency, cache-hit and any fallback, keyed to the tenant, power billing, budgets, alerting, and the quality investigations the router’s decisions demand.

Emit a structured record per request: tenant, model chosen (and why), tokens in/out, cost, latency, cache-hit, retries/fallback, and outcome — into a Cost & Traces store. That one ledger drives billing, budget enforcement, alerting, and quality debugging. Log metadata always; sample or redact prompt/response content deliberately.

You can’t govern what you can’t see: The gateway’s whole value — routing, budgets, reliability — is only real if it’s measured. Per-tenant cost attribution is what turns "we use LLMs" into a system you can actually run and bill.

Step 8 · Serve it at scale

Streaming, statelessness, and multi-region

Users expect tokens to stream as they’re generated, traffic is spiky, and the gateway now sits in the hot path of every LLM call — a tempting single point of failure. How do you make it fast and highly available?

Design decision: How do you scale a gateway that’s in every request’s hot path?

The call: Stream tokens through, keep the gateway stateless, and run many replicas across regions. — The gateway proxies the provider’s SSE/streaming chunks to the client with minimal added latency, holds no per-request state (cache/registry/limits are external), so you scale horizontally and run multi-region for HA.

Make the gateway stateless — cache, registry, limits and traces live in external stores — so you run many replicas behind a load balancer, multi-region for availability. Stream provider tokens straight through (SSE) so time-to-first-token stays low. The control plane must be more available than any single provider, because everything depends on it.

Thin, stateless, streaming: The gateway adds policy, not latency: pass tokens through, keep no local state, scale out. Its own reliability is the ceiling on every app’s reliability — engineer it accordingly.

The payoff

You built an LLM gateway

From "every app hard-codes one vendor" to a control plane: one unified API, provider adapters, a capability-first router with cost/latency tie-breakers, retries + failover + circuit breaking, per-tenant limits and budgets, exact + semantic caching, per-request cost attribution, and a thin stateless streaming edge.

Now route by price alone — collapse the policy to "cheapest token wins" — and watch quality quietly fall apart: hard prompts sent to weak models return confident, wrong, un-parseable answers as clean 200s with no error. That’s why you route on capability first and let cost break ties only within a tier — and why the cost-and-quality ledger from step 7 exists to catch it.

Unified API — apps call one provider-agnostic endpoint; routing changes never touch them
Adapters — one per provider — normalize the vendor mess at the edge
Router — capability first, cost/latency as tie-breakers, policy in a hot-reloadable registry
Fallback — bounded retries → failover to an equivalent model → circuit breaker
Limits & budgets — per-tenant token buckets + spend caps isolate the shared quota
Caching — exact + semantic — the cheapest call is the one you never send
Observability — per-request cost/latency attributed to a tenant drives billing + debugging
Scale — stateless, streaming, multi-region — the control plane outlives any one provider