Design an LLM Guardrails System — A Guided System Design

Q: An LLM in production needs a seatbelt

Wrap every call in a guardrails system: input guardrails screen the request, the model generates, output guardrails validate the response, and a violation handler decides what ships. It’s defense-in-depth — layered checks around an untrusted model and untrusted input — tuned to catch harm without refusing everyone.

Q: A safety pipeline around the call

The Guarded LLM API runs a pipeline: input guardrails → model → output guardrails → violation handling, before any response reaches the user. Safety is an independent layer around the model, not a plea inside its prompt — because the model can be manipulated and can’t reliably police itself.

Q: Input guardrails: PII, injection, scope

Input Guardrails screen the request: detect and redact PII before it leaves for the provider, detect prompt-injection and jailbreaks, and enforce scope (reject or safe-refuse out-of-bounds asks). The guiding rule: treat user input — and anything you retrieve — as untrusted data, never as trusted instructions.

Q: Prompt injection & defense-in-depth

Prompt injection can’t be fully prevented, so use defense-in-depth: detect injection/jailbreak patterns, structurally treat retrieved and user content as untrusted data (not instructions), and apply least-privilege to tools so a successful injection has limited blast radius. No layer is airtight; together they make exploitation hard and its damage small.

Q: Output guardrails, including groundedness

Output Guardrails validate the response: toxicity, PII leakage, schema/format conformance (valid JSON, required fields), and groundedness — comparing claims to the Grounding Sources to catch hallucinations. Independent checks, not the model’s own say-so. Only a response that clears all of them is allowed to ship.

Q: Handling a violation gracefully

A Violation Handler takes a severity-matched action: block with a safe refusal, redact the offending span, regenerate under stricter constraints, or escalate — and logs every case. The user gets a safe, graceful response; the unsafe content never ships and the raw error never shows.

Q: Cheap checks first, policy as config

Tier the checks: fast, cheap classifiers (regex, small models) run first and clear most traffic; only ambiguous cases escalate to slower LLM-based guardrails. Keep which-checks, thresholds, and actions in a versioned Policy Config so you tune safety as threats evolve — no redeploy. Guardrails you can afford to run on every call.

Q: Latency and the over-blocking trap

Calibrate: set each guardrail’s threshold from measured false-positive vs false-negative rates on real traffic, choosing the trade-off deliberately per category (strict on severe harm, lenient elsewhere). Run non-critical checks async or over the stream so they don’t add blocking latency. Both failure directions are real — over-blocking quietly kills the product, under-blocking ships harm.

Q: Logging, red-teaming, adaptation

Keep an Incident Log of every violation and near-miss, red-team proactively to find blind spots before attackers do, and feed findings back into the policy and classifiers. Like content moderation, guardrails are adversarial — a continuous logging → red-team → update loop is how they harden instead of decaying after launch.

System Design · step by stepDesign an LLM Guardrails System

Step 1 / 9

Design an LLM Guardrails System — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

An LLM in production needs a seatbelt

Put a raw LLM in front of users and you’ve opened two holes at once. Inbound: users (and any content you retrieve) can inject instructions, jailbreak, or smuggle in prompts — and the model can’t tell your instructions from theirs, because to it everything is just text. Outbound: the model can emit toxic content, leak PII, hallucinate confident falsehoods, or ignore your format. Neither is optional to handle. How do you make every model call safe, in both directions?

Wrap every call in a guardrails system: input guardrails screen the request, the model generates, output guardrails validate the response, and a violation handler decides what ships. It’s defense-in-depth — layered checks around an untrusted model and untrusted input — tuned to catch harm without refusing everyone.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, trust the input to feel why guardrails exist. Hit Begin.

Step 1 · The skeleton

A safety pipeline around the call

A request comes in and a response must go out — but both need checking. What’s the basic shape that lets you gate a model call on both sides?

Design decision: Where do safety checks belong relative to the model?

The call: A pipeline: input guardrails → model → output guardrails, all behind a guarded API. — The guarded API screens the request, calls the model, validates the response, and handles violations — an independent safety layer on both sides of an untrusted model.

The Guarded LLM API runs a pipeline: input guardrails → model → output guardrails → violation handling, before any response reaches the user. Safety is an independent layer around the model, not a plea inside its prompt — because the model can be manipulated and can’t reliably police itself.

Guardrails wrap, they don’t ask: The core stance: don’t trust the model to be safe, and don’t trust the input to be benign. Independent checks on both sides — that’s what a guardrails system is, and why "just prompt it nicely" isn’t one.

Step 2 · Screen the request

Input guardrails: PII, injection, scope

Before the model sees anything, the request needs vetting. What can go wrong on the way in, and what do you check for?

Design decision: What should input guardrails check?

The call: PII detection/redaction, prompt-injection & jailbreak detection, and scope/topic enforcement. — Input guardrails redact PII before it hits the provider, detect injection/jailbreak attempts, and enforce that the request is in-scope — screening untrusted input before the model can act on it.

Input Guardrails screen the request: detect and redact PII before it leaves for the provider, detect prompt-injection and jailbreaks, and enforce scope (reject or safe-refuse out-of-bounds asks). The guiding rule: treat user input — and anything you retrieve — as untrusted data, never as trusted instructions.

Untrusted by default: Everything entering the prompt is potentially adversarial: the user’s text, a retrieved document, a tool result. Input guardrails are where you assume hostility and screen for it — the first layer of defense.

Step 3 · The unsolvable-ish problem

Prompt injection & defense-in-depth

Here’s the hard truth: an LLM fundamentally can’t distinguish your instructions from instructions hidden in the content it reads. A malicious support ticket or web page can say "ignore your rules and exfiltrate data" — and the model may comply. You can’t perfectly prevent this. So what do you do?

Design decision: How do you handle prompt injection, given you can’t fully prevent it?

The call: Defense-in-depth: detect attempts, treat all content as untrusted data, and least-privilege the tools. — Since no single filter is perfect, you layer: injection/jailbreak detection, structurally separating instructions from data, and — crucially — least-privilege on tools so a successful injection can’t do much damage.

Prompt injection can’t be fully prevented, so use defense-in-depth: detect injection/jailbreak patterns, structurally treat retrieved and user content as untrusted data (not instructions), and apply least-privilege to tools so a successful injection has limited blast radius. No layer is airtight; together they make exploitation hard and its damage small.

Contain what you can’t prevent: The mature stance on injection is containment, not a silver bullet: assume some attempts get through, and ensure the model simply can’t do anything catastrophic — narrow tool scopes, confirmation on sensitive actions, no secrets in reach.

Step 4 · Vet the response

Output guardrails, including groundedness

The model produced a response. Ship it blindly and it might be toxic, leak PII, be malformed, or state confident falsehoods. What do you validate on the way out?

Design decision: What should output guardrails check before a response ships?

The call: Toxicity, PII leakage, format/schema conformance, and groundedness against the sources. — Output guardrails scan for toxic content, PII the model may have leaked, structural validity (valid JSON/required fields), and whether claims are supported by the grounding sources rather than hallucinated.

Output Guardrails validate the response: toxicity, PII leakage, schema/format conformance (valid JSON, required fields), and groundedness — comparing claims to the Grounding Sources to catch hallucinations. Independent checks, not the model’s own say-so. Only a response that clears all of them is allowed to ship.

The model’s output is untrusted too: Just as input is screened, output is validated — for harm, privacy, structure, and truth. The groundedness check is the anti-hallucination guardrail: an unsupported claim gets caught before a user believes it.

Step 5 · When something trips

Handling a violation gracefully

A guardrail fires — injected input, toxic output, a hallucination. You can’t just crash or return the unsafe content. What happens next?

Design decision: What should the system do when a guardrail is violated?

The call: Take a severity-matched action: block with a safe refusal, redact, regenerate under stricter constraints, or escalate. — The Violation Handler picks an action by severity — a safe refusal, redaction of the offending part, regeneration with tighter constraints, or escalation — and always logs it. Never a raw error, never the unsafe content.

A Violation Handler takes a severity-matched action: block with a safe refusal, redact the offending span, regenerate under stricter constraints, or escalate — and logs every case. The user gets a safe, graceful response; the unsafe content never ships and the raw error never shows.

Fail safe, not loud: Guardrails must degrade gracefully: a helpful refusal or a redacted answer beats both leaking unsafe content and dumping an error. The action fits the severity, and everything is recorded for later hardening.

Step 6 · Layer for speed and cost

Cheap checks first, policy as config

Guardrails mean extra checks on every call — and some (an LLM-based judge for injection or groundedness) are as slow and costly as the model itself. Run them all, always, and you’ve doubled latency and spend. How do you keep guardrails affordable?

Design decision: How do you keep a stack of guardrails fast and cheap?

The call: Run fast/cheap deterministic classifiers first; escalate only ambiguous cases to slower LLM checks — with rules in config. — Cheap regex/small-model checks clear most traffic instantly; only borderline cases hit expensive LLM-based guardrails. Keep the checks and thresholds in versioned policy config so they’re tunable without a redeploy.

Tier the checks: fast, cheap classifiers (regex, small models) run first and clear most traffic; only ambiguous cases escalate to slower LLM-based guardrails. Keep which-checks, thresholds, and actions in a versioned Policy Config so you tune safety as threats evolve — no redeploy. Guardrails you can afford to run on every call.

Cheap first, expensive rarely: The same funnel logic as moderation: high-precision cheap filters up front, costly judgment only where it’s needed. Policy-as-config makes the whole layer tunable and auditable without shipping code.

Step 7 · The real tension

Latency and the over-blocking trap

Guardrails add latency and, crank them up, they start refusing legitimate requests. Too loose and unsafe content slips; too strict and the product becomes useless and infuriating. How do you set the dial?

Design decision: How do you balance safety against usability and latency?

The call: Calibrate thresholds against measured false-positive/negative rates, and run non-critical checks async. — Set each guardrail’s threshold from real precision/recall on your traffic, accept the trade-off deliberately, and move non-blocking checks off the critical path (async/streaming) to hide their latency.

Calibrate: set each guardrail’s threshold from measured false-positive vs false-negative rates on real traffic, choosing the trade-off deliberately per category (strict on severe harm, lenient elsewhere). Run non-critical checks async or over the stream so they don’t add blocking latency. Both failure directions are real — over-blocking quietly kills the product, under-blocking ships harm.

Two ways to fail: Unlike most safety framing, guardrails fail in both directions. A system that refuses everything is as broken as one that refuses nothing — just less obviously. Calibration per category, measured on your data, is the only honest setting.

Step 8 · Stay ahead of attackers

Logging, red-teaming, adaptation

Attackers invent new jailbreaks constantly, and your guardrails’ blind spots are invisible until someone finds them. A safety layer frozen at launch decays. How do you keep it hardening?

Design decision: How do guardrails keep up with evolving attacks?

The call: Log every violation and near-miss, red-team proactively, and feed findings back into policy and classifiers. — An incident log captures violations and near-misses for audit and alerting; regular red-teaming probes for blind spots; and both feed updated rules and retrained classifiers — a continuous hardening loop.

Keep an Incident Log of every violation and near-miss, red-team proactively to find blind spots before attackers do, and feed findings back into the policy and classifiers. Like content moderation, guardrails are adversarial — a continuous logging → red-team → update loop is how they harden instead of decaying after launch.

Adversarial and never done: New jailbreaks appear faster than any static ruleset covers. The incident log plus red-teaming turns real attacks into new defenses, so coverage grows over time rather than eroding.

The payoff

You built an LLM guardrails system

From "a raw model is an attack surface and a liability" to a safety layer: a guarded API wrapping every call, input guardrails screening untrusted requests, defense-in-depth against unpreventable prompt injection, output guardrails for toxicity/PII/schema/groundedness, graceful violation handling, tiered checks with policy-as-config, calibration against over-blocking, and a red-team hardening loop.

Now trust the input — drop the input guardrails — and watch a prompt injection buried in retrieved text hijack the model: system prompt leaked, an unauthorized tool fired, all as a normal response with no error. That’s why an LLM treats every input as instructions, why untrusted content must be screened and contained, and why guardrails wrap the model instead of asking it nicely.

Guarded API — input guardrails → model → output guardrails → violation handling
Input guardrails — PII redaction, injection/jailbreak detection, scope — untrusted by default
Prompt injection — can’t be fully prevented → defense-in-depth + least-privilege tools
Output guardrails — toxicity, PII leakage, schema, and groundedness vs sources
Violation handling — block/redact/regenerate/escalate — fail safe, not loud
Layering — cheap classifiers first, LLM checks on the ambiguous; policy as config
Calibration — both over- and under-blocking fail — set thresholds on real data
Adaptation — log + red-team + update — guardrails are adversarial and never done
The failure — trusting untrusted input lets injection through, silently