Design Content Moderation — the walkthrough in full
A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.
The big idea
Moderating content at platform scale
A platform accepts millions of posts, comments, and images a day. Some tiny fraction is harmful — abuse, hate, violence, spam, CSAM. You can’t have humans read all of it (impossible at scale), and you can’t fully trust a model (it’s wrong in both directions: censoring the innocent and passing the harmful). Users actively try to evade you. How do you catch harm at scale without either drowning in review or silently over-censoring?
Build a staged moderation funnel: cheap high-precision filters first, ML classifiers next, and humans on the uncertain middle. A policy engine turns scores into actions; an appeals + retraining loop keeps up with adversaries. The core discipline: automate the confident extremes, send the ambiguous to people.
How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, drop human review to feel moderation’s core failure. Hit Begin.
Step 1 · The skeleton
Content in, action out
A piece of content arrives and needs a decision — allow, limit, block, or escalate — often before it’s shown. What sits between the upload and the verdict?
Design decision: What’s the minimal shape of a moderation request?
The call: A Moderation API that runs content through checks and returns an action, recording a case. — An orchestrator takes the content, runs the funnel (fast filters → classifiers → policy), records the decision for audit/appeal, and returns allow/limit/block/escalate.
A Moderation API orchestrates the decision: run content through the funnel, record a case (scores, action, reviewer) for audit and appeals, and return an action. Screening is largely proactive — decided at or before publish for anything severe — with latency budgets tighter for pre-publish checks.
A decision with a paper trail: Every action must be explainable and reversible: the case store behind the API is what makes appeals, transparency reports, and retraining possible. Moderation without an audit trail is unaccountable.
Step 2 · Cheapest check first
The funnel: hash-match known-bad
Running heavy ML on every item is expensive, and a lot of harmful content is re-uploaded, previously-seen material. Why pay a classifier to rediscover what you already know is banned?
Design decision: What should run before the ML classifiers?
The call: A fast hash filter matching known-bad content (exact + perceptual hashes). — Hash-matching against a database of known violating material (exact and perceptual/PhotoDNA-style) is fast and near-perfect precision — it catches re-uploads instantly before any model runs.
Front the funnel with a Hash Filter: match content against hashes of known-violating material — exact hashes for identical files and perceptual hashes for slightly-altered images. It’s cheap, near-perfect precision, and catches re-uploads (including legally-mandated categories like CSAM) before spending a cent on ML. The funnel’s principle: cheap and certain first, expensive and fuzzy later.
Precision at the top, recall below: Hash-matching is high-precision but only catches known content. It handles the easy, certain fraction so classifiers and humans focus on the novel and ambiguous — a funnel that widens as certainty drops.
Step 3 · Judge novel text
Text classifiers, multi-label, per-category
Most content is new, so hashing won’t catch it. You need to judge unseen text across many kinds of harm — and "is it bad?" is the wrong question. What does a text classifier actually output?
Design decision: What should a text moderation classifier produce?
The call: Per-category probabilities (harassment, hate, violence, spam, self-harm…), multi-label. — The model scores each policy category independently, so a post can trigger several, each with its own probability — feeding per-category thresholds and severity in the policy engine.
The Text Classifier emits per-category probabilities (harassment, hate, violence, spam, self-harm, …) — multi-label, since content can violate several policies at once. Crucially it outputs scores, not verdicts: the policy engine (step 7) turns those probabilities into actions using category-specific thresholds. Keep the model’s judgment and the policy separate.
Scores, not verdicts: A classifier gives calibrated probabilities per policy; a verdict is a business decision layered on top. Separating them lets you tune thresholds and severity without retraining, and route the uncertain band to humans.
Step 4 · See the other modalities
Image + multimodal, and text-in-image
Plenty of abuse hides where a text model can’t look: graphic images, and — cleverly — harmful text baked into an image to dodge text classifiers. How do you cover non-text content?
Design decision: How do you catch harm in images (including text hidden in them)?
The call: Run image/multimodal classifiers and OCR the embedded text into the text pipeline. — Vision models score images/video frames for policy violations, and OCR extracts text-in-image so it’s judged by the text classifier too — closing the "hide the words in a picture" loophole.
Add an Image / Multimodal Model that scores images and video frames (nudity, violence, weapons) and OCRs embedded text back into the text pipeline — so abuse hidden inside an image is still judged. Multimodal coverage closes the biggest evasion gap; each modality feeds the same policy engine.
Cover every channel: Adversaries move to whatever you’re not checking — text-in-image, audio, links. Multimodal analysis (and OCR bridging image→text) is table stakes; a single-modality moderator is trivially evaded.
Step 5 · Turn scores into decisions
The policy engine: thresholds + severity
You have hash hits and per-category probabilities. But a 0.7 "hate" score isn’t self-executing — is that block, limit, or review? And a 0.7 on self-harm should behave very differently from 0.7 on spam. Where does that logic live?
Design decision: What converts classifier scores into an actual action?
The call: A policy engine with per-category thresholds and severity tiers → allow / limit / block / escalate. — The policy engine maps scores + hash hits to actions using category-specific thresholds and severity: high-confidence severe → block, medium → age-gate/limit, uncertain → human review. Policy lives here, separate from the models.
A Policy Engine maps scores + hash hits to an action via per-category thresholds and severity tiers: high-confidence + severe → block; medium → limit/age-gate; low-confidence or high-stakes → escalate to human review; else allow. Keeping policy separate from the models lets you retune enforcement without retraining — and defines the uncertain band that goes to people.
Policy is a dial, not a weight: Thresholds encode risk tolerance and law, and they change often. A separate policy engine makes enforcement tunable and auditable, and — critically — carves out the confidence band where humans, not the model, must decide.
Step 6 · People on the hard cases
The human-in-the-loop review queue
Classifiers are confident at the extremes and unsure in the middle — sarcasm, reclaimed slurs, context-dependent threats, novel evasions. Auto-actioning that middle at scale means confident wrong decisions on real people. Who decides the ambiguous cases?
Design decision: What happens to low-confidence and high-severity cases?
The call: Route them to a prioritized human review queue whose decisions also become training data. — Low-confidence and high-severity cases go to human moderators via a prioritized queue; their rulings resolve the case and feed back as labeled data to improve the classifiers.
Route low-confidence and high-severity cases to a prioritized Human Review queue. Humans resolve the ambiguity the model can’t, and their decisions become labeled training data. Automate only the confident extremes; the uncertain middle is where people add irreplaceable judgment — and where removing them (the chaos button) breaks everything.
Humans on the margin: The model handles the obvious; humans handle the ambiguous and the appeals. That division is the whole design — it’s why you keep classifier outputs as probabilities and reserve people for the band where being wrong is costly.
Step 7 · Enforce, and let users appeal
Actions, appeals, and accountability
A decision is made — now act on it. But moderation systems make mistakes in both directions, and users deserve recourse. How do you enforce while staying accountable?
Design decision: What does responsible enforcement require beyond taking the action?
The call: Take the graduated action, notify the user, and offer an appeal that re-routes to human review. — Enforcement applies a severity-matched action (limit, block, account action), notifies the user, and provides an appeal that sends the case (back) to human review — catching false positives and feeding the loop.
Enforcement applies a severity-matched action (limit, block, account-level action), notifies the user, and offers an appeal that routes the case to human review. Appeals are a feature, not an annoyance: they surface false positives, correct them, and generate the highest-value labeled data — cases the system got wrong.
Reversibility builds trust: Because the system is wrong in both directions, every action must be appealable and auditable. Appeals close the loop on false positives the same way review closes it on false negatives.
Step 8 · Keep up with adversaries
The retraining loop vs. evasion
Moderation is adversarial: the moment you block a pattern, users mutate it — leet-speak, new slang, coded language, adversarial images. A model frozen at launch decays fast. How do you stay current?
Design decision: How does the system keep up as evasion tactics evolve?
The call: Feed reviewer decisions and appeal outcomes back as labels to retrain classifiers continuously. — The feedback loop turns human rulings and appeal reversals into fresh labeled data, retraining classifiers to catch new evasions and correct past errors — a continuous arms race, not a one-time launch.
A Feedback Loop turns reviewer decisions and appeal outcomes into fresh labeled data, continuously retraining the classifiers to catch new evasions and fix past mistakes. Moderation is a permanent arms race — the loop (plus updated hashes and policies) is how the system tracks a moving adversary instead of decaying after launch.
A moving target needs a moving model: Adversarial drift guarantees that yesterday’s classifier misses today’s evasion. The human-labeled feedback loop is the engine that keeps recall from silently eroding — the same flywheel idea, aimed at an adversary.
The payoff
You built content moderation
From "millions of posts, some harmful, users evading" to a staged system: a Moderation API with an audit trail, a hash filter for known-bad, text and multimodal classifiers emitting per-category scores, a policy engine mapping scores to graduated actions, a human review queue on the uncertain middle, appeals for recourse, and a retraining loop against adversarial drift.
Now drop human review — auto-action every classifier call — and watch it fail in both directions at once: legitimate users banned on sarcasm and reclaimed speech, obfuscated abuse passing as clean, no appeal ever seen. That’s why a classifier is a probability, not a verdict; why you automate only the confident extremes; and why the ambiguous middle belongs to people.
- Moderation API — orchestrate the funnel; record every case for audit and appeal
- Hash filter — known-bad by exact + perceptual hash — cheap, high-precision, first
- Text classifier — per-category, multi-label scores — not a single verdict
- Multimodal — image/video models + OCR close the text-in-image evasion
- Policy engine — per-category thresholds + severity → allow/limit/block/escalate, tunable
- Human review — low-confidence + high-severity cases go to people; their calls retrain
- Appeals — graduated, notified, reversible — recourse surfaces false positives
- Retraining loop — reviewer + appeal labels track adversarial drift
- The failure — automating the uncertain middle = confident wrong actions, silently