System Design · step by stepDesign a Feature Store
Step 1 / 9

Design a Feature Store — the walkthrough in full

A written version of the interactive walkthrough above — the same steps, decisions and trade-offs, laid out for reading, reference and search.

The big idea

Why models need a feature store

A model is only as good as the features it’s fed. In training, a data scientist computes features in a notebook over historical data. In production, an engineer recomputes "the same" features in low-latency serving code. These are two different code paths — and the moment they diverge, the model sees inputs it never trained on and quietly gets worse. Every ML team reinvents feature plumbing, and every one hits this. How do you make features consistent, reusable, and fresh?

A feature store is the shared system for defining, computing, storing, and serving features. One definition produces both the historical values that train the model and the current values that serve it — so training and serving can’t drift apart, and features become reusable assets across models.

How to read this: Each step opens with a real design decision — you make the call before I show you what ships. Watch the diagram grow, hover any box, replay the flow. At the end, skew train vs serve to feel ML’s most common quiet failure. Hit Begin.

Step 1 · The core problem

Training/serving skew, and reuse

Two teams compute "user’s 30-day average spend": one in a training notebook, one in serving code. What goes wrong, and what does the right structure prevent?

Design decision: What’s the fundamental thing a feature store must guarantee?

The call: That a feature is defined once and used identically in training and serving. — A single definition, materialized to both an offline (training) and online (serving) store, means the model can’t receive a differently-computed feature in production than it learned from — no skew — and features become reusable across models.

The feature store’s job is consistency and reuse: define a feature once, and serve it identically to training (historical values) and serving (current values). That single-definition rule eliminates training/serving skew — the silent killer — and turns features into shared assets many models can reuse.

One definition, two consumers: Training needs history; serving needs the latest value fast. A feature store satisfies both from one definition, so the numbers can never disagree. Everything else — offline store, online store, materialization — implements this promise.

Step 2 · Compute the features

The pipeline + raw data

Features come from raw events and tables that a model can’t use directly. Something has to transform "a stream of orders" into "30-day average order value per user." Where does that logic live?

Design decision: Where should feature transformation logic live?

The call: In a shared Feature Pipeline that computes from raw data using registered definitions. — A single pipeline reads raw events and applies the registered transformation to produce feature values — the one place the logic lives, feeding both stores.

A shared Feature Pipeline reads Raw Events and applies transformations to produce feature values. This is the one place feature logic lives, so both stores (and every model) get identically-computed values. The pipeline is the engine; the registry (soon) holds the definitions it runs.

Compute once, consume many: Centralizing transformation is what makes a feature reusable and skew-free. The pipeline doesn’t belong to a model — it belongs to the platform, and models subscribe to its outputs.

Step 3 · History for training

The offline store + point-in-time joins

To train, you need each example’s features as they were at that moment — not today’s values. Join naively and you leak the future into the past. How do you build a correct training set?

Design decision: How do you assemble features for a historical training set correctly?

The call: Point-in-time joins against an offline store: for each labeled event, use feature values known at its timestamp. — The offline store keeps feature values over time; a point-in-time join attaches, to each training example, only the feature values that existed at that example’s moment — no leakage from the future.

The Offline Store holds historical feature values over time; Training builds datasets with point-in-time-correct joins — each example gets only the feature values known at its own timestamp. That prevents feature leakage (using future information) and makes training sets reproducible. It’s throughput-optimized (a warehouse), not latency-optimized.

The past, as it actually was: Point-in-time correctness is the offline store’s reason to exist: it reconstructs history without letting the future bleed in. Skip it and your evals lie in the optimistic direction — the model looks great until it ships.

Step 4 · Freshness for serving

The online store, at millisecond latency

At serving time a prediction must return in milliseconds, and re-running the batch pipeline per request is impossible. But the features must be the same ones training used. How do you serve them fast without recomputing?

Design decision: How do you serve features in milliseconds without skew?

The call: Materialize the pipeline’s output into a low-latency online store; serving looks up by entity key. — The same pipeline writes the latest feature values into a key-value online store; serving does a fast lookup by entity id — same definition as offline, millisecond latency, no recompute.

The Online Store is a low-latency key-value store holding the latest feature values per entity, materialized by the same pipeline that fills the offline store. Model Serving looks up features by entity key in milliseconds and scores. Because both stores come from one definition, the serving features match the training features exactly — no recompute, no skew.

Same feature, two shapes: Offline = all history, high throughput, for training. Online = latest value, low latency, for serving. One definition, materialized two ways — that’s how a feature store serves fast and stays consistent.

Step 5 · One source of truth

The feature registry + versioning

If the pipeline holds the logic but each run hard-codes it, definitions drift, nobody knows who owns a feature, and changing one silently breaks a model. What makes a feature a governed, discoverable asset?

Design decision: What turns a computed value into a reusable, governed feature?

The call: A versioned definition in a registry: logic, types, owner — the source both stores derive from. — The Feature Registry holds each feature’s definition, data type, owner, and version. The pipeline runs these definitions, and both stores derive from them — so a feature is discoverable, reusable, and changes are versioned.

The Feature Registry is the single source of truth: each feature’s transformation logic, type, owner, and version. The pipeline executes these definitions and both stores derive from them, so there is exactly one meaning of a feature. Versioning makes changes explicit — a new version instead of a silent redefinition that breaks dependents.

Features as versioned assets: The registry is what makes "define once, use everywhere" real and governable: discoverable so teams reuse instead of reinventing, versioned so a change can’t silently skew a live model.

Step 6 · Keep it fresh

Batch + streaming materialization

Some features are slow-moving (lifetime order count); some are fast (activity in the last 5 minutes). A nightly batch keeps the first fresh but leaves the second hours stale. How do you keep the online store current for both?

Design decision: How do you keep fast-moving features fresh in the online store?

The call: Batch-materialize slow features; stream-update fast ones into the online store in near-real-time. — Use both: scheduled batch jobs for features that change slowly, and streaming ingestion from event streams for features that must reflect the last minutes — each store stays as fresh as its features need.

Materialize with the right cadence per feature: batch jobs for slow-moving features and streaming ingestion for fast-moving ones, both writing the online store. Freshness becomes a per-feature property (its TTL/SLA), so the online store reflects reality at the speed each feature demands — without recomputing everything constantly.

Freshness is per-feature: Not all features age at the same rate. Batch + stream lets each feature meet its own freshness SLA cheaply. Stale features are their own silent failure — a monitor (next) watches for them.

Step 7 · Watch for rot

Drift and staleness monitoring

Everything works on launch day. Months later the model quietly underperforms — the world shifted, or a materialization job silently stalled and features went stale. Nothing errored. How do you catch decay before users do?

Design decision: What tells you features are silently degrading the model?

The call: Monitor feature distributions (drift) and freshness (staleness), alerting on both. — Track each feature’s distribution over time to catch drift (the world changed) and its update recency to catch staleness (materialization stalled) — the two silent ways features degrade a model.

A Drift Monitor watches feature distributions (has the data shifted from what the model trained on?) and freshness (did a materialization job stall?), alerting on both. These are the two silent ways a healthy-looking system decays — catching them at the feature level surfaces the cause before aggregate model metrics even move.

Decay is silent by default: Drift and staleness don’t throw errors; they just quietly widen the gap between training and reality. Feature-level monitoring is the smoke detector — it names the failing feature, not just "accuracy is down."

The payoff

You built a feature store

From "two teams compute the same feature two ways" to a shared platform: one versioned registry of definitions, a pipeline that computes once, an offline store with point-in-time-correct history for training, a low-latency online store for serving, batch + streaming materialization for freshness, and drift/staleness monitoring.

Now skew train vs serve — recompute a feature at serving instead of reading the store — and watch production accuracy fall while every offline metric stays green, because the model is now fed a distribution it never trained on. That’s why a feature is defined once and read everywhere, and why training/serving skew is the silent failure a feature store exists to kill.

  • One definition — define a feature once; training and serving read the same values — no skew
  • Feature pipeline — the single place transformation logic lives; computes once, feeds both stores
  • Offline store — historical values + point-in-time joins → correct, leak-free training sets
  • Online store — latest values, millisecond lookup → fast serving, same features as training
  • Registry — versioned, owned, discoverable definitions — features as reusable assets
  • Materialization — batch for slow features, streaming for fast — freshness per feature
  • Monitoring — drift + staleness alerts catch silent decay before users do
  • The failure — training/serving skew degrades the model silently — read the store, never re-derive
built to be reasoned about, not memorized — make the calls, skew train vs serve, run the quiz.
Finished this one? 0 / 16 AI System Designs done

Explore the topic

See this alongside everything else on the same subject — handbooks, system designs, challenges and tools, in one place.

More AI System Designs