> "Only a small fraction of real-world ML systems is composed of the ML code... The required surrounding infrastructure is vast and complex."
In This Chapter
- Learning Objectives
- 24.1 The Model Is 5% of the System
- 24.2 Anatomy of an ML System
- 24.3 The Recommendation System Architecture: A Complete Walkthrough
- 24.4 Serving Patterns: Batch, Real-Time, and Near-Real-Time
- 24.5 Feature Stores: The Bridge Between Training and Serving
- 24.6 Training-Serving Skew: Detection and Prevention
- 24.7 Designing for Reliability
- 24.8 Architecture Decision Records (ADRs)
- 24.9 The Credit Scoring Anchor: Batch vs. Real-Time
- 24.10 Experimentation Infrastructure
- 24.11 The StreamRec Application: System Architecture Design (Milestone 9)
- 24.12 System Design as a Professional Discipline
- Summary
Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning
"Only a small fraction of real-world ML systems is composed of the ML code... The required surrounding infrastructure is vast and complex." — D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison, "Hidden Technical Debt in Machine Learning Systems" (NeurIPS, 2015)
Learning Objectives
By the end of this chapter, you will be able to:
- Decompose an ML system into its core components — data pipeline, feature store, training infrastructure, model serving, and monitoring — and explain the responsibilities and failure modes of each
- Select among batch, real-time, and near-real-time serving patterns based on latency requirements, throughput constraints, and cost trade-offs
- Design for reliability using redundancy, graceful degradation, fallback models, and circuit breakers
- Write Architecture Decision Records (ADRs) that document ML system design choices with context, constraints, and consequences
- Apply the full system design framework to the StreamRec recommendation capstone
24.1 The Model Is 5% of the System
In 2015, Sculley et al. published what remains the most cited paper on ML engineering: "Hidden Technical Debt in Machine Learning Systems." The paper's central figure — a small box labeled "ML Code" surrounded by vastly larger boxes for data collection, feature extraction, analysis tools, serving infrastructure, monitoring, configuration, and process management — has become an icon of the field. The message is stark: the machine learning model itself is a small component of a production ML system, and often the easiest part to get right.
This chapter begins Part V of the book — the part where we build the 95% that surrounds the model. We have spent twenty-three chapters learning to build excellent models. Linear algebra and optimization (Part I) gave us the mathematical foundations. Deep learning (Part II) gave us neural architectures for every data modality. Causal inference (Part III) taught us to distinguish prediction from causation. Bayesian and temporal methods (Part IV) gave us principled uncertainty quantification. All of this work was necessary. None of it is sufficient.
A model that achieves state-of-the-art performance on a hold-out test set but cannot serve predictions in under 100 milliseconds is useless for a real-time recommendation system. A model trained on yesterday's data that does not retrain when user behavior shifts after a product launch is a ticking time bomb. A feature engineering pipeline that computes features differently during training and serving — the dreaded training-serving skew — will produce predictions that are silently, catastrophically wrong.
Production ML = Software Engineering: Machine learning in production is software engineering with three additional dimensions of complexity: data dependencies, model behavior that changes over time, and the entanglement of components through shared data. Every principle of software engineering — modularity, testing, monitoring, documentation, version control — applies, and then ML adds its own. This theme runs through every chapter in Part V.
This chapter provides the architectural blueprint. We will design a complete ML system for a content recommendation platform — from the moment a user opens the application to the moment a ranked list of recommendations appears on their screen, all within a 200-millisecond latency budget. Along the way, we will develop a vocabulary for discussing ML system architecture, a framework for choosing between design alternatives, and a documentation discipline (ADRs) for recording and communicating those choices.
24.2 Anatomy of an ML System
Every production ML system, regardless of domain, contains the same core components. They vary in scale, complexity, and implementation technology, but the conceptual architecture is universal.
The Component Map
| Component | Responsibility | Failure Mode |
|---|---|---|
| Data Pipeline | Ingest, validate, transform, and store raw data | Stale data, schema drift, silent corruption |
| Feature Store | Compute, store, and serve features for training and inference | Training-serving skew, stale features, missing values |
| Training Infrastructure | Train, validate, and register models | Non-reproducible runs, resource waste, silent degradation |
| Model Registry | Version, stage, and manage model artifacts | Wrong model in production, missing metadata |
| Serving Infrastructure | Accept requests, compute predictions, return results | Latency spikes, OOM errors, stale models |
| Monitoring & Observability | Track data quality, model performance, and system health | Undetected drift, cascading failures |
| Experimentation Platform | A/B test, shadow mode, canary deployment | Selection bias, underpowered tests |
| Configuration Management | Manage hyperparameters, feature lists, model versions | Configuration drift, untraceable changes |
These components interact through well-defined interfaces: the feature store provides a consistent feature-fetching API for both training jobs and serving requests; the model registry provides versioned artifacts for the serving infrastructure; the monitoring system consumes metrics from every other component.
The Two Loops
ML systems operate on two timescales, and understanding this distinction is essential for architecture:
The inner loop (online) handles individual prediction requests. A user opens the application; the serving infrastructure retrieves user features from the feature store, loads the current model, computes a prediction, and returns a response. This loop runs in milliseconds. Latency, throughput, and availability are the primary concerns.
The outer loop (offline) handles model improvement. New data arrives; the data pipeline processes it; the training infrastructure retrains or fine-tunes the model; the experimentation platform evaluates the new model against the current one; if the new model wins, the model registry promotes it to production. This loop runs in hours or days. Model quality, reproducibility, and validation rigor are the primary concerns.
The art of ML system design is keeping these two loops decoupled yet synchronized. The inner loop must never block on the outer loop (a user should never wait while the model retrains). The outer loop must never bypass the inner loop's safety checks (a new model should never reach production without validation).
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from enum import Enum
import time
class ComponentStatus(Enum):
"""Health status for ML system components."""
HEALTHY = "healthy"
DEGRADED = "degraded"
UNHEALTHY = "unhealthy"
@dataclass
class MLSystemComponent:
"""Base representation of an ML system component.
Tracks name, health status, latency SLA, and dependencies
for system-level architecture reasoning.
Attributes:
name: Component identifier.
status: Current health status.
latency_budget_ms: Maximum acceptable latency in milliseconds.
dependencies: List of component names this component depends on.
last_health_check: Unix timestamp of last health check.
"""
name: str
status: ComponentStatus = ComponentStatus.HEALTHY
latency_budget_ms: float = 100.0
dependencies: List[str] = field(default_factory=list)
last_health_check: float = 0.0
def is_available(self) -> bool:
"""Check if component is available for serving.
Returns:
True if status is HEALTHY or DEGRADED.
"""
return self.status in (ComponentStatus.HEALTHY, ComponentStatus.DEGRADED)
@dataclass
class MLSystemArchitecture:
"""High-level ML system architecture specification.
Enumerates all components, their latency budgets, and dependencies.
Used for architecture documentation and system health assessment.
Attributes:
name: System name.
components: Dictionary mapping component name to specification.
total_latency_budget_ms: End-to-end latency SLA.
"""
name: str
components: Dict[str, MLSystemComponent] = field(default_factory=dict)
total_latency_budget_ms: float = 200.0
def add_component(self, component: MLSystemComponent) -> None:
"""Register a component in the architecture.
Args:
component: The component to add.
"""
self.components[component.name] = component
def critical_path_latency(self) -> float:
"""Compute the sum of latency budgets along the critical path.
This is a simplified model: the critical path is the sum of all
serial component latencies. Parallel components share a budget.
Returns:
Total critical path latency in milliseconds.
"""
return sum(c.latency_budget_ms for c in self.components.values())
def check_latency_feasibility(self) -> Tuple[bool, str]:
"""Verify that the critical path fits within the total budget.
Returns:
(feasible, message) tuple.
"""
critical = self.critical_path_latency()
if critical <= self.total_latency_budget_ms:
return True, (
f"Critical path ({critical:.0f}ms) fits within "
f"budget ({self.total_latency_budget_ms:.0f}ms). "
f"Headroom: {self.total_latency_budget_ms - critical:.0f}ms."
)
return False, (
f"Critical path ({critical:.0f}ms) exceeds "
f"budget ({self.total_latency_budget_ms:.0f}ms) "
f"by {critical - self.total_latency_budget_ms:.0f}ms."
)
def system_health(self) -> Dict[str, ComponentStatus]:
"""Report the health of all components.
Returns:
Dictionary mapping component name to health status.
"""
return {name: c.status for name, c in self.components.items()}
# Example: Define a minimal recommendation system architecture
rec_system = MLSystemArchitecture(
name="StreamRec",
total_latency_budget_ms=200.0
)
rec_system.add_component(MLSystemComponent(
name="api_gateway",
latency_budget_ms=5.0,
dependencies=[]
))
rec_system.add_component(MLSystemComponent(
name="feature_store_lookup",
latency_budget_ms=15.0,
dependencies=["api_gateway"]
))
rec_system.add_component(MLSystemComponent(
name="candidate_retrieval",
latency_budget_ms=30.0,
dependencies=["feature_store_lookup"]
))
rec_system.add_component(MLSystemComponent(
name="ranking_model",
latency_budget_ms=50.0,
dependencies=["candidate_retrieval"]
))
rec_system.add_component(MLSystemComponent(
name="reranking",
latency_budget_ms=20.0,
dependencies=["ranking_model"]
))
rec_system.add_component(MLSystemComponent(
name="response_assembly",
latency_budget_ms=5.0,
dependencies=["reranking"]
))
feasible, message = rec_system.check_latency_feasibility()
print(f"Feasibility: {feasible}")
print(message)
print(f"\nComponent budgets:")
for name, comp in rec_system.components.items():
print(f" {name}: {comp.latency_budget_ms:.0f}ms")
Feasibility: True
Critical path (125ms) fits within budget (200ms). Headroom: 75ms.
Component budgets:
api_gateway: 5ms
feature_store_lookup: 15ms
candidate_retrieval: 30ms
ranking_model: 50ms
reranking: 20ms
response_assembly: 5ms
The 75ms of headroom is not luxury — it absorbs network variability, garbage collection pauses, and the long tail of database queries. Production systems that use 100% of their latency budget at p50 will violate SLAs at p99.
24.3 The Recommendation System Architecture: A Complete Walkthrough
We now design the architecture for StreamRec, the content recommendation platform that has been our progressive project throughout this book. In Part I, we decomposed the user-item interaction matrix with SVD (Chapter 1) and built approximate nearest neighbor indices with FAISS (Chapter 5). In Part II, we moved to neural architectures: the click-prediction MLP (Chapter 6), the session transformer (Chapter 10), the two-tower retrieval model (Chapter 13), and the graph neural network collaborative filter (Chapter 14). In Part III, we asked causal questions about recommendation effects and estimated heterogeneous treatment effects (Chapters 15-19). In Part IV, we built Bayesian user preferences for cold-start handling (Chapter 20).
Now we bring it all together. The question is no longer "which model?" but "how do all these models compose into a system that serves 50 million users with sub-200ms latency?"
The Multi-Stage Architecture
Modern recommendation systems use a funnel architecture with three stages: candidate retrieval, ranking, and re-ranking. Each stage narrows the item set while increasing the computational cost per item.
┌─────────────────────────────┐
│ Item Corpus │
│ 200,000 items │
└──────────┬──────────────────-┘
│
┌──────────▼──────────────────-┐
│ Stage 1: Retrieval │
│ 200,000 → 500 candidates │
│ ~0.15ms per item │
│ ANN (FAISS), two-tower │
│ Budget: 30ms │
└──────────┬───────────────────-┘
│
┌──────────▼──────────────────-┐
│ Stage 2: Ranking │
│ 500 → 50 items │
│ ~0.1ms per item │
│ Deep ranking model │
│ Budget: 50ms │
└──────────┬───────────────────-┘
│
┌──────────▼───────────────────┐
│ Stage 3: Re-ranking │
│ 50 → 20 displayed │
│ ~0.4ms per item │
│ Business rules, diversity, │
│ freshness, fairness │
│ Budget: 20ms │
└──────────┬───────────────────┘
│
▼
20 recommendations
shown to user
Why three stages? A single model scoring all 200,000 items per request at 0.1ms per item would take 20 seconds — two orders of magnitude beyond the budget. The funnel architecture is not optional; it is a mathematical consequence of the latency constraint.
Stage 1: Candidate Retrieval
The retrieval stage must reduce 200,000 items to ~500 candidates in under 30 milliseconds. Speed dominates precision: it is acceptable to miss a few good items if the false negative rate is low enough that the top 500 contains most of the items the user would engage with.
Two-tower retrieval (Chapter 13). The user encoder produces a dense embedding $\mathbf{u} \in \mathbb{R}^{128}$; the item encoder produces $\mathbf{v}_i \in \mathbb{R}^{128}$ for each item. At serving time, item embeddings are pre-computed and stored in a FAISS index (Chapter 5). The user embedding is computed on-the-fly and used to query the index for the nearest 200 items.
Multi-source retrieval. No single retrieval source captures all relevant items. StreamRec uses four retrieval sources, run in parallel:
- Embedding similarity (two-tower): 200 candidates based on user-item embedding distance.
- Collaborative filtering (GNN, Chapter 14): 150 candidates from graph-based neighborhood.
- Content-based (category affinity, Chapter 20): 100 candidates from the user's preferred categories.
- Trending/fresh (popularity-weighted recency): 50 candidates to ensure coverage of new content.
The union of these sources produces 400-500 unique candidates after deduplication. Each source has its own latency budget (30ms total, run in parallel), and each has a fallback: if the embedding index is unavailable, the system falls back to collaborative filtering and content-based retrieval alone.
from dataclasses import dataclass
from typing import List, Set, Dict, Optional
from enum import Enum
import time
class RetrievalSource(Enum):
"""Sources of candidate items for recommendation."""
EMBEDDING_ANN = "embedding_ann"
COLLABORATIVE = "collaborative"
CONTENT_BASED = "content_based"
TRENDING = "trending"
@dataclass
class RetrievalResult:
"""Result from a single retrieval source.
Attributes:
source: Which retrieval method produced these candidates.
item_ids: List of candidate item IDs.
scores: Corresponding relevance scores (source-specific).
latency_ms: Time taken by this retrieval source.
is_fallback: Whether this result came from a fallback mechanism.
"""
source: RetrievalSource
item_ids: List[int]
scores: List[float]
latency_ms: float
is_fallback: bool = False
@dataclass
class CandidateRetriever:
"""Multi-source candidate retrieval with fallback logic.
Queries multiple retrieval sources in parallel and merges results.
If a primary source fails, activates its fallback.
Attributes:
max_candidates_per_source: Maximum candidates from each source.
total_candidate_target: Target number of unique candidates after merge.
timeout_ms: Maximum time to wait for any single source.
"""
max_candidates_per_source: Dict[RetrievalSource, int] = None
total_candidate_target: int = 500
timeout_ms: float = 30.0
def __post_init__(self):
if self.max_candidates_per_source is None:
self.max_candidates_per_source = {
RetrievalSource.EMBEDDING_ANN: 200,
RetrievalSource.COLLABORATIVE: 150,
RetrievalSource.CONTENT_BASED: 100,
RetrievalSource.TRENDING: 50,
}
def merge_candidates(
self, results: List[RetrievalResult]
) -> List[int]:
"""Merge candidates from multiple sources, deduplicating.
Items appearing in multiple sources receive a bonus (multi-source
agreement is a strong relevance signal).
Args:
results: List of retrieval results from all sources.
Returns:
Deduplicated list of candidate item IDs, ordered by
aggregate score.
"""
item_scores: Dict[int, float] = {}
item_source_count: Dict[int, int] = {}
for result in results:
for item_id, score in zip(result.item_ids, result.scores):
# Normalize scores to [0, 1] per source
if item_id not in item_scores:
item_scores[item_id] = 0.0
item_source_count[item_id] = 0
item_scores[item_id] += score
item_source_count[item_id] += 1
# Multi-source bonus: items from 2+ sources get a 1.5x boost
for item_id in item_scores:
if item_source_count[item_id] >= 2:
item_scores[item_id] *= 1.5
# Sort by aggregate score, return top candidates
sorted_items = sorted(
item_scores.keys(),
key=lambda x: item_scores[x],
reverse=True
)
return sorted_items[:self.total_candidate_target]
def retrieval_summary(
self, results: List[RetrievalResult]
) -> Dict[str, any]:
"""Summarize retrieval for monitoring.
Args:
results: List of retrieval results.
Returns:
Dictionary of retrieval metrics.
"""
total_candidates = len(
set(
item_id
for r in results
for item_id in r.item_ids
)
)
return {
"total_unique_candidates": total_candidates,
"sources_used": [r.source.value for r in results],
"sources_fallback": [
r.source.value for r in results if r.is_fallback
],
"max_latency_ms": max(r.latency_ms for r in results),
"mean_latency_ms": sum(r.latency_ms for r in results) / len(results),
}
# Simulate multi-source retrieval
retriever = CandidateRetriever()
# Simulated results (in production, these run in parallel)
results = [
RetrievalResult(
source=RetrievalSource.EMBEDDING_ANN,
item_ids=list(range(0, 200)),
scores=[1.0 - i * 0.004 for i in range(200)],
latency_ms=12.3,
),
RetrievalResult(
source=RetrievalSource.COLLABORATIVE,
item_ids=list(range(100, 250)),
scores=[0.9 - i * 0.005 for i in range(150)],
latency_ms=18.7,
),
RetrievalResult(
source=RetrievalSource.CONTENT_BASED,
item_ids=list(range(180, 280)),
scores=[0.8 - i * 0.006 for i in range(100)],
latency_ms=8.1,
),
RetrievalResult(
source=RetrievalSource.TRENDING,
item_ids=list(range(300, 350)),
scores=[0.7 - i * 0.01 for i in range(50)],
latency_ms=3.2,
),
]
candidates = retriever.merge_candidates(results)
summary = retriever.retrieval_summary(results)
print(f"Unique candidates: {summary['total_unique_candidates']}")
print(f"After merge/rank: {len(candidates)}")
print(f"Max source latency: {summary['max_latency_ms']:.1f}ms")
print(f"Fallback sources: {summary['sources_fallback']}")
Unique candidates: 350
After merge/rank: 350
Max source latency: 18.7ms
Fallback sources: []
Stage 2: Ranking
The ranking stage scores 500 candidates with a deep model and selects the top 50. This is where the expensive neural architectures from Part II earn their keep: the ranking model has access to dense user features, item features, context features (time of day, device, session history), and cross-features.
The ranking model. StreamRec's ranker is a deep cross network (DCN-V2): it combines explicit feature interactions (cross network) with implicit interactions (deep network) to predict engagement probability. The model takes ~0.1ms per item on a GPU, allowing 500 items to be scored in ~50ms.
Feature requirements. The ranker needs features from multiple sources:
| Feature Group | Source | Latency |
|---|---|---|
| User profile (demographics, tenure) | Feature store (batch) | ~2ms |
| User real-time activity (last 10 interactions) | Feature store (streaming) | ~5ms |
| Item embeddings (from two-tower model) | Feature store (batch) | ~2ms |
| Item statistics (CTR, completion rate) | Feature store (batch) | ~2ms |
| Context (time, device, session length) | Request metadata | ~0ms |
Feature retrieval runs in parallel with candidate retrieval — this is a key architectural optimization. By the time candidates are ready for scoring, their features are already fetched.
Stage 3: Re-ranking
The re-ranking stage takes the top 50 scored items and applies non-ML business logic to produce the final 20 displayed recommendations. Re-ranking is where the system enforces constraints that the ML model does not — and should not — encode:
- Diversity. No more than 3 items from the same category in the top 10.
- Freshness. At least 2 items published in the last 24 hours.
- Content policy. Remove items flagged by the trust & safety system.
- Business rules. Boost promoted content; suppress items the user has already seen.
- Fairness. Ensure minimum exposure for content from underrepresented creators (Chapter 31).
Re-ranking is deliberately separate from ranking because business rules change faster than ML models, and entangling them creates a maintenance nightmare.
from dataclasses import dataclass
from typing import List, Dict, Set, Callable
from datetime import datetime, timedelta
@dataclass
class RankedItem:
"""An item with its ranking model score and metadata.
Attributes:
item_id: Unique item identifier.
score: Ranking model predicted engagement probability.
category: Content category.
creator_id: Content creator identifier.
published_at: Publication timestamp.
is_promoted: Whether item is promoted content.
is_flagged: Whether item is flagged by trust & safety.
"""
item_id: int
score: float
category: str
creator_id: str
published_at: datetime
is_promoted: bool = False
is_flagged: bool = False
@dataclass
class ReRankingConfig:
"""Configuration for re-ranking business rules.
Attributes:
max_per_category: Maximum items from a single category in output.
min_fresh_items: Minimum items published within freshness_window.
freshness_window_hours: How recent an item must be to count as fresh.
output_size: Number of items to return.
promoted_boost: Score multiplier for promoted content.
"""
max_per_category: int = 3
min_fresh_items: int = 2
freshness_window_hours: int = 24
output_size: int = 20
promoted_boost: float = 1.2
def rerank(
items: List[RankedItem],
config: ReRankingConfig,
seen_item_ids: Set[int],
now: Optional[datetime] = None,
) -> List[RankedItem]:
"""Apply business rules to produce final recommendation list.
Enforces diversity, freshness, content policy, and business rules.
Items are selected greedily: at each position, pick the highest-scoring
item that does not violate any constraint.
Args:
items: Ranked items from the scoring stage (descending by score).
config: Re-ranking configuration.
seen_item_ids: Items the user has already seen (to suppress).
now: Current timestamp (for freshness calculation).
Returns:
Final list of items to display.
"""
if now is None:
now = datetime.now()
freshness_cutoff = now - timedelta(hours=config.freshness_window_hours)
# Apply score adjustments
adjusted_items = []
for item in items:
if item.is_flagged:
continue # Hard filter: never show flagged items
if item.item_id in seen_item_ids:
continue # Suppress already-seen items
adjusted_score = item.score
if item.is_promoted:
adjusted_score *= config.promoted_boost
adjusted_items.append((item, adjusted_score))
# Sort by adjusted score
adjusted_items.sort(key=lambda x: x[1], reverse=True)
# Greedy selection with constraints
selected: List[RankedItem] = []
category_counts: Dict[str, int] = {}
fresh_count = 0
for item, adj_score in adjusted_items:
if len(selected) >= config.output_size:
break
# Check category diversity constraint
cat_count = category_counts.get(item.category, 0)
if cat_count >= config.max_per_category:
continue
selected.append(item)
category_counts[item.category] = cat_count + 1
if item.published_at >= freshness_cutoff:
fresh_count += 1
# Post-check: if freshness requirement not met, swap in fresh items
if fresh_count < config.min_fresh_items:
# Find fresh items not yet selected
selected_ids = {item.item_id for item in selected}
fresh_pool = [
(item, score)
for item, score in adjusted_items
if item.published_at >= freshness_cutoff
and item.item_id not in selected_ids
]
# Replace lowest-scored non-fresh items
needed = config.min_fresh_items - fresh_count
non_fresh_indices = [
i for i, item in enumerate(selected)
if item.published_at < freshness_cutoff
]
for idx in reversed(non_fresh_indices[-needed:]):
if fresh_pool:
fresh_item, _ = fresh_pool.pop(0)
selected[idx] = fresh_item
return selected
# Example: re-rank 50 items into 20
now = datetime(2026, 3, 25, 14, 0, 0)
sample_items = [
RankedItem(
item_id=i,
score=1.0 - i * 0.015,
category=f"cat_{i % 5}",
creator_id=f"creator_{i % 20}",
published_at=now - timedelta(hours=i * 2),
is_promoted=(i == 3),
is_flagged=(i == 7),
)
for i in range(50)
]
config = ReRankingConfig()
final = rerank(sample_items, config, seen_item_ids={1, 15, 30}, now=now)
print(f"Final list: {len(final)} items")
print(f"Categories: {[item.category for item in final]}")
print(f"Fresh items: {sum(1 for item in final if item.published_at >= now - timedelta(hours=24))}")
print(f"Flagged items: {sum(1 for item in final if item.is_flagged)}")
Final list: 20 items
Categories: ['cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4']
Fresh items: 6
Flagged items: 0
24.4 Serving Patterns: Batch, Real-Time, and Near-Real-Time
The choice of serving pattern — how and when predictions are computed — is one of the most consequential architectural decisions in an ML system. It determines latency, cost, freshness, and the set of features available at prediction time.
Batch Serving
In batch serving, predictions are pre-computed for all users (or all relevant entities) on a schedule — typically daily or hourly. Results are stored in a key-value store and looked up at serving time.
Mechanics. A batch job iterates over all active users, retrieves their features, scores all candidate items (or a relevant subset), and writes the top-$k$ recommendations to a database:
from dataclasses import dataclass
from typing import List, Dict
from datetime import datetime
@dataclass
class BatchPredictionJob:
"""Specification for a batch prediction run.
Attributes:
model_version: Identifier for the model artifact.
feature_snapshot_time: Timestamp of the feature snapshot used.
user_segment: Which user segment to process (or "all").
top_k: Number of recommendations to pre-compute per user.
output_table: Destination for pre-computed recommendations.
"""
model_version: str
feature_snapshot_time: datetime
user_segment: str = "all"
top_k: int = 50
output_table: str = "recommendations.batch_predictions"
def compute_staleness(self, request_time: datetime) -> float:
"""Compute how stale the pre-computed predictions are.
Args:
request_time: When the user makes the request.
Returns:
Staleness in hours.
"""
delta = request_time - self.feature_snapshot_time
return delta.total_seconds() / 3600.0
# If the batch runs at 2am and a user requests at 8pm, predictions
# are 18 hours stale.
batch_job = BatchPredictionJob(
model_version="dcn_v2_20260324",
feature_snapshot_time=datetime(2026, 3, 24, 2, 0, 0),
)
staleness = batch_job.compute_staleness(datetime(2026, 3, 24, 20, 0, 0))
print(f"Prediction staleness: {staleness:.0f} hours")
Prediction staleness: 18 hours
Advantages. Simple infrastructure (no model serving endpoint needed). Can use arbitrarily expensive models (no latency constraint). Easy to validate predictions before they reach users.
Disadvantages. Stale predictions that do not reflect recent behavior. Cannot personalize based on session context (time of day, device, session history). Wasteful: computes predictions for users who may never log in.
When to use batch serving. When the prediction does not change based on real-time context (e.g., daily email recommendations), when the item corpus changes slowly, or when the computation is too expensive for real-time serving (e.g., a model that takes 10 seconds per user).
Real-Time Serving
In real-time serving, predictions are computed on-the-fly for each request. The model runs as a service, accepting feature vectors and returning predictions.
Mechanics. A model serving framework (TensorFlow Serving, Triton Inference Server, or a custom service) loads the model into memory, receives prediction requests over gRPC or HTTP, runs inference on GPU or CPU, and returns results.
Latency anatomy. Every millisecond matters. Here is a typical breakdown for a real-time recommendation request:
| Step | p50 | p95 | p99 |
|---|---|---|---|
| Network (client → API gateway) | 5ms | 10ms | 25ms |
| Feature store lookup | 8ms | 15ms | 35ms |
| Candidate retrieval (FAISS ANN) | 12ms | 22ms | 40ms |
| Ranking model inference (500 items) | 35ms | 55ms | 80ms |
| Re-ranking | 5ms | 8ms | 12ms |
| Network (API gateway → client) | 5ms | 10ms | 25ms |
| Total | 70ms | 120ms | 217ms |
The p99 latency (217ms) slightly exceeds the 200ms budget. This is typical: the p50 is comfortable, but the tail distribution is what breaks SLAs. Managing p99 latency requires:
- Hedged requests. Send the feature store lookup to two replicas; use whichever responds first.
- Timeouts. If candidate retrieval exceeds 30ms, return results from whichever sources have responded.
- Caching. Cache feature store results for recently active users (reduces p99 from 35ms to 15ms).
- Model optimization. Quantize the ranking model (Chapter 26) to reduce inference time by 2-3x.
Advantages. Fresh predictions using the latest user behavior. Can incorporate real-time context. No wasted computation.
Disadvantages. Requires model serving infrastructure with GPU/CPU provisioning. Latency constraints limit model complexity. Requires a feature store that serves features in real-time.
Near-Real-Time Serving
Near-real-time serving is a hybrid: a streaming system processes events (user clicks, item uploads, feature updates) with low latency (seconds to minutes) and updates a cache or feature store that the serving layer reads from.
Example: session-aware recommendations. When a user starts a session, the system computes fresh embeddings from their last 10 interactions (near-real-time feature update). When the user requests recommendations, the serving layer uses these fresh embeddings with a pre-built candidate index.
from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime
class ServingPattern:
"""Enumeration of serving patterns with their characteristics."""
@staticmethod
def comparison_table() -> List[Dict[str, str]]:
"""Return a comparison of serving patterns.
Returns:
List of dictionaries describing each pattern.
"""
return [
{
"pattern": "Batch",
"latency": "N/A (pre-computed)",
"freshness": "Hours to days",
"cost_per_prediction": "Low (amortized)",
"infrastructure": "Batch compute (Spark, etc.)",
"best_for": "Stable preferences, email, notifications",
},
{
"pattern": "Real-time",
"latency": "50-200ms",
"freshness": "Instant",
"cost_per_prediction": "High (GPU serving)",
"infrastructure": "Model server, feature store, GPU cluster",
"best_for": "Interactive UX, session-dependent ranking",
},
{
"pattern": "Near-real-time",
"latency": "1-60 seconds (feature update)",
"freshness": "Seconds to minutes",
"cost_per_prediction": "Medium",
"infrastructure": "Stream processor, feature store",
"best_for": "Session features, trending signals, fraud",
},
]
for pattern in ServingPattern.comparison_table():
print(f"\n{pattern['pattern']} Serving:")
print(f" Latency: {pattern['latency']}")
print(f" Freshness: {pattern['freshness']}")
print(f" Best for: {pattern['best_for']}")
Batch Serving:
Latency: N/A (pre-computed)
Freshness: Hours to days
Best for: Stable preferences, email, notifications
Real-time Serving:
Latency: 50-200ms
Freshness: Instant
Best for: Interactive UX, session-dependent ranking
Near-real-time Serving:
Latency: 1-60 seconds (feature update)
Freshness: Seconds to minutes
Best for: Session features, trending signals, fraud
Simplest Model That Works: The correct serving pattern is the simplest one that meets the product requirements. If batch serving with a daily update provides 95% of the engagement value of real-time serving, the engineering cost of real-time infrastructure may not be justified. Netflix's home page uses batch-computed recommendations for most rows, with a single real-time "Continue Watching" row. The sophisticated pattern is reserved for the cases where it matters.
24.5 Feature Stores: The Bridge Between Training and Serving
The feature store is the component that ensures the features used during model training are identical to the features available during model serving. This consistency requirement sounds trivial. It is, in practice, the single most common source of production ML failures.
The Training-Serving Skew Problem
Training-serving skew occurs when the feature values a model sees at prediction time differ systematically from the feature values it saw during training. Common causes:
- Implementation mismatch. The training pipeline computes
user_avg_session_lengthusing a SQL query on a data warehouse; the serving pipeline computes it using an in-memory cache that handles null values differently. - Temporal leakage. During training, features include information from the future (e.g., the feature "number of items user watched this week" includes data from after the prediction point). At serving time, this future information is unavailable.
- Stale features. Training uses the latest feature values, but serving uses a cache that has not been refreshed.
- Missing features. A new user has no historical features. The training pipeline drops these rows; the serving pipeline must handle them.
The insidious aspect of training-serving skew is that it produces no errors — the model still outputs predictions, and those predictions look plausible. The model simply performs worse than expected, and the gap between offline evaluation (which uses correctly computed features) and online performance (which uses skewed features) grows.
Feature Store Architecture
A feature store has two planes:
The offline store serves the training pipeline. It stores historical feature values keyed by (entity_id, timestamp), enabling point-in-time correct feature retrieval. When you train a model on data from six months ago, the offline store provides the feature values as they existed at that time, preventing temporal leakage.
The online store serves the prediction pipeline. It stores the current feature values for each entity in a low-latency key-value store (Redis, DynamoDB, Bigtable). When a serving request arrives, the online store returns the latest features in single-digit milliseconds.
from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime
from enum import Enum
class FeatureType(Enum):
"""Classification of features by computation pattern."""
BATCH = "batch" # Computed in batch pipelines (daily/hourly)
STREAMING = "streaming" # Updated by stream processors (seconds)
ON_DEMAND = "on_demand" # Computed at request time
@dataclass
class FeatureDefinition:
"""Schema for a single feature in the feature store.
Attributes:
name: Feature name (must be unique within entity type).
entity_type: The entity this feature describes (user, item, etc.).
dtype: Data type (float, int, string, list[float], etc.).
feature_type: How the feature is computed and updated.
default_value: Value to use when the feature is missing.
description: Human-readable description.
owner: Team or person responsible.
ttl_hours: Time-to-live in the online store (0 = no expiry).
"""
name: str
entity_type: str
dtype: str
feature_type: FeatureType
default_value: Any = None
description: str = ""
owner: str = ""
ttl_hours: int = 0
@dataclass
class FeatureStoreSchema:
"""Schema registry for all features in the feature store.
Provides a single source of truth for feature definitions,
ensuring training and serving use the same feature semantics.
Attributes:
features: Dictionary mapping feature name to definition.
"""
features: Dict[str, FeatureDefinition] = field(default_factory=dict)
def register(self, feature: FeatureDefinition) -> None:
"""Register a feature definition.
Args:
feature: Feature to register.
Raises:
ValueError: If feature name is already registered.
"""
if feature.name in self.features:
raise ValueError(
f"Feature '{feature.name}' is already registered. "
f"Use update() to modify existing features."
)
self.features[feature.name] = feature
def get_features_by_entity(
self, entity_type: str
) -> List[FeatureDefinition]:
"""Get all features for an entity type.
Args:
entity_type: Entity type to filter by.
Returns:
List of feature definitions for the entity type.
"""
return [
f for f in self.features.values()
if f.entity_type == entity_type
]
def get_features_by_type(
self, feature_type: FeatureType
) -> List[FeatureDefinition]:
"""Get all features of a given computation type.
Args:
feature_type: Computation pattern to filter by.
Returns:
List of matching feature definitions.
"""
return [
f for f in self.features.values()
if f.feature_type == feature_type
]
def consistency_report(self) -> Dict[str, Any]:
"""Report potential consistency issues.
Checks for features without defaults (risky in serving),
features without owners (maintenance risk), and streaming
features without TTL (staleness risk).
Returns:
Dictionary of potential issues.
"""
issues = {
"no_default": [
f.name for f in self.features.values()
if f.default_value is None
],
"no_owner": [
f.name for f in self.features.values()
if not f.owner
],
"streaming_no_ttl": [
f.name for f in self.features.values()
if f.feature_type == FeatureType.STREAMING
and f.ttl_hours == 0
],
}
return issues
# Define StreamRec's feature store schema
schema = FeatureStoreSchema()
# User features
schema.register(FeatureDefinition(
name="user_embedding_128d",
entity_type="user",
dtype="list[float]",
feature_type=FeatureType.BATCH,
default_value=[0.0] * 128,
description="128-dim user embedding from two-tower model",
owner="rec-ml",
ttl_hours=48,
))
schema.register(FeatureDefinition(
name="user_avg_session_length_min",
entity_type="user",
dtype="float",
feature_type=FeatureType.BATCH,
default_value=12.0, # Population median
description="Average session length in minutes (30-day window)",
owner="rec-ml",
ttl_hours=48,
))
schema.register(FeatureDefinition(
name="user_last_10_interactions",
entity_type="user",
dtype="list[int]",
feature_type=FeatureType.STREAMING,
default_value=[],
description="Last 10 item IDs interacted with (real-time)",
owner="rec-platform",
ttl_hours=24,
))
schema.register(FeatureDefinition(
name="user_category_preferences",
entity_type="user",
dtype="dict[str, float]",
feature_type=FeatureType.BATCH,
default_value={},
description="Bayesian posterior means per category (Ch. 20)",
owner="rec-ml",
ttl_hours=48,
))
# Item features
schema.register(FeatureDefinition(
name="item_embedding_128d",
entity_type="item",
dtype="list[float]",
feature_type=FeatureType.BATCH,
default_value=[0.0] * 128,
description="128-dim item embedding from two-tower model",
owner="rec-ml",
ttl_hours=168,
))
schema.register(FeatureDefinition(
name="item_ctr_7d",
entity_type="item",
dtype="float",
feature_type=FeatureType.BATCH,
default_value=0.02, # Global average CTR
description="Click-through rate over the last 7 days",
owner="rec-platform",
ttl_hours=24,
))
schema.register(FeatureDefinition(
name="item_trending_score",
entity_type="item",
dtype="float",
feature_type=FeatureType.STREAMING,
default_value=0.0,
description="Real-time trending score (engagement velocity)",
owner="rec-platform",
ttl_hours=6,
))
# Report
print("Feature store schema:")
print(f" Total features: {len(schema.features)}")
for entity in ["user", "item"]:
features = schema.get_features_by_entity(entity)
print(f" {entity} features: {len(features)}")
print(f"\nBy computation type:")
for ft in FeatureType:
features = schema.get_features_by_type(ft)
print(f" {ft.value}: {len(features)}")
issues = schema.consistency_report()
print(f"\nConsistency issues:")
for issue_type, names in issues.items():
print(f" {issue_type}: {names if names else 'None'}")
Feature store schema:
Total features: 7
user features: 4
item features: 3
By computation type:
batch: 5
streaming: 2
on_demand: 0
Consistency issues:
no_default: None
no_owner: None
streaming_no_ttl: None
Online-Offline Consistency
The critical guarantee of a feature store is online-offline consistency: for any entity at any point in time, the online store and the offline store must return the same feature values (or as close as the system's latency allows). This is achieved by:
- Single feature computation path. The batch pipeline that writes to the offline store also writes to the online store (or a shared intermediate layer writes to both).
- Backfill. When a new feature is added, it is backfilled in the offline store so that historical training data has the feature computed correctly.
- Point-in-time joins. The training pipeline performs temporal joins that retrieve feature values as of the training example's timestamp — never using future data.
Without these guarantees, the model trains on one version of reality and serves in another. The result is training-serving skew.
24.6 Training-Serving Skew: Detection and Prevention
Training-serving skew deserves its own section because it is both the most common production ML failure and the hardest to detect. Unlike a crashed server or a timeout error, skew produces plausible predictions — just slightly worse ones.
Categories of Skew
| Category | Cause | Detection | Prevention |
|---|---|---|---|
| Feature skew | Different code computes features in training vs. serving | Compare feature distributions offline vs. online | Use the feature store for both |
| Data skew | Training data distribution differs from serving distribution | Monitor input feature distributions over time | Retrain on recent data; detect drift (Ch. 30) |
| Label skew | Labels computed differently in training vs. evaluation | Audit label pipelines; compare online/offline metrics | Single label definition source |
| Temporal skew | Training includes future information unavailable at serve time | Point-in-time correctness checks | Strict point-in-time joins in the feature store |
Skew Detection in Practice
The most effective skew detector is simple: compute summary statistics of each feature during training and compare them to the same statistics during serving.
from dataclasses import dataclass
from typing import Dict, List, Optional
import numpy as np
@dataclass
class FeatureStatistics:
"""Summary statistics for a single feature.
Attributes:
name: Feature name.
mean: Mean value.
std: Standard deviation.
p1: 1st percentile.
p50: Median.
p99: 99th percentile.
null_fraction: Fraction of null/missing values.
n_samples: Number of samples used to compute statistics.
"""
name: str
mean: float
std: float
p1: float
p50: float
p99: float
null_fraction: float
n_samples: int
def detect_skew(
training_stats: FeatureStatistics,
serving_stats: FeatureStatistics,
mean_threshold: float = 0.5,
std_threshold: float = 0.5,
null_threshold: float = 0.05,
) -> Dict[str, any]:
"""Detect training-serving skew for a single feature.
Compares summary statistics from the training distribution and the
serving distribution. Flags features where the relative difference
exceeds configurable thresholds.
Args:
training_stats: Statistics from the training dataset.
serving_stats: Statistics from recent serving requests.
mean_threshold: Relative mean difference threshold (in std units).
std_threshold: Relative std difference threshold.
null_threshold: Absolute null fraction difference threshold.
Returns:
Dictionary with skew detection results.
"""
results = {"feature": training_stats.name, "alerts": []}
# Mean shift (in units of training std)
if training_stats.std > 1e-8:
mean_shift = abs(
training_stats.mean - serving_stats.mean
) / training_stats.std
if mean_shift > mean_threshold:
results["alerts"].append(
f"MEAN_SHIFT: {mean_shift:.2f} std "
f"(train={training_stats.mean:.4f}, "
f"serve={serving_stats.mean:.4f})"
)
# Std change
if training_stats.std > 1e-8:
std_ratio = serving_stats.std / training_stats.std
if abs(std_ratio - 1.0) > std_threshold:
results["alerts"].append(
f"STD_CHANGE: ratio={std_ratio:.2f} "
f"(train={training_stats.std:.4f}, "
f"serve={serving_stats.std:.4f})"
)
# Null fraction change
null_diff = abs(
training_stats.null_fraction - serving_stats.null_fraction
)
if null_diff > null_threshold:
results["alerts"].append(
f"NULL_RATE_CHANGE: diff={null_diff:.3f} "
f"(train={training_stats.null_fraction:.3f}, "
f"serve={serving_stats.null_fraction:.3f})"
)
results["is_skewed"] = len(results["alerts"]) > 0
return results
# Example: detect skew in StreamRec's user_avg_session_length feature
train_stats = FeatureStatistics(
name="user_avg_session_length_min",
mean=12.3, std=5.1, p1=1.2, p50=11.0, p99=32.0,
null_fraction=0.001, n_samples=5_000_000,
)
serve_stats = FeatureStatistics(
name="user_avg_session_length_min",
mean=14.8, std=6.3, p1=1.5, p50=13.2, p99=38.0,
null_fraction=0.015, n_samples=100_000,
)
skew_result = detect_skew(train_stats, serve_stats)
print(f"Feature: {skew_result['feature']}")
print(f"Skewed: {skew_result['is_skewed']}")
for alert in skew_result["alerts"]:
print(f" {alert}")
Feature: user_avg_session_length_min
Skewed: True
MEAN_SHIFT: 0.49 std (train=12.3000, serve=14.8000)
STD_CHANGE: ratio=1.24 (train=5.1000, serve=6.3000)
NULL_RATE_CHANGE: diff=0.014 (train=0.001, serve=0.015)
In this example, the mean has shifted by 0.49 standard deviations (borderline), the standard deviation has increased by 24% (notable), and the null rate has increased 15x from 0.1% to 1.5% (a clear issue). The null rate increase might indicate a data pipeline problem: perhaps a new user cohort is missing session length data because their sessions are tracked by a different logging system.
24.7 Designing for Reliability
ML systems fail. The question is not whether but how — and whether the failure is graceful or catastrophic.
Failure Taxonomy
| Failure Type | Example | Impact | Mitigation |
|---|---|---|---|
| Model server crash | OOM on a large batch | No predictions | Load balancer routes to healthy replicas |
| Feature store timeout | Redis cluster slow | Missing features → garbage predictions | Default values; fallback to batch features |
| Stale model | Retraining pipeline fails for 3 days | Predictions degrade as data drifts | Monitor model age; alert if beyond threshold |
| Data pipeline failure | Schema change in upstream source | Features computed incorrectly or not at all | Data contracts; schema validation (Ch. 28) |
| Cascading failure | Retrieval timeout → ranking overloaded | Entire system unavailable | Circuit breakers; backpressure |
| Silent degradation | Feature drift; no alerts triggered | Model performance degrades without detection | Comprehensive monitoring (Ch. 30) |
Graceful Degradation
The principle of graceful degradation says: when a component fails, the system should fall back to a simpler but still useful behavior, not fail entirely.
For StreamRec, the degradation hierarchy is:
- Full system healthy. Multi-source retrieval → deep ranking → re-ranking → personalized recommendations.
- Real-time features unavailable. Fall back to batch features. Recommendations are slightly stale but still personalized.
- Ranking model unavailable. Use retrieval scores directly (two-tower embedding similarity). Quality drops but the system serves.
- Feature store unavailable. Use a popularity-based fallback: return the globally most popular items. Not personalized, but better than an error page.
- Complete outage. Return a static cached response from the last successful serving.
from dataclasses import dataclass
from typing import List, Optional, Callable
from enum import Enum
class DegradationLevel(Enum):
"""Levels of graceful degradation, from best to worst."""
FULL = "full" # All systems nominal
STALE_FEATURES = "stale_features" # Real-time features unavailable
SIMPLE_RANKING = "simple_ranking" # Ranking model down
POPULARITY = "popularity" # Feature store down
CACHED = "cached" # Complete outage, serve cache
ERROR = "error" # Nothing works
@dataclass
class FallbackChain:
"""Ordered chain of fallback strategies for recommendation serving.
Each level is tried in order. If a level's health check passes,
its serving strategy is used. If not, the system falls to the next level.
Attributes:
levels: Ordered list of (DegradationLevel, health_check, strategy).
current_level: The currently active degradation level.
"""
levels: List[tuple] = None
current_level: DegradationLevel = DegradationLevel.FULL
def evaluate(
self, health_checks: Dict[str, bool]
) -> DegradationLevel:
"""Determine the current degradation level based on health checks.
Args:
health_checks: Dictionary mapping component name to health status.
Returns:
The highest functioning degradation level.
"""
if all(health_checks.values()):
return DegradationLevel.FULL
if not health_checks.get("streaming_features", True):
if health_checks.get("batch_features", False):
return DegradationLevel.STALE_FEATURES
if not health_checks.get("ranking_model", True):
if health_checks.get("retrieval", False):
return DegradationLevel.SIMPLE_RANKING
if not health_checks.get("feature_store", True):
return DegradationLevel.POPULARITY
if not health_checks.get("retrieval", True):
return DegradationLevel.CACHED
return DegradationLevel.STALE_FEATURES
def log_degradation(
self, previous: DegradationLevel, current: DegradationLevel
) -> Optional[str]:
"""Generate alert message when degradation level changes.
Args:
previous: Previous degradation level.
current: New degradation level.
Returns:
Alert message if degraded, None if improved or unchanged.
"""
level_order = list(DegradationLevel)
prev_idx = level_order.index(previous)
curr_idx = level_order.index(current)
if curr_idx > prev_idx:
return (
f"DEGRADATION: {previous.value} -> {current.value}. "
f"Service quality reduced."
)
elif curr_idx < prev_idx:
return (
f"RECOVERY: {previous.value} -> {current.value}. "
f"Service quality improved."
)
return None
# Example: streaming features go down
fallback = FallbackChain()
health_all_good = {
"streaming_features": True,
"batch_features": True,
"ranking_model": True,
"feature_store": True,
"retrieval": True,
}
health_streaming_down = {
"streaming_features": False,
"batch_features": True,
"ranking_model": True,
"feature_store": True,
"retrieval": True,
}
health_ranking_down = {
"streaming_features": False,
"batch_features": True,
"ranking_model": False,
"feature_store": True,
"retrieval": True,
}
for scenario_name, health in [
("All healthy", health_all_good),
("Streaming down", health_streaming_down),
("Streaming + ranking down", health_ranking_down),
]:
level = fallback.evaluate(health)
print(f"{scenario_name}: {level.value}")
All healthy: full
Streaming down: stale_features
Streaming + ranking down: simple_ranking
Circuit Breakers
A circuit breaker is a pattern borrowed from electrical engineering: when a downstream service fails repeatedly, the circuit "opens" and stops sending requests, giving the failing service time to recover instead of being overwhelmed with requests that will fail anyway.
For ML systems, circuit breakers are essential between the serving layer and the feature store, between the serving layer and the model server, and between the API gateway and the serving layer. Without circuit breakers, a slow feature store can cause the serving layer to accumulate pending requests, exhaust its thread pool, and crash — turning a feature store slowdown into a complete system outage.
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class CircuitBreaker:
"""Circuit breaker for a downstream ML system component.
States:
CLOSED: Normal operation. Requests pass through.
OPEN: Circuit tripped. Requests fail fast (fallback used).
HALF_OPEN: Testing recovery. Limited requests pass through.
Attributes:
name: Component name this circuit breaker protects.
failure_threshold: Failures before circuit opens.
recovery_timeout_seconds: How long to wait before testing recovery.
half_open_max_requests: Requests allowed in HALF_OPEN state.
failure_count: Current consecutive failure count.
state: Current circuit state.
last_failure_time: Timestamp of last failure.
"""
name: str
failure_threshold: int = 5
recovery_timeout_seconds: float = 30.0
half_open_max_requests: int = 3
failure_count: int = 0
state: str = "CLOSED"
last_failure_time: Optional[datetime] = None
_half_open_attempts: int = 0
def record_success(self) -> None:
"""Record a successful request. Resets failure count."""
self.failure_count = 0
if self.state == "HALF_OPEN":
self._half_open_attempts += 1
if self._half_open_attempts >= self.half_open_max_requests:
self.state = "CLOSED"
self._half_open_attempts = 0
def record_failure(self, now: Optional[datetime] = None) -> None:
"""Record a failed request. May trip the circuit.
Args:
now: Current timestamp.
"""
if now is None:
now = datetime.now()
self.failure_count += 1
self.last_failure_time = now
if self.failure_count >= self.failure_threshold:
self.state = "OPEN"
def should_allow_request(
self, now: Optional[datetime] = None
) -> bool:
"""Check whether a request should be allowed through.
Args:
now: Current timestamp.
Returns:
True if the request should proceed, False if it should
fail fast and use the fallback.
"""
if now is None:
now = datetime.now()
if self.state == "CLOSED":
return True
if self.state == "OPEN":
# Check if recovery timeout has elapsed
if self.last_failure_time is not None:
elapsed = (now - self.last_failure_time).total_seconds()
if elapsed >= self.recovery_timeout_seconds:
self.state = "HALF_OPEN"
self._half_open_attempts = 0
return True
return False
if self.state == "HALF_OPEN":
return self._half_open_attempts < self.half_open_max_requests
return False
def status_report(self) -> Dict[str, any]:
"""Report circuit breaker status for monitoring.
Returns:
Dictionary with circuit breaker state.
"""
return {
"component": self.name,
"state": self.state,
"failure_count": self.failure_count,
"threshold": self.failure_threshold,
}
# Simulate feature store degradation
cb = CircuitBreaker(name="feature_store", failure_threshold=3)
now = datetime(2026, 3, 25, 14, 0, 0)
print("Simulating feature store failures:")
for i in range(5):
allowed = cb.should_allow_request(now)
if allowed:
# Simulate failure
cb.record_failure(now)
print(f" Request {i+1}: allowed={allowed}, "
f"state={cb.state}, failures={cb.failure_count}")
else:
print(f" Request {i+1}: allowed={allowed}, "
f"state={cb.state} (fail fast, use fallback)")
now += timedelta(seconds=1)
# Wait for recovery timeout
now += timedelta(seconds=30)
allowed = cb.should_allow_request(now)
print(f"\nAfter 30s recovery timeout:")
print(f" Request: allowed={allowed}, state={cb.state}")
Simulating feature store failures:
Request 1: allowed=True, state=CLOSED, failures=1
Request 2: allowed=True, state=CLOSED, failures=2
Request 3: allowed=True, state=OPEN, failures=3
Request 4: allowed=False, state=OPEN (fail fast, use fallback)
Request 5: allowed=False, state=OPEN (fail fast, use fallback)
After 30s recovery timeout:
Request: allowed=True, state=HALF_OPEN
24.8 Architecture Decision Records (ADRs)
Every ML system involves dozens of design decisions: which serving pattern, which feature store technology, how many retrieval sources, what latency budget per stage, when to retrain, whether to use shadow mode before production. These decisions are made once, but their consequences persist for years. Six months later, when a new team member asks "why do we use batch serving for this model?" the answer is often "nobody remembers."
An Architecture Decision Record (ADR) is a short document that captures a single architectural decision: the context, the options considered, the decision made, and the consequences. ADRs create an institutional memory that prevents re-arguing settled questions, explains constraints to new team members, and surfaces assumptions that may have changed since the decision was made.
ADR Template
# ADR-{number}: {title}
## Status
{Proposed | Accepted | Deprecated | Superseded by ADR-XXX}
## Context
What is the issue? What forces are at play? What constraints exist?
## Decision
What is the decision that was made?
## Options Considered
1. Option A: [brief description]
- Pros: ...
- Cons: ...
2. Option B: [brief description]
- Pros: ...
- Cons: ...
3. Option C (chosen): [brief description]
- Pros: ...
- Cons: ...
## Consequences
What are the results of this decision? What trade-offs are accepted?
What new constraints does this create?
## Review Date
When should this decision be revisited? What conditions would trigger
a reconsideration?
Example ADR: StreamRec Serving Pattern
# ADR-007: Serving Pattern for StreamRec Home Page Recommendations
## Status
Accepted (2026-01-15)
## Context
StreamRec's home page requires personalized recommendations for
50 million active users. The product team requires:
- Sub-200ms end-to-end latency for the recommendation API.
- Recommendations that reflect the user's current session context
(e.g., recent interactions within the current session).
- Support for 10,000 requests per second at peak.
The team has 3 ML engineers and 2 backend engineers. Infrastructure
budget: $150K/month for serving.
## Decision
Use real-time serving with a multi-stage architecture:
1. Multi-source candidate retrieval (30ms budget)
2. Deep ranking model on GPU (50ms budget)
3. Rule-based re-ranking (20ms budget)
Batch fallback: pre-compute top-50 per user daily for use when
real-time serving is degraded.
## Options Considered
1. Pure batch serving
- Pros: Simple infrastructure, no GPU serving cost, easy validation
- Cons: Cannot use session features, 18-hour staleness, wastes
compute for inactive users
- Rejected: Session context is critical for engagement; A/B test
showed +12% engagement for session-aware vs. batch recommendations
2. Pure real-time serving (no batch fallback)
- Pros: Maximum freshness, no pre-computation waste
- Cons: No fallback during outages; error page shown to users
- Rejected: Availability requirement (99.9%) mandates a fallback
3. Hybrid real-time + batch fallback (chosen)
- Pros: Session-aware when healthy, graceful degradation when not
- Cons: Dual infrastructure cost; must maintain batch pipeline
- Accepted: Best balance of quality, availability, and cost
## Consequences
- Must maintain both real-time and batch serving infrastructure.
- GPU serving cluster adds ~$80K/month to infrastructure costs.
- Batch pipeline runs daily at 2am; must complete before 6am.
- Monitoring must track which serving mode (real-time vs. batch
fallback) is used per request for accurate A/B testing.
- Feature store must support both online (real-time) and offline
(batch) access patterns.
## Review Date
2026-07-15 (6 months). Reconsider if: (a) batch A/B test gap
narrows below +5%, (b) infrastructure costs exceed $200K/month,
(c) system availability drops below 99.5%.
Why ADRs Matter for ML Systems
ADRs are particularly important in ML because:
-
ML decisions have non-obvious dependencies. The choice to use real-time serving implies the need for an online feature store, which implies the need for a streaming feature pipeline, which implies stream processing infrastructure. The ADR makes these downstream implications explicit.
-
ML systems evolve through experimentation. A decision made based on an A/B test result ("+12% engagement for session-aware recommendations") should be re-evaluated if the test's conditions change (new model architecture, different user population, product redesign).
-
ML teams rotate. The engineer who designed the serving architecture may leave the team. Without ADRs, the institutional knowledge leaves with them.
Write ADRs for every decision that would take more than one hour to reverse. If reversal is trivial, the decision does not need an ADR.
24.9 The Credit Scoring Anchor: Batch vs. Real-Time
The Content Platform Recommender demands real-time serving because session context drives engagement. But not all ML systems need real-time inference. The Credit Scoring anchor example illustrates the opposite end of the spectrum.
The Decision Context
A bank receives 10,000 credit applications per day. Each application requires a creditworthiness score. The score determines the interest rate offered to the applicant. Two serving patterns are viable:
Batch serving. Applications received during the day are scored overnight. Applicants receive their decision the next morning. The model has access to the full data warehouse: credit bureau reports, transaction history, employment verification, and alternative data sources (rent payments, utility bills). No latency constraint; the model can be a large ensemble of gradient-boosted trees, a neural network, and a logistic regression model, with model disagreement flagged for human review.
Real-time serving. Applications are scored within 30 seconds. Applicants receive an instant decision. The model has access to a subset of features: credit bureau score (via API call), application data (income, employment, address), and cached historical data. The latency constraint limits model complexity and feature breadth.
from dataclasses import dataclass
from typing import Dict, List
@dataclass
class ServingTradeoff:
"""Analysis of serving pattern trade-offs for a specific use case.
Attributes:
use_case: Description of the use case.
batch_pros: Advantages of batch serving.
batch_cons: Disadvantages of batch serving.
realtime_pros: Advantages of real-time serving.
realtime_cons: Disadvantages of real-time serving.
recommendation: Which pattern to use and why.
"""
use_case: str
batch_pros: List[str]
batch_cons: List[str]
realtime_pros: List[str]
realtime_cons: List[str]
recommendation: str
credit_analysis = ServingTradeoff(
use_case="Credit application scoring (10K applications/day)",
batch_pros=[
"Full feature access (all data sources, no latency limit)",
"Ensemble of complex models (XGBoost + neural + logistic)",
"Model disagreement detection → human review for edge cases",
"Simpler infrastructure (no model serving endpoint)",
"Easier auditability (batch logs, full feature snapshots)",
],
batch_cons=[
"Next-day decision → customer dropout (estimated 15-25%)",
"Cannot respond to real-time fraud signals",
"Competitive disadvantage vs. instant-decision competitors",
],
realtime_pros=[
"Instant decision → higher conversion rate (+20-30%)",
"Can incorporate real-time fraud signals",
"Better customer experience",
],
realtime_cons=[
"Fewer features available (no time for full bureau pull)",
"Simpler model (latency constraint)",
"Model auditability more complex (serving logs, not batch)",
"Higher infrastructure cost (model serving, feature store)",
],
recommendation=(
"HYBRID: Real-time pre-approval with batch final decision. "
"Show applicant an instant pre-approval (simple model, "
"limited features, conservative thresholds). Run full model "
"overnight for final terms. This captures conversion benefit "
"while preserving model quality for final decision."
),
)
print(f"Use case: {credit_analysis.use_case}")
print(f"\nRecommendation: {credit_analysis.recommendation}")
Use case: Credit application scoring (10K applications/day)
Recommendation: HYBRID: Real-time pre-approval with batch final decision. Show applicant an instant pre-approval (simple model, limited features, conservative thresholds). Run full model overnight for final terms. This captures conversion benefit while preserving model quality for final decision.
The hybrid pattern — real-time for the initial decision, batch for the final decision — is common in financial services. It is an example of the simplest model that works principle applied at the system level: use the simple real-time model where speed matters (conversion), and the complex batch model where accuracy matters (final terms).
Regulatory Constraints on Serving Patterns
Credit scoring in the United States is subject to the Equal Credit Opportunity Act (ECOA) and the Fair Credit Reporting Act (FCRA), which require lenders to provide adverse action notices — specific reasons why an applicant was denied credit or offered unfavorable terms. This requirement constrains the serving architecture:
- The model must be interpretable enough to generate reason codes (Chapter 35).
- The serving system must log every feature value, model score, and decision for auditability.
- The batch pipeline must retain full feature snapshots for regulatory examination.
These constraints favor batch serving for the final decision, because batch pipelines are easier to audit and reason about. The ADR for this system would document the regulatory requirements as a primary constraint.
24.10 Experimentation Infrastructure
Deploying a new model to production is not a launch — it is a hypothesis. "We believe model v2 will increase engagement by 3% compared to model v1." The experimentation infrastructure tests this hypothesis rigorously before committing.
Deployment Strategies
| Strategy | Description | Risk | When to Use |
|---|---|---|---|
| Shadow mode | New model runs in parallel; predictions logged but not served | Zero user impact | First deployment of any new model |
| Canary | New model serves 1-5% of traffic | Very low | After shadow mode validates basic correctness |
| A/B test | Randomized 50/50 split (or other ratio) | Moderate | Full evaluation of model quality |
| Gradual rollout | 1% → 5% → 25% → 50% → 100% | Controlled | Scaling a validated model to full traffic |
Shadow Mode
Shadow mode is the zero-risk first step for any new model. The production system serves predictions from the current model. Simultaneously, a shadow pipeline sends the same requests to the new model and logs the predictions. No user ever sees the shadow model's output.
Shadow mode validates:
- Correctness. The new model produces predictions in the expected range and format.
- Latency. The new model meets the latency budget.
- Consistency. The new model's predictions are correlated with (but hopefully better than) the current model's predictions.
- Error handling. The new model handles edge cases (missing features, new users, new items) without crashing.
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
import numpy as np
@dataclass
class ShadowModeReport:
"""Summary of shadow mode evaluation comparing production and shadow models.
Attributes:
production_model: Name/version of the production model.
shadow_model: Name/version of the shadow model.
n_requests: Total shadow requests processed.
prediction_correlation: Pearson correlation between production
and shadow predictions.
shadow_latency_p50_ms: Shadow model p50 latency.
shadow_latency_p99_ms: Shadow model p99 latency.
shadow_error_rate: Fraction of shadow requests that errored.
prediction_distribution_shift: KL divergence of prediction
distributions.
"""
production_model: str
shadow_model: str
n_requests: int
prediction_correlation: float
shadow_latency_p50_ms: float
shadow_latency_p99_ms: float
shadow_error_rate: float
prediction_distribution_shift: float
def is_ready_for_canary(
self,
min_requests: int = 10000,
max_error_rate: float = 0.001,
max_p99_latency_ms: float = 100.0,
min_correlation: float = 0.5,
) -> Tuple[bool, List[str]]:
"""Evaluate whether the shadow model is ready for canary deployment.
Args:
min_requests: Minimum shadow requests for statistical validity.
max_error_rate: Maximum acceptable error rate.
max_p99_latency_ms: Maximum acceptable p99 latency.
min_correlation: Minimum correlation with production predictions.
Returns:
(ready, list_of_blocking_issues) tuple.
"""
issues = []
if self.n_requests < min_requests:
issues.append(
f"Insufficient requests: {self.n_requests} < {min_requests}"
)
if self.shadow_error_rate > max_error_rate:
issues.append(
f"Error rate too high: {self.shadow_error_rate:.4f} > "
f"{max_error_rate:.4f}"
)
if self.shadow_latency_p99_ms > max_p99_latency_ms:
issues.append(
f"p99 latency too high: {self.shadow_latency_p99_ms:.0f}ms > "
f"{max_p99_latency_ms:.0f}ms"
)
if self.prediction_correlation < min_correlation:
issues.append(
f"Prediction correlation too low: "
f"{self.prediction_correlation:.3f} < {min_correlation:.3f}"
)
return len(issues) == 0, issues
# Example shadow mode report
report = ShadowModeReport(
production_model="dcn_v2_20260310",
shadow_model="dcn_v2_20260324",
n_requests=52_340,
prediction_correlation=0.89,
shadow_latency_p50_ms=42.0,
shadow_latency_p99_ms=87.0,
shadow_error_rate=0.00015,
prediction_distribution_shift=0.023,
)
ready, issues = report.is_ready_for_canary()
print(f"Shadow model: {report.shadow_model}")
print(f"Requests evaluated: {report.n_requests:,}")
print(f"Prediction correlation with production: {report.prediction_correlation:.3f}")
print(f"p50 latency: {report.shadow_latency_p50_ms:.0f}ms")
print(f"p99 latency: {report.shadow_latency_p99_ms:.0f}ms")
print(f"Error rate: {report.shadow_error_rate:.4f}")
print(f"\nReady for canary: {ready}")
if issues:
for issue in issues:
print(f" Blocking: {issue}")
Shadow model: dcn_v2_20260324
Requests evaluated: 52,340
Prediction correlation with production: 0.890
p50 latency: 42ms
p99 latency: 87ms
Error rate: 0.0002
Ready for canary: True
A/B Testing for ML Models
A/B testing for ML models differs from traditional web A/B testing in three ways:
-
Metric delay. The primary metric (e.g., 30-day retention) may take weeks to mature. Proxy metrics (session engagement, next-day return) are used for faster decisions, but the relationship between proxy and primary metrics must be validated.
-
Interference. In recommendation systems, treating a user differently affects their behavior, which affects the data used to train the next model iteration. This creates interference between treatment and control groups that standard A/B testing does not account for. Chapter 33 covers this in depth.
-
Multiple metrics. An ML model change that increases click-through rate by 5% but decreases content completion rate by 3% is not clearly better. The experimentation platform must track a metric suite, and the decision criteria must be pre-specified to avoid p-hacking.
24.11 The StreamRec Application: System Architecture Design (Milestone 9)
We now bring together every concept from this chapter to design the complete StreamRec system architecture. This is the ninth milestone in the progressive project — the point where the models from Parts I-IV are assembled into a production system.
Requirements
| Requirement | Value |
|---|---|
| Active users | 50 million |
| Item corpus | 200,000 items |
| Recommendations per request | 20 |
| End-to-end latency (p95) | 200ms |
| Peak throughput | 10,000 requests/second |
| Availability SLA | 99.9% |
| Model freshness | Retraining every 24 hours |
| Feature freshness | Batch: daily; Streaming: < 60 seconds |
System Diagram
┌──────────────────────────────────────────────────────────────────────┐
│ StreamRec System │
│ │
│ ┌─────────┐ ┌──────────────┐ ┌──────────────────────────────┐│
│ │ Client │───▶│ API Gateway │───▶│ Serving Layer ││
│ │ (App) │◀───│ (5ms) │◀───│ ││
│ └─────────┘ └──────────────┘ │ ┌──────────────────────┐ ││
│ │ │ Feature Store Lookup │ ││
│ │ │ (15ms) │ ││
│ │ └──────────┬───────────┘ ││
│ │ │ ││
│ │ ┌──────────▼───────────┐ ││
│ │ │ Candidate Retrieval │ ││
│ │ │ (30ms) │ ││
│ │ │ ├─ ANN (two-tower) │ ││
│ │ │ ├─ CF (GNN) │ ││
│ │ │ ├─ Content-based │ ││
│ │ │ └─ Trending │ ││
│ │ └──────────┬───────────┘ ││
│ │ │ ││
│ │ ┌──────────▼───────────┐ ││
│ │ │ Ranking (DCN-V2) │ ││
│ │ │ (50ms, GPU) │ ││
│ │ └──────────┬───────────┘ ││
│ │ │ ││
│ │ ┌──────────▼───────────┐ ││
│ │ │ Re-Ranking │ ││
│ │ │ (20ms) │ ││
│ │ │ Diversity, freshness,│ ││
│ │ │ policy, fairness │ ││
│ │ └──────────┬───────────┘ ││
│ │ │ ││
│ │ ┌──────────▼───────────┐ ││
│ │ │ Response Assembly │ ││
│ │ │ (5ms) │ ││
│ │ └──────────────────────┘ ││
│ └──────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────────────┐│
│ │ Offline Layer ││
│ │ ││
│ │ ┌────────────┐ ┌──────────────┐ ┌───────────────────────┐ ││
│ │ │ Data │──▶│ Feature │──▶│ Training │ ││
│ │ │ Pipeline │ │ Engineering │ │ (daily, GPU cluster) │ ││
│ │ └────────────┘ └──────────────┘ └──────────┬────────────┘ ││
│ │ │ ││
│ │ ┌────────────────────────────────┐ ┌──────────▼────────────┐ ││
│ │ │ Monitoring & Observability │ │ Model Registry │ ││
│ │ │ (drift, latency, quality) │ │ (version, stage, │ ││
│ │ └────────────────────────────────┘ │ validate) │ ││
│ │ └──────────┬────────────┘ ││
│ │ ┌────────────────────────────────┐ │ ││
│ │ │ Experimentation Platform │◀────────────┘ ││
│ │ │ (shadow, canary, A/B) │ ││
│ │ └────────────────────────────────┘ ││
│ └──────────────────────────────────────────────────────────────────┘│
│ │
│ ┌──────────────────────────────────────────────────────────────────┐│
│ │ Feature Store ││
│ │ ┌──────────────┐ ┌────────────────┐ ││
│ │ │ Online Store │ │ Offline Store │ ││
│ │ │ (Redis/Dynamo)│ │ (Parquet/Delta) │ ││
│ │ │ Latest values │ │ Historical + │ ││
│ │ │ < 5ms lookup │ │ point-in-time │ ││
│ │ └──────────────┘ └────────────────┘ ││
│ └──────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────┘
Latency Budget
| Stage | Budget (p95) | Notes |
|---|---|---|
| API gateway | 5ms | Routing, auth, rate limiting |
| Feature store lookup | 15ms | Parallel with retrieval start; Redis cluster |
| Candidate retrieval | 30ms | 4 sources in parallel; take union of completed |
| Ranking model | 50ms | DCN-V2 on GPU; 500 items at ~0.1ms each |
| Re-ranking | 20ms | CPU; greedy diversity/freshness selection |
| Response assembly | 5ms | Serialize response, add metadata |
| Total critical path | 125ms | 75ms headroom for p99 tail |
ADR: StreamRec Serving Pattern (Summary)
Following the ADR template from Section 24.8:
- Context: 50M users, sub-200ms latency, session-aware personalization required.
- Decision: Real-time multi-stage serving with batch fallback.
- Key trade-off: +$80K/month GPU infrastructure cost justified by +12% engagement over batch-only (A/B tested in pilot).
- Consequence: Must maintain dual infrastructure (real-time + batch); feature store must serve both online and offline; monitoring must distinguish serving mode for clean A/B analysis.
Key Design Decisions
from dataclasses import dataclass
from typing import List
@dataclass
class DesignDecision:
"""A key design decision with its rationale.
Attributes:
decision: What was decided.
rationale: Why this choice was made.
alternatives_rejected: What was considered and rejected.
risk: What could go wrong.
"""
decision: str
rationale: str
alternatives_rejected: List[str]
risk: str
streamrec_decisions = [
DesignDecision(
decision="Multi-source retrieval (4 sources) instead of single ANN",
rationale=(
"Single-source retrieval has a coverage ceiling: the two-tower "
"model cannot retrieve items without embedding similarity, "
"missing cold-start items and category-niche items. Multi-source "
"retrieval improves Recall@500 by 18% in offline evaluation."
),
alternatives_rejected=[
"Single ANN retrieval (simpler, but lower coverage)",
"6-source retrieval (marginal gain, +15ms latency)",
],
risk="Merging 4 source results in 30ms requires all sources to "
"respond in time. Mitigation: return whatever has arrived by "
"the timeout.",
),
DesignDecision(
decision="GPU serving for ranking model (not CPU)",
rationale=(
"DCN-V2 scoring 500 items takes ~50ms on GPU (T4) and ~300ms "
"on CPU. CPU serving would require reducing candidates to ~80 "
"or using a simpler model, both of which degrade quality."
),
alternatives_rejected=[
"CPU serving with fewer candidates (too much quality loss)",
"CPU serving with distilled model (quality loss > latency gain)",
],
risk="GPU failure → no ranking. Mitigation: CPU fallback with a "
"lightweight model (logistic regression on top features).",
),
DesignDecision(
decision="Batch fallback with daily pre-computation",
rationale=(
"99.9% availability SLA requires a fallback when real-time "
"serving is degraded. Batch pre-computed recommendations are "
"stale but personalized — better than popularity fallback."
),
alternatives_rejected=[
"No fallback (violates availability SLA)",
"Popularity-only fallback (too large quality drop)",
],
risk="Batch pipeline failure leaves stale fallback. Mitigation: "
"alert if batch results are > 36 hours old.",
),
]
for i, d in enumerate(streamrec_decisions, 1):
print(f"Decision {i}: {d.decision}")
print(f" Rationale: {d.rationale[:100]}...")
print(f" Risk: {d.risk[:80]}...")
print()
Decision 1: Multi-source retrieval (4 sources) instead of single ANN
Rationale: Single-source retrieval has a coverage ceiling: the two-tower model cannot retrieve items wi...
Risk: Merging 4 source results in 30ms requires all sources to respond in time. Mitigation: re...
Decision 2: GPU serving for ranking model (not CPU)
Rationale: DCN-V2 scoring 500 items takes ~50ms on GPU (T4) and ~300ms on CPU. CPU serving would requi...
Risk: GPU failure → no ranking. Mitigation: CPU fallback with a lightweight model (logistic reg...
Decision 3: Batch fallback with daily pre-computation
Rationale: 99.9% availability SLA requires a fallback when real-time serving is degraded. Batch pre-co...
Risk: Batch pipeline failure leaves stale fallback. Mitigation: alert if batch results are > 36...
24.12 System Design as a Professional Discipline
This chapter opened with Sculley et al.'s observation that the model is 5% of the system. We have now built — at least in architectural blueprint — the other 95% for StreamRec.
The key ideas are structural, not technological:
The funnel architecture (retrieval → ranking → re-ranking) is a consequence of the latency-quality trade-off: you cannot run an expensive model on every item, so you use cheap models to narrow the candidate set and expensive models to refine it. This pattern applies to recommendation systems, search engines, advertising systems, and any application where the candidate space is large and the latency budget is small.
The feature store solves the training-serving consistency problem that has derailed more ML projects than any modeling error. By providing a single source of truth for feature computation, with online and offline access patterns, the feature store ensures that the model sees the same features in production that it saw during training.
Serving pattern selection (batch, real-time, near-real-time) is a system design decision, not a modeling decision. The choice depends on latency requirements, feature freshness needs, infrastructure budget, and team capacity. The simplest pattern that meets the requirements is the right choice.
Graceful degradation means designing for failure, not hoping for its absence. A recommendation system that serves popularity-based recommendations during a feature store outage is infinitely better than one that returns HTTP 500 errors.
ADRs create the institutional memory that ML teams desperately need. They document not just what was decided, but why — and under what conditions the decision should be revisited.
Production ML = Software Engineering: The shift from model building to system design is the shift from individual craft to engineering discipline. A model is an artifact; a system is a product. Building the system requires every skill from Parts I-IV — the models must be excellent — plus the software engineering skills covered in this chapter and the rest of Part V. The payoff is ML that works, reliably, at scale, in the real world.
The next six chapters fill in the components we have outlined here: Chapter 25 builds the feature store, Chapter 26 handles distributed training, Chapter 27 orchestrates the pipeline, Chapter 28 tests it, Chapter 29 deploys it, and Chapter 30 monitors it.
Summary
ML system design is the discipline of composing models, data pipelines, feature stores, serving infrastructure, monitoring, and experimentation into a reliable, scalable, maintainable production system. The model — however sophisticated — is a small fraction of the total system.
The funnel architecture (candidate retrieval → ranking → re-ranking) enables serving personalized recommendations from a large item corpus within a strict latency budget. Each stage trades precision for speed, narrowing the candidate set while increasing the computational cost per item.
Serving patterns — batch, real-time, and near-real-time — differ in latency, freshness, cost, and infrastructure complexity. The right choice depends on the product requirements, not the model architecture. Hybrid patterns (real-time with batch fallback) are common and effective.
Feature stores bridge the gap between training (offline, historical data) and serving (online, real-time lookup), preventing training-serving skew — the silent killer of production ML performance. Online-offline consistency, point-in-time correctness, and default value handling are the critical guarantees.
Graceful degradation designs for failure: when components fail, the system falls back to simpler but still useful behavior rather than returning errors. Circuit breakers, fallback chains, and popularity-based defaults are the mechanisms.
Architecture Decision Records document design choices with context, constraints, and consequences. They prevent knowledge loss, avoid re-litigation of settled decisions, and surface assumptions that may have changed.
The system architecture defines the boundaries within which every other production ML component operates. Chapters 25-30 fill in those components one by one.