27 min read

> "Only a small fraction of real-world ML systems is composed of the ML code... The required surrounding infrastructure is vast and complex."

Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning

"Only a small fraction of real-world ML systems is composed of the ML code... The required surrounding infrastructure is vast and complex." — D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young, Jean-Francois Crespo, and Dan Dennison, "Hidden Technical Debt in Machine Learning Systems" (NeurIPS, 2015)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Decompose an ML system into its core components — data pipeline, feature store, training infrastructure, model serving, and monitoring — and explain the responsibilities and failure modes of each
  2. Select among batch, real-time, and near-real-time serving patterns based on latency requirements, throughput constraints, and cost trade-offs
  3. Design for reliability using redundancy, graceful degradation, fallback models, and circuit breakers
  4. Write Architecture Decision Records (ADRs) that document ML system design choices with context, constraints, and consequences
  5. Apply the full system design framework to the StreamRec recommendation capstone

24.1 The Model Is 5% of the System

In 2015, Sculley et al. published what remains the most cited paper on ML engineering: "Hidden Technical Debt in Machine Learning Systems." The paper's central figure — a small box labeled "ML Code" surrounded by vastly larger boxes for data collection, feature extraction, analysis tools, serving infrastructure, monitoring, configuration, and process management — has become an icon of the field. The message is stark: the machine learning model itself is a small component of a production ML system, and often the easiest part to get right.

This chapter begins Part V of the book — the part where we build the 95% that surrounds the model. We have spent twenty-three chapters learning to build excellent models. Linear algebra and optimization (Part I) gave us the mathematical foundations. Deep learning (Part II) gave us neural architectures for every data modality. Causal inference (Part III) taught us to distinguish prediction from causation. Bayesian and temporal methods (Part IV) gave us principled uncertainty quantification. All of this work was necessary. None of it is sufficient.

A model that achieves state-of-the-art performance on a hold-out test set but cannot serve predictions in under 100 milliseconds is useless for a real-time recommendation system. A model trained on yesterday's data that does not retrain when user behavior shifts after a product launch is a ticking time bomb. A feature engineering pipeline that computes features differently during training and serving — the dreaded training-serving skew — will produce predictions that are silently, catastrophically wrong.

Production ML = Software Engineering: Machine learning in production is software engineering with three additional dimensions of complexity: data dependencies, model behavior that changes over time, and the entanglement of components through shared data. Every principle of software engineering — modularity, testing, monitoring, documentation, version control — applies, and then ML adds its own. This theme runs through every chapter in Part V.

This chapter provides the architectural blueprint. We will design a complete ML system for a content recommendation platform — from the moment a user opens the application to the moment a ranked list of recommendations appears on their screen, all within a 200-millisecond latency budget. Along the way, we will develop a vocabulary for discussing ML system architecture, a framework for choosing between design alternatives, and a documentation discipline (ADRs) for recording and communicating those choices.


24.2 Anatomy of an ML System

Every production ML system, regardless of domain, contains the same core components. They vary in scale, complexity, and implementation technology, but the conceptual architecture is universal.

The Component Map

Component Responsibility Failure Mode
Data Pipeline Ingest, validate, transform, and store raw data Stale data, schema drift, silent corruption
Feature Store Compute, store, and serve features for training and inference Training-serving skew, stale features, missing values
Training Infrastructure Train, validate, and register models Non-reproducible runs, resource waste, silent degradation
Model Registry Version, stage, and manage model artifacts Wrong model in production, missing metadata
Serving Infrastructure Accept requests, compute predictions, return results Latency spikes, OOM errors, stale models
Monitoring & Observability Track data quality, model performance, and system health Undetected drift, cascading failures
Experimentation Platform A/B test, shadow mode, canary deployment Selection bias, underpowered tests
Configuration Management Manage hyperparameters, feature lists, model versions Configuration drift, untraceable changes

These components interact through well-defined interfaces: the feature store provides a consistent feature-fetching API for both training jobs and serving requests; the model registry provides versioned artifacts for the serving infrastructure; the monitoring system consumes metrics from every other component.

The Two Loops

ML systems operate on two timescales, and understanding this distinction is essential for architecture:

The inner loop (online) handles individual prediction requests. A user opens the application; the serving infrastructure retrieves user features from the feature store, loads the current model, computes a prediction, and returns a response. This loop runs in milliseconds. Latency, throughput, and availability are the primary concerns.

The outer loop (offline) handles model improvement. New data arrives; the data pipeline processes it; the training infrastructure retrains or fine-tunes the model; the experimentation platform evaluates the new model against the current one; if the new model wins, the model registry promotes it to production. This loop runs in hours or days. Model quality, reproducibility, and validation rigor are the primary concerns.

The art of ML system design is keeping these two loops decoupled yet synchronized. The inner loop must never block on the outer loop (a user should never wait while the model retrains). The outer loop must never bypass the inner loop's safety checks (a new model should never reach production without validation).

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from enum import Enum
import time


class ComponentStatus(Enum):
    """Health status for ML system components."""
    HEALTHY = "healthy"
    DEGRADED = "degraded"
    UNHEALTHY = "unhealthy"


@dataclass
class MLSystemComponent:
    """Base representation of an ML system component.

    Tracks name, health status, latency SLA, and dependencies
    for system-level architecture reasoning.

    Attributes:
        name: Component identifier.
        status: Current health status.
        latency_budget_ms: Maximum acceptable latency in milliseconds.
        dependencies: List of component names this component depends on.
        last_health_check: Unix timestamp of last health check.
    """
    name: str
    status: ComponentStatus = ComponentStatus.HEALTHY
    latency_budget_ms: float = 100.0
    dependencies: List[str] = field(default_factory=list)
    last_health_check: float = 0.0

    def is_available(self) -> bool:
        """Check if component is available for serving.

        Returns:
            True if status is HEALTHY or DEGRADED.
        """
        return self.status in (ComponentStatus.HEALTHY, ComponentStatus.DEGRADED)


@dataclass
class MLSystemArchitecture:
    """High-level ML system architecture specification.

    Enumerates all components, their latency budgets, and dependencies.
    Used for architecture documentation and system health assessment.

    Attributes:
        name: System name.
        components: Dictionary mapping component name to specification.
        total_latency_budget_ms: End-to-end latency SLA.
    """
    name: str
    components: Dict[str, MLSystemComponent] = field(default_factory=dict)
    total_latency_budget_ms: float = 200.0

    def add_component(self, component: MLSystemComponent) -> None:
        """Register a component in the architecture.

        Args:
            component: The component to add.
        """
        self.components[component.name] = component

    def critical_path_latency(self) -> float:
        """Compute the sum of latency budgets along the critical path.

        This is a simplified model: the critical path is the sum of all
        serial component latencies. Parallel components share a budget.

        Returns:
            Total critical path latency in milliseconds.
        """
        return sum(c.latency_budget_ms for c in self.components.values())

    def check_latency_feasibility(self) -> Tuple[bool, str]:
        """Verify that the critical path fits within the total budget.

        Returns:
            (feasible, message) tuple.
        """
        critical = self.critical_path_latency()
        if critical <= self.total_latency_budget_ms:
            return True, (
                f"Critical path ({critical:.0f}ms) fits within "
                f"budget ({self.total_latency_budget_ms:.0f}ms). "
                f"Headroom: {self.total_latency_budget_ms - critical:.0f}ms."
            )
        return False, (
            f"Critical path ({critical:.0f}ms) exceeds "
            f"budget ({self.total_latency_budget_ms:.0f}ms) "
            f"by {critical - self.total_latency_budget_ms:.0f}ms."
        )

    def system_health(self) -> Dict[str, ComponentStatus]:
        """Report the health of all components.

        Returns:
            Dictionary mapping component name to health status.
        """
        return {name: c.status for name, c in self.components.items()}


# Example: Define a minimal recommendation system architecture
rec_system = MLSystemArchitecture(
    name="StreamRec",
    total_latency_budget_ms=200.0
)

rec_system.add_component(MLSystemComponent(
    name="api_gateway",
    latency_budget_ms=5.0,
    dependencies=[]
))
rec_system.add_component(MLSystemComponent(
    name="feature_store_lookup",
    latency_budget_ms=15.0,
    dependencies=["api_gateway"]
))
rec_system.add_component(MLSystemComponent(
    name="candidate_retrieval",
    latency_budget_ms=30.0,
    dependencies=["feature_store_lookup"]
))
rec_system.add_component(MLSystemComponent(
    name="ranking_model",
    latency_budget_ms=50.0,
    dependencies=["candidate_retrieval"]
))
rec_system.add_component(MLSystemComponent(
    name="reranking",
    latency_budget_ms=20.0,
    dependencies=["ranking_model"]
))
rec_system.add_component(MLSystemComponent(
    name="response_assembly",
    latency_budget_ms=5.0,
    dependencies=["reranking"]
))

feasible, message = rec_system.check_latency_feasibility()
print(f"Feasibility: {feasible}")
print(message)
print(f"\nComponent budgets:")
for name, comp in rec_system.components.items():
    print(f"  {name}: {comp.latency_budget_ms:.0f}ms")
Feasibility: True
Critical path (125ms) fits within budget (200ms). Headroom: 75ms.

Component budgets:
  api_gateway: 5ms
  feature_store_lookup: 15ms
  candidate_retrieval: 30ms
  ranking_model: 50ms
  reranking: 20ms
  response_assembly: 5ms

The 75ms of headroom is not luxury — it absorbs network variability, garbage collection pauses, and the long tail of database queries. Production systems that use 100% of their latency budget at p50 will violate SLAs at p99.


24.3 The Recommendation System Architecture: A Complete Walkthrough

We now design the architecture for StreamRec, the content recommendation platform that has been our progressive project throughout this book. In Part I, we decomposed the user-item interaction matrix with SVD (Chapter 1) and built approximate nearest neighbor indices with FAISS (Chapter 5). In Part II, we moved to neural architectures: the click-prediction MLP (Chapter 6), the session transformer (Chapter 10), the two-tower retrieval model (Chapter 13), and the graph neural network collaborative filter (Chapter 14). In Part III, we asked causal questions about recommendation effects and estimated heterogeneous treatment effects (Chapters 15-19). In Part IV, we built Bayesian user preferences for cold-start handling (Chapter 20).

Now we bring it all together. The question is no longer "which model?" but "how do all these models compose into a system that serves 50 million users with sub-200ms latency?"

The Multi-Stage Architecture

Modern recommendation systems use a funnel architecture with three stages: candidate retrieval, ranking, and re-ranking. Each stage narrows the item set while increasing the computational cost per item.

                        ┌─────────────────────────────┐
                        │      Item Corpus             │
                        │      200,000 items           │
                        └──────────┬──────────────────-┘
                                   │
                        ┌──────────▼──────────────────-┐
                        │   Stage 1: Retrieval          │
                        │   200,000 → 500 candidates    │
                        │   ~0.15ms per item            │
                        │   ANN (FAISS), two-tower      │
                        │   Budget: 30ms                │
                        └──────────┬───────────────────-┘
                                   │
                        ┌──────────▼──────────────────-┐
                        │   Stage 2: Ranking            │
                        │   500 → 50 items              │
                        │   ~0.1ms per item             │
                        │   Deep ranking model          │
                        │   Budget: 50ms                │
                        └──────────┬───────────────────-┘
                                   │
                        ┌──────────▼───────────────────┐
                        │   Stage 3: Re-ranking         │
                        │   50 → 20 displayed           │
                        │   ~0.4ms per item             │
                        │   Business rules, diversity,  │
                        │   freshness, fairness         │
                        │   Budget: 20ms                │
                        └──────────┬───────────────────┘
                                   │
                                   ▼
                           20 recommendations
                           shown to user

Why three stages? A single model scoring all 200,000 items per request at 0.1ms per item would take 20 seconds — two orders of magnitude beyond the budget. The funnel architecture is not optional; it is a mathematical consequence of the latency constraint.

Stage 1: Candidate Retrieval

The retrieval stage must reduce 200,000 items to ~500 candidates in under 30 milliseconds. Speed dominates precision: it is acceptable to miss a few good items if the false negative rate is low enough that the top 500 contains most of the items the user would engage with.

Two-tower retrieval (Chapter 13). The user encoder produces a dense embedding $\mathbf{u} \in \mathbb{R}^{128}$; the item encoder produces $\mathbf{v}_i \in \mathbb{R}^{128}$ for each item. At serving time, item embeddings are pre-computed and stored in a FAISS index (Chapter 5). The user embedding is computed on-the-fly and used to query the index for the nearest 200 items.

Multi-source retrieval. No single retrieval source captures all relevant items. StreamRec uses four retrieval sources, run in parallel:

  1. Embedding similarity (two-tower): 200 candidates based on user-item embedding distance.
  2. Collaborative filtering (GNN, Chapter 14): 150 candidates from graph-based neighborhood.
  3. Content-based (category affinity, Chapter 20): 100 candidates from the user's preferred categories.
  4. Trending/fresh (popularity-weighted recency): 50 candidates to ensure coverage of new content.

The union of these sources produces 400-500 unique candidates after deduplication. Each source has its own latency budget (30ms total, run in parallel), and each has a fallback: if the embedding index is unavailable, the system falls back to collaborative filtering and content-based retrieval alone.

from dataclasses import dataclass
from typing import List, Set, Dict, Optional
from enum import Enum
import time


class RetrievalSource(Enum):
    """Sources of candidate items for recommendation."""
    EMBEDDING_ANN = "embedding_ann"
    COLLABORATIVE = "collaborative"
    CONTENT_BASED = "content_based"
    TRENDING = "trending"


@dataclass
class RetrievalResult:
    """Result from a single retrieval source.

    Attributes:
        source: Which retrieval method produced these candidates.
        item_ids: List of candidate item IDs.
        scores: Corresponding relevance scores (source-specific).
        latency_ms: Time taken by this retrieval source.
        is_fallback: Whether this result came from a fallback mechanism.
    """
    source: RetrievalSource
    item_ids: List[int]
    scores: List[float]
    latency_ms: float
    is_fallback: bool = False


@dataclass
class CandidateRetriever:
    """Multi-source candidate retrieval with fallback logic.

    Queries multiple retrieval sources in parallel and merges results.
    If a primary source fails, activates its fallback.

    Attributes:
        max_candidates_per_source: Maximum candidates from each source.
        total_candidate_target: Target number of unique candidates after merge.
        timeout_ms: Maximum time to wait for any single source.
    """
    max_candidates_per_source: Dict[RetrievalSource, int] = None
    total_candidate_target: int = 500
    timeout_ms: float = 30.0

    def __post_init__(self):
        if self.max_candidates_per_source is None:
            self.max_candidates_per_source = {
                RetrievalSource.EMBEDDING_ANN: 200,
                RetrievalSource.COLLABORATIVE: 150,
                RetrievalSource.CONTENT_BASED: 100,
                RetrievalSource.TRENDING: 50,
            }

    def merge_candidates(
        self, results: List[RetrievalResult]
    ) -> List[int]:
        """Merge candidates from multiple sources, deduplicating.

        Items appearing in multiple sources receive a bonus (multi-source
        agreement is a strong relevance signal).

        Args:
            results: List of retrieval results from all sources.

        Returns:
            Deduplicated list of candidate item IDs, ordered by
            aggregate score.
        """
        item_scores: Dict[int, float] = {}
        item_source_count: Dict[int, int] = {}

        for result in results:
            for item_id, score in zip(result.item_ids, result.scores):
                # Normalize scores to [0, 1] per source
                if item_id not in item_scores:
                    item_scores[item_id] = 0.0
                    item_source_count[item_id] = 0
                item_scores[item_id] += score
                item_source_count[item_id] += 1

        # Multi-source bonus: items from 2+ sources get a 1.5x boost
        for item_id in item_scores:
            if item_source_count[item_id] >= 2:
                item_scores[item_id] *= 1.5

        # Sort by aggregate score, return top candidates
        sorted_items = sorted(
            item_scores.keys(),
            key=lambda x: item_scores[x],
            reverse=True
        )
        return sorted_items[:self.total_candidate_target]

    def retrieval_summary(
        self, results: List[RetrievalResult]
    ) -> Dict[str, any]:
        """Summarize retrieval for monitoring.

        Args:
            results: List of retrieval results.

        Returns:
            Dictionary of retrieval metrics.
        """
        total_candidates = len(
            set(
                item_id
                for r in results
                for item_id in r.item_ids
            )
        )
        return {
            "total_unique_candidates": total_candidates,
            "sources_used": [r.source.value for r in results],
            "sources_fallback": [
                r.source.value for r in results if r.is_fallback
            ],
            "max_latency_ms": max(r.latency_ms for r in results),
            "mean_latency_ms": sum(r.latency_ms for r in results) / len(results),
        }


# Simulate multi-source retrieval
retriever = CandidateRetriever()

# Simulated results (in production, these run in parallel)
results = [
    RetrievalResult(
        source=RetrievalSource.EMBEDDING_ANN,
        item_ids=list(range(0, 200)),
        scores=[1.0 - i * 0.004 for i in range(200)],
        latency_ms=12.3,
    ),
    RetrievalResult(
        source=RetrievalSource.COLLABORATIVE,
        item_ids=list(range(100, 250)),
        scores=[0.9 - i * 0.005 for i in range(150)],
        latency_ms=18.7,
    ),
    RetrievalResult(
        source=RetrievalSource.CONTENT_BASED,
        item_ids=list(range(180, 280)),
        scores=[0.8 - i * 0.006 for i in range(100)],
        latency_ms=8.1,
    ),
    RetrievalResult(
        source=RetrievalSource.TRENDING,
        item_ids=list(range(300, 350)),
        scores=[0.7 - i * 0.01 for i in range(50)],
        latency_ms=3.2,
    ),
]

candidates = retriever.merge_candidates(results)
summary = retriever.retrieval_summary(results)
print(f"Unique candidates: {summary['total_unique_candidates']}")
print(f"After merge/rank: {len(candidates)}")
print(f"Max source latency: {summary['max_latency_ms']:.1f}ms")
print(f"Fallback sources: {summary['sources_fallback']}")
Unique candidates: 350
After merge/rank: 350
Max source latency: 18.7ms
Fallback sources: []

Stage 2: Ranking

The ranking stage scores 500 candidates with a deep model and selects the top 50. This is where the expensive neural architectures from Part II earn their keep: the ranking model has access to dense user features, item features, context features (time of day, device, session history), and cross-features.

The ranking model. StreamRec's ranker is a deep cross network (DCN-V2): it combines explicit feature interactions (cross network) with implicit interactions (deep network) to predict engagement probability. The model takes ~0.1ms per item on a GPU, allowing 500 items to be scored in ~50ms.

Feature requirements. The ranker needs features from multiple sources:

Feature Group Source Latency
User profile (demographics, tenure) Feature store (batch) ~2ms
User real-time activity (last 10 interactions) Feature store (streaming) ~5ms
Item embeddings (from two-tower model) Feature store (batch) ~2ms
Item statistics (CTR, completion rate) Feature store (batch) ~2ms
Context (time, device, session length) Request metadata ~0ms

Feature retrieval runs in parallel with candidate retrieval — this is a key architectural optimization. By the time candidates are ready for scoring, their features are already fetched.

Stage 3: Re-ranking

The re-ranking stage takes the top 50 scored items and applies non-ML business logic to produce the final 20 displayed recommendations. Re-ranking is where the system enforces constraints that the ML model does not — and should not — encode:

  1. Diversity. No more than 3 items from the same category in the top 10.
  2. Freshness. At least 2 items published in the last 24 hours.
  3. Content policy. Remove items flagged by the trust & safety system.
  4. Business rules. Boost promoted content; suppress items the user has already seen.
  5. Fairness. Ensure minimum exposure for content from underrepresented creators (Chapter 31).

Re-ranking is deliberately separate from ranking because business rules change faster than ML models, and entangling them creates a maintenance nightmare.

from dataclasses import dataclass
from typing import List, Dict, Set, Callable
from datetime import datetime, timedelta


@dataclass
class RankedItem:
    """An item with its ranking model score and metadata.

    Attributes:
        item_id: Unique item identifier.
        score: Ranking model predicted engagement probability.
        category: Content category.
        creator_id: Content creator identifier.
        published_at: Publication timestamp.
        is_promoted: Whether item is promoted content.
        is_flagged: Whether item is flagged by trust & safety.
    """
    item_id: int
    score: float
    category: str
    creator_id: str
    published_at: datetime
    is_promoted: bool = False
    is_flagged: bool = False


@dataclass
class ReRankingConfig:
    """Configuration for re-ranking business rules.

    Attributes:
        max_per_category: Maximum items from a single category in output.
        min_fresh_items: Minimum items published within freshness_window.
        freshness_window_hours: How recent an item must be to count as fresh.
        output_size: Number of items to return.
        promoted_boost: Score multiplier for promoted content.
    """
    max_per_category: int = 3
    min_fresh_items: int = 2
    freshness_window_hours: int = 24
    output_size: int = 20
    promoted_boost: float = 1.2


def rerank(
    items: List[RankedItem],
    config: ReRankingConfig,
    seen_item_ids: Set[int],
    now: Optional[datetime] = None,
) -> List[RankedItem]:
    """Apply business rules to produce final recommendation list.

    Enforces diversity, freshness, content policy, and business rules.
    Items are selected greedily: at each position, pick the highest-scoring
    item that does not violate any constraint.

    Args:
        items: Ranked items from the scoring stage (descending by score).
        config: Re-ranking configuration.
        seen_item_ids: Items the user has already seen (to suppress).
        now: Current timestamp (for freshness calculation).

    Returns:
        Final list of items to display.
    """
    if now is None:
        now = datetime.now()

    freshness_cutoff = now - timedelta(hours=config.freshness_window_hours)

    # Apply score adjustments
    adjusted_items = []
    for item in items:
        if item.is_flagged:
            continue  # Hard filter: never show flagged items
        if item.item_id in seen_item_ids:
            continue  # Suppress already-seen items

        adjusted_score = item.score
        if item.is_promoted:
            adjusted_score *= config.promoted_boost

        adjusted_items.append((item, adjusted_score))

    # Sort by adjusted score
    adjusted_items.sort(key=lambda x: x[1], reverse=True)

    # Greedy selection with constraints
    selected: List[RankedItem] = []
    category_counts: Dict[str, int] = {}
    fresh_count = 0

    for item, adj_score in adjusted_items:
        if len(selected) >= config.output_size:
            break

        # Check category diversity constraint
        cat_count = category_counts.get(item.category, 0)
        if cat_count >= config.max_per_category:
            continue

        selected.append(item)
        category_counts[item.category] = cat_count + 1
        if item.published_at >= freshness_cutoff:
            fresh_count += 1

    # Post-check: if freshness requirement not met, swap in fresh items
    if fresh_count < config.min_fresh_items:
        # Find fresh items not yet selected
        selected_ids = {item.item_id for item in selected}
        fresh_pool = [
            (item, score)
            for item, score in adjusted_items
            if item.published_at >= freshness_cutoff
            and item.item_id not in selected_ids
        ]
        # Replace lowest-scored non-fresh items
        needed = config.min_fresh_items - fresh_count
        non_fresh_indices = [
            i for i, item in enumerate(selected)
            if item.published_at < freshness_cutoff
        ]
        for idx in reversed(non_fresh_indices[-needed:]):
            if fresh_pool:
                fresh_item, _ = fresh_pool.pop(0)
                selected[idx] = fresh_item

    return selected


# Example: re-rank 50 items into 20
now = datetime(2026, 3, 25, 14, 0, 0)
sample_items = [
    RankedItem(
        item_id=i,
        score=1.0 - i * 0.015,
        category=f"cat_{i % 5}",
        creator_id=f"creator_{i % 20}",
        published_at=now - timedelta(hours=i * 2),
        is_promoted=(i == 3),
        is_flagged=(i == 7),
    )
    for i in range(50)
]

config = ReRankingConfig()
final = rerank(sample_items, config, seen_item_ids={1, 15, 30}, now=now)
print(f"Final list: {len(final)} items")
print(f"Categories: {[item.category for item in final]}")
print(f"Fresh items: {sum(1 for item in final if item.published_at >= now - timedelta(hours=24))}")
print(f"Flagged items: {sum(1 for item in final if item.is_flagged)}")
Final list: 20 items
Categories: ['cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4', 'cat_0', 'cat_1', 'cat_2', 'cat_3', 'cat_4']
Fresh items: 6
Flagged items: 0

24.4 Serving Patterns: Batch, Real-Time, and Near-Real-Time

The choice of serving pattern — how and when predictions are computed — is one of the most consequential architectural decisions in an ML system. It determines latency, cost, freshness, and the set of features available at prediction time.

Batch Serving

In batch serving, predictions are pre-computed for all users (or all relevant entities) on a schedule — typically daily or hourly. Results are stored in a key-value store and looked up at serving time.

Mechanics. A batch job iterates over all active users, retrieves their features, scores all candidate items (or a relevant subset), and writes the top-$k$ recommendations to a database:

from dataclasses import dataclass
from typing import List, Dict
from datetime import datetime


@dataclass
class BatchPredictionJob:
    """Specification for a batch prediction run.

    Attributes:
        model_version: Identifier for the model artifact.
        feature_snapshot_time: Timestamp of the feature snapshot used.
        user_segment: Which user segment to process (or "all").
        top_k: Number of recommendations to pre-compute per user.
        output_table: Destination for pre-computed recommendations.
    """
    model_version: str
    feature_snapshot_time: datetime
    user_segment: str = "all"
    top_k: int = 50
    output_table: str = "recommendations.batch_predictions"

    def compute_staleness(self, request_time: datetime) -> float:
        """Compute how stale the pre-computed predictions are.

        Args:
            request_time: When the user makes the request.

        Returns:
            Staleness in hours.
        """
        delta = request_time - self.feature_snapshot_time
        return delta.total_seconds() / 3600.0


# If the batch runs at 2am and a user requests at 8pm, predictions
# are 18 hours stale.
batch_job = BatchPredictionJob(
    model_version="dcn_v2_20260324",
    feature_snapshot_time=datetime(2026, 3, 24, 2, 0, 0),
)
staleness = batch_job.compute_staleness(datetime(2026, 3, 24, 20, 0, 0))
print(f"Prediction staleness: {staleness:.0f} hours")
Prediction staleness: 18 hours

Advantages. Simple infrastructure (no model serving endpoint needed). Can use arbitrarily expensive models (no latency constraint). Easy to validate predictions before they reach users.

Disadvantages. Stale predictions that do not reflect recent behavior. Cannot personalize based on session context (time of day, device, session history). Wasteful: computes predictions for users who may never log in.

When to use batch serving. When the prediction does not change based on real-time context (e.g., daily email recommendations), when the item corpus changes slowly, or when the computation is too expensive for real-time serving (e.g., a model that takes 10 seconds per user).

Real-Time Serving

In real-time serving, predictions are computed on-the-fly for each request. The model runs as a service, accepting feature vectors and returning predictions.

Mechanics. A model serving framework (TensorFlow Serving, Triton Inference Server, or a custom service) loads the model into memory, receives prediction requests over gRPC or HTTP, runs inference on GPU or CPU, and returns results.

Latency anatomy. Every millisecond matters. Here is a typical breakdown for a real-time recommendation request:

Step p50 p95 p99
Network (client → API gateway) 5ms 10ms 25ms
Feature store lookup 8ms 15ms 35ms
Candidate retrieval (FAISS ANN) 12ms 22ms 40ms
Ranking model inference (500 items) 35ms 55ms 80ms
Re-ranking 5ms 8ms 12ms
Network (API gateway → client) 5ms 10ms 25ms
Total 70ms 120ms 217ms

The p99 latency (217ms) slightly exceeds the 200ms budget. This is typical: the p50 is comfortable, but the tail distribution is what breaks SLAs. Managing p99 latency requires:

  • Hedged requests. Send the feature store lookup to two replicas; use whichever responds first.
  • Timeouts. If candidate retrieval exceeds 30ms, return results from whichever sources have responded.
  • Caching. Cache feature store results for recently active users (reduces p99 from 35ms to 15ms).
  • Model optimization. Quantize the ranking model (Chapter 26) to reduce inference time by 2-3x.

Advantages. Fresh predictions using the latest user behavior. Can incorporate real-time context. No wasted computation.

Disadvantages. Requires model serving infrastructure with GPU/CPU provisioning. Latency constraints limit model complexity. Requires a feature store that serves features in real-time.

Near-Real-Time Serving

Near-real-time serving is a hybrid: a streaming system processes events (user clicks, item uploads, feature updates) with low latency (seconds to minutes) and updates a cache or feature store that the serving layer reads from.

Example: session-aware recommendations. When a user starts a session, the system computes fresh embeddings from their last 10 interactions (near-real-time feature update). When the user requests recommendations, the serving layer uses these fresh embeddings with a pre-built candidate index.

from dataclasses import dataclass
from typing import List, Optional
from datetime import datetime


class ServingPattern:
    """Enumeration of serving patterns with their characteristics."""

    @staticmethod
    def comparison_table() -> List[Dict[str, str]]:
        """Return a comparison of serving patterns.

        Returns:
            List of dictionaries describing each pattern.
        """
        return [
            {
                "pattern": "Batch",
                "latency": "N/A (pre-computed)",
                "freshness": "Hours to days",
                "cost_per_prediction": "Low (amortized)",
                "infrastructure": "Batch compute (Spark, etc.)",
                "best_for": "Stable preferences, email, notifications",
            },
            {
                "pattern": "Real-time",
                "latency": "50-200ms",
                "freshness": "Instant",
                "cost_per_prediction": "High (GPU serving)",
                "infrastructure": "Model server, feature store, GPU cluster",
                "best_for": "Interactive UX, session-dependent ranking",
            },
            {
                "pattern": "Near-real-time",
                "latency": "1-60 seconds (feature update)",
                "freshness": "Seconds to minutes",
                "cost_per_prediction": "Medium",
                "infrastructure": "Stream processor, feature store",
                "best_for": "Session features, trending signals, fraud",
            },
        ]


for pattern in ServingPattern.comparison_table():
    print(f"\n{pattern['pattern']} Serving:")
    print(f"  Latency: {pattern['latency']}")
    print(f"  Freshness: {pattern['freshness']}")
    print(f"  Best for: {pattern['best_for']}")

Batch Serving:
  Latency: N/A (pre-computed)
  Freshness: Hours to days
  Best for: Stable preferences, email, notifications

Real-time Serving:
  Latency: 50-200ms
  Freshness: Instant
  Best for: Interactive UX, session-dependent ranking

Near-real-time Serving:
  Latency: 1-60 seconds (feature update)
  Freshness: Seconds to minutes
  Best for: Session features, trending signals, fraud

Simplest Model That Works: The correct serving pattern is the simplest one that meets the product requirements. If batch serving with a daily update provides 95% of the engagement value of real-time serving, the engineering cost of real-time infrastructure may not be justified. Netflix's home page uses batch-computed recommendations for most rows, with a single real-time "Continue Watching" row. The sophisticated pattern is reserved for the cases where it matters.


24.5 Feature Stores: The Bridge Between Training and Serving

The feature store is the component that ensures the features used during model training are identical to the features available during model serving. This consistency requirement sounds trivial. It is, in practice, the single most common source of production ML failures.

The Training-Serving Skew Problem

Training-serving skew occurs when the feature values a model sees at prediction time differ systematically from the feature values it saw during training. Common causes:

  1. Implementation mismatch. The training pipeline computes user_avg_session_length using a SQL query on a data warehouse; the serving pipeline computes it using an in-memory cache that handles null values differently.
  2. Temporal leakage. During training, features include information from the future (e.g., the feature "number of items user watched this week" includes data from after the prediction point). At serving time, this future information is unavailable.
  3. Stale features. Training uses the latest feature values, but serving uses a cache that has not been refreshed.
  4. Missing features. A new user has no historical features. The training pipeline drops these rows; the serving pipeline must handle them.

The insidious aspect of training-serving skew is that it produces no errors — the model still outputs predictions, and those predictions look plausible. The model simply performs worse than expected, and the gap between offline evaluation (which uses correctly computed features) and online performance (which uses skewed features) grows.

Feature Store Architecture

A feature store has two planes:

The offline store serves the training pipeline. It stores historical feature values keyed by (entity_id, timestamp), enabling point-in-time correct feature retrieval. When you train a model on data from six months ago, the offline store provides the feature values as they existed at that time, preventing temporal leakage.

The online store serves the prediction pipeline. It stores the current feature values for each entity in a low-latency key-value store (Redis, DynamoDB, Bigtable). When a serving request arrives, the online store returns the latest features in single-digit milliseconds.

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Any, Tuple
from datetime import datetime
from enum import Enum


class FeatureType(Enum):
    """Classification of features by computation pattern."""
    BATCH = "batch"           # Computed in batch pipelines (daily/hourly)
    STREAMING = "streaming"   # Updated by stream processors (seconds)
    ON_DEMAND = "on_demand"   # Computed at request time


@dataclass
class FeatureDefinition:
    """Schema for a single feature in the feature store.

    Attributes:
        name: Feature name (must be unique within entity type).
        entity_type: The entity this feature describes (user, item, etc.).
        dtype: Data type (float, int, string, list[float], etc.).
        feature_type: How the feature is computed and updated.
        default_value: Value to use when the feature is missing.
        description: Human-readable description.
        owner: Team or person responsible.
        ttl_hours: Time-to-live in the online store (0 = no expiry).
    """
    name: str
    entity_type: str
    dtype: str
    feature_type: FeatureType
    default_value: Any = None
    description: str = ""
    owner: str = ""
    ttl_hours: int = 0


@dataclass
class FeatureStoreSchema:
    """Schema registry for all features in the feature store.

    Provides a single source of truth for feature definitions,
    ensuring training and serving use the same feature semantics.

    Attributes:
        features: Dictionary mapping feature name to definition.
    """
    features: Dict[str, FeatureDefinition] = field(default_factory=dict)

    def register(self, feature: FeatureDefinition) -> None:
        """Register a feature definition.

        Args:
            feature: Feature to register.

        Raises:
            ValueError: If feature name is already registered.
        """
        if feature.name in self.features:
            raise ValueError(
                f"Feature '{feature.name}' is already registered. "
                f"Use update() to modify existing features."
            )
        self.features[feature.name] = feature

    def get_features_by_entity(
        self, entity_type: str
    ) -> List[FeatureDefinition]:
        """Get all features for an entity type.

        Args:
            entity_type: Entity type to filter by.

        Returns:
            List of feature definitions for the entity type.
        """
        return [
            f for f in self.features.values()
            if f.entity_type == entity_type
        ]

    def get_features_by_type(
        self, feature_type: FeatureType
    ) -> List[FeatureDefinition]:
        """Get all features of a given computation type.

        Args:
            feature_type: Computation pattern to filter by.

        Returns:
            List of matching feature definitions.
        """
        return [
            f for f in self.features.values()
            if f.feature_type == feature_type
        ]

    def consistency_report(self) -> Dict[str, Any]:
        """Report potential consistency issues.

        Checks for features without defaults (risky in serving),
        features without owners (maintenance risk), and streaming
        features without TTL (staleness risk).

        Returns:
            Dictionary of potential issues.
        """
        issues = {
            "no_default": [
                f.name for f in self.features.values()
                if f.default_value is None
            ],
            "no_owner": [
                f.name for f in self.features.values()
                if not f.owner
            ],
            "streaming_no_ttl": [
                f.name for f in self.features.values()
                if f.feature_type == FeatureType.STREAMING
                and f.ttl_hours == 0
            ],
        }
        return issues


# Define StreamRec's feature store schema
schema = FeatureStoreSchema()

# User features
schema.register(FeatureDefinition(
    name="user_embedding_128d",
    entity_type="user",
    dtype="list[float]",
    feature_type=FeatureType.BATCH,
    default_value=[0.0] * 128,
    description="128-dim user embedding from two-tower model",
    owner="rec-ml",
    ttl_hours=48,
))
schema.register(FeatureDefinition(
    name="user_avg_session_length_min",
    entity_type="user",
    dtype="float",
    feature_type=FeatureType.BATCH,
    default_value=12.0,  # Population median
    description="Average session length in minutes (30-day window)",
    owner="rec-ml",
    ttl_hours=48,
))
schema.register(FeatureDefinition(
    name="user_last_10_interactions",
    entity_type="user",
    dtype="list[int]",
    feature_type=FeatureType.STREAMING,
    default_value=[],
    description="Last 10 item IDs interacted with (real-time)",
    owner="rec-platform",
    ttl_hours=24,
))
schema.register(FeatureDefinition(
    name="user_category_preferences",
    entity_type="user",
    dtype="dict[str, float]",
    feature_type=FeatureType.BATCH,
    default_value={},
    description="Bayesian posterior means per category (Ch. 20)",
    owner="rec-ml",
    ttl_hours=48,
))

# Item features
schema.register(FeatureDefinition(
    name="item_embedding_128d",
    entity_type="item",
    dtype="list[float]",
    feature_type=FeatureType.BATCH,
    default_value=[0.0] * 128,
    description="128-dim item embedding from two-tower model",
    owner="rec-ml",
    ttl_hours=168,
))
schema.register(FeatureDefinition(
    name="item_ctr_7d",
    entity_type="item",
    dtype="float",
    feature_type=FeatureType.BATCH,
    default_value=0.02,  # Global average CTR
    description="Click-through rate over the last 7 days",
    owner="rec-platform",
    ttl_hours=24,
))
schema.register(FeatureDefinition(
    name="item_trending_score",
    entity_type="item",
    dtype="float",
    feature_type=FeatureType.STREAMING,
    default_value=0.0,
    description="Real-time trending score (engagement velocity)",
    owner="rec-platform",
    ttl_hours=6,
))

# Report
print("Feature store schema:")
print(f"  Total features: {len(schema.features)}")
for entity in ["user", "item"]:
    features = schema.get_features_by_entity(entity)
    print(f"  {entity} features: {len(features)}")

print(f"\nBy computation type:")
for ft in FeatureType:
    features = schema.get_features_by_type(ft)
    print(f"  {ft.value}: {len(features)}")

issues = schema.consistency_report()
print(f"\nConsistency issues:")
for issue_type, names in issues.items():
    print(f"  {issue_type}: {names if names else 'None'}")
Feature store schema:
  Total features: 7
  user features: 4
  item features: 3

By computation type:
  batch: 5
  streaming: 2
  on_demand: 0

Consistency issues:
  no_default: None
  no_owner: None
  streaming_no_ttl: None

Online-Offline Consistency

The critical guarantee of a feature store is online-offline consistency: for any entity at any point in time, the online store and the offline store must return the same feature values (or as close as the system's latency allows). This is achieved by:

  1. Single feature computation path. The batch pipeline that writes to the offline store also writes to the online store (or a shared intermediate layer writes to both).
  2. Backfill. When a new feature is added, it is backfilled in the offline store so that historical training data has the feature computed correctly.
  3. Point-in-time joins. The training pipeline performs temporal joins that retrieve feature values as of the training example's timestamp — never using future data.

Without these guarantees, the model trains on one version of reality and serves in another. The result is training-serving skew.


24.6 Training-Serving Skew: Detection and Prevention

Training-serving skew deserves its own section because it is both the most common production ML failure and the hardest to detect. Unlike a crashed server or a timeout error, skew produces plausible predictions — just slightly worse ones.

Categories of Skew

Category Cause Detection Prevention
Feature skew Different code computes features in training vs. serving Compare feature distributions offline vs. online Use the feature store for both
Data skew Training data distribution differs from serving distribution Monitor input feature distributions over time Retrain on recent data; detect drift (Ch. 30)
Label skew Labels computed differently in training vs. evaluation Audit label pipelines; compare online/offline metrics Single label definition source
Temporal skew Training includes future information unavailable at serve time Point-in-time correctness checks Strict point-in-time joins in the feature store

Skew Detection in Practice

The most effective skew detector is simple: compute summary statistics of each feature during training and compare them to the same statistics during serving.

from dataclasses import dataclass
from typing import Dict, List, Optional
import numpy as np


@dataclass
class FeatureStatistics:
    """Summary statistics for a single feature.

    Attributes:
        name: Feature name.
        mean: Mean value.
        std: Standard deviation.
        p1: 1st percentile.
        p50: Median.
        p99: 99th percentile.
        null_fraction: Fraction of null/missing values.
        n_samples: Number of samples used to compute statistics.
    """
    name: str
    mean: float
    std: float
    p1: float
    p50: float
    p99: float
    null_fraction: float
    n_samples: int


def detect_skew(
    training_stats: FeatureStatistics,
    serving_stats: FeatureStatistics,
    mean_threshold: float = 0.5,
    std_threshold: float = 0.5,
    null_threshold: float = 0.05,
) -> Dict[str, any]:
    """Detect training-serving skew for a single feature.

    Compares summary statistics from the training distribution and the
    serving distribution. Flags features where the relative difference
    exceeds configurable thresholds.

    Args:
        training_stats: Statistics from the training dataset.
        serving_stats: Statistics from recent serving requests.
        mean_threshold: Relative mean difference threshold (in std units).
        std_threshold: Relative std difference threshold.
        null_threshold: Absolute null fraction difference threshold.

    Returns:
        Dictionary with skew detection results.
    """
    results = {"feature": training_stats.name, "alerts": []}

    # Mean shift (in units of training std)
    if training_stats.std > 1e-8:
        mean_shift = abs(
            training_stats.mean - serving_stats.mean
        ) / training_stats.std
        if mean_shift > mean_threshold:
            results["alerts"].append(
                f"MEAN_SHIFT: {mean_shift:.2f} std "
                f"(train={training_stats.mean:.4f}, "
                f"serve={serving_stats.mean:.4f})"
            )

    # Std change
    if training_stats.std > 1e-8:
        std_ratio = serving_stats.std / training_stats.std
        if abs(std_ratio - 1.0) > std_threshold:
            results["alerts"].append(
                f"STD_CHANGE: ratio={std_ratio:.2f} "
                f"(train={training_stats.std:.4f}, "
                f"serve={serving_stats.std:.4f})"
            )

    # Null fraction change
    null_diff = abs(
        training_stats.null_fraction - serving_stats.null_fraction
    )
    if null_diff > null_threshold:
        results["alerts"].append(
            f"NULL_RATE_CHANGE: diff={null_diff:.3f} "
            f"(train={training_stats.null_fraction:.3f}, "
            f"serve={serving_stats.null_fraction:.3f})"
        )

    results["is_skewed"] = len(results["alerts"]) > 0
    return results


# Example: detect skew in StreamRec's user_avg_session_length feature
train_stats = FeatureStatistics(
    name="user_avg_session_length_min",
    mean=12.3, std=5.1, p1=1.2, p50=11.0, p99=32.0,
    null_fraction=0.001, n_samples=5_000_000,
)
serve_stats = FeatureStatistics(
    name="user_avg_session_length_min",
    mean=14.8, std=6.3, p1=1.5, p50=13.2, p99=38.0,
    null_fraction=0.015, n_samples=100_000,
)

skew_result = detect_skew(train_stats, serve_stats)
print(f"Feature: {skew_result['feature']}")
print(f"Skewed: {skew_result['is_skewed']}")
for alert in skew_result["alerts"]:
    print(f"  {alert}")
Feature: user_avg_session_length_min
Skewed: True
  MEAN_SHIFT: 0.49 std (train=12.3000, serve=14.8000)
  STD_CHANGE: ratio=1.24 (train=5.1000, serve=6.3000)
  NULL_RATE_CHANGE: diff=0.014 (train=0.001, serve=0.015)

In this example, the mean has shifted by 0.49 standard deviations (borderline), the standard deviation has increased by 24% (notable), and the null rate has increased 15x from 0.1% to 1.5% (a clear issue). The null rate increase might indicate a data pipeline problem: perhaps a new user cohort is missing session length data because their sessions are tracked by a different logging system.


24.7 Designing for Reliability

ML systems fail. The question is not whether but how — and whether the failure is graceful or catastrophic.

Failure Taxonomy

Failure Type Example Impact Mitigation
Model server crash OOM on a large batch No predictions Load balancer routes to healthy replicas
Feature store timeout Redis cluster slow Missing features → garbage predictions Default values; fallback to batch features
Stale model Retraining pipeline fails for 3 days Predictions degrade as data drifts Monitor model age; alert if beyond threshold
Data pipeline failure Schema change in upstream source Features computed incorrectly or not at all Data contracts; schema validation (Ch. 28)
Cascading failure Retrieval timeout → ranking overloaded Entire system unavailable Circuit breakers; backpressure
Silent degradation Feature drift; no alerts triggered Model performance degrades without detection Comprehensive monitoring (Ch. 30)

Graceful Degradation

The principle of graceful degradation says: when a component fails, the system should fall back to a simpler but still useful behavior, not fail entirely.

For StreamRec, the degradation hierarchy is:

  1. Full system healthy. Multi-source retrieval → deep ranking → re-ranking → personalized recommendations.
  2. Real-time features unavailable. Fall back to batch features. Recommendations are slightly stale but still personalized.
  3. Ranking model unavailable. Use retrieval scores directly (two-tower embedding similarity). Quality drops but the system serves.
  4. Feature store unavailable. Use a popularity-based fallback: return the globally most popular items. Not personalized, but better than an error page.
  5. Complete outage. Return a static cached response from the last successful serving.
from dataclasses import dataclass
from typing import List, Optional, Callable
from enum import Enum


class DegradationLevel(Enum):
    """Levels of graceful degradation, from best to worst."""
    FULL = "full"                    # All systems nominal
    STALE_FEATURES = "stale_features"  # Real-time features unavailable
    SIMPLE_RANKING = "simple_ranking"  # Ranking model down
    POPULARITY = "popularity"          # Feature store down
    CACHED = "cached"                  # Complete outage, serve cache
    ERROR = "error"                    # Nothing works


@dataclass
class FallbackChain:
    """Ordered chain of fallback strategies for recommendation serving.

    Each level is tried in order. If a level's health check passes,
    its serving strategy is used. If not, the system falls to the next level.

    Attributes:
        levels: Ordered list of (DegradationLevel, health_check, strategy).
        current_level: The currently active degradation level.
    """
    levels: List[tuple] = None
    current_level: DegradationLevel = DegradationLevel.FULL

    def evaluate(
        self, health_checks: Dict[str, bool]
    ) -> DegradationLevel:
        """Determine the current degradation level based on health checks.

        Args:
            health_checks: Dictionary mapping component name to health status.

        Returns:
            The highest functioning degradation level.
        """
        if all(health_checks.values()):
            return DegradationLevel.FULL

        if not health_checks.get("streaming_features", True):
            if health_checks.get("batch_features", False):
                return DegradationLevel.STALE_FEATURES

        if not health_checks.get("ranking_model", True):
            if health_checks.get("retrieval", False):
                return DegradationLevel.SIMPLE_RANKING

        if not health_checks.get("feature_store", True):
            return DegradationLevel.POPULARITY

        if not health_checks.get("retrieval", True):
            return DegradationLevel.CACHED

        return DegradationLevel.STALE_FEATURES

    def log_degradation(
        self, previous: DegradationLevel, current: DegradationLevel
    ) -> Optional[str]:
        """Generate alert message when degradation level changes.

        Args:
            previous: Previous degradation level.
            current: New degradation level.

        Returns:
            Alert message if degraded, None if improved or unchanged.
        """
        level_order = list(DegradationLevel)
        prev_idx = level_order.index(previous)
        curr_idx = level_order.index(current)

        if curr_idx > prev_idx:
            return (
                f"DEGRADATION: {previous.value} -> {current.value}. "
                f"Service quality reduced."
            )
        elif curr_idx < prev_idx:
            return (
                f"RECOVERY: {previous.value} -> {current.value}. "
                f"Service quality improved."
            )
        return None


# Example: streaming features go down
fallback = FallbackChain()

health_all_good = {
    "streaming_features": True,
    "batch_features": True,
    "ranking_model": True,
    "feature_store": True,
    "retrieval": True,
}

health_streaming_down = {
    "streaming_features": False,
    "batch_features": True,
    "ranking_model": True,
    "feature_store": True,
    "retrieval": True,
}

health_ranking_down = {
    "streaming_features": False,
    "batch_features": True,
    "ranking_model": False,
    "feature_store": True,
    "retrieval": True,
}

for scenario_name, health in [
    ("All healthy", health_all_good),
    ("Streaming down", health_streaming_down),
    ("Streaming + ranking down", health_ranking_down),
]:
    level = fallback.evaluate(health)
    print(f"{scenario_name}: {level.value}")
All healthy: full
Streaming down: stale_features
Streaming + ranking down: simple_ranking

Circuit Breakers

A circuit breaker is a pattern borrowed from electrical engineering: when a downstream service fails repeatedly, the circuit "opens" and stops sending requests, giving the failing service time to recover instead of being overwhelmed with requests that will fail anyway.

For ML systems, circuit breakers are essential between the serving layer and the feature store, between the serving layer and the model server, and between the API gateway and the serving layer. Without circuit breakers, a slow feature store can cause the serving layer to accumulate pending requests, exhaust its thread pool, and crash — turning a feature store slowdown into a complete system outage.

from dataclasses import dataclass
from datetime import datetime, timedelta


@dataclass
class CircuitBreaker:
    """Circuit breaker for a downstream ML system component.

    States:
        CLOSED: Normal operation. Requests pass through.
        OPEN: Circuit tripped. Requests fail fast (fallback used).
        HALF_OPEN: Testing recovery. Limited requests pass through.

    Attributes:
        name: Component name this circuit breaker protects.
        failure_threshold: Failures before circuit opens.
        recovery_timeout_seconds: How long to wait before testing recovery.
        half_open_max_requests: Requests allowed in HALF_OPEN state.
        failure_count: Current consecutive failure count.
        state: Current circuit state.
        last_failure_time: Timestamp of last failure.
    """
    name: str
    failure_threshold: int = 5
    recovery_timeout_seconds: float = 30.0
    half_open_max_requests: int = 3
    failure_count: int = 0
    state: str = "CLOSED"
    last_failure_time: Optional[datetime] = None
    _half_open_attempts: int = 0

    def record_success(self) -> None:
        """Record a successful request. Resets failure count."""
        self.failure_count = 0
        if self.state == "HALF_OPEN":
            self._half_open_attempts += 1
            if self._half_open_attempts >= self.half_open_max_requests:
                self.state = "CLOSED"
                self._half_open_attempts = 0

    def record_failure(self, now: Optional[datetime] = None) -> None:
        """Record a failed request. May trip the circuit.

        Args:
            now: Current timestamp.
        """
        if now is None:
            now = datetime.now()
        self.failure_count += 1
        self.last_failure_time = now
        if self.failure_count >= self.failure_threshold:
            self.state = "OPEN"

    def should_allow_request(
        self, now: Optional[datetime] = None
    ) -> bool:
        """Check whether a request should be allowed through.

        Args:
            now: Current timestamp.

        Returns:
            True if the request should proceed, False if it should
            fail fast and use the fallback.
        """
        if now is None:
            now = datetime.now()

        if self.state == "CLOSED":
            return True

        if self.state == "OPEN":
            # Check if recovery timeout has elapsed
            if self.last_failure_time is not None:
                elapsed = (now - self.last_failure_time).total_seconds()
                if elapsed >= self.recovery_timeout_seconds:
                    self.state = "HALF_OPEN"
                    self._half_open_attempts = 0
                    return True
            return False

        if self.state == "HALF_OPEN":
            return self._half_open_attempts < self.half_open_max_requests

        return False

    def status_report(self) -> Dict[str, any]:
        """Report circuit breaker status for monitoring.

        Returns:
            Dictionary with circuit breaker state.
        """
        return {
            "component": self.name,
            "state": self.state,
            "failure_count": self.failure_count,
            "threshold": self.failure_threshold,
        }


# Simulate feature store degradation
cb = CircuitBreaker(name="feature_store", failure_threshold=3)
now = datetime(2026, 3, 25, 14, 0, 0)

print("Simulating feature store failures:")
for i in range(5):
    allowed = cb.should_allow_request(now)
    if allowed:
        # Simulate failure
        cb.record_failure(now)
        print(f"  Request {i+1}: allowed={allowed}, "
              f"state={cb.state}, failures={cb.failure_count}")
    else:
        print(f"  Request {i+1}: allowed={allowed}, "
              f"state={cb.state} (fail fast, use fallback)")
    now += timedelta(seconds=1)

# Wait for recovery timeout
now += timedelta(seconds=30)
allowed = cb.should_allow_request(now)
print(f"\nAfter 30s recovery timeout:")
print(f"  Request: allowed={allowed}, state={cb.state}")
Simulating feature store failures:
  Request 1: allowed=True, state=CLOSED, failures=1
  Request 2: allowed=True, state=CLOSED, failures=2
  Request 3: allowed=True, state=OPEN, failures=3
  Request 4: allowed=False, state=OPEN (fail fast, use fallback)
  Request 5: allowed=False, state=OPEN (fail fast, use fallback)

After 30s recovery timeout:
  Request: allowed=True, state=HALF_OPEN

24.8 Architecture Decision Records (ADRs)

Every ML system involves dozens of design decisions: which serving pattern, which feature store technology, how many retrieval sources, what latency budget per stage, when to retrain, whether to use shadow mode before production. These decisions are made once, but their consequences persist for years. Six months later, when a new team member asks "why do we use batch serving for this model?" the answer is often "nobody remembers."

An Architecture Decision Record (ADR) is a short document that captures a single architectural decision: the context, the options considered, the decision made, and the consequences. ADRs create an institutional memory that prevents re-arguing settled questions, explains constraints to new team members, and surfaces assumptions that may have changed since the decision was made.

ADR Template

# ADR-{number}: {title}

## Status
{Proposed | Accepted | Deprecated | Superseded by ADR-XXX}

## Context
What is the issue? What forces are at play? What constraints exist?

## Decision
What is the decision that was made?

## Options Considered
1. Option A: [brief description]
   - Pros: ...
   - Cons: ...
2. Option B: [brief description]
   - Pros: ...
   - Cons: ...
3. Option C (chosen): [brief description]
   - Pros: ...
   - Cons: ...

## Consequences
What are the results of this decision? What trade-offs are accepted?
What new constraints does this create?

## Review Date
When should this decision be revisited? What conditions would trigger
a reconsideration?

Example ADR: StreamRec Serving Pattern

# ADR-007: Serving Pattern for StreamRec Home Page Recommendations

## Status
Accepted (2026-01-15)

## Context
StreamRec's home page requires personalized recommendations for
50 million active users. The product team requires:
- Sub-200ms end-to-end latency for the recommendation API.
- Recommendations that reflect the user's current session context
  (e.g., recent interactions within the current session).
- Support for 10,000 requests per second at peak.

The team has 3 ML engineers and 2 backend engineers. Infrastructure
budget: $150K/month for serving.

## Decision
Use real-time serving with a multi-stage architecture:
1. Multi-source candidate retrieval (30ms budget)
2. Deep ranking model on GPU (50ms budget)
3. Rule-based re-ranking (20ms budget)

Batch fallback: pre-compute top-50 per user daily for use when
real-time serving is degraded.

## Options Considered
1. Pure batch serving
   - Pros: Simple infrastructure, no GPU serving cost, easy validation
   - Cons: Cannot use session features, 18-hour staleness, wastes
     compute for inactive users
   - Rejected: Session context is critical for engagement; A/B test
     showed +12% engagement for session-aware vs. batch recommendations

2. Pure real-time serving (no batch fallback)
   - Pros: Maximum freshness, no pre-computation waste
   - Cons: No fallback during outages; error page shown to users
   - Rejected: Availability requirement (99.9%) mandates a fallback

3. Hybrid real-time + batch fallback (chosen)
   - Pros: Session-aware when healthy, graceful degradation when not
   - Cons: Dual infrastructure cost; must maintain batch pipeline
   - Accepted: Best balance of quality, availability, and cost

## Consequences
- Must maintain both real-time and batch serving infrastructure.
- GPU serving cluster adds ~$80K/month to infrastructure costs.
- Batch pipeline runs daily at 2am; must complete before 6am.
- Monitoring must track which serving mode (real-time vs. batch
  fallback) is used per request for accurate A/B testing.
- Feature store must support both online (real-time) and offline
  (batch) access patterns.

## Review Date
2026-07-15 (6 months). Reconsider if: (a) batch A/B test gap
narrows below +5%, (b) infrastructure costs exceed $200K/month,
(c) system availability drops below 99.5%.

Why ADRs Matter for ML Systems

ADRs are particularly important in ML because:

  1. ML decisions have non-obvious dependencies. The choice to use real-time serving implies the need for an online feature store, which implies the need for a streaming feature pipeline, which implies stream processing infrastructure. The ADR makes these downstream implications explicit.

  2. ML systems evolve through experimentation. A decision made based on an A/B test result ("+12% engagement for session-aware recommendations") should be re-evaluated if the test's conditions change (new model architecture, different user population, product redesign).

  3. ML teams rotate. The engineer who designed the serving architecture may leave the team. Without ADRs, the institutional knowledge leaves with them.

Write ADRs for every decision that would take more than one hour to reverse. If reversal is trivial, the decision does not need an ADR.


24.9 The Credit Scoring Anchor: Batch vs. Real-Time

The Content Platform Recommender demands real-time serving because session context drives engagement. But not all ML systems need real-time inference. The Credit Scoring anchor example illustrates the opposite end of the spectrum.

The Decision Context

A bank receives 10,000 credit applications per day. Each application requires a creditworthiness score. The score determines the interest rate offered to the applicant. Two serving patterns are viable:

Batch serving. Applications received during the day are scored overnight. Applicants receive their decision the next morning. The model has access to the full data warehouse: credit bureau reports, transaction history, employment verification, and alternative data sources (rent payments, utility bills). No latency constraint; the model can be a large ensemble of gradient-boosted trees, a neural network, and a logistic regression model, with model disagreement flagged for human review.

Real-time serving. Applications are scored within 30 seconds. Applicants receive an instant decision. The model has access to a subset of features: credit bureau score (via API call), application data (income, employment, address), and cached historical data. The latency constraint limits model complexity and feature breadth.

from dataclasses import dataclass
from typing import Dict, List


@dataclass
class ServingTradeoff:
    """Analysis of serving pattern trade-offs for a specific use case.

    Attributes:
        use_case: Description of the use case.
        batch_pros: Advantages of batch serving.
        batch_cons: Disadvantages of batch serving.
        realtime_pros: Advantages of real-time serving.
        realtime_cons: Disadvantages of real-time serving.
        recommendation: Which pattern to use and why.
    """
    use_case: str
    batch_pros: List[str]
    batch_cons: List[str]
    realtime_pros: List[str]
    realtime_cons: List[str]
    recommendation: str


credit_analysis = ServingTradeoff(
    use_case="Credit application scoring (10K applications/day)",
    batch_pros=[
        "Full feature access (all data sources, no latency limit)",
        "Ensemble of complex models (XGBoost + neural + logistic)",
        "Model disagreement detection → human review for edge cases",
        "Simpler infrastructure (no model serving endpoint)",
        "Easier auditability (batch logs, full feature snapshots)",
    ],
    batch_cons=[
        "Next-day decision → customer dropout (estimated 15-25%)",
        "Cannot respond to real-time fraud signals",
        "Competitive disadvantage vs. instant-decision competitors",
    ],
    realtime_pros=[
        "Instant decision → higher conversion rate (+20-30%)",
        "Can incorporate real-time fraud signals",
        "Better customer experience",
    ],
    realtime_cons=[
        "Fewer features available (no time for full bureau pull)",
        "Simpler model (latency constraint)",
        "Model auditability more complex (serving logs, not batch)",
        "Higher infrastructure cost (model serving, feature store)",
    ],
    recommendation=(
        "HYBRID: Real-time pre-approval with batch final decision. "
        "Show applicant an instant pre-approval (simple model, "
        "limited features, conservative thresholds). Run full model "
        "overnight for final terms. This captures conversion benefit "
        "while preserving model quality for final decision."
    ),
)

print(f"Use case: {credit_analysis.use_case}")
print(f"\nRecommendation: {credit_analysis.recommendation}")
Use case: Credit application scoring (10K applications/day)

Recommendation: HYBRID: Real-time pre-approval with batch final decision. Show applicant an instant pre-approval (simple model, limited features, conservative thresholds). Run full model overnight for final terms. This captures conversion benefit while preserving model quality for final decision.

The hybrid pattern — real-time for the initial decision, batch for the final decision — is common in financial services. It is an example of the simplest model that works principle applied at the system level: use the simple real-time model where speed matters (conversion), and the complex batch model where accuracy matters (final terms).

Regulatory Constraints on Serving Patterns

Credit scoring in the United States is subject to the Equal Credit Opportunity Act (ECOA) and the Fair Credit Reporting Act (FCRA), which require lenders to provide adverse action notices — specific reasons why an applicant was denied credit or offered unfavorable terms. This requirement constrains the serving architecture:

  • The model must be interpretable enough to generate reason codes (Chapter 35).
  • The serving system must log every feature value, model score, and decision for auditability.
  • The batch pipeline must retain full feature snapshots for regulatory examination.

These constraints favor batch serving for the final decision, because batch pipelines are easier to audit and reason about. The ADR for this system would document the regulatory requirements as a primary constraint.


24.10 Experimentation Infrastructure

Deploying a new model to production is not a launch — it is a hypothesis. "We believe model v2 will increase engagement by 3% compared to model v1." The experimentation infrastructure tests this hypothesis rigorously before committing.

Deployment Strategies

Strategy Description Risk When to Use
Shadow mode New model runs in parallel; predictions logged but not served Zero user impact First deployment of any new model
Canary New model serves 1-5% of traffic Very low After shadow mode validates basic correctness
A/B test Randomized 50/50 split (or other ratio) Moderate Full evaluation of model quality
Gradual rollout 1% → 5% → 25% → 50% → 100% Controlled Scaling a validated model to full traffic

Shadow Mode

Shadow mode is the zero-risk first step for any new model. The production system serves predictions from the current model. Simultaneously, a shadow pipeline sends the same requests to the new model and logs the predictions. No user ever sees the shadow model's output.

Shadow mode validates:

  1. Correctness. The new model produces predictions in the expected range and format.
  2. Latency. The new model meets the latency budget.
  3. Consistency. The new model's predictions are correlated with (but hopefully better than) the current model's predictions.
  4. Error handling. The new model handles edge cases (missing features, new users, new items) without crashing.
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple
import numpy as np


@dataclass
class ShadowModeReport:
    """Summary of shadow mode evaluation comparing production and shadow models.

    Attributes:
        production_model: Name/version of the production model.
        shadow_model: Name/version of the shadow model.
        n_requests: Total shadow requests processed.
        prediction_correlation: Pearson correlation between production
            and shadow predictions.
        shadow_latency_p50_ms: Shadow model p50 latency.
        shadow_latency_p99_ms: Shadow model p99 latency.
        shadow_error_rate: Fraction of shadow requests that errored.
        prediction_distribution_shift: KL divergence of prediction
            distributions.
    """
    production_model: str
    shadow_model: str
    n_requests: int
    prediction_correlation: float
    shadow_latency_p50_ms: float
    shadow_latency_p99_ms: float
    shadow_error_rate: float
    prediction_distribution_shift: float

    def is_ready_for_canary(
        self,
        min_requests: int = 10000,
        max_error_rate: float = 0.001,
        max_p99_latency_ms: float = 100.0,
        min_correlation: float = 0.5,
    ) -> Tuple[bool, List[str]]:
        """Evaluate whether the shadow model is ready for canary deployment.

        Args:
            min_requests: Minimum shadow requests for statistical validity.
            max_error_rate: Maximum acceptable error rate.
            max_p99_latency_ms: Maximum acceptable p99 latency.
            min_correlation: Minimum correlation with production predictions.

        Returns:
            (ready, list_of_blocking_issues) tuple.
        """
        issues = []

        if self.n_requests < min_requests:
            issues.append(
                f"Insufficient requests: {self.n_requests} < {min_requests}"
            )
        if self.shadow_error_rate > max_error_rate:
            issues.append(
                f"Error rate too high: {self.shadow_error_rate:.4f} > "
                f"{max_error_rate:.4f}"
            )
        if self.shadow_latency_p99_ms > max_p99_latency_ms:
            issues.append(
                f"p99 latency too high: {self.shadow_latency_p99_ms:.0f}ms > "
                f"{max_p99_latency_ms:.0f}ms"
            )
        if self.prediction_correlation < min_correlation:
            issues.append(
                f"Prediction correlation too low: "
                f"{self.prediction_correlation:.3f} < {min_correlation:.3f}"
            )

        return len(issues) == 0, issues


# Example shadow mode report
report = ShadowModeReport(
    production_model="dcn_v2_20260310",
    shadow_model="dcn_v2_20260324",
    n_requests=52_340,
    prediction_correlation=0.89,
    shadow_latency_p50_ms=42.0,
    shadow_latency_p99_ms=87.0,
    shadow_error_rate=0.00015,
    prediction_distribution_shift=0.023,
)

ready, issues = report.is_ready_for_canary()
print(f"Shadow model: {report.shadow_model}")
print(f"Requests evaluated: {report.n_requests:,}")
print(f"Prediction correlation with production: {report.prediction_correlation:.3f}")
print(f"p50 latency: {report.shadow_latency_p50_ms:.0f}ms")
print(f"p99 latency: {report.shadow_latency_p99_ms:.0f}ms")
print(f"Error rate: {report.shadow_error_rate:.4f}")
print(f"\nReady for canary: {ready}")
if issues:
    for issue in issues:
        print(f"  Blocking: {issue}")
Shadow model: dcn_v2_20260324
Requests evaluated: 52,340
Prediction correlation with production: 0.890
p50 latency: 42ms
p99 latency: 87ms
Error rate: 0.0002

Ready for canary: True

A/B Testing for ML Models

A/B testing for ML models differs from traditional web A/B testing in three ways:

  1. Metric delay. The primary metric (e.g., 30-day retention) may take weeks to mature. Proxy metrics (session engagement, next-day return) are used for faster decisions, but the relationship between proxy and primary metrics must be validated.

  2. Interference. In recommendation systems, treating a user differently affects their behavior, which affects the data used to train the next model iteration. This creates interference between treatment and control groups that standard A/B testing does not account for. Chapter 33 covers this in depth.

  3. Multiple metrics. An ML model change that increases click-through rate by 5% but decreases content completion rate by 3% is not clearly better. The experimentation platform must track a metric suite, and the decision criteria must be pre-specified to avoid p-hacking.


24.11 The StreamRec Application: System Architecture Design (Milestone 9)

We now bring together every concept from this chapter to design the complete StreamRec system architecture. This is the ninth milestone in the progressive project — the point where the models from Parts I-IV are assembled into a production system.

Requirements

Requirement Value
Active users 50 million
Item corpus 200,000 items
Recommendations per request 20
End-to-end latency (p95) 200ms
Peak throughput 10,000 requests/second
Availability SLA 99.9%
Model freshness Retraining every 24 hours
Feature freshness Batch: daily; Streaming: < 60 seconds

System Diagram

┌──────────────────────────────────────────────────────────────────────┐
│                         StreamRec System                             │
│                                                                      │
│  ┌─────────┐    ┌──────────────┐    ┌──────────────────────────────┐│
│  │  Client  │───▶│  API Gateway │───▶│        Serving Layer         ││
│  │  (App)   │◀───│  (5ms)       │◀───│                              ││
│  └─────────┘    └──────────────┘    │  ┌──────────────────────┐    ││
│                                      │  │ Feature Store Lookup │    ││
│                                      │  │ (15ms)               │    ││
│                                      │  └──────────┬───────────┘    ││
│                                      │             │                 ││
│                                      │  ┌──────────▼───────────┐    ││
│                                      │  │ Candidate Retrieval  │    ││
│                                      │  │ (30ms)               │    ││
│                                      │  │  ├─ ANN (two-tower)  │    ││
│                                      │  │  ├─ CF (GNN)         │    ││
│                                      │  │  ├─ Content-based    │    ││
│                                      │  │  └─ Trending         │    ││
│                                      │  └──────────┬───────────┘    ││
│                                      │             │                 ││
│                                      │  ┌──────────▼───────────┐    ││
│                                      │  │ Ranking (DCN-V2)     │    ││
│                                      │  │ (50ms, GPU)          │    ││
│                                      │  └──────────┬───────────┘    ││
│                                      │             │                 ││
│                                      │  ┌──────────▼───────────┐    ││
│                                      │  │ Re-Ranking           │    ││
│                                      │  │ (20ms)               │    ││
│                                      │  │ Diversity, freshness,│    ││
│                                      │  │ policy, fairness     │    ││
│                                      │  └──────────┬───────────┘    ││
│                                      │             │                 ││
│                                      │  ┌──────────▼───────────┐    ││
│                                      │  │ Response Assembly    │    ││
│                                      │  │ (5ms)               │    ││
│                                      │  └──────────────────────┘    ││
│                                      └──────────────────────────────┘│
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────────┐│
│  │                      Offline Layer                                ││
│  │                                                                   ││
│  │  ┌────────────┐   ┌──────────────┐   ┌───────────────────────┐  ││
│  │  │ Data       │──▶│ Feature      │──▶│ Training              │  ││
│  │  │ Pipeline   │   │ Engineering  │   │ (daily, GPU cluster)  │  ││
│  │  └────────────┘   └──────────────┘   └──────────┬────────────┘  ││
│  │                                                  │               ││
│  │  ┌────────────────────────────────┐  ┌──────────▼────────────┐  ││
│  │  │ Monitoring & Observability     │  │ Model Registry        │  ││
│  │  │ (drift, latency, quality)      │  │ (version, stage,      │  ││
│  │  └────────────────────────────────┘  │  validate)            │  ││
│  │                                      └──────────┬────────────┘  ││
│  │  ┌────────────────────────────────┐             │               ││
│  │  │ Experimentation Platform       │◀────────────┘               ││
│  │  │ (shadow, canary, A/B)          │                             ││
│  │  └────────────────────────────────┘                             ││
│  └──────────────────────────────────────────────────────────────────┘│
│                                                                      │
│  ┌──────────────────────────────────────────────────────────────────┐│
│  │                    Feature Store                                  ││
│  │  ┌──────────────┐          ┌────────────────┐                   ││
│  │  │ Online Store  │          │ Offline Store   │                   ││
│  │  │ (Redis/Dynamo)│          │ (Parquet/Delta) │                   ││
│  │  │ Latest values │          │ Historical +    │                   ││
│  │  │ < 5ms lookup  │          │ point-in-time   │                   ││
│  │  └──────────────┘          └────────────────┘                   ││
│  └──────────────────────────────────────────────────────────────────┘│
└──────────────────────────────────────────────────────────────────────┘

Latency Budget

Stage Budget (p95) Notes
API gateway 5ms Routing, auth, rate limiting
Feature store lookup 15ms Parallel with retrieval start; Redis cluster
Candidate retrieval 30ms 4 sources in parallel; take union of completed
Ranking model 50ms DCN-V2 on GPU; 500 items at ~0.1ms each
Re-ranking 20ms CPU; greedy diversity/freshness selection
Response assembly 5ms Serialize response, add metadata
Total critical path 125ms 75ms headroom for p99 tail

ADR: StreamRec Serving Pattern (Summary)

Following the ADR template from Section 24.8:

  • Context: 50M users, sub-200ms latency, session-aware personalization required.
  • Decision: Real-time multi-stage serving with batch fallback.
  • Key trade-off: +$80K/month GPU infrastructure cost justified by +12% engagement over batch-only (A/B tested in pilot).
  • Consequence: Must maintain dual infrastructure (real-time + batch); feature store must serve both online and offline; monitoring must distinguish serving mode for clean A/B analysis.

Key Design Decisions

from dataclasses import dataclass
from typing import List


@dataclass
class DesignDecision:
    """A key design decision with its rationale.

    Attributes:
        decision: What was decided.
        rationale: Why this choice was made.
        alternatives_rejected: What was considered and rejected.
        risk: What could go wrong.
    """
    decision: str
    rationale: str
    alternatives_rejected: List[str]
    risk: str


streamrec_decisions = [
    DesignDecision(
        decision="Multi-source retrieval (4 sources) instead of single ANN",
        rationale=(
            "Single-source retrieval has a coverage ceiling: the two-tower "
            "model cannot retrieve items without embedding similarity, "
            "missing cold-start items and category-niche items. Multi-source "
            "retrieval improves Recall@500 by 18% in offline evaluation."
        ),
        alternatives_rejected=[
            "Single ANN retrieval (simpler, but lower coverage)",
            "6-source retrieval (marginal gain, +15ms latency)",
        ],
        risk="Merging 4 source results in 30ms requires all sources to "
             "respond in time. Mitigation: return whatever has arrived by "
             "the timeout.",
    ),
    DesignDecision(
        decision="GPU serving for ranking model (not CPU)",
        rationale=(
            "DCN-V2 scoring 500 items takes ~50ms on GPU (T4) and ~300ms "
            "on CPU. CPU serving would require reducing candidates to ~80 "
            "or using a simpler model, both of which degrade quality."
        ),
        alternatives_rejected=[
            "CPU serving with fewer candidates (too much quality loss)",
            "CPU serving with distilled model (quality loss > latency gain)",
        ],
        risk="GPU failure → no ranking. Mitigation: CPU fallback with a "
             "lightweight model (logistic regression on top features).",
    ),
    DesignDecision(
        decision="Batch fallback with daily pre-computation",
        rationale=(
            "99.9% availability SLA requires a fallback when real-time "
            "serving is degraded. Batch pre-computed recommendations are "
            "stale but personalized — better than popularity fallback."
        ),
        alternatives_rejected=[
            "No fallback (violates availability SLA)",
            "Popularity-only fallback (too large quality drop)",
        ],
        risk="Batch pipeline failure leaves stale fallback. Mitigation: "
             "alert if batch results are > 36 hours old.",
    ),
]

for i, d in enumerate(streamrec_decisions, 1):
    print(f"Decision {i}: {d.decision}")
    print(f"  Rationale: {d.rationale[:100]}...")
    print(f"  Risk: {d.risk[:80]}...")
    print()
Decision 1: Multi-source retrieval (4 sources) instead of single ANN
  Rationale: Single-source retrieval has a coverage ceiling: the two-tower model cannot retrieve items wi...
  Risk: Merging 4 source results in 30ms requires all sources to respond in time. Mitigation: re...

Decision 2: GPU serving for ranking model (not CPU)
  Rationale: DCN-V2 scoring 500 items takes ~50ms on GPU (T4) and ~300ms on CPU. CPU serving would requi...
  Risk: GPU failure → no ranking. Mitigation: CPU fallback with a lightweight model (logistic reg...

Decision 3: Batch fallback with daily pre-computation
  Rationale: 99.9% availability SLA requires a fallback when real-time serving is degraded. Batch pre-co...
  Risk: Batch pipeline failure leaves stale fallback. Mitigation: alert if batch results are > 36...

24.12 System Design as a Professional Discipline

This chapter opened with Sculley et al.'s observation that the model is 5% of the system. We have now built — at least in architectural blueprint — the other 95% for StreamRec.

The key ideas are structural, not technological:

The funnel architecture (retrieval → ranking → re-ranking) is a consequence of the latency-quality trade-off: you cannot run an expensive model on every item, so you use cheap models to narrow the candidate set and expensive models to refine it. This pattern applies to recommendation systems, search engines, advertising systems, and any application where the candidate space is large and the latency budget is small.

The feature store solves the training-serving consistency problem that has derailed more ML projects than any modeling error. By providing a single source of truth for feature computation, with online and offline access patterns, the feature store ensures that the model sees the same features in production that it saw during training.

Serving pattern selection (batch, real-time, near-real-time) is a system design decision, not a modeling decision. The choice depends on latency requirements, feature freshness needs, infrastructure budget, and team capacity. The simplest pattern that meets the requirements is the right choice.

Graceful degradation means designing for failure, not hoping for its absence. A recommendation system that serves popularity-based recommendations during a feature store outage is infinitely better than one that returns HTTP 500 errors.

ADRs create the institutional memory that ML teams desperately need. They document not just what was decided, but why — and under what conditions the decision should be revisited.

Production ML = Software Engineering: The shift from model building to system design is the shift from individual craft to engineering discipline. A model is an artifact; a system is a product. Building the system requires every skill from Parts I-IV — the models must be excellent — plus the software engineering skills covered in this chapter and the rest of Part V. The payoff is ML that works, reliably, at scale, in the real world.

The next six chapters fill in the components we have outlined here: Chapter 25 builds the feature store, Chapter 26 handles distributed training, Chapter 27 orchestrates the pipeline, Chapter 28 tests it, Chapter 29 deploys it, and Chapter 30 monitors it.


Summary

ML system design is the discipline of composing models, data pipelines, feature stores, serving infrastructure, monitoring, and experimentation into a reliable, scalable, maintainable production system. The model — however sophisticated — is a small fraction of the total system.

The funnel architecture (candidate retrieval → ranking → re-ranking) enables serving personalized recommendations from a large item corpus within a strict latency budget. Each stage trades precision for speed, narrowing the candidate set while increasing the computational cost per item.

Serving patterns — batch, real-time, and near-real-time — differ in latency, freshness, cost, and infrastructure complexity. The right choice depends on the product requirements, not the model architecture. Hybrid patterns (real-time with batch fallback) are common and effective.

Feature stores bridge the gap between training (offline, historical data) and serving (online, real-time lookup), preventing training-serving skew — the silent killer of production ML performance. Online-offline consistency, point-in-time correctness, and default value handling are the critical guarantees.

Graceful degradation designs for failure: when components fail, the system falls back to simpler but still useful behavior rather than returning errors. Circuit breakers, fallback chains, and popularity-based defaults are the mechanisms.

Architecture Decision Records document design choices with context, constraints, and consequences. They prevent knowledge loss, avoid re-litigation of settled decisions, and surface assumptions that may have changed.

The system architecture defines the boundaries within which every other production ML component operates. Chapters 25-30 fill in those components one by one.