Appendix H: ML System Design Patterns

Reference Architectures, Design Templates, and Decision Frameworks for Production Machine Learning


This appendix collects the system design patterns referenced throughout Part V (Chapters 24-30) into a single reference. Each pattern includes the architecture diagram, trade-off analysis, when to use it, and when to avoid it. The goal is not to replace the detailed treatment in the chapters, but to give you a quick-reference catalogue when you are designing a new system or conducting a design review.

Production ML systems are software systems first and ML systems second. The patterns here borrow heavily from software engineering (microservices, event-driven architecture, circuit breakers) and adapt them to the unique challenges of ML: data-dependent behavior, training-serving skew, model staleness, and the probabilistic nature of outputs.


H.1 — Serving Patterns

The serving pattern determines how predictions reach users. This is the most consequential architectural decision in an ML system because it constrains latency, throughput, cost, and the freshness of predictions.

H.1.1 — Batch Serving

graph LR
    A[Data Warehouse] --> B[Feature Pipeline]
    B --> C[Batch Inference Job]
    C --> D[Prediction Store / Cache]
    D --> E[Application API]
    E --> F[User]

How it works: Pre-compute predictions for all entities (users, items, queries) on a schedule (hourly, daily). Store predictions in a fast key-value store (Redis, DynamoDB). The application looks up pre-computed predictions at serving time.

Trade-offs:

Dimension Assessment
Latency Excellent (cache lookup: <5ms)
Freshness Poor (predictions are stale by up to one schedule period)
Cost Low per-prediction (batch amortizes GPU cost)
Complexity Low (no real-time model serving infrastructure)
Feature requirements Offline features only (no real-time signals)

When to use: - The entity space is finite and enumerable (all users, all products) - Prediction freshness of hours is acceptable - Real-time features do not significantly improve quality - The team lacks real-time serving infrastructure - StreamRec example: pre-compute the top-100 recommendations for each user daily; serve from cache

When to avoid: - The query space is too large to enumerate (e.g., arbitrary text queries in search) - Real-time context matters (e.g., current session behavior, time-of-day) - The application requires sub-second freshness

H.1.2 — Real-Time Serving (Online Inference)

graph LR
    A[User Request] --> B[Application]
    B --> C[Feature Store: Online]
    C --> D[Model Server]
    D --> B
    B --> E[User Response]

How it works: Compute predictions on-the-fly for each request. The application sends features to a model server (TorchServe, TFServing, Triton, BentoML, or a custom FastAPI service), which runs inference and returns the prediction.

Trade-offs:

Dimension Assessment
Latency Variable (10ms-500ms depending on model complexity)
Freshness Excellent (predictions use latest features)
Cost High (GPU/CPU for every request)
Complexity High (model server, feature store, load balancing)
Feature requirements Can use real-time features (session, context)

When to use: - Predictions depend on real-time context (current query, session state, time) - The query space is too large to enumerate - Low-latency personalization is critical (fraud detection at point-of-sale, real-time bidding) - Meridian Financial example: credit scoring at application time, incorporating the latest credit bureau data

When to avoid: - The model is too large/slow for the latency budget (consider model distillation or batch serving) - Traffic volume makes per-request GPU inference cost-prohibitive - Offline features are sufficient (batch serving is simpler and cheaper)

Latency budget allocation (typical recommendation system):

Stage Budget Operation
Feature retrieval 10-20ms Online feature store lookup
Candidate retrieval 10-30ms ANN search (FAISS/ScaNN)
Ranking 20-50ms Neural ranker inference
Re-ranking 5-10ms Business rules, diversity
Total 45-110ms End-to-end

H.1.3 — Lambda Architecture

graph TD
    A[Event Stream] --> B[Batch Layer]
    A --> C[Speed Layer]
    B --> D[Batch View]
    C --> E[Real-Time View]
    D --> F[Serving Layer]
    E --> F
    F --> G[Application]

How it works: Maintain two parallel processing paths. The batch layer processes the complete dataset on a schedule (high accuracy, high latency). The speed layer processes recent events in real-time (lower accuracy, low latency). The serving layer merges both views.

Trade-offs:

Dimension Assessment
Latency Good (real-time path handles recent data)
Freshness Good (real-time path captures recent events)
Cost High (dual infrastructure)
Complexity Very high (maintain two codepaths, reconcile views)
Correctness Good (batch layer eventually corrects speed layer)

When to use: - You need both historical accuracy and real-time responsiveness - The batch layer provides the "ground truth" model, and the speed layer provides recent adjustments - StreamRec example: batch layer computes user preference profiles daily from full history; speed layer adjusts for current session interests

When to avoid: - In almost all cases. Lambda architecture's dual-codebase problem (maintaining logic in both batch and streaming) is a maintenance nightmare. Prefer the Kappa architecture unless you have a compelling reason.

Production Reality: Nathan Marz introduced the Lambda architecture in 2011. By 2014, Jay Kreps had published "Questioning the Lambda Architecture," arguing that a single streaming system is simpler and often sufficient. Most modern ML systems that appear to use Lambda are actually using Kappa with a batch backfill mechanism.

H.1.4 — Kappa Architecture

graph LR
    A[Event Stream] --> B[Stream Processing Engine]
    B --> C[Serving Layer]
    C --> D[Application]
    A --> E[Event Log / Lake]
    E -->|Replay for retraining| B

How it works: A single stream processing engine handles all computation. Historical reprocessing is done by replaying the event log through the same streaming pipeline. There is no separate batch layer.

Trade-offs:

Dimension Assessment
Latency Good (stream processing)
Freshness Excellent (single real-time path)
Cost Moderate (single infrastructure)
Complexity Moderate (one codebase, but streaming is inherently harder)
Correctness Depends on replay capability

When to use: - Real-time processing is the primary requirement - The event log is the source of truth (Kafka, Kinesis) - Historical reprocessing can be done by replaying events - Most modern ML systems at companies with mature streaming infrastructure

When to avoid: - The batch computation is fundamentally different from the streaming computation (e.g., batch requires full-dataset aggregations that streaming cannot efficiently approximate) - The team lacks streaming expertise


H.2 — Design Templates for Common ML Systems

H.2.1 — Recommendation System

This is the architecture built progressively throughout the book for the StreamRec platform.

graph TD
    subgraph Candidate Retrieval
        A1[User Features] --> B1[User Tower]
        A2[Item Features] --> B2[Item Tower]
        B1 --> C1[ANN Index - FAISS]
        B2 --> C1
        C1 --> D1["Top-K Candidates (K=500)"]
    end

    subgraph Ranking
        D1 --> E1[Feature Assembly]
        E1 --> F1[Neural Ranker]
        F1 --> G1["Scored Candidates (K=500)"]
    end

    subgraph Re-Ranking
        G1 --> H1[Diversity Filter]
        H1 --> I1[Fairness Constraints]
        I1 --> J1[Business Rules]
        J1 --> K1["Final Recommendations (N=20)"]
    end

Key design decisions:

Decision Options StreamRec Choice Rationale
Retrieval model Two-tower, GNN, collaborative filtering Two-tower (Ch. 13) Scalable to 200K items, supports real-time user encoding
Ranking model Gradient-boosted trees, deep neural network, transformer Transformer (Ch. 10) Captures sequential user behavior, attention weights are interpretable
ANN algorithm FAISS (IVF-PQ), ScaNN, HNSW FAISS IVF-PQ Sub-millisecond retrieval for 200K items, good recall-latency trade-off
Serving pattern Batch, real-time, hybrid Hybrid (batch candidates + real-time ranking) Balance freshness and cost
Exploration Epsilon-greedy, UCB, Thompson sampling Thompson sampling (Ch. 22) Principled exploration with convergence guarantees
Fairness Post-processing, in-processing Post-processing re-ranking (Ch. 31) Non-invasive, auditable, adjustable without retraining

H.2.2 — Fraud Detection System

graph TD
    A[Transaction Event] --> B[Real-Time Feature Engine]
    B --> C{Rule Engine}
    C -->|High Risk| D[Block]
    C -->|Low Risk| E[Approve]
    C -->|Medium Risk| F[ML Scoring]
    F --> G{Score > Threshold?}
    G -->|Yes| H[Review Queue]
    G -->|No| E
    H --> I[Human Analyst]
    I --> J[Feedback Loop]
    J --> B

Key design principles: - Two-stage architecture: Rules handle obvious cases (known fraud patterns, velocity checks); ML handles the ambiguous middle. Rules provide explainability and fast response; ML provides generalization. - Latency constraint: The entire pipeline must complete within 100-200ms to avoid degrading user experience. The ML model must run inference in <50ms. - High-recall orientation: Missing a fraud transaction (false negative) costs the company directly; flagging a legitimate transaction (false positive) causes friction but is not catastrophic. Optimize for recall at the cost of precision, and use the human review queue to handle false positives. - Feedback loop: Every human decision feeds back into the training pipeline. Monitor for label bias: analysts may have systematic biases that propagate into training data. - Concept drift: Fraud patterns evolve rapidly. Retrain frequently (daily or weekly) and monitor feature distributions for anomalies.

Feature engineering patterns for fraud:

Feature Type Examples Computation
Transaction-level Amount, merchant category, time of day Direct from event
Velocity features Transactions in last 1h/24h/7d, unique merchants Streaming aggregation
Behavioral deviation Z-score of amount vs. user's history Streaming + historical
Graph features Shared device/address/phone with known fraud Graph database lookup
Geo features Distance from last transaction, country mismatch Geospatial computation

H.2.3 — Search Ranking System

graph TD
    A[Query] --> B[Query Understanding]
    B --> C[Retrieval: BM25 + Semantic]
    C --> D["Candidate Pool (1000)"]
    D --> E[L1 Ranker: Lightweight]
    E --> F["Refined Pool (100)"]
    F --> G[L2 Ranker: Heavy Model]
    G --> H["Final Results (10)"]
    H --> I[Blending + Business Rules]
    I --> J[Search Results Page]

Key design decisions: - Multi-stage ranking: The funnel architecture (1000 -> 100 -> 10) allows using progressively more expensive models at each stage, keeping total latency within budget. - Hybrid retrieval: Combine lexical (BM25) and semantic (dense retrieval with bi-encoders) to capture both exact-match and meaning-based relevance. - Learning to rank: The L2 ranker uses features from query-document interaction (attention scores, semantic similarity, click-through rate history) that are too expensive to compute for all candidates. - Metrics: NDCG@10 (primary), MRR (for navigational queries), Recall@1000 (for retrieval stage).


H.3 — Decision Matrices

H.3.1 — Build vs. Buy Decision Matrix

Use this framework when deciding whether to build a component in-house or adopt a vendor solution.

Factor Build In-House Buy / Use SaaS Criterion
Differentiation Component creates competitive advantage Commodity capability (logging, monitoring) Is this what makes your product special?
Customization Requirements are unique to your domain Standard requirements across industries Do off-the-shelf solutions meet 80%+ of needs?
Control Need full control over data, model, deployment Acceptable to delegate to vendor Data sensitivity, regulatory requirements?
Team expertise Team has (or should develop) this capability Capability is outside core competency Does building this advance the team's mission?
Time-to-value Long-term investment horizon (6+ months) Need results in weeks Can you afford the build timeline?
Maintenance Willing to own long-term maintenance burden Vendor handles upgrades, patches, scaling Who carries the operational pager?
Cost High upfront, lower marginal cost at scale Lower upfront, higher marginal cost at scale What is the 3-year total cost of ownership?

Decision heuristic: Build the components that differentiate your product. Buy everything else. When in doubt, start with buy and migrate to build when you understand the requirements well enough to build something better.

StreamRec application:

Component Decision Reasoning
Recommendation models Build Core differentiator; unique to platform
Feature store Build (on open-source: Feast) Need customization for real-time features
Experiment platform Buy (Statsig, Eppo) initially; build later Standard A/B testing needs; build when scale demands
Monitoring Buy (Datadog, Grafana Cloud) + custom ML metrics Monitoring is commodity; ML-specific dashboards are custom
Vector database Use managed (Pinecone, Weaviate Cloud) or self-host (FAISS) Depends on scale and operational burden tolerance
Orchestration Adopt open-source (Dagster, Airflow) Mature OSS tools; no vendor lock-in benefit

H.3.2 — Model Serving Infrastructure Decision Matrix

Factor TorchServe Triton Inference Server BentoML Custom FastAPI
Framework support PyTorch native Multi-framework (PyTorch, TF, ONNX) Multi-framework Any
Dynamic batching Yes Yes (advanced) Yes Manual
Model ensemble Limited Native support Via pipeline Manual
GPU optimization Basic TensorRT integration Via runtime Manual
Kubernetes-native Yes Yes Yes (Yatai) Manual
Learning curve Moderate Steep Low Depends
Best for Pure PyTorch shops Multi-model, high-throughput Fast prototyping, mixed frameworks Full control, simple models

H.3.3 — Feature Store Decision Matrix

Factor Feast Tecton Hopsworks Custom
Cost Free (OSS) Commercial Free (OSS) + Commercial Engineering time
Online store Redis, DynamoDB, etc. Managed (DynamoDB) RonDB (built-in) Your choice
Offline store BigQuery, Snowflake, etc. S3/Snowflake/etc. Hive/S3 Your choice
Streaming features Limited (push-based) Native (Kafka/Kinesis) Spark Streaming Your choice
Point-in-time joins Yes Yes (optimized) Yes Must implement
Monitoring Basic Advanced Moderate Must implement
Best for Startups, teams learning feature stores Enterprise, complex real-time features Teams wanting all-in-one platform Teams with very specific requirements

H.4 — Architecture Decision Record (ADR) Template

Every significant architectural decision should be documented in an ADR. This template is used in Chapter 24 (ML System Design) and Chapter 36 (Capstone).

# ADR-{number}: {Title}

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-{n}]

## Date
{YYYY-MM-DD}

## Context
What is the technical or business situation that requires a decision?
What constraints exist (latency, cost, team expertise, regulatory)?
What problem are we trying to solve?

## Decision
What is the architectural decision? State it clearly in one or two sentences.

## Alternatives Considered

### Alternative 1: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}

### Alternative 2: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}

## Consequences

### Positive
- {Benefit 1}
- {Benefit 2}

### Negative
- {Trade-off 1}
- {Trade-off 2}

### Risks
- {Risk 1}: Mitigation: {How we address it}
- {Risk 2}: Mitigation: {How we address it}

## Review Date
{When should this decision be re-evaluated?}

Example ADR for StreamRec:

# ADR-003: Hybrid Serving Architecture for Recommendations

## Status
Accepted

## Date
2025-06-15

## Context
StreamRec needs to serve personalized recommendations to 5M users.
Latency requirement: p99 < 200ms. The recommendation model uses both
historical user features (updated daily) and real-time session features
(updated per click). Pure batch serving cannot incorporate session
signals. Pure real-time serving is too expensive at our current scale
(estimated $45K/month in GPU costs for full real-time inference).

## Decision
Use a hybrid serving architecture: batch-compute candidate sets daily
(top-500 per user), store in Redis. At request time, retrieve the
pre-computed candidates and re-rank using a lightweight model that
incorporates real-time session features.

## Alternatives Considered

### Alternative 1: Pure Batch Serving
- Pros: Simple, cheap ($3K/month)
- Cons: Cannot use session features; recommendations are stale for
  up to 24 hours
- Rejected because: A/B tests show session features improve NDCG@10
  by 12%; 24h staleness is unacceptable for trending content

### Alternative 2: Pure Real-Time Serving
- Pros: Freshest predictions, uses all features
- Cons: $45K/month GPU cost, complex infrastructure, cold-start
  latency issues
- Rejected because: Cost is 15x batch; latency budget is tight for
  the full transformer model on every request

## Consequences

### Positive
- Incorporates session signals (12% NDCG improvement)
- Manageable cost ($8K/month: $3K batch + $5K real-time re-ranking)
- p99 latency ~120ms (Redis lookup + lightweight re-rank)

### Negative
- Candidate set is stale (up to 24h); new items are not in candidates
  until next batch run. Mitigation: inject trending items into
  candidate set via a real-time "boost" list.
- Two serving paths to maintain (batch generation + real-time
  re-ranking)

## Review Date
2025-12-15 (re-evaluate when user base exceeds 10M or GPU costs drop)

H.5 — Service Mesh Patterns for ML

When an ML system comprises multiple services (feature store, model server, candidate retrieval, re-ranking), a service mesh pattern manages their interactions.

H.5.1 — Circuit Breaker Pattern

import time
from enum import Enum
from typing import Callable, TypeVar

T = TypeVar("T")


class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"  # Failing; reject requests immediately
    HALF_OPEN = "half_open"  # Testing if service has recovered


class CircuitBreaker:
    """Circuit breaker for ML model serving.

    Prevents cascading failures when a model service is unhealthy.
    Falls back to a simpler model or cached predictions.
    """

    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3,
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.half_open_calls = 0

    def call(
        self,
        primary: Callable[[], T],
        fallback: Callable[[], T],
    ) -> T:
        """Execute primary function with circuit breaker protection.

        Args:
            primary: The main model serving function.
            fallback: Fallback (e.g., cached predictions, popularity baseline).

        Returns:
            Result from primary or fallback.
        """
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
            else:
                return fallback()

        try:
            result = primary()
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_calls += 1
                if self.half_open_calls >= self.half_open_max_calls:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            else:
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            return fallback()

ML-specific considerations: - The fallback for a recommendation model might be a popularity-based ranker (no personalization, but always available and fast) - The fallback for a fraud model might be a rule-based system (higher false positive rate, but never misses known patterns) - Monitor fallback activation rate as a system health metric

H.5.2 — Graceful Degradation Hierarchy

Design ML systems with a degradation hierarchy — a sequence of increasingly simpler models that maintain service availability at reduced quality.

Level Model Latency Quality When Activated
L0 (Full) Transformer ranker + real-time features 120ms Best Normal operation
L1 (Reduced) Lightweight MLP ranker + cached features 40ms Good GPU quota exceeded, feature store latency spike
L2 (Minimal) Pre-computed batch recommendations 5ms Acceptable Model server down, feature store down
L3 (Emergency) Global popularity baseline 2ms Poor Complete infrastructure failure

Each level should be tested regularly. The StreamRec team runs monthly "chaos engineering" exercises where they deliberately disable components and verify that the system degrades gracefully through each level.

H.5.3 — Shadow Mode Deployment

graph TD
    A[User Request] --> B[Load Balancer]
    B --> C[Champion Model]
    B --> D[Challenger Model: Shadow]
    C --> E[User Response]
    D --> F[Shadow Predictions Log]
    F --> G[Offline Comparison]

Implementation considerations: - Shadow predictions must not affect user experience (asynchronous evaluation) - Log both champion and challenger predictions with identical features to ensure a fair comparison - Monitor challenger latency independently — a slow challenger should not affect champion performance - Run shadow mode for at least one full business cycle (typically 1-2 weeks) to capture temporal patterns


H.6 — Feature Store Architecture Patterns

H.6.1 — Dual-Store Pattern

The most common feature store pattern separates online serving from offline training.

graph TD
    subgraph Offline Path
        A[Raw Data Lake] --> B[Feature Pipeline: Spark/Dagster]
        B --> C[Offline Store: Parquet on S3]
        C --> D[Training Job: Point-in-Time Join]
    end

    subgraph Online Path
        B --> E[Online Store: Redis/DynamoDB]
        E --> F[Model Server: Feature Lookup]
    end

    subgraph Streaming Path
        G[Event Stream: Kafka] --> H[Streaming Pipeline: Flink/Spark]
        H --> E
    end

The critical invariant: Features computed for training (from the offline store, with point-in-time joins to prevent leakage) must produce the same values as features served in production (from the online store). This is the online-offline consistency problem — the single most common source of production ML bugs (Chapter 25).

Strategies for consistency: 1. Single computation: Compute features once (in the batch pipeline) and write to both stores simultaneously 2. Shared transformation logic: Define feature transformations in a single specification (e.g., Feast feature definitions) that generates both batch and streaming computation 3. Consistency monitoring: Regularly compare online feature values against offline feature values for the same entity and timestamp; alert on divergence

H.6.2 — Feature Computation Patterns

Pattern Description Use Case Example
Batch transform Compute from full historical data on a schedule Slowly-changing features User lifetime engagement rate
Streaming aggregate Maintain running aggregates from event stream Recent activity features Clicks in last 1 hour
On-demand compute Compute at serving time from raw inputs Context-dependent features Distance between user location and item location
Pre-joined Join multiple tables in the batch pipeline, store the result Cross-entity features Average rating of items in user's watch history

H.7 — Anti-Patterns to Avoid

H.7.1 — The God Model

Anti-pattern: One monolithic model that handles all use cases (recommendations, search, notifications, emails).

Why it fails: Different use cases have different latency requirements, different training data, and different optimization objectives. A single model cannot satisfy all constraints.

Fix: Decompose into specialized models with a shared feature platform. The StreamRec system uses separate models for candidate retrieval (two-tower), ranking (transformer), and notification triggering (lightweight classifier), all consuming from the same feature store.

H.7.2 — Training-Serving Skew Denial

Anti-pattern: Using different feature computation code for training (pandas in a notebook) and serving (SQL in a production pipeline), assuming they produce the same results.

Why it fails: Subtle differences in null handling, timestamp parsing, aggregation boundaries, and floating-point precision cause features to differ between training and serving. The model performs well in offline evaluation and poorly in production, with no obvious error.

Fix: Feature stores with shared transformation logic (Section H.6). Integration tests that compare training features against serving features for the same entity-timestamp pairs. Monitoring for feature distribution divergence.

H.7.3 — The Premature Microservice

Anti-pattern: Decomposing the ML system into 15 microservices before validating that the ML approach works at all.

Why it fails: The first version should answer: "Does this ML approach solve the business problem?" Microservice complexity makes iteration slow and debugging hard.

Fix: Start with a monolithic prototype (batch inference, simple serving). Decompose into services only after validating the approach and identifying scaling bottlenecks. The StreamRec team started with a single Python script that ran matrix factorization on a cron job and stored recommendations in PostgreSQL. They decomposed into the current architecture only after proving that personalization increased engagement by 20%.

H.7.4 — Monitoring by Dashboard Staring

Anti-pattern: Building beautiful dashboards but no automated alerts. Relying on humans to notice when metrics drift.

Why it fails: Humans check dashboards when they remember to, which is not at 3 AM on Sunday when the feature pipeline silently produces null values.

Fix: Every metric on the dashboard should have an associated alert with a threshold, escalation path, and runbook (Chapter 30). Dashboards are for investigation; alerts are for detection.

H.7.5 — The Reproducibility Illusion

Anti-pattern: Logging the model version and code commit but not the exact training data, feature definitions, hyperparameters, and random seeds.

Why it fails: When a model degrades in production, you need to reproduce the previous version exactly — including the data snapshot and the full configuration. Logging only the model artifact is insufficient because you cannot retrain or debug without the full provenance chain.

Fix: Version everything: code (git), data (DVC or snapshot IDs), features (feature store versions), hyperparameters (experiment tracker: MLflow, W&B), and environment (Docker image SHA). The StreamRec team's deployment pipeline records all of these in the model registry and refuses to deploy a model without complete provenance.


H.8 — Putting It All Together: StreamRec System Architecture

The complete StreamRec recommendation system, as built across Chapters 24-30 and integrated in Chapter 36, combines these patterns into a coherent architecture.

┌─────────────────────────────────────────────────────────────────────────┐
│                          StreamRec Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐    │
│  │ Event Stream  │────▶│ Streaming    │────▶│ Online Feature Store  │    │
│  │ (Kafka)       │     │ Pipeline     │     │ (Redis)               │    │
│  └──────┬───────┘     └──────────────┘     └──────────┬───────────┘    │
│         │                                              │                │
│         ▼                                              │                │
│  ┌──────────────┐     ┌──────────────┐                │                │
│  │ Data Lake     │────▶│ Batch Feature│────▶┌──────────┴───────────┐    │
│  │ (S3/Delta)    │     │ Pipeline     │     │ Offline Feature Store │    │
│  └──────────────┘     │ (Dagster)    │────▶│ (Parquet on S3)       │    │
│                        └──────────────┘     └──────────┬───────────┘    │
│                                                        │                │
│  ┌─────────────────────────────────────┐               │                │
│  │        Training Pipeline            │◀──────────────┘                │
│  │  Data → Features → Train → Eval    │                                │
│  │  → Register → Shadow → Canary      │                                │
│  └──────────────┬──────────────────────┘                                │
│                 │                                                        │
│                 ▼                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Serving Layer                                  │   │
│  │  Request → Feature Lookup → Candidate Retrieval (Two-Tower+ANN) │   │
│  │  → Ranking (Transformer) → Re-Ranking (Diversity+Fairness)      │   │
│  │  → Response                                                      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Monitoring Layer                               │   │
│  │  Data Quality │ Feature Drift │ Model Health │ Business Metrics  │   │
│  │  Latency/SLOs │ Fairness     │ Alerts       │ Incident Response │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

Design principles embodied: - Dual-store feature architecture (H.6.1): online Redis for serving, offline Parquet for training - Hybrid serving (H.1.1 + H.1.2): batch candidate generation + real-time re-ranking - Graceful degradation (H.5.2): four levels from full model to popularity baseline - Circuit breaker (H.5.1): protects against cascading failures in model serving - Shadow mode (H.5.3): every model update spends one week in shadow before canary rollout - Complete monitoring (Chapter 30): data quality through business metrics with automated alerting - Full provenance (H.7.5): every model deployment records code, data, features, hyperparameters, and evaluation results


The patterns in this appendix are starting points, not prescriptions. Every system has unique constraints that require adaptation. The mark of a senior ML engineer is not knowing these patterns but knowing when to deviate from them — and documenting why in an ADR.