Appendix H: ML System Design Patterns

Reference Architectures, Design Templates, and Decision Frameworks for Production Machine Learning

This appendix collects the system design patterns referenced throughout Part V (Chapters 24-30) into a single reference. Each pattern includes the architecture diagram, trade-off analysis, when to use it, and when to avoid it. The goal is not to replace the detailed treatment in the chapters, but to give you a quick-reference catalogue when you are designing a new system or conducting a design review.

Production ML systems are software systems first and ML systems second. The patterns here borrow heavily from software engineering (microservices, event-driven architecture, circuit breakers) and adapt them to the unique challenges of ML: data-dependent behavior, training-serving skew, model staleness, and the probabilistic nature of outputs.

H.1 — Serving Patterns

The serving pattern determines how predictions reach users. This is the most consequential architectural decision in an ML system because it constrains latency, throughput, cost, and the freshness of predictions.

H.1.1 — Batch Serving

graph LR
    A[Data Warehouse] --> B[Feature Pipeline]
    B --> C[Batch Inference Job]
    C --> D[Prediction Store / Cache]
    D --> E[Application API]
    E --> F[User]

How it works: Pre-compute predictions for all entities (users, items, queries) on a schedule (hourly, daily). Store predictions in a fast key-value store (Redis, DynamoDB). The application looks up pre-computed predictions at serving time.

Trade-offs:

Dimension	Assessment
Latency	Excellent (cache lookup: <5ms)
Freshness	Poor (predictions are stale by up to one schedule period)
Cost	Low per-prediction (batch amortizes GPU cost)
Complexity	Low (no real-time model serving infrastructure)
Feature requirements	Offline features only (no real-time signals)

When to use: - The entity space is finite and enumerable (all users, all products) - Prediction freshness of hours is acceptable - Real-time features do not significantly improve quality - The team lacks real-time serving infrastructure - StreamRec example: pre-compute the top-100 recommendations for each user daily; serve from cache

When to avoid: - The query space is too large to enumerate (e.g., arbitrary text queries in search) - Real-time context matters (e.g., current session behavior, time-of-day) - The application requires sub-second freshness

H.1.2 — Real-Time Serving (Online Inference)

graph LR
    A[User Request] --> B[Application]
    B --> C[Feature Store: Online]
    C --> D[Model Server]
    D --> B
    B --> E[User Response]

How it works: Compute predictions on-the-fly for each request. The application sends features to a model server (TorchServe, TFServing, Triton, BentoML, or a custom FastAPI service), which runs inference and returns the prediction.

Trade-offs:

Dimension	Assessment
Latency	Variable (10ms-500ms depending on model complexity)
Freshness	Excellent (predictions use latest features)
Cost	High (GPU/CPU for every request)
Complexity	High (model server, feature store, load balancing)
Feature requirements	Can use real-time features (session, context)

When to use: - Predictions depend on real-time context (current query, session state, time) - The query space is too large to enumerate - Low-latency personalization is critical (fraud detection at point-of-sale, real-time bidding) - Meridian Financial example: credit scoring at application time, incorporating the latest credit bureau data

When to avoid: - The model is too large/slow for the latency budget (consider model distillation or batch serving) - Traffic volume makes per-request GPU inference cost-prohibitive - Offline features are sufficient (batch serving is simpler and cheaper)

Latency budget allocation (typical recommendation system):

Stage	Budget	Operation
Feature retrieval	10-20ms	Online feature store lookup
Candidate retrieval	10-30ms	ANN search (FAISS/ScaNN)
Ranking	20-50ms	Neural ranker inference
Re-ranking	5-10ms	Business rules, diversity
Total	45-110ms	End-to-end

H.1.3 — Lambda Architecture

graph TD
    A[Event Stream] --> B[Batch Layer]
    A --> C[Speed Layer]
    B --> D[Batch View]
    C --> E[Real-Time View]
    D --> F[Serving Layer]
    E --> F
    F --> G[Application]

How it works: Maintain two parallel processing paths. The batch layer processes the complete dataset on a schedule (high accuracy, high latency). The speed layer processes recent events in real-time (lower accuracy, low latency). The serving layer merges both views.

Trade-offs:

Dimension	Assessment
Latency	Good (real-time path handles recent data)
Freshness	Good (real-time path captures recent events)
Cost	High (dual infrastructure)
Complexity	Very high (maintain two codepaths, reconcile views)
Correctness	Good (batch layer eventually corrects speed layer)

When to use: - You need both historical accuracy and real-time responsiveness - The batch layer provides the "ground truth" model, and the speed layer provides recent adjustments - StreamRec example: batch layer computes user preference profiles daily from full history; speed layer adjusts for current session interests

When to avoid: - In almost all cases. Lambda architecture's dual-codebase problem (maintaining logic in both batch and streaming) is a maintenance nightmare. Prefer the Kappa architecture unless you have a compelling reason.

Production Reality: Nathan Marz introduced the Lambda architecture in 2011. By 2014, Jay Kreps had published "Questioning the Lambda Architecture," arguing that a single streaming system is simpler and often sufficient. Most modern ML systems that appear to use Lambda are actually using Kappa with a batch backfill mechanism.

H.1.4 — Kappa Architecture

graph LR
    A[Event Stream] --> B[Stream Processing Engine]
    B --> C[Serving Layer]
    C --> D[Application]
    A --> E[Event Log / Lake]
    E -->|Replay for retraining| B

How it works: A single stream processing engine handles all computation. Historical reprocessing is done by replaying the event log through the same streaming pipeline. There is no separate batch layer.

Trade-offs:

Dimension	Assessment
Latency	Good (stream processing)
Freshness	Excellent (single real-time path)
Cost	Moderate (single infrastructure)
Complexity	Moderate (one codebase, but streaming is inherently harder)
Correctness	Depends on replay capability

When to use: - Real-time processing is the primary requirement - The event log is the source of truth (Kafka, Kinesis) - Historical reprocessing can be done by replaying events - Most modern ML systems at companies with mature streaming infrastructure

When to avoid: - The batch computation is fundamentally different from the streaming computation (e.g., batch requires full-dataset aggregations that streaming cannot efficiently approximate) - The team lacks streaming expertise

H.2 — Design Templates for Common ML Systems

H.2.1 — Recommendation System

This is the architecture built progressively throughout the book for the StreamRec platform.

graph TD
    subgraph Candidate Retrieval
        A1[User Features] --> B1[User Tower]
        A2[Item Features] --> B2[Item Tower]
        B1 --> C1[ANN Index - FAISS]
        B2 --> C1
        C1 --> D1["Top-K Candidates (K=500)"]
    end

    subgraph Ranking
        D1 --> E1[Feature Assembly]
        E1 --> F1[Neural Ranker]
        F1 --> G1["Scored Candidates (K=500)"]
    end

    subgraph Re-Ranking
        G1 --> H1[Diversity Filter]
        H1 --> I1[Fairness Constraints]
        I1 --> J1[Business Rules]
        J1 --> K1["Final Recommendations (N=20)"]
    end

Key design decisions:

Decision	Options	StreamRec Choice	Rationale
Retrieval model	Two-tower, GNN, collaborative filtering	Two-tower (Ch. 13)	Scalable to 200K items, supports real-time user encoding
Ranking model	Gradient-boosted trees, deep neural network, transformer	Transformer (Ch. 10)	Captures sequential user behavior, attention weights are interpretable
ANN algorithm	FAISS (IVF-PQ), ScaNN, HNSW	FAISS IVF-PQ	Sub-millisecond retrieval for 200K items, good recall-latency trade-off
Serving pattern	Batch, real-time, hybrid	Hybrid (batch candidates + real-time ranking)	Balance freshness and cost
Exploration	Epsilon-greedy, UCB, Thompson sampling	Thompson sampling (Ch. 22)	Principled exploration with convergence guarantees
Fairness	Post-processing, in-processing	Post-processing re-ranking (Ch. 31)	Non-invasive, auditable, adjustable without retraining

H.2.2 — Fraud Detection System

graph TD
    A[Transaction Event] --> B[Real-Time Feature Engine]
    B --> C{Rule Engine}
    C -->|High Risk| D[Block]
    C -->|Low Risk| E[Approve]
    C -->|Medium Risk| F[ML Scoring]
    F --> G{Score > Threshold?}
    G -->|Yes| H[Review Queue]
    G -->|No| E
    H --> I[Human Analyst]
    I --> J[Feedback Loop]
    J --> B

Key design principles: - Two-stage architecture: Rules handle obvious cases (known fraud patterns, velocity checks); ML handles the ambiguous middle. Rules provide explainability and fast response; ML provides generalization. - Latency constraint: The entire pipeline must complete within 100-200ms to avoid degrading user experience. The ML model must run inference in <50ms. - High-recall orientation: Missing a fraud transaction (false negative) costs the company directly; flagging a legitimate transaction (false positive) causes friction but is not catastrophic. Optimize for recall at the cost of precision, and use the human review queue to handle false positives. - Feedback loop: Every human decision feeds back into the training pipeline. Monitor for label bias: analysts may have systematic biases that propagate into training data. - Concept drift: Fraud patterns evolve rapidly. Retrain frequently (daily or weekly) and monitor feature distributions for anomalies.

Feature engineering patterns for fraud:

Feature Type	Examples	Computation
Transaction-level	Amount, merchant category, time of day	Direct from event
Velocity features	Transactions in last 1h/24h/7d, unique merchants	Streaming aggregation
Behavioral deviation	Z-score of amount vs. user's history	Streaming + historical
Graph features	Shared device/address/phone with known fraud	Graph database lookup
Geo features	Distance from last transaction, country mismatch	Geospatial computation

H.2.3 — Search Ranking System

graph TD
    A[Query] --> B[Query Understanding]
    B --> C[Retrieval: BM25 + Semantic]
    C --> D["Candidate Pool (1000)"]
    D --> E[L1 Ranker: Lightweight]
    E --> F["Refined Pool (100)"]
    F --> G[L2 Ranker: Heavy Model]
    G --> H["Final Results (10)"]
    H --> I[Blending + Business Rules]
    I --> J[Search Results Page]

Key design decisions: - Multi-stage ranking: The funnel architecture (1000 -> 100 -> 10) allows using progressively more expensive models at each stage, keeping total latency within budget. - Hybrid retrieval: Combine lexical (BM25) and semantic (dense retrieval with bi-encoders) to capture both exact-match and meaning-based relevance. - Learning to rank: The L2 ranker uses features from query-document interaction (attention scores, semantic similarity, click-through rate history) that are too expensive to compute for all candidates. - Metrics: NDCG@10 (primary), MRR (for navigational queries), Recall@1000 (for retrieval stage).

H.3 — Decision Matrices

H.3.1 — Build vs. Buy Decision Matrix

Use this framework when deciding whether to build a component in-house or adopt a vendor solution.

Factor	Build In-House	Buy / Use SaaS	Criterion
Differentiation	Component creates competitive advantage	Commodity capability (logging, monitoring)	Is this what makes your product special?
Customization	Requirements are unique to your domain	Standard requirements across industries	Do off-the-shelf solutions meet 80%+ of needs?
Control	Need full control over data, model, deployment	Acceptable to delegate to vendor	Data sensitivity, regulatory requirements?
Team expertise	Team has (or should develop) this capability	Capability is outside core competency	Does building this advance the team's mission?
Time-to-value	Long-term investment horizon (6+ months)	Need results in weeks	Can you afford the build timeline?
Maintenance	Willing to own long-term maintenance burden	Vendor handles upgrades, patches, scaling	Who carries the operational pager?
Cost	High upfront, lower marginal cost at scale	Lower upfront, higher marginal cost at scale	What is the 3-year total cost of ownership?

Decision heuristic: Build the components that differentiate your product. Buy everything else. When in doubt, start with buy and migrate to build when you understand the requirements well enough to build something better.

StreamRec application:

Component	Decision	Reasoning
Recommendation models	Build	Core differentiator; unique to platform
Feature store	Build (on open-source: Feast)	Need customization for real-time features
Experiment platform	Buy (Statsig, Eppo) initially; build later	Standard A/B testing needs; build when scale demands
Monitoring	Buy (Datadog, Grafana Cloud) + custom ML metrics	Monitoring is commodity; ML-specific dashboards are custom
Vector database	Use managed (Pinecone, Weaviate Cloud) or self-host (FAISS)	Depends on scale and operational burden tolerance
Orchestration	Adopt open-source (Dagster, Airflow)	Mature OSS tools; no vendor lock-in benefit

H.3.2 — Model Serving Infrastructure Decision Matrix

Factor	TorchServe	Triton Inference Server	BentoML	Custom FastAPI
Framework support	PyTorch native	Multi-framework (PyTorch, TF, ONNX)	Multi-framework	Any
Dynamic batching	Yes	Yes (advanced)	Yes	Manual
Model ensemble	Limited	Native support	Via pipeline	Manual
GPU optimization	Basic	TensorRT integration	Via runtime	Manual
Kubernetes-native	Yes	Yes	Yes (Yatai)	Manual
Learning curve	Moderate	Steep	Low	Depends
Best for	Pure PyTorch shops	Multi-model, high-throughput	Fast prototyping, mixed frameworks	Full control, simple models

H.3.3 — Feature Store Decision Matrix

Factor	Feast	Tecton	Hopsworks	Custom
Cost	Free (OSS)	Commercial	Free (OSS) + Commercial	Engineering time
Online store	Redis, DynamoDB, etc.	Managed (DynamoDB)	RonDB (built-in)	Your choice
Offline store	BigQuery, Snowflake, etc.	S3/Snowflake/etc.	Hive/S3	Your choice
Streaming features	Limited (push-based)	Native (Kafka/Kinesis)	Spark Streaming	Your choice
Point-in-time joins	Yes	Yes (optimized)	Yes	Must implement
Monitoring	Basic	Advanced	Moderate	Must implement
Best for	Startups, teams learning feature stores	Enterprise, complex real-time features	Teams wanting all-in-one platform	Teams with very specific requirements

H.4 — Architecture Decision Record (ADR) Template

Every significant architectural decision should be documented in an ADR. This template is used in Chapter 24 (ML System Design) and Chapter 36 (Capstone).

# ADR-{number}: {Title}

## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-{n}]

## Date
{YYYY-MM-DD}

## Context
What is the technical or business situation that requires a decision?
What constraints exist (latency, cost, team expertise, regulatory)?
What problem are we trying to solve?

## Decision
What is the architectural decision? State it clearly in one or two sentences.

## Alternatives Considered

### Alternative 1: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}

### Alternative 2: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}

## Consequences

### Positive
- {Benefit 1}
- {Benefit 2}

### Negative
- {Trade-off 1}
- {Trade-off 2}

### Risks
- {Risk 1}: Mitigation: {How we address it}
- {Risk 2}: Mitigation: {How we address it}

## Review Date
{When should this decision be re-evaluated?}

Example ADR for StreamRec:

# ADR-003: Hybrid Serving Architecture for Recommendations

## Status
Accepted

## Date
2025-06-15

## Context
StreamRec needs to serve personalized recommendations to 5M users.
Latency requirement: p99 < 200ms. The recommendation model uses both
historical user features (updated daily) and real-time session features
(updated per click). Pure batch serving cannot incorporate session
signals. Pure real-time serving is too expensive at our current scale
(estimated $45K/month in GPU costs for full real-time inference).

## Decision
Use a hybrid serving architecture: batch-compute candidate sets daily
(top-500 per user), store in Redis. At request time, retrieve the
pre-computed candidates and re-rank using a lightweight model that
incorporates real-time session features.

## Alternatives Considered

### Alternative 1: Pure Batch Serving
- Pros: Simple, cheap ($3K/month)
- Cons: Cannot use session features; recommendations are stale for
  up to 24 hours
- Rejected because: A/B tests show session features improve NDCG@10
  by 12%; 24h staleness is unacceptable for trending content

### Alternative 2: Pure Real-Time Serving
- Pros: Freshest predictions, uses all features
- Cons: $45K/month GPU cost, complex infrastructure, cold-start
  latency issues
- Rejected because: Cost is 15x batch; latency budget is tight for
  the full transformer model on every request

## Consequences

### Positive
- Incorporates session signals (12% NDCG improvement)
- Manageable cost ($8K/month: $3K batch + $5K real-time re-ranking)
- p99 latency ~120ms (Redis lookup + lightweight re-rank)

### Negative
- Candidate set is stale (up to 24h); new items are not in candidates
  until next batch run. Mitigation: inject trending items into
  candidate set via a real-time "boost" list.
- Two serving paths to maintain (batch generation + real-time
  re-ranking)

## Review Date
2025-12-15 (re-evaluate when user base exceeds 10M or GPU costs drop)

H.5 — Service Mesh Patterns for ML

When an ML system comprises multiple services (feature store, model server, candidate retrieval, re-ranking), a service mesh pattern manages their interactions.

H.5.1 — Circuit Breaker Pattern

import time
from enum import Enum
from typing import Callable, TypeVar

T = TypeVar("T")


class CircuitState(Enum):
    CLOSED = "closed"  # Normal operation
    OPEN = "open"  # Failing; reject requests immediately
    HALF_OPEN = "half_open"  # Testing if service has recovered


class CircuitBreaker:
    """Circuit breaker for ML model serving.

    Prevents cascading failures when a model service is unhealthy.
    Falls back to a simpler model or cached predictions.
    """

    def __init__(
        self,
        failure_threshold: int = 5,
        recovery_timeout: float = 30.0,
        half_open_max_calls: int = 3,
    ):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.state = CircuitState.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0.0
        self.half_open_calls = 0

    def call(
        self,
        primary: Callable[[], T],
        fallback: Callable[[], T],
    ) -> T:
        """Execute primary function with circuit breaker protection.

        Args:
            primary: The main model serving function.
            fallback: Fallback (e.g., cached predictions, popularity baseline).

        Returns:
            Result from primary or fallback.
        """
        if self.state == CircuitState.OPEN:
            if time.time() - self.last_failure_time > self.recovery_timeout:
                self.state = CircuitState.HALF_OPEN
                self.half_open_calls = 0
            else:
                return fallback()

        try:
            result = primary()
            if self.state == CircuitState.HALF_OPEN:
                self.half_open_calls += 1
                if self.half_open_calls >= self.half_open_max_calls:
                    self.state = CircuitState.CLOSED
                    self.failure_count = 0
            else:
                self.failure_count = 0
            return result
        except Exception:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = CircuitState.OPEN
            return fallback()

ML-specific considerations: - The fallback for a recommendation model might be a popularity-based ranker (no personalization, but always available and fast) - The fallback for a fraud model might be a rule-based system (higher false positive rate, but never misses known patterns) - Monitor fallback activation rate as a system health metric

H.5.2 — Graceful Degradation Hierarchy

Design ML systems with a degradation hierarchy — a sequence of increasingly simpler models that maintain service availability at reduced quality.

Level	Model	Latency	Quality	When Activated
L0 (Full)	Transformer ranker + real-time features	120ms	Best	Normal operation
L1 (Reduced)	Lightweight MLP ranker + cached features	40ms	Good	GPU quota exceeded, feature store latency spike
L2 (Minimal)	Pre-computed batch recommendations	5ms	Acceptable	Model server down, feature store down
L3 (Emergency)	Global popularity baseline	2ms	Poor	Complete infrastructure failure

Each level should be tested regularly. The StreamRec team runs monthly "chaos engineering" exercises where they deliberately disable components and verify that the system degrades gracefully through each level.

H.5.3 — Shadow Mode Deployment

graph TD
    A[User Request] --> B[Load Balancer]
    B --> C[Champion Model]
    B --> D[Challenger Model: Shadow]
    C --> E[User Response]
    D --> F[Shadow Predictions Log]
    F --> G[Offline Comparison]

Implementation considerations: - Shadow predictions must not affect user experience (asynchronous evaluation) - Log both champion and challenger predictions with identical features to ensure a fair comparison - Monitor challenger latency independently — a slow challenger should not affect champion performance - Run shadow mode for at least one full business cycle (typically 1-2 weeks) to capture temporal patterns

H.6 — Feature Store Architecture Patterns

H.6.1 — Dual-Store Pattern

The most common feature store pattern separates online serving from offline training.

graph TD
    subgraph Offline Path
        A[Raw Data Lake] --> B[Feature Pipeline: Spark/Dagster]
        B --> C[Offline Store: Parquet on S3]
        C --> D[Training Job: Point-in-Time Join]
    end

    subgraph Online Path
        B --> E[Online Store: Redis/DynamoDB]
        E --> F[Model Server: Feature Lookup]
    end

    subgraph Streaming Path
        G[Event Stream: Kafka] --> H[Streaming Pipeline: Flink/Spark]
        H --> E
    end

The critical invariant: Features computed for training (from the offline store, with point-in-time joins to prevent leakage) must produce the same values as features served in production (from the online store). This is the online-offline consistency problem — the single most common source of production ML bugs (Chapter 25).

Strategies for consistency: 1. Single computation: Compute features once (in the batch pipeline) and write to both stores simultaneously 2. Shared transformation logic: Define feature transformations in a single specification (e.g., Feast feature definitions) that generates both batch and streaming computation 3. Consistency monitoring: Regularly compare online feature values against offline feature values for the same entity and timestamp; alert on divergence

H.6.2 — Feature Computation Patterns

Pattern	Description	Use Case	Example
Batch transform	Compute from full historical data on a schedule	Slowly-changing features	User lifetime engagement rate
Streaming aggregate	Maintain running aggregates from event stream	Recent activity features	Clicks in last 1 hour
On-demand compute	Compute at serving time from raw inputs	Context-dependent features	Distance between user location and item location
Pre-joined	Join multiple tables in the batch pipeline, store the result	Cross-entity features	Average rating of items in user's watch history

H.7 — Anti-Patterns to Avoid

H.7.1 — The God Model

Anti-pattern: One monolithic model that handles all use cases (recommendations, search, notifications, emails).

Why it fails: Different use cases have different latency requirements, different training data, and different optimization objectives. A single model cannot satisfy all constraints.

Fix: Decompose into specialized models with a shared feature platform. The StreamRec system uses separate models for candidate retrieval (two-tower), ranking (transformer), and notification triggering (lightweight classifier), all consuming from the same feature store.

H.7.2 — Training-Serving Skew Denial

Anti-pattern: Using different feature computation code for training (pandas in a notebook) and serving (SQL in a production pipeline), assuming they produce the same results.

Why it fails: Subtle differences in null handling, timestamp parsing, aggregation boundaries, and floating-point precision cause features to differ between training and serving. The model performs well in offline evaluation and poorly in production, with no obvious error.

Fix: Feature stores with shared transformation logic (Section H.6). Integration tests that compare training features against serving features for the same entity-timestamp pairs. Monitoring for feature distribution divergence.

H.7.3 — The Premature Microservice

Anti-pattern: Decomposing the ML system into 15 microservices before validating that the ML approach works at all.

Why it fails: The first version should answer: "Does this ML approach solve the business problem?" Microservice complexity makes iteration slow and debugging hard.

Fix: Start with a monolithic prototype (batch inference, simple serving). Decompose into services only after validating the approach and identifying scaling bottlenecks. The StreamRec team started with a single Python script that ran matrix factorization on a cron job and stored recommendations in PostgreSQL. They decomposed into the current architecture only after proving that personalization increased engagement by 20%.

H.7.4 — Monitoring by Dashboard Staring

Anti-pattern: Building beautiful dashboards but no automated alerts. Relying on humans to notice when metrics drift.

Why it fails: Humans check dashboards when they remember to, which is not at 3 AM on Sunday when the feature pipeline silently produces null values.

Fix: Every metric on the dashboard should have an associated alert with a threshold, escalation path, and runbook (Chapter 30). Dashboards are for investigation; alerts are for detection.

H.7.5 — The Reproducibility Illusion

Anti-pattern: Logging the model version and code commit but not the exact training data, feature definitions, hyperparameters, and random seeds.

Why it fails: When a model degrades in production, you need to reproduce the previous version exactly — including the data snapshot and the full configuration. Logging only the model artifact is insufficient because you cannot retrain or debug without the full provenance chain.

Fix: Version everything: code (git), data (DVC or snapshot IDs), features (feature store versions), hyperparameters (experiment tracker: MLflow, W&B), and environment (Docker image SHA). The StreamRec team's deployment pipeline records all of these in the model registry and refuses to deploy a model without complete provenance.

H.8 — Putting It All Together: StreamRec System Architecture

The complete StreamRec recommendation system, as built across Chapters 24-30 and integrated in Chapter 36, combines these patterns into a coherent architecture.

┌─────────────────────────────────────────────────────────────────────────┐
│                          StreamRec Architecture                         │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                         │
│  ┌──────────────┐     ┌──────────────┐     ┌──────────────────────┐    │
│  │ Event Stream  │────▶│ Streaming    │────▶│ Online Feature Store  │    │
│  │ (Kafka)       │     │ Pipeline     │     │ (Redis)               │    │
│  └──────┬───────┘     └──────────────┘     └──────────┬───────────┘    │
│         │                                              │                │
│         ▼                                              │                │
│  ┌──────────────┐     ┌──────────────┐                │                │
│  │ Data Lake     │────▶│ Batch Feature│────▶┌──────────┴───────────┐    │
│  │ (S3/Delta)    │     │ Pipeline     │     │ Offline Feature Store │    │
│  └──────────────┘     │ (Dagster)    │────▶│ (Parquet on S3)       │    │
│                        └──────────────┘     └──────────┬───────────┘    │
│                                                        │                │
│  ┌─────────────────────────────────────┐               │                │
│  │        Training Pipeline            │◀──────────────┘                │
│  │  Data → Features → Train → Eval    │                                │
│  │  → Register → Shadow → Canary      │                                │
│  └──────────────┬──────────────────────┘                                │
│                 │                                                        │
│                 ▼                                                        │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Serving Layer                                  │   │
│  │  Request → Feature Lookup → Candidate Retrieval (Two-Tower+ANN) │   │
│  │  → Ranking (Transformer) → Re-Ranking (Diversity+Fairness)      │   │
│  │  → Response                                                      │   │
│  └──────────────────────────────────────────────────────────────────┘   │
│                                                                         │
│  ┌──────────────────────────────────────────────────────────────────┐   │
│  │                    Monitoring Layer                               │   │
│  │  Data Quality │ Feature Drift │ Model Health │ Business Metrics  │   │
│  │  Latency/SLOs │ Fairness     │ Alerts       │ Incident Response │   │
│  └──────────────────────────────────────────────────────────────────┘   │
└─────────────────────────────────────────────────────────────────────────┘

Design principles embodied: - Dual-store feature architecture (H.6.1): online Redis for serving, offline Parquet for training - Hybrid serving (H.1.1 + H.1.2): batch candidate generation + real-time re-ranking - Graceful degradation (H.5.2): four levels from full model to popularity baseline - Circuit breaker (H.5.1): protects against cascading failures in model serving - Shadow mode (H.5.3): every model update spends one week in shadow before canary rollout - Complete monitoring (Chapter 30): data quality through business metrics with automated alerting - Full provenance (H.7.5): every model deployment records code, data, features, hyperparameters, and evaluation results

The patterns in this appendix are starting points, not prescriptions. Every system has unique constraints that require adaptation. The mark of a senior ML engineer is not knowing these patterns but knowing when to deviate from them — and documenting why in an ADR.