Appendix H: ML System Design Patterns
Reference Architectures, Design Templates, and Decision Frameworks for Production Machine Learning
This appendix collects the system design patterns referenced throughout Part V (Chapters 24-30) into a single reference. Each pattern includes the architecture diagram, trade-off analysis, when to use it, and when to avoid it. The goal is not to replace the detailed treatment in the chapters, but to give you a quick-reference catalogue when you are designing a new system or conducting a design review.
Production ML systems are software systems first and ML systems second. The patterns here borrow heavily from software engineering (microservices, event-driven architecture, circuit breakers) and adapt them to the unique challenges of ML: data-dependent behavior, training-serving skew, model staleness, and the probabilistic nature of outputs.
H.1 — Serving Patterns
The serving pattern determines how predictions reach users. This is the most consequential architectural decision in an ML system because it constrains latency, throughput, cost, and the freshness of predictions.
H.1.1 — Batch Serving
graph LR
A[Data Warehouse] --> B[Feature Pipeline]
B --> C[Batch Inference Job]
C --> D[Prediction Store / Cache]
D --> E[Application API]
E --> F[User]
How it works: Pre-compute predictions for all entities (users, items, queries) on a schedule (hourly, daily). Store predictions in a fast key-value store (Redis, DynamoDB). The application looks up pre-computed predictions at serving time.
Trade-offs:
| Dimension | Assessment |
|---|---|
| Latency | Excellent (cache lookup: <5ms) |
| Freshness | Poor (predictions are stale by up to one schedule period) |
| Cost | Low per-prediction (batch amortizes GPU cost) |
| Complexity | Low (no real-time model serving infrastructure) |
| Feature requirements | Offline features only (no real-time signals) |
When to use: - The entity space is finite and enumerable (all users, all products) - Prediction freshness of hours is acceptable - Real-time features do not significantly improve quality - The team lacks real-time serving infrastructure - StreamRec example: pre-compute the top-100 recommendations for each user daily; serve from cache
When to avoid: - The query space is too large to enumerate (e.g., arbitrary text queries in search) - Real-time context matters (e.g., current session behavior, time-of-day) - The application requires sub-second freshness
H.1.2 — Real-Time Serving (Online Inference)
graph LR
A[User Request] --> B[Application]
B --> C[Feature Store: Online]
C --> D[Model Server]
D --> B
B --> E[User Response]
How it works: Compute predictions on-the-fly for each request. The application sends features to a model server (TorchServe, TFServing, Triton, BentoML, or a custom FastAPI service), which runs inference and returns the prediction.
Trade-offs:
| Dimension | Assessment |
|---|---|
| Latency | Variable (10ms-500ms depending on model complexity) |
| Freshness | Excellent (predictions use latest features) |
| Cost | High (GPU/CPU for every request) |
| Complexity | High (model server, feature store, load balancing) |
| Feature requirements | Can use real-time features (session, context) |
When to use: - Predictions depend on real-time context (current query, session state, time) - The query space is too large to enumerate - Low-latency personalization is critical (fraud detection at point-of-sale, real-time bidding) - Meridian Financial example: credit scoring at application time, incorporating the latest credit bureau data
When to avoid: - The model is too large/slow for the latency budget (consider model distillation or batch serving) - Traffic volume makes per-request GPU inference cost-prohibitive - Offline features are sufficient (batch serving is simpler and cheaper)
Latency budget allocation (typical recommendation system):
| Stage | Budget | Operation |
|---|---|---|
| Feature retrieval | 10-20ms | Online feature store lookup |
| Candidate retrieval | 10-30ms | ANN search (FAISS/ScaNN) |
| Ranking | 20-50ms | Neural ranker inference |
| Re-ranking | 5-10ms | Business rules, diversity |
| Total | 45-110ms | End-to-end |
H.1.3 — Lambda Architecture
graph TD
A[Event Stream] --> B[Batch Layer]
A --> C[Speed Layer]
B --> D[Batch View]
C --> E[Real-Time View]
D --> F[Serving Layer]
E --> F
F --> G[Application]
How it works: Maintain two parallel processing paths. The batch layer processes the complete dataset on a schedule (high accuracy, high latency). The speed layer processes recent events in real-time (lower accuracy, low latency). The serving layer merges both views.
Trade-offs:
| Dimension | Assessment |
|---|---|
| Latency | Good (real-time path handles recent data) |
| Freshness | Good (real-time path captures recent events) |
| Cost | High (dual infrastructure) |
| Complexity | Very high (maintain two codepaths, reconcile views) |
| Correctness | Good (batch layer eventually corrects speed layer) |
When to use: - You need both historical accuracy and real-time responsiveness - The batch layer provides the "ground truth" model, and the speed layer provides recent adjustments - StreamRec example: batch layer computes user preference profiles daily from full history; speed layer adjusts for current session interests
When to avoid: - In almost all cases. Lambda architecture's dual-codebase problem (maintaining logic in both batch and streaming) is a maintenance nightmare. Prefer the Kappa architecture unless you have a compelling reason.
Production Reality: Nathan Marz introduced the Lambda architecture in 2011. By 2014, Jay Kreps had published "Questioning the Lambda Architecture," arguing that a single streaming system is simpler and often sufficient. Most modern ML systems that appear to use Lambda are actually using Kappa with a batch backfill mechanism.
H.1.4 — Kappa Architecture
graph LR
A[Event Stream] --> B[Stream Processing Engine]
B --> C[Serving Layer]
C --> D[Application]
A --> E[Event Log / Lake]
E -->|Replay for retraining| B
How it works: A single stream processing engine handles all computation. Historical reprocessing is done by replaying the event log through the same streaming pipeline. There is no separate batch layer.
Trade-offs:
| Dimension | Assessment |
|---|---|
| Latency | Good (stream processing) |
| Freshness | Excellent (single real-time path) |
| Cost | Moderate (single infrastructure) |
| Complexity | Moderate (one codebase, but streaming is inherently harder) |
| Correctness | Depends on replay capability |
When to use: - Real-time processing is the primary requirement - The event log is the source of truth (Kafka, Kinesis) - Historical reprocessing can be done by replaying events - Most modern ML systems at companies with mature streaming infrastructure
When to avoid: - The batch computation is fundamentally different from the streaming computation (e.g., batch requires full-dataset aggregations that streaming cannot efficiently approximate) - The team lacks streaming expertise
H.2 — Design Templates for Common ML Systems
H.2.1 — Recommendation System
This is the architecture built progressively throughout the book for the StreamRec platform.
graph TD
subgraph Candidate Retrieval
A1[User Features] --> B1[User Tower]
A2[Item Features] --> B2[Item Tower]
B1 --> C1[ANN Index - FAISS]
B2 --> C1
C1 --> D1["Top-K Candidates (K=500)"]
end
subgraph Ranking
D1 --> E1[Feature Assembly]
E1 --> F1[Neural Ranker]
F1 --> G1["Scored Candidates (K=500)"]
end
subgraph Re-Ranking
G1 --> H1[Diversity Filter]
H1 --> I1[Fairness Constraints]
I1 --> J1[Business Rules]
J1 --> K1["Final Recommendations (N=20)"]
end
Key design decisions:
| Decision | Options | StreamRec Choice | Rationale |
|---|---|---|---|
| Retrieval model | Two-tower, GNN, collaborative filtering | Two-tower (Ch. 13) | Scalable to 200K items, supports real-time user encoding |
| Ranking model | Gradient-boosted trees, deep neural network, transformer | Transformer (Ch. 10) | Captures sequential user behavior, attention weights are interpretable |
| ANN algorithm | FAISS (IVF-PQ), ScaNN, HNSW | FAISS IVF-PQ | Sub-millisecond retrieval for 200K items, good recall-latency trade-off |
| Serving pattern | Batch, real-time, hybrid | Hybrid (batch candidates + real-time ranking) | Balance freshness and cost |
| Exploration | Epsilon-greedy, UCB, Thompson sampling | Thompson sampling (Ch. 22) | Principled exploration with convergence guarantees |
| Fairness | Post-processing, in-processing | Post-processing re-ranking (Ch. 31) | Non-invasive, auditable, adjustable without retraining |
H.2.2 — Fraud Detection System
graph TD
A[Transaction Event] --> B[Real-Time Feature Engine]
B --> C{Rule Engine}
C -->|High Risk| D[Block]
C -->|Low Risk| E[Approve]
C -->|Medium Risk| F[ML Scoring]
F --> G{Score > Threshold?}
G -->|Yes| H[Review Queue]
G -->|No| E
H --> I[Human Analyst]
I --> J[Feedback Loop]
J --> B
Key design principles: - Two-stage architecture: Rules handle obvious cases (known fraud patterns, velocity checks); ML handles the ambiguous middle. Rules provide explainability and fast response; ML provides generalization. - Latency constraint: The entire pipeline must complete within 100-200ms to avoid degrading user experience. The ML model must run inference in <50ms. - High-recall orientation: Missing a fraud transaction (false negative) costs the company directly; flagging a legitimate transaction (false positive) causes friction but is not catastrophic. Optimize for recall at the cost of precision, and use the human review queue to handle false positives. - Feedback loop: Every human decision feeds back into the training pipeline. Monitor for label bias: analysts may have systematic biases that propagate into training data. - Concept drift: Fraud patterns evolve rapidly. Retrain frequently (daily or weekly) and monitor feature distributions for anomalies.
Feature engineering patterns for fraud:
| Feature Type | Examples | Computation |
|---|---|---|
| Transaction-level | Amount, merchant category, time of day | Direct from event |
| Velocity features | Transactions in last 1h/24h/7d, unique merchants | Streaming aggregation |
| Behavioral deviation | Z-score of amount vs. user's history | Streaming + historical |
| Graph features | Shared device/address/phone with known fraud | Graph database lookup |
| Geo features | Distance from last transaction, country mismatch | Geospatial computation |
H.2.3 — Search Ranking System
graph TD
A[Query] --> B[Query Understanding]
B --> C[Retrieval: BM25 + Semantic]
C --> D["Candidate Pool (1000)"]
D --> E[L1 Ranker: Lightweight]
E --> F["Refined Pool (100)"]
F --> G[L2 Ranker: Heavy Model]
G --> H["Final Results (10)"]
H --> I[Blending + Business Rules]
I --> J[Search Results Page]
Key design decisions: - Multi-stage ranking: The funnel architecture (1000 -> 100 -> 10) allows using progressively more expensive models at each stage, keeping total latency within budget. - Hybrid retrieval: Combine lexical (BM25) and semantic (dense retrieval with bi-encoders) to capture both exact-match and meaning-based relevance. - Learning to rank: The L2 ranker uses features from query-document interaction (attention scores, semantic similarity, click-through rate history) that are too expensive to compute for all candidates. - Metrics: NDCG@10 (primary), MRR (for navigational queries), Recall@1000 (for retrieval stage).
H.3 — Decision Matrices
H.3.1 — Build vs. Buy Decision Matrix
Use this framework when deciding whether to build a component in-house or adopt a vendor solution.
| Factor | Build In-House | Buy / Use SaaS | Criterion |
|---|---|---|---|
| Differentiation | Component creates competitive advantage | Commodity capability (logging, monitoring) | Is this what makes your product special? |
| Customization | Requirements are unique to your domain | Standard requirements across industries | Do off-the-shelf solutions meet 80%+ of needs? |
| Control | Need full control over data, model, deployment | Acceptable to delegate to vendor | Data sensitivity, regulatory requirements? |
| Team expertise | Team has (or should develop) this capability | Capability is outside core competency | Does building this advance the team's mission? |
| Time-to-value | Long-term investment horizon (6+ months) | Need results in weeks | Can you afford the build timeline? |
| Maintenance | Willing to own long-term maintenance burden | Vendor handles upgrades, patches, scaling | Who carries the operational pager? |
| Cost | High upfront, lower marginal cost at scale | Lower upfront, higher marginal cost at scale | What is the 3-year total cost of ownership? |
Decision heuristic: Build the components that differentiate your product. Buy everything else. When in doubt, start with buy and migrate to build when you understand the requirements well enough to build something better.
StreamRec application:
| Component | Decision | Reasoning |
|---|---|---|
| Recommendation models | Build | Core differentiator; unique to platform |
| Feature store | Build (on open-source: Feast) | Need customization for real-time features |
| Experiment platform | Buy (Statsig, Eppo) initially; build later | Standard A/B testing needs; build when scale demands |
| Monitoring | Buy (Datadog, Grafana Cloud) + custom ML metrics | Monitoring is commodity; ML-specific dashboards are custom |
| Vector database | Use managed (Pinecone, Weaviate Cloud) or self-host (FAISS) | Depends on scale and operational burden tolerance |
| Orchestration | Adopt open-source (Dagster, Airflow) | Mature OSS tools; no vendor lock-in benefit |
H.3.2 — Model Serving Infrastructure Decision Matrix
| Factor | TorchServe | Triton Inference Server | BentoML | Custom FastAPI |
|---|---|---|---|---|
| Framework support | PyTorch native | Multi-framework (PyTorch, TF, ONNX) | Multi-framework | Any |
| Dynamic batching | Yes | Yes (advanced) | Yes | Manual |
| Model ensemble | Limited | Native support | Via pipeline | Manual |
| GPU optimization | Basic | TensorRT integration | Via runtime | Manual |
| Kubernetes-native | Yes | Yes | Yes (Yatai) | Manual |
| Learning curve | Moderate | Steep | Low | Depends |
| Best for | Pure PyTorch shops | Multi-model, high-throughput | Fast prototyping, mixed frameworks | Full control, simple models |
H.3.3 — Feature Store Decision Matrix
| Factor | Feast | Tecton | Hopsworks | Custom |
|---|---|---|---|---|
| Cost | Free (OSS) | Commercial | Free (OSS) + Commercial | Engineering time |
| Online store | Redis, DynamoDB, etc. | Managed (DynamoDB) | RonDB (built-in) | Your choice |
| Offline store | BigQuery, Snowflake, etc. | S3/Snowflake/etc. | Hive/S3 | Your choice |
| Streaming features | Limited (push-based) | Native (Kafka/Kinesis) | Spark Streaming | Your choice |
| Point-in-time joins | Yes | Yes (optimized) | Yes | Must implement |
| Monitoring | Basic | Advanced | Moderate | Must implement |
| Best for | Startups, teams learning feature stores | Enterprise, complex real-time features | Teams wanting all-in-one platform | Teams with very specific requirements |
H.4 — Architecture Decision Record (ADR) Template
Every significant architectural decision should be documented in an ADR. This template is used in Chapter 24 (ML System Design) and Chapter 36 (Capstone).
# ADR-{number}: {Title}
## Status
[Proposed | Accepted | Deprecated | Superseded by ADR-{n}]
## Date
{YYYY-MM-DD}
## Context
What is the technical or business situation that requires a decision?
What constraints exist (latency, cost, team expertise, regulatory)?
What problem are we trying to solve?
## Decision
What is the architectural decision? State it clearly in one or two sentences.
## Alternatives Considered
### Alternative 1: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}
### Alternative 2: {Name}
- Description: {How it works}
- Pros: {Advantages}
- Cons: {Disadvantages}
- Rejected because: {Specific reason}
## Consequences
### Positive
- {Benefit 1}
- {Benefit 2}
### Negative
- {Trade-off 1}
- {Trade-off 2}
### Risks
- {Risk 1}: Mitigation: {How we address it}
- {Risk 2}: Mitigation: {How we address it}
## Review Date
{When should this decision be re-evaluated?}
Example ADR for StreamRec:
# ADR-003: Hybrid Serving Architecture for Recommendations
## Status
Accepted
## Date
2025-06-15
## Context
StreamRec needs to serve personalized recommendations to 5M users.
Latency requirement: p99 < 200ms. The recommendation model uses both
historical user features (updated daily) and real-time session features
(updated per click). Pure batch serving cannot incorporate session
signals. Pure real-time serving is too expensive at our current scale
(estimated $45K/month in GPU costs for full real-time inference).
## Decision
Use a hybrid serving architecture: batch-compute candidate sets daily
(top-500 per user), store in Redis. At request time, retrieve the
pre-computed candidates and re-rank using a lightweight model that
incorporates real-time session features.
## Alternatives Considered
### Alternative 1: Pure Batch Serving
- Pros: Simple, cheap ($3K/month)
- Cons: Cannot use session features; recommendations are stale for
up to 24 hours
- Rejected because: A/B tests show session features improve NDCG@10
by 12%; 24h staleness is unacceptable for trending content
### Alternative 2: Pure Real-Time Serving
- Pros: Freshest predictions, uses all features
- Cons: $45K/month GPU cost, complex infrastructure, cold-start
latency issues
- Rejected because: Cost is 15x batch; latency budget is tight for
the full transformer model on every request
## Consequences
### Positive
- Incorporates session signals (12% NDCG improvement)
- Manageable cost ($8K/month: $3K batch + $5K real-time re-ranking)
- p99 latency ~120ms (Redis lookup + lightweight re-rank)
### Negative
- Candidate set is stale (up to 24h); new items are not in candidates
until next batch run. Mitigation: inject trending items into
candidate set via a real-time "boost" list.
- Two serving paths to maintain (batch generation + real-time
re-ranking)
## Review Date
2025-12-15 (re-evaluate when user base exceeds 10M or GPU costs drop)
H.5 — Service Mesh Patterns for ML
When an ML system comprises multiple services (feature store, model server, candidate retrieval, re-ranking), a service mesh pattern manages their interactions.
H.5.1 — Circuit Breaker Pattern
import time
from enum import Enum
from typing import Callable, TypeVar
T = TypeVar("T")
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing; reject requests immediately
HALF_OPEN = "half_open" # Testing if service has recovered
class CircuitBreaker:
"""Circuit breaker for ML model serving.
Prevents cascading failures when a model service is unhealthy.
Falls back to a simpler model or cached predictions.
"""
def __init__(
self,
failure_threshold: int = 5,
recovery_timeout: float = 30.0,
half_open_max_calls: int = 3,
):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = CircuitState.CLOSED
self.failure_count = 0
self.last_failure_time = 0.0
self.half_open_calls = 0
def call(
self,
primary: Callable[[], T],
fallback: Callable[[], T],
) -> T:
"""Execute primary function with circuit breaker protection.
Args:
primary: The main model serving function.
fallback: Fallback (e.g., cached predictions, popularity baseline).
Returns:
Result from primary or fallback.
"""
if self.state == CircuitState.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = CircuitState.HALF_OPEN
self.half_open_calls = 0
else:
return fallback()
try:
result = primary()
if self.state == CircuitState.HALF_OPEN:
self.half_open_calls += 1
if self.half_open_calls >= self.half_open_max_calls:
self.state = CircuitState.CLOSED
self.failure_count = 0
else:
self.failure_count = 0
return result
except Exception:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
return fallback()
ML-specific considerations: - The fallback for a recommendation model might be a popularity-based ranker (no personalization, but always available and fast) - The fallback for a fraud model might be a rule-based system (higher false positive rate, but never misses known patterns) - Monitor fallback activation rate as a system health metric
H.5.2 — Graceful Degradation Hierarchy
Design ML systems with a degradation hierarchy — a sequence of increasingly simpler models that maintain service availability at reduced quality.
| Level | Model | Latency | Quality | When Activated |
|---|---|---|---|---|
| L0 (Full) | Transformer ranker + real-time features | 120ms | Best | Normal operation |
| L1 (Reduced) | Lightweight MLP ranker + cached features | 40ms | Good | GPU quota exceeded, feature store latency spike |
| L2 (Minimal) | Pre-computed batch recommendations | 5ms | Acceptable | Model server down, feature store down |
| L3 (Emergency) | Global popularity baseline | 2ms | Poor | Complete infrastructure failure |
Each level should be tested regularly. The StreamRec team runs monthly "chaos engineering" exercises where they deliberately disable components and verify that the system degrades gracefully through each level.
H.5.3 — Shadow Mode Deployment
graph TD
A[User Request] --> B[Load Balancer]
B --> C[Champion Model]
B --> D[Challenger Model: Shadow]
C --> E[User Response]
D --> F[Shadow Predictions Log]
F --> G[Offline Comparison]
Implementation considerations: - Shadow predictions must not affect user experience (asynchronous evaluation) - Log both champion and challenger predictions with identical features to ensure a fair comparison - Monitor challenger latency independently — a slow challenger should not affect champion performance - Run shadow mode for at least one full business cycle (typically 1-2 weeks) to capture temporal patterns
H.6 — Feature Store Architecture Patterns
H.6.1 — Dual-Store Pattern
The most common feature store pattern separates online serving from offline training.
graph TD
subgraph Offline Path
A[Raw Data Lake] --> B[Feature Pipeline: Spark/Dagster]
B --> C[Offline Store: Parquet on S3]
C --> D[Training Job: Point-in-Time Join]
end
subgraph Online Path
B --> E[Online Store: Redis/DynamoDB]
E --> F[Model Server: Feature Lookup]
end
subgraph Streaming Path
G[Event Stream: Kafka] --> H[Streaming Pipeline: Flink/Spark]
H --> E
end
The critical invariant: Features computed for training (from the offline store, with point-in-time joins to prevent leakage) must produce the same values as features served in production (from the online store). This is the online-offline consistency problem — the single most common source of production ML bugs (Chapter 25).
Strategies for consistency: 1. Single computation: Compute features once (in the batch pipeline) and write to both stores simultaneously 2. Shared transformation logic: Define feature transformations in a single specification (e.g., Feast feature definitions) that generates both batch and streaming computation 3. Consistency monitoring: Regularly compare online feature values against offline feature values for the same entity and timestamp; alert on divergence
H.6.2 — Feature Computation Patterns
| Pattern | Description | Use Case | Example |
|---|---|---|---|
| Batch transform | Compute from full historical data on a schedule | Slowly-changing features | User lifetime engagement rate |
| Streaming aggregate | Maintain running aggregates from event stream | Recent activity features | Clicks in last 1 hour |
| On-demand compute | Compute at serving time from raw inputs | Context-dependent features | Distance between user location and item location |
| Pre-joined | Join multiple tables in the batch pipeline, store the result | Cross-entity features | Average rating of items in user's watch history |
H.7 — Anti-Patterns to Avoid
H.7.1 — The God Model
Anti-pattern: One monolithic model that handles all use cases (recommendations, search, notifications, emails).
Why it fails: Different use cases have different latency requirements, different training data, and different optimization objectives. A single model cannot satisfy all constraints.
Fix: Decompose into specialized models with a shared feature platform. The StreamRec system uses separate models for candidate retrieval (two-tower), ranking (transformer), and notification triggering (lightweight classifier), all consuming from the same feature store.
H.7.2 — Training-Serving Skew Denial
Anti-pattern: Using different feature computation code for training (pandas in a notebook) and serving (SQL in a production pipeline), assuming they produce the same results.
Why it fails: Subtle differences in null handling, timestamp parsing, aggregation boundaries, and floating-point precision cause features to differ between training and serving. The model performs well in offline evaluation and poorly in production, with no obvious error.
Fix: Feature stores with shared transformation logic (Section H.6). Integration tests that compare training features against serving features for the same entity-timestamp pairs. Monitoring for feature distribution divergence.
H.7.3 — The Premature Microservice
Anti-pattern: Decomposing the ML system into 15 microservices before validating that the ML approach works at all.
Why it fails: The first version should answer: "Does this ML approach solve the business problem?" Microservice complexity makes iteration slow and debugging hard.
Fix: Start with a monolithic prototype (batch inference, simple serving). Decompose into services only after validating the approach and identifying scaling bottlenecks. The StreamRec team started with a single Python script that ran matrix factorization on a cron job and stored recommendations in PostgreSQL. They decomposed into the current architecture only after proving that personalization increased engagement by 20%.
H.7.4 — Monitoring by Dashboard Staring
Anti-pattern: Building beautiful dashboards but no automated alerts. Relying on humans to notice when metrics drift.
Why it fails: Humans check dashboards when they remember to, which is not at 3 AM on Sunday when the feature pipeline silently produces null values.
Fix: Every metric on the dashboard should have an associated alert with a threshold, escalation path, and runbook (Chapter 30). Dashboards are for investigation; alerts are for detection.
H.7.5 — The Reproducibility Illusion
Anti-pattern: Logging the model version and code commit but not the exact training data, feature definitions, hyperparameters, and random seeds.
Why it fails: When a model degrades in production, you need to reproduce the previous version exactly — including the data snapshot and the full configuration. Logging only the model artifact is insufficient because you cannot retrain or debug without the full provenance chain.
Fix: Version everything: code (git), data (DVC or snapshot IDs), features (feature store versions), hyperparameters (experiment tracker: MLflow, W&B), and environment (Docker image SHA). The StreamRec team's deployment pipeline records all of these in the model registry and refuses to deploy a model without complete provenance.
H.8 — Putting It All Together: StreamRec System Architecture
The complete StreamRec recommendation system, as built across Chapters 24-30 and integrated in Chapter 36, combines these patterns into a coherent architecture.
┌─────────────────────────────────────────────────────────────────────────┐
│ StreamRec Architecture │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────────────┐ │
│ │ Event Stream │────▶│ Streaming │────▶│ Online Feature Store │ │
│ │ (Kafka) │ │ Pipeline │ │ (Redis) │ │
│ └──────┬───────┘ └──────────────┘ └──────────┬───────────┘ │
│ │ │ │
│ ▼ │ │
│ ┌──────────────┐ ┌──────────────┐ │ │
│ │ Data Lake │────▶│ Batch Feature│────▶┌──────────┴───────────┐ │
│ │ (S3/Delta) │ │ Pipeline │ │ Offline Feature Store │ │
│ └──────────────┘ │ (Dagster) │────▶│ (Parquet on S3) │ │
│ └──────────────┘ └──────────┬───────────┘ │
│ │ │
│ ┌─────────────────────────────────────┐ │ │
│ │ Training Pipeline │◀──────────────┘ │
│ │ Data → Features → Train → Eval │ │
│ │ → Register → Shadow → Canary │ │
│ └──────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Serving Layer │ │
│ │ Request → Feature Lookup → Candidate Retrieval (Two-Tower+ANN) │ │
│ │ → Ranking (Transformer) → Re-Ranking (Diversity+Fairness) │ │
│ │ → Response │ │
│ └──────────────────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────────────┐ │
│ │ Monitoring Layer │ │
│ │ Data Quality │ Feature Drift │ Model Health │ Business Metrics │ │
│ │ Latency/SLOs │ Fairness │ Alerts │ Incident Response │ │
│ └──────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────┘
Design principles embodied: - Dual-store feature architecture (H.6.1): online Redis for serving, offline Parquet for training - Hybrid serving (H.1.1 + H.1.2): batch candidate generation + real-time re-ranking - Graceful degradation (H.5.2): four levels from full model to popularity baseline - Circuit breaker (H.5.1): protects against cascading failures in model serving - Shadow mode (H.5.3): every model update spends one week in shadow before canary rollout - Complete monitoring (Chapter 30): data quality through business metrics with automated alerting - Full provenance (H.7.5): every model deployment records code, data, features, hyperparameters, and evaluation results
The patterns in this appendix are starting points, not prescriptions. Every system has unique constraints that require adaptation. The mark of a senior ML engineer is not knowing these patterns but knowing when to deviate from them — and documenting why in an ADR.