Part V: Production ML Systems
"The model is 5% of the system. The other 95% — data pipelines, feature stores, monitoring, deployment, governance — determines whether the system succeeds or fails."
Why This Part Exists
A brilliant model that runs only in a Jupyter notebook is not a product. It is a prototype.
The gap between "the model works on my laptop" and "the model runs reliably in production, serving millions of requests, retraining on fresh data, and recovering gracefully from failures" is the defining challenge of applied machine learning. Most ML courses skip it entirely. Most bootcamps hand-wave it away. The result: data scientists who can build excellent models but cannot ship them.
This part covers the full production ML stack. System design: how to architect a recommendation system with candidate retrieval, ranking, and re-ranking stages, each with its own latency budget. Data infrastructure: feature stores that ensure the features used during training match the features available during serving. Distributed training: scaling from one GPU to a cluster without losing model quality. Pipeline orchestration: building robust data workflows that handle failures gracefully. Testing: data validation, behavioral testing, and model validation gates that prevent bad models from reaching production. Deployment: CI/CD for ML, canary deployments, shadow mode, and automatic rollback. Monitoring: detecting data drift, model degradation, and system failures before they affect users.
Seven chapters, each addressing a component that a senior data scientist must understand — and often build — themselves.
Chapters in This Part
| Chapter | Focus |
|---|---|
| 24. ML System Design | Architecture patterns, serving strategies, ADRs |
| 25. Data Infrastructure | Feature stores, lakehouses, data contracts, lineage |
| 26. Training at Scale | DDP, model parallelism, mixed precision, GPU optimization |
| 27. ML Pipeline Orchestration | Airflow, Dagster, Prefect, idempotency, backfill |
| 28. ML Testing and Validation | Great Expectations, behavioral testing, model validation gates |
| 29. Continuous Training and Deployment | CI/CD for ML, canary, shadow mode, retraining triggers |
| 30. Monitoring and Observability | Data drift, concept drift, alerting, incident response |
Progressive Project Milestones
- M9 (Chapter 24): Design the complete StreamRec system architecture.
- M10 (Chapter 25): Build the feature store for real-time and batch features.
- M11 (Chapter 27): Orchestrate the training pipeline with Dagster.
- M12 (Chapter 28): Build data validation and behavioral testing infrastructure.
- M13 (Chapter 29): Implement the CI/CD pipeline with canary deployment.
- M14 (Chapter 30): Build the monitoring dashboard with drift detection and alerting.
Prerequisites
Practical ML experience (Parts I-II). Chapter 5 (Computational Complexity) provides useful background. No prior production ML experience is assumed — this part builds the full stack from first principles.
Chapters in This Part
- Chapter 24: ML System Design — Architecture Patterns for Real-World Machine Learning
- Chapter 25: Data Infrastructure — Feature Stores, Data Warehouses, Lakehouses, and the Plumbing Nobody Teaches
- Chapter 26: Training at Scale — Distributed Training, GPU Optimization, and Managing Compute Costs
- Chapter 27: ML Pipeline Orchestration — Airflow, Dagster, Prefect, and Designing Robust Data Workflows
- Chapter 28: ML Testing and Validation Infrastructure — Data Contracts, Behavioral Testing, and Great Expectations
- Chapter 29: Continuous Training and Deployment — CI/CD for ML, Canary Deployments, Shadow Mode, and Progressive Rollout
- Chapter 30: Monitoring, Observability, and Incident Response — Keeping ML Systems Healthy in Production