Part V: Production ML Systems

"The model is 5% of the system. The other 95% — data pipelines, feature stores, monitoring, deployment, governance — determines whether the system succeeds or fails."


Why This Part Exists

A brilliant model that runs only in a Jupyter notebook is not a product. It is a prototype.

The gap between "the model works on my laptop" and "the model runs reliably in production, serving millions of requests, retraining on fresh data, and recovering gracefully from failures" is the defining challenge of applied machine learning. Most ML courses skip it entirely. Most bootcamps hand-wave it away. The result: data scientists who can build excellent models but cannot ship them.

This part covers the full production ML stack. System design: how to architect a recommendation system with candidate retrieval, ranking, and re-ranking stages, each with its own latency budget. Data infrastructure: feature stores that ensure the features used during training match the features available during serving. Distributed training: scaling from one GPU to a cluster without losing model quality. Pipeline orchestration: building robust data workflows that handle failures gracefully. Testing: data validation, behavioral testing, and model validation gates that prevent bad models from reaching production. Deployment: CI/CD for ML, canary deployments, shadow mode, and automatic rollback. Monitoring: detecting data drift, model degradation, and system failures before they affect users.

Seven chapters, each addressing a component that a senior data scientist must understand — and often build — themselves.

Chapters in This Part

Chapter Focus
24. ML System Design Architecture patterns, serving strategies, ADRs
25. Data Infrastructure Feature stores, lakehouses, data contracts, lineage
26. Training at Scale DDP, model parallelism, mixed precision, GPU optimization
27. ML Pipeline Orchestration Airflow, Dagster, Prefect, idempotency, backfill
28. ML Testing and Validation Great Expectations, behavioral testing, model validation gates
29. Continuous Training and Deployment CI/CD for ML, canary, shadow mode, retraining triggers
30. Monitoring and Observability Data drift, concept drift, alerting, incident response

Progressive Project Milestones

  • M9 (Chapter 24): Design the complete StreamRec system architecture.
  • M10 (Chapter 25): Build the feature store for real-time and batch features.
  • M11 (Chapter 27): Orchestrate the training pipeline with Dagster.
  • M12 (Chapter 28): Build data validation and behavioral testing infrastructure.
  • M13 (Chapter 29): Implement the CI/CD pipeline with canary deployment.
  • M14 (Chapter 30): Build the monitoring dashboard with drift detection and alerting.

Prerequisites

Practical ML experience (Parts I-II). Chapter 5 (Computational Complexity) provides useful background. No prior production ML experience is assumed — this part builds the full stack from first principles.

Chapters in This Part