Chapter 29: Key Takeaways

DataField.Dev

Chapter 29: Key Takeaways

ML deployment manages three artifacts — code, data, and model — that change on different timelines and require different versioning, testing, and promotion strategies. Code is versioned in Git and tested with unit and integration tests. Data is versioned by partition date or DVC hash and tested with schema validation and statistical tests (PSI). Models are versioned in the model registry (MLflow) and tested with behavioral tests and champion-challenger validation gates. The CI/CD pipeline must handle all three artifact types and enforce their co-versioning through artifact lineage, which connects every production model to the exact code commit, data version, and hyperparameters that produced it. Without lineage, production regressions cannot be diagnosed, regulatory audits cannot be satisfied, and reproducibility is impossible.
Staged deployment — shadow mode, canary, progressive rollout — provides increasing confidence with increasing risk exposure. Shadow mode (0% traffic, challenger runs in parallel) tests latency, rank correlation, and prediction distributions on real production data with zero user impact. Canary deployment (5-10% traffic) tests actual user behavior (CTR, completion rate, engagement) with limited blast radius. Progressive rollout (25% → 50% → 100%) scales exposure with monitoring at each stage. Each stage catches failure modes the previous stage cannot: shadow catches serving-training skew and latency issues; canary catches user-facing quality regressions; progressive rollout catches interaction effects that only appear at scale. Skipping a stage trades safety for speed — a tradeoff that is rarely justified for production ML systems.
Continuous training requires three complementary triggers: scheduled (guaranteed freshness), drift-based (reactive adaptation), and performance-based (quality safety net). Scheduled retraining provides a predictable baseline cadence (weekly for StreamRec, quarterly for credit scoring). Drift-based retraining responds to distribution shifts between scheduled windows, using PSI or other statistical tests to detect when the data has changed enough to warrant a new model. Performance-based retraining catches quality degradation that drift detection may miss — because the degradation affects features not monitored by PSI, or because the model is sensitive to small shifts that fall below the drift threshold. The combination of all three, with deduplication to prevent redundant retraining, balances model freshness, compute cost, and deployment risk.
Rollback is infrastructure, not an emergency procedure — the previous champion must be warm, healthy, tested, and restorable in seconds. A rollback procedure that takes minutes or requires human intervention is insufficient for high-traffic systems where every second of degraded service affects thousands of requests. Warm standby (the previous champion remains deployed but receives no traffic), automated trigger conditions (error rate, latency, metric thresholds), and regular rollback drills (monthly, tested end-to-end) transform rollback from a hope into a guarantee. The cost of warm standby is real (approximately 50% of serving infrastructure for the standby pods) but is insurance against the much larger cost of a prolonged production incident.
Regulated environments add human approval gates, independent validation, and documentation requirements that lengthen the deployment timeline but reduce regulatory risk. The deployment pipeline for a credit scoring model (Meridian Financial, 17-19 business days) is fundamentally different from the pipeline for a recommendation model (StreamRec, 12 days) — not because the engineering is different, but because the regulatory constraints add mandatory human review stages. Independent MRM validation, automated model change documents, fairness checks at every deployment stage, and a kill switch that any compliance officer can activate are regulatory requirements that shape the pipeline design. The key insight is that automation and regulation are complementary: automated pipelines produce better documentation faster, reduce deployment failures, and satisfy examiner expectations more consistently than manual processes.
The MLOps maturity progression from Level 0 (manual) to Level 2 (CI/CD automation) is the most impactful engineering investment for production ML teams. Level 0 to Level 1 (automating the training pipeline) saves the most engineering hours by eliminating manual notebook execution and ad hoc deployment. Level 1 to Level 2 (automating deployment with staged rollout) saves the most production incidents by replacing manual promotion decisions with statistically evaluated canary deployments and automated rollback. Level 3 (fully automated self-healing) is achievable for high-traffic, time-sensitive models but is not cost-effective for most production systems. The target for any model serving production traffic is Level 2.