Case Study 1: StreamRec Deployment Pipeline — From Weekly Retraining to Continuous Deployment
Context
Eight months into production, the StreamRec recommendation system has a mature training pipeline (Chapter 27), comprehensive testing infrastructure (Chapter 28), and a model validation gate that prevents bad models from reaching production. The team retrains weekly, evaluates offline metrics, and — when the new model passes the validation gate — a senior engineer manually promotes the model by updating a Kubernetes deployment YAML and rolling it out.
The manual deployment process takes 45 minutes of engineer time per deployment: 10 minutes to review the validation gate results, 5 minutes to update the deployment configuration, 15 minutes to monitor the rollout, and 15 minutes to verify that the new model is serving correctly. But the time cost is not the real problem. The real problem is what happens between deployments.
Incident 4 (Week 26): Stale model during trending event. A major content creator released a 10-part series on the platform. Within 48 hours, the series dominated organic engagement. But the recommendation model, trained 5 days earlier, did not know the series existed. The model continued recommending older content while users were actively seeking the new series. The product team manually boosted the series in the recommendation carousel, but the model's irrelevant suggestions in other positions reduced overall engagement by 4% for the 3 days before the weekly retraining captured the new data.
Incident 5 (Week 31): Failed Friday deployment. The weekly retraining completed on Sunday as usual. The validation gate passed. On Monday, the senior engineer was on PTO. On Tuesday, the backup engineer was in meetings until 4pm. By the time the deployment started at 4:30pm (Tuesday), the model was 2.5 days old. The rollout completed at 5:15pm, and the engineer went home. At 11pm, the monitoring dashboard showed a 6% CTR drop for Android users. The on-call engineer, unfamiliar with the deployment, spent 90 minutes diagnosing the issue before manually rolling back by reverting the Kubernetes deployment. The root cause: a client update on Monday changed the Android event format, and the model trained on Sunday's data produced degraded predictions for the new format. If the deployment had occurred Monday morning, the Android data issue would have been detected before the model was promoted. The 2.5-day delay between training and deployment was the critical failure.
Incident 6 (Week 35): Quiet degradation. Between weeks 30 and 35, the model's NDCG@20 on the weekly offline evaluation declined from 0.186 to 0.171 — a 8% drop. Each weekly model was slightly worse than the last, but each passed the validation gate (which compared the new model against the current production model, not against a historical baseline). Because the current production model was also degrading, the relative comparison never triggered a rejection. The root cause was a gradual shift in the user population: a marketing campaign had attracted a new user demographic with different content preferences, and the model's features did not capture the new demographic's behavior well. No trigger fired because the decline was gradual and each weekly comparison showed only a 1-2% relative change.
The VP of Engineering mandated an investment: automated deployment with continuous training, canary evaluation, and automatic rollback.
The Solution
Phase 1: Automated Deployment Pipeline (Weeks 1-3)
The team implemented the full deployment pipeline from Section 29.12:
CI/CD for code changes. Every pull request triggers the CI pipeline (lint, type check, unit tests, integration tests, Docker build). Merges to main push the serving image to the container registry. Code changes do not automatically deploy a new model — they update the serving infrastructure, and the next retraining picks up the new code.
Automated model promotion. When the Dagster training pipeline completes and the model passes the validation gate (Chapter 28), the pipeline automatically registers the model in MLflow and transitions it to Staging. A Dagster sensor detects the new Staging model and initiates the deployment pipeline.
Deployment stages. The model progresses through:
| Stage | Traffic | Duration | Automated |
|---|---|---|---|
| Shadow | 0% (parallel) | 7 days | Yes |
| Canary | 10% | 3 days | Yes (auto-evaluate) |
| Stage 1 | 25% | 1 day | Yes |
| Stage 2 | 50% | 1 day | Yes |
| Full | 100% | — | Yes |
The total deployment time from training completion to full rollout: 12 days for scheduled retraining.
Phase 2: Continuous Training Triggers (Weeks 4-5)
The team implemented three retraining triggers:
Scheduled (weekly, Sunday 2am UTC). The existing weekly cadence, now fully automated. The model enters the deployment pipeline immediately after training and validation.
Drift-based (PSI > 0.25). The monitoring pipeline computes daily PSI for 8 key features against a 30-day reference distribution. When any feature exceeds PSI 0.25, a retraining trigger fires. The team configured human approval before canary deployment for drift-triggered retraining, after learning from Incident 6 that drift may indicate a data quality issue rather than a legitimate distribution shift.
Performance-based (absolute floor or relative decline). The monitoring pipeline computes daily production metrics (CTR, completion rate, NDCG@20 on a held-out evaluation set). Two thresholds: absolute floor (NDCG@20 < 0.15) for catastrophic degradation, and relative decline (>10% from the 7-day rolling average baseline) for gradual degradation. The relative decline threshold was specifically designed to catch the Incident 6 pattern — gradual degradation that the champion-challenger gate missed because both models were degrading together.
Deduplication. Minimum 24-hour interval between retraining runs. Critical triggers (PSI > 0.50 or metric below absolute floor) bypass the deduplication interval.
Phase 3: Rollback Infrastructure (Week 6)
Warm standby. The previous champion model remains deployed on a dedicated pod set (2 pods, vs. 8 for the active champion). The warm standby receives no traffic but is health-checked every 5 minutes. Rollback switches traffic from the active champion to the warm standby via an Istio VirtualService weight update — a single API call that propagates in under 3 seconds.
Automated rollback triggers. Two immediate rollback triggers: - Error rate > 0.5% (any stage) - P99 latency > 60ms (any stage)
Two evaluation-based rollback triggers: - Canary CTR < -2% relative to champion (statistically significant, p < 0.05) - Shadow rank correlation < 0.70
Monthly rollback drills. First Tuesday of every month, 2am UTC (low traffic). The on-call engineer executes a planned rollback: switch traffic to the warm standby, verify predictions are served correctly, switch back. Drill results are logged and reviewed in the weekly team meeting.
Results
Quantitative Impact
| Metric | Before (Manual) | After (Automated) | Change |
|---|---|---|---|
| Deployment frequency | 1x/week (manual) | 1x/week (auto) + event-driven | +event-driven |
| Time from training to full rollout | 2-5 days (variable) | 12 days (predictable) | More predictable |
| Time from training to first traffic | 2-5 days | 7 days (shadow start) | Standardized |
| Mean time to rollback | 94 minutes | 3 seconds | 99.9% reduction |
| Monthly engineer hours on deployment | 12 hours | 1 hour (monitoring) | 92% reduction |
| Bad model deployments (6-month) | 2 (Incidents 5, 6) | 0 | 100% reduction |
| Drift-triggered retrains (6-month) | 0 (no triggers existed) | 4 | New capability |
| CTR impact from stale models | -2.1% (estimated, Incident 4) | Near zero | Significant |
Retroactive Analysis
The team retroactively applied the new trigger system to historical data from Incidents 4, 5, and 6:
Incident 4 (stale model during trending event): The drift trigger would have fired within 24 hours of the trending event — the item_popularity_score feature's PSI jumped to 0.38 as the new series dominated engagement. A drift-triggered retraining on day 2 (instead of waiting for the weekly schedule on day 5) would have reduced the staleness window from 5 days to approximately 2 days plus the deployment pipeline time.
Incident 5 (failed Friday deployment): Automated deployment eliminates the human bottleneck entirely. The model would have entered shadow mode immediately after validation, with no dependency on engineer availability. The Android data format change would have been caught during shadow mode (the shadow model's predictions on Android traffic would show anomalous score distributions), preventing the bad model from reaching canary.
Incident 6 (quiet degradation): The performance trigger's relative decline threshold (>10% from 7-day rolling average) would have fired in week 33, when NDCG@20 declined from the 7-day average of 0.182 to 0.171 (a 6% decline per week, crossing the 10% cumulative threshold after 2 weeks). This would have triggered an investigation 2 weeks earlier than the manual discovery in week 35.
Organizational Impact
The automated deployment pipeline changed the team's relationship with production models. Before automation, deployment was an event — a high-stakes, high-attention activity that required coordination, availability, and vigilance. After automation, deployment became a process — a predictable, monitored, and self-correcting flow that required attention only when anomalies occurred.
The shift from event to process freed 11 hours per month of senior engineer time (from 12 hours of deployment toil to 1 hour of deployment monitoring). More importantly, it eliminated the deployment bottleneck: the model's freshness was no longer constrained by human availability. The weekly model was deployed within 12 days, every week, regardless of holidays, PTO, or meeting schedules.
The drift-triggered retraining provided a new capability that the team had never had: the ability to respond to distribution shifts between scheduled retraining windows. In the first 6 months, 4 drift-triggered retraining events occurred — each producing a model that was measurably better than the scheduled model would have been, because it was trained on more recent data that reflected the current distribution.