Chapter 29: Quiz

Test your understanding of continuous training and deployment. Answers follow each question.


Question 1

Why does ML deployment require managing three artifact types (code, data, model) while traditional software deployment requires only one (code)?

Answer Traditional software behavior is fully determined by code: the same code produces the same binary, and the same binary produces the same output. ML behavior is determined by code (serving infrastructure, feature engineering), data (the training dataset, feature schemas, preprocessing parameters), and the model artifact (serialized weights and architecture). These three artifacts change on different timelines — code changes with engineer commits (daily), data changes with new events (hourly), and models change with retraining (weekly). A code change that does not touch the model can break serving. A data change that does not touch the code can degrade predictions. A model retrained on the same code and data can produce different results due to non-deterministic training. The CI/CD pipeline must version, test, and promote all three artifacts together to ensure that any production model can be fully reproduced and any regression can be traced to a specific change in code, data, or model.

Question 2

What are the four MLOps maturity levels, and what is the most impactful transition between adjacent levels?

Answer **Level 0 (Manual):** Training in notebooks, manual deployment, no monitoring, no retraining schedule. **Level 1 (ML Pipeline Automation):** Automated training pipeline, manual deployment promotion, basic monitoring, scheduled retraining. **Level 2 (CI/CD Pipeline Automation):** Automated CI/CD with staged rollout (shadow, canary, progressive), model performance and drift monitoring, event-driven retraining. **Level 3 (Automated ML System):** Fully automated self-healing loop (trigger → train → validate → deploy → monitor → trigger), automated anomaly detection, zero-touch deployment. The most impactful transition is Level 0 to Level 1 — automating the training pipeline saves the most engineering hours per unit of investment by eliminating manual notebook execution, ad hoc deployment, and "someone forgot to retrain" incidents. The Level 1 to Level 2 transition saves the most production incidents by replacing manual deployment decisions with automated staged rollouts and monitoring-driven rollback.

Question 3

Explain the difference between the code path, data path, and model path in an ML CI/CD pipeline. When is each path triggered?

Answer The **code path** is triggered by a git push or pull request merge. It runs linting, type checking, unit tests, integration tests, and builds a Docker image. It tests the serving infrastructure, feature engineering logic, and API contracts. The **data path** is triggered by the arrival of a new data partition or a sensor detecting new data. It runs schema validation (Great Expectations), statistical tests (PSI), and data contract checks. If drift thresholds are exceeded, it triggers retraining. The **model path** is triggered by the completion of a training run. It runs offline evaluation, behavioral tests (CheckList), the champion-challenger validation gate, and registers the model in the registry. These three paths operate on different timelines (code: daily; data: hourly; model: weekly) and converge at the deployment stage, where the registered model enters the shadow → canary → progressive rollout pipeline.

Question 4

Why does the ModelRegistry enforce a fixed stage transition order (None → Staging → Shadow → Canary → Production)?

Answer The fixed stage transition order encodes the deployment process as a state machine, ensuring that no model can bypass evaluation stages. A model cannot jump from `None` directly to `Production` — it must pass through `Staging` (validation gate), `Shadow` (production traffic evaluation without user impact), and `Canary` (limited production traffic with user impact). This prevents several failure modes: an engineer manually promoting an untested model, a pipeline bug skipping the shadow evaluation, or an automated process deploying a model that passed offline evaluation but would fail under production conditions. The `Archived` state is reachable from any stage (for rollback or retirement) but is terminal — an archived model cannot be reactivated without going through the full promotion pipeline again. This enforcement is particularly important in regulated environments where auditors must verify that every production model completed the required evaluation stages.

Question 5

What is artifact lineage, and why is it critical for production ML systems?

Answer Artifact lineage is the complete record connecting a deployed model to every input that produced it: the exact git commit of the training code, the data version or partition date, the feature set version, the hyperparameters, the random seed, the Docker image, and the evaluation metrics. Lineage is critical for three reasons. **Reproducibility:** given a model version number, an engineer must be able to reconstruct the exact training environment that produced it — required by regulation in financial services and healthcare. **Root cause analysis:** when a production model regresses, comparing the lineage of the current and previous versions identifies what changed (code diff, data shift, hyperparameter change). **Compliance:** in regulated industries (SR 11-7 for banking, FDA for medical devices), auditors require documentation of every model's provenance. The `ArtifactLineage.diff` method enables rapid comparison between any two model versions, turning "why did the model get worse?" from an investigation into a query.

Question 6

What are the three requirements of a shadow mode deployment, and why is each necessary?

Answer **Zero user impact:** Shadow predictions are never returned to users, even if the shadow model appears to perform better. This eliminates risk — a broken shadow model affects monitoring dashboards, not users. **Identical inputs:** The shadow model must receive the exact same features as the champion for every request. Any difference in inputs (due to caching, feature versioning, or code path differences) would confound the comparison, making it impossible to determine whether performance differences are due to the model or the inputs. **Bounded resource cost:** Shadow mode doubles inference compute (every request is scored twice). The shadow model must run on a separate resource pool so that its compute load cannot degrade the champion's latency. If the shadow model is slow, it should degrade its own predictions (by timing out), not the champion's serving quality.

Question 7

Why is shadow mode insufficient as the only pre-production evaluation, even when the shadow model passes all latency, rank correlation, and score divergence checks?

Answer Shadow mode evaluates the model on production traffic but does not measure actual user behavior in response to the model's predictions. The shadow model's predictions are never served, so there is no click-through rate, no completion rate, no engagement signal. Shadow mode can verify that the model is fast enough (latency), that it produces reasonable predictions (rank correlation with champion), and that its output distribution is not anomalous (score divergence). But it cannot verify that users prefer the new model's recommendations. A model could have high rank correlation with the champion (producing similar rankings) yet consistently place slightly worse items at position 1 — a difference invisible to rank correlation but visible to CTR. Canary deployment, where the model's predictions are actually served to users and outcomes are measured, is necessary to evaluate whether the model improves the user experience.

Question 8

In a canary deployment, why must the traffic split be consistent per user rather than random per request?

Answer If the traffic split is random per request, a single user may see recommendations from the champion model on one page load and the canary model on the next. This creates two problems. First, **inconsistent user experience:** users may notice that recommendations change erratically, which degrades perceived quality regardless of which model is better. Second, **confounded metrics:** if a user clicks on a champion recommendation and then completes content recommended by the canary, the completion is attributed to the canary even though the champion initiated the journey. Consistent per-user splitting ensures that each user's entire session is served by one model, making metric attribution clean and the user experience consistent. This is the same principle as randomization units in A/B testing (Chapter 33).

Question 9

What is the difference between a canary deployment and a blue-green deployment? When would you choose each?

Answer In **blue-green** deployment, two identical environments exist, and all traffic switches instantaneously from one to the other (0% → 100%). In **canary** deployment, traffic is gradually shifted (0% → 10% → 25% → 50% → 100%) with evaluation at each stage. Blue-green provides instantaneous rollback (switch back to blue) but exposes all users to the new version immediately, with no statistical evaluation during the transition. Canary limits blast radius (only canary users see the new model) and enables statistical comparison during deployment, but the deployment takes longer (days vs. seconds). Choose blue-green for infrastructure changes where behavior is deterministic and testable (e.g., upgrading the serving framework, changing the load balancer). Choose canary for ML model deployments where quality is uncertain and depends on user behavior that can only be measured in production.

Question 10

Explain the three types of retraining triggers and the tradeoff each optimizes.

Answer **Scheduled triggers** retrain at fixed intervals (weekly, daily, hourly). They are simple, predictable, and guarantee a maximum model staleness. The tradeoff is between freshness and cost: more frequent retraining keeps the model fresher but costs more compute. **Drift-based triggers** retrain when input data distributions shift (PSI exceeds a threshold). They are reactive — retraining only when the data has actually changed — avoiding unnecessary retraining when the world is stable. The tradeoff is detection latency: PSI requires accumulating enough data for a reliable estimate (typically one day), during which the model serves on shifted data. **Performance-based triggers** retrain when production metrics (CTR, AUC) drop below thresholds. They are the most direct — retraining only when the model is actually underperforming. The tradeoff is that performance metrics are lagging indicators: click outcomes arrive in hours, but default rates take months. A comprehensive retraining strategy uses all three triggers in combination: scheduled for guaranteed freshness, drift for rapid adaptation, and performance as a safety net.

Question 11

The RetrainingTriggerManager has a min_retrain_interval_hours parameter set to 24 hours. Why is this deduplication necessary?

Answer Without deduplication, multiple triggers can fire simultaneously and initiate redundant retraining runs. For example, a data drift event on Monday could trigger a drift-based retraining. If the scheduled trigger fires on Tuesday (within 24 hours), it would initiate a second retraining run on nearly identical data — wasting compute and potentially causing deployment pipeline congestion (two models in shadow mode simultaneously). The 24-hour minimum interval ensures that at most one retraining run occurs per day. Critical triggers (priority 1, such as severe performance degradation or extreme drift with PSI > 0.5) can bypass this deduplication, ensuring that genuine emergencies are not suppressed by the deduplication logic.

Question 12

Why does the StreamRec pipeline require human approval for drift-triggered retraining before canary deployment, but not for scheduled retraining?

Answer Scheduled retraining occurs on a predictable cadence (weekly) with data that is incrementally newer but structurally similar to previous training data. The model architecture, feature set, and hyperparameters are unchanged. The risk of a bad model is low, and the validation gate (Chapter 28) catches quality regressions automatically. Drift-triggered retraining, however, indicates that the data distribution has changed significantly. The drift may signal a legitimate change in user behavior (the model should adapt) or a data quality issue (an upstream schema change, a logging bug, a pipeline failure). Retraining on corrupted data would produce a corrupted model. Human review before canary deployment allows an engineer to examine the drift — which features shifted, by how much, and whether the shift is consistent with a known business event — before exposing real users to a model trained on potentially anomalous data.

Question 13

What are the three requirements for a model rollback to be effective?

Answer **Fast:** The previous model must be restored within seconds. At high request volumes (20M/day for StreamRec), even a 10-minute rollback window affects 140,000 requests. Speed requires keeping the previous champion deployed in warm standby — loaded, ready, and reachable by the load balancer with a configuration change. **Safe:** The rollback itself must not introduce additional risk. The previous model must be verified healthy (via periodic health checks on the warm standby), its serving infrastructure must be allocated, and the rollback mechanism must be tested regularly through drills. **Automated:** Rollback should not require a human to diagnose, identify the previous version, and manually switch traffic. The monitoring system or canary evaluator should trigger rollback automatically when guardrail metrics (error rate, latency, CTR) violate thresholds. Manual rollback is too slow for incidents that occur outside business hours.

Question 14

Why does the ProgressiveRolloutController use a "two strikes" rollback policy instead of immediately aborting the deployment on the first rollback?

Answer A single metric violation may be transient — caused by a temporary traffic spike, a brief upstream data quality issue, or statistical noise in a small sample. Immediately aborting the deployment on every transient violation would make it extremely difficult to deploy any model: the deployment would fail frequently on noise, wasting the shadow and canary evaluation time already invested. The one-stage rollback gives the model a second chance: if the violation was transient, the model will pass at the previous stage (with lower traffic). If the violation recurs at the lower stage, it is likely a genuine issue with the model, and full rollback is appropriate. This policy balances deployment reliability (not aborting on noise) with safety (not persisting with a genuinely degraded model).

Question 15

What is a kill switch, and why is it distinct from automated rollback?

Answer A kill switch is a feature flag that, when activated, immediately disables a model and routes all traffic to a known-safe fallback — typically the previous approved model or a simple rule-based system. It is distinct from automated rollback in three ways. **Trigger:** Automated rollback is triggered by metric thresholds (CTR decline, error rate); the kill switch is triggered by a human decision, often in response to an issue that metrics have not yet detected (e.g., a compliance violation, a PR crisis, an executive decision). **Scope:** Automated rollback restores the previous model version; the kill switch can route to any fallback, including a non-ML fallback. **Speed:** The kill switch is a single configuration change with no evaluation logic — it is faster than automated rollback because it skips the health check and verification steps. In regulated environments, the kill switch is a compliance requirement: regulators expect the ability to instantly disable any model.

Question 16

How does model deployment in a regulated environment (e.g., credit scoring under SR 11-7) differ from deployment in an unregulated environment (e.g., content recommendations)?

Answer Regulated deployment adds three constraints. **Independent validation:** The model validation gate must be executed by a team independent of the development team (the MRM team), not by the same data scientist who trained the model. This adds 2-5 business days to the deployment timeline. **Documentation:** Every deployment produces auditable evidence — the model's purpose, methodology, assumptions, limitations, validation results, and approval signatures — stored for 7 years. Lineage must be complete enough for an examiner to reproduce any historical model. **Change management:** Model changes are classified by materiality. Material changes (new architecture, new feature source) require full re-validation and risk committee approval. Non-material changes (retraining on fresh data) require abbreviated validation. The net effect is that a deployment taking 12 days at StreamRec takes 17-19 days at a financial institution. Continuous training triggers must account for this timeline: drift-triggered retraining cannot complete the full pipeline in 24 hours if MRM review takes 5 days.

Question 17

The shadow evaluator computes Spearman rank correlation between champion and challenger predictions. What does a rank correlation of 0.92 mean, and is it sufficient to proceed to canary?

Answer A Spearman rank correlation of 0.92 means that the challenger model's item rankings are highly correlated with the champion's — they largely agree on which items should be ranked higher and lower, but there are some disagreements. This is generally a positive signal: the challenger is not producing wildly different recommendations. However, rank correlation alone is insufficient for the canary decision. A correlation of 0.92 means 8% discordance in rankings, which could be concentrated in critical positions (e.g., the top-3 items that users actually see) or in specific user segments. High rank correlation also does not guarantee that the challenger is better — it could be 92% similar and systematically worse in the 8% where it differs. The canary decision should also consider latency (is the challenger within p99 bounds?), score divergence (are prediction magnitudes similar?), and segment-level analysis (is rank correlation uniformly high or does it drop for specific user segments?).

Question 18

Why should the previous champion model remain in "warm standby" rather than being terminated after the new champion completes full rollout?

Answer The warm standby serves as the rollback target. If the new champion develops issues after full rollout — due to a delayed-onset problem like concept drift, a long-tail edge case that only appears at 100% traffic, or an interaction with another system change — the team needs the ability to instantly restore the previous champion. If the previous champion was terminated after full rollout, rollback would require loading the model from the registry, starting new serving pods, and warming up the model — a process that takes minutes rather than seconds. The cost of warm standby (approximately 50% of serving cost, since the GPUs are allocated but mostly idle) is justified by the insurance it provides against post-rollout issues. The warm standby is typically maintained for 1-2 weeks after full rollout; once the new champion has been stable for that period, the warm standby can be safely deallocated.

Question 19

A team is deploying a model using BentoML on Kubernetes with Istio for traffic splitting. How does an Istio VirtualService enable canary deployment, and why is it preferable to application-level traffic splitting?

Answer An Istio VirtualService is a network-level configuration that routes traffic between Kubernetes services based on configurable weights. For canary deployment, the VirtualService defines two destinations (champion and canary services) with traffic weights (e.g., 90/10). Updating the weights requires a single Kubernetes API call to modify the VirtualService resource; the change propagates to all Envoy sidecar proxies within seconds. This is preferable to application-level traffic splitting for three reasons. **Separation of concerns:** The model serving code does not need to know about traffic splitting — it serves every request it receives. Traffic routing is the infrastructure's responsibility. **Language/framework agnostic:** Istio works with any serving framework (BentoML, TorchServe, Seldon, custom gRPC) without modification. **Observability:** Istio automatically collects per-service metrics (latency, error rate, throughput) that the canary evaluator needs, without requiring instrumentation in the serving code.

Question 20

Design a deployment pipeline for a model that serves two markets (US and EU) with different regulatory requirements. The US market allows automated canary deployment; the EU market requires human approval at every stage due to GDPR concerns about automated decision-making. How would you structure the pipeline?

Answer The pipeline should share the training and validation stages (one model, one training run, one offline evaluation) but fork at the deployment stage. The **US branch** follows the standard automated pipeline: shadow (7 days, automated) → canary (10%, 3 days, automated evaluation) → progressive rollout (25% → 50% → 100%, automated). The **EU branch** adds approval gates: shadow (7 days, automated) → [human review of shadow results] → canary (5%, 7 days, conservative thresholds) → [human review of canary results, GDPR impact assessment] → progressive rollout (10% → 25% → 50% → 100%, human approval at each stage). The EU branch also requires additional documentation: a Data Protection Impact Assessment (DPIA) for any model change, evidence that the model does not rely on prohibited features (Article 22 considerations), and evidence that affected individuals can request human review of automated decisions. The two branches operate independently but share the model registry and artifact lineage — ensuring that the same model version is deployed in both markets, even though the deployment timelines differ.