Chapter 29: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
CI/CD for ML
Exercise 29.1 (*)
A fraud detection team at an e-commerce company deploys an XGBoost model that scores every transaction in real time. The model uses 80 features derived from transaction history, device fingerprints, and behavioral signals. Retraining happens manually every 6 weeks.
(a) Classify this team's MLOps maturity level (0-3). Identify the two highest-impact improvements that would advance them to the next level.
(b) The team wants to move to weekly automated retraining. List the three artifact types that their CI/CD pipeline must version, and for each artifact, specify the versioning tool and testing strategy.
(c) The fraud detection model produces a binary decision (block/allow) with immediate user impact. Does this affect your choice of deployment strategy (blue-green vs. canary vs. shadow)? Explain your reasoning.
Exercise 29.2 (*)
Consider the MLCIPipeline class from Section 29.3. The current pipeline runs integration tests only with the not gpu marker, skipping GPU-dependent tests in CI.
(a) Why might a team skip GPU tests in CI? List two practical reasons.
(b) Design a CI step that runs GPU-dependent tests in a separate pipeline triggered nightly (not on every commit). Write the CIStep dataclass instance.
(c) The team discovers that a model serving bug only manifests when the model runs on GPU (a precision difference between CPU and GPU inference). How would you modify the CI pipeline to catch this class of bug without running GPU tests on every commit?
Exercise 29.3 (*)
The ModelRegistry class (Section 29.3) enforces valid stage transitions. Trace the stage transitions for the following scenario and identify which transition would raise a ValueError:
- Model
fraud-detectorv12 is registered (stage:None) - v12 transitions to
Staging - v12 transitions to
Shadow - v12 transitions to
Canary - v12 transitions to
Production - Previously,
fraud-detectorv11 was inProduction - Someone attempts to transition v11 from
Archivedback toProduction
Exercise 29.4 (**)
Extend the ArtifactLineage class to include feature importance lineage: for each feature, store its importance score (from the model) and its data source. When comparing two lineage records, flag any feature whose importance changed by more than 50% or whose data source changed.
Write the extended class and a test that demonstrates the comparison on two lineage records where one feature's data source changed from "credit_bureau_v2" to "credit_bureau_v3" and another feature's importance changed from 0.08 to 0.15.
Shadow Mode
Exercise 29.5 (*)
A search engine serves 50 million queries per day. The team wants to evaluate a new ranking model in shadow mode before canary deployment.
(a) At 50M queries/day, how long would shadow mode need to run to accumulate 100,000 predictions? Is the minimum prediction threshold the binding constraint or the temporal coverage requirement (covering weekday/weekend patterns)?
(b) Shadow mode doubles the inference compute per request. If the current serving cost is $12,000/day, what is the additional daily cost of shadow mode? Propose an alternative to full shadow mode that reduces cost while preserving evaluation quality.
(c) The search engine's primary metric is NDCG@10, but this requires knowing which results users clicked (outcome data). During shadow mode, users see only the champion's results, so outcome data is only available for the champion's ranking. How would you evaluate the shadow model's quality without outcome data? (Hint: consider rank correlation, model agreement, and interleaving.)
Exercise 29.6 (**)
The ShadowEvaluator.compute_rank_correlation method computes a global rank correlation across all predictions. This can mask important segment-level differences.
(a) Modify the evaluator to compute rank correlation separately for new users (< 7 days old), medium users (7-90 days), and power users (> 90 days). Write the segmented evaluation method.
(b) The shadow evaluation shows: global rank correlation = 0.92, new users = 0.65, medium users = 0.94, power users = 0.97. What does this pattern suggest about the challenger model? Should you proceed to canary?
(c) Design a shadow evaluation criterion that incorporates segment-level analysis into the promotion decision. Specify the thresholds and the decision logic.
Exercise 29.7 (**)
Shadow mode assumes that the champion and challenger receive identical inputs. In practice, this can fail if the feature store returns different values for the two models (e.g., due to caching, race conditions, or feature versioning). Design a shadow input validation check that detects feature divergence between champion and challenger requests. Write the validation function.
Canary Deployments
Exercise 29.8 (*)
StreamRec serves 20 million requests per day. A canary deployment routes 10% of traffic to the new model.
(a) How many canary impressions accumulate per day? If the baseline CTR is 5.2%, how many days are needed to detect a 0.5% relative CTR change (from 5.2% to 5.174%) with 80% power at significance level 0.05?
(b) The product team wants to reduce the canary percentage to 5% to limit user exposure. How does this affect the required canary duration? Is the tradeoff worth it for a recommendation system?
(c) The canary shows CTR = 5.35% (canary) vs. 5.20% (champion) after 2 million canary impressions. Use the _two_proportion_z_test function to compute the z-statistic and p-value. Should the canary be promoted?
Exercise 29.9 (*)
The canary evaluation in Section 29.6 uses a two-proportion z-test. This test assumes independent observations (each impression is independent). However, in a recommendation system, a single user generates multiple impressions per session.
(a) Explain why user-level clustering violates the independence assumption and how it affects the z-test.
(b) Propose an alternative testing approach that accounts for user-level clustering. (Hint: consider randomization at the user level and a user-level metric like "CTR per user.")
(c) If the effective sample size (accounting for clustering) is 40% of the nominal sample size, how does this affect the required canary duration?
Exercise 29.10 (**)
The CanaryConfig class allows a min_ctr_lift of -0.01, meaning the canary can be promoted even with a 1% CTR regression. This is a non-inferiority design.
(a) Why would a team accept a small regression? Give two concrete business reasons.
(b) Formalize the non-inferiority test. Given champion CTR $p_c$ and canary CTR $p_k$, the null hypothesis is $H_0: p_k - p_c \leq -\delta$ (the canary is inferior by at least $\delta$) and the alternative is $H_1: p_k - p_c > -\delta$. Derive the one-sided z-test statistic for this formulation.
(c) What is the risk of setting the non-inferiority margin $\delta$ too large? How would you choose $\delta$ in practice?
Exercise 29.11 (**)
The canary evaluation checks CTR as the primary metric. But CTR is a short-term engagement metric. The product team also cares about long-term metrics: user retention (7-day return rate) and subscription conversion (30-day).
(a) The canary runs for 3 days. Can 7-day retention be evaluated within the canary window? If not, how would you handle long-term metrics in the deployment pipeline?
(b) Design a "delayed outcome reconciliation" process that, after the canary is promoted to full traffic, continues to compare retention and conversion between users who were in the canary group and users who were in the champion group. Write the data schema and the comparison logic.
(c) If the delayed reconciliation reveals that the canary model hurt 7-day retention by 2% (detected 10 days after full rollout), what should the team do? Is rollback appropriate? Discuss the tradeoffs.
Progressive Rollout
Exercise 29.12 (*)
The ProgressiveRolloutController uses a fixed stage schedule: 0% → 10% → 25% → 50% → 100%.
(a) A healthcare AI model that assists radiologists in detecting lung nodules requires extreme caution. Design a rollout schedule for this model. Justify your choice of stages, durations, and metrics.
(b) A real-time bidding model for advertising must adapt quickly to market changes. Design a rollout schedule that prioritizes speed. What additional safety mechanisms would you add to compensate for the faster rollout?
Exercise 29.13 (**)
The rollback logic in ProgressiveRolloutController.rollback uses a "two strikes" policy: first rollback goes one stage back, second rollback aborts the deployment.
(a) Trace the behavior for the following scenario: canary passes, advance to 25%. At 25%, CTR drops 3%. Rollback to canary. At canary (second time), CTR is normal. Advance to 25% again. At 25%, CTR drops 3% again. What happens?
(b) The current implementation tracks rollback_count as a global counter. Propose a modification that distinguishes between rollbacks at different stages and allows re-advancement after a single-stage rollback recovery. Write the modified rollback method.
(c) Under what circumstances should the system immediately abort (full rollback) rather than stepping back one stage? List at least three conditions.
Exercise 29.14 (***)
Design a dynamic progressive rollout controller that adjusts the advancement speed based on observed metrics. When metrics are strongly positive (CTR improvement > 2%), the controller should advance faster (reduce the minimum duration). When metrics are neutral (no significant difference), the controller should advance at the standard pace. When metrics are weakly negative but within tolerance, the controller should slow down (increase the minimum duration and reduce the traffic increment).
Write the DynamicRolloutController class with adaptive stage durations and traffic increments.
Continuous Training and Retraining Triggers
Exercise 29.15 (*)
A ride-sharing platform has three models in production: surge pricing (retrained hourly), demand forecasting (retrained daily), and driver churn prediction (retrained weekly). For each model:
(a) Identify the primary retraining trigger type (scheduled, drift, or performance) and justify your choice.
(b) Specify the retraining window (how much historical data to include in each retraining run) and explain why.
(c) The surge pricing model's hourly retraining costs $15 per run ($10,800/month). The engineering manager asks whether the model could be retrained every 4 hours instead. How would you evaluate this tradeoff? What metrics would you monitor to detect staleness?
Exercise 29.16 (*)
The DriftTrigger uses PSI to detect feature drift. PSI has known limitations: it requires binning continuous features, it is symmetric (does not distinguish between the reference and current distributions), and it can miss localized shifts in the tails.
(a) Compute PSI for the following distributions. Reference bins: [0.30, 0.25, 0.25, 0.20]. Current bins: [0.22, 0.28, 0.30, 0.20]. Does this exceed the 0.25 threshold?
(b) Propose an alternative drift detection method for continuous features that does not require binning. (Hint: consider the Kolmogorov-Smirnov test, Maximum Mean Discrepancy, or the Wasserstein distance.)
(c) Design a DriftTrigger variant that uses the KS test instead of PSI. Write the implementation using scipy.stats.ks_2samp.
Exercise 29.17 (**)
The RetrainingTriggerManager suppresses non-critical triggers if retraining occurred within the last 24 hours. This deduplication prevents redundant retraining, but it can delay response to a genuine issue.
(a) Scenario: Sunday 2am — scheduled retraining runs. Monday 3am — PSI for user_engagement_rate jumps to 0.35. The drift trigger fires but is suppressed because the 24-hour interval has not elapsed. When will the drift-triggered retraining actually execute?
(b) Modify the evaluate method so that drift triggers with PSI > 0.50 (critical drift) bypass the deduplication interval. Write the modified code.
(c) The performance trigger fires at the same time as the drift trigger. Both are priority 1. Which should take precedence? Design a priority tiebreaker that considers the root cause (data issue vs. model issue).
Exercise 29.18 (***)
Design a cost-aware retraining scheduler that optimizes the tradeoff between model freshness and compute cost. The scheduler takes as input: (1) the model's performance degradation curve (how quickly quality declines after training), (2) the retraining cost (compute + deployment), and (3) the business cost of degraded predictions (revenue impact per unit of metric decline). The scheduler outputs the optimal retraining interval that minimizes total cost.
Formalize the optimization problem. Define the cost function. Sketch the solution for a model whose performance degrades linearly at 0.5% per day, retraining costs $200, and each 1% quality decline costs $1,000/day in lost revenue.
Rollback
Exercise 29.19 (*)
A model serving 5 million requests per hour experiences a latency spike: p99 increases from 35ms to 120ms after a canary promotion to 25% traffic.
(a) If the automated rollback takes 5 seconds (load balancer config change), how many requests are affected during the rollback? Assume the latency spike affects only the 25% of traffic served by the canary.
(b) If the automated rollback takes 5 minutes (requires pod restart), how many requests are affected? Compare with part (a).
(c) What infrastructure investment would reduce rollback time from 5 minutes to 5 seconds? Is the investment justified given the numbers from parts (a) and (b)?
Exercise 29.20 (**)
The rollback architecture keeps the previous champion in "warm standby," consuming resources but not serving traffic. For a model served on 8 GPU pods, warm standby costs 50% of the serving cost (GPUs are allocated but mostly idle).
(a) Calculate the monthly warm standby cost if each GPU pod costs $3,000/month. Is this cost justified for a model that generates $2 million/month in attributed revenue?
(b) Propose an alternative to warm standby that reduces cost while maintaining rollback capability within 60 seconds. (Hint: consider pre-loaded container images, model caching, or serverless inference.)
(c) The team proposes eliminating warm standby entirely and relying on "cold rollback" — loading the previous model from the registry on demand. Estimate the cold rollback time given: Docker image pull (30s), model loading (45s), warm-up inference (15s). Is this acceptable for a model serving 5 million requests/hour?
Exercise 29.21 (**)
Design a rollback drill playbook for the StreamRec team. The playbook should include:
(a) The rollback drill schedule (frequency, timing, participants).
(b) Step-by-step instructions for executing the drill, including what to verify at each step.
(c) Success criteria: how do you determine if the drill passed or failed?
(d) A post-drill review template that captures lessons learned and improvement actions.
Regulated Deployment
Exercise 29.22 (*)
Meridian Financial's credit scoring model deployment requires MRM approval, which adds 2-5 business days to the deployment timeline.
(a) The MRM team reviews 15 models per quarter. Each review takes an average of 4 hours of analyst time. If the team has 3 analysts and each works 160 hours per quarter, what fraction of their capacity is consumed by model reviews? Is this sustainable if the number of models grows to 30 per quarter?
(b) Distinguish between "material" and "non-material" model changes. For each category, specify the required review depth and the expected review duration. Give two examples of each category.
(c) Design an automated pre-screening step that triages model changes into material and non-material categories based on the ArtifactLineage.diff output. What changes in the lineage diff would trigger a material review?
Exercise 29.23 (**)
The kill switch feature flag provides a one-click mechanism to disable a model. However, a kill switch that disables the model without a fallback leaves users with no predictions.
(a) Design a kill switch for the Meridian Financial credit scoring model that routes traffic to a fallback model (the previous approved version). What happens if the previous version is also degraded?
(b) Design a "graceful degradation" fallback hierarchy: primary model → previous model → rule-based model → manual review. Write the routing logic.
(c) The kill switch is triggered 3 times in one month. The MRM team asks whether this frequency indicates a systemic issue. What analysis would you perform to determine the root cause?
System Design
Exercise 29.24 (***)
Design the complete CI/CD pipeline for a medical imaging model that classifies chest X-rays for pneumonia detection. The model is used as a clinical decision support tool (not autonomous diagnosis). The deployment must comply with FDA 510(k) requirements for software as a medical device (SaMD).
(a) Define the three artifacts (code, data, model) and their versioning requirements. How do FDA documentation requirements affect artifact lineage?
(b) Design the deployment stages. Can shadow mode be used for medical imaging? What are the ethical considerations of canary deployment (some patients get the new model, others get the old model)?
(c) Design the retraining strategy. Regulatory guidance requires that any model change undergoes re-validation. How does this affect the continuous training trigger design?
(d) Design the rollback procedure. For a clinical decision support tool, what additional safeguards are needed beyond those in the StreamRec pipeline?
Exercise 29.25 (***)
A large online retailer operates 200 ML models in production, ranging from search ranking to dynamic pricing to fraud detection. Each model has its own deployment pipeline, and the ML platform team supports all of them.
(a) Classify the 200 models into tiers (Tier 1: business-critical, Tier 2: important, Tier 3: experimental) and design different deployment pipelines for each tier. Specify the deployment stages, duration, and approval requirements for each tier.
(b) The platform team has 5 engineers. Design a deployment automation strategy that enables 200 models to be deployed without 200 custom pipelines. What should be standardized? What should be configurable?
(c) Two models have conflicting deployment schedules: the search ranking model (Tier 1, deployed Tuesday) and the product recommendation model (Tier 1, deployed Tuesday). Both models affect the same user-facing page. Design a deployment coordination strategy that prevents both models from deploying simultaneously.
Exercise 29.26 (****)
The standard progressive rollout (shadow → canary → 25% → 50% → 100%) assumes a single model serving a single population. In practice, many production systems serve multi-armed bandit models or contextual bandit models (Chapter 22) where the "model" is actually an exploration-exploitation policy.
(a) How does canary deployment interact with the exploration-exploitation tradeoff? If the canary model has a different exploration rate than the champion, the CTR comparison is confounded by exploration (lower immediate CTR but potentially better long-term learning). How would you evaluate a canary that explores more?
(b) Design a deployment pipeline for a contextual bandit system that replaces the bandit policy (not just the model weights). What additional evaluation criteria are needed?
(c) The contextual bandit literature uses "regret" as the primary evaluation metric. Can regret be computed during a canary deployment? If not, what proxy metrics would you use?
Exercise 29.27 (****)
Current deployment pipelines assume a single model version is promoted or rolled back. Emerging research on continuous deployment of ML models (Karimov et al., 2023) proposes deploying models as continuous streams: instead of discrete model versions with staged rollout, the model weights are updated incrementally as new data arrives, with no discrete "deployment" event.
(a) How would shadow mode, canary deployment, and rollback work in a continuous deployment paradigm? What is the "rollback target" when there is no discrete previous version?
(b) Design a monitoring system for continuous deployment that detects when incremental updates are degrading quality. What is the reference distribution when the model changes continuously?
(c) Under what conditions is continuous deployment preferable to discrete retraining-and-deployment? Under what conditions is it dangerous? Consider the fraud detection, recommendation, and credit scoring domains.