Chapter 30: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Monitoring vs. Observability

Exercise 30.1 (*)

A fraud detection model at a payments company serves 10 million transactions per day. The model classifies each transaction as legitimate or fraudulent (probability score 0-1). The SRE team has standard software monitoring (latency, error rate, throughput). The ML team has no ML-specific monitoring.

(a) Classify each of the following failure modes as "detectable by software monitoring," "detectable by ML monitoring," or "detectable by neither without business metric feedback":

  1. The model's inference latency increases from 15ms to 200ms
  2. A feature pipeline starts returning null for the merchant_category field
  3. The model begins scoring all transactions from a new payment processor at 0.01 (low fraud risk), regardless of actual risk
  4. A concept drift event increases real-world fraud rates from 0.5% to 1.2%, but the model's predicted fraud rate stays at 0.5%
  5. The model serving container runs out of GPU memory during a traffic spike

(b) For each failure mode that software monitoring cannot detect, describe a specific ML monitoring signal that would detect it. Include the metric name, collection method, and threshold.

(c) Which of these failure modes could be diagnosed (not just detected) purely from metrics? Which require observability (logs, traces, or ad-hoc investigation)?


Exercise 30.2 (*)

Design a TelemetryConfig (using the class from Section 30.2) for a real-time pricing model that serves dynamic prices for an e-commerce platform. The model receives product features, user features, and context features, and returns a price multiplier (0.8 to 1.5). The system serves 500,000 pricing requests per day.

(a) Define at least 8 signals across all four types (metrics, logs, traces, predictions).

(b) Set the prediction_sample_rate and trace_sample_rate. Justify your choices based on storage cost and diagnostic utility.

(c) Estimate the daily storage requirement using estimated_daily_storage_gb().


Data Quality Monitoring

Exercise 30.3 (*)

The StreamRec feature store serves 34 features at inference time. Five of these features have experienced quality issues in the past 6 months:

Feature Issue Duration Impact
user_engagement_rate Null rate jumped from 2% to 45% 3 days Recall@20 dropped 0.03
item_popularity_score Stale (48h old instead of 1h) 12 hours Minor; popularity is slow-changing
user_session_count_7d Values 10x too large (counting bug) 5 days Severe; model over-recommended to power users
content_category New category "podcast" appeared Ongoing Model treats unknown category as missing
days_since_last_login Negative values for new users 2 weeks Model scored new users as extremely inactive

(a) For each issue, identify which monitoring signal from Section 30.4 (null rate, mean/median, variance, min/max, cardinality, PSI) would have detected it earliest. Specify the threshold.

(b) Rank the five features by monitoring priority (highest impact × highest likelihood of recurrence). Justify your ranking.

(c) The team can afford to run full PSI computation (expensive) on only 10 of the 34 features per hour. Which 10 should they prioritize, and what simpler check should they run on the remaining 24?


Exercise 30.4 (**)

Design a training-serving skew detection system for the StreamRec model.

(a) The feature user_avg_session_minutes is computed in Spark for training (mean of all sessions in the training window) and in Redis for serving (exponential moving average of recent sessions with decay factor 0.95). These two computations will naturally produce different values for the same user. How would you distinguish expected computation skew from problematic computation skew?

(b) Implement a TrainingServingSkewDetector class that: - Takes a sample of serving-time feature vectors and the corresponding training-time feature vectors for the same users - Computes per-feature PSI between training and serving distributions - Classifies each feature as "aligned" (PSI < 0.05), "expected_skew" (0.05 < PSI < 0.15, known computation differences), or "anomalous_skew" (PSI > 0.15)

(c) How would you detect training-serving skew for categorical features where PSI is not well-defined? Propose a metric and threshold.


Exercise 30.5 (**)

A streaming feature pipeline computes user_click_rate_1h (clicks in the last hour / impressions in the last hour) in real time. The pipeline has an intermittent bug that occasionally double-counts impressions, causing the click rate to be halved.

(a) The bug affects 3-5% of computation windows and self-resolves within minutes. Would PSI-based monitoring detect this? Why or why not?

(b) Design a monitoring approach that detects intermittent, self-resolving computation errors. Consider: anomaly detection on the feature's time series, percentile monitoring (p1/p5 drops), or individual-level consistency checks.

(c) The model's prediction is moderately sensitive to user_click_rate_1h (SHAP importance rank #7 out of 34 features). At what error rate (percentage of requests with a corrupted feature) would you expect the model's Recall@20 to degrade measurably? Design a simulation to estimate this.


Drift Detection Methods

Exercise 30.6 (*)

A feature income has the following reference and current distributions (1000 samples each):

Reference:

Bin [0, 30K) [30K, 50K) [50K, 75K) [75K, 100K) [100K, +)
Proportion 0.15 0.25 0.30 0.20 0.10

Current:

Bin [0, 30K) [30K, 50K) [50K, 75K) [75K, 100K) [100K, +)
Proportion 0.08 0.18 0.28 0.26 0.20

(a) Compute the per-bin PSI contributions and the total PSI. Show your work.

(b) Which bins contribute most to the PSI? What does this tell you about the nature of the income shift?

(c) Interpret the result: is this a "no action," "investigate," or "alert" scenario? What real-world event might have caused this shift?


Exercise 30.7 (*)

Using the same income distributions from Exercise 30.6:

(a) Describe how the KS test would compare these two distributions. What is the KS statistic conceptually (without computing it precisely)?

(b) Would the KS test reject the null hypothesis at $\alpha = 0.01$ with 1,000 samples in each group? What about with 10,000 samples? Why does sample size matter?

(c) A colleague argues: "Just use the KS test — it gives you a p-value, which PSI does not." Provide a counterargument using the overpowering problem discussed in Section 30.9.


Exercise 30.8 (**)

Compare PSI, KS, and JS divergence on synthetic data:

(a) Generate two normal distributions: $P \sim N(0, 1)$ and $Q \sim N(\delta, 1)$ for $\delta \in \{0.0, 0.1, 0.2, 0.5, 1.0, 2.0\}$. For each $\delta$, compute PSI (10 bins), KS statistic, and JS divergence (10 bins) using 10,000 samples. Plot all three metrics as a function of $\delta$.

(b) Repeat for a variance shift: $P \sim N(0, 1)$ and $Q \sim N(0, \sigma)$ for $\sigma \in \{0.5, 0.8, 1.0, 1.2, 1.5, 2.0, 3.0\}$. Which metric is most sensitive to variance changes?

(c) Repeat for a tail shift: $P \sim N(0, 1)$ and $Q$ is a mixture $0.95 \cdot N(0, 1) + 0.05 \cdot N(5, 0.5)$. Which metric detects the tail contamination? What is the minimum contamination fraction (replacing 0.05 with $\epsilon$) that each metric detects?


Exercise 30.9 (**)

PSI is sensitive to the number of bins. Design and evaluate a robust PSI estimator.

(a) For the distributions in Exercise 30.8(a) with $\delta = 0.5$, compute PSI with 3, 5, 10, 20, 50, and 100 bins. Plot PSI as a function of bin count. How stable is the result?

(b) Implement a "quantile-adaptive PSI" that uses quantile-based bins from the reference distribution (as in ReferenceDistribution.from_array). Repeat the analysis from (a). Is the result more stable?

(c) Propose a bootstrap confidence interval for PSI: resample the current distribution 1000 times, compute PSI for each resample, and report the 95% confidence interval. Implement this and evaluate on the $\delta = 0.5$ case.


Exercise 30.10 (***)

The chapter presents drift detection for individual features. In practice, features are correlated, and multivariate drift may be missed by marginal monitoring.

(a) Construct an example where two features are individually stable (PSI < 0.05 for each) but their joint distribution has shifted significantly (e.g., the correlation between them has changed from 0.2 to 0.8).

(b) Propose a multivariate drift detection method. Options include: (i) PCA on the feature space and monitor the first $k$ principal components, (ii) train an autoencoder on reference data and monitor reconstruction error on current data, (iii) train a classifier to distinguish reference from current data (the "classifier two-sample test"). Implement one of these.

(c) Discuss the scalability of your method. For a model with 200 features and 10 million daily predictions, what is the computational cost? How does it compare to running 200 independent PSI computations?


Alerting and Incident Response

Exercise 30.11 (*)

Design alert rules for a search ranking model at an e-commerce company. The model ranks products for each user query. Key metrics include: NDCG@10, click-through rate, add-to-cart rate, query latency, and null result rate (queries that return zero results).

(a) Define 6 alert rules covering data quality, model performance, and system health. For each rule, specify the metric, condition, severity, escalation target, and cooldown.

(b) The alert ndcg_drop fires every Monday morning because weekend search patterns differ from weekday patterns (lower NDCG on weekends is expected). Redesign the alert to avoid this false alarm while still detecting genuine degradation.

(c) The alerts feature_drift_critical and prediction_drift_warning fire simultaneously. Design a correlation rule that groups these into a single incident and suppresses the lower-severity alert.


Exercise 30.12 (**)

Write a runbook for the following scenario: the StreamRec prediction latency p99 has spiked from 35ms to 120ms, exceeding the 50ms SLO.

(a) Write the impact assessment section: who is affected, what is the business impact, and what is the severity classification?

(b) Write the diagnostic steps section. The latency breakdown metrics show: feature retrieval (5ms → 80ms), model inference (25ms → 30ms), re-ranking (5ms → 10ms). Where is the bottleneck? What specific commands or queries would you run to diagnose the feature retrieval latency spike?

(c) Write the mitigation section. Include at least 3 options ordered by speed of implementation: (1) immediate (< 5 minutes), (2) short-term (< 1 hour), (3) long-term (< 1 day).


Exercise 30.13 (**)

Design an on-call rotation for an ML team of 6 engineers (3 senior, 3 junior) supporting 2 production models (StreamRec recommendation model and a content moderation classifier).

(a) Define the rotation schedule. Consider: minimum rotation period, handoff procedures, weekend coverage, and burnout prevention.

(b) What training would a junior ML engineer need before joining the on-call rotation? Design a 2-week onboarding program that includes: runbook familiarization, shadow on-call shifts, simulated incidents, and model rollback practice.

(c) The team is considering a "follow-the-sun" on-call model by partnering with a team in a different timezone. What are the benefits and risks? What information must be included in the handoff?


Exercise 30.14 (**)

Write a blameless post-mortem using the PostMortem class for the following incident:

On Tuesday at 14:00, a new version of the item embedding model was deployed to the StreamRec feature store. The new embeddings had 256 dimensions instead of the expected 128. The recommendation model, which expects 128-dimensional embeddings, silently truncated the input and produced degraded recommendations. The issue was detected at 18:30 when a product manager noticed a 15% CTR drop. The model was rolled back at 19:15. Root cause was identified at 20:00: the embedding model's configuration was updated but the downstream model's input schema was not.

(a) Populate all fields of the PostMortem dataclass, including a detailed timeline with at least 8 events.

(b) Identify at least 3 root causes and 3 contributing factors. Remember: root causes are systemic, not personal.

(c) Define at least 5 action items, each with an owner and deadline. At least 2 action items should be monitoring improvements that would have detected this incident earlier.


Business Metric Monitoring

Exercise 30.15 (*)

The StreamRec product team tracks 5 business metrics: CTR, completion rate, session length, recommendation coverage, and revenue per recommendation. The ML team tracks 3 model metrics: Recall@20, NDCG@20, and prediction latency p99.

(a) Draw a causal diagram showing how model metric changes propagate to business metric changes. For example: Recall@20 decreases → users click fewer recommendations → CTR drops. Not all relationships are monotonic — identify at least one case where improving a model metric could worsen a business metric.

(b) The recommendation coverage dropped from 0.45 to 0.20 while CTR remained stable. Explain how this is possible and why it is a problem despite stable CTR.

(c) Design a composite health score that combines model metrics and business metrics into a single number (0-100). Specify the formula, the weights, and the normalization method. What are the risks of reducing multi-dimensional health to a single number?


Exercise 30.16 (**)

The feedback loop for StreamRec recommendations creates a monitoring challenge: the model's own predictions influence the data it will be retrained on.

(a) Describe the feedback loop precisely: how does the current model's recommendation decisions affect the future training data? What specific bias does this introduce?

(b) Design a monitoring signal that detects when the feedback loop is narrowing recommendation diversity. Consider tracking: item exposure entropy over time, the Gini coefficient of recommendation frequency, or the fraction of the catalog that receives fewer than $k$ impressions per week.

(c) At Meridian Financial, a similar feedback loop exists: the model's credit decisions determine who receives loans, and only loan recipients generate default/non-default labels. How does this create selection bias in the monitoring data? How would you design monitoring that accounts for this bias? (Connect to the potential outcomes framework from Chapter 16.)


Comprehensive Monitoring Design

Exercise 30.17 (**)

Adapt the monitoring architecture from this chapter for a medical image classification system that classifies chest X-rays as normal or showing pneumonia.

(a) Design the data quality monitoring layer. What data quality signals are specific to medical images (vs. tabular data)? Consider: image resolution, bit depth, DICOM metadata completeness, scanner model distribution, patient demographic distribution.

(b) Design the model performance monitoring layer. Ground truth is available within 48 hours (radiologist review). What metrics should be tracked in real time (before ground truth) and what metrics should be tracked after ground truth arrives?

(c) Design the fairness monitoring layer. The system must not have disparate performance across patient demographics (age, sex, race). What SLOs would you define? How would you handle the fact that disease prevalence varies by demographic group (older patients have higher pneumonia rates)?


Exercise 30.18 (***)

Implement a complete MonitoringDashboard class that integrates all four monitoring pillars:

(a) Define the class with methods for: - update_data_quality(features: Dict[str, np.ndarray]) — compute PSI for all features and update drift metrics - update_model_performance(predictions: np.ndarray, outcomes: Optional[np.ndarray]) — track prediction distribution and, when available, accuracy - update_system_health(latency: float, error_count: int, request_count: int) — track SLI values - update_business_metrics(ctr: float, coverage: float, session_length: float) — track business KPIs

(b) Add a generate_health_report() method that produces a structured report summarizing the current state of all four pillars, any active alerts, and the SLO error budget status.

(c) Add a detect_correlated_anomalies() method that identifies when multiple signals are anomalous simultaneously (e.g., feature drift + prediction drift + business metric drop), suggesting a common root cause.


Exercise 30.19 (***)

The chapter discusses monitoring for individual features and the prediction distribution. Design a monitoring system for a multi-model pipeline (like StreamRec's retrieval → ranking → re-ranking architecture from Chapter 24).

(a) Each stage in the pipeline has its own inputs, outputs, and failure modes. Design per-stage monitoring signals for all three stages. What signals are specific to each stage (e.g., candidate set size for retrieval, score distribution for ranking)?

(b) A degradation in the retrieval stage (returning fewer relevant candidates) may or may not affect the final recommendations (the ranking stage may compensate). Design an end-to-end monitoring strategy that detects whether a stage-level degradation propagates to the final output.

(c) The re-ranking stage applies business rules (content diversity, recency boost, safety filtering). These rules are not learned — they are deterministic. Should they be monitored? If so, what signals would you track? (Consider: safety filter activation rate, diversity score before/after re-ranking, items removed by rules.)


Exercise 30.20 (***)

Design a cost-effective monitoring strategy for a startup with limited engineering resources.

(a) You have one ML engineer, one production model, and a $500/month budget for monitoring infrastructure. Prioritize the monitoring signals from this chapter by impact-per-dollar. Which 5 signals would you implement first? What open-source tools would you use?

(b) As the company scales from 1 to 5 to 20 models, what monitoring infrastructure investments should be made at each stage? Define the monitoring maturity levels and the trigger for moving from one level to the next.

(c) Compare build-vs-buy for ML monitoring: open-source (Prometheus + Grafana + custom drift detection) vs. managed platforms (Evidently, Whylabs, Arize, Fiddler). For each approach, estimate the cost in engineering hours and dollars for a 5-model deployment.


Exercise 30.21 (****)

The chapter treats monitoring and observability as complementary approaches. A fundamental open question is: how much telemetry is enough?

(a) Formalize the trade-off between telemetry granularity and diagnostic capability. Define "diagnostic capability" as the probability that a novel failure mode can be diagnosed from the available telemetry. How does this probability change as a function of the prediction sample rate (1% vs. 10% vs. 100% of predictions logged)?

(b) Propose a framework for adaptive telemetry: when the system is healthy, collect minimal telemetry (low cost). When an anomaly is detected, automatically increase the telemetry granularity (high cost but high diagnostic capability). Implement the adaptive logic.

(c) How does the monitoring-observability spectrum interact with privacy requirements? In a healthcare or financial system, full-fidelity prediction logging may violate privacy regulations. Design a monitoring system that provides observability while respecting differential privacy constraints: the system can detect and diagnose problems without exposing individual predictions. (Connect to Chapter 32, Privacy-Preserving Data Science.)


Exercise 30.22 (****)

A fundamental challenge in ML monitoring is alert fatigue: as more signals are monitored and more rules are configured, the number of alerts increases until the on-call team ignores them.

(a) Quantify the alert fatigue problem. If a team monitors 50 features with PSI (threshold 0.10), 50 features with KS test (threshold 0.01), and 10 business metrics (with 2 thresholds each), how many alerts per day would you expect under the null hypothesis (no real drift, but statistical false alarms)?

(b) The Bonferroni correction (divide the significance level by the number of tests) is a standard solution for multiple testing. Apply it to the monitoring scenario in (a). What happens to the effective threshold for each test? Is the correction too conservative for practical monitoring?

(c) Propose an alternative to Bonferroni for ML monitoring. Consider: (i) False Discovery Rate (FDR) control (Benjamini-Hochberg), (ii) hierarchical alerting (feature-level alerts only escalate if the prediction distribution also shifts), (iii) anomaly scoring (rank all signals by anomaly score and alert on the top-$k$). Implement one approach and evaluate it on simulated data.


Exercise 30.23 (**)

Design a monitoring system for model fairness over time.

(a) Extend the FairLendingMonitor class to support time-series analysis of AIR: track AIR monthly and detect significant trends (e.g., AIR declining for 3 consecutive months even if each month's AIR is above threshold).

(b) Define SLOs for fairness: "The adverse impact ratio for each protected group shall not fall below 0.85 for more than 1 month in any 12-month period." Implement the error budget computation for this SLO.

(c) How would you adapt fairness monitoring for the StreamRec recommendation system, where there are no regulatory requirements but the product team has committed to equitable recommendations across user demographics?


Exercise 30.24 (***)

The PostMortem class captures incident data. Design a system for learning from post-mortems across incidents.

(a) Implement a PostMortemDatabase class that stores all post-mortems and provides queries: most common root causes, mean time to detect by incident type, most frequently recommended action items, action items that are never completed.

(b) Compute "monitoring coverage": what fraction of past incidents would have been detected by the current monitoring configuration? For each undetected incident, what monitoring rule would have caught it?

(c) Use the post-mortem database to prioritize monitoring investments. If 60% of incidents have root cause "data pipeline change" and only 20% have root cause "model drift," which monitoring capability should receive more investment? Formalize this as a cost-benefit optimization problem.


Exercise 30.25 (**)

Implement concept drift detection for the StreamRec model using the prediction-outcome gap approach described in Section 30.11.

(a) Define a ConceptDriftDetector class that: - Accepts a stream of (prediction, outcome) pairs - Computes the rolling mean absolute error (MAE) between predictions and outcomes in a sliding window - Uses CUSUM (Cumulative Sum) change-point detection to identify when the MAE shifts upward

(b) Test the detector on simulated data: generate 10,000 predictions where the first 7,000 have MAE ~0.05 and the last 3,000 have MAE ~0.12 (simulating concept drift at sample 7,000). Does your detector identify the change point?

(c) The CUSUM detector has a sensitivity parameter $k$ (the allowable slack). How does $k$ affect the trade-off between detection speed and false alarm rate? Evaluate for $k \in \{0.5\sigma, 1.0\sigma, 1.5\sigma, 2.0\sigma\}$ where $\sigma$ is the standard deviation of the pre-drift MAE.