Exercises: Chapter 32

Monitoring Models in Production


Exercise 1: PSI Computation by Hand (Conceptual + Code)

A feature called plan_price has the following distribution in the training data and in this week's production data:

Bin (decile) Training Proportion Production Proportion
1 0.10 0.08
2 0.10 0.07
3 0.10 0.09
4 0.10 0.11
5 0.10 0.12
6 0.10 0.11
7 0.10 0.13
8 0.10 0.11
9 0.10 0.10
10 0.10 0.08

a) Compute the PSI by hand (show the per-bin calculations). Classify the result as stable, investigate, or retrain.

b) Now verify your answer using the compute_psi function from this chapter. Generate synthetic data that approximately matches these bin proportions and confirm the PSI value.

c) Suppose the production proportion in bin 7 changes from 0.13 to 0.35 (with proportional reductions in other bins). Recompute PSI. What happens to the overall score when a single bin has a large shift?


Exercise 2: Multi-Feature Drift Detection (Code)

StreamFlow has 10 features. Simulate training data (n=50,000) and production data (n=5,000) where:

  • 7 features are stable (same distribution in training and production)
  • 2 features have moderate drift (PSI between 0.10 and 0.25)
  • 1 feature has significant drift (PSI > 0.50)
import numpy as np
import pandas as pd

np.random.seed(42)

# Your task: create reference_df and production_df
# with the drift pattern described above.

a) Use the compute_drift_report function from this chapter to generate a drift report. Verify that the PSI values match your intended drift levels.

b) Sort the report by PSI descending. Which feature has the highest PSI? Does this match your simulation?

c) Add a categorical feature device_type with categories ["mobile", "tablet", "desktop", "tv"]. In training, the distribution is [0.45, 0.15, 0.25, 0.15]. In production, it shifts to [0.55, 0.10, 0.20, 0.15]. Use the chi2_drift_test function to test for drift. Is the shift statistically significant?

d) If you were setting up alerts for StreamFlow, would you alert on the moderately drifted features or only on the significantly drifted one? Justify your answer in terms of alert fatigue vs. early warning.


Exercise 3: Concept Drift vs. Data Drift (Conceptual)

For each scenario below, identify whether the situation describes data drift, concept drift, prior probability shift, or some combination. Explain your reasoning.

a) A fraud detection model was trained on data where 2% of transactions were fraudulent. After a major data breach at a partner company, the fraud rate jumps to 8%. The fraudulent transactions have the same patterns as before --- just more of them.

b) An e-commerce recommendation model was trained on pre-pandemic shopping behavior. During lockdowns, users who previously bought office supplies start buying home fitness equipment. The features (browsing history, purchase frequency) have similar distributions, but the products they map to have changed entirely.

c) A credit scoring model uses income, employment length, and debt-to-income ratio to predict default. After a recession, all three features shift downward (lower incomes, shorter employment, higher debt ratios), but the relationship between these features and default probability remains the same.

d) TurbineTech's vibration model was trained on data from summer operations. In winter, ambient temperature drops cause thermal contraction in the turbine housing, reducing baseline vibration by 0.4 mm/s. Failures still occur at the same absolute vibration level, but the model's baseline has shifted.


Exercise 4: Building a Monitoring Pipeline (Code)

Using the ModelMonitor class from this chapter, build a complete monitoring pipeline for the following scenario:

StreamFlow deploys a churn model on January 1. Simulate 12 weeks of production data with the following pattern:

  • Weeks 1--4: Stable (same distribution as training data)
  • Weeks 5--8: Gradual drift in sessions_last_30d (mean increases from 14 to 20 linearly)
  • Weeks 9--12: Drift stabilizes at the new level; additionally, support_tickets_90d mean increases from 1.2 to 2.5
np.random.seed(42)

# Reference data
n_ref = 50000
reference = pd.DataFrame({
    "sessions_last_30d": np.random.poisson(14, n_ref),
    "avg_session_minutes": np.random.exponential(28, n_ref).round(1),
    "content_completion_rate": np.random.beta(3, 2, n_ref).round(3),
    "hours_change_pct": np.random.normal(0, 30, n_ref).round(1),
    "months_active": np.random.randint(1, 60, n_ref),
    "devices_used": np.random.randint(1, 6, n_ref),
    "support_tickets_90d": np.random.poisson(1.2, n_ref),
})
ref_predictions = np.random.beta(2, 15, n_ref)

# Your task:
# 1. Generate 12 weeks of production data following the pattern above.
# 2. Initialize a ModelMonitor with appropriate alert rules.
# 3. Run monitor.run_check() for each week.
# 4. Plot the max_feature_psi over the 12 weeks.
# 5. Identify the week when the first warning alert fires
#    and the week when the first critical alert fires.

a) At which week does the first warning alert trigger?

b) At which week does the first critical alert trigger?

c) Plot max_feature_psi over the 12 weeks. Describe the shape of the curve. Why does it not increase linearly even though the drift is introduced linearly?

d) If StreamFlow used scheduled weekly retraining, would the drift in weeks 5--8 cause a problem? Why or why not?


Exercise 5: KS Test Sensitivity Analysis (Code)

The KS test is known to be overly sensitive with large sample sizes. Demonstrate this empirically.

a) Generate two samples from the exact same distribution: N(0, 1) with sizes 100, 1,000, 10,000, and 100,000. For each sample size, run the KS test 1,000 times and record the proportion of times the test rejects the null hypothesis at alpha = 0.05.

np.random.seed(42)

sample_sizes = [100, 1_000, 10_000, 100_000]
n_trials = 1000

# Your task: for each sample size, compute the false positive rate.

b) The false positive rate should be approximately 5% (the alpha level) if the test is properly calibrated. What do you observe? Does the false positive rate increase, decrease, or stay constant with sample size?

c) Now introduce a tiny shift: the production distribution is N(0.01, 1) --- a shift of 0.01 standard deviations. Repeat the experiment. At what sample size does the KS test reliably detect this trivially small shift?

d) Explain why PSI is often preferred over the KS test for production drift monitoring, using your results from parts (a) through (c) as evidence.


Exercise 6: Retraining Strategy Design (Conceptual)

You are the ML engineer for each of the following systems. Design a retraining strategy for each one. Specify: (1) scheduled vs. triggered vs. hybrid, (2) the retraining frequency or trigger conditions, (3) the deployment strategy (shadow, canary, or direct), and (4) your rationale.

a) Ad click-through rate model. Labels arrive within seconds. Training data volume: 50 million events per day. Model training time: 2 hours on a GPU cluster. The ad marketplace is highly competitive, and staleness directly impacts revenue.

b) Loan default prediction model. Labels arrive after 6--24 months. Training data volume: 100,000 loans per year. Model training time: 15 minutes on a laptop. The model is used for regulatory-compliant lending decisions, and every model change requires compliance review.

c) Manufacturing quality control model (TurbineTech). Labels arrive after 30 days (the turbine either fails or does not). The factory has strong seasonal patterns (summer vs. winter). Training data volume: 5,000 turbine-months per year. Model training time: 45 minutes.

d) Real-time fraud detection model. Labels arrive after 1--14 days (investigation completion). Training data is highly imbalanced (0.1% fraud). New fraud patterns emerge every few weeks as attackers adapt. Model training time: 4 hours.


Exercise 7: Shadow Deployment Comparison (Code)

Implement a shadow deployment comparison for StreamFlow's churn model.

a) Train two models on the same StreamFlow data: - Champion: RandomForestClassifier(n_estimators=200, max_depth=10, random_state=42) - Challenger: GradientBoostingClassifier(n_estimators=300, learning_rate=0.05, max_depth=5, random_state=42)

b) Use the ShadowDeployment class from this chapter to run both models on 10 batches of 500 production samples each. Log the comparison metrics for each batch.

c) After all 10 batches, call evaluate_challenger with known labels. Which model has higher AUC? By how much?

d) Based on the results, would you promote the challenger to production? What additional information would you want before making that decision?


Exercise 8: Progressive Project M11 --- StreamFlow Monitoring

This exercise extends the StreamFlow progressive project. You will add monitoring to the deployed churn model from M10.

a) Compute weekly PSI. Using the reference data from M10 (or the training data from earlier milestones), compute PSI for all features on each week's production data. Store results in a DataFrame with columns week, feature, psi.

b) Track prediction distribution. For each week, compute the mean, median, and standard deviation of the model's predicted probabilities. Plot these over time.

c) Set alert thresholds. Define alert rules using the AlertRule class: - Warning: any feature PSI > 0.10 - Warning: prediction mean shift > 0.03 - Critical: 2+ features with PSI > 0.25 - Critical: AUC < 0.78 (when labels are available)

d) Simulate drift. Create a function that gradually shifts the sessions_last_30d distribution over 8 weeks (from Poisson(14) to Poisson(24)). Run your monitoring pipeline on this simulated drift. At which week does each alert threshold trigger?

e) Write a monitoring report. In 200--300 words, describe what you observed, when alerts fired, and what retraining action you would recommend. Include the PSI trend plot.


These exercises support Chapter 32: Monitoring Models in Production. Return to the chapter for reference.