Quiz: Chapter 32

DataField.Dev

Quiz: Chapter 32

Monitoring Models in Production

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

A model's PSI for the feature sessions_last_30d is 0.18. According to standard PSI thresholds, what action should you take?

A) No action needed; the model is stable
B) Investigate the drift; retraining may be needed
C) Immediately retrain the model
D) Remove the feature from the model

Answer: B) Investigate the drift; retraining may be needed. A PSI between 0.10 and 0.25 falls in the "investigate" zone. The distribution has shifted enough to warrant attention but not enough to automatically trigger retraining. You should examine why the feature shifted (product change, seasonal effect, data pipeline issue) and assess whether the shift is affecting model performance before deciding on retraining.

Question 2 (Multiple Choice)

What is the fundamental difference between data drift and concept drift?

A) Data drift requires labels to detect; concept drift does not
B) Data drift is a change in input distributions; concept drift is a change in the input-output relationship
C) Data drift is gradual; concept drift is always sudden
D) Data drift affects classification models; concept drift affects regression models

Answer: B) Data drift is a change in input distributions; concept drift is a change in the input-output relationship. Data drift means $P(X)$ has changed but $P(Y|X)$ may be the same. Concept drift means $P(Y|X)$ itself has changed, even if the input distributions look unchanged. Concept drift is harder to detect because the inputs appear normal --- you need ground truth labels to confirm that the mapping from features to outcomes has shifted.

Question 3 (Short Answer)

Explain why a model with a high AUC on the training holdout set can still produce poor business outcomes in production three months later, even if the AUC on production data has only dropped slightly (e.g., from 0.87 to 0.84).

Answer: AUC measures discrimination (the model's ability to rank positive cases above negative cases) but says nothing about calibration or the decision threshold. A small drop in AUC can mask a large shift in the prediction distribution. If the model's predicted probabilities are no longer calibrated --- for instance, predicting 0.12 for customers who now have a 0.19 actual churn rate --- then business rules built on specific probability thresholds (e.g., "contact customers above 0.20") will systematically miss at-risk customers. The business experiences this as "the model is not finding churners anymore," even though the AUC looks almost unchanged.

Question 4 (Multiple Choice)

Which drift detection method does not require ground truth labels from production?

A) Tracking AUC on production data over time
B) Computing PSI on input features
C) Monitoring precision and recall weekly
D) Comparing predicted label rates to actual label rates

Answer: B) Computing PSI on input features. PSI compares the distribution of input features between training and production data. It requires only the feature values, not the labels. Options A, C, and D all require ground truth labels to compute. This makes PSI (and the KS test for continuous features, and the chi-squared test for categorical features) especially valuable when labels are delayed.

Question 5 (Multiple Choice)

A model was trained with a 12% positive class rate. In production, the positive class rate has increased to 19%. Which type of drift does this primarily represent?

A) Data drift (covariate shift)
B) Concept drift
C) Prior probability shift
D) Feature engineering failure

Answer: C) Prior probability shift. The base rate of the outcome has changed ($P(Y)$ is different), which means the model's calibrated probabilities are no longer aligned with reality. A predicted probability of 0.15 used to be slightly above the base rate and is now below it. The model may still rank-order customers correctly (preserved discrimination), but the absolute probabilities and any threshold-based decision rules are now miscalibrated.

Question 6 (Short Answer)

A colleague suggests running the KS test on every feature with 100,000 production samples per week and alerting whenever the p-value is below 0.05. Explain two problems with this approach.

Answer: First, the KS test is extremely sensitive at large sample sizes. With 100,000 samples, even a trivially small distributional difference that has no practical impact on model performance will produce a statistically significant p-value. This leads to constant false alarms. Second, testing multiple features simultaneously without correction inflates the overall false positive rate --- with 20 features at alpha = 0.05, you expect at least one false positive per check by chance alone. PSI is often preferred because its thresholds are calibrated for practical significance (business impact) rather than statistical significance.

Question 7 (Multiple Choice)

In a shadow deployment, what happens to the challenger model's predictions?

A) They are served to a random subset of users
B) They replace the champion model's predictions for all users
C) They are computed and logged but not served to any users
D) They are averaged with the champion model's predictions

Answer: C) They are computed and logged but not served to any users. In a shadow deployment, the champion model continues to serve all production traffic. The challenger model receives the same inputs and generates predictions, but those predictions are only logged for comparison purposes. This allows you to evaluate the challenger's real-world performance without any risk to users or business outcomes.

Question 8 (Multiple Choice)

TurbineTech's predictive maintenance model is trained on summer data. In winter, ambient temperature drops cause a 0.4 mm/s reduction in baseline vibration readings. The failure threshold has not changed. What is the most appropriate response?

A) Retrain the model on winter data only
B) Retrain the model on data that includes both summer and winter conditions
C) Add a temperature normalization step to the feature pipeline
D) Both B and C are reasonable approaches

Answer: D) Both B and C are reasonable approaches. Retraining on data that spans both seasons ensures the model learns the seasonal pattern. Adding temperature normalization removes the seasonal effect from the features, making the model more robust to future temperature-related shifts. In practice, the best approach is often both: normalize the features to reduce unnecessary variation, and ensure the training data covers the full range of operating conditions.

Question 9 (Short Answer)

Describe the difference between scheduled retraining and triggered retraining. When would you choose each, and when would you use both?

Answer: Scheduled retraining runs on a fixed cadence (daily, weekly, monthly) regardless of monitoring signals. Triggered retraining fires only when a monitoring metric crosses a threshold (e.g., PSI > 0.25 or AUC < 0.80). Use scheduled retraining when labels arrive quickly and retraining is cheap --- it guarantees a maximum staleness window. Use triggered retraining when retraining is expensive or labels are delayed. Most production systems use a hybrid: scheduled retraining provides a baseline guarantee, and triggered retraining responds to sudden drift events between scheduled windows.

Question 10 (Multiple Choice)

Which of the following is the strongest signal that concept drift (not just data drift) has occurred?

A) PSI for three features exceeds 0.25
B) The KS test rejects the null for the most important feature
C) The model's AUC on newly labeled production data drops from 0.87 to 0.72 while input feature distributions remain stable
D) The mean predicted probability shifts from 0.12 to 0.18

Answer: C) The model's AUC on newly labeled production data drops from 0.87 to 0.72 while input feature distributions remain stable. This is the hallmark of concept drift: the inputs look the same ($P(X)$ unchanged), but the relationship between inputs and outputs has changed ($P(Y|X)$ different). Options A and B indicate data drift (input distribution changes). Option D could indicate either data drift or concept drift but is not definitive on its own.

Question 11 (Short Answer)

Your monitoring pipeline fires a critical alert: three features have PSI > 0.25, and the prediction distribution PSI is 0.31. However, you do not have ground truth labels yet (they arrive with a 90-day delay). Describe the steps you would take before triggering a retrain.

Answer: First, investigate the root cause of the drift: check whether a product change, data pipeline update, or external event (holiday, competitor action) explains the feature shifts. Second, check if the drift is in features that the model weights heavily --- drift in an unimportant feature is less concerning. Third, examine whether the drift is temporary (a one-week spike) or persistent (trending over multiple weeks). If the drift is caused by a data pipeline bug, fix the pipeline rather than retraining. If it is a genuine, persistent behavioral shift in important features, trigger a retrain using the most recent available labeled data (even if it is 90 days old) and deploy the retrained model as a shadow to validate.

Question 12 (Multiple Choice)

What is the primary purpose of alert threshold hysteresis (using separate arm and disarm thresholds)?

A) To make the monitoring system run faster
B) To prevent alert flapping when a metric oscillates around the threshold
C) To ensure only critical alerts are sent
D) To reduce the computational cost of PSI calculation

Answer: B) To prevent alert flapping when a metric oscillates around the threshold. If the alert triggers at PSI = 0.25 and resolves at PSI = 0.25, a metric that oscillates between 0.24 and 0.26 will trigger and resolve repeatedly, creating alert fatigue. Using separate thresholds --- trigger at 0.25, resolve at 0.20 --- means the alert fires once and stays active until the metric drops meaningfully below the threshold, reducing noise.

Question 13 (Multiple Choice)

A data scientist argues: "We should retrain our model every day because more recent data is always better." Which of the following is the strongest counterargument?

A) Daily retraining is computationally expensive
B) Frequent retraining can introduce instability if daily data has high variance or label noise
C) Daily retraining violates the PSI threshold framework
D) Models should only be retrained when the business requests it

Answer: B) Frequent retraining can introduce instability if daily data has high variance or label noise. If yesterday's data happens to be anomalous (a holiday, a system outage, a data collection error), the retrained model may overfit to an unrepresentative sample. Each retraining cycle also risks introducing subtle bugs through data pipeline changes that have not been fully validated. The correct retraining frequency balances freshness against stability, and that balance depends on the domain, the label delay, and the cost of a bad model update.

Question 14 (Short Answer)

Explain what a canary release is in the context of model deployment and how it differs from a shadow deployment.

Answer: A canary release routes a small percentage of live production traffic (e.g., 5%) to the new model, with the remaining traffic still served by the existing model. Users in the canary group receive predictions from the new model, which means it affects real outcomes. A shadow deployment, by contrast, runs the new model on all traffic but does not serve its predictions --- only the existing model's predictions are used. The key difference is risk: canary releases carry some risk (the canary group experiences the new model), while shadow deployments carry none (the new model's predictions are never served).

Question 15 (Multiple Choice)

When computing PSI, why are quantile-based bins from the reference distribution preferred over equal-width bins?

A) Quantile bins are faster to compute
B) Quantile bins ensure each bin has approximately equal representation in the reference data, making the comparison sensitive across the entire distribution
C) Equal-width bins always produce higher PSI values
D) Quantile bins are required by the PSI mathematical formula

Answer: B) Quantile bins ensure each bin has approximately equal representation in the reference data, making the comparison sensitive across the entire distribution. With equal-width bins, bins in the tails of the distribution may contain very few observations, making the PSI estimate noisy and unreliable in those regions. Quantile-based bins guarantee that each bin contributes equally to the reference distribution, so drift anywhere in the feature range --- center or tails --- is detected with equal sensitivity.

These quiz questions support Chapter 32: Monitoring Models in Production. Return to the chapter for review.