Key Takeaways: Chapter 32

DataField.Dev

Key Takeaways: Chapter 32

Monitoring Models in Production

Every model you deploy starts dying the moment it hits production. The world that generated your training data is not the world your model will encounter next month. Customer behavior shifts. Sensor calibrations drift. Product redesigns change feature distributions. Seasonal effects introduce cyclical variation. The question is never whether your model will degrade --- it is when, how fast, and whether you will notice before the business does.
Data drift and concept drift are different problems with different solutions. Data drift is a change in input distributions ($P(X)$ shifts); detect it with PSI, KS tests, and chi-squared tests on your features. No labels required. Concept drift is a change in the input-output relationship ($P(Y|X)$ shifts); detect it by tracking model performance on labeled production data. Labels required. Prior probability shift ($P(Y)$ changes) is a third category that affects calibration even when the model's discrimination is intact.
PSI is the industry-standard drift detection metric, and its thresholds are a practical starting point. PSI < 0.10 means stable (no action needed). PSI between 0.10 and 0.25 means investigate (something changed --- determine if it matters). PSI > 0.25 means retrain (the distribution has shifted enough to compromise predictions). These thresholds are conventions, not physical constants. Adjust them based on the cost of missed drift in your domain.
The KS test is statistically rigorous but operationally dangerous at scale. With large production sample sizes (tens of thousands of observations per week), the KS test will flag trivially small distributional differences as statistically significant. Statistical significance is not practical significance. Use PSI for primary alerting (calibrated for business impact) and the KS test as a secondary diagnostic tool.
Monitor the prediction distribution, not just the input features. If your model was well-calibrated at deployment time, a shift in the distribution of predicted probabilities --- without a corresponding known change in inputs --- is a strong signal that something has changed. Track the mean, median, and standard deviation of predictions over time. Compute PSI on the prediction distribution itself.
The label delay problem is the central challenge of concept drift detection. If you are predicting 90-day churn, you will not know whether today's predictions were correct for 90 days. During that delay, you depend on data drift detection (no labels needed) and prediction distribution monitoring (no labels needed) as indirect signals. When labels arrive quickly (ad clicks, real-time fraud), you can monitor concept drift directly.
Not all drift requires retraining --- diagnose before you react. Seasonal drift is predictable and can be addressed with feature engineering (temperature normalization). Sensor calibration drift is a data quality issue that should be fixed in the data pipeline. Product-launch drift may be temporary or permanent. A data pipeline bug that corrupts feature values is not model drift --- it is a bug. Always ask why the drift occurred before deciding how to respond.
Three retraining strategies: scheduled, triggered, and hybrid. Scheduled retraining runs on a fixed cadence and guarantees a maximum staleness window. Triggered retraining fires only when monitoring signals cross a threshold. Hybrid (recommended for most systems) combines both: scheduled retraining as a baseline with triggered retraining for sudden drift events. The right choice depends on label delay, retraining cost, and your tolerance for model staleness.
Safe deployment of retrained models is non-negotiable. Shadow deployment runs the new model alongside the existing one without serving its predictions --- zero risk, full comparison. Canary releases route a small percentage of traffic to the new model. A/B tests measure business outcomes, not just ML metrics. Always validate a retrained model in production before promoting it to serve all traffic.
Alerting systems fail through noise, not silence. If your monitoring fires alerts that nobody reads, you do not have a monitoring system --- you have a log file. Design alerts with severity levels (warning vs. critical), threshold hysteresis (to prevent flapping), aggregation (alert on count of drifted features, not individual features), and context (include the metric value, the threshold, and what to do about it).

If You Remember One Thing

Monitor your models or your stakeholders will monitor them for you --- and they will not be as polite about it. A monitoring pipeline that computes PSI weekly, tracks prediction distributions, and fires alerts when thresholds are breached costs a few hours to set up and runs automatically. The alternative is finding out your model has degraded when the VP of Customer Success asks why the "high-risk churn" list is empty, or when the third turbine bearing failure in two months triggers an emergency meeting. Detection is cheap. Surprise is expensive.

These takeaways summarize Chapter 32: Monitoring Models in Production. Return to the chapter for full context.