Further Reading: Chapter 32

Monitoring Models in Production

Official Documentation and Tools

1. Evidently AI --- evidentlyai.com An open-source Python library purpose-built for ML monitoring. Evidently computes data drift, prediction drift, and model quality reports with a single function call. It generates interactive HTML dashboards and integrates with monitoring platforms (Grafana, Prometheus, Airflow). The DataDriftPreset and TargetDriftPreset are the fastest way to get a production-quality drift report. Apache 2.0 licensed. Start with the "Quickstart" guide and the "Data Drift" tutorial.

2. NannyML Documentation --- nannyml.readthedocs.io NannyML specializes in monitoring model performance without ground truth labels --- exactly the label-delay problem discussed in this chapter. It uses confidence-based performance estimation (CBPE) and direct loss estimation (DLE) to estimate AUC, F1, and other metrics before labels arrive. The library also includes univariate and multivariate drift detection. Open-source (Apache 2.0). Particularly useful for domains with long label delays (loan default, churn).

3. WhyLabs / whylogs --- whylabs.ai An open-source data logging library that generates lightweight statistical profiles of your data at each pipeline stage. The profiles can be compared over time to detect drift without storing raw data. Designed for high-volume production environments where storing full datasets for comparison is impractical. The "Data Profiling" and "Drift Detection" tutorials cover the core workflow.

4. Great Expectations --- greatexpectations.io While primarily a data validation tool (not a drift detector), Great Expectations complements model monitoring by catching data quality issues before they become drift issues. Expectations like "column mean should be between X and Y" and "column values should be in set S" catch pipeline bugs, schema changes, and upstream data issues that masquerade as model drift. Open-source.

Foundational Concepts

5. "Failing Loudly: An Empirical Study of Methods for Detecting Dataset Shift" --- Rabanser et al. (2019) A systematic comparison of drift detection methods, including the KS test, Maximum Mean Discrepancy (MMD), and classifier-based approaches. The authors evaluate each method on real-world dataset shift scenarios and provide practical guidance on when each method succeeds or fails. The key finding: univariate tests (like KS) are effective for individual features but miss multivariate interactions; dimensionality reduction followed by a multivariate test is more reliable. Available on arXiv (1810.11953).

6. "A Survey on Concept Drift Adaptation" --- Gama et al. (2014) The most comprehensive survey of concept drift detection and adaptation methods. Covers drift types (sudden, gradual, incremental, recurring), detection methods (DDM, EDDM, ADWIN, Page-Hinkley), and adaptation strategies (ensemble methods, sliding windows, instance weighting). Published in ACM Computing Surveys. Dense but authoritative --- the reference taxonomy used throughout the research community.

7. Designing Machine Learning Systems --- Chip Huyen (2022) Chapter 8 ("Data Distribution Shifts and Monitoring") is the most accessible practitioner treatment of drift detection and monitoring. Huyen covers covariate shift, label shift, and concept drift with production examples, and provides a framework for designing monitoring systems. Chapter 9 ("Continual Learning and Test in Production") covers retraining strategies, shadow deployments, and A/B testing. O'Reilly.

8. "Hidden Technical Debt in Machine Learning Systems" --- Sculley et al. (Google, 2015) The paper that identified monitoring as a critical but underinvested component of ML systems. Section 4 ("Feedback Loops") and Section 5 ("ML-System Anti-Patterns") are directly relevant to this chapter. The authors argue that the monitoring problem in ML is harder than in traditional software because model behavior degrades silently --- the system continues to produce outputs, but those outputs are quietly wrong. Published in NeurIPS 2015.

Drift Detection Methods

9. "Detecting and Correcting for Label Shift with Black Box Predictors" --- Lipton et al. (2018) A practical method for detecting and correcting prior probability shift (when $P(Y)$ changes but $P(X|Y)$ does not). The authors provide a correction algorithm that adjusts model predictions to account for the shifted class balance. Relevant when your production churn rate, fraud rate, or failure rate differs from the training distribution. Published at ICML 2018. Available on arXiv (1802.03916).

10. "Population Stability Index and Its Applications" --- Siddiqi (2006) The original credit scoring reference for PSI. Siddiqi's Credit Risk Scorecards: Developing and Implementing Intelligent Credit Scoring (Wiley) is the source for the PSI thresholds used throughout the industry: < 0.10 stable, 0.10--0.25 moderate shift, > 0.25 significant shift. The book is focused on credit risk, but the PSI methodology is domain-independent.

11. "ADWIN: An Adaptive Sliding Window Algorithm for Detecting Change" --- Bifet and Gavalda (2007) ADWIN is an online drift detection algorithm that automatically adjusts its window size to detect distribution changes. Unlike fixed-window PSI, ADWIN adapts: it uses a small window when the data is stable (for efficiency) and expands when it detects a potential change (for confidence). Relevant for streaming data applications where batch-based PSI is too slow. Available in the river Python library.

Retraining Strategies

12. "Continual Learning for Production ML" --- Google Cloud Architecture Center A production architecture guide covering triggered retraining, continuous training, and the infrastructure required for each. Includes decision frameworks for choosing between scheduled and triggered retraining based on data velocity, label availability, and model complexity. The architecture diagrams for Cloud Composer (Airflow) + Vertex AI pipelines show how these concepts map to real infrastructure. Available at cloud.google.com/architecture.

13. Practical MLOps --- Noah Gift and Alfredo Deza (2021) Chapters 5--7 cover model monitoring, retraining pipelines, and deployment strategies (canary, blue-green, shadow) from a DevOps perspective. The book emphasizes infrastructure automation and CI/CD for ML, which complements this chapter's focus on the data science side of monitoring. O'Reilly.

14. "Challenges in Deploying Machine Learning: A Survey of Case Studies" --- Paleyes et al. (2022) A survey of 91 real-world ML deployment case studies, identifying monitoring and maintenance as the most commonly cited post-deployment challenge. The paper categorizes failure modes (data drift, concept drift, infrastructure failures, feedback loops) and maps them to mitigation strategies. Published in ACM Computing Surveys. An excellent source of real-world examples beyond the two anchors in this chapter.

Production Monitoring Systems

15. "Monitoring Machine Learning Models in Production" --- Breck et al. (Google, 2017) Companion paper to the ML Test Score rubric (Chapter 30 further reading). Describes Google's internal approach to model monitoring, including data validation, prediction monitoring, and alerting. The authors propose specific tests for production ML: "feature distribution should match training distribution," "prediction distribution should be stable over time," and "model performance should exceed baseline on every evaluation." Published in IEEE Big Data 2017.

16. "Monitoring ML Models with Prometheus and Grafana" --- Neptune.ai Blog A practical tutorial on building a real-time monitoring dashboard using Prometheus (time-series metrics collection) and Grafana (visualization). Covers exporting custom ML metrics (PSI, AUC, prediction mean) from Python to Prometheus, and building Grafana dashboards with alert rules. The tutorial bridges the gap between this chapter's Python monitoring code and production infrastructure.

17. MLflow Model Evaluation Documentation --- mlflow.org MLflow's built-in model evaluation module (mlflow.evaluate()) computes standard metrics and generates visual reports. When combined with the MLflow Model Registry (Chapter 30), it provides a complete lifecycle: track experiments, register models, evaluate on production data, and trigger alerts. The "Model Evaluation" quickstart shows how to integrate evaluation into a monitoring pipeline.

Domain-Specific Monitoring

18. "Predictive Maintenance for Industry 4.0" --- Carvalho et al. (2019) A survey of predictive maintenance approaches, including monitoring and drift detection for sensor-based models. Covers the specific challenges of manufacturing ML: sensor calibration drift, seasonal effects, fleet heterogeneity, and the high cost of missed failures. Published in Computers in Industry. Directly relevant to the TurbineTech case study.

19. "Monitoring Machine Learning Models in the Wild" --- Klaise et al. (2020) An end-to-end framework for model monitoring that covers drift detection, outlier detection, and adversarial detection. The authors implement the framework using Seldon Deploy and Alibi Detect, an open-source Python library for drift and outlier detection. The Alibi Detect library includes implementations of KS, chi-squared, Maximum Mean Discrepancy, and learned drift detectors. Available on arXiv (2011.01314).

20. "The ML Test Score: A Rubric for ML Production Readiness" --- Breck et al. (2017) A scoring rubric for assessing ML system maturity. The monitoring section awards points for: "data invariants are monitored in production," "training-serving skew is measured," "model staleness is tracked," and "performance is monitored on a regular cadence." Useful as a checklist for evaluating whether your monitoring system is complete. Published in IEEE Big Data 2017.

How to Use This List

If you are setting up monitoring for the first time, start with Evidently (item 1) and Huyen's Chapter 8 (item 7). Evidently will give you a working drift report in under an hour. Huyen will give you the conceptual framework to understand what the report means.

If you need to monitor without labels, read about NannyML (item 2) and Lipton et al. (item 9). NannyML estimates performance without ground truth. Lipton provides correction methods for label shift.

If you want to understand the research foundations, Gama et al. (item 6) is the comprehensive survey, and Rabanser et al. (item 5) is the empirical comparison of detection methods.

If you are building production infrastructure, the Google Cloud guide (item 12) and the Prometheus/Grafana tutorial (item 16) bridge the gap between Python prototypes and production monitoring systems.

If you work in manufacturing or other high-cost-of-failure domains, Carvalho et al. (item 18) covers the domain-specific challenges, and Klaise et al. (item 19) provides a framework that includes outlier detection alongside drift detection.

This reading list supports Chapter 32: Monitoring Models in Production. Return to the chapter to review concepts before diving in.