Chapter 30: Further Reading
Essential Sources
1. Betsy Beyer, Chris Jones, Jennifer Petoff, and Niall Richard Murphy, eds., Site Reliability Engineering: How Google Runs Production Systems (O'Reilly Media, 2016)
The foundational text for production system monitoring, alerting, incident response, and post-mortems. Written by Google SREs, it defines the concepts — SLOs, SLIs, error budgets, blameless post-mortems, on-call rotation, escalation — that this chapter adapts for ML systems. The book's core argument is that reliability is a feature that must be engineered, measured, and budgeted, not a vague aspiration.
Reading guidance: Chapter 6 (Monitoring Distributed Systems) establishes the four golden signals (latency, traffic, errors, saturation) that form the system health pillar of ML monitoring. Chapter 15 (Postmortem Culture: Learning from Failure) provides the blameless post-mortem methodology adapted in Section 30.15, including the critical insight that post-mortems must produce concrete, assigned, deadlined action items — not vague commitments to "be more careful." Chapter 28 (Accelerating SREs to On-Call and Beyond) covers on-call training and shadow shifts. The ML-specific extensions in this chapter — data quality monitoring, drift detection, prediction distribution tracking — sit on top of the SRE foundation, and understanding that foundation makes the extensions more natural. For an updated perspective, see the companion volume: Beyer et al., The Site Reliability Workbook (O'Reilly, 2018), which provides practical exercises and case studies.
2. Janis Klaise, Arnaud Van Looveren, Giovanni Vacanti, and Alexandru Coca, "Monitoring Machine Learning Models in Production" (arXiv, 2020)
A systematic survey of ML monitoring challenges, drift detection methods, and monitoring architectures. The paper identifies the monitoring gap between traditional software monitoring and ML-specific needs and proposes a layered monitoring architecture similar to the four-pillar framework in this chapter. It covers covariate shift, prior probability shift (label drift), and concept drift detection methods with mathematical rigor.
Reading guidance: Section 2 provides a clean taxonomy of drift types that maps directly to Section 30.11 of this chapter: covariate shift, prior probability shift, and concept drift, with formal definitions using the data-generating distribution $P(X, Y) = P(Y|X) \cdot P(X)$. Section 3 surveys detection methods — including PSI, KS test, maximum mean discrepancy (MMD), and learned drift detectors — providing the mathematical foundations for the implementations in Sections 30.8-30.10. Section 4 discusses monitoring architecture design, including the trade-off between detection sensitivity and false alarm rate that Exercise 30.22 explores. For a deeper treatment of concept drift specifically, see Lu et al., "Learning under Concept Drift: A Review" (IEEE TKDE, 2019), which covers adaptive learning algorithms that automatically adjust to concept drift — a direction beyond this chapter's focus on detection and alerting.
3. Jie Lu, Anjin Liu, Fan Dong, Feng Gu, Joao Gama, and Guangquan Zhang, "Learning under Concept Drift: A Review" (IEEE Transactions on Knowledge and Data Engineering 31, no. 12, 2019)
The most comprehensive survey of concept drift in machine learning, covering detection methods, adaptation strategies, and the theoretical foundations of learning from non-stationary data. While this chapter focuses on detecting drift and alerting humans, Lu et al. cover the full spectrum from detection through automated model adaptation.
Reading guidance: Section II defines the concept drift taxonomy (sudden, gradual, incremental, recurring) with formal mathematical definitions that extend the treatment in Section 30.11. Section III covers detection methods in depth: sequential analysis (Page-Hinkley, CUSUM), distribution-based (ADWIN, DDM, EDDM), and multiple-hypothesis-based approaches. The CUSUM method referenced in Exercise 30.25 is treated formally here. Section IV covers adaptation strategies — sliding windows, instance weighting, ensemble methods — that go beyond this chapter's monitoring-and-alert focus into automated response. For practitioners implementing concept drift detection, start with Section III.B (distribution-based detectors), which provides the most practical algorithms. For the connection between drift detection and causal inference, see Section V.C, which discusses confounding between genuine concept drift and sampling bias — a concern especially relevant to the Meridian Financial case study.
4. Prometheus Authors, "Prometheus Documentation" (https://prometheus.io/docs/) and Grafana Labs, "Grafana Documentation" (https://grafana.com/docs/)
The official documentation for the two tools that form the backbone of the monitoring architecture in this chapter. Prometheus provides the time-series metric collection and storage layer; Grafana provides the visualization and alerting layer. Together, they are the industry standard for production system monitoring.
Reading guidance for Prometheus: Start with "Getting Started" to set up a local Prometheus instance and instrument a Python application using the prometheus_client library (used throughout Section 30.3). The "Metric Types" page defines counters, gauges, histograms, and summaries — the four types used in the StreamRec instrumentation code. The "Querying" page covers PromQL, the query language used in Grafana dashboard panels and alert rules: rate() for computing request rates from counters, histogram_quantile() for computing latency percentiles, and increase() for counting events in time windows. Reading guidance for Grafana: The "Dashboard" documentation covers panel types (time series, gauge, heatmap, table) used in the four-layer dashboard design from Section 30.16. The "Alerting" documentation covers alert rule definition, notification channels (Slack, PagerDuty, email), and alert grouping — the Grafana-native equivalent of the AlertManager class in Section 30.12. For production deployment, the "Loki" documentation (Grafana's log aggregation system) covers structured logging and log querying, providing the observability layer that complements Prometheus's monitoring layer.
5. Evidently AI, "Evidently Documentation" (https://docs.evidentlyai.com/)
Evidently is an open-source Python library for ML model monitoring, data drift detection, and model performance analysis. It provides pre-built drift detection (PSI, KS, Wasserstein, Jensen-Shannon), data quality reports, and model performance dashboards — implementing many of the techniques from this chapter in a ready-to-use library.
Reading guidance: The "Data Drift" documentation covers Evidently's drift detection suite, which implements PSI, KS test, Wasserstein distance, Jensen-Shannon divergence, and several other metrics out of the box — a convenient alternative to the from-scratch implementations in Sections 30.8-30.10 for production use. The "ML Monitoring" tutorial shows how to integrate Evidently with Prometheus and Grafana, bridging Evidently's ML-specific drift computation with the general-purpose monitoring stack described in Section 30.3. The "Test Suites" feature allows defining threshold-based tests on drift metrics and integrating them into CI/CD pipelines — connecting Chapter 28's testing approach with Chapter 30's monitoring approach. For teams building monitoring infrastructure rather than implementing from scratch, Evidently provides a practical starting point that covers drift detection, data quality, and performance monitoring. Alternatives in the same space include WhyLabs (https://whylabs.ai/), which focuses on data profiling and drift detection at scale, and Arize AI (https://arize.com/), which emphasizes observability and root cause analysis for production ML.