Chapter 30: Quiz

DataField.Dev

Chapter 30: Quiz

Test your understanding of monitoring, observability, and incident response for ML systems. Answers follow each question.

Question 1

What is the fundamental difference between monitoring and observability, and why does an ML system require both?

Answer

**Monitoring** detects **known unknowns** — failure modes you anticipated by defining metrics and thresholds (e.g., "alert if feature PSI > 0.25"). It answers "is something wrong?" **Observability** enables diagnosis of **unknown unknowns** — failure modes you did not anticipate, by providing enough telemetry (metrics, logs, traces, prediction artifacts) to investigate novel problems. It answers "why is something wrong?" An ML system requires both because ML failures often manifest in novel ways: a timezone conversion bug producing future timestamps, a feedback loop narrowing diversity, a proxy variable amplifying bias in a new population. Monitoring catches the failure modes you have seen before; observability provides the data to diagnose the ones you have not. A system with only monitoring will detect known problems but be blind to novel ones. A system with only observability produces data but no alerts — problems are diagnosed only when someone manually investigates.

Question 2

What are the four signal types in ML system telemetry, and what role does each play?

Answer

**(1) Metrics** are numeric measurements sampled at regular intervals (e.g., latency p99, feature null rate, prediction mean). They are the foundation of monitoring: cheap to collect, efficient to store, fast to query, but aggregated and lossy. **(2) Logs** are timestamped, structured records of discrete events (e.g., individual prediction requests with full feature vectors). They are the foundation of observability: rich and granular but expensive to store and slow to query at scale. **(3) Traces** are causally linked sequences of operations representing a single request's path through the system (e.g., API gateway → feature retrieval → inference → re-ranking). They allow pinpointing which component is slow or failing. **(4) Prediction artifacts** are the ML-specific inputs and outputs of the model — feature vectors, predicted scores, ranking orders. They have no analogue in traditional software monitoring and are essential for diagnosing model behavior issues (e.g., a score that is anomalously high for a specific item category).

Question 3

Why is the ground truth delay problem significant for ML monitoring, and how do monitoring strategies differ between short and long feedback loops?

Answer

The ground truth delay problem means that the actual outcome (did the user engage? did the borrower default?) is often unavailable when the prediction is made. For **short feedback loops** (minutes to hours, like StreamRec recommendations), you can monitor actual model performance directly — compute real-time accuracy, precision, or custom metrics using incoming ground truth. For **long feedback loops** (months to years, like credit scoring), you cannot measure actual performance for months. Instead, you monitor **proxy signals** (prediction distribution shift, feature distribution drift, early warning indicators like 30-day delinquency) and **distributional properties** (prediction entropy, calibration against expected rates). The key insight is that even without ground truth, changes in the model's prediction distribution signal that something has changed — either the input data, the model-outcome relationship, or the model itself.

Question 4

What is the Population Stability Index (PSI), and what do the standard threshold values of 0.10 and 0.25 represent?

Answer

PSI measures the shift between a reference distribution $P$ and a current distribution $Q$: $\text{PSI} = \sum_i (q_i - p_i) \cdot \ln(q_i / p_i)$, where $p_i$ and $q_i$ are bin proportions. It is a symmetrized form of KL divergence. PSI is non-negative (zero means identical distributions), symmetric, decomposable by bin (allowing localization of the shift), and has industry-standard thresholds from credit risk regulation. PSI **< 0.10** indicates no significant shift — continue monitoring. PSI **0.10-0.25** indicates moderate shift — investigate the root cause and consider retraining. PSI **> 0.25** indicates significant shift — alert, investigate immediately, and consider model rollback. These thresholds are empirical conventions from credit scoring, where regulatory guidance expects ongoing stability monitoring, and have been widely adopted across ML monitoring more broadly.

Question 5

What are the three key limitations of PSI for drift detection?

Answer

**(1) Bin sensitivity.** PSI depends on the number and placement of bins. Too few bins miss localized shifts; too many introduce noise. Quantile-based bins from the reference distribution mitigate but do not eliminate this. **(2) Sample size sensitivity.** With small samples, PSI can be noisy. A rough guideline is at least 100 observations per bin (100 × B total observations). **(3) No statistical significance.** PSI is a descriptive measure, not a hypothesis test — it does not produce a p-value. A PSI of 0.12 means "moderate shift" but does not tell you whether this shift is statistically significant given the sample sizes. Additionally, PSI is insensitive to distribution shape within bins — two distributions with identical bin proportions but very different within-bin shapes will produce PSI = 0.

Question 6

How does the Kolmogorov-Smirnov (KS) test complement PSI for drift detection, and what is the "overpowering problem"?

Answer

The KS test is a nonparametric hypothesis test that computes the maximum absolute difference between two empirical CDFs ($D = \max_x |F_\text{ref}(x) - F_\text{cur}(x)|$). It complements PSI by providing statistical rigor (a p-value), requiring no binning (avoiding PSI's bin sensitivity), and being sensitive to any distributional difference (location, scale, or shape). The **overpowering problem** is that with large sample sizes (millions of serving predictions), the KS test will detect trivially small shifts with astronomically small p-values (e.g., PSI = 0.01 but KS p-value = $10^{-15}$). The shift is statistically significant but operationally meaningless. The solution for production monitoring is to use the KS statistic D (which is sample-size independent) rather than the p-value as the alerting signal, with thresholds like: D < 0.05 (negligible), 0.05-0.10 (small), 0.10-0.20 (moderate, investigate), > 0.20 (large, alert).

Question 7

What is Jensen-Shannon divergence, and what advantage does it have over KL divergence (which underlies PSI)?

Answer

Jensen-Shannon divergence is defined as $\text{JSD}(P \| Q) = \frac{1}{2} D_{KL}(P \| M) + \frac{1}{2} D_{KL}(Q \| M)$ where $M = \frac{1}{2}(P + Q)$. Its key advantage over KL divergence is that it is **always defined**: KL divergence is undefined (infinite) when the current distribution assigns zero probability to a bin where the reference has nonzero probability, which requires clipping and produces unstable results. JS divergence naturally handles zero-probability bins through the midpoint distribution $M$. Additionally, JS divergence is **bounded** (between 0 and $\ln 2 \approx 0.693$), making it easier to set thresholds and visualize on dashboards, and its square root is a proper metric (satisfies the triangle inequality). It is particularly well-suited for categorical features and multi-modal distributions.

Question 8

Define covariate shift, concept drift, label drift, and prediction drift. How do you detect each?

Answer

**Covariate shift:** $P(X)$ changes but $P(Y|X)$ stays the same — the population changed but the relationship is stable. Detected by feature-level PSI/KS/JS monitoring. **Concept drift:** $P(Y|X)$ changes — the relationship between inputs and outcomes has changed (e.g., a pandemic changes viewing patterns). Detected by monitoring the prediction-outcome gap over time; statistical process control charts (CUSUM, EWMA) on prediction error. **Label drift:** $P(Y)$ changes — the outcome distribution shifts (e.g., default rate rises due to recession). Detected by monitoring outcome rates and comparing to model-predicted rates. **Prediction drift:** $P(\hat{Y})$ changes — the model's output distribution shifts. Detected by monitoring prediction score distribution PSI/KS/JS. Prediction drift is the most general signal — it indicates something changed but root cause analysis is needed to determine whether the cause is covariate shift, concept drift, a deployment error, or a data pipeline change.

Question 9

What are the four pillars of ML monitoring, and how do they differ from standard software monitoring?

Answer

The four pillars are: **(1) Data quality** — feature null rates, freshness, distribution drift, training-serving skew, schema compliance. This has no software analogue because software inputs are deterministic; ML inputs are statistical. **(2) Model performance** — prediction distribution, accuracy proxies, calibration, business metric alignment. This has no software analogue because software correctness is binary; ML correctness is probabilistic. **(3) System health** — latency, throughput, errors, resource utilization, SLOs. This is the same as standard software monitoring, extended with ML-specific breakdowns (inference latency vs. feature retrieval latency). **(4) Business metrics** — CTR, engagement, revenue, coverage. This has a software analogue (conversion rate, user satisfaction) but the relationship between model behavior and business outcomes is more indirect, delayed, and confounded in ML systems. Standard software monitoring covers only pillar 3; ML monitoring requires all four.

Question 10

What is an SLO error budget, and how does it apply to ML systems?

Answer

An **error budget** is the inverse of an SLO: if the SLO is 99.9% availability, the error budget is 0.1% — roughly 43 minutes of downtime per 30-day period. It creates a quantitative framework for balancing reliability against velocity: when error budget is available, the team can take risks (deploy new models, experiment); when it is exhausted, the team freezes deployments and focuses on reliability. For ML systems, error budgets apply to both **system reliability** (uptime, latency) and **model reliability** (drift compliance, prediction quality). A model deployment that triggers a drift alert consumes model reliability budget. A feature store outage consumes system reliability budget. The key insight is that error budgets make the cost of unreliability explicit and create a shared vocabulary between ML teams (who want to ship models) and platform teams (who want stability).

Question 11

What are the three characteristics of a good alert?

Answer

**(1) Actionable.** Every alert must have a clear next step — not just "metric X exceeded threshold" but a description of what the metric means, why it matters, and a link to the runbook with diagnostic and mitigation steps. **(2) Tiered.** Not every anomaly is an emergency. Alerts should use severity levels (info, warning, critical) with different escalation targets (log only, Slack channel, PagerDuty). A PSI of 0.12 is a warning to investigate; a PSI of 0.35 with a business metric drop is a critical page. **(3) Deduplicated.** If a feature is drifting continuously, the alert should fire once and not repeat until the cooldown period expires or the alert is resolved. Additionally, good alerts are **correlated**: multiple simultaneous alerts should be grouped into a single incident suggesting a common root cause, rather than flooding the on-call with individual notifications.

Question 12

What are the five stages of the ML incident lifecycle?

Answer

**(1) Detection** — the incident is identified through automated alerts, business metric anomalies, or user reports. **(2) Triage** — the incident is classified by severity (SEV-1 through SEV-4) and type (system, data, model, integration), determining the response urgency and the team that responds. **(3) Mitigation** — immediate actions to reduce user impact before the root cause is fixed: model rollback, feature fallback, traffic shift to a rule-based system, or traffic throttling. The priority is impact reduction, not root cause analysis. **(4) Resolution** — the root cause is identified and fixed: a pipeline bug is patched, a model is retrained, a feature store is repaired. **(5) Post-mortem** — a structured, blameless analysis of the incident that identifies root causes, contributing factors, and preventive action items. The post-mortem is the most important step because it converts individual incidents into systemic improvements.

Question 13

Why are blameless post-mortems an engineering decision, not a cultural nicety?

Answer

Blamelessness is an engineering decision because it optimizes for organizational learning. When blame is assigned, people hide mistakes, underreport incidents, and avoid transparent documentation — which prevents the organization from identifying systemic weaknesses. When post-mortems are blameless, people report problems early, document failures thoroughly, and share root causes openly — which allows the organization to fix the systems, processes, and monitoring gaps that enabled the incident. The framing shifts from "who made the error?" to "what about our system allowed this error to happen and reach production?" This surfaces the real root causes: missing monitors, missing tests, unclear runbooks, absent data contracts, inadequate training-serving skew detection. These systemic causes are fixable; individual human errors are not preventable.

Question 14

List three ML-specific root cause categories that have no analogue in traditional software incidents.

Answer

**(1) Concept drift** — the relationship between inputs and outcomes has changed (e.g., a pandemic changes user behavior), but the system infrastructure is functioning correctly. In traditional software, the relationship between inputs and outputs is determined by code, not learned from data, so it does not drift. **(2) Feedback loop amplification** — the model's own predictions influence the training data, creating a self-reinforcing bias (e.g., the model recommends popular items, which get more clicks, which makes them appear more popular). Traditional software does not have this circular dependency between output and future input. **(3) Training-serving skew** — different code paths compute the same feature differently in training (batch) and serving (real-time), producing silent prediction errors. Traditional software typically has a single code path per function. Other ML-specific categories include label leakage, data pipeline semantic changes (schema unchanged but meaning changed), and stale model (retraining pipeline silently broken).

Question 15

In the StreamRec degradation case study (Case Study 1), why did the validation gate fail to prevent the three-month degradation?

Answer

The validation gate compared each new model only to the **current champion** with a maximum regression limit of 0.02. Each weekly retraining produced a model that was 0.005 worse than the previous champion — within the 0.02 threshold. But over 12 weeks, the cumulative degradation was 0.028 (from 0.220 to 0.192). The gate never triggered because it only looked at single-step regression, not cumulative regression from a fixed baseline. This is the "ratchet effect": a gradually degrading champion becomes the comparison baseline for an even more degraded challenger. The fix was dual: (1) compare each model not just to the current champion but also to a **fixed golden baseline** (the best model from the last quarterly review), and (2) track cumulative regression across consecutive deployments and alert when the cumulative drift exceeds a separate threshold.

Question 16

How does monitoring for a regulated credit scoring model (Meridian Financial) differ from monitoring for a recommendation system (StreamRec)?

Answer

Three key differences. **(1) Fair lending monitoring layer.** Credit models must continuously track adverse impact ratios (AIR) for all ECOA-protected groups and alert when any group falls below the four-fifths threshold. Recommendation systems have no regulatory equivalent (though ethical practice suggests similar monitoring). **(2) Documentation and retention requirements.** Every monitoring report, trigger violation, and re-validation must be documented in a format that an OCC examiner can review, and retained for 7 years. StreamRec monitoring is operational — it serves the engineering team, not regulators. **(3) Trigger-based re-validation.** When monitoring triggers fire (PSI > 0.20, default rate deviation, AIR violation), the model must undergo full re-validation within a specified timeframe. StreamRec monitoring triggers investigation but not a formal re-validation process with regulatory consequences. Despite these differences, the underlying monitoring infrastructure is identical: PSI, drift detection, alerting, and dashboards serve the same technical purpose in both contexts.

Question 17

What is recommendation coverage, and why is it a critical monitoring metric?

Answer

Recommendation coverage is the fraction of the item catalog that appears in recommendations: unique items recommended / total catalog size. It is critical because it detects a failure mode invisible to accuracy metrics: the model concentrating all recommendations on a narrow set of popular items. In the StreamRec case study, Recall@20 was within tolerance at each weekly deployment, CTR was stable (popular items have high baseline engagement), but coverage collapsed from 0.45 to 0.18 — meaning 82% of the catalog was never recommended. This signals a feedback loop (popular items get recommended, get more clicks, appear more popular) that degrades user experience (less diverse recommendations), creator experience (most content never shown), and platform health (reduced catalog utilization, stale content). Coverage, along with prediction entropy and the Gini coefficient of recommendation frequency, is an early warning system for feedback loop amplification.

Question 18

What is the purpose of a runbook, and what six sections should every ML runbook contain?

Answer

A runbook is a documented procedure for responding to a specific alert, designed to reduce mean time to resolution (MTTR) by converting on-call response from improvisation into a checklist. The six essential sections are: **(1) Alert description** — what triggered the alert and what it means in plain language. **(2) Impact assessment** — who is affected, how severely, and what business impact is expected. **(3) Diagnostic steps** — step-by-step commands, queries, and dashboard links to identify the root cause. **(4) Mitigation options** — immediate actions to reduce impact (model rollback, feature fallback, traffic shift), ordered by speed of implementation. **(5) Root cause investigation** — deeper analysis steps to perform once mitigation is in place. **(6) Escalation criteria** — when to escalate to the next on-call level and what information to include in the escalation.

Question 19

The chapter recommends using PSI for alerting, KS for confirmation, and JS divergence for categorical features. Why use all three instead of choosing one?

Answer

Each method has strengths that compensate for the others' weaknesses. **PSI** has well-calibrated thresholds (from decades of credit risk practice), is decomposable by bin (enabling root cause localization), and is interpretable to stakeholders and regulators — but it requires binning and has no p-value. **KS test** provides statistical rigor (a p-value), requires no binning, and is sensitive to any distributional difference — but it is overpowered at large sample sizes (detecting trivially small shifts with tiny p-values) and its single-number summary is less informative for root cause analysis. **JS divergence** naturally handles zero-probability categories (unlike KL/PSI which require clipping), is bounded (simplifying dashboard visualization), and works well for categorical features — but it also requires binning for continuous features and has less established threshold conventions. Using all three provides a multi-perspective view: PSI for the primary alert, KS for statistical confirmation ("is this shift real?"), and JS for categorical features and bounded comparisons.

Question 20

Explain the relationship between Chapter 28 (testing) and Chapter 30 (monitoring). How do they complement each other?

Answer

[Chapter 28](../chapter-28-ml-testing-and-validation/index.md)'s testing infrastructure and Chapter 30's monitoring infrastructure address the same goal — keeping the ML system healthy — at different points in the lifecycle. **Testing ([Chapter 28](../chapter-28-ml-testing-and-validation/index.md))** operates at **deployment time**: it validates data quality, behavioral properties, and model performance before a model reaches production. Testing catches **known** problems: schema violations, behavioral regressions, performance degradation relative to the champion. **Monitoring (Chapter 30)** operates at **serving time**: it continuously tracks data quality, model behavior, and business impact after the model is in production. Monitoring catches **emerging** problems: gradual drift, feedback loops, upstream pipeline changes, concept drift, and novel failure modes that testing could not anticipate. The StreamRec degradation case study demonstrates the gap: the testing infrastructure passed every model because each was within threshold of its predecessor. Only monitoring — tracking cumulative drift against a fixed baseline, recommendation diversity, and business metric correlation — would have detected the three-month degradation. Testing prevents deploying bad models; monitoring detects when good models go bad.