Chapter 28: Key Takeaways

  1. ML systems require testing at four layers: data, features, model behavior, and model validation. Schema validation (Great Expectations, Pandera) catches structural data problems — wrong types, missing columns, null values, out-of-range values. Statistical validation (PSI, distribution shift detection) catches distributional problems — features with the right schema but wrong statistics. Behavioral tests (CheckList framework: MFT, INV, DIR) catch model pathologies — sensitivity to irrelevant inputs, insensitivity to meaningful inputs, poor performance on specific segments. Model validation gates (champion-challenger comparison) catch deployment regressions — models that pass all other tests but are still worse than the current production model. Each layer catches failure modes that the others miss, and skipping any layer leaves a dangerous blind spot.

  2. Data validation catches the most production incidents per unit of engineering effort. Most production ML failures are data failures, not model failures: a schema change in an upstream service, a dropped logging pipeline, a normalization step applied in the wrong order. Great Expectations (for batch checkpoint validation) and Pandera (for inline function-boundary validation) are complementary tools that together cover data quality at ingestion and processing. The mostly parameter, volume expectations, and freshness checks are simple but powerful — the StreamRec case study showed that a 22-expectation suite, calibrated against 90 days of historical data, would have caught two of the three production incidents before they affected users.

  3. Data contracts make implicit data dependencies explicit and enforceable. In any ML system with multiple upstream data sources, the recommendation pipeline implicitly depends on the exact schema, semantics, and delivery cadence of every source. Without contracts, changes to upstream data are silent: the producer does not know the consumer depends on a specific format, and the consumer does not know the format has changed until the model degrades. Data contracts — specifying columns, types, value ranges, delivery SLA, freshness requirements, and breaking change policy — transform these implicit dependencies into explicit, testable, and enforceable agreements between teams.

  4. Behavioral tests check what aggregate metrics cannot: whether the model has learned the right things and not learned the wrong things. A model with excellent Recall@20 may still be sensitive to user names (an invariance failure), insensitive to strong engagement signals (a directional failure), or catastrophically poor on a specific user segment (a minimum functionality failure). The CheckList framework — minimum functionality tests (MFT), invariance tests (INV), and directional expectation tests (DIR) — provides a systematic vocabulary for expressing these domain-specific expectations as executable tests. For credit scoring, behavioral tests directly encode regulatory requirements (fair lending invariance, directional economic relationships), making the test suite both a quality tool and a compliance artifact.

  5. Model validation gates must enforce dual thresholds: absolute floors and regression limits from the champion. Absolute floors prevent catastrophically poor models from reaching production, even when the champion has degraded. Regression limits prevent gradual degradation, where each new model is slightly worse than the last but still above the absolute floor. The champion-challenger pattern, combined with staged validation (offline evaluation, behavioral tests, shadow evaluation, canary deployment), minimizes both the risk of deploying a bad model and the cost of evaluating a good one. Each stage is progressively more expensive and more realistic, so bad models are rejected early and cheaply.

  6. The ML Test Score (Breck et al., 2017) transforms "how mature is our testing?" from a subjective question into a quantifiable metric. The four-category rubric — tests for features and data, tests for model development, tests for ML infrastructure, and monitoring tests — provides a systematic checklist for evaluating testing maturity and identifying the highest-priority gaps. For StreamRec, the score improved from 4.5 (critical gaps) to 14.0 (mature) after M12, with the remaining gaps (hyperparameter sensitivity, load testing, fairness monitoring) providing a concrete roadmap for the next investment cycle.

  7. Testing infrastructure is an investment that compounds over time. The StreamRec team's 6-week investment in testing infrastructure reduced incident-response engineering hours from 40 per month to 4, prevented an estimated 1.4 bad model deployments per month, and cut mean time to detect data quality issues from 11 days to 45 minutes. At Meridian Financial, the same investment produced zero regulatory examination findings — a first in three examination cycles. The upfront cost is real (6-8 weeks of engineering), but the ongoing return — in prevented incidents, reduced toil, faster iteration, and regulatory compliance — grows with every model trained and every data change processed.