Chapter 28: Quiz

Test your understanding of ML testing and validation infrastructure. Answers follow each question.


Question 1

Why does Michael Feathers' definition of legacy code — "code without tests" — apply with amplified force to ML systems compared to traditional software?

Answer ML systems introduce three testing dimensions absent from traditional software. First, **data testing**: the input is a dataset that changes daily and can degrade silently — wrong types, shifted distributions, or missing values produce no exceptions but corrupt model predictions. Second, **model behavioral testing**: ML outputs are stochastic and approximate, so exact-output assertions (`assert f(x) == y`) are impossible; behavioral properties (invariance, directionality, minimum functionality) must be tested instead. Third, **model validation**: a new model version may pass all code tests yet perform worse than the current production model due to subtle data or training differences. Together, these dimensions mean that untested ML code is not just potentially buggy — it can be silently, catastrophically wrong while appearing to function normally.

Question 2

What is the difference between an expectation, an expectation suite, and a checkpoint in Great Expectations?

Answer An **expectation** is a single declarative assertion about data — for example, "column `user_id` is never null" or "row count is between 500,000 and 50,000,000." An **expectation suite** is a named collection of expectations that together define the quality contract for a specific data asset. A **checkpoint** is the orchestration unit that runs a validator (which evaluates the suite against a batch of data) and triggers actions on success or failure — such as storing results, updating Data Docs, or sending Slack alerts. Expectations define *what* to check; suites group the checks; checkpoints define *when* and *how* to run them and *what to do* with the results.

Question 3

What does the mostly parameter control in a Great Expectations expectation, and why is it critical for production data validation?

Answer The `mostly` parameter specifies the fraction of rows that must satisfy the expectation for it to pass. For example, `mostly=0.999` means 99.9% of values must be within the specified range. It is critical because real-world data is never perfectly clean — mobile apps report outliers, clock synchronization issues produce impossible timestamps, and edge cases generate nulls in nominally non-null columns. Setting `mostly=1.0` on every expectation guarantees false alarms on every batch, rendering the validation useless. Setting it too low (e.g., `mostly=0.5`) defeats the purpose of validation. The correct value is calibrated empirically: run the suite against 30 days of historical data, observe the natural failure rate, and set `mostly` to be stricter than the historical worst case but tolerant of normal variance.

Question 4

How does Pandera differ from Great Expectations in its validation approach, and when would you use each?

Answer Great Expectations performs **batch validation** — it checks a complete dataset after it has been produced, typically at a pipeline checkpoint. Pandera performs **inline validation** — it checks DataFrames at Python function boundaries, raising an exception the moment invalid data enters a function. Great Expectations integrates with orchestration systems (Airflow, Dagster) and supports multiple backends (Pandas, Spark, SQL); Pandera integrates with Python type annotations and function decorators, primarily for Pandas DataFrames. Great Expectations excels at audit trails and documentation (Data Docs); Pandera excels at pinpointing the exact function where a contract violation occurs. The two are complementary: Great Expectations validates raw data at ingestion; Pandera validates DataFrames within feature engineering code.

Question 5

What is the Population Stability Index (PSI), and what do the standard threshold values of 0.1 and 0.25 represent?

Answer PSI measures how much a feature's distribution has shifted between a reference period and the current period: $\text{PSI} = \sum_{i=1}^{B} (p_i - q_i) \cdot \ln(p_i / q_i)$, where $p_i$ and $q_i$ are bin proportions for the current and reference distributions. It is a symmetrized form of KL divergence. A PSI below 0.1 indicates **no significant shift** — the distribution is stable. A PSI between 0.1 and 0.25 indicates **moderate shift** — the distribution has changed enough to warrant investigation but may not require immediate action. A PSI above 0.25 indicates **significant shift** — the distribution has changed substantially and action is required (investigate the root cause, retrain the model, or alert the team). PSI is widely used in credit scoring, where regulatory guidance expects ongoing distribution monitoring.

Question 6

What is a data contract, and how does it differ from a database schema?

Answer A **data contract** is a formal agreement between a data producer and a data consumer that specifies the structure, semantics, quality, and service level of a data asset. It goes far beyond a database schema: a schema specifies column names and types; a data contract also specifies nullability semantics, valid value ranges, delivery SLA, freshness requirements, volume bounds, the owner responsible for the data, the breaking change policy, and the list of consumers who depend on it. A schema tells you the data's shape; a contract tells you its meaning, quality guarantees, and the organizational commitments around it. The critical difference is that a contract is *owned* and *enforced* — the producer commits to honoring it, and violations are treated as incidents, not accidents.

Question 7

Why is backward compatibility important for data contract evolution, and which types of schema changes are backward-compatible?

Answer Backward compatibility means that existing consumers continue to work without modification after a schema change. It is important because in a large data organization, a single data asset may have dozens of consumers, and forcing all consumers to update simultaneously for every schema change creates coordination nightmares and frequent pipeline failures. **Backward-compatible changes** include: adding a nullable column (consumers that do not use it are unaffected), and expanding a categorical value set (consumers that handle unknown categories gracefully are unaffected). **Non-backward-compatible (breaking) changes** include: removing a column, adding a non-nullable column, narrowing a value set, changing a column's type, and changing a column's semantics. Breaking changes require advance notice (typically 30 days), consumer acknowledgment, and a transition period.

Question 8

What are the three types of behavioral tests in the CheckList framework, and what does each test for?

Answer **Minimum Functionality Tests (MFT)** are the ML equivalent of unit tests: they check that the model achieves a minimum performance threshold on specific, focused test sets. Example: the model must outperform a popularity baseline on every user segment. **Invariance Tests (INV)** check that the model's predictions do not change when the input is perturbed in a way that should be irrelevant. Example: changing a user's display name should not change their recommendations. **Directional Expectation Tests (DIR)** check that the model's predictions change in a predictable direction when the input is perturbed in a meaningful way. Example: adding science fiction completions to a user's history should increase science fiction recommendation scores. Together, MFT, INV, and DIR tests check that the model has learned the right things (MFT), has not learned the wrong things (INV), and responds correctly to meaningful input changes (DIR).

Question 9

Give an example of a model that achieves high accuracy on a standard holdout set but would fail behavioral tests. Why does this happen?

Answer A recommendation model with Recall@20 = 0.22 might achieve this by performing excellently on high-activity users (who dominate the holdout set) while performing at random on new users or a specific platform. It might also be sensitive to user display names (failing invariance tests) or insensitive to strong engagement signals like completion vs. skip (failing directional tests). This happens because aggregate holdout metrics average over the entire test population, masking pathologies on subgroups. A model can compensate for catastrophic failure on 10% of users by excelling on the other 90%. Ribeiro et al. (2020) demonstrated this systematically: commercial NLP models with >90% accuracy on standard benchmarks failed 30-50% of behavioral tests, revealing systematic weaknesses invisible to aggregate metrics.

Question 10

How do behavioral tests for a credit scoring model differ from those for a recommendation model?

Answer Three key differences. First, **invariance tolerances are much higher** — credit scoring invariance tests require 0.98-0.99 overlap (near-perfect invariance to protected attributes like gender and zip code), while recommendation invariance tests tolerate 0.70-0.95 because some platform-specific effects are legitimate. Second, **directional tests encode economic relationships** that regulators expect — higher income should reduce predicted default risk, higher debt-to-income should increase it — and failing these tests is not just a quality issue but a regulatory finding. Third, **the test suite serves as regulatory documentation** — when an examiner asks "how do you verify your model does not discriminate?", the answer is the invariance test suite, its results, and its execution history. For recommendations, behavioral tests are quality assurance; for credit scoring, they are compliance infrastructure.

Question 11

What is the champion-challenger pattern, and why is it necessary for ML model deployment?

Answer The **champion-challenger pattern** compares a newly trained model (the challenger) against the current production model (the champion) before permitting deployment. It is necessary because, unlike traditional software where a new version either works or does not, a new ML model can pass all unit tests, behavioral tests, and code reviews yet still be worse than the current model — due to training data differences, hyperparameter sensitivity, random seed effects, or subtle feature changes. The validation gate evaluates the challenger against the champion on multiple dimensions: holdout metrics (absolute floors and maximum regression), behavioral tests (MFT, INV, DIR), operational constraints (latency, model size), and sliced performance (per-segment metrics). Only if the challenger passes all checks is it promoted — first to shadow evaluation, then to canary deployment, then to full production.

Question 12

Explain the dual-threshold design in model validation gates: absolute floors and regression limits. Why are both needed?

Answer **Absolute floors** set a minimum acceptable performance regardless of the champion (e.g., Recall@20 $\geq$ 0.15). **Regression limits** set the maximum allowed degradation from the champion (e.g., no more than 0.02 below the champion's Recall@20). Both are needed because each catches a failure mode the other misses. Without absolute floors, a gradually degrading champion could serve as the baseline for an even worse challenger — a "race to the bottom" where each model is slightly worse than the last, all within the regression limit. Without regression limits, a model that barely beats the absolute floor would be promoted even when it represents a significant regression from a strong champion. Together, the dual thresholds prevent both catastrophic failure (below the absolute floor) and gradual degradation (exceeding the regression limit from a strong baseline).

Question 13

What is shadow evaluation, and what is its key advantage over holdout evaluation?

Answer In **shadow evaluation**, both the champion and the challenger receive every production request. The champion's response is served to users; the challenger's response is logged but discarded. After a shadow period (typically 3-7 days), the two models' responses are compared on every request. The key advantage over holdout evaluation is that shadow evaluation tests the model on **live production traffic** — with its real distribution of users, items, contexts, and edge cases — rather than on a static, retrospective holdout set. This catches issues that holdout data misses: traffic pattern changes, seasonal effects, new user segments, and the interaction between model predictions and the serving infrastructure. Shadow evaluation answers the question "how will this model actually perform in production?" rather than "how would this model have performed on past data?"

Question 14

What is the four-stage validation pipeline for StreamRec, and why is the ordering important?

Answer The four stages are: (1) **Offline evaluation** — holdout metrics and sliced metrics, taking minutes; (2) **Behavioral tests** — MFT, INV, DIR suites, taking minutes; (3) **Shadow evaluation** — live traffic comparison, taking 3-7 days; (4) **Canary deployment** — 5-10% real traffic, taking 3 days. The ordering is important because each stage is progressively more expensive and more realistic. A model that fails offline evaluation (stage 1) is blocked in minutes, before consuming resources for behavioral testing. A model that fails behavioral tests (stage 2) is blocked before the 3-7 day shadow evaluation period. This staged approach minimizes the cost of evaluating bad models while maximizing the rigor applied to good ones. Running the stages in reverse order would waste days of shadow evaluation on models that would have been caught by a 5-minute behavioral test.

Question 15

What are the four categories of the ML Test Score (Breck et al., 2017)?

Answer The four categories are: (1) **Tests for features and data** — schema validation, feature importance correlation, data pipeline unit tests, feature coverage, and data monitoring; (2) **Tests for model development** — training reproducibility, holdout quality, sliced quality, staleness detection, and proxy metrics; (3) **Tests for ML infrastructure** — training and serving infrastructure tests, API integration tests, model artifact reproducibility, and pipeline latency tests; (4) **Monitoring tests for ML** — model quality tracking, feature distribution monitoring, model age tracking, inference latency monitoring, and prediction bias tracking. Each category awards 0.5 points per test implemented. A total score below 5 indicates critical gaps; 5-10 indicates functional but immature; 10-15 indicates mature; above 15 indicates advanced.

Question 16

A feature normally has mean 0.5 and the model works well. After an upstream pipeline change, the feature's mean shifts to 0.05 but its schema (type, nullability, value range) is unchanged. Which validation layer catches this?

Answer **Statistical validation** — specifically, PSI-based distribution shift detection or a statistical test on the feature's mean. Schema validation (Great Expectations or Pandera) would not catch this because the schema is unchanged: the column exists, has the correct type, is non-null, and values may still be within the valid range (e.g., 0 to 1). This is precisely why statistical validation is needed in addition to schema validation. The feature has the right structure but the wrong statistics. The PSI of this feature would likely exceed 0.25 (significant shift), triggering an alert. This is also why volume and freshness checks alone are insufficient — the data looks correct in every structural sense but has been silently corrupted.

Question 17

What is feature importance stability, and why does it complement PSI for drift detection?

Answer **Feature importance stability** measures whether the model's reliance on features has changed across model versions — typically using Spearman rank correlation of importance rankings or Jaccard similarity of top-$k$ important features. It complements PSI because they measure different things: PSI measures **input distribution shift** (has the data changed?), while importance stability measures **model reliance shift** (has the model changed which features it depends on?). A feature can shift in distribution without affecting the model (the model is robust to that feature's drift). Conversely, the model's feature importance can change without any single feature shifting in distribution (e.g., if a correlated feature is added or removed). A comprehensive monitoring strategy tracks both: PSI for data health, importance stability for model health.

Question 18

The to_great_expectations_suite method on DataContract (Section 28.5) auto-generates a GE suite from the contract definition. What is the advantage of deriving expectations from contracts rather than writing them independently?

Answer Deriving expectations from contracts ensures **single-source-of-truth consistency**: the contract is the authoritative definition, and the validation rules are mechanically derived from it. This eliminates the risk of the contract and the validation suite drifting out of sync — a real problem when both are maintained independently. If the contract changes (e.g., a new column is added or a value range is tightened), the generated expectations automatically reflect the change. It also enforces completeness: every column in the contract gets non-null checks, type checks, and range checks by default, rather than relying on a human to remember to add expectations for new columns. The trade-off is that auto-generated expectations are necessarily generic; domain-specific checks (e.g., cross-column relationships, distributional properties) must still be added manually.

Question 19

A recommendation model passes all 12 behavioral tests but achieves Recall@20 = 0.13, below the 0.15 threshold. Another model achieves Recall@20 = 0.22 but fails 4 of 12 behavioral tests. Which model should be deployed? Why?

Answer **Neither model should be deployed** — the validation gate should block both. The first model passes behavioral tests but does not meet the absolute performance floor, meaning it provides worse recommendations than the minimum acceptable standard. The second model has good aggregate performance but fails behavioral tests, meaning it has systematic pathologies (e.g., name sensitivity, lack of directional response, poor performance on specific segments) that aggregate metrics mask. This scenario illustrates why behavioral tests and aggregate metrics are **complementary, not substitutes**. Aggregate metrics catch models that are globally poor; behavioral tests catch models that are globally adequate but locally pathological. A deployable model must pass both. The correct action is to investigate why each model fails and iterate.

Question 20

Explain the relationship between data contracts, data validation, and data testing. How do they form a hierarchy?

Answer The three form a hierarchy of increasing specificity. **Data contracts** are the highest-level agreements: they define what data should look like (schema, semantics, quality, SLA) and who is responsible for it. They are organizational documents that codify agreements between teams. **Data validation** is the enforcement mechanism: Great Expectations and Pandera check whether actual data conforms to the expectations derived from contracts. Validation runs at specific points in the pipeline (ingestion checkpoints, function boundaries) and produces pass/fail results. **Data testing** is the broadest category, encompassing not just contract-derived validation but also statistical tests (PSI, distribution shift), temporal checks (day-over-day consistency), and integration tests (end-to-end pipeline correctness). Contracts define the rules; validation enforces them; testing encompasses the full range of checks that ensure data quality throughout the ML lifecycle. A mature system has all three layers, with contracts generating validation rules and testing covering the gaps that contracts do not address.