Chapter 28: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Data Validation with Great Expectations
Exercise 28.1 (*)
A ride-sharing platform ingests trip records with the following schema:
| Column | Type | Description |
|---|---|---|
trip_id |
string | Unique trip identifier |
driver_id |
string | Driver identifier |
rider_id |
string | Rider identifier |
pickup_lat |
float | Pickup latitude |
pickup_lon |
float | Pickup longitude |
dropoff_lat |
float | Dropoff latitude |
dropoff_lon |
float | Dropoff longitude |
distance_miles |
float | Trip distance |
duration_minutes |
float | Trip duration |
fare_usd |
float | Trip fare |
timestamp |
datetime | Trip start time |
(a) Write a Great Expectations expectation suite for this data. Include at least 10 expectations covering schema, non-null constraints, value ranges, and volume.
(b) What mostly value would you set for the distance_miles range check (0 to 200 miles)? Justify your choice.
(c) The platform operates in 3 cities. Design an expectation that checks whether the distribution of trips across cities has shifted relative to last month. Which GE expectation type would you use?
Exercise 28.2 (*)
The StreamRec event data (Section 28.2) normally has a duration_seconds null rate of 25% (because non-view events have null duration). On March 15, the null rate jumps to 68%.
(a) Which Great Expectations expectation would catch this? Write the expectation configuration.
(b) The root cause is that the iOS client (version 4.2.1) stopped logging duration for view events after an update. The Android and web clients are unaffected. Design a set of expectations that would distinguish this failure mode from a general logging outage.
(c) What downstream impact would this undetected failure have on the recommendation model? Trace the impact through feature engineering, model training, and serving.
Exercise 28.3 (*)
Write a Pandera schema for a feature table with the following columns:
| Column | Type | Constraints |
|---|---|---|
user_id |
string | Non-null, unique |
engagement_score |
float | Between 0 and 1, non-null |
days_since_signup |
int | Non-negative, non-null |
preferred_category |
string | One of 12 categories, non-null |
avg_session_minutes |
float | Non-negative, nullable |
Include at least one cross-column check (e.g., users with days_since_signup < 7 should have engagement_score < 0.95 because new users rarely reach maximum engagement immediately).
Exercise 28.4 (**)
You inherit a feature engineering pipeline with no data validation. The pipeline computes 50 features from 3 source tables and serves a gradient-boosted model. You have access to 90 days of historical feature data.
(a) Design a bootstrapping strategy to create a Great Expectations suite from the historical data. How would you use great_expectations profile to generate initial expectations, and how would you refine them?
(b) For each of the 50 features, you compute PSI relative to the 90-day baseline. 5 features show PSI > 0.25. Before raising an alarm, what additional analysis would you perform? (Consider: are these features important to the model? Is the shift gradual or sudden? Is the shift in the direction of a known seasonal pattern?)
(c) The model owner says "we've been running without validation for 2 years and nothing bad has happened." Write a 3-paragraph memo explaining why this is survivorship bias and why validation should still be implemented.
Data Contracts
Exercise 28.5 (*)
Define a DataContract (using the class from Section 28.5) for the following scenario: the marketing team produces a daily export of campaign metadata (campaign ID, start date, end date, budget, target audience segment, channel). The ML team uses this data to train an ad click prediction model.
(a) Define the column-level contracts, including types, nullability, and value ranges.
(b) Set the delivery_sla, freshness_requirement, and breaking_change_policy appropriate for this use case.
(c) The marketing team wants to add a new column campaign_objective (one of: awareness, engagement, conversion). Is this a backward-compatible change? Should it require consumer notification?
Exercise 28.6 (**)
Two teams at an e-commerce company are in conflict. Team A (data engineering) produces a user activity table. Team B (ML) consumes it for a churn prediction model. Team A recently renamed last_login_date to most_recent_login_ts and changed its type from DATE to TIMESTAMP. Team B's pipeline silently produced NaN features for 3 days before anyone noticed.
(a) Write a contract test that would have caught this failure at the pipeline boundary.
(b) Design a schema evolution policy that permits this change while giving Team B time to adapt. Include: deprecation notice period, dual-column transition period, and final removal.
(c) What organizational process would prevent this from happening in the future? Consider: contract registry, CI-based contract validation, change approval workflows.
Exercise 28.7 (**)
A large ML platform has 200 data contracts between 15 producer teams and 30 consumer teams. On average, each contract has 20 columns. The platform processes 500 million rows per day.
(a) Calculate the total number of column-level validations per day (assuming each column has 3 expectations).
(b) Great Expectations validation on this volume takes 4 hours. The data must be available to consumers within 6 hours of production. Design a sampling strategy that reduces validation time to under 30 minutes while maintaining statistical power for detecting shifts affecting $\geq 1\%$ of rows.
(c) Some contracts involve multiple consumers with different quality requirements (one consumer tolerates 5% nulls; another requires $< 0.1\%$). How would you model tiered SLAs within the DataContract class?
Behavioral Testing
Exercise 28.8 (*)
Design 3 invariance tests, 3 directional tests, and 3 minimum functionality tests for a sentiment analysis model that classifies product reviews as positive, negative, or neutral.
For each test, specify: the perturbation (or test set), the expected behavior, and the failure mode the test is designed to catch.
Exercise 28.9 (*)
The StreamRec invariance test inv_user_name_change (Section 28.8) requires 95% Jaccard similarity between recommendations before and after changing the user's display name.
(a) A colleague argues that 95% is too strict because the model uses character-level embeddings of user-generated content titles, and changing a user's name could affect collaborative filtering through name-content correlations. Evaluate this argument. Under what conditions would you relax the threshold?
(b) Another colleague proposes testing invariance to user age. Design this test. What tolerance would you set, and why? (Consider: age is legitimately relevant to content preferences — this is not the same as name invariance.)
(c) Is it possible for a model to pass all invariance tests and still be unfair? Give a concrete example.
Exercise 28.10 (**)
The directional test dir_scifi_affinity (Section 28.8) checks that adding science fiction completions increases sci-fi recommendation scores. But it does not check the magnitude of the effect.
(a) Design a quantitative directional test that checks not just direction but also magnitude. Specifically: adding 10 sci-fi completions should increase the average sci-fi recommendation score by at least 15%.
(b) A model passes the directional test but with a suspiciously large effect: adding 10 sci-fi completions causes the model to recommend only sci-fi content, suppressing all other genres. Design a test that catches this "over-steering" failure mode.
(c) Formalize the concept of "directional sensitivity" as a derivative: for a model $f$ and a perturbation $\Delta$, define $\text{sensitivity}(\Delta) = \frac{\partial \bar{s}_{\text{genre}}}{\partial |\Delta|}$ where $\bar{s}_{\text{genre}}$ is the mean score for items in the target genre. Implement this numerically and discuss what "healthy" sensitivity values look like.
Exercise 28.11 (**)
Ribeiro et al. (2020) applied CheckList to commercial NLP APIs and found that models with >90% accuracy on standard benchmarks failed 30-50% of behavioral tests. Replicate a simplified version of this finding:
(a) Using the StreamRec behavioral test framework, construct a scenario where a recommendation model achieves Recall@20 = 0.22 (above the M12 threshold) but fails 4 out of 12 behavioral tests. Describe the model's pathology.
(b) Construct the reverse scenario: a model that passes all 12 behavioral tests but achieves Recall@20 = 0.13 (below the threshold). What does this tell you about the relationship between aggregate metrics and behavioral tests?
(c) Argue that behavioral tests and aggregate metrics are complementary, not substitutes. A good testing strategy requires both. What does each catch that the other misses?
Exercise 28.12 (***)
Extend the BehavioralTestSuite class to support metamorphic testing: tests where the relationship between multiple inputs and outputs is specified, rather than testing a single input-output pair.
Example metamorphic relation for a recommendation model: if user A and user B have identical interaction histories except that user A has watched 5 additional comedy items, then user A's comedy recommendation scores should be higher than user B's comedy scores.
(a) Design the MetamorphicTestCase dataclass and implement the run_metamorphic_test method.
(b) Write 3 metamorphic test cases for the StreamRec model that cannot be expressed as INV or DIR tests.
(c) Discuss the relationship between metamorphic testing and the concept of monotonicity in machine learning (i.e., monotonic constraints in gradient boosting). When are they equivalent?
Model Validation Gates
Exercise 28.13 (*)
The StreamRec validation gate (Section 28.10) uses maximum_regression=0.02 for Recall@20. A new model achieves Recall@20 = 0.195, while the champion has Recall@20 = 0.210. The gate blocks the model.
(a) Is this the correct decision? The regression is 0.015, which is below the threshold of 0.02. Re-read the MetricComparison.__post_init__ logic and verify.
(b) The model developer argues that the new model has significantly better NDCG@20 (0.16 vs. 0.14) and should be promoted despite the Recall@20 regression. How should the gate handle multi-metric trade-offs?
(c) Design a gate configuration that permits trade-offs between Recall@20 and NDCG@20 by using a composite metric: $\text{score} = 0.6 \times \text{Recall@20} + 0.4 \times \text{NDCG@20}$.
Exercise 28.14 (**)
Meridian Financial's credit scoring validation gate must satisfy regulatory requirements that go beyond accuracy metrics.
(a) Add the following checks to the ValidationGateConfig:
- Adverse impact ratio (approval rate for protected group / approval rate for control group) must be $\geq 0.80$ (the four-fifths rule)
- Model must be stable: PSI of the score distribution between development and validation samples must be $< 0.10$
- Top 5 reason codes must cover $\geq 90\%$ of adverse actions
(b) Implement these checks as additional methods on the ModelValidationGate class.
(c) A model passes all accuracy and fairness checks but was trained on a dataset that excluded applicants under age 21 (due to a data pipeline bug). The model's predictions for 18-20 year olds are unreliable. Which gate check (existing or new) would catch this? If none, design one.
Exercise 28.15 (**)
The champion-challenger pattern assumes a single champion. In practice, some systems run multiple models simultaneously (e.g., a general model and a cold-start model).
(a) Extend the ModelValidationGate to support a multi-champion configuration where different champions serve different user segments.
(b) A challenger model is better than Champion A on Segment 1 but worse on Segment 2, and better than Champion B on Segment 2 but worse on Segment 1. Design a gate decision logic for this scenario.
(c) How does the multi-champion pattern interact with A/B testing? If two models serve different segments, what is the correct experimental design for comparing a new model against the current multi-model system?
Exercise 28.16 (***)
The shadow evaluation (Section 28.11) compares champion and challenger recommendations on the same requests. But the champion's recommendations were served to users, while the challenger's were not — meaning we only observe user feedback for the champion's recommendations.
(a) Explain why this creates a counterfactual evaluation problem. Connect to the potential outcomes framework (Chapter 16): what are the potential outcomes, and which are observed?
(b) Inverse propensity scoring (Chapter 18) can correct this bias. Design a shadow evaluation that uses IPS-weighted metrics. What propensity score is needed, and how is it estimated?
(c) Under what conditions does the naive shadow evaluation (ignoring counterfactual bias) still provide a useful comparison? When does it fail catastrophically?
Statistical Validation and Drift Detection
Exercise 28.17 (*)
Compute the PSI between the following two distributions:
Reference (training data, 1000 observations):
| Bin | [0, 0.2) | [0.2, 0.4) | [0.4, 0.6) | [0.6, 0.8) | [0.8, 1.0] |
|---|---|---|---|---|---|
| Count | 200 | 250 | 300 | 150 | 100 |
Current (today's serving data, 1200 observations):
| Bin | [0, 0.2) | [0.2, 0.4) | [0.4, 0.6) | [0.6, 0.8) | [0.8, 1.0] |
|---|---|---|---|---|---|
| Count | 120 | 180 | 360 | 300 | 240 |
(a) Compute the per-bin PSI contributions and the total PSI.
(b) Interpret the result using the standard thresholds (< 0.1, 0.1-0.25, > 0.25).
(c) Which bins contribute most to the PSI? What does this suggest about the nature of the distribution shift?
Exercise 28.18 (**)
PSI uses equal-width bins, which can be sensitive to the choice of bin count.
(a) Compute PSI for the distributions in Exercise 28.17 using 3, 5, 10, and 20 bins. How does the value change?
(b) Propose an alternative binning strategy that is more robust to the number of bins. (Hint: consider quantile-based bins from the reference distribution.)
(c) The Kolmogorov-Smirnov (KS) test is an alternative to PSI for drift detection. Compare PSI and KS on the following dimensions: sensitivity to bin count, sensitivity to sample size, interpretability, and ability to localize the shift.
Exercise 28.19 (***)
Feature importance stability (Section 28.4) uses Spearman rank correlation. Design and implement a more comprehensive feature stability analysis:
(a) Implement a FeatureStabilityDashboard class that tracks feature importance across $N$ model versions and flags features whose importance rank changes by more than $k$ positions in any consecutive pair of versions.
(b) A feature that was ranked #3 in version 1, #4 in version 2, #5 in version 3, and #15 in version 4 shows a gradual decline followed by a sudden drop. Design a change-point detection algorithm for feature importance time series.
(c) Feature importance stability and PSI measure different things: PSI measures input distribution shift; importance stability measures model reliance shift. A feature can shift in distribution without changing importance (the model is robust to the shift), and importance can change without distribution shift (a correlated feature was added). Design a 2x2 analysis that considers both dimensions and prescribes different responses for each quadrant.
Comprehensive Testing Strategy
Exercise 28.20 (**)
Evaluate the ML Test Score for a system you have built or worked with (or use the StreamRec system from the progressive project). For each of the 20 rubric items:
(a) Mark whether the test is implemented (yes/no).
(b) For each unimplemented test, estimate the effort (in person-days) to implement it and the risk reduction it would provide.
(c) Rank the unimplemented tests by risk-reduction-per-effort ratio. Which 3 tests should be implemented first?
Exercise 28.21 (**)
Design a complete testing strategy for a medical image classification system that classifies chest X-rays as normal or showing pneumonia.
(a) List the data validation checks (schema, completeness, distributional) that are specific to medical imaging data. Consider: image dimensions, bit depth, DICOM metadata, patient demographics.
(b) Design 5 behavioral tests for the model. Consider: invariance to image rotation (< 5 degrees), invariance to brightness adjustment, directional (larger pneumonia opacity should increase confidence), minimum functionality (AUC $\geq 0.90$ on every hospital site).
(c) Design the model validation gate. What additional checks are required for a medical device (compared to StreamRec)? Consider: FDA regulatory requirements, sensitivity/specificity trade-offs, site-specific performance.
Exercise 28.22 (***)
The chapter presents data validation, behavioral testing, and model validation gates as separate components. Design an integrated system that:
(a) Automatically generates behavioral tests from data contract specifications. For example, if a contract specifies that event_type must be in {view, click, complete, skip}, generate an MFT that the model handles all four event types correctly.
(b) Uses model validation gate failures to automatically update the behavioral test suite. If the gate blocks a model because Recall@20 on new Android users is below threshold, automatically add a per-segment MFT for that slice.
(c) Discuss the risks of this automation. Can automatically generated tests create a false sense of security? When is human judgment essential in test design?
Exercise 28.23 (***)
The StreamRec testing infrastructure (M12) was built for a recommendation model. Adapt the complete strategy for a large language model (LLM) used for customer support chat:
(a) Data validation: what does "data quality" mean for conversational data? Design expectations for chat transcripts (message length, response time, language detection, PII detection).
(b) Behavioral testing: design MFT, INV, and DIR tests for a customer support chatbot. Consider: invariance to customer name, directional (more context should improve answer quality), MFT (must correctly answer the 50 most common questions).
(c) Model validation gates: the LLM is updated via fine-tuning. Design a gate that evaluates the fine-tuned model against the base model on: helpfulness, harmlessness, honesty, and response latency. What metrics would you use for each dimension?
Exercise 28.24 (****)
The ML Test Score (Breck et al., 2017) treats all tests equally (0.5 points each). In practice, some tests are more important than others — a missing data validation test in a medical system is far more dangerous than a missing ablation study.
(a) Design a risk-weighted ML Test Score that assigns weights to tests based on the severity and likelihood of the failure mode each test prevents. Define a risk matrix with at least 4 severity levels and 3 likelihood levels.
(b) Apply your risk-weighted rubric to 3 different systems: StreamRec (recommendations), Meridian Financial (credit scoring), and a medical image classifier. Show that the same set of implemented tests produces different risk-weighted scores depending on the domain.
(c) Propose a method for calibrating the risk weights empirically using incident data. If an organization has a history of production ML incidents, how would you use incident post-mortems to set risk weights?
Exercise 28.25 (****)
A fundamental challenge in ML testing is the oracle problem: there is no ground truth to test against. In software testing, assert add(2, 3) == 5 works because we know the correct answer. In ML, we do not know the "correct" recommendation for a specific user.
(a) Categorize the testing approaches in this chapter by their relationship to the oracle problem. Which approaches require a ground truth oracle (e.g., holdout evaluation)? Which approaches avoid it (e.g., invariance tests)?
(b) Metamorphic testing (Exercise 28.12) is one approach to the oracle problem. Property-based testing (testing invariants rather than specific outputs) is another. Compare these approaches and discuss their relative strengths for ML systems.
(c) Propose a research direction for ML testing that addresses a gap not covered by current approaches. Consider: testing for model uncertainty calibration, testing for robustness to adversarial inputs, testing for out-of-distribution detection, or testing for causal correctness.
Exercise 28.26 (**)
The PipelineContractEnforcer (Section 28.6) validates data at pipeline boundaries. Extend it to support temporal contracts: assertions about how data changes over time.
(a) Design a TemporalContract class that specifies: maximum day-over-day row count change (e.g., $\pm 20\%$), maximum day-over-day null rate change per column, and maximum day-over-day mean/variance shift per numeric column.
(b) Implement the temporal validation logic. The enforcer must maintain a rolling 7-day history of validation results to detect gradual shifts.
(c) A temporal contract detects a gradual increase in null rate for duration_seconds over 10 days: 25%, 26%, 27%, ..., 35%. No single day's change exceeds 2%, so a simple day-over-day check would not flag it. Design a cumulative shift detection check that catches this pattern.
Exercise 28.27 (***)
The StreamRec pipeline runs data validation, behavioral testing, and model validation gates sequentially. This adds 30-45 minutes to the training pipeline.
(a) Analyze which tests can be parallelized. Draw a dependency graph showing the execution order. What is the critical path?
(b) Design a tiered testing strategy where fast tests (schema validation, smoke tests) run first and expensive tests (behavioral tests, shadow evaluation) run only if fast tests pass. Estimate the time savings for the common case (everything passes) and the worst case (first test fails).
(c) Some organizations run expensive tests asynchronously — the model is deployed after passing fast tests, and behavioral tests run in parallel. If a behavioral test fails after deployment, the model is automatically rolled back. Analyze the risk-benefit trade-off of this approach compared to synchronous gating.