Case Study 1: StreamRec Testing Infrastructure — From "Works on My Laptop" to Production-Grade Validation

Context

Six months after launching the production recommendation system (Case Study, Chapter 24), the StreamRec ML team has experienced three incidents that share a common root cause: the absence of systematic testing infrastructure.

Incident 1 (Week 8): Silent Feature Corruption. A refactoring of the feature engineering pipeline changed the order of operations for the engagement_rate feature. The old pipeline computed completions / views; the new pipeline computed completions / (views + clicks), producing lower values. No tests existed to detect this change. The model retrained on corrupted features, and Recall@20 dropped from 0.21 to 0.16 — a 24% degradation. The team discovered the issue 11 days later through a routine offline evaluation, by which time the degraded model had served 140 million recommendation requests.

Incident 2 (Week 14): Schema Change. The client engineering team added a new event type, bookmark, to the click stream. The recommendation model's Pandera-free feature engineering code assumed exactly four event types (view, click, complete, skip) and computed click_rate = n_clicks / n_events. With bookmark events included in n_events but not in any numerator, all engagement rates dropped by 5-8%. The model retrained and began underweighting high-engagement users. The team discovered the issue after a product manager noticed a decline in engagement metrics for the top 10% of users.

Incident 3 (Week 20): Regression Deployment. A new model trained on 90 days of data (instead of the usual 60 days) showed improved Recall@20 on the holdout set (0.23 vs. 0.21). The team deployed it directly to production. Within 24 hours, the ML on-call received complaints: the model was recommending stale content from 2+ months ago with disproportionate frequency. The 90-day training window included a holiday period with atypical viewing patterns, and the model had learned to overweight content popular during that period. No behavioral test existed to check content freshness in recommendations.

The VP of Engineering mandated a testing infrastructure investment: no model reaches production without passing automated validation. The team had 6 weeks.

The Strategy

The team adopted the four-layer testing strategy from this chapter and implemented it in three phases.

Phase 1: Data Validation (Weeks 1-2)

The team built a Great Expectations suite for the click stream data with 22 expectations:

Category Count Examples
Schema 5 Column set match, column types, primary key uniqueness
Completeness 4 Non-null checks on required columns, mostly thresholds calibrated from 90 days of history
Value ranges 5 event_type in set, platform in set, duration_seconds in [0, 14400]
Volume 2 Row count between 1M and 40M, unique user count between 500K and 10M
Statistical 4 PSI < 0.20 for event_type distribution, duration_seconds mean, platform distribution, null rate for duration_seconds
Freshness 2 Most recent timestamp within 2 hours, data partition date matches expected date

The team also added a Pandera schema to the feature engineering functions. The UserEventSchema class (Section 28.3) validated the click stream DataFrame at the function boundary, and an EngagementFeatureSchema validated the output of the feature engineering step.

Retroactive test: The team ran the suite against data from Incidents 1 and 2. The schema check would have caught Incident 2 immediately: the bookmark event type was not in the allowed set. The statistical checks would have flagged Incident 1 within one day: the PSI of engagement_rate jumped to 0.41 when computed on the corrupted data.

Phase 2: Behavioral Tests and Data Contracts (Weeks 3-4)

The team implemented the behavioral test suite from Section 28.8 with 14 tests:

  • 6 MFT tests: Recall@20 floor overall (0.15), per-platform floors (iOS: 0.12, Android: 0.12, web: 0.10), new user floor (0.05), NDCG@20 floor (0.12)
  • 4 INV tests: Name invariance (0.95), timestamp jitter (0.85), platform switch (0.70), session ID randomization (0.98)
  • 4 DIR tests: Genre affinity (+sci-fi completions increase sci-fi scores), recency effect (recent interactions score higher), completion vs. skip (completion boosts similar content), engagement depth (+sessions increase engagement-type recommendations)

The team also formalized data contracts with three upstream teams:

  1. Client engineering → ML recommendations: Click stream events (22 expectations, critical SLA, 2-hour freshness)
  2. Content engineering → ML recommendations: Item metadata (15 expectations, standard SLA, 24-hour freshness)
  3. User engineering → ML recommendations: User profiles (12 expectations, standard SLA, 24-hour freshness)

Each contract was registered in a central contract registry (a YAML file in the data platform repository), and contract tests ran at every pipeline stage boundary.

Retroactive test: The team ran the behavioral tests against the Incident 3 model (trained on 90 days). The model passed all MFT and INV tests — it was genuinely a better model on aggregate metrics and was correctly invariant to names and timestamps. But it failed the DIR recency test: recent interactions did not produce proportionally higher scores, because the model had learned to weight the holiday period heavily. This was exactly the behavioral pathology that the product manager had identified through engagement metrics — but the behavioral test would have caught it before deployment, not after.

Phase 3: Model Validation Gate (Weeks 5-6)

The team implemented the full validation gate:

streamrec_gate = ValidationGateConfig(
    gate_name="streamrec_v2_gate",
    metrics={
        "recall_at_20": (0.15, 0.02),   # (absolute min, max regression)
        "ndcg_at_20": (0.12, 0.015),
        "hit_rate_at_10": (0.30, 0.03),
    },
    behavioral_test_suite="streamrec_recommendation_v2",
    require_all_behavioral_tests=True,
    max_latency_p99_ms=45.0,
    max_model_size_mb=250.0,
    sliced_metrics={
        "recall_new_users": ("recall_at_20", 0.05),
        "recall_ios": ("recall_at_20", 0.12),
        "recall_android": ("recall_at_20", 0.12),
        "recall_web": ("recall_at_20", 0.10),
        "recall_top10pct_users": ("recall_at_20", 0.20),
    },
)

The gate was integrated into the Dagster training pipeline as the final asset before model registration. A model that failed the gate was logged with full comparison details and the pipeline sent an alert to the ML team's Slack channel. The gate decision, all metric comparisons, and behavioral test results were stored in MLflow as metadata on the model run.

Outcome

In the 12 weeks following deployment of the testing infrastructure:

Metric Before After
Models reaching production per month 4.2 3.8
Models blocked by validation gate 0 1.4 per month
User-facing incidents caused by model changes 1.0 per month 0
Mean time to detect data quality issues 11 days 45 minutes
Engineering hours spent on incident response 40 per month 4 per month

The testing infrastructure did not reduce the number of models trained — the team continued its weekly retraining cadence. But it blocked approximately 1.4 models per month that would have reached production with quality issues. In the most dramatic case, a model trained on a day when the Android logging pipeline dropped 60% of events was blocked by the data validation checkpoint before training even began. Without validation, this model would have trained on iOS-biased data and served biased recommendations to the entire user base for a week.

The ML Test Score improved from 4.5 (critical gaps) to 14.0 (mature). The three remaining gaps — hyperparameter sensitivity testing, load/stress testing, and fairness monitoring — were prioritized for the next quarter.

Lessons Learned

  1. Retroactive testing is the most convincing argument for testing infrastructure. Running the new suite against data from past incidents showed the team exactly what each test would have caught. This turned skeptics ("we've been running without tests for months") into advocates.

  2. Data validation catches the most incidents per unit of effort. The Great Expectations suite and Pandera schemas, which took 2 weeks to implement, prevented more incidents than the behavioral tests and validation gate combined — because most production ML failures are data failures, not model failures.

  3. Behavioral tests catch what metrics miss — but metrics catch what behavioral tests miss. The Incident 3 model passed MFT tests (aggregate metrics were good) but failed DIR tests (recency behavior was wrong). A model could equally fail MFT tests (poor aggregate performance) while passing all behavioral tests. Both layers are essential.

  4. Data contracts change organizational behavior. Before contracts, upstream teams changed their data output without considering downstream impact. After contracts, the client engineering team's sprint planning included a line item: "check data contract impact for any event schema changes." The contract made the implicit dependency explicit and the accountability clear.