Chapter 28: Further Reading

DataField.Dev

Chapter 28: Further Reading

Essential Sources

1. Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh, "Beyond Accuracy: Behavioral Testing of NLP Models with CheckList" (ACL, 2020)

The foundational paper for behavioral testing of ML models. Ribeiro et al. introduce the CheckList framework — minimum functionality tests (MFT), invariance tests (INV), and directional expectation tests (DIR) — and apply it to commercial NLP APIs from Google, Microsoft, and Amazon. The paper's central finding is striking: models with >90% accuracy on standard benchmarks failed 30-50% of behavioral tests, revealing systematic weaknesses (negation handling, temporal reasoning, named entity robustness) that aggregate accuracy metrics completely masked.

Reading guidance: Section 3 defines the three test types with concrete examples; this maps directly to the BehavioralTestSuite class in Section 28.7 of this chapter. Section 4 applies CheckList to sentiment analysis, demonstrating how to construct test cases from templates with vocabulary substitution — a technique that extends naturally to recommendation systems (substitute user IDs, item categories, interaction types). Section 5's evaluation of commercial APIs is sobering and motivates the chapter's argument that behavioral tests and aggregate metrics are complementary, not substitutes. The paper's supplementary material includes the full test suite, which is an excellent model for designing your own. For an extension to generative models, see Ribeiro and Lundberg, "Adaptive Testing and Debugging of NLP Models" (ACL, 2022), which automates test generation using language models. For a survey of the broader ML testing landscape, see Zhang et al., "Machine Learning Testing: Survey, Landscapes and Horizons" (IEEE TSE, 2022).

2. Great Expectations, "Great Expectations Documentation" (https://docs.greatexpectations.io/)

The official documentation for the Great Expectations framework. GE is the de facto standard for data validation in ML pipelines, and the documentation is exceptionally well-organized: the "Getting Started" tutorial builds a complete validation pipeline in 30 minutes, the "Expectation Gallery" catalogs all 300+ built-in expectations with examples, and the "How-To Guides" cover integration with every major orchestration framework (Airflow, Dagster, Prefect) and data backend (Pandas, Spark, SQL, Databricks).

Reading guidance: Start with the "Quickstart" guide to build a working expectation suite and checkpoint on a local CSV file. The "How to create Expectations" guide explains the mostly parameter, custom expectations, and the profiler (which auto-generates expectations from data — useful for bootstrapping a suite on a new dataset). The "How to validate data" guide covers checkpoints and actions (Slack alerts, PagerDuty, custom webhooks). For production deployment, the "How to use Great Expectations with Dagster" integration guide maps directly to the StreamRec pipeline architecture from Chapter 27. The Data Docs feature (auto-generated HTML documentation of validation results) is covered in "How to host and share Data Docs" and is essential for audit trails in regulated environments, as demonstrated in the Meridian Financial case study. For the companion inline validation tool, see the Pandera documentation at https://pandera.readthedocs.io/, which covers schema definition, decorator-based validation, and integration with type checkers.

3. Eric Breck, Shanqing Cai, Eric Nielsen, Michael Salib, and D. Sculley, "The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction" (IEEE Big Data, 2017)

A systematic rubric for evaluating the testing maturity of production ML systems. Breck et al. — from Google — define 28 specific tests organized into four categories: tests for features and data, tests for model development, tests for ML infrastructure, and monitoring tests for ML. Each test is worth 0.5 points, and the total score classifies the system's maturity (0-5: critical gaps; 5-10: functional but immature; 10-15: mature; 15+: advanced). The rubric is designed to be applied retrospectively to existing systems to identify gaps and prioritize investments.

Reading guidance: Table 1 (the complete rubric) is the core artifact — print it and score your own system. Section 3 provides the rationale for each test, which is often more valuable than the score itself: understanding why "the model is tested for quality on important data slices" is a distinct test from "the model is tested for quality on a holdout set" clarifies the difference between aggregate and sliced evaluation (Section 28.10 of this chapter). Section 4 discusses the relationship between the ML Test Score and technical debt, connecting to Sculley et al.'s foundational work on hidden technical debt (Chapter 24, Further Reading). For a practitioner's perspective on applying the rubric, see Sato, Wider, and Windheuser, "Continuous Delivery for Machine Learning" (martinfowler.com, 2019), which translates the rubric into a CI/CD pipeline architecture. For an updated perspective incorporating LLM-era concerns, see Shankar et al., "Operationalizing Machine Learning: An Interview Study" (arXiv, 2022), which surveys how ML engineers at 18 organizations implement (or fail to implement) the tests in the rubric.

4. Andrew Chad, Niall Murphy, and Laine Campbell, "Engineering Data Quality" (O'Reilly, 2024)

A comprehensive treatment of data quality as an engineering discipline, covering the full stack from data profiling and anomaly detection to data contracts, observability, and incident response. The book bridges the gap between data engineering (Chapter 25 of this textbook) and ML testing (this chapter) by treating data quality as a first-class concern that spans the entire data lifecycle, not just the ML pipeline.

Reading guidance: Part II (Data Quality Fundamentals) covers profiling, validation, and monitoring — the same topics as Sections 28.2-28.4 of this chapter, but with deeper treatment of anomaly detection algorithms and statistical testing methodology. Part III (Data Contracts) provides the most thorough published treatment of data contracts in practice, including contract specification languages, schema evolution strategies, and organizational patterns for contract governance — extending the DataContract class from Section 28.5 with production-tested patterns from industry. Part IV (Data Observability) covers the monitoring and alerting infrastructure that connects data validation (this chapter) to incident response (Chapter 30).

5. Sasu Tarkoma, "Validation and Verification of Machine Learning Models" (Springer, 2024)

An academic treatment of ML model validation, covering formal verification, testing theory, and the relationship between software testing methodologies and ML-specific testing needs. The book provides the theoretical foundations for the practical techniques in this chapter: why invariance testing is a special case of metamorphic testing, how property-based testing relates to behavioral testing, and what the oracle problem means for ML system validation.

Reading guidance: Chapter 3 (Testing ML Models) provides a formal taxonomy of ML testing approaches that maps to the CheckList framework: MFT corresponds to "adequacy testing," INV corresponds to "mutation testing," and DIR corresponds to "property-based testing." Chapter 5 (Model Verification) covers formal methods for verifying model properties (monotonicity, Lipschitz continuity, fairness constraints) — techniques that complement the empirical testing approach of this chapter with mathematical guarantees. Chapter 7 (Testing in Practice) surveys industry practices and connects the academic framework to production realities. For readers interested in the connection between testing and fairness specifically, Chapters 31 and 35 of this textbook extend the fairness behavioral tests from Section 28.9 into full fairness auditing and interpretability frameworks.