Chapter 25: Further Reading

Essential Sources

1. Neelesh Salian, Willem Pienaar, and the Feast Community, Feast: An Open Source Feature Store for Machine Learning (feast.dev, 2021–present)

Feast is the open-source feature store used throughout this chapter. The official documentation provides the most authoritative treatment of the core abstractions: entities, feature views, data sources, online/offline stores, and point-in-time joins. The architecture documentation explains how Feast separates feature definition (the Python SDK), offline retrieval (get_historical_features), and online serving (get_online_features), and how the materialization process bridges the two.

Reading guidance: Start with the "Quickstart" guide to run Feast locally with a file-based offline store and an in-memory online store — this takes 15 minutes and demonstrates the full workflow without cloud infrastructure. Then read the "Architecture" page, which explains the offline-online separation and the materialization pipeline. The "Feature Views" reference documents the full API for defining batch and stream feature views, including TTL, tags, and default values. For production deployments, the "Running Feast in Production" guide covers Redis, DynamoDB, and BigQuery backends. The Feast RFC repository (github.com/feast-dev/feast/tree/master/docs/rfcs) contains design documents for features like on-demand transforms, stream feature views, and the push API — these are valuable for understanding the why behind Feast's design decisions, not just the how. For a comparison with commercial alternatives (Tecton, Hopsworks, Databricks Feature Store), see the "Feature Store Comparison Matrix" maintained by the Feast community.

2. Zhamak Dehghani, Data Mesh: Delivering Data-Driven Value at Scale (O'Reilly, 2022)

The definitive reference on the data mesh paradigm — the organizational architecture that treats data as a product, with decentralized ownership by domain teams. Dehghani, who originated the concept at ThoughtWorks, presents the four principles (domain ownership, data as a product, self-serve platform, federated governance) and provides detailed guidance for implementing them. The book is essential reading for understanding the organizational context in which feature stores, data contracts, and lineage systems operate.

Reading guidance: Part I (The "Why") establishes the problem: centralized data teams become bottlenecks as organizations scale, and the solution is not more centralization but a new topology. Chapter 4 (Data as a Product) is the most directly relevant to this chapter: it defines the properties of a data product (discoverable, addressable, trustworthy, self-describing, interoperable, secure) and explains how data contracts formalize these properties. Chapter 9 (The Self-Serve Data Platform) describes the infrastructure layer that enables domain teams to publish data products — this maps directly to the feature store platform discussed in Section 25.15. For ML practitioners, the key takeaway is the reframing of the ML team's relationship with data: instead of building pipelines to extract data from source systems, the ML team consumes published data products with explicit contracts and SLAs. This reframing changes the incentive structure and reduces the coordination overhead that causes most data quality incidents.

3. Ryan Blue, Daniel Weeks, Owen O'Malley, and the Apache Iceberg Community, Apache Iceberg: The Definitive Guide (O'Reilly, 2024)

Apache Iceberg is the table format that provides ACID transactions, schema evolution, time-travel, and hidden partitioning on top of object storage. This book, co-authored by Iceberg's creator (Ryan Blue, originally at Netflix), provides a comprehensive treatment of the format's internal architecture and its practical application to data lake and lakehouse workloads.

Reading guidance: Chapter 3 (Architecture) explains the metadata layer — catalog, metadata files, manifest lists, manifest files, and data files — that gives Iceberg its transactional properties. Understanding this architecture is essential for grasping why time-travel is a metadata operation (reading an old snapshot) rather than a data operation (maintaining multiple copies of the data). Chapter 5 (Schema Evolution) covers Iceberg's ID-based column tracking, which makes renames and reorderings metadata-only operations — the key advantage over Delta Lake discussed in Section 25.11. Chapter 7 (Time Travel and Rollback) explains snapshot isolation, the FOR SYSTEM_TIME AS OF syntax, and the rollback_to_snapshot operation, all of which support the point-in-time feature construction pattern from Section 25.7. Chapter 9 (Partitioning) introduces hidden partitioning and partition evolution — the ability to change partitioning schemes without rewriting data — which is particularly valuable for ML workloads where query patterns evolve as new features are added. For readers using Delta Lake instead of Iceberg, the "Delta Lake: The Definitive Guide" (O'Reilly, 2024) by Denny Lee et al. provides equivalent coverage of the Delta Log, time-travel, and schema enforcement mechanisms.

4. Neoklis Polyzotis, Sudip Roy, Steven Euijong Whang, and Martin Zinkevich, "Data Lifecycle Challenges in Production Machine Learning: A Survey" (ACM SIGMOD Record, 2018)

A systematic survey of data management challenges in production ML systems, authored by researchers from Google. The paper identifies six lifecycle stages (data understanding, data validation, data cleaning, data enrichment, data integration, and data serving) and catalogs the problems that arise at each stage. It is the most rigorous academic treatment of the "plumbing" problems that this chapter addresses.

Reading guidance: Section 3 (Data Understanding and Validation) covers the data quality challenges that motivate data contracts: schema drift, distribution drift, and missing value patterns. The paper introduces the concept of a "data schema" for ML that goes beyond structural schema to include distributional expectations — the intellectual ancestor of the quality expectations in Section 25.10's data contracts. Section 4 (Data Cleaning and Enrichment) discusses feature engineering at scale, including the challenges of computing features consistently across training and serving — the problem that feature stores solve. Section 5 (Data Serving) covers the online-offline consistency problem and the point-in-time join requirement, with references to Google's internal systems (Sibyl, TFX) that influenced the design of external feature stores. For a more recent perspective from the same research group, see Polyzotis et al., "Data Management Challenges in Production Machine Learning" (SIGMOD, 2017), which provides the foundational framework, and Breck et al., "Data Validation for Machine Learning" (SysML, 2019), which describes Google's data validation system (TFDV) — a production implementation of the data contract concept.

5. Chad Sanderson, Andrew Jones, and the Data Contract Community, Driving Data Quality with Data Contracts (O'Reilly, 2023)

The first book-length treatment of data contracts as a practice. Sanderson (former head of data quality at Convoy) and Jones (data mesh advocate) present data contracts as the mechanism for making data dependencies explicit, testable, and enforceable across organizational boundaries.

Reading guidance: Part I establishes the problem: in most organizations, data consumers discover data quality issues after they have already caused production failures, because there is no explicit agreement about what the data should look like. Chapter 3 (Anatomy of a Data Contract) defines the components of a contract — schema, semantics, quality expectations, SLAs, ownership — that map directly to the DataContract class in Section 25.10. Chapter 5 (Schema Compatibility) covers the compatibility modes (backward, forward, full, breaking) in the context of data contracts, extending the schema registry concepts from the Confluent documentation. Part III (Implementation) provides practical guidance for introducing data contracts in an organization, including the political challenges of getting data producers to accept contracts. For ML practitioners, the key insight is that data contracts are not just a governance tool — they are a safety mechanism that prevents the most common class of production ML incidents: silent data quality degradation from upstream changes.