Chapter 25: Key Takeaways

DataField.Dev

Chapter 25: Key Takeaways

Feature stores solve the online-offline consistency problem — the most common source of production ML quality degradation. Training-serving skew occurs when the features a model trains on differ from the features it receives in production. This happens whenever training and serving compute features using different code paths — different SQL queries, different programming languages, different time window semantics. Feature stores prevent skew by providing a single source of truth for feature definitions: one computation pipeline populates both the offline store (for training) and the online store (for serving). The StreamRec case study demonstrated this concretely: before the feature store, three incidents in 12 weeks were caused by training-serving skew; after deployment, zero incidents occurred, and the online-offline NDCG gap collapsed from 0.07 to 0.01.
Point-in-time joins are the mechanism that prevents temporal data leakage in training data construction. A naive join between event tables and feature tables retrieves the latest feature values, which may include data generated after the training example — data that would not have been available at prediction time. Point-in-time joins retrieve the most recent feature value before the event timestamp, enforcing the filtration condition $\hat{f}_k(u, t) = f_k(u, t^*)$ where $t^* = \max\{t' \leq t\}$. This is not an optimization; it is a correctness requirement. Models trained on leaked data show inflated offline metrics that do not transfer to production.
The lakehouse architecture (Delta Lake, Iceberg, Hudi) is the pragmatic default for ML data infrastructure because it serves both SQL analytics and direct file access. Data warehouses excel at structured SQL queries but are expensive at scale and do not support direct file access for ML frameworks. Data lakes are cheap and format-agnostic but lack ACID transactions, schema enforcement, and time-travel. Lakehouses add these warehouse-grade features to lake-grade storage, enabling the same data to support feature engineering (SQL), model training (direct Parquet access), and reproducibility (time-travel). The choice between warehouse and lakehouse depends on the workload: pure structured analytics with strong consistency requirements (Meridian Financial's credit scoring) favors the warehouse; mixed workloads with large scale and ML training (StreamRec's recommendation system) favor the lakehouse.
Data contracts are the organizational mechanism that makes feature stores reliable by making cross-team data dependencies explicit, testable, and enforceable. A data contract specifies the schema (field names, types, nullability), quality expectations (no null IDs, valid value ranges), semantic definitions (what each field means), and SLAs (freshness, completeness) for a data asset. Without contracts, upstream teams can change data schemas, rename fields, or degrade data quality without knowing that an ML model depends on their output. The StreamRec case study showed that a single field rename (duration_seconds to watch_time_seconds) would have silently broken the session feature pipeline. Contracts catch these changes before they reach production.
Schema evolution in ML systems has a larger blast radius than in traditional software because every change propagates through features, models, and monitoring. Adding a nullable column is safe. Renaming a column or narrowing a type is breaking. Changing a feature's semantic definition (e.g., the time window) requires retraining every model that consumes that feature. The migration checklist — pause pipelines, notify consumers, apply changes, backfill, retrain, validate, monitor — is the discipline that prevents cascading failures. Apache Iceberg's ID-based column tracking makes renames a metadata-only operation, reducing the cost of schema evolution for large tables.
Data lineage is not optional for regulated ML systems and is invaluable for unregulated ones. Lineage tracks the flow of data from source to model to prediction: which raw events computed which features, which features trained which models, which models serve which endpoints. For regulated systems (credit scoring under ECOA/FCRA), lineage provides the audit trail that regulators require. For unregulated systems, lineage provides the debugging trail that engineers need: when the model degrades, lineage traces the root cause from predictions back through features to raw data. The Meridian Financial case study showed that lineage reduced audit response time from weeks to hours and eliminated documentation deficiency findings entirely.
The infrastructure cost of a production feature store is small relative to the value it protects. StreamRec's feature store costs approximately $11,550 per month — 0.003% of the company's $400M annual revenue. Meridian Financial's lineage-enabled store costs $14,200 per year in additional storage for 7-year retention. These costs are trivial compared to the revenue impact of recommendation quality degradation (StreamRec) or the regulatory risk of audit failures (Meridian). The plumbing nobody teaches is the plumbing that keeps the system running.