Chapter 25: Quiz

DataField.Dev

Chapter 25: Quiz

Test your understanding of data infrastructure for production ML. Answers follow each question.

Question 1

What is the fundamental problem that feature stores are designed to solve?

Answer

Feature stores solve the **online-offline consistency problem** (also called training-serving skew). When an ML model is trained using features computed offline (e.g., SQL queries on a data warehouse) and served using features computed online (e.g., from a Redis cache updated by a streaming pipeline), the two code paths can diverge — producing different feature values for the same entity at the same time. This divergence causes the model to see different input distributions in production than in training, silently degrading prediction quality. Feature stores solve this by providing a single source of truth for feature definitions, ensuring that the same computation logic populates both the offline store (for training) and the online store (for serving).

Question 2

Explain the difference between the online store and the offline store in a feature store architecture. What is each used for?

Answer

The **online store** (e.g., Redis, DynamoDB, Bigtable) stores the latest feature values per entity key and is optimized for low-latency point lookups (typically < 5-15ms). It is used during model serving — when a prediction request arrives, the serving infrastructure retrieves the user's and item's current features from the online store. The **offline store** (e.g., Delta Lake on S3, BigQuery, Redshift) stores historical feature values with timestamps and is optimized for batch reads and analytical queries. It is used during model training — to construct training datasets with point-in-time correct features — and for batch scoring, feature analysis, and debugging. Both stores are populated by the same computation pipeline, which is the key mechanism for ensuring consistency.

Question 3

What is a point-in-time join, and why is it essential for constructing ML training datasets?

Answer

A **point-in-time join** retrieves, for each training example at timestamp $t$, the most recent feature value that was available *before* $t$ (i.e., `feature_timestamp <= event_timestamp`). This prevents **temporal data leakage** — the inclusion of future information in the training data. Without a point-in-time join, a naive join retrieves the latest feature value, which may include data generated after the training example's event. The model then learns from signals that are not available at serving time, producing inflated offline metrics that do not translate to production performance. Point-in-time joins are implemented using `ASOF JOIN` (DuckDB, ClickHouse) or window functions with `ROW_NUMBER()` partitioned by entity and event timestamp, ordered by feature timestamp descending.

Question 4

What is the difference between a data warehouse, a data lake, and a data lakehouse? When would you choose each for an ML workload?

Answer

A **data warehouse** (BigQuery, Snowflake, Redshift) is a schema-on-write system with ACID transactions, columnar storage, and SQL-native interfaces. Choose it when all data is structured, you need strong consistency (e.g., financial/regulatory data), and feature engineering is expressible in SQL. A **data lake** (Parquet/JSON files on S3/GCS) is a schema-on-read system that stores any data format on cheap object storage. Choose it when data is large and diverse (images, text, raw logs), ML frameworks need direct file access, and cost is the primary concern. A **data lakehouse** (Delta Lake, Iceberg, Hudi) adds ACID transactions, schema enforcement, time-travel, and partition evolution to a data lake. Choose it when you need both SQL analytics and direct file access, when time-travel is required for reproducibility or compliance, or when concurrent pipelines (batch + streaming) write to the same tables and need transaction isolation.

Question 5

What does "time-travel" mean in the context of a data lakehouse, and how does it support ML reproducibility?

Answer

**Time-travel** is the ability to query a table as it existed at any historical point in time — by timestamp (`TIMESTAMP AS OF`) or by version number (`VERSION AS OF`). It is implemented through a metadata layer (Delta Log, Iceberg snapshots) that records the state of the table at each transaction. For ML reproducibility, time-travel enables: (1) reconstructing the exact features that were available at training time, even months later; (2) debugging production model degradation by inspecting features at the time the degradation began; (3) regulatory compliance by demonstrating which data was available when a decision was made; and (4) safe feature backfills — computing historical values for a new feature by querying the source data at historical timestamps, ensuring no future data leakage.

Question 6

Why does columnar storage (Parquet) outperform row-oriented storage for ML feature engineering queries?

Answer

ML feature engineering queries typically read a subset of columns (e.g., 5-10 features out of 100+ columns) and aggregate over many rows. Columnar storage like Parquet stores each column contiguously, enabling: (1) **column pruning** — only the columns used in the query are read from disk, reducing I/O by the fraction of unused columns (e.g., reading 5 of 100 columns reduces I/O by ~95%); (2) **predicate pushdown** — min/max statistics per row group allow entire row groups to be skipped when they cannot match a filter condition; and (3) **compression efficiency** — values within a column tend to be similar, enabling highly effective dictionary encoding, run-length encoding, and delta encoding. Row-oriented storage would need to read every column in every row even when only a few columns are needed.

Question 7

What is a data contract, and why is it particularly important for ML systems?

Answer

A **data contract** is a formal agreement between a data producer (e.g., the mobile engineering team) and a data consumer (e.g., the ML platform team) that specifies: the schema (field names, types, nullability), quality expectations (no null IDs, valid value ranges, freshness SLAs), semantic definitions (what each field means), and compatibility rules (how the schema can evolve). Data contracts are particularly important for ML systems because: (1) ML models are sensitive to data distribution changes that would be invisible to traditional applications; (2) the blast radius of a data quality issue in ML is larger — a single upstream change can degrade model quality across all predictions; (3) the failure mode is silent — the model produces plausible but degraded predictions rather than errors; and (4) ML teams are typically downstream consumers who have no control over upstream data producers. Contracts make the dependency explicit, testable, and enforceable.

Question 8

Classify each of the following schema changes as backward compatible, forward compatible, or breaking: (a) adding a nullable column, (b) removing a column, (c) renaming a column, (d) widening a type from INT32 to INT64.

Answer

**(a)** Adding a nullable column: **backward and forward compatible** (full compatible). New readers can read old data (the new column is NULL). Old readers can read new data (they ignore the extra column). **(b)** Removing a column: **forward compatible** only. Old readers (expecting the column) can still read new data if they handle missing columns, but new readers cannot read old data with the removed column without schema awareness. In practice, many frameworks treat this as breaking. **(c)** Renaming a column: **breaking**. Both old and new readers look for different column names, so neither direction works without explicit mapping. Exception: Iceberg tracks columns by ID rather than name, so renames in Iceberg are backward compatible. **(d)** Widening a type from INT32 to INT64: **backward compatible**. New readers can read old data (INT32 values fit in INT64). Old readers may fail on new data if INT64 values exceed INT32 range.

Question 9

What is the role of feature TTL (time-to-live) in an online feature store, and what happens when a feature value exceeds its TTL?

Answer

**TTL (time-to-live)** specifies how long a cached feature value in the online store remains valid. If a feature value has not been refreshed within the TTL period, the feature store treats it as expired and returns the configured **default value** (e.g., zero, global mean, or a sentinel value indicating missing data) rather than the stale value. The rationale: stale features are more dangerous than missing features. A stale feature silently degrades model quality because the model receives a value that looks normal but is outdated. A missing feature (default value) triggers explicit handling — the model was trained with default values for missing features and has learned appropriate behavior. TTL thus acts as a safety mechanism: if the materialization pipeline is delayed or fails, the online store degrades gracefully to defaults rather than serving increasingly stale values.

Question 10

Explain the difference between ETL and ELT. Which is more common in modern ML feature engineering pipelines, and why?

Answer

**ETL (Extract, Transform, Load)**: data is extracted from sources, transformed in a separate processing layer (Spark, Python scripts), and then loaded into the target storage. The transformation happens outside the target system. **ELT (Extract, Load, Transform)**: data is extracted and loaded raw into the target system (data warehouse/lakehouse), then transformed in place using SQL (tools like dbt). The transformation happens inside the target system. Modern ML feature engineering typically uses **both**. ELT handles structured feature engineering — aggregations, joins, window functions — because SQL is expressive, testable (dbt tests), and runs efficiently on warehouse compute (BigQuery, Snowflake). ETL handles ML-specific transformations — embedding computation, tokenization, feature crossing, model inference — that cannot be expressed in SQL and require Python/Spark. The feature store sits downstream of both and provides a unified serving layer.

Question 11

What is a feature view in Feast, and how does it differ from a feature table?

Answer

A **feature view** is the primary abstraction in Feast for organizing features. It binds a set of features to: (1) one or more entities (the keys by which features are looked up, e.g., `user_id`), (2) a data source (the table or stream that provides the raw data), and (3) a TTL (how long cached values are valid). Feature views also specify whether features should be materialized to the online store. A **feature table** is a related but distinct concept: it is the physical storage table in the offline store that contains the timestamped feature values. A feature view is a logical concept that maps to one or more physical feature tables. Multiple feature views can read from the same source table but expose different subsets of columns or apply different transformations. The distinction matters because it allows logical feature organization (user batch features vs. user stream features) independent of physical storage layout.

Question 12

A feature store uses Redis as the online store with asynchronous replication across 3 replicas. Under what circumstances could two concurrent requests for the same user return different feature values?

Answer

With asynchronous replication, a write to the primary Redis node is acknowledged before it propagates to the replicas. If request A is routed to replica 1 (which has received the latest write) and request B is routed to replica 2 (which has not yet received the write), they will see different values for the same feature. This is the **eventual consistency** property of async replication. This can occur: (1) immediately after a streaming feature update (the write has reached the primary but not all replicas), (2) during a network partition between replicas, or (3) when replicas have different replication lag. The typical replication lag for Redis async replication is on the order of milliseconds (< 10ms in a well-provisioned cluster), so the inconsistency window is short. For most ML systems, this is acceptable — the model quality impact of a few milliseconds of feature staleness is negligible.

Question 13

How does Apache Iceberg's approach to schema evolution differ from Delta Lake's, specifically for column renames?

Answer

Apache Iceberg tracks columns by **internal column IDs** rather than by column names. When a column is renamed, the column ID remains the same, so old data files (which reference the column by ID) can still be read correctly under the new name — **no data rewrite is required**. This makes renames a metadata-only operation and backward compatible. Delta Lake, by contrast, references columns by **name and position**. Renaming a column in Delta Lake requires either: (1) adding a column mapping mode (available since Delta Lake 2.0 with column mapping enabled), which adds an ID-based mapping similar to Iceberg, or (2) rewriting the data files with the new column name. Without column mapping mode, a rename is a breaking change. This difference is significant for ML systems with large historical datasets: Iceberg allows schema evolution without the cost of rewriting terabytes of Parquet files.

Question 14

What is a slowly changing dimension (SCD Type 2), and why is it important for ML feature stores?

Answer

A **slowly changing dimension (SCD Type 2)** tracks the full history of attribute changes for an entity by maintaining multiple rows per entity, each with `valid_from` and `valid_to` timestamps. When an attribute changes, the current row is closed (`valid_to` set to the change date, `is_current = FALSE`) and a new row is opened (`valid_from` set to the change date, `is_current = TRUE`). For ML feature stores, SCD Type 2 is essential for **point-in-time correct joins** on slowly changing attributes like subscription tier, geographic region, or account status. Without SCD Type 2, a point-in-time join for a training example from 6 months ago would use the *current* attribute values, not the values that were valid at that time. For example, if a user upgraded from "free" to "premium" in January, a training example from December should use "free" — not "premium." SCD Type 2 enables this by providing a historical record that the point-in-time join can query with `valid_from <= event_timestamp AND (valid_to > event_timestamp OR valid_to IS NULL)`.

Question 15

What is data lineage, and how does it support three distinct use cases: debugging, impact analysis, and regulatory compliance?

Answer

**Data lineage** is a directed graph that tracks how data flows from source to destination — which raw events are used to compute which features, which features train which models, and which models serve which endpoints. **Debugging:** When a model's production quality degrades, lineage traces the root cause backward: from the degraded predictions to the model to the features to the feature pipeline to the raw data source. This narrows the investigation from "something is wrong" to "the user session features are stale because the Flink pipeline failed." **Impact analysis:** Before changing a data source (e.g., renaming a column, changing a data format), lineage traces forward to identify all downstream features, models, and endpoints that would be affected. This prevents the common failure where an upstream change silently breaks downstream ML systems. **Regulatory compliance:** Regulations like ECOA and FCRA for credit scoring require demonstrating which data sources, feature values, and model version produced each decision. Lineage provides this audit trail — for any specific decision, the system can enumerate the complete provenance chain from raw data to prediction.

Question 16

You are designing a feature store for a system with 50 million users and 500,000 items. Each user has 15 features, and each item has 10 features. The online store must serve feature lookups for 1 user and 200 items per request, at 10,000 requests per second. Estimate the total Redis memory required, assuming an average of 200 bytes per feature vector.

Answer

**User features:** 50,000,000 users $\times$ 200 bytes = 10 GB. **Item features:** 500,000 items $\times$ 200 bytes = 0.1 GB. **Total raw data:** ~10.1 GB. With Redis overhead (data structures, pointers, hash table entries), typical overhead is 2-3x the raw data size, so approximately **20-30 GB**. **Throughput check:** Each request requires 1 user lookup + 1 batch item lookup (200 items, but fetched in a single `MGET` or pipeline command, approximately 2-3 Redis operations). At 10,000 req/s, that is ~20,000-30,000 Redis operations per second — well within a single Redis node's capacity (~100,000 ops/s). With 3 replicas for availability, the cluster needs 3 nodes with at least 30 GB each. Total memory: **~90 GB across 3 replicas** (30 GB per replica). This is modest — a single `r6g.xlarge` instance provides 32 GB, so 3 instances suffice.

Question 17

What is DVC (Data Version Control), and how does it complement a feature store and lakehouse in the ML data infrastructure stack?

Answer

**DVC (Data Version Control)** extends Git to handle large data files and ML artifacts. It stores lightweight pointer files (`.dvc` files) in Git while the actual data is stored in remote storage (S3, GCS, Azure Blob). `git checkout` plus `dvc checkout` restores the exact data files for any historical commit. DVC complements the feature store and lakehouse by covering a different layer of versioning: - The **lakehouse** versions the tables themselves (time-travel queries on the offline store). - The **feature store** versions feature definitions and materialized feature values. - **DVC** versions the training artifacts: the specific training dataset snapshot, the model checkpoint, and the evaluation results. It connects these artifacts via a reproducible pipeline (`dvc.yaml`), so `dvc repro` re-runs exactly the steps whose inputs have changed. Together, they provide end-to-end reproducibility: the lakehouse stores the data, the feature store computes the features with point-in-time correctness, and DVC tracks which data version, feature version, and hyperparameters produced each model version.

Question 18

A streaming feature pipeline (Flink) and a batch feature pipeline (Spark) both compute user_session_completion_rate. The Flink pipeline produces values with mean 0.18, while the Spark pipeline produces values with mean 0.52 for the same users and time period. What are the most likely causes, and how would you debug this?

Answer

The most likely causes are: 1. **Different denominators:** The Flink pipeline may count all events (views + clicks + completions) in the denominator, while the Spark pipeline counts only views. Completions / all_events < completions / views. 2. **Different time windows:** The Flink pipeline may use a tumbling window while the Spark pipeline uses a trailing window, or the window boundaries differ (e.g., session-scoped vs. time-scoped). 3. **Different event filtering:** The streaming pipeline may include test/bot traffic that the batch pipeline filters out. 4. **Late event handling:** The Flink pipeline may not wait for late events, while the Spark pipeline processes all events for a completed time window. **Debugging approach:** (1) Pick 10-20 specific users and time windows. (2) Compute the feature from both pipelines for those users. (3) Inspect the intermediate values: numerator, denominator, event counts, timestamps. (4) Identify the first point of divergence. (5) Once the root cause is found, add a consistency test that runs both computations on a common subset and alerts on divergence > threshold.

Question 19

What is the data mesh paradigm, and how does it change the relationship between ML teams and data producers?

Answer

**Data mesh** (Dehghani, 2022) is an organizational architecture built on four principles: (1) **domain ownership** — the team that produces data owns it as a product; (2) **data as a product** — each data asset has a discoverable interface, documented schema, quality guarantees, and an owner; (3) **self-serve data platform** — shared infrastructure enables domain teams to publish data products without building their own infrastructure; (4) **federated computational governance** — global policies are enforced by the platform while domain-specific decisions remain with domain teams. For ML teams, data mesh changes the relationship from *dependency* to *product consumption*. Instead of the ML team asking the data engineering team to build a pipeline (centralized), the domain team (e.g., mobile engineering) publishes interaction events as a data product with a contract, SLA, and documentation. The ML team consumes this product through the contract interface. This makes dependencies explicit, reduces coordination overhead, and aligns incentives: the domain team's job is to provide high-quality data, not just to "send events somewhere."

Question 20

Describe the complete lifecycle of a feature value in the StreamRec feature store, from the moment a user interaction occurs to the moment that feature value influences a recommendation for another user.

Answer

1. **Event generation:** A user completes a content item. The mobile app sends an `InteractionEvent` to the Kafka `streamrec.events.interactions` topic. 2. **Stream processing:** The Flink streaming pipeline consumes the event, updates the user's real-time session features (e.g., `user_session_completion_rate`), and writes the updated values to the Redis online store. This happens within seconds. 3. **Data lake ingestion:** The same event is also written to the Delta Lake offline store (S3) by a Kafka-to-S3 connector, partitioned by `event_date`. 4. **Batch feature computation:** Overnight, the Spark batch pipeline reads the day's events from Delta Lake, computes aggregated features (e.g., `user_7d_completion_rate`), and writes them to the feature tables in Delta Lake. 5. **Materialization:** After the batch pipeline completes, Feast's materialization process copies the latest batch feature values from Delta Lake to Redis, updating the online store. 6. **Serving:** When another user opens the app, the recommendation service calls the Feast online feature server, which retrieves both batch features (from the just-materialized Redis data) and streaming features (from the Flink-updated Redis data) for that user. These feature values are passed to the ranking model, which scores candidate items and returns recommendations. 7. **Training (offline):** Periodically, the training pipeline calls `get_historical_features` to construct a new training dataset. Feast performs point-in-time joins against the Delta Lake offline store, ensuring that each training example uses only features that were available at the time of the example's event. The retrained model eventually replaces the current production model, completing the cycle.