Chapter 25: Exercises

DataField.Dev

Chapter 25: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Feature Stores and Online-Offline Consistency

Exercise 25.1 (*)

A ride-sharing company builds a surge pricing model that uses the feature driver_availability_ratio — the ratio of available drivers to pending ride requests in a geographic zone. During training, this feature is computed from a data warehouse using a SQL query that aggregates events over 15-minute windows. During serving, it is computed in real-time from a Kafka stream using a Flink sliding window.

(a) Identify three specific ways the training and serving computations could diverge, producing training-serving skew.

(b) For each source of skew, describe a concrete consequence for the surge pricing model's predictions.

(c) Propose a feature store design that eliminates each source of skew.

Exercise 25.2 (*)

For each of the following features, classify whether it should be served from the batch (offline) store or the streaming (online) store, and justify your choice. Specify the TTL you would set for each.

A user's average order value over the last 90 days (e-commerce)
The number of items currently in a user's shopping cart (e-commerce)
A product's average rating over its lifetime (e-commerce)
The number of fraud reports filed against a seller in the last hour (marketplace)
A patient's most recent blood pressure reading (healthcare)
A user's preferred language (social media)
The current temperature at a weather station (climate monitoring)

Exercise 25.3 (*)

A feature store uses Redis as its online store with a TTL of 24 hours for batch features. At 3:00 AM, the daily Spark pipeline fails due to a misconfigured S3 bucket policy. The on-call engineer fixes the issue at 8:00 AM and restarts the pipeline, which completes at 9:30 AM.

(a) During the period from 3:00 AM to 9:30 AM, what happens to feature values in the online store? Consider features that were last materialized at 2:00 AM the previous day versus features materialized at 2:00 AM today (before the failure).

(b) Some users who were active between 3:00 AM and 9:30 AM received recommendations based on stale features. Quantify the staleness for each group of features.

(c) Design a monitoring alert that would detect this failure within 30 minutes of the pipeline's expected completion time.

Exercise 25.4 (**)

The StreamRec feature store serves 10,000 requests per second at peak. Each request requires looking up 7 user features and 20 item features (for the top-20 candidates from retrieval).

(a) Calculate the total Redis operations per second, assuming each feature view requires one HGETALL command per entity key.

(b) A single Redis node can handle approximately 100,000 operations per second. How many Redis nodes are needed for the read workload alone? Include a 2x headroom factor for spikes.

(c) The team is considering switching to batch feature pre-computation: at serving time, instead of looking up individual features, the system retrieves a single pre-computed feature vector per user (stored as a serialized Protobuf). Compare the Redis operations per second under this design. What are the trade-offs?

Exercise 25.5 (**)

Implement a FeatureStalenesMonitor class in Python that:

Accepts a dictionary of feature names and their expected refresh intervals.
Tracks the last update timestamp for each feature.
Returns a report of all features that are stale (i.e., last update was more than the expected interval ago).
Computes a staleness severity score: $\text{severity} = \frac{t_{\text{now}} - t_{\text{last\_update}}}{\text{expected\_interval}}$, where severity $> 1.0$ means the feature is stale and severity $> 2.0$ is critical.

Test your implementation with a scenario where 3 out of 10 features are stale.

Point-in-Time Joins

Exercise 25.6 (*)

A credit scoring model uses the feature applicant_credit_utilization — the ratio of total credit balance to total credit limit. The feature is updated daily at midnight from a credit bureau pull. An applicant submits their application at 2:00 PM on March 15.

(a) What value should the point-in-time join retrieve for this feature? The midnight March 15 value or the midnight March 14 value? Explain.

(b) The applicant paid off a large credit card balance at 10:00 AM on March 15. The midnight March 15 snapshot does not reflect this payment (it will appear in the midnight March 16 snapshot). How does this affect the model's prediction? Is this a bug or a correct behavior?

(c) Write the SQL for a point-in-time join that retrieves the credit utilization for each application in a batch, using the window function pattern from Section 25.4.

Exercise 25.7 (**)

Consider a time series of feature values for user 42:

`feature_timestamp`	`user_7d_completion_rate`
2025-03-10 00:00:00	0.65
2025-03-11 00:00:00	0.68
2025-03-12 00:00:00	0.72
2025-03-13 00:00:00	0.71
2025-03-14 00:00:00	0.75

And a set of training events for user 42:

`event_timestamp`	`item_id`	`label`
2025-03-11 14:30:00	789	1
2025-03-12 09:00:00	456	0
2025-03-13 22:15:00	1023	1

(a) For each training event, determine which user_7d_completion_rate value the point-in-time join should return. Show your reasoning.

(b) If you used a naive join (latest feature value), what values would be returned? For which events does this introduce data leakage?

(c) Compute the absolute error between the point-in-time correct values and the naive join values. In what scenario would this error be largest?

Exercise 25.8 (***)

Implement a PointInTimeJoiner class in Python (using pandas) that:

Accepts an entity DataFrame with columns [entity_id, event_timestamp, label].
Accepts a features DataFrame with columns [entity_id, feature_timestamp, feature_1, feature_2, ...].
Returns a merged DataFrame where each event row is joined with the most recent feature row where feature_timestamp <= event_timestamp.
Handles the edge case where no feature row exists before the event timestamp (returns NaN).

Test with the data from Exercise 25.7 and verify the correct values.

Storage Architectures

Exercise 25.9 (*)

For each of the following ML workloads, recommend a storage architecture (data warehouse, data lake, or lakehouse) and justify your choice:

A financial fraud detection system that processes 10 million structured transactions per day and must support regulatory audits.
A computer vision system that stores 500 TB of medical images with associated DICOM metadata.
A recommendation system that needs SQL-based feature engineering on 5 TB of structured user events and direct Parquet file access for PyTorch training.
A small startup with 50 GB of data, one data scientist, and a limited budget.
A genomics research lab that processes 200 TB of sequencing data with custom bioinformatics pipelines.

Exercise 25.10 (**)

A company is migrating from a data lake (Parquet files on S3) to a Delta Lake lakehouse. The data lake contains 50 TB of user interaction events partitioned by event_date.

(a) Describe the migration strategy. Can you convert the existing Parquet files to Delta Lake in place, or must you rewrite them?

(b) After migration, the team discovers that some Parquet files have inconsistent schemas (an older pipeline wrote user_id as INT32 while the newer pipeline writes it as INT64). How does Delta Lake handle this? What happens when a query reads both old and new files?

(c) The team wants to add time-travel support for the last 30 days. What is the storage overhead of maintaining 30 days of Delta Log history? How does the VACUUM command interact with time-travel?

Exercise 25.11 (**)

A Parquet file contains 100 columns, each storing 1 million rows. The file has 10 row groups of 100,000 rows each.

(a) A query reads 5 columns with a filter on column A (WHERE A > 100). Column A's row group statistics show: RG1 min=0, max=50; RG2 min=40, max=120; RG3 min=110, max=200; ... (remaining row groups have max > 100). How many row groups can be skipped by predicate pushdown?

(b) Calculate the I/O reduction from column pruning alone (reading 5 of 100 columns), assuming all columns are the same size.

(c) Calculate the combined I/O reduction from column pruning and row group pruning for the query above, assuming 3 out of 10 row groups can be skipped.

Data Contracts and Schema Evolution

Exercise 25.12 (*)

Write a data contract (using the DataContract class from Section 25.10) for the following data asset:

A table called orders produced by the e-commerce backend team. It contains: - order_id (STRING, not nullable): unique order identifier - customer_id (INT64, not nullable, PII): customer identifier - order_total_usd (FLOAT64, not nullable): total order amount, must be positive - order_timestamp (TIMESTAMP, not nullable): when the order was placed - shipping_country (STRING, nullable): two-letter ISO country code - items_count (INT64, not nullable): number of items, must be >= 1

Include at least three quality expectations. Specify the compatibility mode and freshness SLA.

Exercise 25.13 (*)

Classify each of the following schema changes as backward compatible, forward compatible, full compatible, or breaking. Explain your reasoning.

Adding a new nullable column user_referral_source to the user features table.
Removing the column user_legacy_score that no active model uses.
Renaming click_count to interaction_count.
Changing user_id from INT32 to INT64.
Changing completion_rate from FLOAT64 to FLOAT32.
Adding a new non-nullable column user_region with default value 'unknown'.
Changing the semantics of session_length from seconds to milliseconds (without changing the type).

Exercise 25.14 (**)

A production ML system has the following dependency chain:

raw_events → user_features → ranking_model → recommendation_api

The data engineering team wants to rename event_type to interaction_type in the raw_events table.

(a) Using the schema evolution plan pattern from Section 25.11, create a migration checklist for this change.

(b) Estimate the total time to complete the migration, assuming: schema change (1 hour), pipeline updates (4 hours), feature backfill (8 hours), model retraining (6 hours), A/B testing (7 days), and monitoring (48 hours). What is the critical path?

(c) Propose an alternative approach that avoids the breaking change entirely. What are the trade-offs?

Exercise 25.15 (**)

The Confluent Schema Registry enforces compatibility rules for Avro schemas on Kafka topics. Given the following Avro schema for StreamRec interaction events:

{
  "type": "record",
  "name": "InteractionEvent",
  "fields": [
    {"name": "user_id", "type": "long"},
    {"name": "item_id", "type": "long"},
    {"name": "event_type", "type": "string"},
    {"name": "timestamp", "type": "long"},
    {"name": "duration_seconds", "type": ["null", "double"], "default": null}
  ]
}

(a) Write an evolved schema that adds a device_type field. Is this backward compatible? Forward compatible?

(b) Write an evolved schema that removes the duration_seconds field. Under what compatibility mode is this allowed?

(c) The team wants to change event_type from a string to an enum with values ["view", "click", "completion", "rating"]. Is this backward compatible in Avro? What is the safe migration strategy?

Feature Engineering at Scale

Exercise 25.16 (*)

Write a SQL query (PostgreSQL or Spark SQL syntax) that computes the following item-level features for each item in the StreamRec catalog:

item_7d_impressions: total impressions in the last 7 days
item_7d_completion_rate: fraction of views that resulted in completions in the last 7 days
item_7d_unique_users: number of distinct users who interacted in the last 7 days
item_velocity: ratio of 7-day impressions to 30-day impressions (a measure of trending momentum)

Use the streamrec.events.interactions table with columns item_id, user_id, event_type, event_date.

Exercise 25.17 (**)

Write a SQL query that computes a user-item affinity score for every (user, category) pair based on the user's interaction history. The affinity score should be a weighted sum:

$$\text{affinity}(u, c) = \frac{\sum_{i \in I_{u,c}} w(e_i) \cdot \text{recency}(t_i)}{\sum_{i \in I_u} w(e_i) \cdot \text{recency}(t_i)}$$

where: - $I_{u,c}$ is the set of user $u$'s interactions with items in category $c$ - $I_u$ is the set of all of user $u$'s interactions - $w(e_i)$ is an event weight: view=1, click=2, completion=5, rating=3 - $\text{recency}(t_i) = \exp(-\lambda \cdot \text{days\_ago}(t_i))$ with $\lambda = 0.1$

Exercise 25.18 (**)

Implement a slowly changing dimension (SCD Type 2) update pipeline in Python. Given:

A current dimension table with columns [entity_id, attribute, valid_from, valid_to, is_current]
A batch of new attribute values with columns [entity_id, attribute, effective_date]

Your pipeline should: - For entities whose attribute has changed: close the current record (set valid_to = effective_date, is_current = False) and insert a new current record. - For entities whose attribute has not changed: do nothing. - For new entities: insert a new current record.

Test with a user subscription tier dimension where 3 out of 10 users have changed tiers.

Exercise 25.19 (***)

Design and implement a feature computation framework in Python that:

Defines features declaratively (name, entity key, aggregation type, window, source table).
Generates the SQL for each feature automatically.
Handles multiple aggregation types: count, sum, avg, min, max, count_distinct, rate (count of condition / total count).
Generates point-in-time correct SQL using the window function pattern.

Demonstrate with at least 5 features for StreamRec.

Data Lineage and Versioning

Exercise 25.20 (*)

Draw (or describe in text) the data lineage graph for the Meridian Financial credit scoring system. Include: - Source nodes: credit bureau API, application form, bank transaction history, employment verification - Feature nodes: credit utilization, income ratio, delinquency score, employment stability - Model nodes: real-time scoring model (fast, fewer features), batch scoring model (full, all features) - Endpoint nodes: instant decision API, overnight batch scoring

For each node, specify the owner team and the data contract between connected nodes.

Exercise 25.21 (**)

Using DVC, design a versioning strategy for the StreamRec ML project that tracks:

Training data (Parquet files, ~5 GB per snapshot)
Feature definitions (Python/YAML files)
Model artifacts (PyTorch checkpoints, ~500 MB each)
Evaluation results (JSON/CSV files, ~10 MB each)

(a) Write the DVC pipeline definition (dvc.yaml) that connects data → features → training → evaluation.

(b) Explain how dvc repro ensures reproducibility when a feature definition changes but the raw data does not.

(c) The team wants to compare the performance of two different feature sets (v1 and v2) using the same model architecture and training data. Describe the DVC workflow for this experiment.

Exercise 25.22 (**)

A regulatory audit for Meridian Financial requires answering the question: "For loan application #A-2025-03-15-7842, which data sources, feature values, model version, and decision logic were used to produce the denial?"

(a) Design the lineage schema (tables and relationships) that would store this information.

(b) Write the SQL query that retrieves the complete audit trail for a given application ID.

(c) Estimate the storage cost of maintaining this audit trail for 5 years at 15,000 applications per day, assuming 2 KB per application record.

System Design

Exercise 25.23 (***)

Design the complete data infrastructure for a real-time ad serving system that: - Serves 500,000 ad requests per second - Uses 200 features per ad candidate (50 user features, 50 ad features, 100 contextual/cross features) - Requires feature freshness: user features within 60 seconds, ad features within 1 hour, contextual features real-time - Has a total feature lookup budget of 10ms

Include: feature store architecture, storage backend selection, caching strategy, and capacity planning.

Exercise 25.24 (***)

Implement a data contract enforcement system in Python that:

Reads contract definitions from YAML files.
Validates incoming data batches against the contract (schema check, quality expectations, freshness).
Blocks downstream pipelines if the contract is violated (severity = "error").
Sends alerts for warning-level violations.
Logs all validation results for auditability.

Demonstrate with the StreamRec interaction events contract from Section 25.10.

Exercise 25.25 (***)

The StreamRec team discovers that the streaming feature user_session_completion_rate has been computing a different value than the batch version for 3 weeks. The streaming version divides completions by total events (including clicks), while the batch version divides completions by views only.

(a) Quantify the expected impact on model quality. If the batch version produces values in [0, 1] with mean 0.52, and the streaming version produces values in [0, 0.3] with mean 0.18, what is the distributional shift?

(b) Design a test suite that would have caught this inconsistency before it reached production.

(c) Write a remediation plan: how do you fix the streaming computation, backfill the corrected feature values, retrain affected models, and prevent recurrence?

Exercise 25.26 (***)

A healthcare ML system predicts patient readmission risk using features from electronic health records (EHR). The system must comply with HIPAA, which requires: - All PII fields are encrypted at rest and in transit - Access to patient data is logged and auditable - Data is retained for a minimum of 6 years - Patients can request deletion of their data

Design a feature store architecture that satisfies all four requirements. Address: (a) How PII fields are stored in the online and offline stores. (b) How access logging is implemented without adding latency to the serving path. (c) How patient data deletion is handled when the feature store contains pre-aggregated features that include the patient's data.

Exercise 25.27 (****)

The feature freshness-quality trade-off is a fundamental tension in feature store design. Formally:

Let $q(s)$ be the model quality as a function of feature staleness $s$ (the time since the feature was last updated). Empirically, $q(s)$ is often well-approximated by:

$$q(s) = q_0 - \alpha \cdot s^\beta$$

where $q_0$ is the quality with perfectly fresh features, $\alpha > 0$ is the sensitivity coefficient, and $\beta > 0$ controls the shape of the decay (linear for $\beta = 1$, concave for $\beta < 1$, convex for $\beta > 1$).

Let $c(f)$ be the cost of maintaining feature freshness $f = 1/s$ (inversely proportional to staleness). Empirically, cost grows super-linearly with freshness:

$$c(f) = c_0 + \gamma \cdot f^\delta$$

where $c_0$ is the base infrastructure cost, $\gamma$ is the cost coefficient, and $\delta > 1$ captures the super-linear cost scaling.

(a) Formulate the optimization problem: maximize quality subject to a budget constraint, i.e., find the optimal staleness $s^*$ that maximizes $q(s)$ subject to $c(1/s) \leq B$.

(b) Solve for $s^*$ analytically when $\beta = 1$ and $\delta = 2$.

(c) For a system with 10 feature groups, each with different $(\alpha, \beta)$ parameters, formulate the optimization that allocates the budget across feature groups to maximize total quality. Is this problem convex?

(d) Implement a numerical solver for part (c) using scipy.optimize.minimize and demonstrate with realistic parameters for StreamRec.

Exercise 25.28 (****)

The data mesh paradigm (Dehghani, 2022) argues that centralized data teams do not scale and that domain teams should own their data products. Critics argue that decentralization leads to inconsistency, duplication, and governance gaps.

(a) For an organization with 50 ML models across 8 domain teams, design a federated feature store architecture that balances domain ownership with cross-domain feature sharing. Address: how does a model in the recommendations domain use features owned by the user-profile domain?

(b) Identify three failure modes specific to a federated feature store that do not exist in a centralized feature store. For each, propose a mitigation.

(c) Propose a governance model that enforces global data quality standards while preserving domain autonomy. How do you handle disagreements between a domain team and a consumer team about feature quality?

Exercise 25.29 (****)

Feature store consistency models. The online feature store uses eventual consistency (Redis with async replication). This means that immediately after a write, different replicas may return different values for the same feature.

(a) Define the consistency guarantee formally: for a feature $f$ updated at time $t_w$, any read at time $t_r > t_w + \Delta$ returns the updated value, where $\Delta$ is the replication lag. What is a typical value of $\Delta$ for Redis with async replication?

(b) Consider two concurrent requests for the same user: request A reads from replica 1 (which has the latest feature), and request B reads from replica 2 (which has a stale feature). The requests arrive 100ms apart. If both requests trigger ranking model inference, the two responses will use different feature values for the same user. Is this a problem? Under what circumstances?

(c) Propose a consistency protocol that guarantees that all requests for the same user within a session read the same feature values. What is the latency cost?

(d) Derive an upper bound on the model quality degradation caused by eventual consistency, as a function of the replication lag $\Delta$, the feature update frequency $f$, and the Lipschitz constant $L$ of the model with respect to the feature (i.e., $|y(x) - y(x')| \leq L \|x - x'\|$).