Chapter 24: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


ML System Components

Exercise 24.1 (*)

Consider a fraud detection system for an e-commerce platform that processes 500 transactions per second and must flag suspicious transactions before payment processing completes (within 200ms).

(a) List the core components of this ML system (data pipeline, feature store, training, serving, monitoring). For each component, specify one key requirement specific to the fraud detection domain.

(b) Which of the two loops (inner loop / outer loop) is more critical for this system? Justify your answer with reference to the consequences of failure in each loop.

(c) Draw a dependency graph showing which components depend on which. Identify the single point of failure that would have the greatest impact.


Exercise 24.2 (*)

For each of the following ML applications, identify which system component (from the component map in Section 24.2) is the most likely bottleneck and explain why:

  1. A search engine that serves 100,000 queries per second
  2. A medical image classifier deployed to a rural hospital with intermittent internet
  3. A dynamic pricing model for ride-sharing that must incorporate real-time supply/demand
  4. A language translation model used in a mobile app with offline capability

Exercise 24.3 (*)

The chapter distinguishes between the inner loop (online, milliseconds) and the outer loop (offline, hours/days). For each of the following events, identify which loop is affected and what the immediate consequence is:

  1. The feature store's Redis cluster experiences a 30-second network partition.
  2. A bug in the data pipeline causes 10% of user interaction events to be dropped.
  3. The GPU serving cluster's utilization hits 95%.
  4. The training pipeline produces a model with AUC 0.02 lower than the current production model.
  5. A new content category is added to the platform.

Serving Patterns

Exercise 24.4 (*)

A weather forecasting service generates predictions for 50,000 zip codes. The model takes 2 minutes per zip code on a single GPU.

(a) Is real-time serving feasible? Calculate the total compute time for a single complete forecast.

(b) Design a batch serving architecture for this system. Specify: (1) how often forecasts are updated, (2) how they are stored, and (3) how they are served to end users.

(c) A severe weather alert system requires updates within 15 minutes of new radar data. Can the batch architecture support this? If not, propose a hybrid architecture.


Exercise 24.5 (**)

An online advertising platform must select and serve a display ad within 50ms of a page load request. The ad selection model considers 10,000 candidate ads and user features.

(a) Decompose the 50ms budget across system components (feature lookup, candidate retrieval, scoring, auction logic, response). Justify each allocation.

(b) What is the maximum model complexity (in terms of FLOPs per candidate) given your latency budget for the scoring stage? Assume a T4 GPU with 65 TFLOPS (FP16) throughput.

(c) The system currently scores all 10,000 candidates with a lightweight model. Propose a two-stage architecture (like StreamRec's) that uses a complex model on a subset. What is the quality-latency trade-off?


Exercise 24.6 (**)

A content moderation system for a social media platform must classify user-uploaded images as safe or unsafe.

(a) The platform receives 10 million image uploads per day. Compare the cost and latency trade-offs of batch serving (scan all images hourly) vs. real-time serving (scan each image before it becomes visible). Consider both compute cost and user experience.

(b) Propose a near-real-time architecture that balances the two. How long is it acceptable for a potentially unsafe image to be visible before being reviewed?

(c) What graceful degradation strategy would you use when the image classification model is temporarily unavailable? Consider the asymmetry between false positives (blocking a safe image) and false negatives (showing an unsafe image).


Exercise 24.7 (*)

The chapter presents three serving patterns. For each scenario below, select the most appropriate pattern and write a one-paragraph justification:

  1. Personalized email digest sent every morning at 7am
  2. Autocomplete suggestions as a user types a search query
  3. Fraud score for insurance claims submitted via a web form (decision needed within 48 hours)
  4. Real-time sports commentary translation for live broadcasts
  5. Product recommendations on an e-commerce checkout page

Feature Stores and Training-Serving Skew

Exercise 24.8 (**)

The FeatureStoreSchema class in Section 24.5 tracks feature metadata. Extend it with the following capabilities:

(a) Add a method validate_feature_value(name: str, value: Any) -> Tuple[bool, str] that checks whether a feature value is within the expected range (based on dtype and any registered constraints). Handle at least: float range checks, list length checks for embedding features, and null value detection.

(b) Add a method dependency_graph() -> Dict[str, List[str]] that tracks which features depend on other features (e.g., user_category_preferences depends on user_last_10_interactions). Build the dependency graph for StreamRec's features.

(c) Use the dependency graph to compute a topological ordering of feature computation. Explain why this ordering matters for the feature engineering pipeline.


Exercise 24.9 (**)

You discover that StreamRec's user_avg_session_length_min feature has a training-serving skew: the training pipeline computes it using a SQL query AVG(session_length_min) WHERE session_length_min > 0, while the serving pipeline computes it using a Python function that includes zero-length sessions (sessions where the user opened the app but immediately closed it).

(a) What is the expected direction of the skew? Will the serving feature be systematically higher or lower than the training feature?

(b) How would this skew affect a model that uses session length as a predictor of engagement? Would the model over-predict or under-predict engagement for the affected users?

(c) Propose a specific fix that prevents this class of skew. Your fix should be structural (i.e., a system design change), not just a code fix.

(d) Write a detect_skew check (following the pattern in Section 24.6) that would have caught this specific skew. What threshold settings would you use?


Exercise 24.10 (***)

Implement a minimal in-memory feature store with both online and offline interfaces.

from dataclasses import dataclass, field
from typing import Dict, Any, Optional, List, Tuple
from datetime import datetime


@dataclass
class InMemoryFeatureStore:
    """Minimal feature store with online and offline interfaces.

    The online store returns the latest feature values.
    The offline store supports point-in-time lookups.

    Your implementation should:
    1. Store feature values with timestamps.
    2. Support writing new feature values.
    3. Support online lookup (latest value for an entity).
    4. Support offline lookup (value at a specific timestamp).
    5. Detect potential training-serving skew by comparing online
       and offline values.
    """

    def write(
        self,
        entity_id: str,
        feature_name: str,
        value: Any,
        timestamp: datetime,
    ) -> None:
        """Write a feature value with timestamp."""
        # Your implementation here
        ...

    def online_lookup(
        self,
        entity_id: str,
        feature_names: List[str],
    ) -> Dict[str, Any]:
        """Get latest feature values for an entity."""
        # Your implementation here
        ...

    def offline_lookup(
        self,
        entity_id: str,
        feature_names: List[str],
        as_of: datetime,
    ) -> Dict[str, Any]:
        """Get feature values as of a specific timestamp."""
        # Your implementation here
        ...

(a) Implement the three methods. The offline store must support point-in-time lookups: given a timestamp $t$, return the most recent feature value with timestamp $\leq t$.

(b) Add a method check_consistency(entity_id: str, feature_names: List[str]) -> Dict[str, bool] that compares the online value to the latest offline value and flags discrepancies.

(c) Write test cases that demonstrate: (1) correct point-in-time lookup, (2) online-offline consistency, and (3) detection of a consistency violation caused by a delayed offline write.


Exercise 24.11 (**)

Temporal leakage is a form of training-serving skew where the training data includes features computed from future data that would not be available at prediction time.

(a) For each of the following features used to predict user churn, identify whether it contains temporal leakage and explain why:

  1. avg_sessions_per_week_last_30d: Average sessions per week over the 30 days before the prediction date.
  2. total_sessions_this_month: Total sessions in the calendar month containing the prediction date.
  3. days_since_last_login: Days between the prediction date and the most recent login.
  4. will_be_active_next_week: Whether the user had any sessions in the 7 days after the prediction date.
  5. avg_session_length_last_90d: Average session length over the 90 days before the prediction date, computed using a SQL query run today (not at the historical prediction time).

(b) For the features that contain leakage, explain how point-in-time joins in a feature store would prevent the issue.


Reliability and Degradation

Exercise 24.12 (**)

Extend the CircuitBreaker class from Section 24.7 with the following enhancements:

(a) Add an error_rate_threshold parameter that opens the circuit when the error rate over the last $N$ requests exceeds a threshold (rather than requiring $N$ consecutive failures). This is more robust to intermittent failures.

(b) Add an exponential backoff to the recovery timeout: the first recovery attempt happens after 30 seconds, the second after 60, the third after 120, and so on, up to a maximum of 10 minutes. This prevents the circuit from rapidly oscillating between OPEN and HALF_OPEN when the downstream service is slow to recover.

(c) Add monitoring hooks: the circuit breaker should maintain counters for total requests, total failures, total circuit-open rejections, and total successful recoveries. These counters are essential for operational dashboards (Chapter 30).


Exercise 24.13 (**)

Design a graceful degradation chain for a real-time fraud detection system. The system has three models:

  1. A full ensemble model (XGBoost + neural network + logistic regression) requiring 200 features.
  2. A lightweight model (logistic regression) requiring 20 features.
  3. A rule-based system using 5 hand-crafted rules (no ML).

(a) Define the degradation levels (analogous to StreamRec's hierarchy in Section 24.7). What triggers the transition between each level?

(b) The fraud domain has a critical asymmetry: false negatives (missing fraud) are far more costly than false positives (flagging legitimate transactions). How does this asymmetry affect the design of the fallback chain? Should fallback models be more or less conservative than the primary model?

(c) Write an ADR documenting the decision to use a three-level fallback chain, including the consequences for fraud detection rate at each level.


Exercise 24.14 (*)

For each of the following system failures, determine: (1) which degradation level StreamRec would operate at, (2) the expected impact on recommendation quality, and (3) the maximum acceptable duration before an engineer must be paged:

  1. The FAISS ANN index is temporarily unavailable (data corruption during an update).
  2. The GPU cluster for the ranking model has a 50% capacity reduction (hardware failure).
  3. The feature store's streaming pipeline is delayed by 10 minutes (Kafka consumer lag).
  4. The batch training pipeline has not completed for 72 hours (infrastructure issue).
  5. The A/B testing framework is reporting incorrect group assignments.

Architecture Decision Records

Exercise 24.15 (**)

Write a complete ADR for the following decision: StreamRec's re-ranking stage should use a rule-based system (not an ML model) for diversity and freshness enforcement.

Include: Status, Context (why this is a decision point), at least three options considered (one of which is an ML-based re-ranker), the decision with full rationale, consequences, and a review date with reconsideration triggers.


Exercise 24.16 (**)

You are reviewing an existing ADR for StreamRec that was written 6 months ago. The ADR chose batch retraining (every 24 hours) over continuous retraining (every hour) because: (1) the team had only 2 ML engineers, (2) the data distribution was stable, and (3) A/B tests showed no statistically significant difference between 24-hour and 1-hour retraining cadences.

Since then: (1) the team has grown to 5 ML engineers, (2) the platform launched a "live events" feature that causes rapid shifts in user behavior, and (3) a new A/B test shows +4% engagement for hourly retraining during live events.

Write an updated ADR that supersedes the original. What is the new recommendation?


Exercise 24.17 (*)

For each of the following decisions, determine whether an ADR is warranted (using the "more than one hour to reverse" criterion from Section 24.8) and explain why:

  1. Choosing Python 3.11 as the primary language for the feature engineering pipeline.
  2. Naming a feature user_avg_session_length_min instead of avg_session_len_user.
  3. Selecting Redis over DynamoDB for the online feature store.
  4. Setting the re-ranking diversity constraint to "max 3 items per category."
  5. Choosing NDJSON as the log format for serving requests.

System Design Exercises

Exercise 24.18 (***)

Design the ML system architecture for a ride-sharing platform's dynamic pricing model (surge pricing). The system must:

  • Update prices every 30 seconds based on real-time supply (available drivers) and demand (ride requests).
  • Cover 500 geographic zones in a city.
  • Serve price quotes to riders within 100ms.
  • Handle 5,000 concurrent ride requests.

(a) Draw the system diagram (like the StreamRec diagram in Section 24.11). Label all components and their latency budgets.

(b) Which serving pattern (batch, real-time, near-real-time) is appropriate for each component? (Hint: different components may use different patterns.)

(c) What is the graceful degradation strategy when the real-time demand signal is unavailable? What price does the system serve?

(d) Write an ADR for the decision between zone-level pricing (one price per zone) and continuous pricing (price varies continuously with location). Consider computational cost, user fairness, and regulatory implications.


Exercise 24.19 (***)

Design the ML system architecture for a clinical decision support system that assists emergency room physicians in diagnosing chest pain patients. The system uses patient vitals, lab results, ECG data, and medical history to estimate the probability of acute myocardial infarction (AMI).

(a) What are the latency, reliability, and explainability requirements for this system? How do they differ from StreamRec's requirements?

(b) Should this system use real-time or batch serving? Consider that some features (lab results) may take 30-60 minutes to become available after the patient arrives.

(c) What is the appropriate graceful degradation strategy? In the medical domain, should the system fall back to a simpler model or refuse to serve predictions entirely? Justify your answer.

(d) How does the regulatory environment (FDA software as a medical device) constrain the system architecture? What components require additional documentation and validation?


Exercise 24.20 (***)

The Credit Scoring anchor example in Section 24.9 uses a hybrid architecture: real-time pre-approval with batch final decision.

(a) Design the feature set for each stage. The real-time stage has access to features available within 30 seconds (credit bureau API, application form data). The batch stage has access to the full data warehouse (transaction history, employment verification, alternative data). List at least 10 features per stage.

(b) The real-time model must err on the side of caution: it should not pre-approve applicants who will be denied by the batch model (false positive pre-approvals damage customer trust). Define a quantitative criterion for setting the real-time model's threshold. What error rate is acceptable?

(c) A rejected applicant disputes the decision. The adverse action notice requires listing the top 3 factors that contributed to the denial. How does the serving architecture need to change to support this regulatory requirement? What must be logged?


Exercise 24.21 (***)

Latency budgeting under constraints. StreamRec's latency budget allocates 125ms across 6 stages with 75ms headroom. The product team now wants to add a personalized notification to the recommendation response: a short message like "Because you watched Documentary X, you might enjoy..."

The notification requires a language model call (estimated 40ms p95 on a GPU with a pre-prompt and short generation).

(a) Can this fit within the existing 200ms budget? If not, what stages could be optimized to create room?

(b) Propose an architecture that generates the notification without increasing the end-to-end latency. (Hint: does the notification need to be generated synchronously?)

(c) Write an ADR documenting the decision to generate notifications asynchronously vs. synchronously.


Exercise 24.22 (***)

Multi-model serving. StreamRec currently uses a single ranking model (DCN-V2). The team wants to experiment with three ranking models simultaneously: DCN-V2, a transformer-based model, and a gradient-boosted tree ensemble.

(a) Design an ensemble serving architecture that combines predictions from all three models. How do you combine the scores? What is the latency implication?

(b) Design a model selection architecture that routes each request to one of the three models based on user segment (new users, casual users, power users). What determines which model serves which segment?

(c) Compare the two architectures on: latency, serving cost, experimentation capability, and failure modes. Which would you recommend for StreamRec and why?


Integration Exercises

Exercise 24.23 (**)

This exercise connects ML system design (this chapter) with the causal inference methods from Part III.

StreamRec wants to measure the causal effect of switching from batch serving to real-time serving. They cannot run a randomized A/B test because the infrastructure change affects all users simultaneously.

(a) Using the potential outcomes framework (Chapter 16), define the treatment, the outcome, and the estimand of interest.

(b) Propose a quasi-experimental design to estimate the causal effect. Consider: staggered rollout across geographic regions, regression discontinuity (users just above/below an activity threshold), or interrupted time series.

(c) What confounders might bias the estimate? How would you address them?


Exercise 24.24 (***)

This exercise connects ML system design with Bayesian methods from Part IV.

StreamRec's Bayesian user preference model (Chapter 20) updates in microseconds per interaction. But the model is per-user and per-category — with 50 million users and 20 categories, there are 1 billion parameter pairs.

(a) Design the feature store architecture for serving these Bayesian posteriors. Where are the posterior parameters stored? How are they updated? What is the lookup latency?

(b) The Bayesian model is used in the content-based retrieval source (Stage 1). When a new user arrives with no interactions, the posterior equals the prior. How does the system diagram change for cold-start users vs. established users?

(c) Thompson sampling (Chapter 22) requires drawing a sample from each category's posterior at request time. The sampling is fast (microseconds per Beta draw) but introduces stochasticity into the recommendations. How does this affect A/B testing? Can you still compute a valid treatment effect estimate if the recommendations are partially randomized?


Exercise 24.25 (***)

Capstone design exercise. Choose one of the following domains and design a complete ML system architecture, following the structure of Section 24.11:

  1. Music streaming: Personalized playlist generation for 100 million users, 80 million tracks.
  2. Job matching: Candidate-job matching for a hiring platform with 10 million active job seekers and 2 million open positions.
  3. Autonomous driving: Real-time object detection and path planning for a self-driving vehicle.

For your chosen domain, provide:

(a) A system diagram with all major components.

(b) A latency budget for the serving path.

(c) Three ADRs documenting your most important design decisions.

(d) A graceful degradation hierarchy with at least four levels.

(e) A feature store schema with at least 10 features, classified by computation type (batch, streaming, on-demand).


Research-Level Exercises

Exercise 24.26 (****)

Sculley et al. (2015) identified several forms of hidden technical debt in ML systems, including: entanglement (CACE — Changing Anything Changes Everything), correction cascades (Model B corrects Model A's errors), undeclared consumers (models that silently consume another model's output), and data dependency debt.

(a) For the StreamRec architecture designed in this chapter, identify at least two instances of each debt type and explain how they could manifest.

(b) Propose architectural patterns that mitigate each debt type. Are there trade-offs between mitigating debt and system complexity?

(c) CACE is particularly acute in recommendation systems because the model influences user behavior, which becomes the training data for the next model iteration. This creates a feedback loop. How does the StreamRec architecture address (or fail to address) this feedback loop? What would a "feedback-aware" architecture look like?


Exercise 24.27 (****)

The serving latency analysis in Section 24.4 treats each component's latency as a fixed budget. In reality, latencies are random variables with distributions that depend on load, cache hit rates, and downstream service health.

(a) Model each component's latency as a log-normal distribution with parameters estimated from the p50 and p99 values in the chapter. Compute the end-to-end p99 latency by simulation (draw $10^6$ samples from each component's latency distribution and sum them along the critical path).

(b) How does the simulated p99 compare to the sum of individual p99s? Why is the sum of p99s pessimistic?

(c) Implement a latency budget optimizer that, given a total p99 budget and component latency distributions, allocates per-component budgets to minimize the probability of exceeding the total budget. Formulate this as an optimization problem and solve it numerically.


Exercise 24.28 (****)

Online-offline consistency in feature stores is a distributed systems problem. In the CAP theorem framework (Consistency, Availability, Partition tolerance), feature stores typically sacrifice strong consistency for availability.

(a) Explain why eventual consistency (rather than strong consistency) is the pragmatic choice for an online feature store serving ML predictions. What is the worst-case impact of serving a slightly stale feature value?

(b) Design a consistency monitor that measures the lag between the offline store and the online store. How would you define "consistency violation" in quantitative terms? What alerting threshold would you set?

(c) Discuss the connection between feature store consistency and the training-serving skew framework from Section 24.6. Is all training-serving skew equivalent to a consistency violation? Are there forms of skew that a perfectly consistent feature store would not prevent?