Chapter 24: Quiz

Test your understanding of ML system design. Answers follow each question.


Question 1

According to Sculley et al. (2015), what fraction of a real-world ML system is typically ML code? What makes up the rest?

Answer The ML code is a **small fraction** — often cited as roughly 5% — of the total system. The rest consists of data collection and verification, feature extraction and management, serving infrastructure, monitoring and alerting, configuration management, process management, machine resource management, and analysis tools. The surrounding infrastructure is "vast and complex" and represents the majority of the engineering effort, maintenance burden, and failure surface.

Question 2

What are the two timescale loops in an ML system, and what are the primary concerns for each?

Answer The **inner loop (online)** handles individual prediction requests in milliseconds. Its primary concerns are latency, throughput, and availability. The **outer loop (offline)** handles model improvement — data processing, retraining, evaluation, and promotion — on a timescale of hours to days. Its primary concerns are model quality, reproducibility, and validation rigor. The key architectural principle is keeping these loops decoupled yet synchronized: the inner loop must never block on the outer loop, and the outer loop must never bypass the inner loop's safety checks.

Question 3

Why do modern recommendation systems use a multi-stage funnel architecture (retrieval → ranking → re-ranking) rather than scoring all items with a single model?

Answer It is a mathematical consequence of the latency constraint. If a platform has 200,000 items and the ranking model takes 0.1ms per item, scoring all items would take 20 seconds — two orders of magnitude beyond a typical 200ms budget. The funnel architecture resolves this by using computationally cheap models (e.g., approximate nearest neighbor search) to reduce the candidate set to hundreds, then applying expensive models to the smaller set. Each stage trades precision for speed, narrowing the candidate set while increasing cost per item.

Question 4

What is the role of the re-ranking stage, and why is it separated from the ranking model?

Answer The re-ranking stage applies non-ML business logic to the top-scored items: diversity constraints (limit items per category), freshness requirements (include recent content), content policy enforcement (remove flagged items), promotional boosts, suppression of already-seen items, and fairness constraints. It is separated from the ranking model because business rules change faster than ML models, and entangling them creates a maintenance nightmare. A product manager should be able to change "max 3 items per category" without retraining the ranking model.

Question 5

Compare batch, real-time, and near-real-time serving patterns. When is each most appropriate?

Answer **Batch serving** pre-computes predictions on a schedule. Best for: stable predictions that do not depend on real-time context (daily email digests, nightly risk scores), very expensive models, or systems where validation before serving is critical. **Real-time serving** computes predictions on-the-fly per request. Best for: interactive UX where session context matters (search, recommendations during browsing), time-sensitive decisions (fraud detection, ad serving). **Near-real-time serving** uses a streaming system to update features or pre-computed results with low latency (seconds to minutes). Best for: session-aware features, trending signals, and cases where full real-time inference is too expensive but batch staleness is unacceptable.

Question 6

What is training-serving skew, and why is it called a "silent killer" of production ML performance?

Answer Training-serving skew occurs when the feature values a model sees at prediction time differ systematically from the feature values it saw during training. It is "silent" because it produces no errors — the model outputs predictions that look plausible but are degraded. There are no exceptions, no crashes, no obvious signals. The model simply performs worse than its offline evaluation suggested. Common causes include implementation mismatches (different code computes features in training vs. serving), temporal leakage (training features include future information), stale features, and missing feature handling differences.

Question 7

What is the difference between the online store and the offline store in a feature store architecture?

Answer The **online store** serves the prediction pipeline with the *current* feature values for each entity, stored in a low-latency key-value store (Redis, DynamoDB) for single-digit millisecond lookups. The **offline store** serves the training pipeline with *historical* feature values keyed by (entity_id, timestamp), stored in columnar formats (Parquet, Delta Lake). The offline store supports point-in-time correct retrieval: given a historical timestamp, it returns feature values as they existed at that time, preventing temporal leakage. The critical guarantee is **online-offline consistency**: both stores must agree on feature values for the same entity at the same time.

Question 8

What are the four categories of training-serving skew described in the chapter?

Answer 1. **Feature skew**: Different code computes features in training vs. serving (e.g., different null handling). 2. **Data skew**: The training data distribution differs from the serving data distribution (e.g., seasonal changes, new user demographics). 3. **Label skew**: Labels are computed differently during training and evaluation (e.g., different attribution windows). 4. **Temporal skew**: Training data includes future information that is unavailable at serving time (e.g., a feature computed using data from after the prediction timestamp). Feature stores with point-in-time joins prevent temporal skew; shared feature computation code prevents feature skew.

Question 9

Explain the principle of graceful degradation in the context of ML systems. What is StreamRec's five-level degradation hierarchy?

Answer Graceful degradation means that when a component fails, the system falls back to simpler but still useful behavior rather than failing entirely. StreamRec's hierarchy: (1) **Full system healthy** — multi-source retrieval, deep ranking, re-ranking; (2) **Real-time features unavailable** — fall back to batch features (stale but personalized); (3) **Ranking model unavailable** — use retrieval scores directly (lower quality); (4) **Feature store unavailable** — return globally popular items (not personalized but functional); (5) **Complete outage** — serve a static cached response from the last successful run. Each level provides worse recommendations but avoids the worst outcome: showing an error page.

Question 10

What is a circuit breaker in the context of ML serving, and what three states does it have?

Answer A circuit breaker prevents cascading failures by stopping requests to a failing downstream component. It has three states: **CLOSED** (normal operation — requests pass through), **OPEN** (circuit has tripped after repeated failures — requests fail fast and a fallback is used, giving the failing service time to recover), and **HALF_OPEN** (after a recovery timeout, a limited number of test requests are allowed through; if they succeed, the circuit closes; if they fail, it reopens). Without circuit breakers, a slow feature store can cause the serving layer to accumulate pending requests, exhaust resources, and crash — turning a partial degradation into a complete outage.

Question 11

What is an Architecture Decision Record (ADR), and what are its key sections?

Answer An ADR is a short document that captures a single architectural decision. Key sections: **Status** (proposed, accepted, deprecated, or superseded), **Context** (the issue, forces at play, and constraints), **Decision** (the choice made), **Options Considered** (alternatives with pros and cons), **Consequences** (trade-offs accepted, new constraints created), and **Review Date** (when and under what conditions to reconsider). ADRs prevent re-arguing settled decisions, explain constraints to new team members, and surface assumptions that may have become outdated.

Question 12

Why are ADRs particularly important for ML systems compared to traditional software systems?

Answer Three reasons: (1) **Non-obvious dependencies** — an ML decision like choosing real-time serving implies downstream requirements (online feature store, streaming pipeline, GPU infrastructure) that should be documented explicitly. (2) **Evolution through experimentation** — decisions based on A/B test results (e.g., "+12% engagement for session-aware recommendations") should be re-evaluated if the experimental conditions change. (3) **Team rotation** — ML teams grow and change; without ADRs, institutional knowledge leaves with departing engineers. The criterion is: write an ADR for any decision that would take more than one hour to reverse.

Question 13

What is the difference between p50, p95, and p99 latency? Why is p99 more important than p50 for SLA compliance?

Answer **p50** (median) is the latency at the 50th percentile — half of requests are faster, half are slower. **p95** is the 95th percentile — 95% of requests are faster. **p99** is the 99th percentile — 99% of requests are faster. P99 is more important for SLA compliance because: (1) at scale, 1% of 10,000 requests/second means 100 users per second experience p99-or-worse latency, (2) tail latency is often caused by specific conditions (cache misses, GC pauses, slow replicas) that affect real users, and (3) SLAs are typically defined on p95 or p99, not p50. A system with comfortable p50 but terrible p99 will violate SLAs and degrade user experience for a significant absolute number of users.

Question 14

In the credit scoring example, why does the chapter recommend a hybrid architecture (real-time pre-approval + batch final decision) rather than pure real-time or pure batch?

Answer The hybrid captures the benefits of both patterns. **Real-time pre-approval** uses a simpler model on limited features to provide an instant decision, capturing the conversion benefit — customers who receive instant pre-approval are 20-30% more likely to complete the application. **Batch final decision** uses a complex ensemble on the full feature set (including data sources that take hours to process, like employment verification) for the final terms, preserving model quality and auditability. Pure batch loses the conversion benefit (15-25% applicant dropout from next-day decisions). Pure real-time sacrifices model quality (fewer features available) and auditability (harder to log and inspect). The hybrid applies the "simplest model that works" principle at the system level.

Question 15

What is shadow mode deployment, and what does it validate before a model goes to canary?

Answer Shadow mode runs the new model in parallel with the production model, logging predictions but never serving them to users. It validates four things with zero user risk: (1) **Correctness** — predictions are in the expected range and format. (2) **Latency** — the new model meets the latency budget. (3) **Consistency** — predictions are correlated with the production model's predictions (large discrepancies suggest a bug, not a model improvement). (4) **Error handling** — the model handles edge cases (missing features, new users, unusual inputs) without crashing. Shadow mode is the mandatory first step before any canary or A/B test.

Question 16

StreamRec's candidate retrieval uses four sources (embedding ANN, collaborative filtering, content-based, trending). Why use multiple sources instead of a single, better retrieval model?

Answer No single retrieval source captures all relevant items. The embedding ANN model (two-tower) finds items similar to the user's profile but cannot retrieve items without embedding similarity (cold-start items). Collaborative filtering captures structural neighborhood patterns but misses items outside the user's graph neighborhood. Content-based retrieval serves category preferences but ignores collaborative signals. Trending captures new/popular items regardless of personalization. The union of sources improves Recall@500 by 18% (in offline evaluation) compared to any single source. Multi-source retrieval also provides redundancy: if one source fails, the others still produce candidates.

Question 17

What is online-offline consistency in a feature store, and how is it achieved?

Answer Online-offline consistency means that for any entity at any point in time, the online store and the offline store return the same feature values (or as close as system latency allows). It is achieved through three mechanisms: (1) **Single feature computation path** — the same code (or pipeline) writes to both the online and offline stores, eliminating implementation mismatches. (2) **Backfill** — when a new feature is added, it is retroactively computed for historical data in the offline store. (3) **Point-in-time joins** — the training pipeline retrieves feature values as of each training example's timestamp, never using future data. Without these guarantees, the model trains on one version of reality and serves in another.

Question 18

How does multi-source retrieval handle the case where one retrieval source times out?

Answer Sources run in parallel with a shared timeout (30ms for StreamRec). If a source has not responded by the timeout, the system proceeds with the results from whichever sources have responded. This is a graceful degradation pattern at the retrieval level: the union of 3 sources is worse than 4, but still produces a useful candidate set. The retrieval summary tracks which sources contributed and which timed out, enabling monitoring to detect chronic timeout patterns. If a specific source times out frequently, it may be removed or its latency budget increased, documented in an ADR.

Question 19

What is the "simplest model that works" principle, and how does it apply at the system design level?

Answer The principle states that the correct approach is the simplest one that meets the product requirements. At the model level, it means preferring a logistic regression over a deep neural network if the simpler model achieves acceptable performance. At the **system design level**, it means choosing the simplest serving pattern (batch > near-real-time > real-time), the simplest architecture (single model > ensemble > multi-stage funnel), and the simplest infrastructure (no feature store > batch feature store > streaming feature store) that satisfies the latency, freshness, and quality requirements. Netflix uses batch recommendations for most of its home page rows, reserving real-time serving only for the "Continue Watching" row where session context matters. The sophisticated pattern is reserved for the cases where it is necessary.

Question 20

A new ML engineer joins StreamRec and asks: "Why don't we just score all 200,000 items with the ranking model?" Calculate the latency of this approach and explain why the funnel architecture is not optional.

Answer The ranking model (DCN-V2) takes approximately 0.1ms per item on a GPU. Scoring all 200,000 items would take $200{,}000 \times 0.1\text{ms} = 20{,}000\text{ms} = 20\text{ seconds}$. The SLA requires end-to-end latency under 200ms. Scoring all items exceeds the budget by a factor of 100. Even with 10x model optimization (quantization, batching), the result (2 seconds) still exceeds the budget by 10x. The only way to serve personalized recommendations from a large corpus within a strict latency budget is to progressively narrow the candidate set: cheap retrieval (sublinear in corpus size, via ANN) produces hundreds of candidates, and the expensive ranking model scores only those candidates. The funnel architecture is not an optimization — it is a mathematical necessity.