Chapter 36: Quiz

Test your understanding of system integration, architecture decisions, evaluation strategy, and stakeholder communication. Answers follow each question.


Question 1

What is the fundamental difference between evaluating a model and evaluating a system?

Answer **Model evaluation** measures the quality of a single component on held-out data — Hit@10, NDCG@10, AUC, RMSE. It answers "does this model make good predictions?" **System evaluation** measures the end-to-end behavior of all components working together under production conditions. It answers "does this system work reliably, quickly, and fairly?" System evaluation includes latency (p50, p99), throughput (QPS), data quality (feature store hit rate, training-serving skew), causal impact (ATE of recommendations on engagement), and fairness (creator exposure equity, user quality disparity). A model with excellent offline metrics can be part of a system that fails — because of latency violations, stale features, silent drift, or unfair outcomes. The three-level evaluation framework (model, system, business) ensures that model quality is necessary but not sufficient for a passing grade.

Question 2

Why does Track C of the capstone use exactly 200ms of the latency budget, and what does this imply about adding new components?

Answer Track C uses the full 200ms budget because it includes the maximum number of components: two retrieval sources, two ranking models, re-ranking, feature store, API gateway, response assembly, uncertainty estimation, and exploration policy. Using the full budget implies that **any new component added to Track C requires either removing an existing component, optimizing an existing component to reduce its latency, increasing the total latency SLA, or running the new component in parallel with an existing one**. This is a fundamental tension in ML system design: more components provide more capability (better accuracy, fairness, uncertainty) but consume more of a finite resource (latency). The latency budget is a forcing function that disciplines architectural decisions — you cannot add complexity without paying for it somewhere. This is why Track A (8 components, 110ms) is often the right starting point: it leaves 90ms of headroom for future additions.

Question 3

What is an Architecture Decision Record (ADR), and why is the "Alternatives Considered" section the most important part?

Answer An Architecture Decision Record (ADR) documents a significant architectural choice: the context that motivated it, the decision itself, the alternatives considered, and the consequences. The "Alternatives Considered" section is the most important because it demonstrates that the decision was *deliberate* — the result of analyzing multiple options against explicit criteria — rather than *accidental* — the result of defaulting to the first thing that worked. Without documented alternatives, future engineers cannot evaluate whether the original decision still makes sense when the context changes. If the ADR for choosing the two-tower model documents that LightGCN was rejected due to operational complexity, a future engineer who has since built the graph infrastructure can recognize that the rejection reason no longer applies and revisit the decision. An ADR without alternatives is just a description of the status quo — it documents *what* was chosen but not *why*, which is the information that actually matters.

Question 4

Explain the difference between the naive CTR improvement and the causal ATE estimate for a recommendation system. Why can the naive estimate be misleading?

Answer The **naive CTR improvement** compares the click-through rate of the recommendation system to a baseline (random or popularity) using observational data. The **causal ATE** (average treatment effect) estimates the causal impact of showing a recommendation on user engagement, using methods from causal inference (IPW, doubly robust estimation, A/B testing) to account for confounding. The naive estimate can be misleading because it conflates *prediction* with *causation*. A recommendation system that shows users content they would have found anyway appears to "improve" CTR without actually causing additional engagement — the system is predicting existing behavior, not causing new behavior. Confounders include user intent (users who are already engaged are more likely to click on anything), item popularity (popular items get clicked regardless of recommendation), and temporal effects (engagement varies by time of day). In the StreamRec capstone, the naive CTR improvement was 8.7% while the causal ATE was 4.1% — meaning more than half of the apparent improvement was the model reflecting existing behavior rather than causing new engagement. The causal estimate is the correct measure of the system's *value*.

Question 5

Why should the fairness audit be completed before the technical roadmap is finalized?

Answer The fairness audit can fundamentally reorder roadmap priorities by revealing that what appears to be a performance improvement is actually a fairness-critical infrastructure fix. In the StreamRec retrospective example, the fairness audit revealed that the FAISS index rebuild delay (24-hour window where new items are invisible) was not just a latency optimization problem but a fairness problem: new creators are disproportionately non-English-speaking, so the cold-start bias introduced by the rebuild delay amplifies language-based exposure inequity. This reframed the incremental FAISS update roadmap item from a P2 "nice-to-have" to a P0 "fairness-critical" fix. Without the fairness audit, the roadmap would have prioritized this item incorrectly. More broadly, the fairness audit reveals which system behaviors disproportionately affect which populations, which directly informs where engineering effort should be invested. A roadmap that does not incorporate fairness findings risks investing in improvements that help already-well-served users while neglecting underserved populations.

Question 6

What is the "three-slide rule" for executive presentations, and what should each slide contain?

Answer The three-slide rule structures an executive presentation into three focused slides: **Slide 1 — What and why:** One sentence on what the system does, one sentence on the business problem, one number for the business impact (revenue, retention, cost savings). No architecture diagrams. **Slide 2 — How we know it works:** The key metric compared to the previous system, with confidence interval or statistical significance. One sentence on fairness. One sentence on risk and monitoring. **Slide 3 — What comes next:** The top 3 roadmap items with expected impact and timeline. A cost estimate for the next phase. The specific ask (continued funding, headcount, infrastructure). Everything else — architecture diagrams, model details, evaluation methodology, causal analysis — goes in an appendix, available if asked but not presented proactively. The rule reflects a fundamental communication principle: executives make resource allocation decisions based on impact, confidence, and cost. They do not need to understand the transformer architecture to decide whether to fund the system. Presenting technical details to executives signals either that you do not understand your audience or that you do not have a clear business case.

Question 7

What is technical debt in the context of ML systems, and why is "unconscious" debt more dangerous than "conscious" debt?

Answer Technical debt in ML systems encompasses accumulated compromises across six categories: data debt (data quality and pipeline issues), model debt (modeling shortcuts), infrastructure debt (system design compromises), testing debt (missing test coverage), documentation debt (missing or outdated docs), and fairness debt (known fairness issues not yet mitigated). **Conscious debt** is debt that has been identified, documented (in a debt register), assessed for severity, and either scheduled for remediation or explicitly accepted. **Unconscious debt** is debt that exists but has not been identified — the team does not know it is there. Unconscious debt is more dangerous because it compounds silently and surfaces only when it causes a production incident. The continuous training-serving skew example (TD-005) illustrates this: without monitoring, skew accumulates undetected, manifesting as a slow decline in business metrics that is easy to misattribute to other causes. By the time the true cause is identified, weeks or months of degraded service have occurred. The debt register's primary value is converting unconscious debt to conscious debt, where it can be managed.

Question 8

In the build-vs-buy framework, why does the rule "build what differentiates, buy everything else" apply to ML systems?

Answer The rule applies because ML systems have two categories of components: **differentiating** components that encode the organization's unique understanding of its domain and users (the retrieval model, ranking model, and fairness criteria — which reflect the platform's specific user base, content catalog, and values) and **commodity** components that provide essential infrastructure but do not create competitive advantage (the feature store, pipeline orchestration, monitoring, serving framework). Building differentiating components in-house is justified because they require domain-specific data, encode proprietary knowledge, and their quality directly impacts business outcomes. Building commodity components in-house is usually not justified because open-source or commercial alternatives (Feast, Dagster, Grafana, BentoML) are mature, well-maintained, and used by thousands of organizations — meaning they are more reliable and cheaper to maintain than custom implementations. The cost of building commodity infrastructure is not just the initial development but the ongoing maintenance: every bug fix, security patch, and feature enhancement must be done by your team rather than the vendor's. The StreamRec TCO analysis showed that the largest cost category is personnel, not infrastructure — so engineering time spent maintaining commodity infrastructure is engineering time not spent improving the differentiating model.

Question 9

Why does the integration checklist mark some components as "replaced" rather than "not used"? What is the significance?

Answer Components marked "replaced" (such as M0: SVD matrix factorization and M3: 1D CNN content embeddings) were built in earlier milestones, evaluated, and then superseded by better alternatives in later milestones (two-tower retrieval and pretrained encoders, respectively). They are marked "replaced" rather than "not used" because **they served a critical role in the development process even though they are not in the final system**. The SVD baseline established the minimum performance bar against which all subsequent models were evaluated. The 1D CNN established that content features improve recommendation quality, motivating the pretrained content encoder in the two-tower model. Replacement is the expected outcome of iterative development aligned with Theme 6 (Simplest Model That Works): you build the simple version first, validate the approach, and upgrade when the data justifies the complexity. A checklist that shows all components as "integrated" with no replacements would suggest that the team never iterated — they got everything right the first time, which is either miraculous or a sign that they never tested simpler alternatives.

Question 10

What are the three levels of the evaluation framework, and why must all three pass for the system to be considered successful?

Answer The three levels are: **Model evaluation** (does each component perform well in isolation — Hit@10, NDCG@10, AUC on held-out data), **System evaluation** (do the components work together under production conditions — latency, throughput, feature store reliability, training-serving skew, causal impact), and **Business evaluation** (does the system achieve its intended purpose — CTR improvement, revenue impact, creator exposure equity, user quality disparity). All three must pass because each catches failure modes invisible to the others. A model with excellent Hit@10 can be part of a system with p99 latency violations (system failure). A system with excellent latency and throughput can be recommending popular content that provides no incremental value (business failure). A system with good CTR improvement can be systematically underexposing non-English creators (fairness failure, visible only at the business level). The levels form a hierarchy: model evaluation is necessary for system evaluation (a bad model cannot produce a good system), and system evaluation is necessary for business evaluation (an unreliable system cannot deliver business value). But sufficiency flows in only one direction — each level adds constraints that the previous level cannot capture.

Question 11

How does the TCO analysis reveal that the bottleneck for recommendation system quality is engineering time, not compute budget?

Answer The StreamRec TCO analysis showed that **personnel costs ($300,000/year for 1.0 FTE equivalent) dominate the annual cost structure**, accounting for 55% of total annual costs. The combined compute costs for training ($38,400), serving ($102,000), feature store ($33,600), and storage ($14,400) total $188,400 — less than the cost of one senior ML engineer. This means that the marginal value of an additional engineer (who can build the transformer ranker upgrade, implement incremental FAISS updates, or add fairness monitoring) far exceeds the marginal value of additional compute (faster training, more serving replicas). The implication for resource allocation is that **headcount is the constraining resource, not infrastructure budget**. A request for two additional GPUs ($5,000/year) should almost always be approved. A request for one additional engineer ($200,000/year) requires careful justification — but that engineer's output (measured in roadmap items completed, debt remediated, and incidents prevented) will typically deliver far more ROI than the same dollar amount spent on compute.

Question 12

Why is it important to present both the naive CTR improvement and the causal ATE to stakeholders, rather than choosing one?

Answer Presenting both numbers — and the gap between them — provides three important signals. First, it **demonstrates intellectual honesty**: the team acknowledges that the naive estimate overstates the system's causal impact, which builds trust with stakeholders who will eventually encounter the discrepancy themselves. Second, it **quantifies the confounding problem**: the ratio of naive to causal (8.7% / 4.1% = 2.12 in the StreamRec example) tells stakeholders that approximately 53% of the observed engagement would have occurred without the recommendation system — the system is amplifying existing behavior patterns, not purely creating new ones. This informs how much credit the system deserves for engagement metrics. Third, it **frames the opportunity**: the gap between naive and causal represents engagement the system is *predicting but not causing*, which is the target for future improvements. If the system could convert even a fraction of that predicted-but-not-caused engagement into genuinely new engagement (via better exploration, novelty injection, or serendipity), the causal ATE would increase without changing the model architecture.

Question 13

Explain why the retrospective finding "the cold-start problem is a fairness problem" is an example of cross-cutting concerns in ML systems.

Answer The cold-start problem — new items and users have insufficient interaction data for accurate predictions — is traditionally treated as a **model quality** issue, addressed through techniques like Bayesian priors (Chapter 20) and content-based fallbacks. The fairness audit revealed that it is simultaneously a **fairness** issue, because the population of new creators is not demographically identical to established creators: new creators are disproportionately non-English-speaking and from emerging markets. The FAISS index rebuild delay (a 24-hour window where new items are invisible) therefore disproportionately affects non-English creators, contributing to the exposure inequity documented in Chapter 31. This means a single infrastructure problem (the rebuild delay) manifests in three different evaluation levels: model (lower recall for new items), system (stale index), and business (creator exposure inequity). Cross-cutting concerns are problems that span multiple components and evaluation dimensions, and they are among the hardest to diagnose because each team sees only their local manifestation. The integration inherent in the capstone project is what makes these cross-cutting concerns visible.

Question 14

What is the purpose of the parallel group concept in the architecture, and why is it critical for Track C feasibility?

Answer The parallel group concept accounts for the fact that some components run concurrently rather than sequentially. In the StreamRec architecture, the two-tower retrieval and LightGCN retrieval can execute simultaneously: both take user features as input and produce candidate lists as output, with no dependency between them. Without parallel groups, their latencies would sum (30ms + 25ms = 55ms), but with parallel groups, the pair contributes only max(30ms, 25ms) = 30ms — saving 25ms. This is critical for Track C because Track C has 14 components with a naive sum of 200ms. If the two retrieval sources ran sequentially, the total would be 225ms — exceeding the 200ms SLA. Parallel execution brings it back to 200ms. More broadly, parallel execution is the primary mechanism for adding components to a latency-constrained system without increasing end-to-end latency. It requires that the parallelized components have no data dependencies between them (they cannot read each other's outputs) and that the infrastructure supports concurrent execution (async I/O, thread pools, or separate service instances).

Question 15

In the roadmap, why is R-001 (streaming feature pipeline) a dependency for both R-003 (incremental FAISS updates) and R-005 (transformer ranker upgrade)?

Answer **R-003 (incremental FAISS updates)** depends on streaming features because incremental index updates require knowing about new items and changed item embeddings in near-real-time. Without the streaming pipeline, item updates arrive only through daily batch processing, and there is no mechanism to trigger incremental index additions between batch runs. The streaming pipeline provides the event stream (new item published, item embedding updated) that triggers the incremental FAISS insert. **R-005 (transformer ranker upgrade)** depends on streaming features because the transformer ranker's primary advantage over the MLP ranker is its ability to model the user's *current session* — the last N interactions within the current visit. These recent interactions are available only through streaming features (pushed to Redis via Kafka as they occur). Without streaming, the transformer must rely on batch features that are up to 24 hours stale, which eliminates most of its advantage over the MLP. The dependency encoding in the roadmap prevents scheduling either R-003 or R-005 before R-001, ensuring that the infrastructure prerequisite is in place before the components that depend on it are built.

Question 16

What does the debt register's distinction between "scheduled" and "unscheduled" debt items reveal about project priorities?

Answer The distinction reveals which known problems the team has decided to address (scheduled, linked to a roadmap item with a target quarter and resource estimate) versus which problems the team has identified but not yet committed to fixing (unscheduled, with no timeline or resource allocation). In the StreamRec debt register, three of five items are scheduled (TD-001 → R-003, TD-003 → R-004, TD-004 → R-005) and two are unscheduled (TD-002: no behavioral tests for re-ranking, TD-005: no training-serving skew monitoring). The unscheduled items are concerning because they represent **acknowledged risk without a mitigation plan**. TD-005 (training-serving skew monitoring) is particularly concerning because it is rated "high severity" — the failure mode is silent quality degradation — yet has no remediation timeline. This gap between severity and scheduling priority may indicate resource constraints (the team cannot address everything) or a misjudgment of risk (the team underestimates the probability of skew). The debt register makes this gap visible, enabling stakeholders to question priorities: "Why is a high-severity item unscheduled when a medium-severity item (TD-004) is scheduled?"

Question 17

How would the capstone architecture differ for a credit scoring system compared to the StreamRec recommendation system?

Answer The fundamental architectural differences stem from domain constraints. **Latency:** Credit scoring typically operates at human decision timescales (seconds) or batch (overnight), not the sub-200ms real-time constraint of recommendations. **Regulatory requirements:** Credit scoring under ECOA and SR 11-7 requires independent model validation (MRM team review), fairness audits against protected attributes (race, gender, age), model documentation for regulatory examiners, and human approval gates before deployment — none of which are required for a recommendation system. **Retrieval/ranking:** Credit scoring has no retrieval stage — every applicant is scored. The "ranker" is a single model (typically XGBoost or logistic regression) rather than a multi-stage funnel. **Interpretability:** Credit scoring requires adverse action reasons (why an applicant was denied), mandating feature-level explanations (SHAP, Chapter 35) for every decision — not just aggregate explainability. **Retraining cadence:** Credit scoring models retrain quarterly (not weekly) due to regulatory revalidation requirements at each model change. **Deployment:** No canary deployment in the traditional sense — instead, a parallel run period where both old and new models score all applicants, and a human review board compares outcomes before cutover.

Question 18

Why does the retrospective format include a "surprised_by" category separate from "learned"?

Answer The "surprised_by" category captures observations that violated expectations — things the team did not anticipate despite having the knowledge to anticipate them. The "learned" category captures conclusions drawn from the experience. The distinction matters because **surprises are diagnostic** — they reveal gaps in the team's mental model of the system, gaps that may cause future surprises if not addressed. In the StreamRec retrospective, the "surprised_by" item (cold-start bias is fairness bias) revealed a gap in the team's mental model: they understood cold-start as a model quality problem and fairness as a separate concern, and did not anticipate the intersection. The corresponding "learned" item (causal ATE is less than half the naive estimate) is a conclusion, not a gap — the team expected some gap between naive and causal but was surprised by the magnitude. Tracking surprises separately enables the team to ask: "What kind of things surprise us?" — which is a question about the team's blind spots, not just the system's behavior. If surprises consistently involve cross-cutting concerns (fairness × cold-start, drift × deployment, privacy × training), the team may need to add cross-cutting review processes to their workflow.

Question 19

What is the ROI analysis's role in the capstone, and why is the payback period a more useful metric than ROI percentage for executive decision-making?

Answer The ROI analysis translates the technical system into financial terms that executive stakeholders use for resource allocation decisions. The StreamRec ROI of 707% sounds impressive, but it is a ratio that can be manipulated by how costs are allocated (include or exclude the team's full salary? amortize the one-time setup cost?) and does not convey *when* the investment pays for itself. The **payback period** (1.5 months for StreamRec) is more useful because it directly answers the executive's implicit question: "How quickly will this investment generate returns that exceed its cost?" A 1.5-month payback period means the system pays for itself within a single quarter — a strong argument for continued investment. By contrast, a system with 50% ROI but a 14-month payback period might be financially justified but competes for capital with faster-returning investments. The payback period also naturally accounts for opportunity cost: the executive is comparing this investment not against zero, but against other projects the same engineers could work on. A short payback period reduces the risk of the investment, which is important for systems (like ML) where the actual impact is uncertain until launch.

Question 20

The chapter closes with: "The hardest part is not building any individual component but making them all work together reliably." In what specific ways does integration difficulty exceed individual component difficulty?

Answer Integration difficulty exceeds individual component difficulty in five specific ways. **Interface mismatches:** Each component was developed with simplifying assumptions about its inputs and outputs. The feature store schema assumed a fixed feature set; the ranking model assumed features arrive within 5ms; the fairness audit assumed access to demographic metadata that the serving path does not carry. Resolving these mismatches requires modifying multiple components simultaneously. **Emergent failure modes:** Individual components pass their unit tests, but the system fails in ways that no individual test covers — training-serving skew (feature store + model), feedback loops (recommendations + training data + future recommendations), and cascading latency (feature store delay + model timeout + fallback degradation). **Configuration explosion:** Each component has its own hyperparameters, thresholds, and settings. The system's behavior depends on the *combination* of all settings, which grows combinatorially. A PSI threshold that works for the current model may cause false positives after the transformer upgrade. **Competing objectives:** Hit@10 optimization, latency constraint, fairness constraint, diversity requirement, and freshness boost all interact. Improving one often degrades another, and the tradeoffs are not visible until the system is integrated. **Operational complexity:** Each component requires its own monitoring, alerting, runbooks, and on-call expertise. The integrated system requires all of this plus cross-component diagnosis: is the CTR drop caused by a model quality regression, a feature store stale cache, a deployment configuration error, or an external factor? Only an integrated system surfaces these diagnostic challenges.