Chapter 24: Key Takeaways
-
The model is 5% of the system; the other 95% determines whether it works in production. Sculley et al.'s insight from 2015 remains the defining observation of production ML: the surrounding infrastructure — data pipelines, feature stores, serving infrastructure, monitoring, experimentation, and configuration management — is vastly larger, harder to build, and more likely to fail than the model itself. A state-of-the-art model that cannot serve predictions within a latency budget, retrain on fresh data, or recover from component failures is a research artifact, not a product. System design is the discipline that bridges this gap.
-
The funnel architecture (retrieval, ranking, re-ranking) is a mathematical necessity, not an optimization. When the item corpus is large (200,000 items for StreamRec) and the latency budget is small (200ms), scoring every item with the ranking model is computationally infeasible. The funnel narrows the candidate set at each stage — from the full corpus to hundreds (retrieval), from hundreds to tens (ranking), from tens to the displayed set (re-ranking) — with each stage using more compute per item. This pattern is universal across recommendation systems, search engines, and advertising platforms.
-
The correct serving pattern is the simplest one that meets the product requirements. Batch serving is simpler, cheaper, and easier to audit than real-time serving. If batch predictions provide acceptable freshness and the product does not require session context, the engineering cost of real-time infrastructure is not justified. The hybrid pattern (real-time for speed-sensitive decisions, batch for quality-sensitive decisions) often captures the benefits of both, as demonstrated by Meridian Financial's credit decisioning system (Case Study 2) and StreamRec's batch fallback.
-
The feature store is the single most important infrastructure component for production ML quality. Training-serving skew — the silent killer where the model sees different features in production than in training — is the most common cause of production ML degradation. Feature stores prevent skew by providing a single computation path for both online and offline features, point-in-time correct retrieval for training data, and default value handling for missing features. Building the feature store is harder than building the model, but the payoff (Case Study 1: eliminating all feature-skew incidents) justifies the investment.
-
Design for failure with graceful degradation, not optimistic availability. ML systems fail — feature stores time out, GPU servers crash, models go stale. The question is whether the failure is graceful (fall back to batch features, then to retrieval scores, then to popularity) or catastrophic (show the user an error page). Circuit breakers, fallback chains, and batch pre-computed recommendations are the mechanisms. StreamRec's five-level degradation hierarchy (Section 24.7) ensured that two real incidents in the first quarter caused zero user-visible impact (Case Study 1).
-
Architecture Decision Records (ADRs) are institutional memory for ML systems. ML design decisions — serving pattern, feature store technology, retraining cadence, model architecture — have non-obvious dependencies, evolve through experimentation, and lose their rationale when team members rotate. ADRs document the context, options considered, decision made, and consequences, preventing re-litigation of settled questions and surfacing assumptions that may have changed. Write an ADR for any decision that would take more than one hour to reverse.
-
Latency budgets must account for the tail distribution, not just the median. A system with 70ms p50 latency and 217ms p99 latency will violate a 200ms SLA for 1% of requests — which at 10,000 requests per second means 100 users every second experience degraded performance. Managing p99 requires hedged requests, timeouts, caching, and model optimization. The latency budget should allocate headroom (75ms for StreamRec) to absorb network jitter, garbage collection pauses, and downstream service variability.