Chapter 36: Key Takeaways
-
Integration is the bottleneck, not individual component quality. Every component of the StreamRec recommendation system — retrieval, ranking, feature store, deployment, monitoring, fairness — was built and validated individually in Chapters 1-35. The capstone revealed that 20-25% of total project effort was consumed by integration debugging: feature name mismatches, embedding version conflicts, schema incompatibilities, and cross-cutting concerns invisible in isolation. The feature contract (specifying every feature's name, type, unit, and valid range) was the single highest-leverage artifact for reducing integration bugs. Data contracts prevent the most common failure mode in ML system integration: silent data errors that produce wrong predictions without any exception, crash, or log message.
-
Evaluation must operate at three levels — model, system, and business — because each catches failure modes invisible to the others. Model evaluation (Hit@10, NDCG@10) confirms component quality but misses latency violations, training-serving skew, and fairness regressions. System evaluation (latency, throughput, feature store reliability, causal impact) confirms operational health but misses business relevance. Business evaluation (CTR improvement, revenue impact, creator exposure equity, user quality disparity) confirms the system achieves its purpose. A model with excellent Hit@10 can be part of a system with unacceptable latency. A system with excellent latency can be recommending content that provides no incremental value. Only the combination of all three levels provides sufficient confidence.
-
The gap between naive and causal metrics quantifies how much the system reinforces existing behavior versus creating new value. The StreamRec naive CTR improvement (8.7%) was more than double the causal ATE (4.1%), meaning more than half of the observed engagement would have occurred without the recommendation system. Reporting only the naive estimate overstates the system's value; reporting only the causal estimate understates the system's predictive capability. Reporting both — and the gap between them — provides an honest assessment of the system's causal contribution and identifies the opportunity for improvement: converting predicted-but-not-caused engagement into genuinely new engagement through better exploration, novelty, and serendipity.
-
Architecture Decision Records transform accidental architecture into deliberate architecture. An ADR documents the context, decision, alternatives considered, and consequences of each significant architectural choice. The "Alternatives Considered" section is the most valuable because it enables future engineers to evaluate whether a decision still makes sense when the context changes. An ADR marked "Superseded" is a sign of healthy engineering culture — the team revisited a decision when conditions evolved. An ADR collection with no superseded entries indicates either a remarkably stable context or a team that does not re-evaluate past decisions. The capstone demonstrated that ADRs also serve as the institutional memory of integration lessons: the feature naming convention ADR and the two-phase canary ADR both encoded hard-won integration wisdom.
-
The fairness audit must precede the technical roadmap because it can fundamentally reorder priorities. The StreamRec fairness audit revealed that the FAISS index rebuild delay — initially classified as a P2 latency optimization — was also a fairness problem: new creators are disproportionately non-English-speaking, so the 24-hour window where new items are invisible amplifies language-based exposure inequity. This cross-cutting insight reframed the roadmap item from "nice-to-have" to "fairness-critical," changing its priority from P2 to P0. Without the fairness audit, the team would have allocated engineering resources based on an incomplete understanding of the system's impact.
-
Build the components that differentiate your system; buy everything else. The TCO analysis revealed that personnel costs dominate the cost structure (55% of annual costs), not infrastructure. Engineering time spent building and maintaining commodity infrastructure (feature store, monitoring, pipeline orchestration) is engineering time not spent improving differentiating components (the retrieval model, the ranking model, the fairness intervention). The retrieval and ranking models encode StreamRec's unique understanding of its users and content — they create competitive advantage. The feature store, monitoring stack, and pipeline orchestration need to work reliably but do not differentiate the platform from competitors. Build the former; buy (or adopt open-source) the latter.
-
Stakeholder communication is an engineering deliverable, not a soft skill. A system that cannot be explained to its stakeholders — executive, product, legal, engineering — is a system that will not be funded, maintained, or trusted. The same system requires fundamentally different presentations depending on the audience: executives need business impact in three slides; data science peers need causal ATE with confidence intervals; legal needs fairness metrics and privacy practices. The three-slide rule for executives (what and why, how we know it works, what comes next) is a forcing function for clarity — if you cannot compress your system to three slides, you do not yet understand what matters about it.