Chapter 36: Capstone — Designing, Building, and Governing a Production Recommendation System

DataField.Dev

20 min read

> — Grady Booch, Object-Oriented Analysis and Design with Applications (1994)

In This Chapter

Learning Objectives
36.1 The Craft of Integration
36.2 Three Tracks, Three Levels of Ambition
36.3 The Technical Design Document
36.4 Architecture Decision Records
36.5 End-to-End Evaluation: Beyond Hit@10
36.6 Stakeholder Presentation
36.7 The Technical Roadmap
36.8 The Retrospective
36.9 The TCO and ROI Analysis (Track C)
36.10 Technical Debt Assessment
36.11 Progressive Project: Milestone M16 (Final)
36.12 Synthesis: What This Chapter Is Really About
Chapter Summary
Notes

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 36: Capstone — Designing, Building, and Governing a Production Recommendation System

"Integration is not a phase of work. It is the work." — Grady Booch, Object-Oriented Analysis and Design with Applications (1994)

Learning Objectives

By the end of this chapter, you will be able to:

Synthesize all seven parts of this textbook — mathematical foundations, deep learning, causal inference, Bayesian and temporal methods, production systems, and responsible AI — into a single, coherent production recommendation system
Make and justify architectural tradeoffs using Architecture Decision Records (ADRs), with explicit documentation of context, constraints, alternatives, and consequences
Evaluate a production ML system on both predictive metrics (Hit@10, NDCG@10) and causal impact (average treatment effect of recommendation on engagement), synthesizing offline evaluation, online experimentation, and business metrics
Present a technical system to both technical and non-technical stakeholders, adapting depth, vocabulary, and framing to the audience
Identify concrete improvements, estimate their expected impact, and organize them into a prioritized technical roadmap with sequencing, dependencies, and resource estimates

36.1 The Craft of Integration

You have built every component of a production recommendation system. Linear algebra gave you matrix factorization and SVD (Chapter 1). Optimization gave you gradient descent and its variants (Chapter 2). Probability theory gave you the statistical foundations for evaluation (Chapter 3). Information theory gave you mutual information for feature selection (Chapter 4). Computational complexity gave you FAISS for approximate nearest neighbor retrieval (Chapter 5).

Deep learning gave you neural architectures: the click-prediction MLP (Chapter 6), training stability with batch normalization and dropout (Chapter 7), 1D CNN content embeddings (Chapter 8), session LSTMs (Chapter 9), transformer session models (Chapter 10), RAG pipelines for catalog search (Chapter 11), VAE latent embeddings (Chapter 12), two-tower retrieval with pretrained encoders (Chapter 13), and LightGCN graph-based collaborative filtering (Chapter 14).

Causal inference taught you to ask whether your recommendations cause engagement rather than merely predict it (Chapter 15), to define potential outcomes (Chapter 16), to build causal DAGs (Chapter 17), to estimate causal effects with IPW and doubly robust methods (Chapter 18), and to personalize with causal forests and uplift modeling (Chapter 19).

Bayesian methods gave you principled cold-start handling (Chapter 20), hierarchical engagement models (Chapter 21), Thompson sampling for explore-exploit (Chapter 22), and temporal engagement forecasting (Chapter 23).

Production systems engineering gave you the architecture blueprint (Chapter 24), the feature store (Chapter 25), distributed training (Chapter 26), pipeline orchestration (Chapter 27), testing infrastructure (Chapter 28), continuous deployment (Chapter 29), and monitoring (Chapter 30).

Responsible AI gave you fairness auditing (Chapter 31), differential privacy (Chapter 32), rigorous experimentation (Chapter 33), uncertainty quantification (Chapter 34), and interpretability (Chapter 35).

Each of these components, taken individually, is a solved problem. You know how to build a two-tower retrieval model. You know how to deploy it with canary rollout. You know how to monitor it for drift. You know how to audit it for fairness.

The unsolved problem is making them all work together.

Integration is where every simplifying assumption you made in isolation collides with every simplifying assumption made by every other component. The feature store schema designed in Chapter 25 must serve both the two-tower model from Chapter 13 and the fairness audit from Chapter 31. The drift detection thresholds calibrated in Chapter 30 must account for the intentional distribution shifts introduced by the Thompson sampling exploration policy from Chapter 22. The canary deployment criteria from Chapter 29 must include the fairness metrics from Chapter 31 and the uncertainty estimates from Chapter 34 — not just Hit@10.

All Six Themes Converge: This chapter is where every recurring theme of the book meets reality simultaneously. Understanding Why (Theme 1) — you must explain why the system makes each recommendation, not just that it does. Prediction ≠ Causation (Theme 2) — your evaluation must include causal impact estimates, not just predictive accuracy. Production ML = Software Engineering (Theme 3) — the deliverable is a system, not a model. Know How Your Model Is Wrong (Theme 4) — your monitoring must detect when the system fails and your fairness audit must document for whom it fails worst. Fundamentals > Frontier (Theme 5) — you will discover that simpler components often outperform complex ones when integration costs are included. Simplest Model That Works (Theme 6) — your Track A implementation should be remarkably effective, and you should be able to articulate why before adding complexity in Track B or C.

This chapter defines three project tracks at increasing levels of ambition, provides a technical design document template, offers guidance on stakeholder presentation, and closes with a retrospective framework. The goal is not to introduce new techniques — you have all the techniques you need — but to teach you the craft of putting them together.

36.2 Three Tracks, Three Levels of Ambition

The capstone project comes in three tracks. Track A is achievable in 40-60 hours of focused work. Track B adds depth in production engineering, causal evaluation, and fairness. Track C is a portfolio-grade system suitable for a staff-level technical interview. All three tracks use the same StreamRec platform — 50 million monthly active users, 200,000 items, and a 200ms end-to-end latency budget.

Track A: Minimal Viable Recommendation System

Track A proves that you can build a recommendation system that works. It prioritizes simplicity over sophistication, demonstrating Theme 6 (Simplest Model That Works) by showing how far you can get with well-chosen baselines.

Components:

Component	Chapter Origin	Track A Implementation
Retrieval	1, 5, 13	Two-tower model (Ch. 13) with FAISS index
Ranking	6, 7, 10	MLP ranker with user/item/context features
Feature Store	25	Batch features only (no streaming). Offline: Parquet. Online: Redis
Training	26, 27	Single-GPU training. Manual pipeline (Dagster optional)
Deployment	29	BentoML serving. Single-stage deployment (no canary)
Monitoring	30	PSI drift detection on top 10 features. Prediction distribution histogram
Evaluation	3, 33	Offline: Hit@10, NDCG@10. No A/B test
Fairness	31	Single-attribute creator exposure equity audit
Documentation	—	ADR for retrieval model choice. 2-page design summary

Deliverables: 1. Working recommendation endpoint (HTTP API, <200ms p99 latency) 2. Offline evaluation report (Hit@10, NDCG@10, comparison to random and popularity baselines) 3. Creator fairness audit (exposure equity by language) 4. One ADR (retrieval model choice: two-tower vs. matrix factorization vs. LightGCN) 5. 2-page system design summary

Expected Performance: Hit@10 ≈ 0.15-0.18, NDCG@10 ≈ 0.08-0.10.

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
from enum import Enum
import time
import json


class Track(Enum):
    """Capstone project track levels."""
    A_MINIMAL = "A"
    B_STANDARD = "B"
    C_FULL = "C"


class ComponentMaturity(Enum):
    """Maturity level of a system component."""
    PROTOTYPE = "prototype"        # Works on test data, not hardened
    PRODUCTION_READY = "prod"      # Tested, monitored, documented
    BATTLE_TESTED = "battle"       # Survived production incidents


@dataclass
class CapstoneComponent:
    """A single component of the capstone recommendation system.

    Tracks the component's name, which chapters it draws from,
    its maturity level, and key configuration parameters.

    Attributes:
        name: Component identifier (e.g., "retrieval", "ranking").
        chapter_origins: List of chapter numbers this component draws from.
        maturity: Current maturity level.
        track: Minimum track level that includes this component.
        config: Key configuration parameters.
        latency_budget_ms: Allocated latency budget.
        dependencies: Other components this one depends on.
    """
    name: str
    chapter_origins: List[int]
    maturity: ComponentMaturity = ComponentMaturity.PROTOTYPE
    track: Track = Track.A_MINIMAL
    config: Dict[str, str] = field(default_factory=dict)
    latency_budget_ms: float = 0.0
    dependencies: List[str] = field(default_factory=list)

    def is_included_in_track(self, target_track: Track) -> bool:
        """Check if this component is included in the given track.

        Track ordering: A < B < C. A component included in Track A
        is also included in Track B and C.

        Args:
            target_track: The track to check against.

        Returns:
            True if this component is included in the target track.
        """
        ordering = {Track.A_MINIMAL: 0, Track.B_STANDARD: 1, Track.C_FULL: 2}
        return ordering[target_track] >= ordering[self.track]


@dataclass
class CapstoneArchitecture:
    """Full capstone system architecture across all three tracks.

    Manages component registration, track filtering, and latency
    budget validation for the StreamRec recommendation system.

    Attributes:
        components: Registry of all system components.
        total_latency_budget_ms: End-to-end latency SLA.
        track: Selected project track.
    """
    components: Dict[str, CapstoneComponent] = field(default_factory=dict)
    total_latency_budget_ms: float = 200.0
    track: Track = Track.A_MINIMAL

    def register(self, component: CapstoneComponent) -> None:
        """Register a component in the architecture.

        Args:
            component: Component to register.
        """
        self.components[component.name] = component

    def active_components(self) -> Dict[str, CapstoneComponent]:
        """Return only components included in the current track.

        Returns:
            Dictionary of component name to component, filtered by track.
        """
        return {
            name: comp
            for name, comp in self.components.items()
            if comp.is_included_in_track(self.track)
        }

    def latency_budget_check(self) -> Tuple[bool, float, float]:
        """Check if active components fit within latency budget.

        Returns:
            (feasible, total_allocated, headroom) tuple.
        """
        active = self.active_components()
        total = sum(c.latency_budget_ms for c in active.values())
        headroom = self.total_latency_budget_ms - total
        return headroom >= 0, total, headroom

    def dependency_check(self) -> List[str]:
        """Find missing dependencies among active components.

        Returns:
            List of error messages for missing dependencies.
        """
        active = self.active_components()
        errors = []
        for name, comp in active.items():
            for dep in comp.dependencies:
                if dep not in active:
                    errors.append(
                        f"Component '{name}' depends on '{dep}', "
                        f"which is not active in Track {self.track.value}."
                    )
        return errors

    def summary(self) -> str:
        """Generate a human-readable architecture summary.

        Returns:
            Formatted string summarizing the architecture.
        """
        active = self.active_components()
        feasible, total_lat, headroom = self.latency_budget_check()
        dep_errors = self.dependency_check()

        lines = [
            f"StreamRec Capstone Architecture — Track {self.track.value}",
            f"{'=' * 55}",
            f"Active components: {len(active)} / {len(self.components)}",
            f"Latency budget: {total_lat:.0f}ms / "
            f"{self.total_latency_budget_ms:.0f}ms "
            f"({'OK' if feasible else 'EXCEEDED'})",
            "",
        ]

        for name, comp in active.items():
            chapters = ", ".join(str(c) for c in comp.chapter_origins)
            lines.append(
                f"  [{comp.maturity.value:>9s}] {name:<25s} "
                f"({comp.latency_budget_ms:>5.0f}ms) "
                f"Chapters: {chapters}"
            )

        if dep_errors:
            lines.append("\nDependency Errors:")
            for err in dep_errors:
                lines.append(f"  - {err}")

        return "\n".join(lines)


def build_streamrec_architecture(track: Track) -> CapstoneArchitecture:
    """Build the StreamRec capstone architecture for a given track.

    Registers all components with their chapter origins, latency
    budgets, dependencies, and minimum track levels.

    Args:
        track: The project track (A, B, or C).

    Returns:
        Fully configured CapstoneArchitecture.
    """
    arch = CapstoneArchitecture(
        total_latency_budget_ms=200.0,
        track=track,
    )

    # --- Retrieval ---
    arch.register(CapstoneComponent(
        name="two_tower_retrieval",
        chapter_origins=[1, 5, 13],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.A_MINIMAL,
        latency_budget_ms=30.0,
        dependencies=[],
        config={
            "embedding_dim": "128",
            "faiss_index": "IVF4096,PQ32",
            "top_k": "500",
        },
    ))

    arch.register(CapstoneComponent(
        name="lightgcn_retrieval",
        chapter_origins=[14],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.C_FULL,
        latency_budget_ms=25.0,
        dependencies=[],
        config={
            "n_layers": "3",
            "embedding_dim": "128",
            "top_k": "200",
        },
    ))

    # --- Ranking ---
    arch.register(CapstoneComponent(
        name="mlp_ranker",
        chapter_origins=[6, 7],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.A_MINIMAL,
        latency_budget_ms=40.0,
        dependencies=["two_tower_retrieval"],
        config={
            "hidden_dims": "[256, 128, 64]",
            "dropout": "0.2",
            "batch_norm": "true",
        },
    ))

    arch.register(CapstoneComponent(
        name="transformer_ranker",
        chapter_origins=[10],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.B_STANDARD,
        latency_budget_ms=50.0,
        dependencies=["two_tower_retrieval"],
        config={
            "n_heads": "4",
            "n_layers": "2",
            "d_model": "128",
            "max_session_length": "50",
        },
    ))

    # --- Re-ranking ---
    arch.register(CapstoneComponent(
        name="reranker",
        chapter_origins=[24, 31],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.A_MINIMAL,
        latency_budget_ms=15.0,
        dependencies=["mlp_ranker"],
        config={
            "diversity_weight": "0.1",
            "freshness_boost": "1.2",
            "fairness_constraint": "none" if track == Track.A_MINIMAL
            else "exposure_equity",
        },
    ))

    # --- Feature Store ---
    arch.register(CapstoneComponent(
        name="feature_store",
        chapter_origins=[25],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.A_MINIMAL,
        latency_budget_ms=15.0,
        dependencies=[],
        config={
            "offline_store": "parquet",
            "online_store": "redis",
            "streaming": "false" if track == Track.A_MINIMAL else "true",
        },
    ))

    # --- Serving ---
    arch.register(CapstoneComponent(
        name="api_gateway",
        chapter_origins=[24],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.A_MINIMAL,
        latency_budget_ms=5.0,
        dependencies=[],
    ))

    arch.register(CapstoneComponent(
        name="response_assembly",
        chapter_origins=[24],
        maturity=ComponentMaturity.PRODUCTION_READY,
        track=Track.A_MINIMAL,
        latency_budget_ms=5.0,
        dependencies=["reranker"],
    ))

    # --- Deployment ---
    arch.register(CapstoneComponent(
        name="deployment_pipeline",
        chapter_origins=[27, 29],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.A_MINIMAL,
        latency_budget_ms=0.0,
        dependencies=[],
        config={
            "serving_framework": "bentoml",
            "canary": "false" if track == Track.A_MINIMAL else "true",
            "shadow_mode": "false" if track != Track.C_FULL else "true",
        },
    ))

    # --- Monitoring ---
    arch.register(CapstoneComponent(
        name="monitoring",
        chapter_origins=[30],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.A_MINIMAL,
        latency_budget_ms=0.0,
        dependencies=[],
        config={
            "drift_method": "psi",
            "n_monitored_features": "10" if track == Track.A_MINIMAL
            else "30+",
            "alerting": "log" if track == Track.A_MINIMAL
            else "slack+pagerduty",
        },
    ))

    # --- Causal Evaluation (Track B+) ---
    arch.register(CapstoneComponent(
        name="causal_evaluation",
        chapter_origins=[15, 16, 17, 18, 19, 33],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.B_STANDARD,
        latency_budget_ms=0.0,
        dependencies=[],
        config={
            "method": "doubly_robust",
            "experiment_design": "interleaving + A/B",
        },
    ))

    # --- Fairness Module (Track B+) ---
    arch.register(CapstoneComponent(
        name="fairness_module",
        chapter_origins=[31, 35],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.B_STANDARD,
        latency_budget_ms=0.0,
        dependencies=["monitoring"],
        config={
            "creator_metric": "exposure_equity_ratio",
            "user_metric": "hit10_disparity",
            "intersectional": "false" if track == Track.B_STANDARD
            else "true",
        },
    ))

    # --- Uncertainty Quantification (Track C) ---
    arch.register(CapstoneComponent(
        name="uncertainty_module",
        chapter_origins=[20, 34],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.C_FULL,
        latency_budget_ms=10.0,
        dependencies=["transformer_ranker"],
        config={
            "method": "mc_dropout",
            "n_forward_passes": "10",
            "conformal_coverage": "0.90",
        },
    ))

    # --- Explore/Exploit (Track C) ---
    arch.register(CapstoneComponent(
        name="exploration_policy",
        chapter_origins=[22],
        maturity=ComponentMaturity.PROTOTYPE,
        track=Track.C_FULL,
        latency_budget_ms=5.0,
        dependencies=["uncertainty_module"],
        config={
            "method": "thompson_sampling",
            "exploration_fraction": "0.05",
        },
    ))

    return arch


# Demonstrate all three tracks
for t in [Track.A_MINIMAL, Track.B_STANDARD, Track.C_FULL]:
    arch = build_streamrec_architecture(t)
    print(arch.summary())
    print()

StreamRec Capstone Architecture — Track A
=======================================================
Active components: 8 / 14
Latency budget: 110ms / 200ms (OK)

  [     prod] two_tower_retrieval       (   30ms) Chapters: 1, 5, 13
  [     prod] mlp_ranker                (   40ms) Chapters: 6, 7
  [prototype] reranker                  (   15ms) Chapters: 24, 31
  [     prod] feature_store             (   15ms) Chapters: 25
  [     prod] api_gateway               (    5ms) Chapters: 24
  [     prod] response_assembly         (    5ms) Chapters: 24
  [prototype] deployment_pipeline       (    0ms) Chapters: 27, 29
  [prototype] monitoring                (    0ms) Chapters: 30

StreamRec Capstone Architecture — Track B
=======================================================
Active components: 11 / 14
Latency budget: 160ms / 200ms (OK)

  [     prod] two_tower_retrieval       (   30ms) Chapters: 1, 5, 13
  [     prod] mlp_ranker                (   40ms) Chapters: 6, 7
  [     prod] transformer_ranker        (   50ms) Chapters: 10
  [prototype] reranker                  (   15ms) Chapters: 24, 31
  [     prod] feature_store             (   15ms) Chapters: 25
  [     prod] api_gateway               (    5ms) Chapters: 24
  [     prod] response_assembly         (    5ms) Chapters: 24
  [prototype] deployment_pipeline       (    0ms) Chapters: 27, 29
  [prototype] monitoring                (    0ms) Chapters: 30
  [prototype] causal_evaluation         (    0ms) Chapters: 15, 16, 17, 18, 19, 33
  [prototype] fairness_module           (    0ms) Chapters: 31, 35

StreamRec Capstone Architecture — Track C
=======================================================
Active components: 14 / 14
Latency budget: 200ms / 200ms (OK)

  [     prod] two_tower_retrieval       (   30ms) Chapters: 1, 5, 13
  [prototype] lightgcn_retrieval        (   25ms) Chapters: 14
  [     prod] mlp_ranker                (   40ms) Chapters: 6, 7
  [     prod] transformer_ranker        (   50ms) Chapters: 10
  [prototype] reranker                  (   15ms) Chapters: 24, 31
  [     prod] feature_store             (   15ms) Chapters: 25
  [     prod] api_gateway               (    5ms) Chapters: 24
  [     prod] response_assembly         (    5ms) Chapters: 24
  [prototype] deployment_pipeline       (    0ms) Chapters: 27, 29
  [prototype] monitoring                (    0ms) Chapters: 30
  [prototype] causal_evaluation         (    0ms) Chapters: 15, 16, 17, 18, 19, 33
  [prototype] fairness_module           (    0ms) Chapters: 31, 35
  [prototype] uncertainty_module        (   10ms) Chapters: 20, 34
  [prototype] exploration_policy        (    5ms) Chapters: 22

Notice that Track C uses exactly 200ms — the full latency budget. This is not an accident. It reflects a fundamental lesson of systems design: more components consume more latency. Track C is feasible only because the LightGCN retrieval runs in parallel with the two-tower retrieval (they share the 30ms budget rather than adding sequentially). If you are at the budget boundary, your options are to optimize a component (reduce individual latency), remove a component (simplify the architecture), or increase the budget (negotiate with product).

Track B: Standard Integration

Track B adds three capabilities that distinguish a competent production system from a prototype: causal evaluation, fairness governance, and staged deployment.

Additional Components Beyond Track A:

Component	Chapter Origin	Track B Addition
Ranking	10	Transformer ranker replacing or ensembled with MLP
Streaming Features	25	Real-time user features (last 5 interactions) via Kafka/Redis
Causal Evaluation	15-19, 33	Offline: doubly robust ATE. Online: interleaved comparison
Fairness	31, 35	Creator exposure equity + user quality disparity. SHAP explanations
Deployment	29	Canary deployment (10% → 50% → 100%) with automated rollback
Monitoring	30	Full PSI on 30+ features. Grafana dashboard. Slack alerting
Documentation	—	3+ ADRs. 5-page technical design document

Additional Deliverables Beyond Track A: 1. Causal impact estimate: ATE of recommendation vs. random baseline 2. Two-sided fairness audit: creator exposure equity (by language, tenure, intersectional) + user quality disparity (by age, region) 3. Canary deployment runbook with rollback criteria 4. SHAP-based recommendation explanations for top-3 items 5. 5-page technical design document with architecture diagram

Expected Performance: Hit@10 ≈ 0.20-0.24, NDCG@10 ≈ 0.11-0.14.

Track C: Full Production System

Track C adds uncertainty-aware serving, exploration policy, multi-source retrieval, and intersectional fairness — the components that distinguish a staff-level system from a senior-level one.

Additional Components Beyond Track B:

Component	Chapter Origin	Track C Addition
Retrieval	14	LightGCN second retrieval source, merged with two-tower
Uncertainty	20, 34	MC dropout uncertainty estimates. Conformal prediction sets
Exploration	22	Thompson sampling for 5% of traffic
Privacy	32	DP-SGD training (ε=8). Privacy budget tracking
Fairness	31	Intersectional audit (language × tenure). Continuous monitoring
Documentation	—	5+ ADRs. 10-page technical design document. Stakeholder deck

Additional Deliverables Beyond Track B: 1. Uncertainty-calibrated recommendations with conformal coverage guarantee 2. Thompson sampling exploration with regret analysis 3. DP-SGD training report: privacy-accuracy tradeoff at ε ∈ {1, 3, 8, ∞} 4. Intersectional fairness audit with disaggregated exposure equity 5. 10-page technical design document 6. Stakeholder presentation deck (technical and executive versions) 7. Technical roadmap for next 6 months

Expected Performance: Hit@10 ≈ 0.22-0.26, NDCG@10 ≈ 0.12-0.16.

36.3 The Technical Design Document

Every track produces a technical design document (TDD). The TDD is the primary artifact — it documents not just what you built, but why you built it that way, what you chose not to build, and what you would build next. A TDD that describes only the system you built is a user manual. A TDD that also explains the decisions, tradeoffs, and alternatives is an engineering document.

TDD Template

The following template scales across all three tracks. Track A addresses sections 1-6 in 2 pages. Track B addresses all sections in 5 pages. Track C addresses all sections in 10 pages with deeper analysis.

@dataclass
class TechnicalDesignDocument:
    """Template for the capstone technical design document.

    Structures the document into sections with guidance on content
    and expected length for each project track.

    Attributes:
        title: System name.
        author: Author name(s).
        date: Document creation date.
        track: Project track (A, B, or C).
        sections: Ordered list of document sections.
    """
    title: str
    author: str
    date: str
    track: Track
    sections: List["TDDSection"] = field(default_factory=list)

    def __post_init__(self):
        if not self.sections:
            self.sections = self._default_sections()

    def _default_sections(self) -> List["TDDSection"]:
        """Generate default sections based on track level."""
        all_sections = [
            TDDSection(
                number=1,
                title="Problem Statement and Business Context",
                guidance=(
                    "What problem does this system solve? Who are the users? "
                    "What is the business impact of getting it right vs. wrong? "
                    "What are the key constraints (latency, cost, regulatory)?"
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=2,
                title="System Architecture",
                guidance=(
                    "Architecture diagram. Component inventory. Data flow. "
                    "Latency budget allocation. The Two Loops (inner/outer). "
                    "What happens when each component fails?"
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=3,
                title="Model Architecture and Training",
                guidance=(
                    "Which models, why, and how they interact. Training data. "
                    "Hyperparameter selection. Offline evaluation results. "
                    "Comparison to baselines."
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=4,
                title="Data Pipeline and Feature Engineering",
                guidance=(
                    "Feature store design. Batch vs. streaming features. "
                    "Point-in-time correctness. Schema and data contracts. "
                    "Training-serving skew prevention."
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=5,
                title="Evaluation Strategy",
                guidance=(
                    "Offline metrics. Online experiment design. Causal impact "
                    "estimation (Track B+). Business metric mapping. How you "
                    "know the system is working vs. just running."
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=6,
                title="Architecture Decision Records",
                guidance=(
                    "At least 1 ADR (Track A), 3 (Track B), or 5 (Track C). "
                    "Each ADR: context, decision, alternatives considered, "
                    "consequences, and status."
                ),
                min_track=Track.A_MINIMAL,
            ),
            TDDSection(
                number=7,
                title="Fairness and Responsible AI",
                guidance=(
                    "Fairness criteria selection (and why). Audit results. "
                    "Mitigation strategy. Monitoring plan. Intersectional "
                    "analysis (Track C). Who made the ethical choices?"
                ),
                min_track=Track.B_STANDARD,
            ),
            TDDSection(
                number=8,
                title="Deployment and Operations",
                guidance=(
                    "Deployment pipeline. Canary strategy. Rollback procedure. "
                    "Monitoring dashboard design. Alerting and escalation. "
                    "Runbooks for top 3 failure modes."
                ),
                min_track=Track.B_STANDARD,
            ),
            TDDSection(
                number=9,
                title="Uncertainty and Exploration",
                guidance=(
                    "Uncertainty quantification method. Calibration results. "
                    "Conformal prediction coverage. Thompson sampling design. "
                    "Exploration-exploitation tradeoff analysis."
                ),
                min_track=Track.C_FULL,
            ),
            TDDSection(
                number=10,
                title="Technical Roadmap",
                guidance=(
                    "Prioritized improvements for next 6 months. Sequencing "
                    "and dependencies. Resource estimates. Expected impact. "
                    "Build-vs-buy decisions for each improvement."
                ),
                min_track=Track.B_STANDARD,
            ),
            TDDSection(
                number=11,
                title="Cost Analysis and TCO",
                guidance=(
                    "Infrastructure cost breakdown. Training cost. Serving "
                    "cost. ROI analysis. Cost optimization opportunities. "
                    "Build-vs-buy economic comparison."
                ),
                min_track=Track.C_FULL,
            ),
        ]

        return [
            s for s in all_sections
            if s.is_included_in_track(self.track)
        ]

    def table_of_contents(self) -> str:
        """Generate a table of contents.

        Returns:
            Formatted table of contents string.
        """
        lines = [f"Technical Design Document: {self.title}", ""]
        for section in self.sections:
            lines.append(
                f"  {section.number}. {section.title}"
            )
        return "\n".join(lines)


@dataclass
class TDDSection:
    """A section of the technical design document.

    Attributes:
        number: Section number.
        title: Section title.
        guidance: Writing guidance for this section.
        min_track: Minimum track that includes this section.
    """
    number: int
    title: str
    guidance: str
    min_track: Track = Track.A_MINIMAL

    def is_included_in_track(self, target_track: Track) -> bool:
        """Check if this section is included in the given track."""
        ordering = {Track.A_MINIMAL: 0, Track.B_STANDARD: 1, Track.C_FULL: 2}
        return ordering[target_track] >= ordering[self.min_track]


# Generate TDD outlines for each track
for t in [Track.A_MINIMAL, Track.B_STANDARD, Track.C_FULL]:
    tdd = TechnicalDesignDocument(
        title="StreamRec Recommendation System",
        author="Data Science Team",
        date="2026-03-25",
        track=t,
    )
    print(tdd.table_of_contents())
    print(f"  [{len(tdd.sections)} sections]\n")

Technical Design Document: StreamRec Recommendation System
  1. Problem Statement and Business Context
  2. System Architecture
  3. Model Architecture and Training
  4. Data Pipeline and Feature Engineering
  5. Evaluation Strategy
  6. Architecture Decision Records
  [6 sections]

Technical Design Document: StreamRec Recommendation System
  1. Problem Statement and Business Context
  2. System Architecture
  3. Model Architecture and Training
  4. Data Pipeline and Feature Engineering
  5. Evaluation Strategy
  6. Architecture Decision Records
  7. Fairness and Responsible AI
  8. Deployment and Operations
  10. Technical Roadmap
  [9 sections]

Technical Design Document: StreamRec Recommendation System
  1. Problem Statement and Business Context
  2. System Architecture
  3. Model Architecture and Training
  4. Data Pipeline and Feature Engineering
  5. Evaluation Strategy
  6. Architecture Decision Records
  7. Fairness and Responsible AI
  8. Deployment and Operations
  9. Uncertainty and Exploration
  10. Technical Roadmap
  11. Cost Analysis and TCO
  [11 sections]

36.4 Architecture Decision Records

An Architecture Decision Record (ADR) documents a significant architectural choice: the context that motivated it, the decision itself, the alternatives considered, and the consequences — including technical debt created or deferred. ADRs are the institutional memory of engineering decisions. Without them, future engineers (including your future self) will look at the architecture, wonder "why is it this way?", and either accept it without understanding or change it without understanding — both of which are dangerous.

The ADR format used in this capstone follows Nygard (2011):

@dataclass
class ADR:
    """Architecture Decision Record following Nygard format.

    Documents a single architectural decision with its context,
    alternatives, and consequences.

    Attributes:
        number: ADR sequence number.
        title: Short descriptive title.
        status: Current status (proposed/accepted/deprecated/superseded).
        context: The situation that motivates the decision.
        decision: The architectural choice made.
        alternatives: Other options that were considered.
        consequences: What follows from this decision (positive and negative).
        date: When the decision was made.
    """
    number: int
    title: str
    status: str
    context: str
    decision: str
    alternatives: List["ADRAlternative"]
    consequences: List[str]
    date: str

    def render(self) -> str:
        """Render the ADR as a formatted string.

        Returns:
            Human-readable ADR document.
        """
        lines = [
            f"# ADR-{self.number:03d}: {self.title}",
            f"**Status:** {self.status}",
            f"**Date:** {self.date}",
            "",
            "## Context",
            self.context,
            "",
            "## Decision",
            self.decision,
            "",
            "## Alternatives Considered",
        ]

        for alt in self.alternatives:
            lines.append(f"\n### {alt.name}")
            lines.append(f"**Pros:** {alt.pros}")
            lines.append(f"**Cons:** {alt.cons}")
            lines.append(f"**Why not chosen:** {alt.rejection_reason}")

        lines.append("\n## Consequences")
        for consequence in self.consequences:
            lines.append(f"- {consequence}")

        return "\n".join(lines)


@dataclass
class ADRAlternative:
    """An alternative considered in an ADR.

    Attributes:
        name: Name of the alternative approach.
        pros: Advantages of this alternative.
        cons: Disadvantages of this alternative.
        rejection_reason: Why this alternative was not chosen.
    """
    name: str
    pros: str
    cons: str
    rejection_reason: str


# Example ADR: Retrieval model choice
adr_001 = ADR(
    number=1,
    title="Two-Tower Model as Primary Retrieval",
    status="Accepted",
    context=(
        "StreamRec requires a retrieval stage that narrows 200,000 items to "
        "~500 candidates within a 30ms latency budget. The retrieval model "
        "must support real-time user embedding updates (streaming features) "
        "and serve 50M MAU at peak load (~10K QPS). Three candidate approaches "
        "exist from previous chapters: matrix factorization with ALS (Ch. 1), "
        "two-tower neural model with FAISS (Ch. 13), and LightGCN (Ch. 14)."
    ),
    decision=(
        "Use the two-tower neural model (Ch. 13) as the primary retrieval "
        "source. User and item towers produce 128-dimensional embeddings. "
        "Item embeddings are pre-computed and indexed in a FAISS IVF4096,PQ32 "
        "index. User embeddings are computed at request time from batch and "
        "streaming features. ANN search returns top-500 candidates."
    ),
    alternatives=[
        ADRAlternative(
            name="Matrix Factorization with ALS (Ch. 1)",
            pros=(
                "Simple, well-understood, fast training, minimal "
                "infrastructure. Strong cold-start with ALS imputation."
            ),
            cons=(
                "Cannot incorporate side features (content, context). "
                "User embedding requires full matrix re-factorization "
                "to update — no real-time adaptation."
            ),
            rejection_reason=(
                "Inability to incorporate streaming user features "
                "(last 5 interactions) is disqualifying for real-time "
                "personalization. MF serves as the offline evaluation "
                "baseline, not the production model."
            ),
        ),
        ADRAlternative(
            name="LightGCN (Ch. 14)",
            pros=(
                "Captures higher-order collaborative signals via graph "
                "propagation. Better recall on long-tail items in offline "
                "evaluation (Recall@20: 0.14 vs. 0.12 for two-tower)."
            ),
            cons=(
                "Graph construction requires the full interaction graph "
                "to be materialized. Embedding updates require graph "
                "re-propagation — latency-incompatible with real-time "
                "feature updates. Operationally complex (graph storage, "
                "incremental graph updates)."
            ),
            rejection_reason=(
                "Operational complexity of maintaining the interaction "
                "graph and re-computing embeddings exceeds infrastructure "
                "maturity for Track A and B. Adopted as secondary "
                "retrieval source in Track C, where the infrastructure "
                "investment is justified by the long-tail recall gain."
            ),
        ),
    ],
    consequences=[
        "User embeddings can be updated in real-time by re-computing the "
        "user tower with streaming features — enabling session-aware retrieval.",
        "FAISS index requires periodic rebuilding (daily) as new items are "
        "added. Items published within the last rebuild window have no "
        "embedding and must be handled by a fallback (popularity or "
        "content-based retrieval).",
        "The two-tower architecture decouples user and item representations, "
        "preventing the model from learning fine-grained user-item "
        "interactions at the retrieval stage. This limitation is acceptable "
        "because the ranking stage (MLP or transformer) captures interactions.",
        "Technical debt: FAISS index rebuild latency (~45 minutes for 200K "
        "items) creates a window where new items are invisible. Tracked as "
        "roadmap item R-003 (incremental index updates).",
    ],
    date="2026-03-25",
)

print(adr_001.render())

# ADR-001: Two-Tower Model as Primary Retrieval
**Status:** Accepted
**Date:** 2026-03-25

## Context
StreamRec requires a retrieval stage that narrows 200,000 items to ~500 candidates within a 30ms latency budget. The retrieval model must support real-time user embedding updates (streaming features) and serve 50M MAU at peak load (~10K QPS). Three candidate approaches exist from previous chapters: matrix factorization with ALS (Ch. 1), two-tower neural model with FAISS (Ch. 13), and LightGCN (Ch. 14).

## Decision
Use the two-tower neural model (Ch. 13) as the primary retrieval source. User and item towers produce 128-dimensional embeddings. Item embeddings are pre-computed and indexed in a FAISS IVF4096,PQ32 index. User embeddings are computed at request time from batch and streaming features. ANN search returns top-500 candidates.

## Alternatives Considered

### Matrix Factorization with ALS (Ch. 1)
**Pros:** Simple, well-understood, fast training, minimal infrastructure. Strong cold-start with ALS imputation.
**Cons:** Cannot incorporate side features (content, context). User embedding requires full matrix re-factorization to update — no real-time adaptation.
**Why not chosen:** Inability to incorporate streaming user features (last 5 interactions) is disqualifying for real-time personalization. MF serves as the offline evaluation baseline, not the production model.

### LightGCN (Ch. 14)
**Pros:** Captures higher-order collaborative signals via graph propagation. Better recall on long-tail items in offline evaluation (Recall@20: 0.14 vs. 0.12 for two-tower).
**Cons:** Graph construction requires the full interaction graph to be materialized. Embedding updates require graph re-propagation — latency-incompatible with real-time feature updates. Operationally complex (graph storage, incremental graph updates).
**Why not chosen:** Operational complexity of maintaining the interaction graph and re-computing embeddings exceeds infrastructure maturity for Track A and B. Adopted as secondary retrieval source in Track C, where the infrastructure investment is justified by the long-tail recall gain.

## Consequences
- User embeddings can be updated in real-time by re-computing the user tower with streaming features — enabling session-aware retrieval.
- FAISS index requires periodic rebuilding (daily) as new items are added. Items published within the last rebuild window have no embedding and must be handled by a fallback (popularity or content-based retrieval).
- The two-tower architecture decouples user and item representations, preventing the model from learning fine-grained user-item interactions at the retrieval stage. This limitation is acceptable because the ranking stage (MLP or transformer) captures interactions.
- Technical debt: FAISS index rebuild latency (~45 minutes for 200K items) creates a window where new items are invisible. Tracked as roadmap item R-003 (incremental index updates).

The strongest ADRs are the ones you revisit. An ADR marked "Superseded by ADR-007" is a sign of a healthy engineering culture — it means the team re-evaluated a decision when the context changed. An ADR collection where nothing has been superseded is a sign that either the context never changed (unlikely) or the team does not revisit past decisions (concerning).

ADR Topics for Each Track

Track	Required ADRs
A	1: Retrieval model choice
B	1: Retrieval model choice. 2: Ranking model (MLP vs. transformer). 3: Evaluation strategy (offline-only vs. causal)
C	1-3 plus: 4: Exploration policy (Thompson sampling vs. epsilon-greedy). 5: Privacy budget (ε selection and tradeoff). 6 (optional): Build-vs-buy for feature store

36.5 End-to-End Evaluation: Beyond Hit@10

The most common failure mode in capstone projects is evaluating the system on the same metrics used to evaluate the model. Hit@10 and NDCG@10 tell you whether the retrieval and ranking components produce good candidates. They do not tell you whether the system works.

A complete evaluation strategy operates at three levels: model evaluation (does each component perform well?), system evaluation (do the components work together?), and business evaluation (does the system achieve its intended purpose?).

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple
import numpy as np


@dataclass
class EvaluationLevel:
    """One level of the three-level evaluation framework.

    Attributes:
        name: Level name (model, system, business).
        metrics: Dictionary of metric name to measured value.
        thresholds: Dictionary of metric name to minimum acceptable value.
        description: What this level measures.
    """
    name: str
    metrics: Dict[str, float] = field(default_factory=dict)
    thresholds: Dict[str, float] = field(default_factory=dict)
    description: str = ""

    def passing_metrics(self) -> Dict[str, bool]:
        """Check which metrics pass their thresholds.

        Returns:
            Dictionary of metric name to pass/fail boolean.
        """
        return {
            metric: self.metrics.get(metric, 0.0) >= threshold
            for metric, threshold in self.thresholds.items()
        }

    def all_passing(self) -> bool:
        """Check if all metrics pass.

        Returns:
            True if every metric meets its threshold.
        """
        return all(self.passing_metrics().values())


@dataclass
class CapstoneEvaluation:
    """Three-level evaluation framework for the capstone system.

    Organizes model-level, system-level, and business-level metrics
    with pass/fail thresholds and a summary report.

    Attributes:
        levels: Dictionary of level name to EvaluationLevel.
        track: Project track (determines which levels are required).
    """
    levels: Dict[str, EvaluationLevel] = field(default_factory=dict)
    track: Track = Track.A_MINIMAL

    def overall_status(self) -> str:
        """Compute overall evaluation status.

        Returns:
            'PASS' if all levels pass, 'PARTIAL' if some pass,
            'FAIL' if none pass.
        """
        passing = [
            level.all_passing()
            for level in self.levels.values()
        ]
        if all(passing):
            return "PASS"
        elif any(passing):
            return "PARTIAL"
        return "FAIL"

    def report(self) -> str:
        """Generate a human-readable evaluation report.

        Returns:
            Formatted evaluation report string.
        """
        lines = [
            f"StreamRec Evaluation Report — Track {self.track.value}",
            f"{'=' * 55}",
            f"Overall Status: {self.overall_status()}",
            "",
        ]

        for name, level in self.levels.items():
            lines.append(f"--- {level.name} ---")
            lines.append(f"  {level.description}")
            passing = level.passing_metrics()

            for metric, value in level.metrics.items():
                threshold = level.thresholds.get(metric, None)
                status = "PASS" if passing.get(metric, True) else "FAIL"
                threshold_str = (
                    f" (threshold: {threshold:.4f})"
                    if threshold is not None else ""
                )
                lines.append(
                    f"  [{status}] {metric}: {value:.4f}{threshold_str}"
                )
            lines.append("")

        return "\n".join(lines)


def build_track_b_evaluation() -> CapstoneEvaluation:
    """Build an example Track B evaluation with realistic metrics.

    Returns:
        CapstoneEvaluation populated with Track B metrics.
    """
    evaluation = CapstoneEvaluation(track=Track.B_STANDARD)

    # Level 1: Model Evaluation
    evaluation.levels["model"] = EvaluationLevel(
        name="Model Evaluation",
        description="Individual component quality on held-out data.",
        metrics={
            "retrieval_recall_500": 0.62,
            "ranking_hit_10": 0.22,
            "ranking_ndcg_10": 0.13,
            "ranking_auc": 0.78,
            "mf_baseline_hit_10": 0.15,
            "popularity_baseline_hit_10": 0.08,
        },
        thresholds={
            "retrieval_recall_500": 0.50,
            "ranking_hit_10": 0.18,
            "ranking_ndcg_10": 0.10,
            "ranking_auc": 0.70,
        },
    )

    # Level 2: System Evaluation
    evaluation.levels["system"] = EvaluationLevel(
        name="System Evaluation",
        description="End-to-end system behavior under production conditions.",
        metrics={
            "p50_latency_ms": 82.0,
            "p99_latency_ms": 178.0,
            "throughput_qps": 12500.0,
            "feature_store_hit_rate": 0.994,
            "training_serving_skew_psi": 0.03,
            "causal_ate_engagement": 0.041,
            "ate_95ci_lower": 0.028,
            "ate_95ci_upper": 0.054,
        },
        thresholds={
            "p99_latency_ms": 200.0,
            "feature_store_hit_rate": 0.99,
            "training_serving_skew_psi": 0.10,
        },
    )

    # Level 3: Business Evaluation
    evaluation.levels["business"] = EvaluationLevel(
        name="Business Evaluation",
        description="Business impact and responsible AI metrics.",
        metrics={
            "ctr_improvement_vs_baseline": 0.034,
            "completion_rate_improvement": 0.021,
            "creator_exposure_equity_ratio": 0.71,
            "user_hit10_disparity_max": 0.06,
            "dp_sgd_accuracy_loss": 0.0,  # Not applicable for Track B
        },
        thresholds={
            "ctr_improvement_vs_baseline": 0.01,
            "creator_exposure_equity_ratio": 0.60,
            "user_hit10_disparity_max": 0.10,
        },
    )

    return evaluation


eval_report = build_track_b_evaluation()
print(eval_report.report())

StreamRec Evaluation Report — Track B
=======================================================
Overall Status: PASS

--- Model Evaluation ---
  Individual component quality on held-out data.
  [PASS] retrieval_recall_500: 0.6200 (threshold: 0.5000)
  [PASS] ranking_hit_10: 0.2200 (threshold: 0.1800)
  [PASS] ranking_ndcg_10: 0.1300 (threshold: 0.1000)
  [PASS] ranking_auc: 0.7800 (threshold: 0.7000)
  [PASS] mf_baseline_hit_10: 0.1500
  [PASS] popularity_baseline_hit_10: 0.0800

--- System Evaluation ---
  End-to-end system behavior under production conditions.
  [PASS] p50_latency_ms: 82.0000
  [PASS] p99_latency_ms: 178.0000 (threshold: 200.0000)
  [PASS] throughput_qps: 12500.0000
  [PASS] feature_store_hit_rate: 0.9940 (threshold: 0.9900)
  [PASS] training_serving_skew_psi: 0.0300 (threshold: 0.1000)
  [PASS] causal_ate_engagement: 0.0410
  [PASS] ate_95ci_lower: 0.0280
  [PASS] ate_95ci_upper: 0.0540

--- Business Evaluation ---
  Business impact and responsible AI metrics.
  [PASS] ctr_improvement_vs_baseline: 0.0340 (threshold: 0.0100)
  [PASS] completion_rate_improvement: 0.0210
  [PASS] creator_exposure_equity_ratio: 0.7100 (threshold: 0.6000)
  [PASS] user_hit10_disparity_max: 0.0600 (threshold: 0.1000)
  [PASS] dp_sgd_accuracy_loss: 0.0000

The key insight is the causal ATE in the system evaluation. The system produces recommendations that cause a 4.1 percentage point increase in engagement (95% CI: [2.8%, 5.4%]) compared to random recommendations. This is the metric that matters most — it tells you whether the system is actually helping users, not just predicting what they would have clicked on anyway. (Recall from Chapter 15 that offline predictive metrics can be misleading when the model reinforces existing patterns rather than causing new behavior.)

36.6 Stakeholder Presentation

A system that cannot be explained to its stakeholders is a system that will not be funded, maintained, or trusted. Stakeholder presentation is not a soft skill appended to the technical work — it is a core engineering deliverable.

Audience-Adapted Communication

The same system requires fundamentally different presentations depending on the audience:

Audience	Focus	Vocabulary	Key Question
Engineering leadership	Architecture, latency, reliability, cost	System design terms, SLO/SLI	"Can this system run reliably at scale?"
Product management	User impact, experiment results, roadmap	Business metrics, CTR, conversion	"Does this improve user experience?"
Executive team	Business ROI, competitive advantage, risk	Revenue, retention, strategic value	"What is the return on this investment?"
Legal/compliance	Fairness, privacy, regulatory exposure	Disparate impact, ECOA, GDPR, DP	"What is our legal and reputational risk?"
Data science peers	Model architecture, evaluation rigor, causal impact	NDCG, ATE, CATE, conformal coverage	"Is the methodology sound?"

A common mistake is presenting the model architecture to executives ("we use a transformer with 4 attention heads") or the business ROI to data science peers ("CTR increased 3.4%"). Both audiences leave unsatisfied — the executive because they do not know what a transformer is, the peer because CTR alone is insufficient evidence of causal impact.

The Three-Slide Rule

For executive audiences, structure your presentation around three slides:

Slide 1: What the system does and why it matters. One sentence on what it does. One sentence on the business problem it solves. One number for the business impact ($X revenue, Y% retention improvement). No architecture diagrams.

Slide 2: How we know it works. The key metric, compared to the previous system or no system. Confidence interval or statistical significance. One sentence on fairness ("we audited for creator exposure equity and user quality disparity — both within acceptable thresholds"). One sentence on risk ("we deploy with automated rollback and 24/7 monitoring").

Slide 3: What comes next. The top 3 roadmap items, each with expected impact and timeline. A cost estimate for the next phase. The ask: continued funding, additional headcount, infrastructure investment.

Everything else goes in the appendix.

@dataclass
class StakeholderPresentation:
    """Framework for audience-adapted system presentation.

    Generates tailored talking points, metrics, and framing
    for different stakeholder audiences.

    Attributes:
        system_name: Name of the system being presented.
        metrics: Dictionary of all measured metrics.
        audience: Target audience for this presentation.
    """
    system_name: str
    metrics: Dict[str, float]
    audience: str

    def key_metrics(self) -> Dict[str, float]:
        """Select the metrics most relevant to this audience.

        Returns:
            Filtered dictionary of metrics for the target audience.
        """
        audience_metrics = {
            "executive": [
                "ctr_improvement_vs_baseline",
                "estimated_annual_revenue_impact",
                "creator_exposure_equity_ratio",
                "system_uptime",
            ],
            "engineering": [
                "p99_latency_ms",
                "throughput_qps",
                "training_serving_skew_psi",
                "feature_store_hit_rate",
                "rollback_time_seconds",
            ],
            "product": [
                "ctr_improvement_vs_baseline",
                "completion_rate_improvement",
                "user_hit10_disparity_max",
                "cold_start_coverage",
            ],
            "data_science": [
                "ranking_hit_10",
                "ranking_ndcg_10",
                "causal_ate_engagement",
                "ate_95ci_lower",
                "ate_95ci_upper",
                "creator_exposure_equity_ratio",
            ],
            "legal": [
                "creator_exposure_equity_ratio",
                "user_hit10_disparity_max",
                "dp_epsilon",
                "data_retention_days",
            ],
        }

        relevant = audience_metrics.get(self.audience, [])
        return {
            k: v for k, v in self.metrics.items()
            if k in relevant
        }

    def generate_talking_points(self) -> List[str]:
        """Generate audience-appropriate talking points.

        Returns:
            List of talking point strings.
        """
        km = self.key_metrics()

        if self.audience == "executive":
            return [
                f"{self.system_name} improves click-through rate by "
                f"{km.get('ctr_improvement_vs_baseline', 0):.1%}, "
                f"estimated at ${km.get('estimated_annual_revenue_impact', 0):,.0f} "
                f"annual revenue impact.",
                f"Creator exposure equity is {km.get('creator_exposure_equity_ratio', 0):.0%} "
                f"— above our 60% threshold, reducing platform risk from "
                f"creator attrition.",
                f"System uptime is {km.get('system_uptime', 0):.2%} with "
                f"automated rollback and 24/7 monitoring.",
            ]
        elif self.audience == "data_science":
            ate = km.get("causal_ate_engagement", 0)
            ci_lo = km.get("ate_95ci_lower", 0)
            ci_hi = km.get("ate_95ci_upper", 0)
            return [
                f"Hit@10: {km.get('ranking_hit_10', 0):.3f}, "
                f"NDCG@10: {km.get('ranking_ndcg_10', 0):.3f} "
                f"on held-out test set.",
                f"Causal ATE of recommendation on engagement: "
                f"{ate:.3f} (95% CI: [{ci_lo:.3f}, {ci_hi:.3f}]), "
                f"estimated via doubly robust method with "
                f"propensity clipping at [0.05, 0.95].",
                f"Creator exposure equity ratio: "
                f"{km.get('creator_exposure_equity_ratio', 0):.2f}. "
                f"Worst-served group: new Arabic-language creators at 0.11.",
            ]
        return [f"Metrics for {self.audience}: {km}"]


# Example: Generate talking points for two audiences
all_metrics = {
    "ctr_improvement_vs_baseline": 0.034,
    "estimated_annual_revenue_impact": 4_200_000,
    "creator_exposure_equity_ratio": 0.71,
    "system_uptime": 0.9997,
    "ranking_hit_10": 0.22,
    "ranking_ndcg_10": 0.13,
    "causal_ate_engagement": 0.041,
    "ate_95ci_lower": 0.028,
    "ate_95ci_upper": 0.054,
    "p99_latency_ms": 178.0,
    "throughput_qps": 12500.0,
}

for audience in ["executive", "data_science"]:
    pres = StakeholderPresentation(
        system_name="StreamRec",
        metrics=all_metrics,
        audience=audience,
    )
    print(f"--- {audience.upper()} ---")
    for point in pres.generate_talking_points():
        print(f"  - {point}")
    print()

--- EXECUTIVE ---
  - StreamRec improves click-through rate by 3.4%, estimated at $4,200,000 annual revenue impact.
  - Creator exposure equity is 71% — above our 60% threshold, reducing platform risk from creator attrition.
  - System uptime is 99.97% with automated rollback and 24/7 monitoring.

--- DATA_SCIENCE ---
  - Hit@10: 0.220, NDCG@10: 0.130 on held-out test set.
  - Causal ATE of recommendation on engagement: 0.041 (95% CI: [0.028, 0.054]), estimated via doubly robust method with propensity clipping at [0.05, 0.95].
  - Creator exposure equity ratio: 0.71. Worst-served group: new Arabic-language creators at 0.11.

36.7 The Technical Roadmap

A system without a roadmap is a system that will stagnate. The roadmap documents what you would build next, why, in what order, and at what cost. It transforms your capstone from a one-time deliverable into the foundation for ongoing improvement.

Roadmap Item Structure

Each roadmap item specifies the improvement, the expected impact, the estimated effort, the dependencies, and the build-vs-buy decision.

@dataclass
class RoadmapItem:
    """A single item on the technical roadmap.

    Attributes:
        id: Unique identifier (e.g., "R-001").
        title: Short descriptive title.
        description: What the improvement is and why it matters.
        expected_impact: Quantified expected improvement.
        effort_weeks: Estimated engineering effort in person-weeks.
        priority: Priority level (P0 critical, P1 high, P2 medium, P3 low).
        dependencies: List of roadmap item IDs that must complete first.
        build_vs_buy: Whether to build in-house or use an external solution.
        quarter: Target quarter for completion.
        category: Technical category (model, infra, fairness, monitoring, etc.).
    """
    id: str
    title: str
    description: str
    expected_impact: str
    effort_weeks: float
    priority: str
    dependencies: List[str] = field(default_factory=list)
    build_vs_buy: str = "build"
    quarter: str = ""
    category: str = ""


@dataclass
class TechnicalRoadmap:
    """Six-month technical roadmap for the recommendation system.

    Organizes improvement items by priority and quarter, tracks
    dependencies, and computes resource requirements.

    Attributes:
        items: List of roadmap items.
        total_engineering_weeks: Available person-weeks per quarter.
    """
    items: List[RoadmapItem] = field(default_factory=list)
    total_engineering_weeks: float = 26.0  # 2 engineers x 13 weeks/quarter

    def add(self, item: RoadmapItem) -> None:
        """Add a roadmap item."""
        self.items.append(item)

    def by_quarter(self) -> Dict[str, List[RoadmapItem]]:
        """Group items by target quarter.

        Returns:
            Dictionary mapping quarter to its roadmap items.
        """
        grouped: Dict[str, List[RoadmapItem]] = {}
        for item in self.items:
            q = item.quarter or "Unscheduled"
            grouped.setdefault(q, []).append(item)
        return grouped

    def feasibility_check(self) -> Dict[str, Tuple[float, float, bool]]:
        """Check if each quarter's items fit within resource budget.

        Returns:
            Dictionary mapping quarter to (total_weeks, budget, feasible).
        """
        result = {}
        for quarter, items in self.by_quarter().items():
            total = sum(i.effort_weeks for i in items)
            result[quarter] = (
                total,
                self.total_engineering_weeks,
                total <= self.total_engineering_weeks,
            )
        return result

    def dependency_order(self) -> List[RoadmapItem]:
        """Return items in dependency-respecting order.

        Returns:
            Topologically sorted list of roadmap items.
        """
        id_to_item = {item.id: item for item in self.items}
        visited = set()
        order = []

        def visit(item_id: str) -> None:
            if item_id in visited:
                return
            visited.add(item_id)
            item = id_to_item.get(item_id)
            if item is None:
                return
            for dep in item.dependencies:
                visit(dep)
            order.append(id_to_item[item_id])

        for item in self.items:
            visit(item.id)

        return order

    def summary(self) -> str:
        """Generate a formatted roadmap summary.

        Returns:
            Human-readable roadmap string.
        """
        lines = [
            "StreamRec Technical Roadmap (6 Months)",
            "=" * 50,
            "",
        ]

        feasibility = self.feasibility_check()
        for quarter, items in self.by_quarter().items():
            total, budget, feasible = feasibility[quarter]
            status = "OK" if feasible else "OVER CAPACITY"
            lines.append(
                f"--- {quarter} ({total:.0f} / {budget:.0f} "
                f"person-weeks) [{status}] ---"
            )
            for item in sorted(items, key=lambda x: x.priority):
                deps = (
                    f" (after: {', '.join(item.dependencies)})"
                    if item.dependencies else ""
                )
                lines.append(
                    f"  [{item.priority}] {item.id}: {item.title} "
                    f"({item.effort_weeks:.0f}w, {item.build_vs_buy})"
                    f"{deps}"
                )
            lines.append("")

        return "\n".join(lines)


# Build example roadmap
roadmap = TechnicalRoadmap(total_engineering_weeks=26.0)

roadmap.add(RoadmapItem(
    id="R-001",
    title="Streaming feature pipeline",
    description=(
        "Replace batch-only feature updates with Kafka-based streaming "
        "features for user activity (last N interactions). Enables "
        "session-aware retrieval without waiting for daily batch jobs."
    ),
    expected_impact="Hit@10 +0.02 from real-time session features",
    effort_weeks=8.0,
    priority="P0",
    quarter="Q3 2026",
    category="infra",
    build_vs_buy="build",
))

roadmap.add(RoadmapItem(
    id="R-002",
    title="Canary deployment with automated rollback",
    description=(
        "Implement staged deployment (10% canary, 3-day evaluation, "
        "automated rollback on metric degradation). Currently deploying "
        "with manual validation."
    ),
    expected_impact="Reduce deployment incident rate from ~1/month to ~1/quarter",
    effort_weeks=6.0,
    priority="P0",
    quarter="Q3 2026",
    category="infra",
    build_vs_buy="build",
))

roadmap.add(RoadmapItem(
    id="R-003",
    title="Incremental FAISS index updates",
    description=(
        "Replace daily full index rebuilds with incremental updates. "
        "New items become retrievable within minutes of publication."
    ),
    expected_impact="New item visibility: 24h → 15min. Long-tail Recall@500 +0.03",
    effort_weeks=4.0,
    priority="P1",
    quarter="Q3 2026",
    category="model",
    build_vs_buy="build",
    dependencies=["R-001"],
))

roadmap.add(RoadmapItem(
    id="R-004",
    title="Continuous fairness monitoring",
    description=(
        "Automate the fairness audit (creator exposure equity, user "
        "quality disparity) at every retraining cycle. Alert on "
        "threshold violations."
    ),
    expected_impact="Detect fairness regression within 1 retraining cycle vs. quarterly manual audit",
    effort_weeks=5.0,
    priority="P1",
    quarter="Q3 2026",
    category="fairness",
    build_vs_buy="build",
    dependencies=["R-002"],
))

roadmap.add(RoadmapItem(
    id="R-005",
    title="Transformer ranker upgrade",
    description=(
        "Replace MLP ranker with session-aware transformer ranker "
        "(Chapter 10). Requires streaming features (R-001) for full "
        "benefit."
    ),
    expected_impact="Hit@10 +0.03, NDCG@10 +0.02",
    effort_weeks=10.0,
    priority="P1",
    quarter="Q4 2026",
    category="model",
    build_vs_buy="build",
    dependencies=["R-001"],
))

roadmap.add(RoadmapItem(
    id="R-006",
    title="LightGCN secondary retrieval source",
    description=(
        "Add LightGCN as a second retrieval source for long-tail "
        "item coverage. Merge candidates with the two-tower source "
        "using multi-source agreement scoring."
    ),
    expected_impact="Long-tail Recall@20 +0.04, exposure equity +0.08",
    effort_weeks=8.0,
    priority="P2",
    quarter="Q4 2026",
    category="model",
    build_vs_buy="build",
    dependencies=["R-003", "R-005"],
))

roadmap.add(RoadmapItem(
    id="R-007",
    title="Managed feature store evaluation",
    description=(
        "Evaluate Feast or Tecton as replacement for custom feature "
        "store. Custom store works but maintenance burden is high."
    ),
    expected_impact="Reduce feature store maintenance from 0.5 FTE to ~0.1 FTE",
    effort_weeks=3.0,
    priority="P2",
    quarter="Q4 2026",
    category="infra",
    build_vs_buy="buy (evaluate)",
))

print(roadmap.summary())

StreamRec Technical Roadmap (6 Months)
==================================================

--- Q3 2026 (23 / 26 person-weeks) [OK] ---
  [P0] R-001: Streaming feature pipeline (8w, build)
  [P0] R-002: Canary deployment with automated rollback (6w, build)
  [P1] R-003: Incremental FAISS index updates (4w, build) (after: R-001)
  [P1] R-004: Continuous fairness monitoring (5w, build) (after: R-002)

--- Q4 2026 (21 / 26 person-weeks) [OK] ---
  [P1] R-005: Transformer ranker upgrade (10w, build) (after: R-001)
  [P2] R-006: LightGCN secondary retrieval source (8w, build) (after: R-003, R-005)
  [P2] R-007: Managed feature store evaluation (3w, buy (evaluate))

Both quarters are within the 26 person-week capacity. The dependency ordering ensures that streaming features (R-001) are complete before the transformer ranker (R-005) and incremental index updates (R-003), both of which depend on streaming data availability.

Build vs. Buy

The build-vs-buy decision deserves its own ADR for any component where a commercial or open-source solution exists. The framework:

Factor	Build	Buy
Control	Full control over features, performance, and roadmap	Constrained by vendor roadmap
Time to value	Weeks to months	Days to weeks
Maintenance	Ongoing engineering investment	Vendor maintains; you pay
Cost (Year 1)	Engineering time (salary)	License or usage fees
Cost (Year 3)	Engineering time + maintenance debt	License fees (usually increasing)
Differentiation	Can become competitive advantage	Commodity — competitors use same tool
Risk	Scope creep, understaffed	Vendor lock-in, pricing changes

The rule of thumb: build the components that differentiate your system; buy everything else. For StreamRec, the retrieval and ranking models are differentiating — they encode the platform's unique understanding of users and content. The feature store, pipeline orchestration, monitoring, and serving infrastructure are not differentiating — they need to work reliably, but they do not create competitive advantage. The former justify build; the latter justify buy (or open-source adoption).

36.8 The Retrospective

A capstone that ends with deployment is incomplete. A complete capstone ends with a retrospective — a structured reflection on what worked, what did not, and what you would do differently.

Retrospective Framework

@dataclass
class RetrospectiveItem:
    """A single retrospective observation.

    Attributes:
        category: One of 'worked_well', 'didnt_work', 'would_change',
                  'surprised_by', 'learned'.
        observation: The observation itself.
        impact: How this affected the project (time, quality, scope).
        action: What to do differently next time.
    """
    category: str
    observation: str
    impact: str
    action: str


@dataclass
class CapstoneRetrospective:
    """Structured retrospective for the capstone project.

    Organizes reflections into categories and generates a summary
    suitable for a post-launch review.

    Attributes:
        items: List of retrospective observations.
        total_hours: Total hours spent on the project.
        track: Project track completed.
    """
    items: List[RetrospectiveItem] = field(default_factory=list)
    total_hours: float = 0.0
    track: Track = Track.A_MINIMAL

    def add(self, item: RetrospectiveItem) -> None:
        """Add a retrospective item."""
        self.items.append(item)

    def by_category(self) -> Dict[str, List[RetrospectiveItem]]:
        """Group items by category.

        Returns:
            Dictionary mapping category to its items.
        """
        grouped: Dict[str, List[RetrospectiveItem]] = {}
        for item in self.items:
            grouped.setdefault(item.category, []).append(item)
        return grouped

    def summary(self) -> str:
        """Generate a formatted retrospective summary.

        Returns:
            Human-readable retrospective string.
        """
        lines = [
            f"Capstone Retrospective — Track {self.track.value}",
            f"Total Hours: {self.total_hours:.0f}",
            "=" * 50,
            "",
        ]

        category_labels = {
            "worked_well": "What Worked Well",
            "didnt_work": "What Didn't Work",
            "would_change": "What I Would Change",
            "surprised_by": "What Surprised Me",
            "learned": "Key Learnings",
        }

        for cat, label in category_labels.items():
            items = self.by_category().get(cat, [])
            if items:
                lines.append(f"--- {label} ---")
                for item in items:
                    lines.append(f"  {item.observation}")
                    lines.append(f"    Impact: {item.impact}")
                    lines.append(f"    Action: {item.action}")
                    lines.append("")

        return "\n".join(lines)


# Example retrospective
retro = CapstoneRetrospective(
    total_hours=120,
    track=Track.B_STANDARD,
)

retro.add(RetrospectiveItem(
    category="worked_well",
    observation=(
        "Starting with the MLP ranker (Ch. 6) before attempting the "
        "transformer (Ch. 10) established a reliable baseline and "
        "made debugging the transformer integration much easier."
    ),
    impact="Saved ~15 hours of debugging by having a known-good baseline.",
    action="Always deploy the simplest model first, even if you intend to upgrade.",
))

retro.add(RetrospectiveItem(
    category="didnt_work",
    observation=(
        "Attempting to build the feature store and the training pipeline "
        "simultaneously led to integration conflicts — schema changes in "
        "one broke the other."
    ),
    impact="Lost ~8 hours to integration bugs that would not have occurred sequentially.",
    action=(
        "Build and stabilize the feature store first. Define the schema "
        "as a data contract (Ch. 28). Then build the training pipeline "
        "against that contract."
    ),
))

retro.add(RetrospectiveItem(
    category="surprised_by",
    observation=(
        "The fairness audit (Ch. 31) revealed that the two-tower model's "
        "cold-start problem was also a fairness problem: new creators are "
        "disproportionately non-English-speaking, so cold-start bias "
        "amplifies language bias."
    ),
    impact=(
        "Reframed the FAISS incremental update roadmap item from "
        "'nice-to-have latency improvement' to 'fairness-critical "
        "infrastructure.'"
    ),
    action="Run the fairness audit before finalizing the roadmap.",
))

retro.add(RetrospectiveItem(
    category="learned",
    observation=(
        "The causal ATE estimate (4.1%) was less than half the naive "
        "CTR improvement (8.7%). More than half of the apparent 'improvement' "
        "was the model reinforcing existing behavior, not causing new engagement."
    ),
    impact=(
        "Changed how I would present results to stakeholders — the honest "
        "causal estimate is more credible than the inflated naive estimate."
    ),
    action=(
        "Always report both the naive and causal estimates. The gap between "
        "them quantifies how much the model is amplifying existing patterns."
    ),
))

print(retro.summary())

Capstone Retrospective — Track B
Total Hours: 120
==================================================

--- What Worked Well ---
  Starting with the MLP ranker (Ch. 6) before attempting the transformer (Ch. 10) established a reliable baseline and made debugging the transformer integration much easier.
    Impact: Saved ~15 hours of debugging by having a known-good baseline.
    Action: Always deploy the simplest model first, even if you intend to upgrade.

--- What Didn't Work ---
  Attempting to build the feature store and the training pipeline simultaneously led to integration conflicts — schema changes in one broke the other.
    Impact: Lost ~8 hours to integration bugs that would not have occurred sequentially.
    Action: Build and stabilize the feature store first. Define the schema as a data contract (Ch. 28). Then build the training pipeline against that contract.

--- What Surprised Me ---
  The fairness audit (Ch. 31) revealed that the two-tower model's cold-start problem was also a fairness problem: new creators are disproportionately non-English-speaking, so cold-start bias amplifies language bias.
    Impact: Reframed the FAISS incremental update roadmap item from 'nice-to-have latency improvement' to 'fairness-critical infrastructure.'
    Action: Run the fairness audit before finalizing the roadmap.

--- Key Learnings ---
  The causal ATE estimate (4.1%) was less than half the naive CTR improvement (8.7%). More than half of the apparent 'improvement' was the model reinforcing existing behavior, not causing new engagement.
    Impact: Changed how I would present results to stakeholders — the honest causal estimate is more credible than the inflated naive estimate.
    Action: Always report both the naive and causal estimates. The gap between them quantifies how much the model is amplifying existing patterns.

The retrospective serves two purposes. For the individual, it consolidates learning. For the organization, it creates institutional memory. The observations above — "always deploy the simplest model first," "build the feature store before the training pipeline," "the cold-start problem is a fairness problem" — are precisely the kind of hard-won wisdom that separates a senior data scientist from a junior one. They cannot be learned from any single chapter. They emerge only from the experience of integration.

36.9 The TCO and ROI Analysis (Track C)

Track C includes a total cost of ownership (TCO) and return on investment (ROI) analysis. This is the document that justifies the system's existence to financial decision-makers.

@dataclass
class CostLineItem:
    """A single line item in the cost analysis.

    Attributes:
        category: Cost category (compute, storage, personnel, etc.).
        description: What this cost covers.
        monthly_cost: Monthly cost in dollars.
        is_recurring: Whether this cost recurs monthly.
    """
    category: str
    description: str
    monthly_cost: float
    is_recurring: bool = True


@dataclass
class TCOAnalysis:
    """Total Cost of Ownership analysis for the recommendation system.

    Breaks down costs by category and computes annual TCO. Compares
    build vs. buy alternatives.

    Attributes:
        items: List of cost line items.
        time_horizon_months: Analysis time horizon.
    """
    items: List[CostLineItem] = field(default_factory=list)
    time_horizon_months: int = 12

    def add(self, item: CostLineItem) -> None:
        """Add a cost line item."""
        self.items.append(item)

    def annual_cost(self) -> float:
        """Compute total annual cost.

        Returns:
            Total cost over 12 months. Recurring items are multiplied
            by 12; one-time items are counted once.
        """
        total = 0.0
        for item in self.items:
            if item.is_recurring:
                total += item.monthly_cost * 12
            else:
                total += item.monthly_cost
        return total

    def by_category(self) -> Dict[str, float]:
        """Break down annual cost by category.

        Returns:
            Dictionary mapping category to annual cost.
        """
        cats: Dict[str, float] = {}
        for item in self.items:
            cost = (
                item.monthly_cost * 12 if item.is_recurring
                else item.monthly_cost
            )
            cats[item.category] = cats.get(item.category, 0.0) + cost
        return cats

    def summary(self) -> str:
        """Generate cost summary.

        Returns:
            Formatted cost breakdown string.
        """
        lines = [
            "StreamRec TCO Analysis (Annual)",
            "=" * 45,
        ]
        for cat, cost in sorted(
            self.by_category().items(),
            key=lambda x: -x[1],
        ):
            lines.append(f"  {cat:<25s} ${cost:>12,.0f}")

        lines.append(f"  {'─' * 39}")
        lines.append(f"  {'TOTAL':<25s} ${self.annual_cost():>12,.0f}")
        return "\n".join(lines)


@dataclass
class ROIAnalysis:
    """Return on Investment analysis.

    Computes ROI by comparing system cost to estimated business impact.

    Attributes:
        annual_cost: Total annual cost of the system.
        annual_revenue_impact: Estimated annual revenue uplift.
        annual_cost_savings: Estimated annual cost savings.
        payback_months: Months to break even.
    """
    annual_cost: float
    annual_revenue_impact: float
    annual_cost_savings: float

    @property
    def total_annual_benefit(self) -> float:
        """Total annual benefit (revenue + savings)."""
        return self.annual_revenue_impact + self.annual_cost_savings

    @property
    def roi_percentage(self) -> float:
        """ROI as a percentage: (benefit - cost) / cost * 100."""
        if self.annual_cost == 0:
            return 0.0
        return (
            (self.total_annual_benefit - self.annual_cost)
            / self.annual_cost * 100
        )

    @property
    def payback_months(self) -> float:
        """Months to pay back the investment."""
        monthly_benefit = self.total_annual_benefit / 12
        if monthly_benefit <= 0:
            return float("inf")
        return self.annual_cost / monthly_benefit

    def summary(self) -> str:
        """Generate ROI summary.

        Returns:
            Formatted ROI analysis string.
        """
        return "\n".join([
            "StreamRec ROI Analysis",
            "=" * 45,
            f"  Annual system cost:       ${self.annual_cost:>12,.0f}",
            f"  Annual revenue impact:    ${self.annual_revenue_impact:>12,.0f}",
            f"  Annual cost savings:      ${self.annual_cost_savings:>12,.0f}",
            f"  Total annual benefit:     ${self.total_annual_benefit:>12,.0f}",
            f"  {'─' * 41}",
            f"  ROI:                      {self.roi_percentage:>11.0f}%",
            f"  Payback period:           {self.payback_months:>10.1f} months",
        ])


# StreamRec TCO
tco = TCOAnalysis()

# Compute
tco.add(CostLineItem("Compute (training)", "Weekly retraining, 4x A100 GPUs, ~6h/run", 3_200.0))
tco.add(CostLineItem("Compute (serving)", "8x inference servers, CPU + T4 GPU", 8_500.0))
tco.add(CostLineItem("Compute (feature store)", "Redis cluster (online) + Spark jobs (batch)", 2_800.0))
tco.add(CostLineItem("Storage", "S3 (training data, artifacts), Delta Lake, model registry", 1_200.0))
tco.add(CostLineItem("Monitoring", "Grafana Cloud, PagerDuty, log aggregation", 800.0))
tco.add(CostLineItem("Personnel", "2 ML engineers (partial allocation, 0.5 FTE each)", 25_000.0))
tco.add(CostLineItem("Setup (one-time)", "Initial infrastructure provisioning, CI/CD setup", 45_000.0, is_recurring=False))

print(tco.summary())
print()

# ROI
roi = ROIAnalysis(
    annual_cost=tco.annual_cost(),
    annual_revenue_impact=4_200_000.0,  # 3.4% CTR uplift on $123M ad revenue
    annual_cost_savings=180_000.0,       # Reduced manual curation labor
)
print(roi.summary())

StreamRec TCO Analysis (Annual)
=============================================
  Personnel                 $  300,000
  Compute (serving)         $  102,000
  Setup (one-time)          $   45,000
  Compute (training)        $   38,400
  Compute (feature store)   $   33,600
  Storage                   $   14,400
  Monitoring                $    9,600
  ───────────────────────────────────────
  TOTAL                     $  543,000

StreamRec ROI Analysis
=============================================
  Annual system cost:       $    543,000
  Annual revenue impact:    $  4,200,000
  Annual cost savings:      $    180,000
  Total annual benefit:     $  4,380,000
  ─────────────────────────────────────────
  ROI:                             707%
  Payback period:                1.5 months

The ROI is strongly positive — as is typical for well-implemented recommendation systems at scale. The key insight is that most of the cost is personnel, not infrastructure. The four largest compute line items (training, serving, feature store, storage) total $188,400/year — less than one engineer's salary. The bottleneck for recommendation system quality is engineering time, not compute budget.

36.10 Technical Debt Assessment

Every system accumulates technical debt. The capstone retrospective should identify the debt explicitly, classify it, and either schedule repayment or accept it consciously. Unconscious technical debt — debt you do not know you carry — is the most dangerous kind, because it compounds silently until it causes a production incident.

Debt Categories for ML Systems

Category	Description	Example
Data debt	Accumulated data quality or pipeline issues	Training-serving skew not continuously monitored
Model debt	Modeling shortcuts that limit performance	MLP ranker when transformer would be superior
Infrastructure debt	System design compromises	Daily FAISS rebuild instead of incremental
Testing debt	Missing or inadequate test coverage	No behavioral tests for re-ranking business rules
Documentation debt	Missing or outdated documentation	ADRs not updated after architecture changes
Fairness debt	Known fairness issues not yet mitigated	Intersectional audit not yet implemented

@dataclass
class TechnicalDebtItem:
    """A single item of technical debt.

    Attributes:
        id: Unique identifier.
        category: Debt category.
        description: What the debt is.
        risk: What can go wrong if not addressed.
        severity: How bad the failure would be (low/medium/high/critical).
        remediation: How to fix it.
        estimated_effort_weeks: Effort to remediate.
        roadmap_item: Linked roadmap item ID, if scheduled.
    """
    id: str
    category: str
    description: str
    risk: str
    severity: str
    remediation: str
    estimated_effort_weeks: float
    roadmap_item: Optional[str] = None


@dataclass
class TechnicalDebtRegister:
    """Registry of known technical debt.

    Tracks all identified technical debt items and provides
    summary statistics for planning.

    Attributes:
        items: List of debt items.
    """
    items: List[TechnicalDebtItem] = field(default_factory=list)

    def add(self, item: TechnicalDebtItem) -> None:
        """Add a debt item."""
        self.items.append(item)

    def total_remediation_weeks(self) -> float:
        """Total effort to remediate all debt."""
        return sum(i.estimated_effort_weeks for i in self.items)

    def unscheduled(self) -> List[TechnicalDebtItem]:
        """Return debt items not linked to a roadmap item."""
        return [i for i in self.items if i.roadmap_item is None]

    def by_severity(self) -> Dict[str, List[TechnicalDebtItem]]:
        """Group debt by severity.

        Returns:
            Dictionary mapping severity to list of debt items.
        """
        grouped: Dict[str, List[TechnicalDebtItem]] = {}
        for item in self.items:
            grouped.setdefault(item.severity, []).append(item)
        return grouped

    def summary(self) -> str:
        """Generate debt summary.

        Returns:
            Formatted summary string.
        """
        lines = [
            "Technical Debt Register",
            "=" * 50,
            f"Total items: {len(self.items)}",
            f"Total remediation: {self.total_remediation_weeks():.0f} "
            f"person-weeks",
            f"Unscheduled: {len(self.unscheduled())}",
            "",
        ]

        severity_order = ["critical", "high", "medium", "low"]
        for sev in severity_order:
            items = self.by_severity().get(sev, [])
            if items:
                lines.append(f"--- {sev.upper()} ---")
                for item in items:
                    scheduled = (
                        f" -> {item.roadmap_item}"
                        if item.roadmap_item else " [UNSCHEDULED]"
                    )
                    lines.append(
                        f"  {item.id}: {item.description} "
                        f"({item.estimated_effort_weeks:.0f}w)"
                        f"{scheduled}"
                    )
                lines.append("")

        return "\n".join(lines)


# Example debt register for Track B
debt = TechnicalDebtRegister()

debt.add(TechnicalDebtItem(
    id="TD-001",
    category="infrastructure",
    description="FAISS index rebuilt daily; new items invisible for up to 24h",
    risk="New items from active creators receive zero impressions for hours, creating creator churn risk",
    severity="high",
    remediation="Implement incremental FAISS updates (R-003)",
    estimated_effort_weeks=4.0,
    roadmap_item="R-003",
))

debt.add(TechnicalDebtItem(
    id="TD-002",
    category="testing",
    description="Re-ranking business rules have no behavioral tests",
    risk="Diversity and freshness constraints could silently break during refactoring",
    severity="medium",
    remediation="Add MFT/INV/DIR behavioral tests for re-ranking (Ch. 28 pattern)",
    estimated_effort_weeks=2.0,
    roadmap_item=None,
))

debt.add(TechnicalDebtItem(
    id="TD-003",
    category="fairness",
    description="Intersectional fairness audit not yet implemented",
    risk="Compound disparities (e.g., new Arabic-language creators at 0.11 equity) undetected",
    severity="high",
    remediation="Extend fairness module to compute intersectional metrics",
    estimated_effort_weeks=3.0,
    roadmap_item="R-004",
))

debt.add(TechnicalDebtItem(
    id="TD-004",
    category="model",
    description="MLP ranker does not use session context",
    risk="Ranking quality plateaus; cannot personalize within session",
    severity="medium",
    remediation="Upgrade to transformer ranker (R-005)",
    estimated_effort_weeks=10.0,
    roadmap_item="R-005",
))

debt.add(TechnicalDebtItem(
    id="TD-005",
    category="data",
    description="No continuous training-serving skew monitoring",
    risk="Skew accumulates between batch feature updates, causing silent quality degradation",
    severity="high",
    remediation="Add PSI monitoring between training and serving feature distributions (Ch. 30 pattern)",
    estimated_effort_weeks=3.0,
    roadmap_item=None,
))

print(debt.summary())

Technical Debt Register
==================================================
Total items: 5
Total remediation: 22 person-weeks
Unscheduled: 2

--- HIGH ---
  TD-001: FAISS index rebuilt daily; new items invisible for up to 24h (4w) -> R-003
  TD-003: Intersectional fairness audit not yet implemented (3w) -> R-004
  TD-005: No continuous training-serving skew monitoring (3w) [UNSCHEDULED]

--- MEDIUM ---
  TD-002: Re-ranking business rules have no behavioral tests (2w) [UNSCHEDULED]
  TD-004: MLP ranker does not use session context (10w) -> R-005

TD-005 (training-serving skew monitoring) is the most concerning unscheduled item. It is high severity because the failure mode is silent — the system continues serving, the latency metrics look fine, and the degradation appears only as a slow decline in business metrics that is easy to attribute to other causes. This is exactly the pattern described in Chapter 30 (Section 30.4). The remediation — adding PSI monitoring between training and serving feature distributions — should be prioritized into the next planning cycle.

36.11 Progressive Project: Milestone M16 (Final)

M16: Complete StreamRec System Integration

This is the final milestone of the progressive project. It integrates all fifteen previous milestones into a complete, documented, evaluated system.

Milestone Requirements by Track

Requirement	Track A	Track B	Track C
Working recommendation endpoint	Required	Required	Required
Offline evaluation (Hit@10, NDCG@10)	Required	Required	Required
Comparison to baselines (random, popularity, MF)	Required	Required	Required
Creator fairness audit	Single-attribute	Multi-attribute	Intersectional
Causal impact estimate	—	Doubly robust ATE	ATE + CATE
Deployment pipeline	Manual	Canary	Shadow + canary + progressive
Monitoring dashboard	PSI on 10 features	Full 4-layer Grafana	+ uncertainty + fairness panels
Uncertainty quantification	—	—	MC dropout + conformal
ADRs	1	3+	5+
Technical design document	2 pages	5 pages	10 pages
Stakeholder presentation	—	Technical peers	Technical + executive
Technical roadmap	—	3-month	6-month + TCO/ROI
Retrospective	Informal notes	Structured retrospective	Full post-launch review

Integration Checklist

The following checklist ensures that every component from the progressive project is accounted for. Each item corresponds to a previous milestone.

@dataclass
class IntegrationChecklistItem:
    """A single item on the integration checklist.

    Attributes:
        milestone: Original milestone (e.g., "M2").
        chapter: Chapter number.
        component: What was built.
        integration_status: 'integrated', 'replaced', or 'not_applicable'.
        notes: Integration notes.
    """
    milestone: str
    chapter: int
    component: str
    integration_status: str = "pending"
    notes: str = ""


@dataclass
class IntegrationChecklist:
    """Full integration checklist for the capstone project.

    Tracks which previous milestones have been integrated into
    the final system.

    Attributes:
        items: List of checklist items.
        track: Project track.
    """
    items: List[IntegrationChecklistItem] = field(default_factory=list)
    track: Track = Track.B_STANDARD

    def completion_rate(self) -> float:
        """Fraction of items that are integrated or replaced.

        Returns:
            Completion rate as a float in [0, 1].
        """
        done = sum(
            1 for i in self.items
            if i.integration_status in ("integrated", "replaced",
                                         "not_applicable")
        )
        return done / len(self.items) if self.items else 0.0

    def pending(self) -> List[IntegrationChecklistItem]:
        """Return items not yet integrated.

        Returns:
            List of pending checklist items.
        """
        return [
            i for i in self.items
            if i.integration_status == "pending"
        ]

    def summary(self) -> str:
        """Generate checklist summary.

        Returns:
            Formatted checklist string.
        """
        lines = [
            f"Integration Checklist — Track {self.track.value}",
            f"Completion: {self.completion_rate():.0%} "
            f"({len(self.items) - len(self.pending())} / {len(self.items)})",
            "=" * 55,
        ]

        status_symbols = {
            "integrated": "[+]",
            "replaced": "[~]",
            "not_applicable": "[–]",
            "pending": "[ ]",
        }

        for item in self.items:
            symbol = status_symbols.get(item.integration_status, "[ ]")
            notes = f" — {item.notes}" if item.notes else ""
            lines.append(
                f"  {symbol} {item.milestone} (Ch. {item.chapter}): "
                f"{item.component}{notes}"
            )

        return "\n".join(lines)


# Build Track B checklist (all milestones)
checklist = IntegrationChecklist(track=Track.B_STANDARD)

milestones = [
    ("M0", 1, "SVD matrix factorization baseline", "replaced",
     "Replaced by two-tower (M5) but kept as evaluation baseline"),
    ("M1", 5, "FAISS ANN index, latency profiling", "integrated",
     "FAISS IVF4096,PQ32 index serves production retrieval"),
    ("M2", 6, "Click-prediction MLP", "integrated",
     "MLP ranker is the Track A/B ranking model"),
    ("M3", 8, "1D CNN content embeddings", "replaced",
     "Replaced by pretrained encoder in two-tower model (M5)"),
    ("M4", 10, "Transformer session model", "integrated",
     "Available as Track B ranking upgrade; used in Track C"),
    ("M5", 13, "Two-tower retrieval with FAISS", "integrated",
     "Primary retrieval source in all tracks"),
    ("M6", 14, "LightGCN collaborative filtering", "integrated",
     "Track C secondary retrieval source"),
    ("M7", 19, "Causal forests, uplift targeting", "integrated",
     "CATE estimates inform re-ranking in Track C"),
    ("M8", 22, "Thompson sampling explore/exploit", "integrated",
     "Track C exploration policy (5% traffic)"),
    ("M8b", 23, "Temporal engagement forecaster", "integrated",
     "Forecasts feed monitoring anomaly detection"),
    ("M9", 24, "System architecture, latency budgets", "integrated",
     "Architecture blueprint used for all tracks"),
    ("M10", 25, "Feature store (batch + stream)", "integrated",
     "Parquet offline + Redis online; streaming in Track B+"),
    ("M11", 27, "Dagster training pipeline", "integrated",
     "Orchestrates weekly retraining"),
    ("M12", 28, "Testing and validation infrastructure", "integrated",
     "GE expectations, Pandera schemas, behavioral tests"),
    ("M13", 29, "Deployment pipeline", "integrated",
     "Canary deployment in Track B; shadow+progressive in Track C"),
    ("M14", 30, "Monitoring dashboard", "integrated",
     "Four-layer Grafana dashboard with PSI drift detection"),
    ("M15", 31, "Fairness audit", "integrated",
     "Creator exposure equity + user quality disparity"),
]

for m, ch, comp, status, notes in milestones:
    checklist.items.append(IntegrationChecklistItem(
        milestone=m, chapter=ch, component=comp,
        integration_status=status, notes=notes,
    ))

print(checklist.summary())

Integration Checklist — Track B
Completion: 100% (17 / 17)
=======================================================
  [~] M0 (Ch. 1): SVD matrix factorization baseline — Replaced by two-tower (M5) but kept as evaluation baseline
  [+] M1 (Ch. 5): FAISS ANN index, latency profiling — FAISS IVF4096,PQ32 index serves production retrieval
  [+] M2 (Ch. 6): Click-prediction MLP — MLP ranker is the Track A/B ranking model
  [~] M3 (Ch. 8): 1D CNN content embeddings — Replaced by pretrained encoder in two-tower model (M5)
  [+] M4 (Ch. 10): Transformer session model — Available as Track B ranking upgrade; used in Track C
  [+] M5 (Ch. 13): Two-tower retrieval with FAISS — Primary retrieval source in all tracks
  [+] M6 (Ch. 14): LightGCN collaborative filtering — Track C secondary retrieval source
  [+] M7 (Ch. 19): Causal forests, uplift targeting — CATE estimates inform re-ranking in Track C
  [+] M8 (Ch. 22): Thompson sampling explore/exploit — Track C exploration policy (5% traffic)
  [+] M8b (Ch. 23): Temporal engagement forecaster — Forecasts feed monitoring anomaly detection
  [+] M9 (Ch. 24): System architecture, latency budgets — Architecture blueprint used for all tracks
  [+] M10 (Ch. 25): Feature store (batch + stream) — Parquet offline + Redis online; streaming in Track B+
  [+] M11 (Ch. 27): Dagster training pipeline — Orchestrates weekly retraining
  [+] M12 (Ch. 28): Testing and validation infrastructure — GE expectations, Pandera schemas, behavioral tests
  [+] M13 (Ch. 29): Deployment pipeline — Canary deployment in Track B; shadow+progressive in Track C
  [+] M14 (Ch. 30): Monitoring dashboard — Four-layer Grafana dashboard with PSI drift detection
  [+] M15 (Ch. 31): Fairness audit — Creator exposure equity + user quality disparity

The checklist tells a story of progressive integration. Two components (M0, M3) were replaced by better alternatives — the SVD baseline by the two-tower model, the 1D CNN embeddings by the pretrained encoder. Replacement is not failure. It is the natural outcome of iterative development: you build the simple thing first (Theme 6), evaluate it, and upgrade when the data justifies the complexity.

36.12 Synthesis: What This Chapter Is Really About

This chapter has no new algorithm, no new loss function, no new mathematical result. Everything you need to build the capstone system was taught in Chapters 1-35. The new content is entirely structural: how to compose components into a system, how to document decisions, how to evaluate holistically, how to communicate to stakeholders, and how to plan for the future.

This is not a coincidence. It reflects a truth about the practice of data science at the senior level: the bottleneck is not knowledge of any individual technique but the ability to compose techniques into systems that work reliably, fairly, and demonstrably.

A data scientist who can build a transformer model but cannot explain to a product manager why it is better than the MLP it replaces is a data scientist who will not get the transformer deployed. A data scientist who can compute SHAP values but cannot connect them to a regulatory requirement (Chapter 35) is a data scientist whose interpretability work will be ignored by the compliance team. A data scientist who can estimate a causal effect but cannot compare the causal ATE to the naive CTR uplift and explain the difference to an executive is a data scientist whose causal inference skills are wasted.

The capstone forces all of these connections. It is intentionally difficult — not because any single component is hard, but because making them all work together reliably is hard. That is the real lesson.

The hardest part is not building any individual component but making them all work together reliably.

This insight applies far beyond recommendation systems. A credit scoring system at Meridian Financial faces the same integration challenge: the XGBoost model (Chapter 6), the ECOA fairness audit (Chapter 31), the SR 11-7 validation gate (Chapter 28), the quarterly retraining pipeline (Chapter 29), and the model risk management documentation all must compose into a coherent system with clear interfaces and explicit tradeoffs. A climate forecasting system faces the same challenge with different components: the temporal fusion transformer (Chapter 23), the spatial CNN (Chapter 8), the uncertainty quantification (Chapter 34), and the ensemble integration (Chapter 26). The components change; the craft of integration does not.

This is the craft of the senior data scientist. It is the craft this entire book has been preparing you to practice.

Chapter Summary

This chapter defined the capstone project: a complete production recommendation system built from the components developed across all 35 preceding chapters. Three tracks (A Minimal, B Standard, C Full) provide increasing levels of ambition, each with explicit component inventories, deliverable requirements, and expected performance targets. The technical design document template structures the engineering narrative. Architecture Decision Records document the why behind every architectural choice. The three-level evaluation framework (model, system, business) ensures that evaluation goes beyond predictive accuracy to include system reliability, causal impact, and fairness. Stakeholder presentation guidance adapts the same system to different audiences. The technical roadmap transforms the capstone from a one-time project into a foundation for ongoing improvement. The TCO and ROI analysis justifies the system's existence in financial terms. The technical debt register makes accumulated compromises explicit. The retrospective framework consolidates the hard-won lessons of integration into reusable wisdom.

The capstone is the most difficult chapter in this book — not because it introduces new techniques, but because it requires you to use all of them simultaneously. Every simplifying assumption you made in isolation collides with every other simplifying assumption. Every component's output format must match the next component's input format. Every latency budget must sum to less than 200 milliseconds. Every fairness constraint must be compatible with every optimization objective.

The hardest part is not building any individual component but making them all work together reliably.

Notes

The three-track structure is inspired by the tiered project designs in Chip Huyen's Designing Machine Learning Systems (O'Reilly, 2022) and the MLOps maturity model from Google's "MLOps: Continuous delivery and automation pipelines in machine learning" (2020). The ADR format follows Michael Nygard's "Documenting Architecture Decisions" (2011). The technical design document template is adapted from design doc practices at Google, Meta, and Stripe, as described in Will Larson's Staff Engineer: Leadership Beyond the Management Track (2021). The TCO framework draws on cloud cost optimization literature, particularly Corey Quinn's "Last Week in AWS" analyses. The retrospective structure follows Esther Derby and Diana Larsen's Agile Retrospectives: Making Good Teams Great (Pragmatic Bookshelf, 2006). The three-slide executive presentation rule is a simplification of Barbara Minto's pyramid principle, adapted for ML system presentations.