Case Study 1: StreamRec Full System Architecture — From Prototype to Production

Context

StreamRec has operated for 18 months with a prototype recommendation system: a collaborative filtering model (matrix factorization from Chapter 1) trained weekly on a data scientist's laptop and deployed as a batch job that pre-computes the top-50 items per user. The results are written to a PostgreSQL database. When a user opens the app, the backend reads the pre-computed list and displays the top 20.

This system launched the product and proved the value of personalized recommendations. Monthly active users grew from 2 million to 12 million. But the prototype is now a liability:

  1. Staleness. The weekly retrain means recommendations do not reflect a user's behavior from the last 6 days. A user who binges three cooking shows on Monday still sees action movies recommended on Tuesday through Sunday.
  2. Cold start. New users (15% of daily active users) receive globally popular items because the matrix factorization model has no embeddings for them. Engagement for new users is 40% lower than for established users.
  3. Scale. Pre-computing recommendations for 12 million users takes 14 hours. At 50 million users, it would take 58 hours — longer than the weekly retraining cycle.
  4. Feature limitations. The batch job cannot incorporate session context (time of day, device, current session history), which A/B tests on a subset of users showed would increase engagement by 12%.

The VP of Engineering has approved a production ML system to replace the prototype. The team — 3 ML engineers, 2 backend engineers, 1 data engineer — has 4 months to build it.

The Design Process

Step 1: Requirements

The team begins by specifying the system requirements, derived from product goals and infrastructure constraints.

Requirement Target Source
End-to-end latency (p95) < 200ms Product: users abandon after 300ms
Throughput (peak) 10,000 req/s Traffic analysis: peak at 8pm Sunday
Availability 99.9% SLA commitment to advertisers
Model freshness Daily retraining Data science: weekly showed no decay vs. daily
Feature freshness Batch: daily; streaming: < 60s A/B test: session features → +12% engagement
Cold-start handling Personalized after 5 interactions Bayesian model (Ch. 20) meets this threshold
Item coverage 95% of catalog served at least once/week Business: long-tail content is a differentiator

Step 2: Component Design

The team maps each requirement to a system component.

Data pipeline. Event logs (user interactions, item metadata, content signals) flow from the app through Kafka into a data lake (Delta Lake on S3). A Spark pipeline runs nightly to compute batch features. A Flink pipeline runs continuously to compute streaming features (last 10 interactions, session duration).

Feature store. Feast manages both the online store (Redis cluster, 3 replicas for availability) and the offline store (Delta Lake tables with temporal partitioning). All features are defined once in the feature store schema and computed by a single pipeline that writes to both stores.

Training infrastructure. A daily training job runs on a 4-GPU node (A100s). The pipeline: (1) construct training dataset with point-in-time correct features from the offline store, (2) train the two-tower retrieval model and the DCN-V2 ranking model, (3) evaluate on a held-out set with Recall@20 and NDCG@20 metrics, (4) register the model in MLflow with metrics and feature schema, (5) run shadow mode comparison against the current production model.

Serving infrastructure. The serving layer runs on Kubernetes with autoscaling. The ranking model is served on Triton Inference Server (GPU pods). The retrieval index (FAISS) is served on CPU pods. The re-ranking logic runs in the application code.

Monitoring. Prometheus collects system metrics (latency, throughput, error rates). A custom pipeline compares serving feature distributions against training distributions daily (skew detection). An engagement tracking pipeline computes online metrics (CTR, completion rate) with a 4-hour delay.

Step 3: Latency Budget Allocation

The team builds a latency model by benchmarking each component on representative hardware.

from dataclasses import dataclass
from typing import Dict


@dataclass
class LatencyBudget:
    """Latency budget for the StreamRec serving path.

    Attributes:
        stages: Dictionary mapping stage name to (p50, p95, p99) in ms.
        total_budget_p95_ms: Target end-to-end p95 latency.
    """
    stages: Dict[str, tuple]
    total_budget_p95_ms: float = 200.0

    def report(self) -> None:
        """Print the latency budget and check feasibility."""
        print(f"{'Stage':<30} {'p50':>6} {'p95':>6} {'p99':>6}")
        print("-" * 52)
        total_p50, total_p95, total_p99 = 0.0, 0.0, 0.0
        for stage, (p50, p95, p99) in self.stages.items():
            print(f"{stage:<30} {p50:>5.0f}ms {p95:>5.0f}ms {p99:>5.0f}ms")
            total_p50 += p50
            total_p95 += p95
            total_p99 += p99
        print("-" * 52)
        print(f"{'Total (serial sum)':<30} {total_p50:>5.0f}ms "
              f"{total_p95:>5.0f}ms {total_p99:>5.0f}ms")
        print(f"\np95 budget: {self.total_budget_p95_ms:.0f}ms")
        headroom = self.total_budget_p95_ms - total_p95
        if headroom > 0:
            print(f"p95 headroom: {headroom:.0f}ms (OK)")
        else:
            print(f"p95 OVER BUDGET by {-headroom:.0f}ms")


budget = LatencyBudget(stages={
    "API gateway + auth":          (3,   5,  12),
    "Feature store lookup":        (8,  15,  35),
    "Candidate retrieval (4 src)": (12, 22,  40),
    "Ranking model (GPU, 500)":    (35, 55,  80),
    "Re-ranking":                  (5,   8,  12),
    "Response serialization":      (2,   5,  10),
})
budget.report()
Stage                           p50   p95   p99
----------------------------------------------------
API gateway + auth               3ms   5ms  12ms
Feature store lookup             8ms  15ms  35ms
Candidate retrieval (4 src)     12ms  22ms  40ms
Ranking model (GPU, 500)        35ms  55ms  80ms
Re-ranking                       5ms   8ms  12ms
Response serialization           2ms   5ms  10ms
----------------------------------------------------
Total (serial sum)              65ms 110ms 189ms

p95 budget: 200ms
p95 headroom: 90ms (OK)

The p95 sum (110ms) has 90ms of headroom. The p99 sum (189ms) is within budget. However, the serial sum overstates the true critical path because some stages overlap:

  • Feature store lookup starts immediately and runs in parallel with the first part of candidate retrieval.
  • The 4 retrieval sources run in parallel with each other.

With parallelism, the effective p95 critical path drops to approximately 95ms, providing 105ms of headroom — comfortable margin for network jitter and garbage collection pauses.

Step 4: Failure Mode Analysis

The team conducts a failure mode analysis for each component, identifying the failure, its probability, impact, detection mechanism, and mitigation.

from dataclasses import dataclass
from typing import List


@dataclass
class FailureMode:
    """A single failure mode in the system.

    Attributes:
        component: Which component fails.
        failure: Description of the failure.
        probability: Estimated probability per month (low/medium/high).
        impact: Impact on user experience.
        detection: How the failure is detected.
        mitigation: How the system recovers.
        degradation_level: Which degradation level activates.
    """
    component: str
    failure: str
    probability: str
    impact: str
    detection: str
    mitigation: str
    degradation_level: str


failure_modes = [
    FailureMode(
        component="FAISS index",
        failure="Index corruption during update",
        probability="Low (1/year)",
        impact="Embedding retrieval returns wrong candidates",
        detection="Recall@500 monitoring on shadow traffic",
        mitigation="Roll back to previous index version; CF + "
                   "content-based retrieval as fallback",
        degradation_level="STALE_FEATURES",
    ),
    FailureMode(
        component="Redis cluster",
        failure="Primary node failure",
        probability="Medium (2/year)",
        impact="Feature lookup returns stale/default values",
        detection="Redis sentinel failover; latency spike alert",
        mitigation="Sentinel promotes replica; batch features as "
                   "fallback during promotion (5-15 seconds)",
        degradation_level="STALE_FEATURES",
    ),
    FailureMode(
        component="GPU serving",
        failure="OOM on large batch",
        probability="Low (1/year)",
        impact="Ranking model returns error for batch",
        detection="Triton error counter; p99 latency spike",
        mitigation="Retry with smaller batch; CPU logistic regression "
                   "fallback if GPU fleet is unhealthy",
        degradation_level="SIMPLE_RANKING",
    ),
    FailureMode(
        component="Kafka pipeline",
        failure="Consumer group lag > 10 minutes",
        probability="Medium (4/year)",
        impact="Streaming features are stale",
        detection="Kafka consumer lag monitoring",
        mitigation="Batch features used; alert at 10-min lag, "
                   "page at 30-min lag",
        degradation_level="STALE_FEATURES",
    ),
    FailureMode(
        component="Training pipeline",
        failure="Daily training fails for 72+ hours",
        probability="Low (1/year)",
        impact="Model drifts from current data distribution",
        detection="Model age monitoring; offline metric degradation",
        mitigation="Serve the last good model; page on-call engineer; "
                   "manual retraining fallback",
        degradation_level="FULL (stale model)",
    ),
]

print(f"Failure mode analysis: {len(failure_modes)} modes identified\n")
for fm in failure_modes:
    print(f"Component: {fm.component}")
    print(f"  Failure: {fm.failure}")
    print(f"  Probability: {fm.probability}")
    print(f"  Degradation: {fm.degradation_level}")
    print(f"  Mitigation: {fm.mitigation[:80]}...")
    print()
Failure mode analysis: 5 modes identified

Component: FAISS index
  Failure: Index corruption during update
  Probability: Low (1/year)
  Degradation: STALE_FEATURES
  Mitigation: Roll back to previous index version; CF + content-based retrieval as fallba...

Component: Redis cluster
  Failure: Primary node failure
  Probability: Medium (2/year)
  Degradation: STALE_FEATURES
  Mitigation: Sentinel promotes replica; batch features as fallback during promotion (5-1...

Component: GPU serving
  Failure: OOM on large batch
  Probability: Low (1/year)
  Degradation: SIMPLE_RANKING
  Mitigation: Retry with smaller batch; CPU logistic regression fallback if GPU fleet is ...

Component: Kafka pipeline
  Failure: Consumer group lag > 10 minutes
  Probability: Medium (4/year)
  Degradation: STALE_FEATURES
  Mitigation: Batch features used; alert at 10-min lag, page at 30-min lag...

Component: Training pipeline
  Failure: Daily training fails for 72+ hours
  Probability: Low (1/year)
  Degradation: FULL (stale model)
  Mitigation: Serve the last good model; page on-call engineer; manual retraining fallbac...

Step 5: Deployment Plan

The team plans a phased rollout over 4 months:

Month 1: Infrastructure. Deploy Kafka, Redis cluster, and Feast feature store. Migrate the batch feature pipeline from the data scientist's laptop to Spark on the data lake. Establish feature definitions in the feature store schema. Deliverable: feature store serving both online and offline features for the existing matrix factorization model.

Month 2: Models. Train the two-tower retrieval model (Chapter 13) and DCN-V2 ranking model. Deploy to Triton Inference Server. Build the FAISS index update pipeline. Deliverable: new models running in shadow mode alongside the production MF model.

Month 3: Serving. Build the multi-stage serving pipeline (retrieval → ranking → re-ranking). Implement the fallback chain and circuit breakers. Deploy the batch fallback pipeline. Deliverable: new serving system handling 5% canary traffic.

Month 4: Rollout. Gradual rollout from 5% to 100%. Build the monitoring dashboard (latency, throughput, feature distributions, engagement metrics). Write ADRs for all major design decisions. Deliverable: full production system serving 100% of traffic.

Outcome

After the 4-month migration, StreamRec's production system achieves:

  • p95 latency: 108ms (within the 200ms budget, with 92ms headroom).
  • Engagement: +18% over the batch MF baseline (combining session-aware features (+12%), improved retrieval models (+4%), and diversity re-ranking (+2%)).
  • Cold-start engagement: +35% from the Bayesian preference model (Chapter 20), which provides meaningful recommendations after just 3-5 interactions.
  • Availability: 99.95% over the first quarter, exceeding the 99.9% SLA.
  • Incidents: 2 in the first quarter. One Redis failover (5 seconds of batch-feature degradation, no user-visible impact). One FAISS index corruption (15 minutes of reduced retrieval, caught by monitoring before quality degradation was measurable).

The ADR archive contains 14 decisions, including serving pattern (hybrid real-time + batch), feature store technology (Feast + Redis), ranking model architecture (DCN-V2 over transformer due to latency), re-ranking strategy (rule-based over ML-based), and retraining cadence (daily over weekly/hourly).

Lessons Learned

  1. The feature store was the hardest component to build and the most valuable. Achieving online-offline consistency required more engineering effort than the models themselves. But once the feature store was operational, feature skew — which had caused two production incidents in the prototype era — disappeared entirely.

  2. Shadow mode caught three bugs that would have degraded production. One: the new ranking model produced NaN scores for items with zero impressions (a feature the training data excluded). Two: the FAISS index returned item IDs that had been removed from the catalog. Three: the re-ranking diversity constraint was applied before promotional boosts, causing promoted items to sometimes be filtered.

  3. The batch fallback pipeline was used exactly twice in the first quarter — and both times it prevented user-visible impact. The engineering cost of maintaining dual serving infrastructure was debated extensively. In hindsight, the investment was justified by the two incidents where it preserved the user experience.

  4. ADRs saved weeks of re-discussion. When a new ML engineer joined in month 3 and asked "why not use a transformer for ranking?", the ADR documenting the latency analysis (transformer: 120ms p95 for 500 items vs. DCN-V2: 55ms p95) ended the conversation in 5 minutes instead of 5 meetings.