Case Study 1: StreamRec Track B Implementation — Standard Integration

DataField.Dev

Case Study 1: StreamRec Track B Implementation — Standard Integration

Context

A two-person data science team at StreamRec — one ML engineer (focused on models and training) and one ML platform engineer (focused on infrastructure and deployment) — has 12 weeks to deliver a Track B capstone system. They have completed all 15 progressive project milestones individually, but the components have never been integrated. Each milestone was developed in its own notebook or repository, with its own data format assumptions, feature definitions, and evaluation scripts.

The goal is to integrate these components into a single production recommendation system that serves 50 million monthly active users with sub-200ms latency, provides causal impact estimates, passes a two-sided fairness audit, and is documented with a 5-page technical design document and three ADRs.

This case study traces the integration process week by week, highlighting the integration challenges that consumed the largest share of engineering time.

Week 1-2: Foundation — Feature Store and Data Contracts

The team began by stabilizing the feature store (M10, Chapter 25), reasoning that every downstream component depends on features and that schema disagreements are the most common source of integration bugs.

The first integration bug appeared within hours. The two-tower retrieval model (M5, Chapter 13) expected a feature called user_avg_session_length_7d as a float in seconds. The feature store produced avg_session_length as an integer in minutes. The model loaded successfully, consumed the feature without error, and produced embeddings — but the embeddings were wrong. There was no exception, no crash, no log message. The only signal was that Recall@500 dropped from 0.62 to 0.41.

The team traced the problem using the PSI-based drift detection from Chapter 30: the serving distribution of user_avg_session_length_7d had a PSI of 3.2 against the training distribution — well above the 0.25 alert threshold. This confirmed that the feature values were fundamentally different between training and serving.

Resolution: The team defined a data contract (Chapter 28) specifying every feature's name, type, unit, and valid range. Both the training pipeline and the serving path validate against this contract. The contract became the first section of the technical design document.

from dataclasses import dataclass, field
from typing import Dict, List, Optional, Tuple


@dataclass
class FeatureContract:
    """Data contract for a single feature.

    Specifies the name, type, unit, and valid range for a feature
    used in the StreamRec recommendation pipeline.

    Attributes:
        name: Canonical feature name.
        dtype: Expected data type ('float32', 'int64', 'str', etc.).
        unit: Physical unit (e.g., 'seconds', 'count', 'ratio').
        min_value: Minimum valid value (None for unbounded).
        max_value: Maximum valid value (None for unbounded).
        nullable: Whether null/missing values are permitted.
        description: Human-readable description.
    """
    name: str
    dtype: str
    unit: str
    min_value: Optional[float] = None
    max_value: Optional[float] = None
    nullable: bool = False
    description: str = ""

    def validate(self, value) -> Tuple[bool, str]:
        """Validate a single feature value against the contract.

        Args:
            value: The value to validate.

        Returns:
            (valid, message) tuple.
        """
        if value is None:
            if self.nullable:
                return True, "null (permitted)"
            return False, f"Feature '{self.name}' is null but not nullable"

        if self.min_value is not None and value < self.min_value:
            return False, (
                f"Feature '{self.name}' value {value} below "
                f"minimum {self.min_value}"
            )
        if self.max_value is not None and value > self.max_value:
            return False, (
                f"Feature '{self.name}' value {value} above "
                f"maximum {self.max_value}"
            )
        return True, "valid"


@dataclass
class PipelineDataContract:
    """Collection of feature contracts for the full pipeline.

    Validates feature vectors against all contracts and reports
    violations.

    Attributes:
        features: Dictionary of feature name to contract.
    """
    features: Dict[str, FeatureContract] = field(default_factory=dict)

    def add(self, contract: FeatureContract) -> None:
        """Register a feature contract."""
        self.features[contract.name] = contract

    def validate_vector(
        self, feature_vector: Dict[str, Optional[float]]
    ) -> Tuple[bool, List[str]]:
        """Validate a full feature vector.

        Args:
            feature_vector: Dictionary of feature name to value.

        Returns:
            (all_valid, list_of_violation_messages) tuple.
        """
        violations = []

        # Check for missing features
        for name in self.features:
            if name not in feature_vector:
                violations.append(f"Missing feature: '{name}'")

        # Check for unexpected features
        for name in feature_vector:
            if name not in self.features:
                violations.append(f"Unexpected feature: '{name}'")

        # Validate present features
        for name, value in feature_vector.items():
            if name in self.features:
                valid, msg = self.features[name].validate(value)
                if not valid:
                    violations.append(msg)

        return len(violations) == 0, violations


# StreamRec feature contract (subset)
contract = PipelineDataContract()

contract.add(FeatureContract(
    name="user_avg_session_length_7d",
    dtype="float32",
    unit="seconds",
    min_value=0.0,
    max_value=7200.0,
    nullable=False,
    description="Average session length over last 7 days, in seconds.",
))

contract.add(FeatureContract(
    name="user_interaction_count_30d",
    dtype="int64",
    unit="count",
    min_value=0,
    max_value=10000,
    nullable=False,
    description="Total interactions in last 30 days.",
))

contract.add(FeatureContract(
    name="item_ctr_7d",
    dtype="float32",
    unit="ratio",
    min_value=0.0,
    max_value=1.0,
    nullable=True,
    description="Item CTR over last 7 days. Null for items < 7 days old.",
))

# Test: the bug that cost 4 hours
buggy_vector = {
    "user_avg_session_length_7d": 12,  # 12 minutes, not 12 seconds
    "user_interaction_count_30d": 45,
    "item_ctr_7d": 0.034,
}

valid, violations = contract.validate_vector(buggy_vector)
print(f"Valid: {valid}")
# The contract catches the value (12 minutes = 720 seconds), but
# the bug was subtler: 12 was within range for both interpretations.
# The real fix was standardizing the unit convention, not just validation.

Valid: True

The output reveals an important lesson: data contracts with range validation catch obvious errors but miss unit mismatches when values happen to fall within range. The value 12 is valid as seconds (12 seconds is plausible) but wrong when it actually represents 12 minutes (720 seconds). The deeper fix was to include the unit in the feature name (_seconds, _minutes) and add statistical validation: the training distribution's mean is ~480 seconds, so a serving distribution with mean ~8 (minutes, misinterpreted as seconds) would be caught by PSI even when range validation passes.

Week 3-5: Model Integration — Retrieval and Ranking

With the feature store stabilized, the team integrated the two-tower retrieval model (M5) and MLP ranker (M2).

Integration challenge: embedding versioning. The two-tower model produces 128-dimensional item embeddings. These embeddings are pre-computed and stored in a FAISS index. When the model is retrained (weekly), the embedding space changes — item 42's embedding in model v1 is not comparable to item 42's embedding in model v2. If the FAISS index contains v1 embeddings but the user tower computes v2 user embeddings, retrieval quality degrades catastrophically.

Resolution: The team implemented co-versioned artifacts — the FAISS index and the model share a version tag. The deployment pipeline rebuilds the FAISS index whenever a new model is promoted, and both artifacts are deployed atomically. The model registry (MLflow, Chapter 29) tracks the correspondence.

Integration challenge: ranking feature mismatch. The MLP ranker (M2, Chapter 6) was trained with 47 features. The feature store provided 52 features — the additional 5 were streaming features added in M10 (Chapter 25) after the MLP was trained. The model silently ignored the extra features (the feature store returned a superset of what the model consumed), but the feature ordering was different. Features were indexed by position, not by name, so feature 12 (which the model expected to be user_avg_ctr) was actually user_region_id — a categorical feature being consumed as a float.

Resolution: The team switched from positional to named feature access. The model's forward() method now accepts a dictionary of feature names to values, selecting the features it needs by name. This added approximately 0.5ms of overhead but eliminated an entire class of silent integration bugs.

Week 6-8: Evaluation Infrastructure

The team built the three-level evaluation framework (Section 36.5) and ran the first full system evaluation.

Results:

Metric	Value	Threshold	Status
Retrieval Recall@500	0.62	0.50	PASS
Ranking Hit@10	0.22	0.18	PASS
Ranking NDCG@10	0.13	0.10	PASS
p99 latency	178ms	200ms	PASS
Feature store hit rate	0.994	0.99	PASS
Training-serving skew (PSI)	0.03	0.10	PASS
Causal ATE (engagement)	0.041	0.01	PASS
CTR improvement vs. baseline	0.034	0.01	PASS
Creator exposure equity	0.71	0.60	PASS
User Hit@10 disparity	0.06	0.10	PASS

All metrics passed. But the causal evaluation revealed that the naive CTR improvement (8.7%) was more than double the causal ATE (4.1%), confirming that the recommendation system was partially reinforcing existing behavior rather than causing new engagement.

The fairness audit revealed a cross-cutting concern. Creator exposure equity was 0.71 overall, above the 0.60 threshold. But the intersectional analysis showed that new Arabic-language creators had an exposure equity ratio of 0.11 — nearly invisible. The root causes (identified in Chapter 31, Case Study 2) were retrieval bias, ranking signal poverty, and language mismatch. The team added a re-ranking intervention: a minimum exposure guarantee that reserved 5% of recommendation slots for items from the most underserved creator groups. This improved the worst-case intersectional equity from 0.11 to 0.23 while reducing overall Hit@10 by only 1.1% relative (from 0.222 to 0.220).

Week 9-10: Deployment and Monitoring

The team implemented the canary deployment pipeline (M13, Chapter 29) and monitoring dashboard (M14, Chapter 30).

Integration challenge: canary metric conflicts. The canary deployment evaluates the new model against the current champion using online metrics (CTR, completion rate). But the fairness module evaluates using exposure equity, which requires aggregating impressions over days rather than hours. The canary's 3-day evaluation window was long enough for CTR but too short for reliable exposure equity estimation (small sample sizes for intersectional groups). The team faced a tradeoff: extend the canary window (slower deployment) or relax the fairness threshold for canary evaluation (lower fairness confidence).

Resolution (ADR-003): The team implemented a two-phase canary. Phase 1 (3 days, 10% traffic) evaluates predictive metrics (Hit@10, CTR, latency). Phase 2 (4 days, 25% traffic) evaluates fairness metrics (exposure equity, user disparity) with the larger traffic share providing sufficient sample sizes for intersectional analysis. This extended the total deployment timeline from 3 to 7 days but ensured that fairness was not sacrificed for deployment speed.

Week 11-12: Documentation and Presentation

The final two weeks produced the technical design document, three ADRs, the stakeholder presentation, and the retrospective.

ADRs written: 1. ADR-001: Two-tower model as primary retrieval (Section 36.4 example) 2. ADR-002: Named feature access vs. positional indexing (motivated by the Week 3-5 integration bug) 3. ADR-003: Two-phase canary deployment for concurrent predictive and fairness evaluation

Time allocation (actual):

Activity	Planned Hours	Actual Hours	Delta
Feature store and contracts	16	22	+6
Model integration	24	30	+6
Evaluation infrastructure	16	18	+2
Deployment and monitoring	20	24	+4
Documentation and presentation	16	14	-2
Integration debugging (unplanned)	0	28	+28
Total	92	136	+44

The 28 hours of unplanned integration debugging — 21% of total effort — dominated the schedule overrun. Every hour was spent on bugs that could not have been found by testing components in isolation: feature name mismatches, embedding version conflicts, canary metric timing issues, and the cold-start/fairness cross-cutting concern.

Lessons

The feature contract is the most important artifact after the model. Every downstream integration bug traced back to an assumption about feature names, types, units, or availability. Define the contract first. Validate against it everywhere.
Co-version all coupled artifacts. The model and its FAISS index must have the same version. The model and its feature schema must have the same version. Any artifact that is consumed together must be produced together.
Named access eliminates positional bugs. The 0.5ms cost of dictionary-based feature lookup is negligible compared to the debugging cost of a single positional mismatch.
Fairness evaluation requires more data than predictive evaluation. Intersectional groups are small by definition. Canary deployments must allocate sufficient traffic and time for reliable fairness measurement. This is a system design constraint, not just an analysis requirement.
Integration debugging is predictable in aggregate but unpredictable in specifics. Every capstone project should budget 20-25% of total effort for integration debugging that cannot be anticipated. The specific bugs will differ; the total time will not.