Case Study 1: StreamRec Feature Store — From Ad Hoc Features to Production Infrastructure

Context

Six months ago, the StreamRec data science team trained the ranking model (Chapter 24's DCN-V2) using a training dataset built by a single data scientist on a laptop. The process:

  1. Query the PostgreSQL production database for user interaction events.
  2. Compute features in a Jupyter notebook using pandas: trailing window aggregations, category distributions, session statistics.
  3. Save the feature matrix as a CSV file.
  4. Train the model in PyTorch.
  5. For serving, rewrite the feature computation in the backend's Go codebase, reading from the same PostgreSQL database.

This worked for the prototype. It stopped working at scale.

The Incidents

Incident 1: The completion rate mismatch. In week 3 of production, the ML team noticed that the ranking model's online click-through rate was 18% lower than predicted by offline evaluation (NDCG@20: 0.38 offline vs. 0.31 observed online). After two days of debugging, they discovered that the pandas feature computation used event_type == 'completion' while the Go serving code used event_type == 'complete' — a string literal difference. The feature user_7d_completion_rate was zero for every user in production. The model was making predictions based on a feature that was always 0.0, when it had trained on values in [0, 1].

Incident 2: The timezone leak. In week 7, a data scientist joined the team and rebuilt the training dataset from scratch. The new dataset produced a model with suspiciously high offline NDCG (0.52, up from 0.38). Investigation revealed that the new computation used event_timestamp directly from PostgreSQL, which stores timestamps in UTC. The Go serving code, however, converted to the user's local timezone before computing time-of-day features. The training data included hour_of_day in UTC; the serving path computed it in local time. The model learned that UTC hour 14 (2 PM) predicted high engagement — but at serving time, "hour 14" meant 2 PM local time, which could be midnight UTC. The feature was informative but in the wrong reference frame.

Incident 3: The backfill disaster. In week 12, the team added a new feature: user_category_entropy — a measure of how diverse the user's category preferences are. To compute historical values for training, a data scientist wrote a SQL query against the production database. The query had no time boundary, so it computed entropy using the user's entire interaction history — including events after the training example's timestamp. The feature contained future information. The retrained model appeared to improve by 8% on NDCG@20, but the improvement vanished in the A/B test. The team spent three weeks investigating before identifying the data leakage.

Each incident had the same root cause: training and serving used different code paths to compute features. The feature store was the solution.

The Design

Requirements

The team defined five requirements:

Requirement Specification Source
Online-offline consistency Zero feature skew incidents Incident 1, 2
Point-in-time correctness All training features respect event timestamps Incident 3
Latency Online feature lookup p95 < 15ms Chapter 24 latency budget
Freshness Batch features: < 24h; streaming features: < 60s A/B test: session features improve engagement by 12%
Scale 50M users, 200K items, 10K req/s peak Product growth forecast

Technology Selection

The team evaluated three options:

from dataclasses import dataclass
from typing import List


@dataclass
class FeatureStoreOption:
    """Evaluation of a candidate feature store technology.

    Attributes:
        name: Technology name.
        type: Open source or commercial.
        online_store: Backend for low-latency serving.
        offline_store: Backend for historical features.
        pit_join_support: Point-in-time join support quality.
        streaming_support: Native streaming feature support.
        estimated_monthly_cost: Estimated infrastructure cost.
        engineering_effort_weeks: Estimated setup and integration time.
        verdict: Team's assessment.
    """
    name: str
    type: str
    online_store: str
    offline_store: str
    pit_join_support: str
    streaming_support: str
    estimated_monthly_cost: float
    engineering_effort_weeks: int
    verdict: str


options = [
    FeatureStoreOption(
        name="Feast (open source)",
        type="open source",
        online_store="Redis (self-managed)",
        offline_store="Delta Lake on S3",
        pit_join_support="Built-in via get_historical_features",
        streaming_support="StreamFeatureView + Flink integration",
        estimated_monthly_cost=11550,
        engineering_effort_weeks=8,
        verdict="SELECTED. Best control, lowest cost, team has Redis expertise.",
    ),
    FeatureStoreOption(
        name="Tecton",
        type="commercial",
        online_store="DynamoDB (managed)",
        offline_store="S3 (managed)",
        pit_join_support="Built-in, optimized",
        streaming_support="Native Spark Structured Streaming",
        estimated_monthly_cost=28000,
        engineering_effort_weeks=4,
        verdict="Rejected. Cost too high for current stage. Revisit at 100M users.",
    ),
    FeatureStoreOption(
        name="SageMaker Feature Store",
        type="commercial (AWS)",
        online_store="DynamoDB (auto-provisioned)",
        offline_store="S3 (auto-provisioned)",
        pit_join_support="Via FeatureGroup API",
        streaming_support="Kinesis integration",
        estimated_monthly_cost=18000,
        engineering_effort_weeks=5,
        verdict="Rejected. AWS lock-in; team prefers multi-cloud flexibility.",
    ),
]

for opt in options:
    print(f"{opt.name} ({opt.type})")
    print(f"  Monthly cost: ${opt.estimated_monthly_cost:,.0f}")
    print(f"  Setup effort: {opt.engineering_effort_weeks} weeks")
    print(f"  Verdict: {opt.verdict}")
    print()
Feast (open source) (open source)
  Monthly cost: $11,550
  Setup effort: 8 weeks
  Verdict: SELECTED. Best control, lowest cost, team has Redis expertise.

Tecton (commercial)
  Monthly cost: $28,000
  Setup effort: 4 weeks
  Verdict: Rejected. Cost too high for current stage. Revisit at 100M users.

SageMaker Feature Store (commercial (AWS))
  Monthly cost: $18,000
  Setup effort: 5 weeks
  Verdict: Rejected. AWS lock-in; team prefers multi-cloud flexibility.

Implementation

The team implemented the feature store in three phases over 8 weeks:

Phase 1 (Weeks 1-3): Batch features. Define 12 batch feature views (7 user features, 5 item features) in Feast. Migrate the Spark batch pipeline to write to Delta Lake in the format Feast expects. Materialize to Redis. Modify the serving code to call Feast's online retrieval API instead of computing features in Go. Result: Incidents 1 and 2 are structurally eliminated — there is no separate Go code path.

Phase 2 (Weeks 4-6): Streaming features. Deploy Flink to compute 5 real-time user features from the Kafka click stream. Define stream feature views in Feast backed by the Flink output. Write to Redis with a 2-hour TTL. Result: Session features are now available at serving time with < 60s latency.

Phase 3 (Weeks 7-8): Point-in-time joins and training integration. Integrate get_historical_features into the training pipeline. Rebuild the training dataset using point-in-time correct features. Retrain the ranking model. Result: Incident 3 is structurally eliminated — the feature store enforces temporal correctness.

Validation

Before deploying the new system, the team ran a consistency validation:

from dataclasses import dataclass
from typing import Dict, List, Tuple
import numpy as np


@dataclass
class ConsistencyValidationResult:
    """Result of validating batch-stream consistency for a feature.

    Compares feature values computed by the batch pipeline (Spark)
    and the streaming pipeline (Flink) for the same entities and
    time windows.

    Attributes:
        feature_name: Name of the feature being validated.
        sample_size: Number of entity-time pairs compared.
        mean_absolute_error: Mean |batch_value - stream_value|.
        max_absolute_error: Maximum |batch_value - stream_value|.
        correlation: Pearson correlation between batch and stream values.
        percent_within_tolerance: Fraction of pairs within tolerance.
        tolerance: Absolute tolerance threshold.
        passed: Whether the validation passed.
    """
    feature_name: str
    sample_size: int
    mean_absolute_error: float
    max_absolute_error: float
    correlation: float
    percent_within_tolerance: float
    tolerance: float = 0.01
    passed: bool = True

    def __post_init__(self):
        self.passed = (
            self.percent_within_tolerance >= 0.99
            and self.correlation >= 0.999
        )


# Simulated validation results for StreamRec features
np.random.seed(42)
validation_results = [
    ConsistencyValidationResult(
        feature_name="user_session_click_count",
        sample_size=50000,
        mean_absolute_error=0.002,
        max_absolute_error=1.0,
        correlation=0.9998,
        percent_within_tolerance=0.998,
    ),
    ConsistencyValidationResult(
        feature_name="user_session_completion_rate",
        sample_size=50000,
        mean_absolute_error=0.0008,
        max_absolute_error=0.05,
        correlation=0.9999,
        percent_within_tolerance=0.999,
    ),
    ConsistencyValidationResult(
        feature_name="user_session_duration_sec",
        sample_size=50000,
        mean_absolute_error=0.15,
        max_absolute_error=3.2,
        correlation=0.9997,
        percent_within_tolerance=0.994,
        tolerance=0.5,
    ),
]

print("Feature Consistency Validation Report")
print("=" * 60)
for result in validation_results:
    status = "PASS" if result.passed else "FAIL"
    print(f"\n{result.feature_name}: [{status}]")
    print(f"  Sample size: {result.sample_size:,}")
    print(f"  Mean absolute error: {result.mean_absolute_error:.4f}")
    print(f"  Max absolute error: {result.max_absolute_error:.4f}")
    print(f"  Correlation: {result.correlation:.4f}")
    print(f"  Within tolerance ({result.tolerance}): "
          f"{result.percent_within_tolerance:.1%}")
Feature Consistency Validation Report
============================================================

user_session_click_count: [PASS]
  Sample size: 50,000
  Mean absolute error: 0.0020
  Max absolute error: 1.0000
  Correlation: 0.9998
  Within tolerance (0.01): 99.8%

user_session_completion_rate: [PASS]
  Sample size: 50,000
  Mean absolute error: 0.0008
  Max absolute error: 0.0500
  Correlation: 0.9999
  Within tolerance (0.01): 99.9%

user_session_duration_sec: [PASS]
  Sample size: 50,000
  Mean absolute error: 0.1500
  Max absolute error: 3.2000
  Correlation: 0.9997
  Within tolerance (0.5): 99.4%

The max absolute error for user_session_click_count (1.0) represents a single click difference caused by Flink's late event handling — events that arrive after the window closes are counted in the next window. The team accepted this as within tolerance.

Results

After deploying the feature store and retraining the ranking model on point-in-time correct features:

Metric Before Feature Store After Feature Store Change
NDCG@20 (offline) 0.38 0.41 +7.9%
NDCG@20 (online, observed) 0.31 0.40 +29.0%
Online-offline gap 0.07 0.01 -85.7%
Feature skew incidents (per quarter) 3 0 -100%
Feature computation time (training data) 4.5 hours (manual) 35 minutes (automated) -87.1%
New feature deployment time 2-3 weeks 2-3 days -85.7%

The most striking result is the collapse of the online-offline gap from 0.07 to 0.01. The offline NDCG decreased slightly (from an inflated 0.38 to a more honest 0.41 — the original 0.38 had residual data leakage) while the online NDCG increased dramatically (from 0.31 to 0.40). The feature store did not make the model better on paper — it made it honest, and honesty translated to production performance.

Lessons Learned

  1. The feature store's primary value is not performance — it is correctness. The team expected the feature store to enable real-time features (which it did). The larger impact was eliminating training-serving skew, which they had not fully appreciated as a problem.

  2. Batch-stream consistency testing is non-negotiable. The validation step caught a subtle difference in late-event handling that would have created a new source of skew. This test now runs weekly.

  3. Point-in-time joins are expensive but essential. The initial get_historical_features call took 45 minutes for 10M training examples. After optimizing with partitioned Delta Lake tables and predicate pushdown, this dropped to 12 minutes. The time is well spent: the alternative — manual joins — took 4.5 hours and produced leaky data.

  4. Default values matter more than model architecture for cold-start users. Switching from zero defaults to global-mean defaults for new users improved their engagement by 8% — a larger lift than the last three model architecture changes combined.

  5. The feature store is infrastructure, not a project. It requires ongoing maintenance: monitoring materialization latency, upgrading Redis, adding new features, deprecating old ones. The team allocated one ML engineer (20% of their time) to feature store operations.