Case Study 2: StreamRec DP Training — Privacy-Utility Tradeoff at Scale

DataField.Dev

Case Study 2: StreamRec DP Training — Privacy-Utility Tradeoff at Scale

Context

The StreamRec recommendation system (Progressive Project, Chapters 6-28) serves 12 million monthly active users across three markets: the United States, the European Union, and Japan. The click-prediction model — an MLP with 40 input features (20 user features, 20 item features) — trains daily on the previous 30 days of user interaction data: approximately 50 million interaction records per training run.

In Q3, the European Data Protection Board issues updated guidance on the use of behavioral data for algorithmic personalization under GDPR Article 22. The guidance does not mandate differential privacy, but it establishes that "technical measures providing formal, quantifiable privacy protection" are considered best practice for automated decision-making that "significantly affects" individuals. StreamRec's EU counsel interprets this as a strong signal: deploying DP training would materially reduce regulatory risk for the EU market, which represents 35% of revenue.

The ML platform team is tasked with answering three questions:

What is the privacy-utility tradeoff for the StreamRec click predictor at $\varepsilon \in \{1, 3, 8, \infty\}$?
At what $\varepsilon$ does model quality degrade below the business-acceptable threshold (Recall@20 $\geq$ 0.18, a 10% relative degradation from the 0.20 baseline)?
What is the engineering cost of integrating Opacus into the existing training pipeline?

Experimental Setup

The team designs a controlled experiment: four models trained on identical data with identical hyperparameters, differing only in $\varepsilon$. The baseline ($\varepsilon = \infty$) is the existing production model.

from dataclasses import dataclass, field
from typing import List, Dict
import numpy as np


@dataclass
class StreamRecDPConfig:
    """Configuration for one DP training experiment."""
    epsilon: float
    delta: float
    max_grad_norm: float
    batch_size: int
    epochs: int
    learning_rate: float
    noise_multiplier: float  # Computed from epsilon, delta, epochs, n, batch_size


@dataclass
class StreamRecDPResult:
    """Full results for one DP training experiment."""
    config: StreamRecDPConfig
    # Aggregate metrics
    accuracy: float
    auc_roc: float
    f1: float
    recall_at_20: float
    ndcg_at_20: float
    # Per-segment metrics
    segment_recall_at_20: Dict[str, float]
    # Training metrics
    final_train_loss: float
    convergence_epoch: int  # Epoch where loss stabilized
    wall_clock_hours: float
    # Privacy accounting
    actual_epsilon: float  # From RDP accountant
    noise_multiplier: float


# Experimental configurations
configs = [
    StreamRecDPConfig(
        epsilon=1.0, delta=1e-7, max_grad_norm=1.0,
        batch_size=4096, epochs=15, learning_rate=5e-4,
        noise_multiplier=2.41,
    ),
    StreamRecDPConfig(
        epsilon=3.0, delta=1e-7, max_grad_norm=1.0,
        batch_size=4096, epochs=15, learning_rate=5e-4,
        noise_multiplier=0.92,
    ),
    StreamRecDPConfig(
        epsilon=8.0, delta=1e-7, max_grad_norm=1.0,
        batch_size=4096, epochs=15, learning_rate=5e-4,
        noise_multiplier=0.44,
    ),
    StreamRecDPConfig(
        epsilon=float("inf"), delta=0.0, max_grad_norm=0.0,
        batch_size=4096, epochs=15, learning_rate=5e-4,
        noise_multiplier=0.0,
    ),
]

print("StreamRec DP Experiment Configurations")
print("=" * 70)
print(f"{'ε':>6s}  {'δ':>10s}  {'σ (noise)':>10s}  {'C (clip)':>9s}  "
      f"{'Batch':>6s}  {'Epochs':>7s}")
print("-" * 70)
for c in configs:
    eps_str = f"{c.epsilon:.1f}" if c.epsilon != float("inf") else "∞"
    delta_str = f"{c.delta:.0e}" if c.delta > 0 else "N/A"
    clip_str = f"{c.max_grad_norm:.1f}" if c.max_grad_norm > 0 else "N/A"
    print(f"{eps_str:>6s}  {delta_str:>10s}  {c.noise_multiplier:>10.2f}  "
          f"{clip_str:>9s}  {c.batch_size:>6d}  {c.epochs:>7d}")

StreamRec DP Experiment Configurations
======================================================================
     ε           δ  σ (noise)   C (clip)   Batch   Epochs
----------------------------------------------------------------------
   1.0       1e-07        2.41        1.0    4096       15
   3.0       1e-07        0.92        1.0    4096       15
   8.0       1e-07        0.44        1.0    4096       15
     ∞         N/A        0.00        N/A    4096       15

Hyperparameter Decisions

Batch size: 4096. Larger batches improve DP-SGD efficiency because the noise is added to the sum of gradients in the batch, and the signal (sum of true gradients) grows linearly with batch size while the noise standard deviation is fixed. Increasing from the production default of 512 to 4096 is the single highest-impact change for DP training. The team verified that the non-DP model's quality is insensitive to batch size in the 512-4096 range (Recall@20 varies by $<$0.003).

Max gradient norm $C = 1.0$. The team profiled gradient norms during non-DP training and found that the median per-example gradient norm is 0.72 and the 95th percentile is 1.83. Setting $C = 1.0$ clips approximately 30% of gradients — a moderate rate. Setting $C$ too low (e.g., 0.1) clips >99% of gradients, destroying useful signal. Setting $C$ too high (e.g., 10.0) adds 10x more noise per step. The team tested $C \in \{0.5, 1.0, 2.0, 5.0\}$ and found $C = 1.0$ optimal for all $\varepsilon$ levels.

Epochs: 15 (up from 10). DP-SGD converges more slowly because noisy gradients provide a lower-fidelity optimization signal. The team increased the epoch count by 50% to give the DP models more optimization steps. Privacy accounting confirmed that 15 epochs with $\sigma = 0.92$ stays within $\varepsilon = 3.0$ (actual $\varepsilon = 2.87$ from the RDP accountant).

Results

# Full experimental results
results = [
    StreamRecDPResult(
        config=configs[0],
        accuracy=0.7234, auc_roc=0.7891, f1=0.7089,
        recall_at_20=0.1542, ndcg_at_20=0.1234,
        segment_recall_at_20={
            "power_users": 0.1812, "casual": 0.1423,
            "new_users": 0.0987, "eu_market": 0.1498,
            "us_market": 0.1589, "jp_market": 0.1478,
        },
        final_train_loss=0.6812, convergence_epoch=15,
        wall_clock_hours=8.2,
        actual_epsilon=0.97, noise_multiplier=2.41,
    ),
    StreamRecDPResult(
        config=configs[1],
        accuracy=0.7891, auc_roc=0.8412, f1=0.7812,
        recall_at_20=0.1876, ndcg_at_20=0.1567,
        segment_recall_at_20={
            "power_users": 0.2134, "casual": 0.1756,
            "new_users": 0.1234, "eu_market": 0.1834,
            "us_market": 0.1912, "jp_market": 0.1798,
        },
        final_train_loss=0.6298, convergence_epoch=12,
        wall_clock_hours=7.8,
        actual_epsilon=2.87, noise_multiplier=0.92,
    ),
    StreamRecDPResult(
        config=configs[2],
        accuracy=0.8156, auc_roc=0.8678, f1=0.8098,
        recall_at_20=0.2034, ndcg_at_20=0.1734,
        segment_recall_at_20={
            "power_users": 0.2312, "casual": 0.1923,
            "new_users": 0.1412, "eu_market": 0.1989,
            "us_market": 0.2067, "jp_market": 0.1967,
        },
        final_train_loss=0.5987, convergence_epoch=10,
        wall_clock_hours=7.4,
        actual_epsilon=7.82, noise_multiplier=0.44,
    ),
    StreamRecDPResult(
        config=configs[3],
        accuracy=0.8312, auc_roc=0.8823, f1=0.8267,
        recall_at_20=0.2187, ndcg_at_20=0.1876,
        segment_recall_at_20={
            "power_users": 0.2487, "casual": 0.2067,
            "new_users": 0.1534, "eu_market": 0.2134,
            "us_market": 0.2223, "jp_market": 0.2098,
        },
        final_train_loss=0.5634, convergence_epoch=8,
        wall_clock_hours=2.1,
        actual_epsilon=float("inf"), noise_multiplier=0.0,
    ),
]

# Aggregate metrics table
print("StreamRec DP Training: Aggregate Metrics")
print("=" * 88)
print(f"{'ε':>6s}  {'σ':>6s}  {'Accuracy':>10s}  {'AUC':>8s}  {'F1':>8s}  "
      f"{'R@20':>8s}  {'NDCG@20':>9s}  {'Hours':>7s}")
print("-" * 88)

baseline = results[-1]
for r in results:
    eps_str = f"{r.actual_epsilon:.2f}" if r.actual_epsilon != float("inf") else "∞"
    print(f"{eps_str:>6s}  {r.noise_multiplier:>6.2f}  {r.accuracy:>10.4f}  "
          f"{r.auc_roc:>8.4f}  {r.f1:>8.4f}  {r.recall_at_20:>8.4f}  "
          f"{r.ndcg_at_20:>9.4f}  {r.wall_clock_hours:>7.1f}")

# Relative degradation
print("\nRelative Degradation from Baseline (ε=∞)")
print("-" * 65)
for r in results[:-1]:
    acc_d = (1 - r.accuracy / baseline.accuracy) * 100
    auc_d = (1 - r.auc_roc / baseline.auc_roc) * 100
    r20_d = (1 - r.recall_at_20 / baseline.recall_at_20) * 100
    ndcg_d = (1 - r.ndcg_at_20 / baseline.ndcg_at_20) * 100
    print(f"  ε={r.actual_epsilon:>5.2f}: Acc -{acc_d:>5.1f}%, AUC -{auc_d:>5.1f}%, "
          f"R@20 -{r20_d:>5.1f}%, NDCG@20 -{ndcg_d:>5.1f}%")

StreamRec DP Training: Aggregate Metrics
========================================================================================
     ε       σ    Accuracy       AUC        F1      R@20    NDCG@20    Hours
----------------------------------------------------------------------------------------
  0.97    2.41      0.7234    0.7891    0.7089    0.1542     0.1234      8.2
  2.87    0.92      0.7891    0.8412    0.7812    0.1876     0.1567      7.8
  7.82    0.44      0.8156    0.8678    0.8098    0.2034     0.1734      7.4
     ∞    0.00      0.8312    0.8823    0.8267    0.2187     0.1876      2.1

Relative Degradation from Baseline (ε=∞)
-----------------------------------------------------------------
  ε= 0.97: Acc - 13.0%, AUC - 10.6%, R@20 - 29.5%, NDCG@20 - 34.2%
  ε= 2.87: Acc -  5.1%, AUC -  4.7%, R@20 - 14.2%, NDCG@20 - 16.5%
  ε= 7.82: Acc -  1.9%, AUC -  1.6%, R@20 -  7.0%, NDCG@20 -  7.6%

Segment-Level Analysis

Aggregate metrics mask disparate impacts on user segments. The team analyzes Recall@20 across six segments:

print("Segment-Level Recall@20")
print("=" * 82)

segments = ["power_users", "casual", "new_users", "eu_market", "us_market", "jp_market"]
header = f"{'Segment':>14s}" + "".join(f"  {'ε='+str(r.actual_epsilon) if r.actual_epsilon != float('inf') else 'ε=∞':>10s}" for r in results)
print(header)
print("-" * 82)

for seg in segments:
    row = f"{seg:>14s}"
    for r in results:
        val = r.segment_recall_at_20[seg]
        row += f"  {val:>10.4f}"
    print(row)

# Identify segments that fall below the 0.18 threshold
print("\nSegments Below R@20 ≥ 0.18 Threshold")
print("-" * 55)
for r in results:
    eps_str = f"{r.actual_epsilon:.2f}" if r.actual_epsilon != float("inf") else "∞"
    below = [s for s in segments if r.segment_recall_at_20[s] < 0.18]
    if below:
        print(f"  ε={eps_str}: {', '.join(below)}")
    else:
        print(f"  ε={eps_str}: None (all segments above threshold)")

Segment-Level Recall@20
==================================================================================
       Segment     ε=0.97     ε=2.87     ε=7.82        ε=∞
----------------------------------------------------------------------------------
   power_users      0.1812      0.2134      0.2312      0.2487
        casual      0.1423      0.1756      0.1923      0.2067
     new_users      0.0987      0.1234      0.1412      0.1534
     eu_market      0.1498      0.1834      0.1989      0.2134
     us_market      0.1589      0.1912      0.2067      0.2223
     jp_market      0.1478      0.1798      0.1967      0.2098

Segments Below R@20 ≥ 0.18 Threshold
-------------------------------------------------------
  ε=0.97: power_users, casual, new_users, eu_market, us_market, jp_market
  ε=2.87: casual, new_users, jp_market
  ε=7.82: new_users
  ε=∞: new_users

Key Findings

Finding 1: Ranking metrics are more sensitive to DP noise than classification metrics. Recall@20 degrades 29.5% at $\varepsilon = 1$, while accuracy degrades only 13.0%. This makes sense: accuracy measures whether the model gets the binary "click or not" decision right, which requires only rough calibration. Recall@20 measures whether the correct items appear in the top 20, which requires accurate relative ordering across all items. Noise perturbs relative ordering far more than binary classification.

Finding 2: New users are disproportionately harmed by DP. The new_users segment (users with fewer than 30 days of activity) has Recall@20 below 0.15 even without DP ($\varepsilon = \infty$: 0.1534), and DP drives it below 0.10 at $\varepsilon = 1$. This is because new users have sparse interaction histories, so their feature vectors carry less signal. DP noise has a larger relative impact on low-signal inputs.

Finding 3: $\varepsilon = 3$ is the inflection point for StreamRec. At $\varepsilon = 3$, three segments (casual, new_users, jp_market) fall below the 0.18 threshold. At $\varepsilon = 8$, only new_users falls below — and new_users is below the threshold even without DP. The business-acceptable operating point is therefore $\varepsilon = 8$ for the full user base, or $\varepsilon = 3$ if the team accepts additional investment in cold-start mitigation for new users.

Engineering Integration

The team documents the engineering changes required to integrate Opacus into the production training pipeline:

@dataclass
class EngineeringCostEstimate:
    """Engineering cost for DP integration."""
    component: str
    effort_weeks: float
    description: str
    risk: str  # "low", "medium", "high"


integration_costs = [
    EngineeringCostEstimate(
        "Model architecture changes", 1.0,
        "Replace BatchNorm with LayerNorm in all model variants; "
        "verify Opacus ModuleValidator.is_valid() passes; "
        "confirm non-DP quality is unchanged after architecture swap",
        "low",
    ),
    EngineeringCostEstimate(
        "Opacus training pipeline", 2.0,
        "Integrate PrivacyEngine into Dagster training DAG; "
        "add noise_multiplier auto-calibration; "
        "implement privacy budget tracking in MLflow metadata",
        "medium",
    ),
    EngineeringCostEstimate(
        "Privacy accounting dashboard", 1.5,
        "Real-time epsilon consumption tracking; "
        "alert when budget reaches 80% threshold; "
        "audit trail for regulatory compliance",
        "medium",
    ),
    EngineeringCostEstimate(
        "Gradient clipping tuning", 1.0,
        "Gradient norm profiling across model variants and data periods; "
        "automated clip norm selection based on gradient statistics; "
        "integration with hyperparameter sweep infrastructure",
        "low",
    ),
    EngineeringCostEstimate(
        "Batch size optimization", 0.5,
        "Scale training batch size from 512 to 4096; "
        "verify distributed training (DDP) handles larger batches; "
        "memory profiling on A100 GPUs",
        "low",
    ),
    EngineeringCostEstimate(
        "Behavioral testing updates", 1.0,
        "Update MFT thresholds to account for DP quality loss; "
        "add DP-specific behavioral tests (noise sensitivity, segment fairness); "
        "integration with Chapter 28 validation gate",
        "low",
    ),
    EngineeringCostEstimate(
        "A/B testing and shadow deployment", 2.0,
        "Shadow-serve DP model alongside production model for 2 weeks; "
        "online metrics comparison; "
        "gradual traffic ramp from 5% to 100% for EU market",
        "medium",
    ),
]

total_weeks = sum(c.effort_weeks for c in integration_costs)

print("Engineering Integration Cost Estimate")
print("=" * 80)
print(f"{'Component':>35s}  {'Weeks':>6s}  {'Risk':>8s}")
print("-" * 80)
for c in integration_costs:
    print(f"{c.component:>35s}  {c.effort_weeks:>6.1f}  {c.risk:>8s}")
print("-" * 80)
print(f"{'TOTAL':>35s}  {total_weeks:>6.1f}")
print(f"\nEstimated calendar time: {total_weeks * 1.5:.0f} weeks "
      f"(accounting for parallelization and review cycles)")

Engineering Integration Cost Estimate
================================================================================
                          Component   Weeks      Risk
--------------------------------------------------------------------------------
         Model architecture changes     1.0       low
           Opacus training pipeline     2.0    medium
      Privacy accounting dashboard     1.5    medium
          Gradient clipping tuning     1.0       low
          Batch size optimization     0.5       low
       Behavioral testing updates     1.0       low
   A/B testing and shadow deployment     2.0    medium
--------------------------------------------------------------------------------
                              TOTAL     9.0
Estimated calendar time: 14 weeks (accounting for parallelization and review cycles)

Decision

The StreamRec team recommends deploying with $\varepsilon = 8$ for the EU market as an initial step, with a roadmap to tighten to $\varepsilon = 3$ within 6 months as the team builds experience with DP-SGD tuning and invests in cold-start improvements for new users.

Justification for $\varepsilon = 8$ (initial): - Only 1.9% accuracy degradation and 7.0% Recall@20 degradation from baseline - All user segments except new_users (already underperforming) meet the 0.18 threshold - Formal DP guarantee satisfies the EDPB's "quantifiable privacy protection" guidance - 3.5x training time overhead is manageable within the daily training window (2.1h $\to$ 7.4h vs. 10h window)

Justification for $\varepsilon = 3$ (target): - 5.1% accuracy degradation and 14.2% Recall@20 degradation is borderline acceptable - Casual and Japanese market segments require quality mitigation before deployment - $\varepsilon = 3$ is within the range used by Apple (1-8 per data type per day) and represents a defensible privacy standard

Engineering timeline: 9 person-weeks of development, 14 calendar weeks from approval to EU deployment. The DP training infrastructure is reusable across all StreamRec model variants (click predictor, completion predictor, session model) and across the Credit Scoring anchor example where DP is also under regulatory consideration.

Lessons Learned

Batch size is the most important hyperparameter for DP-SGD quality. Increasing batch size from 512 to 4096 improved Recall@20 by 0.015 at $\varepsilon = 3$ — equivalent to moving from $\varepsilon = 2.5$ to $\varepsilon = 3.5$ in privacy budget. This is the cheapest quality improvement available.
Gradient norm profiling should precede clip norm selection. The team initially used $C = 1.0$ based on common practice. Profiling revealed that 30% of gradients were clipped, which was near-optimal. A team that had set $C = 0.1$ (from a different codebase) saw 60% quality degradation before diagnosing the issue.
Privacy accounting method matters. Switching from basic composition to RDP accounting (Opacus default) reduced the reported $\varepsilon$ from 4.2 to 2.87 for identical training — the model was more private than basic accounting suggested. This is free privacy: no additional noise, no quality loss, just tighter analysis.
Segment-level analysis is non-negotiable. The aggregate Recall@20 at $\varepsilon = 3$ (0.1876) is above the 0.18 threshold, but the segment-level view reveals three segments below threshold. Deploying based on aggregate metrics alone would have degraded the experience for specific user populations without the team's knowledge.
DP training infrastructure is reusable. The Opacus integration, privacy accounting dashboard, and gradient profiling tools built for StreamRec are directly applicable to the credit scoring model (Chapter 31), reducing future DP integration cost from 9 weeks to approximately 3 weeks.