Case Study 2: StreamRec Conformal Prediction Sets and Calibration — Knowing When to Recommend and When to Explore

Context

The StreamRec recommendation system (progressive project, Chapters 6-32) serves click-prediction scores to 12 million monthly active users. The ranking pipeline (Chapter 24) uses these scores to order candidate items: the top-K items by predicted click probability are shown to each user. The system was deployed with the continuous training pipeline (Chapter 29), monitored by the four-layer Grafana dashboard (Chapter 30), and audited for fairness (Chapter 31) and privacy (Chapter 32).

In the Q2 review, the product team raises a concern that the ML platform team had not anticipated. The recommendation quality, measured by Recall@20 and NDCG@20, has been stable at 0.20 and 0.14 respectively — well above the deployment threshold. But user satisfaction surveys reveal a pattern: users who receive recommendations dominated by a single genre (e.g., all action movies, or all true-crime podcasts) report lower satisfaction than users who receive diverse recommendations, even when the click-through rate is comparable. The product hypothesis: the model is overconfident about user preferences, leading it to recommend the same type of content repeatedly instead of exploring adjacent interests.

The ML platform team proposes an uncertainty-aware recommendation strategy: use calibrated prediction probabilities and conformal prediction sets to (1) identify which users the model genuinely understands (confident and well-calibrated) versus which users the model is guessing about (overconfident), and (2) inject diversity into recommendations for users where the model is most uncertain. This case study walks through the implementation.

Step 1: Calibration Diagnosis

The first step is to assess how well the model's predicted click probabilities match reality.

from dataclasses import dataclass, field
from typing import List, Dict, Tuple, Optional
import numpy as np


@dataclass
class StreamRecCalibrationAudit:
    """Calibration audit for the StreamRec click-prediction model.

    Evaluates calibration overall and across user segments.
    """
    user_ids: np.ndarray
    predicted_probs: np.ndarray
    true_clicks: np.ndarray
    user_segments: Dict[str, np.ndarray]  # segment_name -> boolean mask

    def run_audit(self, n_bins: int = 15) -> Dict[str, Dict]:
        """Run calibration audit overall and per segment."""
        results = {}

        # Overall calibration
        diag_overall = CalibrationDiagnostics(
            self.true_clicks, self.predicted_probs, n_bins=n_bins
        )
        results["overall"] = diag_overall.summary()

        # Per-segment calibration
        for seg_name, seg_mask in self.user_segments.items():
            if seg_mask.sum() < 100:  # Skip tiny segments
                continue
            diag_seg = CalibrationDiagnostics(
                self.true_clicks[seg_mask],
                self.predicted_probs[seg_mask],
                n_bins=n_bins,
            )
            results[seg_name] = diag_seg.summary()

        return results


# Simulated audit results
rng = np.random.RandomState(42)
n_test = 100_000

# Simulate overconfident model predictions
true_click_probs = rng.beta(2, 8, n_test)  # True CTR ~ 0.2 avg
model_logits = np.log(true_click_probs / (1 - true_click_probs)) * 1.6  # Overconfident
model_probs = 1.0 / (1.0 + np.exp(-model_logits))
true_clicks = rng.binomial(1, true_click_probs)

# Segment masks
user_segments = {
    "new_users_lt30d": rng.random(n_test) < 0.15,
    "power_users": rng.random(n_test) < 0.10,
    "casual_users": rng.random(n_test) < 0.45,
    "genre_diverse": rng.random(n_test) < 0.30,
    "genre_narrow": rng.random(n_test) < 0.20,
}

audit = StreamRecCalibrationAudit(
    user_ids=np.arange(n_test),
    predicted_probs=model_probs,
    true_clicks=true_clicks,
    user_segments=user_segments,
)
audit_results = audit.run_audit()

print("Calibration Audit Results:")
print(f"{'Segment':<25} {'ECE':>8} {'MCE':>8} {'Mean Pred':>10} {'Prevalence':>10}")
print("-" * 65)
for seg, metrics in audit_results.items():
    print(f"{seg:<25} {metrics['ece']:>8.4f} {metrics['mce']:>8.4f} "
          f"{metrics['mean_predicted']:>10.4f} {metrics['prevalence']:>10.4f}")

The audit reveals the expected pattern:

Segment ECE MCE Mean Predicted Prevalence
Overall 0.071 0.142 0.234 0.198
New users (<30d) 0.118 0.203 0.241 0.189
Power users 0.039 0.087 0.267 0.252
Casual users 0.065 0.131 0.219 0.186
Genre-diverse 0.054 0.112 0.228 0.201
Genre-narrow 0.094 0.178 0.253 0.203

Two findings stand out. First, the model is systematically overconfident across all segments (mean predicted probability exceeds prevalence). Second, the miscalibration is worst for new users (ECE = 0.118) and genre-narrow users (ECE = 0.094) — precisely the segments where the product team reported the lowest satisfaction. The model has high epistemic uncertainty for these users (few interactions to learn from), but the predicted probabilities do not reflect this uncertainty.

Step 2: Temperature Scaling

# Split test data: 50% calibration, 50% evaluation
n_cal = n_test // 2
cal_logits, eval_logits = model_logits[:n_cal], model_logits[n_cal:]
cal_labels, eval_labels = true_clicks[:n_cal], true_clicks[n_cal:]

# Fit temperature scaling
temp_scaler = TemperatureScaler().fit(cal_logits, cal_labels)
print(f"Learned temperature: T = {temp_scaler.temperature:.4f}")
# Result: T ≈ 1.85 (>1 confirms overconfidence)

# Evaluate post-calibration
cal_probs = temp_scaler.calibrate(eval_logits)
diag_post = CalibrationDiagnostics(eval_labels, cal_probs, n_bins=15)
print(f"Post-temperature ECE: {diag_post.ece:.4f} (was {audit_results['overall']['ece']:.4f})")
# Result: ECE drops from ~0.071 to ~0.018

Temperature scaling reduces global ECE from 0.071 to 0.018 — a 75% improvement with a single parameter. The ML platform team integrates this into the serving pipeline: the production model outputs raw logits, and the serving layer applies temperature scaling before converting to probabilities. The temperature parameter is re-fitted weekly on a rolling calibration set from the previous 7 days of production traffic.

Step 3: Conformal Prediction Sets

Next, the team constructs binary conformal prediction sets (click / no-click) at 90% coverage for each user-item pair.

@dataclass
class StreamRecConformalDeployment:
    """Conformal prediction deployment for StreamRec.

    For each user-item pair, produces a prediction set:
    - {click}: model is confident the user will click
    - {no-click}: model is confident the user will not click
    - {click, no-click}: model is uncertain — cannot distinguish

    Set size is a natural measure of model confidence per prediction.
    """
    alpha: float = 0.10
    conformal: SplitConformalClassifier = field(init=False)

    def __post_init__(self):
        self.conformal = SplitConformalClassifier(alpha=self.alpha)

    def calibrate(
        self,
        cal_probs: np.ndarray,
        cal_labels: np.ndarray,
    ):
        """Calibrate on held-out data."""
        # Convert binary probs to 2-class format
        probs_2class = np.column_stack([1 - cal_probs, cal_probs])
        self.conformal.calibrate(probs_2class, cal_labels)

    def predict(self, test_probs: np.ndarray) -> Dict[str, np.ndarray]:
        """Produce prediction sets and classification."""
        probs_2class = np.column_stack([1 - test_probs, test_probs])
        pred_sets = self.conformal.predict_sets(probs_2class)

        # Classify each prediction
        set_sizes = np.array([len(s) for s in pred_sets])
        confident_click = np.array([s == [1] for s in pred_sets])
        confident_no_click = np.array([s == [0] for s in pred_sets])
        uncertain = set_sizes == 2

        return {
            "set_sizes": set_sizes,
            "confident_click": confident_click,
            "confident_no_click": confident_no_click,
            "uncertain": uncertain,
            "frac_uncertain": float(uncertain.mean()),
            "frac_confident_click": float(confident_click.mean()),
            "frac_confident_no_click": float(confident_no_click.mean()),
        }


# Deploy conformal prediction
deployment = StreamRecConformalDeployment(alpha=0.10)
deployment.calibrate(cal_probs[:n_cal // 2], cal_labels[:n_cal // 2])

results = deployment.predict(cal_probs[n_cal // 2:])
print(f"Prediction set composition:")
print(f"  Confident click:    {results['frac_confident_click']:.1%}")
print(f"  Confident no-click: {results['frac_confident_no_click']:.1%}")
print(f"  Uncertain (both):   {results['frac_uncertain']:.1%}")

The results: approximately 12% of user-item pairs produce uncertain prediction sets (both click and no-click are plausible). These are the predictions where the model cannot distinguish between a click and a non-click with 90% confidence — the "honest I-don't-know" cases.

Step 4: MC Dropout Uncertainty and User-Level Analysis

@dataclass
class StreamRecUncertaintyProfile:
    """Per-user uncertainty profile aggregating MC dropout results."""
    user_id: int
    n_predictions: int
    mean_epistemic_uncertainty: float
    mean_aleatoric_uncertainty: float
    frac_uncertain_conformal: float
    mean_predicted_prob: float
    actual_click_rate: float

    @property
    def uncertainty_tier(self) -> str:
        """Classify user into uncertainty tiers for routing."""
        if self.mean_epistemic_uncertainty > 0.15:
            return "high_epistemic"
        elif self.frac_uncertain_conformal > 0.25:
            return "moderate_uncertain"
        else:
            return "confident"


def build_user_profiles(
    user_ids: np.ndarray,
    epistemic: np.ndarray,
    aleatoric: np.ndarray,
    conformal_uncertain: np.ndarray,
    predicted_probs: np.ndarray,
    true_clicks: np.ndarray,
) -> List[StreamRecUncertaintyProfile]:
    """Aggregate prediction-level uncertainty to user-level profiles."""
    unique_users = np.unique(user_ids)
    profiles = []

    for uid in unique_users:
        mask = user_ids == uid
        if mask.sum() < 5:  # Skip users with too few predictions
            continue

        profiles.append(StreamRecUncertaintyProfile(
            user_id=int(uid),
            n_predictions=int(mask.sum()),
            mean_epistemic_uncertainty=float(epistemic[mask].mean()),
            mean_aleatoric_uncertainty=float(aleatoric[mask].mean()),
            frac_uncertain_conformal=float(conformal_uncertain[mask].mean()),
            mean_predicted_prob=float(predicted_probs[mask].mean()),
            actual_click_rate=float(true_clicks[mask].mean()),
        ))

    return profiles


# Simulate user-level MC dropout results
n_users = 5000
user_ids_sim = rng.randint(0, n_users, n_test)

# Simulate epistemic uncertainty (higher for new/rare users)
user_activity = np.bincount(user_ids_sim, minlength=n_users)
user_epistemic_base = 0.3 / (1 + np.log1p(user_activity))  # Less activity -> more epistemic
epistemic_per_pred = user_epistemic_base[user_ids_sim] + rng.normal(0, 0.02, n_test)
epistemic_per_pred = np.clip(epistemic_per_pred, 0, 1)

# Aleatoric uncertainty (roughly constant, inherent to the task)
aleatoric_per_pred = 0.08 + rng.normal(0, 0.01, n_test)
aleatoric_per_pred = np.clip(aleatoric_per_pred, 0, 1)

# Conformal uncertain flags
conformal_uncertain = results["uncertain"][: len(user_ids_sim)]

profiles = build_user_profiles(
    user_ids_sim,
    epistemic_per_pred,
    aleatoric_per_pred,
    conformal_uncertain,
    model_probs,
    true_clicks,
)

# Tier distribution
tiers = [p.uncertainty_tier for p in profiles]
for tier in ["confident", "moderate_uncertain", "high_epistemic"]:
    count = tiers.count(tier)
    pct = count / len(tiers) * 100
    print(f"  {tier}: {count} users ({pct:.1f}%)")

The tier distribution reveals the user landscape:

Tier Users % Recommendation Strategy
Confident ~3,200 64% Standard ranking by predicted probability
Moderate uncertain ~1,100 22% Increase diversity in top-K; wider genre mix
High epistemic ~700 14% Active exploration; Thompson sampling (Ch. 22)

Step 5: Identifying Least-Confident Users for Active Exploration

The 700 high-epistemic users are the model's biggest knowledge gaps. For these users, the model has not observed enough interactions to learn their preferences — every prediction is a guess. The ML platform team routes these users to the Thompson sampling exploration policy from Chapter 22, which explicitly balances exploration and exploitation.

# Identify least-confident users
high_epistemic_profiles = [
    p for p in profiles if p.uncertainty_tier == "high_epistemic"
]

# Sort by epistemic uncertainty (highest first)
high_epistemic_profiles.sort(
    key=lambda p: p.mean_epistemic_uncertainty, reverse=True
)

print(f"\nTop 10 least-confident users:")
print(f"{'User':>8} {'Epistemic':>10} {'Aleatoric':>10} "
      f"{'Conformal %':>12} {'Pred CTR':>9} {'Actual CTR':>11}")
print("-" * 65)
for p in high_epistemic_profiles[:10]:
    print(f"{p.user_id:>8} {p.mean_epistemic_uncertainty:>10.4f} "
          f"{p.mean_aleatoric_uncertainty:>10.4f} "
          f"{p.frac_uncertain_conformal:>12.1%} "
          f"{p.mean_predicted_prob:>9.4f} {p.actual_click_rate:>11.4f}")

The analysis reveals that the least-confident users share a common profile: they are either new users (fewer than 30 days on the platform) or users with highly eclectic tastes (their historical clicks span many genres with no clear pattern). For these users, the model's predictions are essentially random — the predicted CTR does not correlate with actual CTR. This confirms that the model's uncertainty estimates are meaningful: high epistemic uncertainty genuinely identifies users where the model is ignorant, not just users with inherently noisy preferences.

Impact

The uncertainty-aware recommendation strategy is A/B tested over 4 weeks against the standard ranking pipeline:

Metric Control (standard) Treatment (uncertainty-aware) Change
Recall@20 0.200 0.198 -1.0% (within noise)
NDCG@20 0.140 0.138 -1.4% (within noise)
User satisfaction (survey) 3.8 / 5.0 4.1 / 5.0 +7.9%
Genre diversity (Shannon) 1.42 1.71 +20.4%
30-day retention 72.1% 74.8% +2.7 pp
New user 7-day retention 58.3% 65.1% +6.8 pp

The headline result: the uncertainty-aware system achieves statistically indistinguishable click metrics (the small decreases are within the confidence interval) while substantially improving user satisfaction, genre diversity, and retention. The largest gain is for new users, where the Thompson sampling exploration policy — activated by high epistemic uncertainty — helps the system learn preferences faster. Within 14 days, the model's epistemic uncertainty for these users drops by 40% as exploration interactions provide signal, and they transition from the "high epistemic" tier to "moderate uncertain" or "confident."

Lessons

  1. Uncertainty enables better product decisions, not just better models. The click-prediction model did not become more accurate. What changed was the system's ability to distinguish "I know this user will click" from "I'm guessing." This distinction — invisible without uncertainty quantification — enabled a routing strategy that improved the user experience without sacrificing engagement metrics.

  2. Calibration is a prerequisite for uncertainty-aware systems. The temperature scaling step (reducing ECE from 0.071 to 0.018) was essential. Without it, the conformal prediction sets and the uncertainty tiers would have been based on overconfident probabilities, leading to smaller-than-warranted prediction sets and fewer users classified as uncertain. Calibrating first ensures that downstream uncertainty signals are trustworthy.

  3. Epistemic uncertainty identifies where to invest. The 14% of users in the "high epistemic" tier are not a permanent category. They are a signal of where the model needs more data. By routing these users to exploration, the system actively reduces its own uncertainty — a virtuous cycle where uncertainty quantification drives data collection, which reduces uncertainty, which improves predictions.