Case Study 1: StreamRec Thompson Sampling — Explore vs. Exploit for Content Recommendations

DataField.Dev

Case Study 1: StreamRec Thompson Sampling — Explore vs. Exploit for Content Recommendations

Context

StreamRec's recommendation team has a problem that the Hit@10 leaderboard cannot capture. The platform's collaborative filtering models (Chapters 13-14) excel at predicting what users will engage with — but only for users with substantial history. The models achieve Hit@10 of 0.22 for established users (100+ interactions) but just 0.08 for new users (fewer than 10 interactions). This is the cold-start problem in action.

The current production system handles new users with a "most popular" policy: recommend the globally most-watched items. This policy has a Hit@10 of approximately 0.10 for new users — better than random (0.04) but far below the established-user performance. More critically, the most-popular policy creates a feedback loop. New users are shown comedy and drama (the globally popular categories); they interact mostly with comedy and drama; the collaborative filtering model learns they "prefer" comedy and drama; and the user's recommendation feed calcifies around two categories. Users who would have loved documentary or sci-fi content never see enough of it to express their preference.

Chapter 20 introduced the solution framework: maintain per-user Beta posteriors over category engagement rates, initialized with population-level priors. This case study implements the decision layer — Thompson sampling — and measures its impact against the greedy baseline in a realistic simulation.

The Simulation

The team designs a simulation that captures the essential dynamics of the cold-start problem. Each simulated user has hidden true engagement probabilities per category, drawn from a population model that reflects real StreamRec data.

import numpy as np
from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional
from scipy import stats
import matplotlib.pyplot as plt


@dataclass
class UserProfile:
    """A simulated user with hidden category preferences.

    Attributes:
        user_id: Unique identifier.
        true_probs: True engagement probability per category.
        favorite: Index of the user's highest-engagement category.
    """
    user_id: str
    true_probs: Dict[str, float]
    favorite: str


def generate_user_population(
    n_users: int,
    categories: List[str],
    rng: np.random.RandomState,
) -> List[UserProfile]:
    """Generate a diverse user population.

    Each user has a randomly selected favorite category with elevated
    engagement, plus base rates drawn from population Beta distributions.

    The diversity model reflects StreamRec's data:
    - 60% of users have a clear favorite (engagement 0.5-0.9)
    - 25% of users have two strong categories
    - 15% are "browsers" with no clear favorite

    Args:
        n_users: Number of users to generate.
        categories: List of category names.
        rng: Random state.

    Returns:
        List of UserProfile instances.
    """
    # Population base rates per category
    base_rates = {
        "comedy": 0.35, "drama": 0.30, "documentary": 0.20,
        "action": 0.32, "sci_fi": 0.18, "horror": 0.12,
        "romance": 0.25, "thriller": 0.28,
    }

    users = []
    for i in range(n_users):
        user_type = rng.choice(["focused", "dual", "browser"], p=[0.60, 0.25, 0.15])
        probs = {}

        if user_type == "focused":
            favorite_idx = rng.randint(len(categories))
            for j, cat in enumerate(categories):
                if j == favorite_idx:
                    probs[cat] = min(0.95, base_rates[cat] + rng.uniform(0.25, 0.55))
                else:
                    probs[cat] = max(0.02, base_rates[cat] + rng.uniform(-0.10, 0.10))

        elif user_type == "dual":
            fav1, fav2 = rng.choice(len(categories), size=2, replace=False)
            for j, cat in enumerate(categories):
                if j in (fav1, fav2):
                    probs[cat] = min(0.90, base_rates[cat] + rng.uniform(0.20, 0.40))
                else:
                    probs[cat] = max(0.02, base_rates[cat] + rng.uniform(-0.10, 0.05))

        else:  # browser
            for j, cat in enumerate(categories):
                probs[cat] = max(0.05, base_rates[cat] + rng.uniform(-0.05, 0.15))

        favorite = max(probs, key=probs.get)
        users.append(UserProfile(f"user_{i}", probs, favorite))

    return users


def run_recommendation_experiment(
    users: List[UserProfile],
    strategy: str,
    n_interactions: int = 150,
    rng: Optional[np.random.RandomState] = None,
) -> Dict:
    """Run a full recommendation experiment.

    Args:
        users: List of user profiles.
        strategy: "thompson", "greedy", "most_popular", or "random".
        n_interactions: Number of interactions per user.
        rng: Random state.

    Returns:
        Dictionary with engagement metrics over time.
    """
    if rng is None:
        rng = np.random.RandomState(42)

    categories = list(users[0].true_probs.keys())

    # Population-level priors (from Section 20.11)
    category_priors = {
        "comedy": (8.0, 12.0), "drama": (7.0, 13.0),
        "documentary": (5.0, 15.0), "action": (8.0, 11.0),
        "sci_fi": (4.0, 16.0), "horror": (3.0, 17.0),
        "romance": (6.0, 14.0), "thriller": (7.0, 12.0),
    }

    # Per-user posteriors
    posteriors: Dict[str, Dict[str, Tuple[float, float]]] = {}
    for user in users:
        posteriors[user.user_id] = {
            cat: category_priors[cat] for cat in categories
        }

    # Tracking
    engagement_by_step = np.zeros(n_interactions)
    diversity_by_step = np.zeros(n_interactions)
    found_favorite_by_step = np.zeros(n_interactions)

    for user in users:
        uid = user.user_id
        categories_seen = set()

        for t in range(n_interactions):
            # Select category based on strategy
            if strategy == "thompson":
                samples = {
                    cat: rng.beta(*posteriors[uid][cat])
                    for cat in categories
                }
                selected = max(samples, key=samples.get)

            elif strategy == "greedy":
                means = {
                    cat: posteriors[uid][cat][0] / sum(posteriors[uid][cat])
                    for cat in categories
                }
                selected = max(means, key=means.get)

            elif strategy == "most_popular":
                # Always recommend globally most popular
                selected = "comedy"

            elif strategy == "random":
                selected = categories[rng.randint(len(categories))]

            else:
                raise ValueError(f"Unknown strategy: {strategy}")

            # Simulate engagement
            engaged = rng.rand() < user.true_probs[selected]

            # Update posterior
            a, b = posteriors[uid][selected]
            if engaged:
                posteriors[uid][selected] = (a + 1, b)
            else:
                posteriors[uid][selected] = (a, b + 1)

            # Track metrics
            engagement_by_step[t] += float(engaged)
            categories_seen.add(selected)
            diversity_by_step[t] += len(categories_seen) / len(categories)
            found_favorite_by_step[t] += float(selected == user.favorite)

    n_users = len(users)
    return {
        "engagement_rate": engagement_by_step / n_users,
        "diversity": diversity_by_step / n_users,
        "favorite_rate": found_favorite_by_step / n_users,
    }


# Run the experiment
rng = np.random.RandomState(42)
categories = ["comedy", "drama", "documentary", "action",
              "sci_fi", "horror", "romance", "thriller"]
users = generate_user_population(1000, categories, rng)

# Print user type distribution
favorites = [u.favorite for u in users]
print("User favorite category distribution:")
for cat in categories:
    count = favorites.count(cat)
    print(f"  {cat:>15s}: {count:>4d} users ({100 * count / len(users):.1f}%)")

strategies = ["most_popular", "random", "greedy", "thompson"]
results = {}
for strategy in strategies:
    results[strategy] = run_recommendation_experiment(
        users, strategy, n_interactions=150,
        rng=np.random.RandomState(42),
    )

User favorite category distribution:
          comedy:  132 users (13.2%)
           drama:  119 users (11.9%)
     documentary:  124 users (12.4%)
          action:  131 users (13.1%)
          sci_fi:  128 users (12.8%)
          horror:  120 users (12.0%)
         romance:  120 users (12.0%)
        thriller:  126 users (12.6%)

Results

# Engagement rate over time
fig, axes = plt.subplots(1, 3, figsize=(16, 5))
window = 10

for strategy in strategies:
    data = results[strategy]
    smoothed = np.convolve(data["engagement_rate"], np.ones(window)/window, mode="valid")
    axes[0].plot(range(window-1, 150), smoothed, label=strategy, linewidth=2)

axes[0].set_xlabel("Interaction step")
axes[0].set_ylabel("Engagement rate")
axes[0].set_title("Engagement rate over time")
axes[0].legend()

# Category diversity
for strategy in strategies:
    data = results[strategy]
    axes[1].plot(data["diversity"], label=strategy, linewidth=2)

axes[1].set_xlabel("Interaction step")
axes[1].set_ylabel("Fraction of categories seen")
axes[1].set_title("Category exploration diversity")
axes[1].legend()

# Found user's true favorite
for strategy in strategies:
    data = results[strategy]
    smoothed = np.convolve(data["favorite_rate"], np.ones(window)/window, mode="valid")
    axes[2].plot(range(window-1, 150), smoothed, label=strategy, linewidth=2)

axes[2].set_xlabel("Interaction step")
axes[2].set_ylabel("Fraction recommending user's favorite")
axes[2].set_title("Favorite category discovery")
axes[2].legend()

plt.tight_layout()
plt.show()

# Summary table
print(f"\n{'Strategy':>15s}  {'Eng (1-30)':>10s}  {'Eng (120-150)':>13s}  "
      f"{'Overall':>8s}  {'Div@150':>8s}  {'Fav@150':>8s}")
print("-" * 72)
for strategy in strategies:
    data = results[strategy]
    early = data["engagement_rate"][:30].mean()
    late = data["engagement_rate"][-30:].mean()
    overall = data["engagement_rate"].mean()
    div_final = data["diversity"][-1]
    fav_final = data["favorite_rate"][-30:].mean()
    print(f"{strategy:>15s}  {early:>10.4f}  {late:>13.4f}  "
          f"{overall:>8.4f}  {div_final:>8.3f}  {fav_final:>8.3f}")

       Strategy  Eng (1-30)  Eng (120-150)   Overall   Div@150   Fav@150
------------------------------------------------------------------------
   most_popular      0.3521         0.3518    0.3519    0.125     0.132
         random      0.2687         0.2703    0.2694    1.000     0.125
         greedy      0.3612         0.4138    0.3924    0.302     0.297
       thompson      0.3274         0.5347    0.4618    0.724     0.612

Analysis

Four findings emerge from the 150-step simulation across 1,000 users.

1. Thompson sampling achieves the highest long-term engagement. By step 120-150, Thompson sampling's engagement rate (0.535) is 29% higher than greedy (0.414) and 52% higher than most-popular (0.352). The initial cost of exploration (steps 1-30, where Thompson trails greedy by 3.4 percentage points) is more than recovered by step 50.

2. Thompson sampling discovers users' true favorites. By the end of the experiment, Thompson sampling recommends the user's actual favorite category 61.2% of the time, compared to 29.7% for greedy, 13.2% for most-popular, and 12.5% for random. Greedy is trapped by the population prior: it recommends comedy and action (the highest prior means) regardless of the user's true preference. Thompson sampling's exploration breaks through the prior, discovering that a sci-fi lover should see sci-fi — even though the population prior for sci-fi is low.

3. Exploration diversity correlates with long-term performance. Thompson sampling achieves a diversity score of 0.724 (users have been exposed to 72.4% of categories by step 150), compared to 0.302 for greedy. This diversity is not random — it is directed by the posterior uncertainty. Categories with few observations have wide posteriors, generating occasional high samples that trigger exploration. Categories that consistently disappoint (low engagement) have concentrated low posteriors and are naturally abandoned.

4. The most-popular baseline reveals the cost of the feedback loop. Most-popular has flat engagement (0.352 across all steps) because it never learns anything about individual users. It serves the same recommendations to a sci-fi enthusiast and a documentary lover. The 0.352 engagement rate is the population average for comedy — a ceiling that the system can never exceed without personalization.

Deployment Considerations

The team identifies three challenges for production deployment:

Latency. Thompson sampling requires one Beta sample per category per request — 8 random number generations for 8 categories. At StreamRec's scale (50,000 requests/second), this must complete in under 5ms. Beta random variates from NumPy take approximately 1 microsecond each, so 8 samples take ~8 microseconds — well within budget.

Posterior storage. Each user-category pair requires two floats (alpha, beta). For 5 million users and 8 categories, this is $5 \times 10^6 \times 8 \times 2 \times 4 = 320\text{ MB}$ in single precision. Easily fits in a Redis cluster.

Exploration budget. Unrestricted Thompson sampling may recommend a category that a user has explicitly disliked. The production system applies a filter: after 3 consecutive non-engagements in a category with posterior mean below 0.10, that category is suppressed for 24 hours. This hybrid approach preserves Thompson sampling's exploration benefits while respecting clear negative signals.

Key Takeaway

Thompson sampling converts the Bayesian posteriors from Chapter 20 into actionable recommendations that solve the cold-start problem. The insight is that the posterior uncertainty — the width of the Beta distribution — is not a nuisance to be ignored but a signal to be exploited. High uncertainty means "we don't know yet," and Thompson sampling translates that into "let's find out." The 14% improvement in overall engagement and the 2x improvement in favorite-category discovery demonstrate that principled exploration pays for itself.