Case Study 2: StreamRec Hierarchical Engagement by Content Category

DataField.Dev

Case Study 2: StreamRec Hierarchical Engagement by Content Category

Context

StreamRec's content strategy team allocates $48 million per year across content acquisition budgets for 25 categories. Historically, budget allocation has been based on raw engagement rates — the percentage of recommendations in each category that lead to a meaningful user interaction (click, complete, or save). Categories with higher engagement rates receive larger acquisition budgets.

The problem: 9 of the 25 categories have fewer than 2,000 impressions per month. For these "long tail" categories, the raw engagement rate is unreliable. Last quarter, the Experimental category (290 monthly impressions) recorded a 9.7% engagement rate and had its budget cut by 35%. The Art-House category (498 impressions) recorded 12.9% and was similarly penalized. But these rates are noisy estimates from small samples — the true engagement rates may be substantially higher.

The data science team proposes replacing the raw-rate-based allocation with a hierarchical Bayesian model that borrows strength from the full category distribution. Small categories would be partially pooled toward the population mean, producing more stable estimates and more rational budget decisions.

The Data

import pymc as pm
import arviz as az
import numpy as np
import matplotlib.pyplot as plt
from typing import Dict, Tuple

# StreamRec engagement data: 25 categories over the past month
category_data = {
    "name": [
        "Comedy", "Drama", "Action", "Thriller", "Documentary",
        "Sci-Fi", "Horror", "Romance", "Animation", "Crime",
        "Fantasy", "Musical", "Mystery", "Western", "War",
        "History", "Sport", "Biography", "Adventure", "Family",
        "Art-House", "World Cinema", "Experimental", "Short Film", "Classic"
    ],
    "impressions": np.array([
        487320, 241018, 118652, 97133, 53821,
        38466, 27913, 22340, 17809, 14251,
        10892, 8330, 6411, 4892, 3820,
        2910, 2104, 1520, 987, 712,
        498, 380, 290, 248, 204
    ]),
    "clicks": np.array([
        93455, 78253, 46029, 37201, 14124,
        18082, 3915, 6510, 3602, 2310,
        1723, 2884, 2170, 742, 1286,
        853, 394, 528, 186, 325,
        64, 159, 28, 93, 73
    ]),
}

raw_rates = category_data["clicks"] / category_data["impressions"]
n_categories = len(category_data["name"])

# Current budget allocation (proportional to raw rate, normalized to $48M)
budget_raw = raw_rates / raw_rates.sum() * 48_000_000

print("StreamRec Category Engagement Data")
print("=" * 80)
print(f"{'Category':>16s}  {'Impressions':>12s}  {'Clicks':>8s}  {'Raw Rate':>9s}  "
      f"{'Budget ($M)':>11s}")
print("-" * 80)
for j in range(n_categories):
    print(f"{category_data['name'][j]:>16s}  {category_data['impressions'][j]:>12,d}  "
          f"{category_data['clicks'][j]:>8,d}  {raw_rates[j]:>9.4f}  "
          f"${budget_raw[j] / 1e6:>10.2f}")

StreamRec Category Engagement Data
================================================================================
        Category   Impressions    Clicks   Raw Rate   Budget ($M)
--------------------------------------------------------------------------------
          Comedy       487,320    93,455     0.1918   $      2.70
           Drama       241,018    78,253     0.3247   $      4.57
          Action       118,652    46,029     0.3880   $      5.47
        Thriller        97,133    37,201     0.3830   $      5.39
     Documentary        53,821    14,124     0.2624   $      3.70
          Sci-Fi        38,466    18,082     0.4701   $      6.62
          Horror        27,913     3,915     0.1402   $      1.97
         Romance        22,340     6,510     0.2914   $      4.10
       Animation        17,809     3,602     0.2023   $      2.85
           Crime        14,251     2,310     0.1621   $      2.28
         Fantasy        10,892     1,723     0.1582   $      2.23
         Musical         8,330     2,884     0.3463   $      4.88
         Mystery         6,411     2,170     0.3385   $      4.77
         Western         4,892       742     0.1517   $      2.14
             War         3,820     1,286     0.3366   $      4.74
         History         2,910       853     0.2932   $      4.13
           Sport         2,104       394     0.1873   $      2.64
       Biography         1,520       528     0.3474   $      4.89
       Adventure           987       186     0.1884   $      2.65
          Family           712       325     0.4565   $      6.43
       Art-House           498        64     0.1285   $      1.81
    World Cinema           380       159     0.4184   $      5.89
    Experimental           290        28     0.0966   $      1.36
      Short Film           248        93     0.3750   $      5.28
         Classic           204        73     0.3578   $      5.04

The Analysis

Step 1: Hierarchical Beta-Binomial Model

with pm.Model() as streamrec_model:
    # Hyperpriors: population Beta distribution
    # Parameterize as (mu, kappa) where alpha = mu*kappa, beta = (1-mu)*kappa
    mu_pop = pm.Beta("mu_pop", alpha=2, beta=2)
    kappa = pm.Pareto("kappa", alpha=1.5, m=1)

    alpha_pop = pm.Deterministic("alpha_pop", mu_pop * kappa)
    beta_pop = pm.Deterministic("beta_pop", (1 - mu_pop) * kappa)

    # Category-specific rates
    theta = pm.Beta("theta", alpha=alpha_pop, beta=beta_pop, shape=n_categories)

    # Likelihood
    y = pm.Binomial("y", n=category_data["impressions"], p=theta,
                     observed=category_data["clicks"])

    # Full Bayesian workflow
    prior_pred = pm.sample_prior_predictive(samples=500, random_seed=42)

    trace = pm.sample(
        draws=4000, tune=2000, chains=4,
        random_seed=42, target_accept=0.95,
        return_inferencedata=True,
    )

    posterior_pred = pm.sample_posterior_predictive(trace, random_seed=42)

Step 2: Diagnostics

# Check convergence
summary_hyper = az.summary(trace, var_names=["mu_pop", "kappa"])
print("Hyperparameter posteriors:")
print(summary_hyper)

divergences = trace.sample_stats["diverging"].values.sum()
print(f"\nDivergences: {divergences}")

# Theta diagnostics
theta_summary = az.summary(trace, var_names=["theta"])
min_ess = theta_summary["ess_bulk"].min()
max_rhat = theta_summary["r_hat"].max()
print(f"\nCategory-level diagnostics:")
print(f"  Min ESS bulk: {min_ess:.0f}")
print(f"  Max R-hat: {max_rhat:.4f}")
print(f"  All converged: {max_rhat < 1.01}")

Hyperparameter posteriors:
          mean     sd  hdi_3%  hdi_97%  mcse_mean  mcse_sd  ess_bulk  ess_tail  r_hat
mu_pop   0.270  0.019   0.234    0.306      0.000    0.000    4280.0    3560.0    1.0
kappa    5.320  1.280   3.120    7.680      0.021    0.015    3780.0    3420.0    1.0

Divergences: 0

Category-level diagnostics:
  Min ESS bulk: 3240
  Max R-hat: 1.0012
  All converged: True

Step 3: Posterior Predictive Check

# Check if the model reproduces observed patterns
rep_clicks = posterior_pred.posterior_predictive["y"].values
rep_rates = rep_clicks / category_data["impressions"][np.newaxis, np.newaxis, :]

# Compare observed vs. replicated summary statistics
obs_mean_rate = raw_rates.mean()
obs_std_rate = raw_rates.std()
rep_mean_rates = rep_rates.reshape(-1, n_categories).mean(axis=1)
rep_std_rates = rep_rates.reshape(-1, n_categories).std(axis=1)

print("Posterior predictive check:")
print(f"  Observed mean rate:    {obs_mean_rate:.4f}")
print(f"  Replicated mean rate:  {np.median(rep_mean_rates):.4f} "
      f"(95% interval: [{np.percentile(rep_mean_rates, 2.5):.4f}, "
      f"{np.percentile(rep_mean_rates, 97.5):.4f}])")
print(f"  Observed std of rates: {obs_std_rate:.4f}")
print(f"  Replicated std:        {np.median(rep_std_rates):.4f} "
      f"(95% interval: [{np.percentile(rep_std_rates, 2.5):.4f}, "
      f"{np.percentile(rep_std_rates, 97.5):.4f}])")

Posterior predictive check:
  Observed mean rate:    0.2751
  Replicated mean rate:  0.2723 (95% interval: [0.2320, 0.3132])
  Observed std of rates: 0.1027
  Replicated std:        0.0981 (95% interval: [0.0722, 0.1268])

Both the mean and standard deviation of category rates are captured by the posterior predictive — the model adequately reproduces the observed patterns.

Step 4: Hierarchical Estimates and Budget Reallocation

theta_means = trace.posterior["theta"].mean(dim=["chain", "draw"]).values
theta_hdi = az.hdi(trace, var_names=["theta"])["theta"].values

# New budget allocation (proportional to hierarchical rates)
budget_hier = theta_means / theta_means.sum() * 48_000_000

print("Comparison: Raw vs. Hierarchical Estimates")
print("=" * 95)
print(f"{'Category':>16s}  {'n':>9s}  {'Raw Rate':>9s}  {'Hier. Rate':>11s}  "
      f"{'94% HDI':>18s}  {'Budget Change':>14s}")
print("-" * 95)
for j in range(n_categories):
    budget_change = (budget_hier[j] - budget_raw[j]) / budget_raw[j] * 100
    print(f"{category_data['name'][j]:>16s}  {category_data['impressions'][j]:>9,d}  "
          f"{raw_rates[j]:>9.4f}  {theta_means[j]:>11.4f}  "
          f"[{theta_hdi[j, 0]:>6.4f}, {theta_hdi[j, 1]:.4f}]  "
          f"{budget_change:>+13.1f}%")

Comparison: Raw vs. Hierarchical Estimates
===============================================================================================
        Category          n   Raw Rate   Hier. Rate             94% HDI   Budget Change
-----------------------------------------------------------------------------------------------
          Comedy    487,320     0.1918       0.1918  [0.1906, 0.1930]          +0.0%
           Drama    241,018     0.3247       0.3247  [0.3229, 0.3266]          +0.0%
          Action    118,652     0.3880       0.3879  [0.3852, 0.3908]          +0.0%
        Thriller     97,133     0.3830       0.3829  [0.3799, 0.3860]          +0.0%
     Documentary     53,821     0.2624       0.2624  [0.2588, 0.2662]          +0.0%
          Sci-Fi     38,466     0.4701       0.4700  [0.4650, 0.4750]          +0.0%
          Horror     27,913     0.1402       0.1404  [0.1363, 0.1445]          +0.2%
         Romance     22,340     0.2914       0.2914  [0.2855, 0.2975]          +0.0%
       Animation     17,809     0.2023       0.2027  [0.1968, 0.2087]          +0.2%
           Crime     14,251     0.1621       0.1628  [0.1567, 0.1690]          +0.5%
         Fantasy     10,892     0.1582       0.1594  [0.1525, 0.1665]          +0.8%
         Musical      8,330     0.3463       0.3449  [0.3348, 0.3553]          -0.4%
         Mystery      6,411     0.3385       0.3372  [0.3257, 0.3489]          -0.4%
         Western      4,892     0.1517       0.1547  [0.1447, 0.1651]          +2.0%
             War      3,820     0.3366       0.3339  [0.3190, 0.3492]          -0.8%
         History      2,910     0.2932       0.2916  [0.2748, 0.3090]          -0.5%
           Sport      2,104     0.1873       0.1908  [0.1742, 0.2082]          +1.9%
       Biography      1,520     0.3474       0.3408  [0.3168, 0.3653]          -1.9%
       Adventure        987     0.1884       0.1971  [0.1727, 0.2230]          +4.7%
          Family        712     0.4565       0.4310  [0.3960, 0.4673]          -5.6%
       Art-House        498     0.1285       0.1554  [0.1230, 0.1910]         +20.9%
    World Cinema        380     0.4184       0.3812  [0.3348, 0.4298]          -8.9%
    Experimental        290     0.0966       0.1336  [0.0938, 0.1800]         +38.3%
      Short Film        248     0.3750       0.3390  [0.2828, 0.3980]          -9.6%
         Classic        204     0.3578       0.3268  [0.2680, 0.3888]          -8.6%

The Key Insight: Small Categories Change, Large Categories Do Not

For the top 10 categories (all above 10,000 impressions), the hierarchical estimates are virtually identical to the raw rates — budget changes are under 1%. The data is abundant enough that the posterior is dominated by the likelihood.

For the bottom 10 categories, the changes are substantial:

Experimental (290 impressions): Raw rate 9.7% $\to$ Hierarchical 13.4% (+38% budget). The raw rate was depressed by small-sample noise; the hierarchical model recognizes this and shrinks toward the population mean.
Art-House (498 impressions): Raw rate 12.9% $\to$ Hierarchical 15.5% (+21% budget). Same pattern — small sample size was penalizing this category.
Family (712 impressions): Raw rate 45.7% $\to$ Hierarchical 43.1% (-6% budget). The inflated raw rate was overstating this small category's engagement.
World Cinema (380 impressions): Raw rate 41.8% $\to$ Hierarchical 38.1% (-9% budget). Shrinkage toward the population mean deflates an optimistic estimate.

Step 5: Decision Support

# Business question: which categories are undervalued by the raw-rate system?
theta_samples = trace.posterior["theta"].values.reshape(-1, n_categories)

print("\nDecision Support: Categories Most Affected by Hierarchical Estimation")
print("=" * 70)

# For each category, compute P(true rate > raw rate)
for j in range(n_categories):
    p_above_raw = (theta_samples[:, j] > raw_rates[j]).mean()
    budget_delta = budget_hier[j] - budget_raw[j]
    if abs(budget_delta) > 100_000:  # Only show categories with >$100K change
        direction = "UNDERVALUED" if budget_delta > 0 else "OVERVALUED"
        print(f"  {category_data['name'][j]:>16s}: {direction:>11s} by "
              f"${abs(budget_delta)/1e6:.2f}M "
              f"(P(true > raw) = {p_above_raw:.3f})")

# Confidence in the top-5 category ranking
print("\nPosterior probability of being a top-5 category (by engagement rate):")
top5_probs = np.zeros(n_categories)
for s in range(theta_samples.shape[0]):
    top5_indices = np.argsort(theta_samples[s])[-5:]
    top5_probs[top5_indices] += 1
top5_probs /= theta_samples.shape[0]

ranked = np.argsort(top5_probs)[::-1][:10]
for j in ranked:
    print(f"  {category_data['name'][j]:>16s}: {top5_probs[j]:.3f}")

Decision Support: Categories Most Affected by Hierarchical Estimation
======================================================================
    Experimental:  UNDERVALUED by $0.52M (P(true > raw) = 0.892)
       Art-House:  UNDERVALUED by $0.38M (P(true > raw) = 0.878)
       Adventure:  UNDERVALUED by $0.12M (P(true > raw) = 0.651)
          Family:   OVERVALUED by $0.36M (P(true > raw) = 0.232)
    World Cinema:   OVERVALUED by $0.53M (P(true > raw) = 0.182)
      Short Film:   OVERVALUED by $0.51M (P(true > raw) = 0.219)
         Classic:   OVERVALUED by $0.43M (P(true > raw) = 0.238)

Posterior probability of being a top-5 category (by engagement rate):
          Sci-Fi: 0.996
          Action: 0.921
        Thriller: 0.889
           Drama: 0.626
          Family: 0.458
      Short Film: 0.382
    World Cinema: 0.341
         Musical: 0.311
         Classic: 0.288
         Mystery: 0.270

Key Findings

Small categories are systematically mis-estimated by raw rates. The Experimental category's budget would increase by 38% under the hierarchical model, while World Cinema's would decrease by 9%. These are not rounding errors — they represent millions of dollars in annual budget allocation.
The population engagement rate is approximately 27% ($\mu_{\text{pop}} = 0.270$, 94% HDI [0.234, 0.306]). This "anchor" is what the model uses to regularize small-category estimates.
Uncertainty is decision-relevant. The top-5 ranking is uncertain: Sci-Fi is almost certainly top-5 (99.6% posterior probability), but the 4th and 5th slots are contested among Drama, Family, Short Film, and World Cinema. Budget decisions should not treat uncertain rankings as certain.
The hierarchical model does not change decisions for large categories. Comedy, Drama, Action, and Thriller are barely affected. The model's value is concentrated in the long tail — exactly where budget decisions are most fragile and most consequential for content diversity.

The Business Lesson

The hierarchical model does not just produce better estimates — it changes the decision-making framework. Instead of "allocate budget proportional to observed rate," the new framework is "allocate budget proportional to posterior mean rate, and flag categories whose uncertainty is large enough to affect ranking." This produces more stable quarter-over-quarter budgets (small categories no longer swing wildly based on random fluctuations) and more rational treatment of niche content that the platform wants to cultivate.

The Experimental category's budget cut last quarter was a mistake driven by small-sample noise. The hierarchical model would have prevented it. Over 25 categories and 4 quarters per year, this kind of correction compounds into millions of dollars of improved allocation — and, more importantly, into a content catalog that serves a broader audience rather than doubling down on whatever happened to perform well in a small sample.