Case Study 2: StreamRec Recommendation Effect — Y(1) vs. Y(0) for Recommended Items
Context
StreamRec's recommendation algorithm drives 62% of all content engagement on the platform. The product team reports this metric proudly: "Our algorithm is responsible for nearly two-thirds of user engagement." But the causal inference team asks a sharper question: How much of that engagement would have occurred without the recommendation?
The distinction matters for business decisions. If the algorithm merely predicts what users would have consumed anyway (high $Y(0)$, small $Y(1) - Y(0)$), then the 62% figure overstates the algorithm's value — users would have found most of this content organically. If the algorithm genuinely drives discovery of content users would not have found (low $Y(0)$, large $Y(1) - Y(0)$), then the algorithm creates substantial incremental value.
The difference between these two scenarios has direct implications: it determines whether StreamRec's \$12M annual recommendation infrastructure investment is generating incremental revenue or merely measuring organic behavior.
Defining the Causal Framework
Unit: A user-item pair $(i, j)$. We ask: what is the effect of recommending item $j$ to user $i$?
Treatment: $D_{ij} = 1$ if the recommendation algorithm places item $j$ in user $i$'s homepage carousel (positions 1-10); $D_{ij} = 0$ if the item is not shown in the carousel (but remains discoverable through search and browse).
Outcome: $Y_{ij}$ = engagement measured as watch time in minutes. Alternative outcomes could include binary engagement (watched vs. did not watch) or completion rate.
Potential outcomes: - $Y_{ij}(1)$: Minutes user $i$ would spend on item $j$ if it appears in their carousel. - $Y_{ij}(0)$: Minutes user $i$ would spend on item $j$ if it does not appear in their carousel.
Causal effect: $\tau_{ij} = Y_{ij}(1) - Y_{ij}(0)$ — the incremental engagement caused by the recommendation.
SUTVA Evaluation
No interference — problematic. StreamRec's carousel has 10 positions. If item $A$ is recommended in position 3, it displaces whatever would have been in position 3. User $i$'s engagement with item $B$ may depend on whether item $A$ is also recommended (competition for attention). Strictly, $Y_{ij}(D_{ij})$ should be $Y_{ij}(\mathbf{D}_i)$ — the full vector of recommendations shown to user $i$.
The team addresses this by defining treatment at the item level while holding the carousel structure constant: "the effect of including item $j$ in the carousel vs. replacing it with the next-best item according to the algorithm." This comparison is more realistic than "item in carousel vs. empty slot."
Consistency — manageable. The treatment "appears in carousel positions 1-10" is well-defined within the current UI. However, position within the carousel matters: position 1 has 3x the click-through rate of position 10. The team initially defines treatment as "appears anywhere in top 10" and notes that position effects could be studied separately.
Simulating the Observational Challenge
import numpy as np
import pandas as pd
import statsmodels.api as sm
from typing import Tuple
def simulate_streamrec_engagement(
n_pairs: int = 50000,
seed: int = 42,
) -> pd.DataFrame:
"""Simulate user-item engagement data with realistic confounding.
The recommendation algorithm selects items based on predicted
engagement (collaborative filtering scores), which are correlated
with organic engagement. This creates confounding.
Args:
n_pairs: Number of user-item pairs.
seed: Random seed.
Returns:
DataFrame with user features, item features, treatment,
potential outcomes, and observed outcomes.
"""
rng = np.random.RandomState(seed)
# User features
user_activity = rng.exponential(1.0, n_pairs) # Sessions per day
user_tenure = rng.poisson(18, n_pairs).clip(1, 96) # Months on platform
user_diversity = rng.beta(2, 5, n_pairs) # Content diversity preference
# Item features
item_quality = rng.beta(5, 3, n_pairs) # Production quality score
item_popularity = rng.beta(2, 8, n_pairs) # Popularity rank (lower = more popular)
item_length = rng.lognormal(2.5, 0.8, n_pairs).clip(3, 120) # Duration in minutes
# User-item affinity (the collaborative filtering signal)
# This drives BOTH the recommendation decision AND organic engagement
affinity = (
0.3 * user_activity
+ 0.2 * item_quality
+ 0.4 * item_popularity
+ 0.3 * user_diversity * (1 - item_popularity) # Diverse users like niche items
+ rng.normal(0, 0.3, n_pairs)
)
# Recommendation decision (confounded by affinity)
rec_logit = -0.5 + 1.5 * affinity + rng.normal(0, 0.3, n_pairs)
rec_prob = 1 / (1 + np.exp(-rec_logit))
treatment = rng.binomial(1, rec_prob)
# Potential outcomes
# Y(0): organic engagement (user finds and watches on their own)
y0 = (
2.0 * affinity
+ 1.5 * item_quality * item_length / 30
+ 0.5 * user_activity
+ rng.normal(0, 2, n_pairs)
).clip(0, None)
# Y(1): engagement with recommendation
# Heterogeneous treatment effect:
# - Larger for low-activity users (discovery effect)
# - Larger for niche items (wouldn't find otherwise)
# - Smaller for items user would watch anyway
tau_i = (
2.5 # Base recommendation effect
+ 1.5 * (1 - np.tanh(user_activity)) # Stronger for low-activity users
+ 2.0 * (1 - item_popularity) # Stronger for niche items
- 0.3 * affinity # Weaker when affinity is already high
+ rng.normal(0, 0.5, n_pairs)
).clip(0, None)
y1 = y0 + tau_i
y_obs = treatment * y1 + (1 - treatment) * y0
return pd.DataFrame({
"user_activity": user_activity,
"user_tenure": user_tenure,
"user_diversity": user_diversity,
"item_quality": item_quality,
"item_popularity": item_popularity,
"item_length": item_length,
"affinity": affinity,
"rec_prob": rec_prob,
"treatment": treatment,
"y0": y0,
"y1": y1,
"true_ite": tau_i,
"y_obs": y_obs,
})
data = simulate_streamrec_engagement()
print("StreamRec Engagement Data Summary")
print("=" * 55)
print(f"Total user-item pairs: {len(data):,}")
print(f"Recommended (D=1): {data['treatment'].sum():,} "
f"({data['treatment'].mean():.1%})")
print(f"Mean engagement (D=1): {data.loc[data['treatment']==1, 'y_obs'].mean():.2f} min")
print(f"Mean engagement (D=0): {data.loc[data['treatment']==0, 'y_obs'].mean():.2f} min")
print()
# Causal estimands
print("True Causal Effects")
print("-" * 55)
print(f"True ATE: {data['true_ite'].mean():.3f} min")
print(f"True ATT: {data.loc[data['treatment']==1, 'true_ite'].mean():.3f} min")
print(f"True ATU: {data.loc[data['treatment']==0, 'true_ite'].mean():.3f} min")
print()
# Naive vs. adjusted
naive = (data.loc[data["treatment"]==1, "y_obs"].mean()
- data.loc[data["treatment"]==0, "y_obs"].mean())
sel_bias = (data.loc[data["treatment"]==1, "y0"].mean()
- data.loc[data["treatment"]==0, "y0"].mean())
print("Estimation")
print("-" * 55)
print(f"Naive estimate: {naive:.3f} min")
print(f"Selection bias: {sel_bias:.3f} min")
print(f"Overstatement factor: {naive / data['true_ite'].mean():.1f}x")
StreamRec Engagement Data Summary
=======================================================
Total user-item pairs: 50,000
Recommended (D=1): 30,876 (61.8%)
Mean engagement (D=1): 7.24 min
Mean engagement (D=0): 2.73 min
True Causal Effects
-------------------------------------------------------
True ATE: 3.488 min
True ATT: 3.052 min
True ATU: 4.201 min
Estimation
-------------------------------------------------------
Naive estimate: 4.516 min
Selection bias: 1.028 min
Overstatement factor: 1.3x
The Business Impact of Getting This Wrong
The naive analysis attributes 4.52 minutes of engagement per recommendation to the algorithm. The true causal effect is 3.49 minutes. The overstatement is 30% — not as dramatic as sign reversal (cf. Case Study 1), but significant for business decisions.
With 30,876 recommendations, the naive analysis credits the algorithm with $30{,}876 \times 4.52 = 139{,}559$ minutes of incremental engagement. The true causal impact is $30{,}876 \times 3.05 = 94{,}172$ minutes (using the ATT, since we are evaluating the recommendations that were actually made). The naive analysis inflates the recommendation system's impact by approximately 48%.
This has concrete business consequences: if StreamRec prices advertising based on recommendation-driven engagement, it is overcharging advertisers by nearly 50% for the incremental reach attributable to the algorithm.
Regression Adjustment
cov_cols = ["user_activity", "user_tenure", "user_diversity",
"item_quality", "item_popularity", "item_length"]
X = data[cov_cols].values
y = data["y_obs"].values
d = data["treatment"].values
# Standardize covariates for numerical stability
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
design = sm.add_constant(np.column_stack([d, X_std]))
model = sm.OLS(y, design).fit(cov_type="HC1")
reg_est = model.params[1]
reg_se = model.bse[1]
print("Regression Adjustment Results")
print("=" * 55)
print(f"Adjusted estimate: {reg_est:.3f} min (SE: {reg_se:.3f})")
print(f"95% CI: [{reg_est - 1.96*reg_se:.3f}, "
f"{reg_est + 1.96*reg_se:.3f}]")
print(f"True ATE: {data['true_ite'].mean():.3f} min")
print(f"Residual bias: {reg_est - data['true_ite'].mean():+.3f} min")
print()
print("Affinity is an unobserved composite score. When we omit it")
print("and control only for its components, the linear regression")
print("captures most but not all of the confounding.")
Regression Adjustment Results
=======================================================
Adjusted estimate: 3.564 min (SE: 0.052)
95% CI: [3.462, 3.666]
True ATE: 3.488 min
Residual bias: +0.076 min
Affinity is an unobserved composite score. When we omit it
and control only for its components, the linear regression
captures most but not all of the confounding.
Heterogeneous Effects: Who Benefits Most?
# Stratify by user activity level
activity_bins = pd.qcut(data["user_activity"], q=3, labels=["Low", "Medium", "High"])
popularity_bins = pd.qcut(data["item_popularity"], q=3, labels=["Niche", "Mid", "Popular"])
print("Treatment Effect by User Activity Level")
print("-" * 55)
for level in ["Low", "Medium", "High"]:
mask = activity_bins == level
true_effect = data.loc[mask, "true_ite"].mean()
n_in_group = mask.sum()
pct_treated = data.loc[mask, "treatment"].mean()
print(f" {level:8s}: True ATE = {true_effect:.2f} min "
f"(N={n_in_group:,}, {pct_treated:.0%} treated)")
print()
print("Treatment Effect by Item Popularity")
print("-" * 55)
for level in ["Niche", "Mid", "Popular"]:
mask = popularity_bins == level
true_effect = data.loc[mask, "true_ite"].mean()
n_in_group = mask.sum()
pct_treated = data.loc[mask, "treatment"].mean()
print(f" {level:8s}: True ATE = {true_effect:.2f} min "
f"(N={n_in_group:,}, {pct_treated:.0%} treated)")
Treatment Effect by User Activity Level
-------------------------------------------------------
Low : True ATE = 4.34 min (N=16,667, 45% treated)
Medium : True ATE = 3.46 min (N=16,667, 63% treated)
High : True ATE = 2.67 min (N=16,666, 78% treated)
Treatment Effect by Item Popularity
-------------------------------------------------------
Niche : True ATE = 4.58 min (N=16,667, 41% treated)
Mid : True ATE = 3.46 min (N=16,667, 62% treated)
Popular : True ATE = 2.45 min (N=16,666, 82% treated)
The heterogeneity reveals a fundamental misalignment in the recommendation algorithm:
- Low-activity users benefit most from recommendations (4.34 min) but are least likely to receive them (45% treated).
- Niche items benefit most from recommendations (4.58 min) but are least likely to be recommended (41% treated).
- The algorithm targets its recommendations where they have the smallest marginal causal effect.
Implications for StreamRec
-
The algorithm creates real value, but less than the naive metrics suggest. The 30% overstatement matters for ROI calculations, ad pricing, and investment decisions. The causal analysis reveals that approximately \$3.6M of the \$12M infrastructure cost is justified by actual incremental engagement; the rest is "taking credit" for organic behavior.
-
The targeting is causally suboptimal. The algorithm recommends where it predicts high engagement, not where it causes high engagement. A causally informed targeting policy (Chapter 19) would redirect recommendations toward low-activity users and niche items — the groups where recommendations make the biggest difference.
-
The estimand choice matters. The ATT (3.05 min) evaluates the current policy; the ATU (4.20 min) tells us there is substantial untapped value in recommending to users who currently do not receive recommendations. The difference between ATT and ATU is the opportunity cost of the current targeting strategy.
-
Evaluation without causal reasoning is misleading. Standard A/B testing metrics (click-through rate, engagement time for recommended items) measure $\mathbb{E}[Y \mid D=1]$, not $\mathbb{E}[Y(1) - Y(0)]$. The former includes organic engagement; the latter isolates the recommendation's causal contribution. Building a causal evaluation pipeline (Chapters 17-19) is essential for understanding the true value of the recommendation system.