Case Study 2: StreamRec Uplift Targeting — Recommend Only When Causally Beneficial
Context
StreamRec's recommendation system serves 5 million users. In Chapters 16 and 18, the causal inference team established that the average causal effect of showing an item in the homepage carousel is 2.8 minutes of additional engagement per user-item pair — substantially less than the 8.3 minutes of engagement the algorithm "takes credit for" in standard evaluation (the difference is organic engagement that would have occurred without the recommendation).
The product team has a new question: "Can we increase the total causal impact of recommendations by being selective about when we recommend?"
Currently, the homepage carousel has 10 positions. The ranking algorithm fills all 10 positions for every user on every session. The hypothesis: some of these recommendation slots are wasted on users who would have found the content organically (low causal effect), while other users — particularly newer users who haven't yet discovered content in niche categories — benefit substantially from recommendations (high causal effect). If the system could identify which users benefit most, it could reallocate recommendation slots (or replace low-uplift recommendations with exploratory content, promotional content, or simply fewer recommendations for a cleaner interface).
Data
The team has data from a recent A/B test ($n = 200,000$ user-sessions, 50/50 split). In the treatment group, items were recommended using the standard algorithm. In the control group, users saw a popularity-based (non-personalized) default. The outcome is total session engagement (minutes).
| Feature | Description |
|---|---|
tenure_months |
Account age in months |
daily_sessions |
Average daily sessions (trailing 30 days) |
category_diversity |
Herfindahl index of content categories consumed (low = niche) |
subscription_tier |
Free (0), Standard (1), Premium (2) |
age |
User age |
hour_of_day |
Session start hour (0-23) |
device_type |
Mobile (0) or Desktop (1) |
content_completion_rate |
Fraction of started content completed (trailing 30 days) |
discovery_rate |
Fraction of consumed content that was "new to user" (trailing 30 days) |
social_connections |
Number of platform friends |
The A/B test provides randomized treatment assignment — the propensity is exactly $e = 0.5$ for all users, which eliminates confounding and simplifies CATE estimation.
Analytical Approach
Step 1: Verify the ATE
Before estimating heterogeneity, the team confirms the average treatment effect:
import numpy as np
import pandas as pd
from scipy import stats
def verify_ate(
df: pd.DataFrame,
outcome_col: str = "session_minutes",
treatment_col: str = "personalized_rec",
) -> dict:
"""Verify the ATE from the A/B test.
Uses the simple difference in means (unbiased under randomization)
and the regression-adjusted estimator (for variance reduction).
Args:
df: A/B test DataFrame.
outcome_col: Session engagement in minutes.
treatment_col: Binary treatment indicator (1 = personalized, 0 = default).
Returns:
ATE estimate, CI, and p-value.
"""
treated = df[df[treatment_col] == 1][outcome_col]
control = df[df[treatment_col] == 0][outcome_col]
ate = treated.mean() - control.mean()
se = np.sqrt(treated.var() / len(treated) + control.var() / len(control))
ci = (ate - 1.96 * se, ate + 1.96 * se)
t_stat, p_value = stats.ttest_ind(treated, control)
return {
"ate": round(ate, 2),
"se": round(se, 3),
"ci_95": (round(ci[0], 2), round(ci[1], 2)),
"p_value": f"{p_value:.2e}",
}
Result: ATE = 2.83 minutes, 95% CI $[2.41, 3.25]$, $p < 10^{-30}$. The personalized recommendation system causally increases engagement by approximately 2.8 minutes per session on average.
Step 2: Estimate CATEs with Multiple Methods
Because the data is randomized, the team uses multiple methods and compares:
from econml.dml import CausalForestDML
from econml.metalearners import XLearner as EconXLearner
from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier
def estimate_streamrec_cates(
df: pd.DataFrame,
feature_cols: list,
treatment_col: str = "personalized_rec",
outcome_col: str = "session_minutes",
) -> dict:
"""Estimate CATEs using causal forest and X-learner.
With RCT data, both methods are valid. Comparing them provides
a robustness check.
Args:
df: A/B test DataFrame.
feature_cols: User features for CATE estimation.
treatment_col: Treatment indicator.
outcome_col: Session engagement.
Returns:
Dictionary with fitted models and CATE estimates.
"""
X = df[feature_cols].values
D = df[treatment_col].values
Y = df[outcome_col].values
# Causal Forest
cf = CausalForestDML(
model_y=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
model_t=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
n_estimators=500,
min_samples_leaf=30,
random_state=42,
)
cf.fit(Y, D, X=X)
tau_cf = cf.effect(X).flatten()
# X-Learner (well-suited for balanced treatment)
xl = EconXLearner(
models=GradientBoostingRegressor(
n_estimators=200, max_depth=4, random_state=42
),
propensity_model=GradientBoostingClassifier(
n_estimators=100, max_depth=3, random_state=42
),
)
xl.fit(Y, D, X=X)
tau_xl = xl.effect(X).flatten()
return {
"causal_forest": cf,
"x_learner": xl,
"tau_cf": tau_cf,
"tau_xl": tau_xl,
"correlation": np.corrcoef(tau_cf, tau_xl)[0, 1],
}
The correlation between causal forest and X-learner CATEs is 0.89 — strong agreement. Both methods identify similar patterns of heterogeneity, which increases confidence in the findings.
Step 3: Identify User Segments
def segment_analysis(
df: pd.DataFrame,
tau_hat: np.ndarray,
feature_cols: list,
) -> pd.DataFrame:
"""Analyze CATE variation across user segments.
Args:
df: User DataFrame.
tau_hat: Estimated CATEs (n,).
feature_cols: Feature names for segmentation.
Returns:
Segment-level summary.
"""
df_analysis = df[feature_cols].copy()
df_analysis["tau_hat"] = tau_hat
# Define user segments
df_analysis["user_type"] = pd.cut(
df_analysis["tenure_months"],
bins=[0, 3, 12, 36, np.inf],
labels=["New (<3mo)", "Growing (3-12mo)", "Established (1-3yr)", "Veteran (3yr+)"],
)
df_analysis["content_profile"] = pd.cut(
df_analysis["category_diversity"],
bins=[0, 0.3, 0.6, 1.0],
labels=["Niche", "Moderate", "Broad"],
)
summary = df_analysis.groupby(["user_type", "content_profile"]).agg(
mean_cate=("tau_hat", "mean"),
median_cate=("tau_hat", "median"),
std_cate=("tau_hat", "std"),
count=("tau_hat", "count"),
pct_positive=("tau_hat", lambda x: (x > 0).mean()),
).round(2)
return summary
Key Findings
Finding 1: Tenure Is the Strongest Driver of Recommendation Uplift
| User segment | Mean CATE (minutes) | 95% CI | N |
|---|---|---|---|
| New users (< 3 months), niche interests | 6.8 | [5.9, 7.7] | 12,400 |
| New users (< 3 months), broad interests | 4.2 | [3.5, 4.9] | 18,200 |
| Growing users (3-12 months), niche interests | 4.1 | [3.5, 4.7] | 22,800 |
| Growing users (3-12 months), broad interests | 2.5 | [2.0, 3.0] | 31,600 |
| Established users (1-3 years), any | 1.4 | [1.0, 1.8] | 68,000 |
| Veteran users (3+ years), any | 0.6 | [0.1, 1.1] | 47,000 |
New users with niche content preferences benefit most: recommendations add 6.8 minutes per session because these users have not yet discovered the platform's long-tail content. Without personalized recommendations, they would see only popular content (the default), missing niche items that match their preferences.
Veteran users with broad tastes benefit least: they have already found most content they like through organic browsing. Personalized recommendations are largely redundant for them — they add only 0.6 minutes per session, most of which is marginal reordering of content they would have found in the first few scrolls.
Finding 2: Causal Feature Importance vs. Predictive Feature Importance
| Feature | Causal importance (drives CATE variation) | Predictive importance (drives engagement $Y$) |
|---|---|---|
tenure_months |
1st (0.28) | 3rd (0.12) |
category_diversity |
2nd (0.22) | 7th (0.04) |
discovery_rate |
3rd (0.16) | 5th (0.08) |
content_completion_rate |
4th (0.10) | 1st (0.25) |
daily_sessions |
5th (0.09) | 2nd (0.21) |
subscription_tier |
6th (0.07) | 4th (0.10) |
The top causal feature (tenure) is only the third most important predictive feature. The top predictive feature (content completion rate) is only fourth for causal importance. This is the pattern described in Section 19.3: the features that predict who engages are different from the features that predict who benefits from recommendations.
Finding 3: Sleeping Dogs Exist
Approximately 3.2% of user-sessions ($\approx 6,400$ in the test set) have negative estimated CATEs — personalized recommendations decrease their engagement. The team investigates and identifies a "recommendation fatigue" pattern: veteran users with high daily session counts who see highly similar recommendations across sessions show reduced engagement compared to the default (which includes more randomness). The mechanism: algorithmic homogeneity causes boredom, while the default's randomness accidentally provides beneficial diversity.
Targeting Policy
The team builds three candidate policies and evaluates them on the held-out A/B test data:
def evaluate_streamrec_policies(
tau_hat: np.ndarray,
Y: np.ndarray,
D: np.ndarray,
) -> pd.DataFrame:
"""Evaluate targeting policies using the A/B test data.
Computes the policy value (expected engagement under the policy)
using AIPW, treating the A/B test as ground truth.
Args:
tau_hat: Estimated CATEs from training data.
Y: Outcomes from held-out A/B test.
D: Treatment assignments from held-out A/B test.
Returns:
Policy comparison table.
"""
n = len(Y)
e = 0.5 # Known propensity from RCT
# Compute AIPW scores for policy evaluation
mu_1 = Y[D == 1].mean() # Simplified; in practice, use model
mu_0 = Y[D == 0].mean()
policies = {
"Treat all": np.ones(n, dtype=int),
"Treat none": np.zeros(n, dtype=int),
"Top 80%": (tau_hat >= np.percentile(tau_hat, 20)).astype(int),
"Top 50%": (tau_hat >= np.percentile(tau_hat, 50)).astype(int),
"Positive CATE only": (tau_hat > 0).astype(int),
}
results = []
for name, pi in policies.items():
# Fraction treated
frac = pi.mean()
# Total uplift in targeted group
targeted_uplift = tau_hat[pi == 1].sum()
untargeted_loss = 0 # Not treating those with tau > 0 who are excluded
results.append({
"Policy": name,
"Fraction treated": f"{frac:.1%}",
"Mean CATE (treated)": f"{tau_hat[pi == 1].mean():.2f} min",
"Est. total uplift (min)": f"{targeted_uplift:,.0f}",
})
return pd.DataFrame(results)
| Policy | Fraction treated | Mean CATE (treated) | Total uplift relative to treat-all |
|---|---|---|---|
| Treat all | 100% | 2.83 min | Baseline |
| Top 80% | 80% | 3.41 min | 96% of total uplift with 80% of slots |
| Top 50% | 50% | 4.52 min | 80% of total uplift with 50% of slots |
| Positive CATE only | 96.8% | 2.93 min | 100.4% of total uplift (avoids Sleeping Dogs) |
| Treat none | 0% | — | 0% |
The "Positive CATE only" policy is the clear winner for overall engagement: it captures more total uplift than treat-all by avoiding the 3.2% of users whose engagement decreases with personalized recommendations. For those users, the system would fall back to the popularity-based default.
The "Top 50%" policy is the most efficient if recommendation slots carry a cost (computational, UI real estate, or content diversity opportunity cost): it captures 80% of the causal benefit with half the interventions, and the mean effect per treated user nearly doubles.
Production Deployment Recommendation
The team recommends a phased rollout:
Phase 1 — Immediate: Implement the "Positive CATE only" policy. Suppress personalized recommendations for the ~3% of user-sessions identified as Sleeping Dogs. Replace with curated/editorial content or a "surprise me" random selection. Expected impact: +0.4% total engagement minutes (by eliminating negative-effect recommendations), at zero marginal cost.
Phase 2 — Experiment: Run an A/B test of the "Top 50%" policy against treat-all. In the Top 50% condition, the bottom 50% of users by $\hat{\tau}$ see a hybrid: 5 personalized slots + 5 diverse/exploratory slots (instead of 10 personalized). Measure not just session engagement but also 30-day retention, content diversity consumed, and creator fairness metrics.
Phase 3 — Monitor and retrain: The CATE model degrades as user behavior evolves. Schedule monthly retraining using the latest A/B test data. Monitor the Qini coefficient on each new A/B test cohort to detect degradation. Alert if the Qini coefficient drops below 0.6 (the current value is 0.78).
Lessons
This case study demonstrates that even a well-performing recommendation system can be improved by thinking causally. The shift from "predict what users will engage with" to "predict which users will engage because of the recommendation" surfaces three actionable insights: (1) new niche users are the highest-value targets for personalized recommendations, (2) veteran broad-taste users receive almost no causal benefit, and (3) a small but measurable fraction of users are actively harmed by recommendation homogeneity. A targeting policy based on causal forests captures nearly all the system's causal benefit while treating substantially fewer users, freeing recommendation slots for exploration, diversity, and serendipity.