Case Study 2: Causal Inference at StreamFlow --- Did the Retention Offer Work?
Background
StreamFlow's churn prediction model has been in production for six months. The model identifies high-risk subscribers (predicted churn probability > 0.20), and the customer success team sends them a retention offer: a 20% discount for three months plus a personalized email from the account manager.
The numbers look good. Before the model, StreamFlow's monthly churn rate was 12%. Six months after deploying the retention program, the overall churn rate is 9.2%. Rachel Torres, VP of Customer Success, reports this as a 23% reduction in churn and attributes it to the model.
The CFO, James Park, is less convinced. "The economy improved. We launched three new features. A competitor raised their prices. How do we know the retention offer is what caused the improvement?" He is asking a causal question, and he is right to ask it.
This case study walks through four approaches to estimating the causal effect of the retention offer, from naive to rigorous.
Phase 1: The Naive Comparison (And Why It Fails)
The simplest approach: compare churn rates between subscribers who received the retention offer (treatment) and those who did not (control).
import numpy as np
import pandas as pd
from scipy import stats
np.random.seed(42)
# Simulate 6 months of StreamFlow data
n_subscribers = 10000
# Subscriber characteristics
tenure_months = np.random.exponential(24, n_subscribers).astype(int) + 1
monthly_usage_hours = np.random.lognormal(2.5, 0.8, n_subscribers)
support_tickets = np.random.poisson(1.2, n_subscribers)
plan_price = np.random.choice([9.99, 14.99, 24.99], n_subscribers,
p=[0.4, 0.35, 0.25])
# True churn probability (function of features)
churn_logit = (
-2.0
- 0.02 * tenure_months
- 0.05 * monthly_usage_hours
+ 0.15 * support_tickets
- 0.03 * plan_price
+ np.random.normal(0, 0.5, n_subscribers)
)
true_churn_prob = 1 / (1 + np.exp(-churn_logit))
# Model scores (correlated with true churn probability but imperfect)
model_score = true_churn_prob + np.random.normal(0, 0.1, n_subscribers)
model_score = np.clip(model_score, 0, 1)
# Treatment assignment: offer sent to subscribers with model_score > 0.20
treated = (model_score > 0.20).astype(int)
# True treatment effect: the offer reduces churn by 6 percentage points
# (on average, for those who receive it)
treatment_effect = -0.06
# Observed outcome
final_churn_prob = true_churn_prob + treatment_effect * treated
final_churn_prob = np.clip(final_churn_prob, 0.01, 0.99)
churned = np.random.binomial(1, final_churn_prob)
df = pd.DataFrame({
"tenure_months": tenure_months,
"monthly_usage_hours": monthly_usage_hours,
"support_tickets": support_tickets,
"plan_price": plan_price,
"model_score": model_score,
"treated": treated,
"churned": churned,
"true_churn_prob": true_churn_prob,
})
# Naive comparison
treated_churn = df[df["treated"] == 1]["churned"].mean()
control_churn = df[df["treated"] == 0]["churned"].mean()
naive_effect = treated_churn - control_churn
print("Naive Comparison")
print("=" * 50)
print(f"Treated group (received offer):")
print(f" N: {df['treated'].sum():>6,}")
print(f" Churn rate: {treated_churn:.3f}")
print(f"\nControl group (no offer):")
print(f" N: {(1 - df['treated']).sum():>6,}")
print(f" Churn rate: {control_churn:.3f}")
print(f"\nNaive estimate: {naive_effect:+.3f}")
print(f"True treatment effect: {treatment_effect:+.3f}")
Key Insight --- The naive comparison gives a misleading answer. The treated group has a higher churn rate than the control group, even though the treatment reduces churn. Why? Selection bias. The offer was sent to high-risk subscribers (model score > 0.20). These subscribers were already more likely to churn. The naive comparison conflates "being high-risk" with "receiving the treatment." The treated group churns more not because the offer is harmful, but because the offer was given to people who were already at higher risk.
The Selection Bias Diagnosis
# Compare the two groups on baseline characteristics
print("Baseline Characteristics by Treatment Group")
print("=" * 50)
for col in ["tenure_months", "monthly_usage_hours",
"support_tickets", "plan_price", "true_churn_prob"]:
treated_mean = df[df["treated"] == 1][col].mean()
control_mean = df[df["treated"] == 0][col].mean()
print(f"{col:>25s}: Treated={treated_mean:.2f} "
f"Control={control_mean:.2f}")
The treated group has shorter tenure, lower usage, more support tickets, and higher baseline churn probability. These groups are not comparable. Any comparison between them is confounded.
Phase 2: Propensity Score Matching
One approach to fixing selection bias: match each treated subscriber with a control subscriber who has similar characteristics. If the matches are good, the groups become comparable, and the difference in outcomes estimates the causal effect.
The propensity score is the probability of receiving treatment given the covariates. Subscribers with similar propensity scores are similar on all observed characteristics, even if they differ on individual features.
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors
np.random.seed(42)
# Step 1: Estimate propensity scores
features = ["tenure_months", "monthly_usage_hours",
"support_tickets", "plan_price"]
X = df[features].values
y_treatment = df["treated"].values
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
ps_model = LogisticRegression(random_state=42, max_iter=1000)
ps_model.fit(X_scaled, y_treatment)
df["propensity_score"] = ps_model.predict_proba(X_scaled)[:, 1]
print("Propensity Score Distribution")
print("=" * 50)
print(f"Treated: mean={df[df['treated']==1]['propensity_score'].mean():.3f}, "
f"std={df[df['treated']==1]['propensity_score'].std():.3f}")
print(f"Control: mean={df[df['treated']==0]['propensity_score'].mean():.3f}, "
f"std={df[df['treated']==0]['propensity_score'].std():.3f}")
# Step 2: Match treated to control on propensity score
treated_df = df[df["treated"] == 1].copy()
control_df = df[df["treated"] == 0].copy()
nn = NearestNeighbors(n_neighbors=1, metric="euclidean")
nn.fit(control_df[["propensity_score"]].values)
distances, indices = nn.kneighbors(
treated_df[["propensity_score"]].values
)
# Get matched control subscribers
matched_control = control_df.iloc[indices.flatten()]
# Step 3: Estimate the treatment effect on matched pairs
matched_treated_churn = treated_df["churned"].mean()
matched_control_churn = matched_control["churned"].mean()
psm_effect = matched_treated_churn - matched_control_churn
print(f"\nPropensity Score Matching Results")
print("=" * 50)
print(f"Matched pairs: {len(treated_df):>6,}")
print(f"Treated churn rate: {matched_treated_churn:.3f}")
print(f"Matched control rate:{matched_control_churn:.3f}")
print(f"PSM estimate: {psm_effect:+.3f}")
print(f"True effect: {treatment_effect:+.3f}")
Why This Is Better --- Propensity score matching creates a control group that looks like the treatment group on observed characteristics. The matched comparison removes selection bias from observed confounders. The estimate is closer to the true effect.
Why This Is Not Perfect --- PSM can only control for observed variables. If there is an unobserved confounder (say, subscribers who received the offer also happened to see a new feature release), PSM cannot account for it. This is why randomized experiments are the gold standard.
Phase 3: Difference-in-Differences
StreamFlow launched the retention program for Business plan subscribers in January. Professional plan subscribers did not receive offers until April. This creates a natural experiment: we can compare the change in churn for Business subscribers (treatment) vs. Professional subscribers (control) before and after the launch.
np.random.seed(42)
# Monthly churn rates (simulated to match realistic patterns)
months = ["Oct", "Nov", "Dec", "Jan", "Feb", "Mar"]
month_idx = list(range(6))
# Pre-treatment: both groups have a slight downward trend
business_churn = [0.145, 0.140, 0.135, 0.095, 0.088, 0.082]
professional_churn = [0.110, 0.107, 0.103, 0.098, 0.094, 0.090]
# Pre-treatment period: months 0-2 (Oct-Dec)
# Post-treatment period: months 3-5 (Jan-Mar)
pre_business = np.mean(business_churn[:3])
post_business = np.mean(business_churn[3:])
pre_professional = np.mean(professional_churn[:3])
post_professional = np.mean(professional_churn[3:])
# DiD calculation
change_business = post_business - pre_business
change_professional = post_professional - pre_professional
did_estimate = change_business - change_professional
print("Difference-in-Differences Analysis")
print("=" * 50)
print(f"\nBusiness Plan (Treatment):")
print(f" Pre-treatment avg churn: {pre_business:.3f}")
print(f" Post-treatment avg churn: {post_business:.3f}")
print(f" Change: {change_business:+.3f}")
print(f"\nProfessional Plan (Control):")
print(f" Pre-treatment avg churn: {pre_professional:.3f}")
print(f" Post-treatment avg churn: {post_professional:.3f}")
print(f" Change: {change_professional:+.3f}")
print(f"\nDiD estimate: {did_estimate:+.3f}")
print(f"True treatment effect: {treatment_effect:+.3f}")
Checking the Parallel Trends Assumption
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(10, 6))
ax.plot(month_idx, business_churn, "o-", color="#2563eb",
linewidth=2, markersize=8, label="Business (Treatment)")
ax.plot(month_idx, professional_churn, "s-", color="#dc2626",
linewidth=2, markersize=8, label="Professional (Control)")
# Treatment line
ax.axvline(x=2.5, color="gray", linestyle="--", linewidth=1.5,
label="Treatment Start (Jan)")
ax.set_xticks(month_idx)
ax.set_xticklabels(months)
ax.set_xlabel("Month", fontsize=12)
ax.set_ylabel("Monthly Churn Rate", fontsize=12)
ax.set_title("Parallel Trends Check: Business vs. Professional Plan",
fontsize=14)
ax.legend(fontsize=11)
ax.set_ylim(0.05, 0.18)
ax.grid(axis="y", alpha=0.3)
plt.tight_layout()
plt.savefig("did_parallel_trends.png", dpi=150, bbox_inches="tight")
plt.close()
print("Saved: did_parallel_trends.png")
Checking the Assumption --- In the pre-treatment period (Oct--Dec), both groups show a slight downward trend in churn, and the trends are approximately parallel. This supports the parallel trends assumption. If the Business plan had been trending downward faster than the Professional plan before the intervention, the DiD estimate would be biased (it would attribute part of the pre-existing trend to the treatment).
Phase 4: Regression Discontinuity Design
The most elegant approach for StreamFlow's specific setup. The retention offer was sent to subscribers with a model score above 0.20. Subscribers just above the threshold (score 0.20--0.25) are nearly identical to subscribers just below it (score 0.15--0.20). The sharp cutoff creates a natural experiment around the threshold.
np.random.seed(42)
# Focus on subscribers near the threshold
bandwidth = 0.10 # Look at scores within 0.10 of the threshold
near_threshold = df[
(df["model_score"] >= 0.20 - bandwidth) &
(df["model_score"] <= 0.20 + bandwidth)
].copy()
# Within this bandwidth, treatment assignment is nearly random
# (subscribers just above and just below 0.20 are very similar)
above = near_threshold[near_threshold["model_score"] >= 0.20]
below = near_threshold[near_threshold["model_score"] < 0.20]
rd_effect = above["churned"].mean() - below["churned"].mean()
print("Regression Discontinuity Design")
print("=" * 50)
print(f"Bandwidth: [{0.20 - bandwidth:.2f}, {0.20 + bandwidth:.2f}]")
print(f"Subscribers above threshold: {len(above):>5,}")
print(f"Subscribers below threshold: {len(below):>5,}")
print(f"\nChurn rate (above, treated): {above['churned'].mean():.3f}")
print(f"Churn rate (below, control): {below['churned'].mean():.3f}")
print(f"RD estimate: {rd_effect:+.3f}")
print(f"True effect: {treatment_effect:+.3f}")
# Check covariate balance near the threshold
print(f"\nCovariate Balance (near threshold)")
print("-" * 50)
for col in ["tenure_months", "monthly_usage_hours",
"support_tickets", "plan_price"]:
above_mean = above[col].mean()
below_mean = below[col].mean()
diff = above_mean - below_mean
print(f"{col:>25s}: Above={above_mean:.2f} Below={below_mean:.2f} "
f"Diff={diff:+.2f}")
Why RD Works Here --- Subscribers with a model score of 0.21 and subscribers with a score of 0.19 are essentially the same --- the 0.02 difference in score reflects noise in the model, not a real difference in risk. But only the 0.21 subscriber received the offer. This creates a clean comparison. The key assumption: there is no other discontinuity at the 0.20 threshold (no other policy or behavior change happens at exactly that score).
Phase 5: Bringing It Together --- The Memo to the CFO
James Park asked: "How do we know the retention offer works?" Here is the analysis, translated into business language.
Summary of Causal Estimates
| Method | Estimate | Interpretation |
|---|---|---|
| Naive comparison | Misleading (positive) | Confounded by selection bias --- we sent offers to the highest-risk subscribers |
| Propensity score matching | approx. -5 to -7 pp | Controls for observed subscriber characteristics; closer to truth |
| Difference-in-differences | approx. -5 to -6 pp | Uses the Professional plan as a control group; accounts for time trends |
| Regression discontinuity | approx. -5 to -7 pp | Uses the model score threshold as a natural experiment; cleanest identification |
Three independent methods, using different assumptions, produce estimates in the range of 5--7 percentage point churn reduction attributable to the retention offer.
Translating to Dollars
np.random.seed(42)
# StreamFlow economics
total_subscribers = 85000
monthly_churn_without_offer = 0.12
estimated_causal_effect = -0.06 # 6pp reduction
subscribers_treated = int(total_subscribers * 0.30) # 30% flagged as high-risk
avg_revenue_per_subscriber = 14.99 # Monthly
cost_per_offer = 3.00 # Discount cost per treated subscriber per month
# Subscribers retained due to the offer
retained = int(subscribers_treated * abs(estimated_causal_effect))
revenue_saved = retained * avg_revenue_per_subscriber * 12 # Annualized
offer_cost = subscribers_treated * cost_per_offer * 12 # Annual cost
net_value = revenue_saved - offer_cost
print("Retention Offer: Causal ROI Analysis")
print("=" * 50)
print(f"Subscribers receiving offers: {subscribers_treated:>10,}")
print(f"Estimated causal effect: {estimated_causal_effect:>+10.0%}")
print(f"Subscribers retained (caused): {retained:>10,}")
print(f"Annual revenue saved: ${revenue_saved:>10,.0f}")
print(f"Annual offer cost: ${offer_cost:>10,.0f}")
print(f"Net annual value: ${net_value:>10,.0f}")
print(f"ROI: {net_value / offer_cost:>10.1f}x")
print(f"\n--- Key Caveat ---")
print(f"This analysis estimates the effect for the AVERAGE treated")
print(f"subscriber. Some subscribers respond strongly to the offer;")
print(f"others would have stayed anyway. Heterogeneous treatment")
print(f"effect analysis (next step) could identify which segments")
print(f"benefit most, further optimizing spend.")
What the CFO Heard
Rachel Torres presents the findings to James Park with a single slide:
"Did the retention offer work?" - Three independent causal analyses say YES. - The offer reduces churn by approximately 6 percentage points for treated subscribers. - This translates to approximately 1,530 subscribers retained per year who would have otherwise canceled. - Net annual value after accounting for discount costs: approximately $180,000. - The naive before/after comparison (23% churn reduction) overstated the effect. Much of the overall improvement was driven by new features, economic conditions, and a competitor's price increase. The retention offer accounts for roughly one-third of the total improvement. - Next step: analyze which subscriber segments respond most strongly to optimize targeting.
James approves the budget. But more importantly, he trusts the number --- because the analysis honestly separated the causal effect of the offer from the confounding factors.
Discussion Questions
-
The naive comparison showed the treated group churning more than the control group, even though the treatment reduced churn. Explain this paradox to a non-technical stakeholder in 2--3 sentences.
-
Propensity score matching controls for observed confounders but not unobserved ones. Name one plausible unobserved confounder in the StreamFlow scenario that could bias the PSM estimate.
-
The parallel trends assumption is untestable in the post-treatment period. You can check it in the pre-treatment period, but this is not a guarantee. Why? What could go wrong?
-
Regression discontinuity is often called the most credible quasi-experimental design. Why is it more convincing than propensity score matching? What is its main limitation?
-
Triangulation. The memo presents three causal estimates from three methods. Why is convergent evidence from multiple methods more convincing than a single precise estimate?
-
The honest caveat. The analysis attributes only one-third of the total churn improvement to the retention offer. A less careful analyst might have attributed all of it. What organizational incentives make this kind of honesty difficult? How do you build a culture that rewards it?
-
Next steps. The memo mentions "heterogeneous treatment effects." What does this mean, and why would it help StreamFlow spend its retention budget more efficiently?
This case study supports Chapter 36: The Road to Advanced. Return to the chapter for the full discussion of causal inference methods.