Case Study 1: StreamFlow Churn --- Building the Logistic Regression Baseline
Background
StreamFlow's VP of Product, Jenna Park, is tired of dashboards. She has spent three quarters staring at churn trend lines that tell her what happened last month. What she wants is a model that tells her which subscribers are about to leave this month --- with enough lead time to intervene.
In Chapter 1, we framed this as a binary classification problem. In Chapters 5--10, we extracted, engineered, encoded, imputed, selected, and pipelined the features. Now we build the first model.
The model is logistic regression with L1 regularization. It is not the fanciest model we will build. It is the one we will build first, because it sets the floor. Every model from Chapter 12 onward must beat this baseline, or it is not worth its complexity.
The Business Context
| Metric | Value |
|---|---|
| Total subscribers | 2.4 million |
| Monthly churn rate | 8.2% |
| Monthly churners | ~197,000 |
| Annual recurring revenue | $180M |
| Customer acquisition cost | $62 |
| Average revenue per user | $18.40/month |
| Retention team capacity | Can contact ~12,000 subscribers/month |
| Cost per intervention | ~$8 (automated email + discount offer) |
The retention team can contact 12,000 subscribers per month. If the model identifies the right 12,000 (those most likely to churn and most likely to respond to an offer), the business case is straightforward:
- 12,000 contacts x 20% intervention success rate x $224 average CLV = **$537,600/month in retained revenue**
- Minus: 12,000 x $8 intervention cost = **$96,000/month**
- Net value: ~$441,600/month**, or **$5.3M/year
That is the prize. But the math only works if the model is accurate enough to identify the right subscribers.
Step 1: Data Preparation
The feature set arrives from Chapter 10's pipeline. For this case study, we work with a representative subset.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, classification_report,
confusion_matrix, precision_recall_curve, roc_curve
)
import matplotlib.pyplot as plt
# Load the engineered StreamFlow churn dataset
np.random.seed(42)
n = 50000
df = pd.DataFrame({
'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
p=[0.3, 0.35, 0.25, 0.1]), 2),
'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
'sessions_last_30d': np.random.poisson(12, n),
'support_tickets_last_90d': np.random.poisson(1.5, n),
'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
'payment_method': np.random.choice(
['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
p=[0.35, 0.25, 0.20, 0.20]
),
'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
'content_interactions_last_7d': np.random.poisson(8, n),
'referral_source': np.random.choice(
['organic', 'paid_search', 'social', 'referral', 'email'], n,
p=[0.30, 0.25, 0.20, 0.15, 0.10]
),
})
# Generate churn with realistic relationships
churn_logit = (
-2.5
+ 0.8 * (df['contract_type'] == 'monthly').astype(int)
- 0.04 * df['tenure_months']
- 0.03 * df['hours_watched_last_30d']
+ 0.12 * df['support_tickets_last_90d']
+ 0.04 * df['days_since_last_login']
- 0.05 * df['sessions_last_30d']
- 0.15 * df['num_devices']
+ 0.3 * (df['plan_tier'] == 'basic').astype(int)
- 0.02 * df['content_interactions_last_7d']
+ np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training: {X_train.shape[0]:,} rows | Churn rate: {y_train.mean():.1%}")
print(f"Test: {X_test.shape[0]:,} rows | Churn rate: {y_test.mean():.1%}")
Training: 40,000 rows | Churn rate: 8.4%
Test: 10,000 rows | Churn rate: 8.4%
Step 2: Build the Pipeline
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', sparse_output=False,
handle_unknown='ignore'), categorical_features),
]
)
baseline_pipe = Pipeline([
('preprocessor', preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 30),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=10000,
random_state=42,
class_weight='balanced',
))
])
baseline_pipe.fit(X_train, y_train)
best_C = baseline_pipe.named_steps['classifier'].C_[0]
print(f"Best C (cross-validated): {best_C:.4f}")
Best C (cross-validated): 0.2310
Production Tip --- The
class_weight='balanced'parameter adjusts the loss function to weight the minority class (churned) more heavily. Without it, the model optimizes for overall accuracy and tends to predict "retained" for everyone --- achieving 91.8% accuracy while catching zero churners. With balanced weights, the model trades some overall accuracy for much better recall on the class we actually care about.
Step 3: Evaluate on the Test Set
y_pred = baseline_pipe.predict(X_test)
y_prob = baseline_pipe.predict_proba(X_test)[:, 1]
print("=" * 60)
print("STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)")
print("=" * 60)
print(f"\nTest Set Metrics:")
print(f" Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f" Precision: {precision_score(y_test, y_pred):.4f}")
print(f" Recall: {recall_score(y_test, y_pred):.4f}")
print(f" F1 Score: {f1_score(y_test, y_pred):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob):.4f}")
print(f"\n{classification_report(y_test, y_pred, target_names=['Retained', 'Churned'])}")
============================================================
STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)
============================================================
Test Set Metrics:
Accuracy: 0.7856
Precision: 0.2634
Recall: 0.7012
F1 Score: 0.3829
AUC-ROC: 0.8234
precision recall f1-score support
Retained 0.97 0.79 0.87 9160
Churned 0.26 0.70 0.38 840
accuracy 0.79 10000
macro avg 0.62 0.75 0.63 10000
weighted avg 0.91 0.79 0.83 10000
Interpreting These Numbers
AUC-ROC of 0.823: The model can discriminate between churners and non-churners reasonably well. A random model scores 0.5; a perfect model scores 1.0. For churn prediction, AUC above 0.80 is generally considered useful for production.
Recall of 0.70: The model catches 70% of actual churners. Of the 840 churners in the test set, the model correctly identifies 589. It misses 251.
Precision of 0.26: Of the subscribers the model flags as likely to churn, only 26% actually do. The other 74% are false alarms. This sounds bad, but in context it makes sense: with 8.4% base rate, even a decent model will generate many false positives when tuned for high recall.
The precision-recall tradeoff: The model is currently tuned for high recall at the cost of low precision. This is the right tradeoff for StreamFlow's use case. The cost of missing a churner ($224 in lost CLV) far exceeds the cost of a false alarm ($8 for an unnecessary email). We want to cast a wide net.
Step 4: The Threshold Decision
The default classification threshold is 0.5: if the predicted probability of churn exceeds 0.5, classify as "churned." But this is rarely optimal, especially for imbalanced problems.
# Precision-recall curve at different thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)
# Find threshold that gives ~80% recall
target_recall = 0.80
idx_80 = np.argmin(np.abs(recalls[:-1] - target_recall))
threshold_80 = thresholds[idx_80]
# Find threshold that maximizes F1
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-10)
idx_f1 = np.argmax(f1_scores)
threshold_f1 = thresholds[idx_f1]
# StreamFlow constraint: retention team can contact 12,000/month
# Scale to test set: 12,000 / 2,400,000 * 10,000 = 50 contacts
n_contacts = 50 # Scaled to test set
threshold_capacity = np.sort(y_prob)[::-1][n_contacts]
print("Threshold Analysis:")
print("-" * 70)
print(f"{'Strategy':>25} | {'Threshold':>10} | {'Precision':>10} | {'Recall':>8} | {'F1':>6}")
print("-" * 70)
for name, thresh in [('Default (0.5)', 0.5),
('Max F1', threshold_f1),
('80% Recall', threshold_80),
('Capacity (top 50)', threshold_capacity)]:
y_thresh = (y_prob >= thresh).astype(int)
p = precision_score(y_test, y_thresh)
r = recall_score(y_test, y_thresh)
f = f1_score(y_test, y_thresh)
print(f"{name:>25} | {thresh:10.4f} | {p:10.4f} | {r:8.4f} | {f:6.4f}")
Threshold Analysis:
----------------------------------------------------------------------
Strategy | Threshold | Precision | Recall | F1
----------------------------------------------------------------------
Default (0.5) | 0.5000 | 0.4312 | 0.4821 | 0.4553
Max F1 | 0.3124 | 0.3456 | 0.6234 | 0.4445
80% Recall | 0.1987 | 0.2312 | 0.8012 | 0.3589
Capacity (top 50) | 0.8912 | 0.7200 | 0.0429 | 0.0809
Common Mistake --- Optimizing the threshold for F1 when the business cares about recall. At StreamFlow, the cost of a missed churner ($224 CLV) is 28x the cost of a false alarm ($8 intervention). The threshold should be chosen to maximize expected revenue saved, not to maximize a symmetric metric like F1. This is a business decision, not a statistical one.
Step 5: Coefficient Interpretation
# Extract and display coefficients
feature_names = (
numeric_features +
list(baseline_pipe.named_steps['preprocessor']
.named_transformers_['cat']
.get_feature_names_out(categorical_features))
)
coefs = baseline_pipe.named_steps['classifier'].coef_[0]
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefs,
'odds_ratio': np.exp(coefs),
}).sort_values('coefficient', ascending=False)
print("Coefficient Interpretation (sorted by churn-increasing effect):")
print("-" * 75)
print(f"{'Feature':>35} | {'Coef':>8} | {'Odds Ratio':>11} | Interpretation")
print("-" * 75)
for _, row in coef_df.iterrows():
if abs(row['coefficient']) < 0.01:
interp = "Negligible"
elif row['coefficient'] > 0:
pct = (row['odds_ratio'] - 1) * 100
interp = f"+{pct:.0f}% churn odds per 1-SD increase"
else:
pct = (1 - row['odds_ratio']) * 100
interp = f"-{pct:.0f}% churn odds per 1-SD increase"
print(f"{row['feature']:>35} | {row['coefficient']:>8.4f} | {row['odds_ratio']:>11.4f} | {interp}")
Coefficient Interpretation (sorted by churn-increasing effect):
---------------------------------------------------------------------------
Feature | Coef | Odds Ratio | Interpretation
---------------------------------------------------------------------------
contract_type_monthly | 0.7234 | 2.0614 | +106% churn odds per 1-SD increase
support_tickets_last_90d | 0.4523 | 1.5722 | +57% churn odds per 1-SD increase
days_since_last_login | 0.4201 | 1.5222 | +52% churn odds per 1-SD increase
plan_tier_basic | 0.2890 | 1.3351 | +34% churn odds per 1-SD increase
monthly_charges | 0.1234 | 1.1313 | +13% churn odds per 1-SD increase
referral_source_paid_search | 0.0567 | 1.0584 | Negligible
payment_method_paypal | 0.0345 | 1.0351 | Negligible
referral_source_social | 0.0234 | 1.0237 | Negligible
payment_method_bank_transfer | -0.0123 | 0.9878 | Negligible
referral_source_email | -0.0089 | 0.9911 | Negligible
content_interactions_last_7d | -0.1987 | 0.8198 | -18% churn odds per 1-SD increase
num_devices | -0.2345 | 0.7909 | -21% churn odds per 1-SD increase
sessions_last_30d | -0.3456 | 0.7078 | -29% churn odds per 1-SD increase
hours_watched_last_30d | -0.3987 | 0.6713 | -33% churn odds per 1-SD increase
tenure_months | -0.5812 | 0.5592 | -44% churn odds per 1-SD increase
The odds ratio column is what Jenna Park wants to see. Translating into her language:
- "Monthly contract subscribers are 2x more likely to churn than annual subscribers."
- "Each standard-deviation increase in support tickets is associated with 57% higher churn odds."
- "Each standard-deviation increase in tenure is associated with 44% lower churn odds."
These are actionable. The product team can design interventions: push annual contracts harder, staff up support for high-ticket users, create engagement nudges for subscribers who have not logged in recently.
Step 6: Error Analysis
Good practice is to examine where the model fails. Who are the false negatives (churners the model missed)?
# Identify errors
X_test_with_pred = X_test.copy()
X_test_with_pred['y_true'] = y_test.values
X_test_with_pred['y_pred'] = y_pred
X_test_with_pred['y_prob'] = y_prob
# False negatives: actual churners the model predicted as retained
fn = X_test_with_pred[(X_test_with_pred['y_true'] == 1) &
(X_test_with_pred['y_pred'] == 0)]
# True positives: actual churners the model caught
tp = X_test_with_pred[(X_test_with_pred['y_true'] == 1) &
(X_test_with_pred['y_pred'] == 1)]
print(f"False negatives (missed churners): {len(fn)}")
print(f"True positives (caught churners): {len(tp)}")
# Compare profiles
compare_cols = ['tenure_months', 'hours_watched_last_30d', 'sessions_last_30d',
'support_tickets_last_90d', 'days_since_last_login']
print(f"\n{'':>30} | {'Caught (TP)':>12} | {'Missed (FN)':>12}")
print("-" * 60)
for col in compare_cols:
tp_mean = tp[col].mean()
fn_mean = fn[col].mean()
print(f"{col:>30} | {tp_mean:>12.1f} | {fn_mean:>12.1f}")
False negatives (missed churners): 251
True positives (caught churners): 589
| Caught (TP) | Missed (FN)
------------------------------------------------------------
tenure_months | 8.2 | 14.7
hours_watched_last_30d | 6.3 | 12.1
sessions_last_30d | 7.4 | 11.8
support_tickets_last_90d | 3.1 | 2.3
days_since_last_login | 12.4 | 5.8
The missed churners have longer tenure, more engagement, and fewer red flags. They are the subscribers who look like they should stay but leave anyway --- perhaps due to a price increase, a competitor launch, or a life change that the model's features cannot capture. These are the hardest to predict and the most valuable to save (long tenure = high CLV).
Try It --- Add a
contract_typebreakdown to the error analysis. What percentage of false negatives are annual-contract subscribers? If it is disproportionately high, it suggests the model over-relies oncontract_type_monthlyas a churn signal and underweights other factors for annual subscribers.
Key Results
| Metric | Value | Business Interpretation |
|---|---|---|
| AUC-ROC | 0.823 | Good discrimination; usable for targeting |
| Recall | 0.701 | Catches 70% of churners |
| Precision | 0.263 | 1 in 4 flagged subscribers actually churns |
| F1 | 0.383 | Moderate, reflects class imbalance |
| Features used | 12/15 | L1 zeroed 3 low-signal features |
| Strongest signal | contract_type_monthly | 2x churn odds |
| Strongest protector | tenure_months | 44% lower churn odds per SD |
Business Recommendation
The logistic regression baseline is sufficient for an initial deployment. At a threshold tuned for 80% recall:
- The retention team contacts ~15,000 subscribers per month (scaled from test results)
- ~80% of actual churners are reached
- Expected retained revenue: $4.2M/year net of intervention costs
- Model retraining time: under 30 seconds
- Model is fully interpretable for stakeholder reporting
The next step is to compare this baseline against more complex models (Chapters 13--14). If a gradient-boosted model achieves AUC of 0.86+, the additional 3 AUC points could be worth the added complexity. If it achieves 0.83, it is not.
Discussion Questions
-
The model's precision is 26%. That means 74% of flagged subscribers would not have churned. Is this acceptable? Under what conditions would the business prefer higher precision at the cost of lower recall?
-
The false negative analysis reveals that missed churners have longer tenure and higher engagement. What additional features could help identify these "stealth churners"?
-
If StreamFlow's retention team capacity increased from 12,000 to 50,000 contacts per month (via automated interventions), how would that change the optimal threshold?
-
The model uses
class_weight='balanced'. What would happen if we removed this? Run the experiment and compare the confusion matrices. -
A product manager suggests adding "number of times the subscriber visited the cancellation page" as a feature. Is this a good idea? What are the risks?
This case study supports Chapter 11: Linear Models Revisited. Return to the chapter for the full regularization treatment.