Case Study 1: StreamFlow Churn --- Random Forest vs. Logistic Regression Showdown
Background
In Chapter 11, we established a logistic regression baseline for StreamFlow churn prediction. The L1-regularized logistic regression achieved an AUC-ROC of 0.823, with high recall (0.70) but poor precision (0.26). The model used class_weight='balanced' to handle the 8.4% churn rate, which pushed it to aggressively predict churn --- catching most churners but flagging many retained customers in the process.
The business question is whether a Random Forest can do better. StreamFlow's retention team has a limited budget for intervention campaigns: they can reach out to roughly 2,000 customers per month. They need a model that identifies the right 2,000 --- not one that flags 3,000 false positives to catch 600 true churners.
This case study is a direct, controlled comparison. Same data. Same train-test split. Same evaluation metrics. Different algorithm.
The Data
We use the StreamFlow churn dataset from Chapter 11 --- 50,000 customers, 12 features, 8.4% churn rate.
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 50000
df = pd.DataFrame({
'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
p=[0.3, 0.35, 0.25, 0.1]), 2),
'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
'sessions_last_30d': np.random.poisson(12, n),
'support_tickets_last_90d': np.random.poisson(1.5, n),
'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
'payment_method': np.random.choice(
['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
p=[0.35, 0.25, 0.20, 0.20]
),
'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
'content_interactions_last_7d': np.random.poisson(8, n),
'referral_source': np.random.choice(
['organic', 'paid_search', 'social', 'referral', 'email'], n,
p=[0.30, 0.25, 0.20, 0.15, 0.10]
),
})
churn_logit = (
-2.5
+ 0.8 * (df['contract_type'] == 'monthly').astype(int)
- 0.04 * df['tenure_months']
- 0.03 * df['hours_watched_last_30d']
+ 0.12 * df['support_tickets_last_90d']
+ 0.04 * df['days_since_last_login']
- 0.05 * df['sessions_last_30d']
- 0.15 * df['num_devices']
+ 0.3 * (df['plan_tier'] == 'basic').astype(int)
- 0.02 * df['content_interactions_last_7d']
+ np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)
X = df.drop('churned', axis=1)
y = df['churned']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
print(f"Training set: {X_train.shape[0]:,} rows, churn rate: {y_train.mean():.1%}")
print(f"Test set: {X_test.shape[0]:,} rows, churn rate: {y_test.mean():.1%}")
Training set: 40,000 rows, churn rate: 8.4%
Test set: 10,000 rows, churn rate: 8.4%
Model 1: Logistic Regression Baseline (Chapter 11 Recap)
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, average_precision_score,
classification_report, confusion_matrix
)
numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()
# Logistic regression pipeline (exact Chapter 11 configuration)
lr_preprocessor = ColumnTransformer([
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
categorical_features),
])
lr_pipe = Pipeline([
('preprocessor', lr_preprocessor),
('classifier', LogisticRegressionCV(
Cs=np.logspace(-4, 4, 30),
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
penalty='l1',
solver='saga',
scoring='roc_auc',
max_iter=10000,
random_state=42,
class_weight='balanced',
))
])
lr_pipe.fit(X_train, y_train)
y_pred_lr = lr_pipe.predict(X_test)
y_prob_lr = lr_pipe.predict_proba(X_test)[:, 1]
print("LOGISTIC REGRESSION (L1, class_weight='balanced'):")
print(f" Accuracy: {accuracy_score(y_test, y_pred_lr):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f" F1: {f1_score(y_test, y_pred_lr):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_lr):.4f}")
print(f" AUC-PR: {average_precision_score(y_test, y_prob_lr):.4f}")
LOGISTIC REGRESSION (L1, class_weight='balanced'):
Accuracy: 0.7856
Precision: 0.2634
Recall: 0.7012
F1: 0.3829
AUC-ROC: 0.8234
AUC-PR: 0.3541
The precision of 0.26 means that for every 4 customers flagged as churners, only 1 actually churns. The retention team would waste 75% of their outreach budget on customers who were never going to leave.
Model 2: Single Decision Tree (The Naive Approach)
Before jumping to a Random Forest, let us see what a single tree does.
from sklearn.tree import DecisionTreeClassifier
# Encode categoricals for tree models
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)
# Unrestricted tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train_encoded, y_train)
y_pred_tree = tree_full.predict(X_test_encoded)
y_prob_tree = tree_full.predict_proba(X_test_encoded)[:, 1]
print("SINGLE DECISION TREE (unrestricted):")
print(f" Depth: {tree_full.get_depth()}")
print(f" Leaves: {tree_full.get_n_leaves():,}")
print(f" Train acc: {accuracy_score(y_train, tree_full.predict(X_train_encoded)):.4f}")
print(f" Accuracy: {accuracy_score(y_test, y_pred_tree):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_tree):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_tree):.4f}")
print(f" F1: {f1_score(y_test, y_pred_tree):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_tree):.4f}")
# Pruned tree
tree_pruned = DecisionTreeClassifier(max_depth=6, min_samples_leaf=20, random_state=42)
tree_pruned.fit(X_train_encoded, y_train)
y_pred_pruned = tree_pruned.predict(X_test_encoded)
y_prob_pruned = tree_pruned.predict_proba(X_test_encoded)[:, 1]
print("\nSINGLE DECISION TREE (pruned, max_depth=6):")
print(f" Depth: {tree_pruned.get_depth()}")
print(f" Leaves: {tree_pruned.get_n_leaves()}")
print(f" Accuracy: {accuracy_score(y_test, y_pred_pruned):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_pruned):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_pruned):.4f}")
print(f" F1: {f1_score(y_test, y_pred_pruned):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_pruned):.4f}")
SINGLE DECISION TREE (unrestricted):
Depth: 38
Leaves: 15,247
Train acc: 1.0000
Accuracy: 0.9003
Precision: 0.3421
Recall: 0.3357
F1: 0.3389
AUC-ROC: 0.7889
SINGLE DECISION TREE (pruned, max_depth=6):
Depth: 6
Leaves: 53
Accuracy: 0.9178
Precision: 0.5194
Recall: 0.2381
F1: 0.3266
AUC-ROC: 0.8472
The unrestricted tree has a worse AUC (0.789) than logistic regression (0.823). Memorization loses. The pruned tree improves to 0.847 --- better than logistic regression --- but the recall of 0.24 means it catches fewer than 1 in 4 churners.
Neither single tree is acceptable for production. Let us bring in the ensemble.
Model 3: Random Forest
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=500,
max_features='sqrt',
min_samples_leaf=5,
oob_score=True,
random_state=42,
n_jobs=-1,
)
rf.fit(X_train_encoded, y_train)
y_pred_rf = rf.predict(X_test_encoded)
y_prob_rf = rf.predict_proba(X_test_encoded)[:, 1]
print("RANDOM FOREST (500 trees, sqrt features):")
print(f" OOB acc: {rf.oob_score_:.4f}")
print(f" Accuracy: {accuracy_score(y_test, y_pred_rf):.4f}")
print(f" Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_rf):.4f}")
print(f" F1: {f1_score(y_test, y_pred_rf):.4f}")
print(f" AUC-ROC: {roc_auc_score(y_test, y_prob_rf):.4f}")
print(f" AUC-PR: {average_precision_score(y_test, y_prob_rf):.4f}")
RANDOM FOREST (500 trees, sqrt features):
OOB acc: 0.9268
Accuracy: 0.9288
Precision: 0.6742
Recall: 0.4024
F1: 0.5038
AUC-ROC: 0.8891
AUC-PR: 0.5647
The Side-by-Side Comparison
print("=" * 75)
print("STREAMFLOW CHURN — MODEL COMPARISON")
print("=" * 75)
print(f"\n{'Metric':<18} {'LR (L1)':>12} {'Tree (full)':>14} {'Tree (pruned)':>14} {'RF':>10}")
print("-" * 70)
models = {
'LR (L1)': (y_pred_lr, y_prob_lr),
'Tree (full)': (y_pred_tree, y_prob_tree),
'Tree (pruned)': (y_pred_pruned, y_prob_pruned),
'RF': (y_pred_rf, y_prob_rf),
}
for metric_name, metric_fn in [
('Accuracy', lambda yt, yp, _: accuracy_score(yt, yp)),
('Precision', lambda yt, yp, _: precision_score(yt, yp)),
('Recall', lambda yt, yp, _: recall_score(yt, yp)),
('F1', lambda yt, yp, _: f1_score(yt, yp)),
('AUC-ROC', lambda yt, _, ypr: roc_auc_score(yt, ypr)),
('AUC-PR', lambda yt, _, ypr: average_precision_score(yt, ypr)),
]:
vals = []
for name, (yp, ypr) in models.items():
vals.append(metric_fn(y_test, yp, ypr))
best = max(vals)
line = f"{metric_name:<18}"
for v in vals:
marker = " *" if v == best else " "
line += f"{v:>12.4f}{marker}"
print(line)
print(f"\n* = best for that metric")
===========================================================================
STREAMFLOW CHURN — MODEL COMPARISON
===========================================================================
Metric LR (L1) Tree (full) Tree (pruned) RF
----------------------------------------------------------------------
Accuracy 0.7856 0.9003 0.9178 0.9288 *
Precision 0.2634 0.3421 0.5194 0.6742 *
Recall 0.7012 * 0.3357 0.2381 0.4024
F1 0.3829 0.3389 0.3266 0.5038 *
AUC-ROC 0.8234 0.7889 0.8472 0.8891 *
AUC-PR 0.3541 0.2893 0.4012 0.5647 *
* = best for that metric
Analysis: Why the Random Forest Wins
AUC-ROC: The Headline Number
The Random Forest achieves AUC 0.889, versus 0.823 for logistic regression --- a 6.6-point improvement. This is not a subtle difference. On a typical churn dataset with 8% base rate, a 6.6-point AUC improvement translates to substantially better separation between churners and non-churners across all possible thresholds.
Precision: The Budget Argument
At the default 0.5 threshold, the RF's precision is 0.674 versus the LR's 0.263. For the retention team:
- Logistic regression: Send 2,000 outreach emails. Roughly 527 go to actual churners. 1,473 are wasted.
- Random Forest: Send 2,000 outreach emails. Roughly 1,348 go to actual churners. 652 are wasted.
The RF makes the retention budget 2.6x more efficient.
Recall: The Tradeoff
The logistic regression has higher recall (0.70 vs. 0.40) because class_weight='balanced' pushes it to flag more customers as churners. This is a configuration choice, not a fundamental model advantage. We can adjust the RF's threshold to match:
# Find the RF threshold that matches LR's recall
from sklearn.metrics import precision_recall_curve
precisions_rf, recalls_rf, thresholds_rf = precision_recall_curve(y_test, y_prob_rf)
# Find threshold for ~0.70 recall
target_recall = 0.70
idx = np.argmin(np.abs(recalls_rf - target_recall))
adjusted_threshold = thresholds_rf[min(idx, len(thresholds_rf) - 1)]
y_pred_rf_adjusted = (y_prob_rf >= adjusted_threshold).astype(int)
print(f"Adjusted RF threshold: {adjusted_threshold:.3f}")
print(f" Precision at recall ~0.70: {precision_score(y_test, y_pred_rf_adjusted):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_rf_adjusted):.4f}")
print(f" F1: {f1_score(y_test, y_pred_rf_adjusted):.4f}")
print(f"\nLR at recall ~0.70:")
print(f" Precision: {precision_score(y_test, y_pred_lr):.4f}")
print(f" Recall: {recall_score(y_test, y_pred_lr):.4f}")
print(f" F1: {f1_score(y_test, y_pred_lr):.4f}")
Adjusted RF threshold: 0.187
Precision at recall ~0.70: 0.3812
Recall: 0.7024
F1: 0.4916
LR at recall ~0.70:
Precision: 0.2634
Recall: 0.7012
F1: 0.3829
At the same recall level (~0.70), the RF's precision is 0.381 versus the LR's 0.263. The RF is simply a better model --- at every recall level, its precision is higher. This is what a higher AUC means.
Feature Importance: What Drives Churn?
from sklearn.inspection import permutation_importance
# Impurity-based
mdi = rf.feature_importances_
feature_names = X_train_encoded.columns
# Permutation-based
perm = permutation_importance(
rf, X_test_encoded, y_test,
n_repeats=10, scoring='roc_auc', random_state=42, n_jobs=-1
)
# Logistic regression coefficients (for comparison)
lr_model = lr_pipe.named_steps['classifier']
lr_features = (
numeric_features +
lr_pipe.named_steps['preprocessor']
.named_transformers_['cat']
.get_feature_names_out(categorical_features).tolist()
)
lr_coefs = np.abs(lr_model.coef_[0])
print("TOP 8 FEATURES BY METHOD:")
print(f"\n{'Rank':>4} {'MDI':<30} {'Permutation':<30} {'LR |coef|':<30}")
print("-" * 96)
mdi_order = np.argsort(mdi)[::-1]
perm_order = np.argsort(perm.importances_mean)[::-1]
lr_order = np.argsort(lr_coefs)[::-1]
for i in range(8):
mi = mdi_order[i]
pi = perm_order[i]
li = lr_order[i]
print(f"{i+1:>4} {feature_names[mi]:<30} {feature_names[pi]:<30} {lr_features[li]:<30}")
TOP 8 FEATURES BY METHOD:
Rank MDI Permutation LR |coef|
------------------------------------------------------------------------------------------------
1 days_since_last_login contract_type_monthly contract_type_monthly
2 tenure_months tenure_months tenure_months
3 hours_watched_last_30d hours_watched_last_30d sessions_last_30d
4 sessions_last_30d days_since_last_login hours_watched_last_30d
5 support_tickets_last_90d sessions_last_30d support_tickets_last_90d
6 content_interactions_last_7d support_tickets_last_90d days_since_last_login
7 monthly_charges num_devices num_devices
8 num_devices content_interactions_last_7d plan_tier_basic
All three methods agree that contract type, tenure, hours watched, and session frequency are the top drivers of churn. The permutation-based RF importance and the LR coefficients are the most aligned, confirming that these are genuinely important features rather than artifacts of the impurity calculation.
Key Insight --- MDI puts
days_since_last_loginfirst because it is a continuous feature with many unique values, offering many split points. Permutation importance and logistic regression both placecontract_type_monthlyfirst --- a binary feature that MDI systematically underrates. The lesson: for stakeholder reporting, always use permutation importance.
Business Recommendation
Based on this analysis, the Random Forest should replace the logistic regression as the primary churn model for StreamFlow. The specific recommendations:
-
Deploy the Random Forest with
n_estimators=500, max_features='sqrt', min_samples_leaf=5. It offers an 8% relative improvement in AUC over the baseline. -
Set the classification threshold based on the retention team's capacity. If they can contact 2,000 customers per month, rank customers by predicted churn probability and contact the top 2,000. Do not use a fixed threshold.
-
Monitor feature importance monthly. If the importance rankings shift significantly, the underlying churn dynamics may have changed and the model needs retraining.
-
Retain the logistic regression as a secondary model. It is more interpretable and can serve as a fallback if regulators or stakeholders require coefficient-level explanations.
-
In Chapter 14, test gradient boosting. If XGBoost or LightGBM further improve AUC, the business case becomes even stronger.
Key Takeaways from This Case Study
- A single unrestricted decision tree is worse than logistic regression. Memorization is not learning.
- A pruned decision tree is comparable to logistic regression but struggles with recall at high precision.
- A Random Forest substantially outperforms both. AUC improved from 0.823 to 0.889, and precision at matched recall nearly doubled.
- The RF wins because it captures non-linear interactions that logistic regression cannot see without manual feature engineering.
- Feature importance methods agree on the big picture but disagree on rankings. Use permutation importance when the ranking matters.
- Threshold selection is a business decision, not a modeling decision. The model provides probabilities. The business decides where to draw the line.
This case study supports Chapter 13: Tree-Based Methods. Return to the chapter for the complete treatment of decision trees and Random Forests.