Case Study 1: StreamFlow Churn --- Random Forest vs. Logistic Regression Showdown


Background

In Chapter 11, we established a logistic regression baseline for StreamFlow churn prediction. The L1-regularized logistic regression achieved an AUC-ROC of 0.823, with high recall (0.70) but poor precision (0.26). The model used class_weight='balanced' to handle the 8.4% churn rate, which pushed it to aggressively predict churn --- catching most churners but flagging many retained customers in the process.

The business question is whether a Random Forest can do better. StreamFlow's retention team has a limited budget for intervention campaigns: they can reach out to roughly 2,000 customers per month. They need a model that identifies the right 2,000 --- not one that flags 3,000 false positives to catch 600 true churners.

This case study is a direct, controlled comparison. Same data. Same train-test split. Same evaluation metrics. Different algorithm.


The Data

We use the StreamFlow churn dataset from Chapter 11 --- 50,000 customers, 12 features, 8.4% churn rate.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
    'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
                                                   p=[0.3, 0.35, 0.25, 0.1]), 2),
    'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
    'sessions_last_30d': np.random.poisson(12, n),
    'support_tickets_last_90d': np.random.poisson(1.5, n),
    'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
    'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
    'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
    'payment_method': np.random.choice(
        ['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
        p=[0.35, 0.25, 0.20, 0.20]
    ),
    'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
    'content_interactions_last_7d': np.random.poisson(8, n),
    'referral_source': np.random.choice(
        ['organic', 'paid_search', 'social', 'referral', 'email'], n,
        p=[0.30, 0.25, 0.20, 0.15, 0.10]
    ),
})

churn_logit = (
    -2.5
    + 0.8 * (df['contract_type'] == 'monthly').astype(int)
    - 0.04 * df['tenure_months']
    - 0.03 * df['hours_watched_last_30d']
    + 0.12 * df['support_tickets_last_90d']
    + 0.04 * df['days_since_last_login']
    - 0.05 * df['sessions_last_30d']
    - 0.15 * df['num_devices']
    + 0.3 * (df['plan_tier'] == 'basic').astype(int)
    - 0.02 * df['content_interactions_last_7d']
    + np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = df.drop('churned', axis=1)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set: {X_train.shape[0]:,} rows, churn rate: {y_train.mean():.1%}")
print(f"Test set:     {X_test.shape[0]:,} rows, churn rate: {y_test.mean():.1%}")
Training set: 40,000 rows, churn rate: 8.4%
Test set:     10,000 rows, churn rate: 8.4%

Model 1: Logistic Regression Baseline (Chapter 11 Recap)

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, average_precision_score,
    classification_report, confusion_matrix
)

numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

# Logistic regression pipeline (exact Chapter 11 configuration)
lr_preprocessor = ColumnTransformer([
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
     categorical_features),
])

lr_pipe = Pipeline([
    ('preprocessor', lr_preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 30),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=10000,
        random_state=42,
        class_weight='balanced',
    ))
])

lr_pipe.fit(X_train, y_train)
y_pred_lr = lr_pipe.predict(X_test)
y_prob_lr = lr_pipe.predict_proba(X_test)[:, 1]

print("LOGISTIC REGRESSION (L1, class_weight='balanced'):")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_lr):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_lr):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_lr):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_lr):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_lr):.4f}")
print(f"  AUC-PR:      {average_precision_score(y_test, y_prob_lr):.4f}")
LOGISTIC REGRESSION (L1, class_weight='balanced'):
  Accuracy:    0.7856
  Precision:   0.2634
  Recall:      0.7012
  F1:          0.3829
  AUC-ROC:     0.8234
  AUC-PR:      0.3541

The precision of 0.26 means that for every 4 customers flagged as churners, only 1 actually churns. The retention team would waste 75% of their outreach budget on customers who were never going to leave.


Model 2: Single Decision Tree (The Naive Approach)

Before jumping to a Random Forest, let us see what a single tree does.

from sklearn.tree import DecisionTreeClassifier

# Encode categoricals for tree models
X_train_encoded = pd.get_dummies(X_train, drop_first=True)
X_test_encoded = pd.get_dummies(X_test, drop_first=True)
X_test_encoded = X_test_encoded.reindex(columns=X_train_encoded.columns, fill_value=0)

# Unrestricted tree
tree_full = DecisionTreeClassifier(random_state=42)
tree_full.fit(X_train_encoded, y_train)

y_pred_tree = tree_full.predict(X_test_encoded)
y_prob_tree = tree_full.predict_proba(X_test_encoded)[:, 1]

print("SINGLE DECISION TREE (unrestricted):")
print(f"  Depth:       {tree_full.get_depth()}")
print(f"  Leaves:      {tree_full.get_n_leaves():,}")
print(f"  Train acc:   {accuracy_score(y_train, tree_full.predict(X_train_encoded)):.4f}")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_tree):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_tree):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_tree):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_tree):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_tree):.4f}")

# Pruned tree
tree_pruned = DecisionTreeClassifier(max_depth=6, min_samples_leaf=20, random_state=42)
tree_pruned.fit(X_train_encoded, y_train)

y_pred_pruned = tree_pruned.predict(X_test_encoded)
y_prob_pruned = tree_pruned.predict_proba(X_test_encoded)[:, 1]

print("\nSINGLE DECISION TREE (pruned, max_depth=6):")
print(f"  Depth:       {tree_pruned.get_depth()}")
print(f"  Leaves:      {tree_pruned.get_n_leaves()}")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_pruned):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_pruned):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_pruned):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_pruned):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_pruned):.4f}")
SINGLE DECISION TREE (unrestricted):
  Depth:       38
  Leaves:      15,247
  Train acc:   1.0000
  Accuracy:    0.9003
  Precision:   0.3421
  Recall:      0.3357
  F1:          0.3389
  AUC-ROC:     0.7889

SINGLE DECISION TREE (pruned, max_depth=6):
  Depth:       6
  Leaves:      53
  Accuracy:    0.9178
  Precision:   0.5194
  Recall:      0.2381
  F1:          0.3266
  AUC-ROC:     0.8472

The unrestricted tree has a worse AUC (0.789) than logistic regression (0.823). Memorization loses. The pruned tree improves to 0.847 --- better than logistic regression --- but the recall of 0.24 means it catches fewer than 1 in 4 churners.

Neither single tree is acceptable for production. Let us bring in the ensemble.


Model 3: Random Forest

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    min_samples_leaf=5,
    oob_score=True,
    random_state=42,
    n_jobs=-1,
)
rf.fit(X_train_encoded, y_train)

y_pred_rf = rf.predict(X_test_encoded)
y_prob_rf = rf.predict_proba(X_test_encoded)[:, 1]

print("RANDOM FOREST (500 trees, sqrt features):")
print(f"  OOB acc:     {rf.oob_score_:.4f}")
print(f"  Accuracy:    {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"  Precision:   {precision_score(y_test, y_pred_rf):.4f}")
print(f"  Recall:      {recall_score(y_test, y_pred_rf):.4f}")
print(f"  F1:          {f1_score(y_test, y_pred_rf):.4f}")
print(f"  AUC-ROC:     {roc_auc_score(y_test, y_prob_rf):.4f}")
print(f"  AUC-PR:      {average_precision_score(y_test, y_prob_rf):.4f}")
RANDOM FOREST (500 trees, sqrt features):
  OOB acc:     0.9268
  Accuracy:    0.9288
  Precision:   0.6742
  Recall:      0.4024
  F1:          0.5038
  AUC-ROC:     0.8891
  AUC-PR:      0.5647

The Side-by-Side Comparison

print("=" * 75)
print("STREAMFLOW CHURN — MODEL COMPARISON")
print("=" * 75)
print(f"\n{'Metric':<18} {'LR (L1)':>12} {'Tree (full)':>14} {'Tree (pruned)':>14} {'RF':>10}")
print("-" * 70)

models = {
    'LR (L1)': (y_pred_lr, y_prob_lr),
    'Tree (full)': (y_pred_tree, y_prob_tree),
    'Tree (pruned)': (y_pred_pruned, y_prob_pruned),
    'RF': (y_pred_rf, y_prob_rf),
}

for metric_name, metric_fn in [
    ('Accuracy', lambda yt, yp, _: accuracy_score(yt, yp)),
    ('Precision', lambda yt, yp, _: precision_score(yt, yp)),
    ('Recall', lambda yt, yp, _: recall_score(yt, yp)),
    ('F1', lambda yt, yp, _: f1_score(yt, yp)),
    ('AUC-ROC', lambda yt, _, ypr: roc_auc_score(yt, ypr)),
    ('AUC-PR', lambda yt, _, ypr: average_precision_score(yt, ypr)),
]:
    vals = []
    for name, (yp, ypr) in models.items():
        vals.append(metric_fn(y_test, yp, ypr))
    best = max(vals)
    line = f"{metric_name:<18}"
    for v in vals:
        marker = " *" if v == best else "  "
        line += f"{v:>12.4f}{marker}"
    print(line)

print(f"\n* = best for that metric")
===========================================================================
STREAMFLOW CHURN — MODEL COMPARISON
===========================================================================

Metric             LR (L1)    Tree (full)  Tree (pruned)         RF
----------------------------------------------------------------------
Accuracy             0.7856        0.9003         0.9178     0.9288 *
Precision            0.2634        0.3421         0.5194     0.6742 *
Recall               0.7012 *      0.3357         0.2381     0.4024
F1                   0.3829        0.3389         0.3266     0.5038 *
AUC-ROC              0.8234        0.7889         0.8472     0.8891 *
AUC-PR               0.3541        0.2893         0.4012     0.5647 *

* = best for that metric

Analysis: Why the Random Forest Wins

AUC-ROC: The Headline Number

The Random Forest achieves AUC 0.889, versus 0.823 for logistic regression --- a 6.6-point improvement. This is not a subtle difference. On a typical churn dataset with 8% base rate, a 6.6-point AUC improvement translates to substantially better separation between churners and non-churners across all possible thresholds.

Precision: The Budget Argument

At the default 0.5 threshold, the RF's precision is 0.674 versus the LR's 0.263. For the retention team:

  • Logistic regression: Send 2,000 outreach emails. Roughly 527 go to actual churners. 1,473 are wasted.
  • Random Forest: Send 2,000 outreach emails. Roughly 1,348 go to actual churners. 652 are wasted.

The RF makes the retention budget 2.6x more efficient.

Recall: The Tradeoff

The logistic regression has higher recall (0.70 vs. 0.40) because class_weight='balanced' pushes it to flag more customers as churners. This is a configuration choice, not a fundamental model advantage. We can adjust the RF's threshold to match:

# Find the RF threshold that matches LR's recall
from sklearn.metrics import precision_recall_curve

precisions_rf, recalls_rf, thresholds_rf = precision_recall_curve(y_test, y_prob_rf)

# Find threshold for ~0.70 recall
target_recall = 0.70
idx = np.argmin(np.abs(recalls_rf - target_recall))
adjusted_threshold = thresholds_rf[min(idx, len(thresholds_rf) - 1)]

y_pred_rf_adjusted = (y_prob_rf >= adjusted_threshold).astype(int)

print(f"Adjusted RF threshold: {adjusted_threshold:.3f}")
print(f"  Precision at recall ~0.70: {precision_score(y_test, y_pred_rf_adjusted):.4f}")
print(f"  Recall:                    {recall_score(y_test, y_pred_rf_adjusted):.4f}")
print(f"  F1:                        {f1_score(y_test, y_pred_rf_adjusted):.4f}")
print(f"\nLR at recall ~0.70:")
print(f"  Precision:                 {precision_score(y_test, y_pred_lr):.4f}")
print(f"  Recall:                    {recall_score(y_test, y_pred_lr):.4f}")
print(f"  F1:                        {f1_score(y_test, y_pred_lr):.4f}")
Adjusted RF threshold: 0.187
  Precision at recall ~0.70: 0.3812
  Recall:                    0.7024
  F1:                        0.4916

LR at recall ~0.70:
  Precision:                 0.2634
  Recall:                    0.7012
  F1:                        0.3829

At the same recall level (~0.70), the RF's precision is 0.381 versus the LR's 0.263. The RF is simply a better model --- at every recall level, its precision is higher. This is what a higher AUC means.


Feature Importance: What Drives Churn?

from sklearn.inspection import permutation_importance

# Impurity-based
mdi = rf.feature_importances_
feature_names = X_train_encoded.columns

# Permutation-based
perm = permutation_importance(
    rf, X_test_encoded, y_test,
    n_repeats=10, scoring='roc_auc', random_state=42, n_jobs=-1
)

# Logistic regression coefficients (for comparison)
lr_model = lr_pipe.named_steps['classifier']
lr_features = (
    numeric_features +
    lr_pipe.named_steps['preprocessor']
        .named_transformers_['cat']
        .get_feature_names_out(categorical_features).tolist()
)
lr_coefs = np.abs(lr_model.coef_[0])

print("TOP 8 FEATURES BY METHOD:")
print(f"\n{'Rank':>4}  {'MDI':<30} {'Permutation':<30} {'LR |coef|':<30}")
print("-" * 96)

mdi_order = np.argsort(mdi)[::-1]
perm_order = np.argsort(perm.importances_mean)[::-1]
lr_order = np.argsort(lr_coefs)[::-1]

for i in range(8):
    mi = mdi_order[i]
    pi = perm_order[i]
    li = lr_order[i]
    print(f"{i+1:>4}  {feature_names[mi]:<30} {feature_names[pi]:<30} {lr_features[li]:<30}")
TOP 8 FEATURES BY METHOD:

Rank  MDI                            Permutation                    LR |coef|
------------------------------------------------------------------------------------------------
   1  days_since_last_login          contract_type_monthly          contract_type_monthly
   2  tenure_months                  tenure_months                  tenure_months
   3  hours_watched_last_30d         hours_watched_last_30d         sessions_last_30d
   4  sessions_last_30d              days_since_last_login          hours_watched_last_30d
   5  support_tickets_last_90d       sessions_last_30d              support_tickets_last_90d
   6  content_interactions_last_7d   support_tickets_last_90d       days_since_last_login
   7  monthly_charges                num_devices                    num_devices
   8  num_devices                    content_interactions_last_7d   plan_tier_basic

All three methods agree that contract type, tenure, hours watched, and session frequency are the top drivers of churn. The permutation-based RF importance and the LR coefficients are the most aligned, confirming that these are genuinely important features rather than artifacts of the impurity calculation.

Key Insight --- MDI puts days_since_last_login first because it is a continuous feature with many unique values, offering many split points. Permutation importance and logistic regression both place contract_type_monthly first --- a binary feature that MDI systematically underrates. The lesson: for stakeholder reporting, always use permutation importance.


Business Recommendation

Based on this analysis, the Random Forest should replace the logistic regression as the primary churn model for StreamFlow. The specific recommendations:

  1. Deploy the Random Forest with n_estimators=500, max_features='sqrt', min_samples_leaf=5. It offers an 8% relative improvement in AUC over the baseline.

  2. Set the classification threshold based on the retention team's capacity. If they can contact 2,000 customers per month, rank customers by predicted churn probability and contact the top 2,000. Do not use a fixed threshold.

  3. Monitor feature importance monthly. If the importance rankings shift significantly, the underlying churn dynamics may have changed and the model needs retraining.

  4. Retain the logistic regression as a secondary model. It is more interpretable and can serve as a fallback if regulators or stakeholders require coefficient-level explanations.

  5. In Chapter 14, test gradient boosting. If XGBoost or LightGBM further improve AUC, the business case becomes even stronger.


Key Takeaways from This Case Study

  1. A single unrestricted decision tree is worse than logistic regression. Memorization is not learning.
  2. A pruned decision tree is comparable to logistic regression but struggles with recall at high precision.
  3. A Random Forest substantially outperforms both. AUC improved from 0.823 to 0.889, and precision at matched recall nearly doubled.
  4. The RF wins because it captures non-linear interactions that logistic regression cannot see without manual feature engineering.
  5. Feature importance methods agree on the big picture but disagree on rankings. Use permutation importance when the ranking matters.
  6. Threshold selection is a business decision, not a modeling decision. The model provides probabilities. The business decides where to draw the line.

This case study supports Chapter 13: Tree-Based Methods. Return to the chapter for the complete treatment of decision trees and Random Forests.