Case Study 1: StreamFlow Churn --- Building the Logistic Regression Baseline


Background

StreamFlow's VP of Product, Jenna Park, is tired of dashboards. She has spent three quarters staring at churn trend lines that tell her what happened last month. What she wants is a model that tells her which subscribers are about to leave this month --- with enough lead time to intervene.

In Chapter 1, we framed this as a binary classification problem. In Chapters 5--10, we extracted, engineered, encoded, imputed, selected, and pipelined the features. Now we build the first model.

The model is logistic regression with L1 regularization. It is not the fanciest model we will build. It is the one we will build first, because it sets the floor. Every model from Chapter 12 onward must beat this baseline, or it is not worth its complexity.


The Business Context

Metric Value
Total subscribers 2.4 million
Monthly churn rate 8.2%
Monthly churners ~197,000
Annual recurring revenue $180M
Customer acquisition cost $62
Average revenue per user $18.40/month
Retention team capacity Can contact ~12,000 subscribers/month
Cost per intervention ~$8 (automated email + discount offer)

The retention team can contact 12,000 subscribers per month. If the model identifies the right 12,000 (those most likely to churn and most likely to respond to an offer), the business case is straightforward:

  • 12,000 contacts x 20% intervention success rate x $224 average CLV = **$537,600/month in retained revenue**
  • Minus: 12,000 x $8 intervention cost = **$96,000/month**
  • Net value: ~$441,600/month**, or **$5.3M/year

That is the prize. But the math only works if the model is accurate enough to identify the right subscribers.


Step 1: Data Preparation

The feature set arrives from Chapter 10's pipeline. For this case study, we work with a representative subset.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report,
    confusion_matrix, precision_recall_curve, roc_curve
)
import matplotlib.pyplot as plt

# Load the engineered StreamFlow churn dataset
np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
    'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
                                                   p=[0.3, 0.35, 0.25, 0.1]), 2),
    'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
    'sessions_last_30d': np.random.poisson(12, n),
    'support_tickets_last_90d': np.random.poisson(1.5, n),
    'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
    'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
    'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
    'payment_method': np.random.choice(
        ['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
        p=[0.35, 0.25, 0.20, 0.20]
    ),
    'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
    'content_interactions_last_7d': np.random.poisson(8, n),
    'referral_source': np.random.choice(
        ['organic', 'paid_search', 'social', 'referral', 'email'], n,
        p=[0.30, 0.25, 0.20, 0.15, 0.10]
    ),
})

# Generate churn with realistic relationships
churn_logit = (
    -2.5
    + 0.8 * (df['contract_type'] == 'monthly').astype(int)
    - 0.04 * df['tenure_months']
    - 0.03 * df['hours_watched_last_30d']
    + 0.12 * df['support_tickets_last_90d']
    + 0.04 * df['days_since_last_login']
    - 0.05 * df['sessions_last_30d']
    - 0.15 * df['num_devices']
    + 0.3 * (df['plan_tier'] == 'basic').astype(int)
    - 0.02 * df['content_interactions_last_7d']
    + np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = df.drop('churned', axis=1)
y = df['churned']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training: {X_train.shape[0]:,} rows | Churn rate: {y_train.mean():.1%}")
print(f"Test:     {X_test.shape[0]:,} rows | Churn rate: {y_test.mean():.1%}")
Training: 40,000 rows | Churn rate: 8.4%
Test:     10,000 rows | Churn rate: 8.4%

Step 2: Build the Pipeline

numeric_features = X_train.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train.select_dtypes(include=['object']).columns.tolist()

preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False,
                              handle_unknown='ignore'), categorical_features),
    ]
)

baseline_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 30),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=10000,
        random_state=42,
        class_weight='balanced',
    ))
])

baseline_pipe.fit(X_train, y_train)

best_C = baseline_pipe.named_steps['classifier'].C_[0]
print(f"Best C (cross-validated): {best_C:.4f}")
Best C (cross-validated): 0.2310

Production Tip --- The class_weight='balanced' parameter adjusts the loss function to weight the minority class (churned) more heavily. Without it, the model optimizes for overall accuracy and tends to predict "retained" for everyone --- achieving 91.8% accuracy while catching zero churners. With balanced weights, the model trades some overall accuracy for much better recall on the class we actually care about.


Step 3: Evaluate on the Test Set

y_pred = baseline_pipe.predict(X_test)
y_prob = baseline_pipe.predict_proba(X_test)[:, 1]

print("=" * 60)
print("STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)")
print("=" * 60)

print(f"\nTest Set Metrics:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred):.4f}")
print(f"  AUC-ROC:   {roc_auc_score(y_test, y_prob):.4f}")

print(f"\n{classification_report(y_test, y_pred, target_names=['Retained', 'Churned'])}")
============================================================
STREAMFLOW CHURN BASELINE — LOGISTIC REGRESSION (L1)
============================================================

Test Set Metrics:
  Accuracy:  0.7856
  Precision: 0.2634
  Recall:    0.7012
  F1 Score:  0.3829
  AUC-ROC:   0.8234

              precision    recall  f1-score   support

    Retained       0.97      0.79      0.87      9160
     Churned       0.26      0.70      0.38       840

    accuracy                           0.79     10000
   macro avg       0.62      0.75      0.63     10000
weighted avg       0.91      0.79      0.83     10000

Interpreting These Numbers

AUC-ROC of 0.823: The model can discriminate between churners and non-churners reasonably well. A random model scores 0.5; a perfect model scores 1.0. For churn prediction, AUC above 0.80 is generally considered useful for production.

Recall of 0.70: The model catches 70% of actual churners. Of the 840 churners in the test set, the model correctly identifies 589. It misses 251.

Precision of 0.26: Of the subscribers the model flags as likely to churn, only 26% actually do. The other 74% are false alarms. This sounds bad, but in context it makes sense: with 8.4% base rate, even a decent model will generate many false positives when tuned for high recall.

The precision-recall tradeoff: The model is currently tuned for high recall at the cost of low precision. This is the right tradeoff for StreamFlow's use case. The cost of missing a churner ($224 in lost CLV) far exceeds the cost of a false alarm ($8 for an unnecessary email). We want to cast a wide net.


Step 4: The Threshold Decision

The default classification threshold is 0.5: if the predicted probability of churn exceeds 0.5, classify as "churned." But this is rarely optimal, especially for imbalanced problems.

# Precision-recall curve at different thresholds
precisions, recalls, thresholds = precision_recall_curve(y_test, y_prob)

# Find threshold that gives ~80% recall
target_recall = 0.80
idx_80 = np.argmin(np.abs(recalls[:-1] - target_recall))
threshold_80 = thresholds[idx_80]

# Find threshold that maximizes F1
f1_scores = 2 * (precisions[:-1] * recalls[:-1]) / (precisions[:-1] + recalls[:-1] + 1e-10)
idx_f1 = np.argmax(f1_scores)
threshold_f1 = thresholds[idx_f1]

# StreamFlow constraint: retention team can contact 12,000/month
# Scale to test set: 12,000 / 2,400,000 * 10,000 = 50 contacts
n_contacts = 50  # Scaled to test set
threshold_capacity = np.sort(y_prob)[::-1][n_contacts]

print("Threshold Analysis:")
print("-" * 70)
print(f"{'Strategy':>25} | {'Threshold':>10} | {'Precision':>10} | {'Recall':>8} | {'F1':>6}")
print("-" * 70)

for name, thresh in [('Default (0.5)', 0.5),
                     ('Max F1', threshold_f1),
                     ('80% Recall', threshold_80),
                     ('Capacity (top 50)', threshold_capacity)]:
    y_thresh = (y_prob >= thresh).astype(int)
    p = precision_score(y_test, y_thresh)
    r = recall_score(y_test, y_thresh)
    f = f1_score(y_test, y_thresh)
    print(f"{name:>25} | {thresh:10.4f} | {p:10.4f} | {r:8.4f} | {f:6.4f}")
Threshold Analysis:
----------------------------------------------------------------------
                 Strategy |  Threshold |  Precision |   Recall |     F1
----------------------------------------------------------------------
            Default (0.5) |     0.5000 |     0.4312 |    0.4821 | 0.4553
                   Max F1 |     0.3124 |     0.3456 |    0.6234 | 0.4445
               80% Recall |     0.1987 |     0.2312 |    0.8012 | 0.3589
       Capacity (top 50) |     0.8912 |     0.7200 |    0.0429 | 0.0809

Common Mistake --- Optimizing the threshold for F1 when the business cares about recall. At StreamFlow, the cost of a missed churner ($224 CLV) is 28x the cost of a false alarm ($8 intervention). The threshold should be chosen to maximize expected revenue saved, not to maximize a symmetric metric like F1. This is a business decision, not a statistical one.


Step 5: Coefficient Interpretation

# Extract and display coefficients
feature_names = (
    numeric_features +
    list(baseline_pipe.named_steps['preprocessor']
         .named_transformers_['cat']
         .get_feature_names_out(categorical_features))
)

coefs = baseline_pipe.named_steps['classifier'].coef_[0]

coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefs,
    'odds_ratio': np.exp(coefs),
}).sort_values('coefficient', ascending=False)

print("Coefficient Interpretation (sorted by churn-increasing effect):")
print("-" * 75)
print(f"{'Feature':>35} | {'Coef':>8} | {'Odds Ratio':>11} | Interpretation")
print("-" * 75)

for _, row in coef_df.iterrows():
    if abs(row['coefficient']) < 0.01:
        interp = "Negligible"
    elif row['coefficient'] > 0:
        pct = (row['odds_ratio'] - 1) * 100
        interp = f"+{pct:.0f}% churn odds per 1-SD increase"
    else:
        pct = (1 - row['odds_ratio']) * 100
        interp = f"-{pct:.0f}% churn odds per 1-SD increase"
    print(f"{row['feature']:>35} | {row['coefficient']:>8.4f} | {row['odds_ratio']:>11.4f} | {interp}")
Coefficient Interpretation (sorted by churn-increasing effect):
---------------------------------------------------------------------------
                            Feature |     Coef |  Odds Ratio | Interpretation
---------------------------------------------------------------------------
             contract_type_monthly |   0.7234 |      2.0614 | +106% churn odds per 1-SD increase
         support_tickets_last_90d |   0.4523 |      1.5722 | +57% churn odds per 1-SD increase
           days_since_last_login |   0.4201 |      1.5222 | +52% churn odds per 1-SD increase
                 plan_tier_basic |   0.2890 |      1.3351 | +34% churn odds per 1-SD increase
                  monthly_charges |   0.1234 |      1.1313 | +13% churn odds per 1-SD increase
      referral_source_paid_search |   0.0567 |      1.0584 | Negligible
          payment_method_paypal |   0.0345 |      1.0351 | Negligible
          referral_source_social |   0.0234 |      1.0237 | Negligible
   payment_method_bank_transfer |  -0.0123 |      0.9878 | Negligible
           referral_source_email |  -0.0089 |      0.9911 | Negligible
     content_interactions_last_7d |  -0.1987 |      0.8198 | -18% churn odds per 1-SD increase
                      num_devices |  -0.2345 |      0.7909 | -21% churn odds per 1-SD increase
               sessions_last_30d |  -0.3456 |      0.7078 | -29% churn odds per 1-SD increase
          hours_watched_last_30d |  -0.3987 |      0.6713 | -33% churn odds per 1-SD increase
                    tenure_months |  -0.5812 |      0.5592 | -44% churn odds per 1-SD increase

The odds ratio column is what Jenna Park wants to see. Translating into her language:

  • "Monthly contract subscribers are 2x more likely to churn than annual subscribers."
  • "Each standard-deviation increase in support tickets is associated with 57% higher churn odds."
  • "Each standard-deviation increase in tenure is associated with 44% lower churn odds."

These are actionable. The product team can design interventions: push annual contracts harder, staff up support for high-ticket users, create engagement nudges for subscribers who have not logged in recently.


Step 6: Error Analysis

Good practice is to examine where the model fails. Who are the false negatives (churners the model missed)?

# Identify errors
X_test_with_pred = X_test.copy()
X_test_with_pred['y_true'] = y_test.values
X_test_with_pred['y_pred'] = y_pred
X_test_with_pred['y_prob'] = y_prob

# False negatives: actual churners the model predicted as retained
fn = X_test_with_pred[(X_test_with_pred['y_true'] == 1) &
                       (X_test_with_pred['y_pred'] == 0)]

# True positives: actual churners the model caught
tp = X_test_with_pred[(X_test_with_pred['y_true'] == 1) &
                       (X_test_with_pred['y_pred'] == 1)]

print(f"False negatives (missed churners): {len(fn)}")
print(f"True positives (caught churners):  {len(tp)}")

# Compare profiles
compare_cols = ['tenure_months', 'hours_watched_last_30d', 'sessions_last_30d',
                'support_tickets_last_90d', 'days_since_last_login']

print(f"\n{'':>30} | {'Caught (TP)':>12} | {'Missed (FN)':>12}")
print("-" * 60)
for col in compare_cols:
    tp_mean = tp[col].mean()
    fn_mean = fn[col].mean()
    print(f"{col:>30} | {tp_mean:>12.1f} | {fn_mean:>12.1f}")
False negatives (missed churners): 251
True positives (caught churners):  589

                               |   Caught (TP) |   Missed (FN)
------------------------------------------------------------
                tenure_months |          8.2 |         14.7
        hours_watched_last_30d |          6.3 |         12.1
              sessions_last_30d |          7.4 |         11.8
       support_tickets_last_90d |          3.1 |          2.3
          days_since_last_login |         12.4 |          5.8

The missed churners have longer tenure, more engagement, and fewer red flags. They are the subscribers who look like they should stay but leave anyway --- perhaps due to a price increase, a competitor launch, or a life change that the model's features cannot capture. These are the hardest to predict and the most valuable to save (long tenure = high CLV).

Try It --- Add a contract_type breakdown to the error analysis. What percentage of false negatives are annual-contract subscribers? If it is disproportionately high, it suggests the model over-relies on contract_type_monthly as a churn signal and underweights other factors for annual subscribers.


Key Results

Metric Value Business Interpretation
AUC-ROC 0.823 Good discrimination; usable for targeting
Recall 0.701 Catches 70% of churners
Precision 0.263 1 in 4 flagged subscribers actually churns
F1 0.383 Moderate, reflects class imbalance
Features used 12/15 L1 zeroed 3 low-signal features
Strongest signal contract_type_monthly 2x churn odds
Strongest protector tenure_months 44% lower churn odds per SD

Business Recommendation

The logistic regression baseline is sufficient for an initial deployment. At a threshold tuned for 80% recall:

  • The retention team contacts ~15,000 subscribers per month (scaled from test results)
  • ~80% of actual churners are reached
  • Expected retained revenue: $4.2M/year net of intervention costs
  • Model retraining time: under 30 seconds
  • Model is fully interpretable for stakeholder reporting

The next step is to compare this baseline against more complex models (Chapters 13--14). If a gradient-boosted model achieves AUC of 0.86+, the additional 3 AUC points could be worth the added complexity. If it achieves 0.83, it is not.


Discussion Questions

  1. The model's precision is 26%. That means 74% of flagged subscribers would not have churned. Is this acceptable? Under what conditions would the business prefer higher precision at the cost of lower recall?

  2. The false negative analysis reveals that missed churners have longer tenure and higher engagement. What additional features could help identify these "stealth churners"?

  3. If StreamFlow's retention team capacity increased from 12,000 to 50,000 contacts per month (via automated interventions), how would that change the optimal threshold?

  4. The model uses class_weight='balanced'. What would happen if we removed this? Run the experiment and compare the confusion matrices.

  5. A product manager suggests adding "number of times the subscriber visited the cancellation page" as a feature. Is this a good idea? What are the risks?


This case study supports Chapter 11: Linear Models Revisited. Return to the chapter for the full regularization treatment.