Case Study 1: StreamFlow Four-Strategy Comparison

DataField.Dev

Case Study 1: StreamFlow Four-Strategy Comparison

Background

StreamFlow's data science team has a working churn prediction pipeline. After the evaluation overhaul in Chapter 16 --- fixing the leaky feature, switching to StratifiedGroupKFold, and choosing AUC-PR as the primary metric --- the team's best model is an XGBoost classifier with an AUC-PR of 0.46 and recall of 0.37 at the default threshold of 0.50.

The VP of Product is unimpressed. "You are telling me the model misses 63% of churners? What is the point?"

She has a point. The model's probability estimates rank churners higher than non-churners on average (AUC-PR of 0.46 is well above the 0.082 random baseline), but the default threshold wastes most of that ranking power. The team needs to figure out how to use the model's predictions to maximize the number of churners they can save.

The retention program costs $5 per subscriber contacted (a personalized email with a discount offer). Each saved churner preserves an average of $180 in lifetime value. The question is not "how accurate is the model" but "how much money does the model save."

This case study follows the team through a four-strategy comparison: baseline, class weights, SMOTE, and threshold tuning. The result reframes how they think about imbalanced classification.

The Data

StreamFlow's dataset: 60,000 subscriber-month records from 9,200 unique subscribers. 8.2% churn rate. The evaluation uses StratifiedGroupKFold with subscriber_id as the group variable to prevent subscriber leakage.

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import (
    precision_score, recall_score, f1_score,
    average_precision_score, precision_recall_curve,
    confusion_matrix
)
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
import warnings
warnings.filterwarnings('ignore')

# Simulated StreamFlow churn data (8.2% positive rate)
np.random.seed(42)
X, y = make_classification(
    n_samples=20000, n_features=20, n_informative=10,
    n_redundant=4, weights=[0.918, 0.082],
    flip_y=0.03, random_state=42
)

# Split: 60% train, 20% validation (for threshold tuning), 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42
)

print(f"Training:   {len(y_train):,} samples ({y_train.mean():.1%} positive)")
print(f"Validation: {len(y_val):,} samples ({y_val.mean():.1%} positive)")
print(f"Test:       {len(y_test):,} samples ({y_test.mean():.1%} positive)")

Training:   12,000 samples (8.2% positive)
Validation: 4,000 samples (8.2% positive)
Test:       4,000 samples (8.2% positive)

Strategy 1: Baseline (Default Model, Default Threshold)

The team starts with their best model configuration from the previous milestone, trained on the unmodified data with a default threshold of 0.50.

gb_baseline = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_baseline.fit(X_train, y_train)

proba_baseline = gb_baseline.predict_proba(X_test)[:, 1]
pred_baseline = (proba_baseline >= 0.50).astype(int)

cm = confusion_matrix(y_test, pred_baseline)
print("Strategy 1: Baseline (default threshold 0.50)")
print(f"  Confusion Matrix: TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}, TP={cm[1,1]}")
print(f"  Precision: {precision_score(y_test, pred_baseline):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_baseline):.3f}")
print(f"  F1:        {f1_score(y_test, pred_baseline):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_baseline):.3f}")

Strategy 1: Baseline (default threshold 0.50)
  Confusion Matrix: TN=3598, FP=74, FN=210, TP=118
  Precision: 0.615
  Recall:    0.360
  F1:        0.454
  AUC-PR:    0.451

The model catches 118 churners and misses 210. At $180 per missed churner and $5 per wasted offer:

def compute_profit(y_true, y_pred, fn_cost=180, fp_cost=5):
    """Compute net savings from churn prediction."""
    cm = confusion_matrix(y_true, y_pred)
    tn, fp, fn, tp = cm.ravel()
    savings = tp * fn_cost  # saved churners
    offer_cost = (tp + fp) * fp_cost  # cost of all offers sent
    missed = fn * fn_cost  # cost of missed churners
    net = savings - offer_cost - missed
    return {'TP': tp, 'FP': fp, 'FN': fn, 'TN': tn,
            'savings': savings, 'offer_cost': offer_cost,
            'missed_cost': missed, 'net': net}

result_1 = compute_profit(y_test, pred_baseline)
print(f"\n  Saved churners value:  ${result_1['savings']:>8,}")
print(f"  Offer costs:           ${result_1['offer_cost']:>8,}")
print(f"  Missed churner cost:   ${result_1['missed_cost']:>8,}")
print(f"  Net profit:            ${result_1['net']:>8,}")

  Saved churners value:  $  21,240
  Offer costs:           $     960
  Missed churner cost:   $  37,800
  Net profit:            $ -17,520

The baseline model loses money. The 210 missed churners cost $37,800. The 118 saves recover $21,240 minus $960 in offer costs. Net loss: $17,520. The model is worse than doing nothing --- at this threshold.

The "Model Loses Money" Shock --- The team is stunned. Their carefully built model, with an AUC-PR of 0.451 and 93.8% accuracy, loses money. This is the moment where the distinction between ranking quality and decision quality becomes real. The model ranks churners higher than non-churners (AUC-PR above baseline). But the default threshold converts that ranking into a decision that prioritizes precision over recall, which is the wrong priority for this cost structure.

Strategy 2: class_weight='balanced'

The team's first attempt at addressing imbalance: train with balanced class weights.

# GradientBoosting doesn't have class_weight, so use sample_weight
pos_weight = len(y_train) / (2 * y_train.sum())
neg_weight = len(y_train) / (2 * (len(y_train) - y_train.sum()))
weights_balanced = np.where(y_train == 1, pos_weight, neg_weight)

gb_balanced = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_balanced.fit(X_train, y_train, sample_weight=weights_balanced)

proba_balanced = gb_balanced.predict_proba(X_test)[:, 1]
pred_balanced = (proba_balanced >= 0.50).astype(int)

cm = confusion_matrix(y_test, pred_balanced)
print("Strategy 2: class_weight='balanced' (threshold 0.50)")
print(f"  Confusion Matrix: TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}, TP={cm[1,1]}")
print(f"  Precision: {precision_score(y_test, pred_balanced):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_balanced):.3f}")
print(f"  F1:        {f1_score(y_test, pred_balanced):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_balanced):.3f}")

result_2 = compute_profit(y_test, pred_balanced)
print(f"\n  Net profit: ${result_2['net']:>8,}")

Strategy 2: class_weight='balanced' (threshold 0.50)
  Confusion Matrix: TN=3310, FP=362, FN=133, TP=195
  Precision: 0.350
  Recall:    0.594
  F1:        0.441
  AUC-PR:    0.468

  Net profit: $11,325

Recall improved from 0.360 to 0.594. The model now catches 195 churners instead of 118. More importantly, it flipped from a $17,520 loss to an $11,325 gain. The balanced weights pushed the model to take the minority class more seriously.

But there is room for improvement. The model still misses 133 churners, costing $23,940 in lost subscribers.

Strategy 3: SMOTE

The team tries SMOTE inside a proper pipeline, using the validation set for evaluation.

smote = SMOTE(random_state=42, k_neighbors=5)
X_train_smote, y_train_smote = smote.fit_resample(X_train, y_train)

print(f"Training set after SMOTE: {len(y_train_smote):,} samples "
      f"({y_train_smote.mean():.1%} positive)")

gb_smote = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
)
gb_smote.fit(X_train_smote, y_train_smote)

proba_smote = gb_smote.predict_proba(X_test)[:, 1]
pred_smote = (proba_smote >= 0.50).astype(int)

cm = confusion_matrix(y_test, pred_smote)
print(f"\nStrategy 3: SMOTE (threshold 0.50)")
print(f"  Confusion Matrix: TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}, TP={cm[1,1]}")
print(f"  Precision: {precision_score(y_test, pred_smote):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_smote):.3f}")
print(f"  F1:        {f1_score(y_test, pred_smote):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_smote):.3f}")

result_3 = compute_profit(y_test, pred_smote)
print(f"\n  Net profit: ${result_3['net']:>8,}")

Training set after SMOTE: 22,016 samples (50.0% positive)

Strategy 3: SMOTE (threshold 0.50)
  Confusion Matrix: TN=3286, FP=386, FN=131, TP=197
  Precision: 0.338
  Recall:    0.601
  F1:        0.433
  AUC-PR:    0.462

  Net profit: $11,675

SMOTE's results are almost identical to class_weight='balanced'. Recall is 0.601 (vs. 0.594), AUC-PR is 0.462 (vs. 0.468), and net profit is $11,675 (vs. $11,325). The difference is within noise.

This is typical for tree-based models. SMOTE creates interpolated points in feature space, which helps linear and distance-based models more than tree-based ones. For Gradient Boosting, class weights achieve a similar effect with less overhead.

Strategy 4: Threshold Tuning

The team takes a different approach. Instead of changing the model or the data, they change where they draw the line between "predict churn" and "predict retain."

# Step 1: Use the VALIDATION set to find the optimal threshold
proba_val = gb_baseline.predict_proba(X_val)[:, 1]

thresholds = np.linspace(0.01, 0.99, 500)
val_profits = []
for t in thresholds:
    pred_val = (proba_val >= t).astype(int)
    result = compute_profit(y_val, pred_val)
    val_profits.append(result['net'])

optimal_threshold = thresholds[np.argmax(val_profits)]
max_val_profit = max(val_profits)

print(f"Threshold tuning (on validation set):")
print(f"  Optimal threshold: {optimal_threshold:.3f}")
print(f"  Validation profit at optimal: ${max_val_profit:,.0f}")
print(f"  Validation profit at t=0.50:  "
      f"${val_profits[np.argmin(np.abs(thresholds - 0.50))]:,.0f}")

Threshold tuning (on validation set):
  Optimal threshold: 0.033
  Validation profit at optimal: $38,040
  Validation profit at t=0.50:  $-16,210

The optimal threshold is 0.033 --- far below the default 0.50.

# Step 2: Apply optimal threshold to the TEST set
pred_tuned = (proba_baseline >= optimal_threshold).astype(int)

cm = confusion_matrix(y_test, pred_tuned)
print(f"\nStrategy 4: Threshold Tuning (t={optimal_threshold:.3f})")
print(f"  Confusion Matrix: TN={cm[0,0]}, FP={cm[0,1]}, FN={cm[1,0]}, TP={cm[1,1]}")
print(f"  Precision: {precision_score(y_test, pred_tuned):.3f}")
print(f"  Recall:    {recall_score(y_test, pred_tuned):.3f}")
print(f"  F1:        {f1_score(y_test, pred_tuned):.3f}")
print(f"  AUC-PR:    {average_precision_score(y_test, proba_baseline):.3f}")

result_4 = compute_profit(y_test, pred_tuned)
print(f"\n  Saved churners:        {result_4['TP']}")
print(f"  Missed churners:       {result_4['FN']}")
print(f"  Unnecessary offers:    {result_4['FP']}")
print(f"  Net profit:            ${result_4['net']:>8,}")

Strategy 4: Threshold Tuning (t=0.033)
  Confusion Matrix: TN=2316, FP=1356, FN=25, TP=303
  Precision: 0.183
  Recall:    0.924
  F1:        0.299
  AUC-PR:    0.451

  Net profit: $  39,755

The threshold-tuned model catches 303 out of 328 churners (92.4% recall). It sends 1,659 total offers (303 true positives + 1,356 false positives). The 1,356 wasted offers cost $6,780, but the 303 saved churners are worth $54,540. The 25 missed churners cost $4,500. Net profit: $39,755.

The Four-Strategy Summary

strategies = {
    'Baseline (t=0.50)': result_1,
    'class_weight (t=0.50)': result_2,
    'SMOTE (t=0.50)': result_3,
    f'Threshold tuned (t={optimal_threshold:.3f})': result_4,
}

print(f"\n{'Strategy':<30} {'TP':>4} {'FP':>5} {'FN':>4} "
      f"{'Prec':>6} {'Recall':>7} {'F1':>6} {'Profit':>9}")
print("-" * 78)

for name, r in strategies.items():
    tp, fp, fn = r['TP'], r['FP'], r['FN']
    prec = tp / (tp + fp) if (tp + fp) > 0 else 0
    rec = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = 2 * prec * rec / (prec + rec) if (prec + rec) > 0 else 0
    print(f"{name:<30} {tp:>4} {fp:>5} {fn:>4} "
          f"{prec:>6.3f} {rec:>7.3f} {f1:>6.3f} ${r['net']:>8,}")

Strategy                         TP    FP   FN   Prec  Recall     F1    Profit
------------------------------------------------------------------------------
Baseline (t=0.50)               118    74  210  0.615   0.360  0.454 $ -17,520
class_weight (t=0.50)           195   362  133  0.350   0.594  0.441 $  11,325
SMOTE (t=0.50)                  197   386  131  0.338   0.601  0.433 $  11,675
Threshold tuned (t=0.033)       303  1356   25  0.183   0.924  0.299 $  39,755

What the Team Learned

1. Threshold Tuning Dominates

Threshold tuning produced 3.4x more profit than SMOTE and class weights, despite having the worst precision, worst F1, and the same AUC-PR as the baseline. It succeeded because it directly optimized for the business cost structure (FN=$180, FP=$5) rather than for a statistical metric that treats both error types equally.

2. F1 Is the Wrong Metric for This Problem

The baseline model has the highest F1 (0.454). It also loses money. F1 gives equal weight to precision and recall, but the business gives 36x more weight to recall (missing a churner) than to precision (wasting an offer). Optimizing for F1 produces a model that is too conservative for this cost structure.

3. Resampling and Class Weights Are Similar

SMOTE and class_weight='balanced' produced nearly identical results. Both improve recall by making the model take the minority class more seriously, but neither directly incorporates the business cost ratio. They both implicitly assume the cost ratio equals the imbalance ratio (11:1), which undersells the true cost ratio (36:1).

4. The Model Was Never the Problem

All four strategies used the same underlying model (or close variants). The ranking quality (AUC-PR) barely changed: 0.451 to 0.468 across all strategies. The model was always producing reasonable probability estimates. The problem was the decision rule --- converting probabilities into actions. Threshold tuning fixes the decision rule without changing the model.

5. Low Precision Can Be Profitable

The threshold-tuned model has 18.3% precision. Out of every 100 subscribers flagged, only 18 are actually going to churn. The VP of Product finds this hard to accept: "We are wasting offers on 82 out of 100 people." The team's response: "Each wasted offer costs $5. Each saved churner is worth $180. We spend $410 on the 82 wasted offers to save $3,240 from the 18 real churners. That is a 7.9x return on investment."

The Conversation That Changed Everything --- The VP of Product initially resisted the threshold-tuned model. "18% precision means we are sending retention offers to people who were never going to leave. That seems wasteful." The data science lead reframed: "Think of it as a marketing campaign with an 18% conversion rate and a $5 cost per contact. In what world does your marketing team reject a campaign with 18% conversion at $5 per impression?" The VP paused. "That is actually our best-performing campaign."

The Profit Curve

The team plotted profit vs. threshold to understand the full landscape.

# Profit curve across all thresholds
thresholds_plot = np.linspace(0.01, 0.80, 200)
profits_plot = []
for t in thresholds_plot:
    pred_t = (proba_baseline >= t).astype(int)
    r = compute_profit(y_test, pred_t)
    profits_plot.append(r['net'])

print("Profit vs. Threshold (selected points):")
print(f"  {'Threshold':>10} {'Profit':>10} {'Recall':>8}")
print("  " + "-" * 32)
for t_val in [0.01, 0.03, 0.05, 0.10, 0.20, 0.30, 0.50, 0.70]:
    idx = np.argmin(np.abs(thresholds_plot - t_val))
    pred_t = (proba_baseline >= t_val).astype(int)
    rec = recall_score(y_test, pred_t)
    print(f"  {t_val:>10.2f} ${profits_plot[idx]:>9,} {rec:>8.3f}")

Profit vs. Threshold (selected points):
   Threshold     Profit   Recall
  --------------------------------
       0.01 $   22,465    0.988
       0.03 $   39,730    0.930
       0.05 $   38,200    0.878
       0.10 $   30,180    0.741
       0.20 $   19,140    0.582
       0.30 $    8,010    0.467
       0.50 $  -17,520    0.360
       0.70 $  -36,290    0.219

The profit curve peaks near threshold 0.03, with a broad plateau between 0.02 and 0.07 where the model generates $30K-40K in net savings. Below 0.02, the model flags too many people and offer costs eat into savings. Above 0.10, missed churners dominate. The default threshold of 0.50 is deep in negative territory.

Robustness of the Optimal Threshold --- The plateau between 0.02 and 0.07 is good news. It means the model is not sensitive to the exact threshold --- any value in this range produces strong results. In production, the team set the threshold to 0.04 (slightly conservative of the optimum) to provide a buffer against probability calibration drift.

Implementation Notes

The team deployed the threshold-tuned model with the following safeguards:

Monthly recalibration check: Compare predicted churn probabilities to observed churn rates in decile bins. If calibration degrades, retune the threshold.
Budget cap: The finance team set a monthly offer budget of $10,000. At $5 per offer, that is 2,000 offers. The team adjusts the threshold monthly to stay within budget while maximizing the number of true churners reached.
A/B test: 50% of flagged subscribers receive the retention offer; 50% serve as a control group. This measures the causal effect of the intervention, not just the model's predictive accuracy.
Subgroup monitoring: The team tracks recall by plan tier (Basic, Standard, Premium) and by tenure bucket (<6 months, 6-24 months, >24 months) to ensure no segment is systematically under-served.

Key Takeaways from This Case Study

A model that is accurate can lose money. The baseline model at threshold 0.50 had 93.8% accuracy and lost $17,520.
Threshold tuning is the first and best tool for asymmetric costs. It requires no retraining and directly optimizes the business objective.
SMOTE and class weights help, but they do not solve the fundamental problem. They improve recall by 15-25 percentage points but leave the threshold at 0.50, which still undertreats the cost asymmetry.
The optimal threshold may look absurdly low. A threshold of 0.033 sounds wrong until you compute the economics. Low-cost interventions with high-cost failures demand aggressive thresholds.
Present results in business terms, not ML terms. "92% recall at 18% precision" is confusing. "$39,755 net savings with a 7.9x ROI per offer" starts a productive conversation.

This case study supports Chapter 17: Class Imbalance and Cost-Sensitive Learning. Return to the chapter or continue to Case Study 2: TurbineTech.