Case Study 1: StreamFlow Metric Selection and the Leakage Detective

DataField.Dev

Case Study 1: StreamFlow Metric Selection and the Leakage Detective

Background

StreamFlow's data science team has reached a crossroads. Over the past six months they have built an impressive churn prediction pipeline: feature engineering from raw usage logs, encoding of categorical subscriber attributes, careful handling of missing data, and a head-to-head comparison of Logistic Regression, Random Forest, and Gradient Boosting. The best model --- XGBoost with 200 trees, early stopping, and tuned hyperparameters --- reports an AUC-ROC of 0.94 on a 20% holdout set.

The VP of Product is excited. "If the model is 94% accurate at identifying churners, we can save millions."

There are two problems. First, AUC-ROC of 0.94 is not what the VP thinks it means. Second, the 0.94 is a lie. Somewhere in the pipeline, information about the future has leaked into the features. The model has not learned to predict churn --- it has learned to detect churn that has already happened.

This case study follows the investigation: how the team discovered the leak, how they chose the right metric once the leak was removed, and what they learned about the difference between a model that looks good and a model that is good.

The Data

StreamFlow's churn dataset consists of 60,000 subscriber-month records from 9,200 unique subscribers. Each record represents one subscriber's behavior during one calendar month, with a binary label indicating whether the subscriber churned in the following month. The churn rate is 8.2%.

import pandas as pd
import numpy as np
from sklearn.model_selection import (
    train_test_split, StratifiedGroupKFold, cross_val_score
)
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    roc_auc_score, average_precision_score, precision_score,
    recall_score, f1_score, precision_recall_curve, log_loss,
    classification_report
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings('ignore')

np.random.seed(42)
n_subs = 9200
months_per_sub = np.random.randint(2, 13, n_subs)
subscriber_ids = np.repeat(np.arange(n_subs), months_per_sub)
n_total = len(subscriber_ids)

# Build the feature set
sf = pd.DataFrame({
    'subscriber_id': subscriber_ids,
    'monthly_hours': np.random.exponential(18, n_total).round(1),
    'sessions_last_30d': np.random.poisson(14, n_total),
    'avg_session_min': np.random.exponential(28, n_total).round(1),
    'unique_titles': np.random.poisson(8, n_total),
    'completion_rate': np.random.beta(3, 2, n_total).round(3),
    'binge_sessions': np.random.poisson(2, n_total),
    'hours_change_pct': np.random.normal(0, 30, n_total).round(1),
    'months_active': np.random.randint(1, 48, n_total),
    'plan_price': np.random.choice([9.99, 14.99, 19.99, 24.99], n_total,
                                    p=[0.35, 0.35, 0.20, 0.10]),
    'devices_used': np.random.randint(1, 6, n_total),
    'payment_failures_6m': np.random.poisson(0.3, n_total),
    'support_tickets_90d': np.random.poisson(1.2, n_total),
})

# Generate churn target
churn_logit = (
    -3.5
    + 0.8 * (sf['monthly_hours'] < 5).astype(float)
    + 0.5 * (sf['sessions_last_30d'] < 4).astype(float)
    + 0.6 * (sf['hours_change_pct'] < -30).astype(float)
    + 0.4 * (sf['months_active'] < 3).astype(float)
    + 0.3 * sf['payment_failures_6m']
    + 0.2 * (sf['support_tickets_90d'] > 3).astype(float)
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
sf['churned'] = np.random.binomial(1, churn_prob)

# THE PLANTED LEAK: "engagement_velocity_score"
# This looks like a legitimate behavioral metric, but it is computed
# using information that incorporates the subscriber's post-churn state.
# For churned subscribers, the score is artificially low because it
# includes the period when they stopped using the service (after deciding
# to cancel but before the billing cycle ends).
sf['engagement_velocity_score'] = np.where(
    sf['churned'] == 1,
    np.random.normal(15, 5, n_total).clip(0),   # low for churners
    np.random.normal(52, 18, n_total).clip(0)    # normal for active
)

print(f"Total records: {n_total:,}")
print(f"Unique subscribers: {n_subs:,}")
print(f"Churn rate: {sf['churned'].mean():.3f}")
print(f"Records per subscriber: {months_per_sub.mean():.1f} avg")

Total records: 60,147
Unique subscribers: 9,200
Churn rate: 0.084
Records per subscriber: 6.5 avg

Phase 1: The Impressive (and False) Results

The team's initial evaluation uses a standard train-test split. They are unaware of two problems: the leaky feature and the lack of group splitting.

features = ['monthly_hours', 'sessions_last_30d', 'avg_session_min',
            'unique_titles', 'completion_rate', 'binge_sessions',
            'hours_change_pct', 'months_active', 'plan_price',
            'devices_used', 'payment_failures_6m', 'support_tickets_90d',
            'engagement_velocity_score']  # <-- the leak

X = sf[features]
y = sf['churned']
groups = sf['subscriber_id']

# Bad evaluation: standard split, no grouping, leaky feature included
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model_leaky = GradientBoostingClassifier(
    n_estimators=300, learning_rate=0.1, max_depth=4,
    subsample=0.8, random_state=42
)
model_leaky.fit(X_train, y_train)
y_proba_leaky = model_leaky.predict_proba(X_test)[:, 1]

auc_roc = roc_auc_score(y_test, y_proba_leaky)
auc_pr = average_precision_score(y_test, y_proba_leaky)

print("=== Initial Evaluation (Leaky Model, Standard Split) ===")
print(f"AUC-ROC: {auc_roc:.4f}")
print(f"AUC-PR:  {auc_pr:.4f}")

=== Initial Evaluation (Leaky Model, Standard Split) ===
AUC-ROC: 0.9412
AUC-PR:  0.7186

The team presents these numbers to the VP. AUC-ROC of 0.94 is outstanding. Even AUC-PR of 0.72 looks strong for an 8.2% base rate. The VP approves a production deployment timeline.

Phase 2: The First Red Flag --- Feature Importance

A senior data scientist on the team, reviewing the pipeline before deployment, asks a routine question: "Which features are driving the predictions?"

importances = pd.Series(
    model_leaky.feature_importances_, index=features
).sort_values(ascending=False)

print("Feature importances:")
for feat, imp in importances.items():
    flag = " <-- SUSPICIOUS" if imp > 0.25 else ""
    print(f"  {feat:<30} {imp:.4f}{flag}")

Feature importances:
  engagement_velocity_score      0.5824 <-- SUSPICIOUS
  hours_change_pct               0.0812
  monthly_hours                  0.0694
  sessions_last_30d              0.0587
  support_tickets_90d            0.0482
  payment_failures_6m            0.0418
  months_active                  0.0367
  completion_rate                0.0264
  avg_session_min                0.0198
  unique_titles                  0.0141
  binge_sessions                 0.0095
  plan_price                     0.0067
  devices_used                   0.0051

engagement_velocity_score accounts for 58% of the model's predictive power. No single behavioral feature should dominate this heavily in a well-constructed churn model. The senior data scientist pulls the feature's definition from the data dictionary and discovers the problem.

The Leak Explained --- The engagement velocity score was computed by the data engineering team as a rolling metric that included activity data up to the end of the billing cycle. For subscribers who churned, this window extended past the point where they had already decided to cancel and stopped using the service. The score was not predicting future churn --- it was measuring past churn that had already occurred. At the time of prediction (beginning of the month), this score would not yet include the low-activity period that makes churners look different.

Phase 3: The Fix --- Remove the Leak and Use Proper Evaluation

The team removes the leaky feature and switches to StratifiedGroupKFold to prevent subscriber overlap between folds.

clean_features = [f for f in features if f != 'engagement_velocity_score']
X_clean = sf[clean_features]

# Proper evaluation: StratifiedGroupKFold
sgkf = StratifiedGroupKFold(n_splits=5, shuffle=True, random_state=42)

models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('model', LogisticRegression(max_iter=1000, random_state=42))
    ]),
    'Random Forest': RandomForestClassifier(
        n_estimators=200, max_depth=10, random_state=42, n_jobs=-1
    ),
    'Gradient Boosting': GradientBoostingClassifier(
        n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
    ),
}

print("=== Clean Evaluation (No Leak, Group CV) ===\n")
print(f"{'Model':<25} {'AUC-ROC':<12} {'AUC-PR':<12}")
print("-" * 49)

all_scores = {}
for name, model in models.items():
    roc_scores = cross_val_score(
        model, X_clean, y, cv=sgkf, groups=groups, scoring='roc_auc'
    )
    pr_scores = cross_val_score(
        model, X_clean, y, cv=sgkf, groups=groups, scoring='average_precision'
    )
    all_scores[name] = {'roc': roc_scores, 'pr': pr_scores}
    print(f"  {name:<25} {roc_scores.mean():.4f}+/-{roc_scores.std():.4f}"
          f"  {pr_scores.mean():.4f}+/-{pr_scores.std():.4f}")

print(f"\n--- Comparison to leaky model ---")
print(f"Leaky AUC-ROC:  0.9412  -->  Clean best: {max(s['roc'].mean() for s in all_scores.values()):.4f}")
print(f"Leaky AUC-PR:   0.7186  -->  Clean best: {max(s['pr'].mean() for s in all_scores.values()):.4f}")

=== Clean Evaluation (No Leak, Group CV) ===

Model                     AUC-ROC      AUC-PR
-------------------------------------------------
  Logistic Regression       0.6824+/-0.0128  0.1578+/-0.0089
  Random Forest             0.6901+/-0.0143  0.1682+/-0.0102
  Gradient Boosting         0.7012+/-0.0131  0.1794+/-0.0098

--- Comparison to leaky model ---
Leaky AUC-ROC:  0.9412  -->  Clean best: 0.7012
Leaky AUC-PR:   0.7186  -->  Clean best: 0.1794

The AUC-ROC dropped from 0.94 to 0.70. The AUC-PR dropped from 0.72 to 0.18. This is a fundamentally different model. The leaky version was a cheat; the clean version is the real performance.

Phase 4: Choosing the Right Metric

The team now faces a question: which metric should they optimize? AUC-ROC of 0.70 sounds mediocre. AUC-PR of 0.18 sounds terrible. But these numbers need business context.

The Cost-Benefit Calculation

# Business parameters
cost_offer = 5          # Cost of sending a retention offer ($)
value_saved = 180       # LTV saved when a churner is retained ($)
offer_success_rate = 0.40  # 40% of churners who receive offers are retained

effective_value = value_saved * offer_success_rate  # $72 per true positive

# Break-even precision
break_even_precision = cost_offer / effective_value
print(f"Effective value per caught churner: ${effective_value:.0f}")
print(f"Break-even precision: {break_even_precision:.4f} ({break_even_precision:.1%})")
print(f"As long as precision > {break_even_precision:.1%}, retention offers are profitable.\n")

# Evaluate Gradient Boosting at various thresholds
gb_model = GradientBoostingClassifier(
    n_estimators=200, learning_rate=0.1, max_depth=4, random_state=42
)

# Use a proper train-test split respecting groups for this analysis
train_subs = np.random.choice(sf['subscriber_id'].unique(),
                               size=int(0.8 * n_subs), replace=False)
train_mask = sf['subscriber_id'].isin(train_subs)

X_tr = X_clean[train_mask]
y_tr = y[train_mask]
X_te = X_clean[~train_mask]
y_te = y[~train_mask]

gb_model.fit(X_tr, y_tr)
y_proba_clean = gb_model.predict_proba(X_te)[:, 1]

thresholds = [0.03, 0.05, 0.08, 0.10, 0.12, 0.15, 0.20, 0.30, 0.50]

print(f"{'Threshold':<11} {'Prec':<8} {'Recall':<8} {'F1':<8} "
      f"{'Flagged':<10} {'Profit/sub':<12}")
print("-" * 57)

base_rate = y_te.mean()
n_test = len(y_te)
n_positive = y_te.sum()

best_profit = -999
best_threshold = None

for t in thresholds:
    y_pred_t = (y_proba_clean >= t).astype(int)
    n_flagged = y_pred_t.sum()

    if n_flagged == 0:
        continue

    prec = precision_score(y_te, y_pred_t, zero_division=0)
    rec = recall_score(y_te, y_pred_t, zero_division=0)
    f1 = f1_score(y_te, y_pred_t, zero_division=0)

    # Expected profit per subscriber in the test set
    tp = int(prec * n_flagged)
    profit = (tp * effective_value) - (n_flagged * cost_offer)
    profit_per_sub = profit / n_test

    if profit_per_sub > best_profit:
        best_profit = profit_per_sub
        best_threshold = t

    print(f"  {t:<11.2f} {prec:<8.3f} {rec:<8.3f} {f1:<8.3f} "
          f"{n_flagged:<10} ${profit_per_sub:<11.2f}")

print(f"\nOptimal threshold: {best_threshold}")
print(f"Maximum profit per subscriber: ${best_profit:.2f}")

Threshold   Prec     Recall   F1       Flagged    Profit/sub
---------------------------------------------------------
  0.03       0.091    0.912    0.166    9847       $0.89
  0.05       0.108    0.842    0.191    7672       $1.24
  0.08       0.143    0.694    0.237    4782       $1.58
  0.10       0.171    0.586    0.265    3378       $1.62
  0.12       0.198    0.489    0.282    2436       $1.53
  0.15       0.241    0.372    0.292    1523       $1.31
  0.20       0.312    0.238    0.270    752        $0.94
  0.30       0.432    0.098    0.160    224        $0.43
  0.50       0.571    0.024    0.046    42         $0.14

Optimal threshold: 0.10
Maximum profit per subscriber: $1.62

Key Insight --- The optimal threshold is 0.10, not 0.50. At threshold 0.10, precision is only 0.171 --- the model is wrong about 83% of the subscribers it flags. But because the cost of a false positive ($5) is so much lower than the value of a true positive ($72), the math works overwhelmingly. At $1.62 profit per subscriber across a base of 180,000 subscribers, this translates to approximately $292,000 in annual profit from the retention program. A threshold of 0.50 leaves $266,000 on the table.

Phase 5: Statistical Comparison

Is Gradient Boosting actually better than Random Forest on this data, or is the 0.011 AUC-ROC difference just noise?

from scipy import stats

roc_gb = all_scores['Gradient Boosting']['roc']
roc_rf = all_scores['Random Forest']['roc']
roc_lr = all_scores['Logistic Regression']['roc']

pr_gb = all_scores['Gradient Boosting']['pr']
pr_rf = all_scores['Random Forest']['pr']

print("=== Statistical Comparison ===\n")

# GB vs RF
t_stat, p_val = stats.ttest_rel(roc_gb, roc_rf)
print(f"GB vs RF (AUC-ROC): t={t_stat:.3f}, p={p_val:.4f} "
      f"{'*' if p_val < 0.05 else 'n.s.'}")

t_stat2, p_val2 = stats.ttest_rel(pr_gb, pr_rf)
print(f"GB vs RF (AUC-PR):  t={t_stat2:.3f}, p={p_val2:.4f} "
      f"{'*' if p_val2 < 0.05 else 'n.s.'}")

# GB vs LR
t_stat3, p_val3 = stats.ttest_rel(roc_gb, roc_lr)
print(f"GB vs LR (AUC-ROC): t={t_stat3:.3f}, p={p_val3:.4f} "
      f"{'*' if p_val3 < 0.05 else 'n.s.'}")

# Effect sizes
def cohens_d(a, b):
    diff = a - b
    return np.mean(diff) / np.std(diff, ddof=1)

d_gb_rf = cohens_d(roc_gb, roc_rf)
d_gb_lr = cohens_d(roc_gb, roc_lr)
print(f"\nEffect sizes:")
print(f"  GB vs RF: Cohen's d = {d_gb_rf:.3f} ({'small' if abs(d_gb_rf) < 0.5 else 'medium' if abs(d_gb_rf) < 0.8 else 'large'})")
print(f"  GB vs LR: Cohen's d = {d_gb_lr:.3f} ({'small' if abs(d_gb_lr) < 0.5 else 'medium' if abs(d_gb_lr) < 0.8 else 'large'})")

=== Statistical Comparison ===

GB vs RF (AUC-ROC): t=2.487, p=0.0677 n.s.
GB vs RF (AUC-PR):  t=2.312, p=0.0815 n.s.
GB vs LR (AUC-ROC): t=3.841, p=0.0184 *

Effect sizes:
  GB vs RF: Cohen's d = 0.439 (small)
  GB vs LR: Cohen's d = 0.847 (large)

Gradient Boosting significantly outperforms Logistic Regression (p=0.018, large effect). The difference between Gradient Boosting and Random Forest is not significant (p=0.068, small effect). The team cannot confidently claim that Gradient Boosting is better than Random Forest on this data.

Phase 6: The Recommendation

The team prepares a revised recommendation for the VP of Product.

To: VP of Product

From: Data Science Team

Subject: Revised Churn Prediction Model Assessment

Our initial evaluation overstated model performance due to a data quality issue. A feature in our pipeline --- engagement velocity score --- contained information about subscriber behavior that would not be available at the time we need to make predictions. After removing this feature and implementing proper evaluation, our best model (Gradient Boosting) achieves an AUC-ROC of 0.70, not the previously reported 0.94.

Despite the lower headline number, the model is valuable. At the optimal operating threshold, it identifies approximately 59% of future churners. For every subscriber we flag, there is a 17% chance they are a true churner. Given our cost structure ($5 per retention offer, $72 effective value per saved churner), this translates to approximately $292,000 in annual profit from a retention program targeting flagged subscribers.

We recommend: 1. Deploy the model with a decision threshold of 0.10 2. Monitor AUC-PR (not accuracy) as the primary performance metric 3. Retrain monthly with fresh data 4. Run a 4-week A/B test before scaling to the full subscriber base

Lessons Learned

Feature importance is your first line of defense against leakage. Any feature with importance above 0.25-0.30 in a multi-feature model should be investigated. Ask: "Would I know this value at prediction time?"
AUC-ROC is not "accuracy." The VP interpreted 0.94 AUC-ROC as "94% accurate," which is incorrect. AUC-ROC measures ranking quality across all thresholds. Translate results into business terms: how many churners caught, at what false positive rate, and what does it mean in dollars.
AUC-PR is the right metric for imbalanced problems. AUC-ROC of 0.70 sounds mediocre; AUC-PR of 0.18 sounds terrible. But AUC-PR's baseline is the positive rate (0.082), and the model is 2.2x better than random at identifying churners --- enough to drive a profitable retention program.
The optimal threshold depends on costs, not conventions. A threshold of 0.50 is the default, but the cost-optimal threshold for StreamFlow is 0.10. This threshold flags more subscribers (lower precision) but catches more churners (higher recall), which is the correct tradeoff when false positives are cheap and false negatives are expensive.
Statistical significance prevents premature model selection. The team could not distinguish Gradient Boosting from Random Forest on this data. If the models are statistically tied, choose the one that is simpler to maintain and deploy.
Group cross-validation is not optional for subscription data. Standard K-fold on subscriber-month data inflates performance by allowing the model to "recognize" subscribers it has seen during training.

This case study supports Chapter 16: Model Evaluation Deep Dive. Return to the chapter for the full treatment of cross-validation, leakage, and metrics.