Case Study 1: The StreamFlow Showdown --- XGBoost vs. LightGBM vs. CatBoost
Background
StreamFlow's data science team has been building a subscriber churn prediction model for six months. In Chapter 11, they established a logistic regression baseline. In Chapter 13, they trained a Random Forest that beat the baseline by a meaningful margin. Now the team is ready to bring out the heavy artillery: gradient boosting.
The stakes are real. StreamFlow's churn rate is 8.2% monthly on a $180M ARR subscriber base. Every 0.1% reduction in churn is worth roughly $180K annually. The data science team has been told, explicitly, that they need the best predictive model the data can support. "We will worry about interpretability later," the VP of Product said. "Right now, I need to know who is about to leave."
This case study runs a proper head-to-head comparison of XGBoost, LightGBM, and CatBoost on the StreamFlow churn dataset. Not a toy comparison on a handful of features --- the full pipeline with feature engineering, categorical handling, early stopping, and hyperparameter tuning.
The Data
StreamFlow's churn dataset includes 24 features for 50,000 subscriber-months. The features span usage behavior, billing history, support interactions, and subscriber demographics.
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
roc_auc_score, f1_score, precision_score, recall_score,
average_precision_score, log_loss
)
from sklearn.preprocessing import LabelEncoder
import time
np.random.seed(42)
n = 50000
# --- Subscriber features ---
streamflow = pd.DataFrame({
# Behavioral
'monthly_hours_watched': np.random.exponential(18, n).round(1),
'sessions_last_30d': np.random.poisson(14, n),
'avg_session_minutes': np.random.exponential(28, n).round(1),
'unique_titles_watched': np.random.poisson(8, n),
'content_completion_rate': np.random.beta(3, 2, n).round(3),
'binge_sessions_30d': np.random.poisson(2, n),
'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
# Engagement trends
'hours_change_pct': np.random.normal(0, 30, n).round(1), # month-over-month
'sessions_change_pct': np.random.normal(0, 25, n).round(1),
# Account
'months_active': np.random.randint(1, 60, n),
'plan_type': np.random.choice(
['basic', 'standard', 'premium', 'family'], n,
p=[0.35, 0.35, 0.20, 0.10]
),
'plan_price': np.zeros(n), # filled below
'devices_used': np.random.randint(1, 6, n),
'profiles_active': np.random.randint(1, 5, n),
# Billing
'payment_failures_6m': np.random.poisson(0.3, n),
'used_promo': np.random.binomial(1, 0.25, n),
'months_since_price_change': np.random.randint(0, 24, n),
# Support
'support_tickets_90d': np.random.poisson(1.2, n),
'negative_sentiment_tickets': np.random.poisson(0.3, n),
# Demographics
'referral_source': np.random.choice(
['organic', 'paid_search', 'social', 'email', 'referral',
'partner_bundle', 'tv_ad', 'podcast'],
n, p=[0.30, 0.20, 0.12, 0.10, 0.10, 0.08, 0.05, 0.05]
),
'region': np.random.choice(
['northeast', 'southeast', 'midwest', 'southwest', 'west', 'northwest'],
n, p=[0.22, 0.18, 0.18, 0.12, 0.20, 0.10]
),
'age_bucket': np.random.choice(
['18-24', '25-34', '35-44', '45-54', '55+'],
n, p=[0.18, 0.30, 0.25, 0.15, 0.12]
),
'genre_diversity': np.random.uniform(0.1, 1.0, n).round(3),
})
# Set plan prices
plan_prices = {'basic': 9.99, 'standard': 14.99, 'premium': 24.99, 'family': 29.99}
streamflow['plan_price'] = streamflow['plan_type'].map(plan_prices)
# Generate realistic churn signal
churn_score = (
-0.025 * streamflow['months_active']
- 0.03 * streamflow['monthly_hours_watched']
+ 0.12 * streamflow['support_tickets_90d']
+ 0.25 * streamflow['negative_sentiment_tickets']
+ 0.35 * streamflow['payment_failures_6m']
- 0.02 * streamflow['sessions_last_30d']
- 0.015 * streamflow['unique_titles_watched']
- 0.3 * streamflow['content_completion_rate']
- 0.4 * streamflow['genre_diversity']
- 0.008 * streamflow['hours_change_pct']
+ 0.15 * (streamflow['plan_type'] == 'basic').astype(float)
- 0.2 * (streamflow['plan_type'] == 'family').astype(float)
+ 0.1 * (streamflow['referral_source'] == 'paid_search').astype(float)
- 0.15 * (streamflow['referral_source'] == 'referral').astype(float)
+ 0.08 * (streamflow['age_bucket'] == '18-24').astype(float)
+ 0.3 * streamflow['used_promo']
+ 1.2
+ np.random.normal(0, 0.6, n)
)
churn_prob = 1 / (1 + np.exp(-churn_score))
streamflow['churned'] = np.random.binomial(1, churn_prob)
print(f"Dataset shape: {streamflow.shape}")
print(f"Churn rate: {streamflow['churned'].mean():.1%}")
print(f"\nFeature types:")
print(f" Numerical: {len(streamflow.select_dtypes(include='number').columns) - 1}")
print(f" Categorical: {len(streamflow.select_dtypes(include='object').columns)}")
Dataset shape: (50000, 25)
Churn rate: 27.8%
Feature types:
Numerical: 21
Categorical: 4
Preparing the Three-Way Split
X = streamflow.drop('churned', axis=1)
y = streamflow['churned']
cat_features = ['plan_type', 'referral_source', 'region', 'age_bucket']
num_features = [c for c in X.columns if c not in cat_features]
# Three-way split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42
)
print(f"Train: {len(X_train):,} samples")
print(f"Validation: {len(X_val):,} samples (early stopping)")
print(f"Test: {len(X_test):,} samples (final evaluation)")
Train: 30,000 samples
Validation: 10,000 samples (early stopping)
Test: 10,000 samples (final evaluation)
Preparing Data for Each Library
Each library has different requirements for categorical features. This is where the API differences matter most.
# --- XGBoost: one-hot encode categoricals ---
X_train_xgb = pd.get_dummies(X_train, columns=cat_features, drop_first=True)
X_val_xgb = pd.get_dummies(X_val, columns=cat_features, drop_first=True)
X_test_xgb = pd.get_dummies(X_test, columns=cat_features, drop_first=True)
# Align columns
for df in [X_val_xgb, X_test_xgb]:
for col in X_train_xgb.columns:
if col not in df.columns:
df[col] = 0
X_val_xgb = X_val_xgb[X_train_xgb.columns]
X_test_xgb = X_test_xgb[X_train_xgb.columns]
print(f"XGBoost features: {X_train_xgb.shape[1]} (after one-hot encoding)")
# --- LightGBM: cast to pandas category dtype ---
X_train_lgb = X_train.copy()
X_val_lgb = X_val.copy()
X_test_lgb = X_test.copy()
for col in cat_features:
cat_type = pd.CategoricalDtype(
categories=X_train[col].unique()
)
X_train_lgb[col] = X_train_lgb[col].astype(cat_type)
X_val_lgb[col] = X_val_lgb[col].astype(cat_type)
X_test_lgb[col] = X_test_lgb[col].astype(cat_type)
print(f"LightGBM features: {X_train_lgb.shape[1]} (native categoricals)")
# --- CatBoost: identify categorical column indices ---
cat_indices = [X_train.columns.get_loc(c) for c in cat_features]
print(f"CatBoost features: {X_train.shape[1]} (native categoricals, indices: {cat_indices})")
XGBoost features: 39 (after one-hot encoding)
LightGBM features: 24 (native categoricals)
CatBoost features: 24 (native categoricals, indices: [11, 20, 21, 22])
The one-hot encoding inflated XGBoost's feature count from 24 to 39. For this dataset with moderate cardinality (4-8 categories per feature), the impact is manageable. With hundreds of categories, the explosion would be far worse.
Training All Three Models
All three models use comparable hyperparameters: learning rate 0.05, approximately equivalent tree complexity, subsampling, and early stopping with patience 50.
import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier
results = {}
# --- XGBoost ---
print("Training XGBoost...")
start = time.time()
xgb_model = xgb.XGBClassifier(
n_estimators=3000,
learning_rate=0.05,
max_depth=6,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
min_child_weight=5,
early_stopping_rounds=50,
eval_metric='logloss',
random_state=42,
n_jobs=-1
)
xgb_model.fit(
X_train_xgb, y_train,
eval_set=[(X_val_xgb, y_val)],
verbose=False
)
xgb_time = time.time() - start
xgb_proba = xgb_model.predict_proba(X_test_xgb)[:, 1]
xgb_pred = xgb_model.predict(X_test_xgb)
results['XGBoost'] = {
'proba': xgb_proba, 'pred': xgb_pred,
'time': xgb_time, 'trees': xgb_model.best_iteration,
'features': X_train_xgb.shape[1]
}
print(f" Done: {xgb_time:.1f}s, {xgb_model.best_iteration} trees")
# --- LightGBM ---
print("Training LightGBM...")
start = time.time()
lgb_model = lgb.LGBMClassifier(
n_estimators=3000,
learning_rate=0.05,
num_leaves=63, # ~depth 6 equivalent
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1,
reg_lambda=1.0,
min_child_samples=20,
random_state=42,
n_jobs=-1,
verbose=-1
)
lgb_model.fit(
X_train_lgb, y_train,
eval_set=[(X_val_lgb, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)
lgb_time = time.time() - start
lgb_proba = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_pred = lgb_model.predict(X_test_lgb)
results['LightGBM'] = {
'proba': lgb_proba, 'pred': lgb_pred,
'time': lgb_time, 'trees': lgb_model.best_iteration_,
'features': X_train_lgb.shape[1]
}
print(f" Done: {lgb_time:.1f}s, {lgb_model.best_iteration_} trees")
# --- CatBoost ---
print("Training CatBoost...")
start = time.time()
cat_model = CatBoostClassifier(
iterations=3000,
learning_rate=0.05,
depth=6,
subsample=0.8,
l2_leaf_reg=3.0,
min_data_in_leaf=20,
random_seed=42,
early_stopping_rounds=50,
cat_features=cat_indices,
eval_metric='AUC',
verbose=0
)
cat_model.fit(
X_train, y_train,
eval_set=(X_val, y_val),
verbose=False
)
cat_time = time.time() - start
cat_proba = cat_model.predict_proba(X_test)[:, 1]
cat_pred = cat_model.predict(X_test)
results['CatBoost'] = {
'proba': cat_proba, 'pred': cat_pred,
'time': cat_time, 'trees': cat_model.get_best_iteration(),
'features': X_train.shape[1]
}
print(f" Done: {cat_time:.1f}s, {cat_model.get_best_iteration()} trees")
Training XGBoost...
Done: 8.4s, 412 trees
Training LightGBM...
Done: 3.1s, 387 trees
Training CatBoost...
Done: 14.7s, 445 trees
The Comparison
print("=" * 75)
print("STREAMFLOW CHURN: GRADIENT BOOSTING SHOWDOWN")
print("=" * 75)
comparison = pd.DataFrame({
'Model': [],
'AUC': [],
'Avg Precision': [],
'F1': [],
'Precision': [],
'Recall': [],
'Log Loss': [],
'Train Time (s)': [],
'Trees Used': [],
'Input Features': [],
})
for name, r in results.items():
row = pd.DataFrame([{
'Model': name,
'AUC': roc_auc_score(y_test, r['proba']),
'Avg Precision': average_precision_score(y_test, r['proba']),
'F1': f1_score(y_test, r['pred']),
'Precision': precision_score(y_test, r['pred']),
'Recall': recall_score(y_test, r['pred']),
'Log Loss': log_loss(y_test, r['proba']),
'Train Time (s)': r['time'],
'Trees Used': r['trees'],
'Input Features': r['features'],
}])
comparison = pd.concat([comparison, row], ignore_index=True)
print(comparison.to_string(index=False, float_format='%.4f'))
===========================================================================
STREAMFLOW CHURN: GRADIENT BOOSTING SHOWDOWN
===========================================================================
Model AUC Avg Precision F1 Precision Recall Log Loss Train Time (s) Trees Used Input Features
XGBoost 0.8547 0.7142 0.6134 0.6423 0.5872 0.4821 8.4000 412.0000 39.0000
LightGBM 0.8561 0.7168 0.6158 0.6398 0.5938 0.4798 3.1000 387.0000 24.0000
CatBoost 0.8589 0.7201 0.6189 0.6445 0.5952 0.4772 14.7000 445.0000 24.0000
Analysis: What the Numbers Tell Us
CatBoost wins, but barely
CatBoost edges ahead on every metric. The AUC gap between CatBoost and XGBoost is 0.0042 --- meaningful in a competition, negligible in most production contexts. The gap between CatBoost and LightGBM is 0.0028.
Why does CatBoost win here? The dataset has four categorical features, and CatBoost's ordered target statistics extract slightly more signal from them than LightGBM's optimal category splits or XGBoost's one-hot encoding. On a purely numerical dataset, the three would be nearly indistinguishable.
LightGBM is fastest by a wide margin
LightGBM trained in 3.1 seconds vs. XGBoost's 8.4 and CatBoost's 14.7. This 3-5x speed advantage is the histogram-based splitting in action. On a 50K-row dataset, the absolute difference is small. On a 50M-row dataset, it is the difference between "we can retrain nightly" and "we need a GPU cluster."
One-hot encoding costs XGBoost features and signal
XGBoost had to work with 39 features (after one-hot encoding) while the other two used the original 24. More features means more candidate splits to evaluate and more opportunities for the algorithm to waste splits on low-information dummy columns. The referral_source feature alone became 7 binary columns, each of which is individually weak.
Feature Importance Comparison
import matplotlib.pyplot as plt
# XGBoost feature importance (top 15 by gain)
xgb_importance = pd.DataFrame({
'feature': X_train_xgb.columns,
'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)
# LightGBM feature importance (top 15 by gain)
lgb_importance = pd.DataFrame({
'feature': X_train_lgb.columns,
'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)
# CatBoost feature importance (top 15)
cat_importance = pd.DataFrame({
'feature': X_train.columns,
'importance': cat_model.get_feature_importance()
}).sort_values('importance', ascending=False).head(15)
print("Top 10 Features by Model")
print("-" * 65)
print(f"{'XGBoost':<25} {'LightGBM':<25} {'CatBoost':<25}")
print("-" * 65)
for i in range(10):
xgb_f = xgb_importance.iloc[i]['feature']
lgb_f = lgb_importance.iloc[i]['feature']
cat_f = cat_importance.iloc[i]['feature']
print(f"{xgb_f:<25} {lgb_f:<25} {cat_f:<25}")
Top 10 Features by Model
-----------------------------------------------------------------
XGBoost LightGBM CatBoost
-----------------------------------------------------------------
payment_failures_6m payment_failures_6m payment_failures_6m
support_tickets_90d support_tickets_90d negative_sentiment_tickets
negative_sentiment_tickets negative_sentiment_tickets support_tickets_90d
monthly_hours_watched monthly_hours_watched monthly_hours_watched
content_completion_rate content_completion_rate content_completion_rate
months_active genre_diversity months_active
genre_diversity months_active genre_diversity
sessions_last_30d sessions_last_30d plan_type
hours_change_pct hours_change_pct sessions_last_30d
plan_price referral_source referral_source
The top 5 features are consistent across all three models: payment failures, support tickets, negative sentiment, hours watched, and content completion. This agreement increases confidence in the feature importance rankings. Note that CatBoost ranks plan_type (a categorical feature) at position 8 as a single feature, while XGBoost splits this information across plan_price and multiple dummy columns.
Hyperparameter Sensitivity Analysis
How sensitive is each model to its key hyperparameters?
# Learning rate sensitivity for all three
lrs = [0.2, 0.1, 0.05, 0.02, 0.01]
print("Learning Rate Sensitivity (Test AUC)")
print("-" * 55)
print(f"{'LR':<8}{'XGBoost':<12}{'LightGBM':<12}{'CatBoost':<12}")
print("-" * 55)
for lr in lrs:
# XGBoost
m = xgb.XGBClassifier(
n_estimators=5000, learning_rate=lr, max_depth=6,
subsample=0.8, colsample_bytree=0.8,
early_stopping_rounds=50, eval_metric='logloss',
random_state=42, n_jobs=-1
)
m.fit(X_train_xgb, y_train, eval_set=[(X_val_xgb, y_val)], verbose=False)
auc_xgb = roc_auc_score(y_test, m.predict_proba(X_test_xgb)[:, 1])
# LightGBM
m2 = lgb.LGBMClassifier(
n_estimators=5000, learning_rate=lr, num_leaves=63,
subsample=0.8, colsample_bytree=0.8,
random_state=42, n_jobs=-1, verbose=-1
)
m2.fit(
X_train_lgb, y_train, eval_set=[(X_val_lgb, y_val)],
callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)
auc_lgb = roc_auc_score(y_test, m2.predict_proba(X_test_lgb)[:, 1])
# CatBoost
m3 = CatBoostClassifier(
iterations=5000, learning_rate=lr, depth=6,
subsample=0.8, l2_leaf_reg=3.0,
random_seed=42, early_stopping_rounds=50,
cat_features=cat_indices, verbose=0
)
m3.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)
auc_cat = roc_auc_score(y_test, m3.predict_proba(X_test)[:, 1])
print(f"{lr:<8}{auc_xgb:<12.4f}{auc_lgb:<12.4f}{auc_cat:<12.4f}")
Learning Rate Sensitivity (Test AUC)
-------------------------------------------------------
LR XGBoost LightGBM CatBoost
-------------------------------------------------------
0.2 0.8498 0.8512 0.8541
0.1 0.8531 0.8548 0.8572
0.05 0.8547 0.8561 0.8589
0.02 0.8554 0.8568 0.8593
0.01 0.8556 0.8571 0.8595
CatBoost is the most stable across learning rates --- its ordered boosting provides implicit regularization. XGBoost is the most sensitive --- the gap between lr=0.2 and lr=0.01 is 0.0058 (vs. 0.0054 for CatBoost). All three improve with lower learning rates, confirming the "lower than you think" principle.
The Verdict for StreamFlow
For the StreamFlow churn prediction system, the team's recommendation:
Primary model: LightGBM. The AUC gap to CatBoost is 0.0028 --- roughly 14 additional correct churn predictions out of 10,000 test subscribers. LightGBM's 3x training speed advantage matters more for the nightly retraining pipeline, hyperparameter search, and model experimentation that the team does daily.
Fallback if categorical features increase: If StreamFlow adds more categorical features (content genre preferences, device types, geographic detail), revisit CatBoost. Its advantage grows with categorical feature count and cardinality.
XGBoost as reference: Keep XGBoost in the comparison pipeline as a sanity check. If LightGBM ever produces suspicious results (data drift, pipeline bugs), XGBoost provides an independent second opinion.
Production Tip --- In production, "best model" is not just about AUC. It is about AUC times training speed times debugging ease times deployment compatibility times team familiarity. LightGBM wins this multi-dimensional comparison for StreamFlow because the team already uses it for three other models, the nightly retraining pipeline is optimized for it, and the 0.3% AUC gap is not worth the operational complexity of maintaining a CatBoost deployment alongside the existing LightGBM infrastructure.
Key Takeaways from This Showdown
-
CatBoost wins on accuracy, especially with categorical features. But the margin is small on datasets with moderate categorical cardinality.
-
LightGBM wins on speed. 3-5x faster training is not a rounding error --- it compounds across hyperparameter search, cross-validation, and daily retraining.
-
XGBoost's one-hot encoding penalty is real but manageable. On this dataset (4 categorical features, <10 categories each), the penalty is ~0.4% AUC. On datasets with 50 categorical features, it would be much worse.
-
All three models agree on feature importance. This is reassuring: the signal in the data is consistent regardless of the algorithm extracting it.
-
Lower learning rates consistently improve all three. But the improvement diminishes below 0.02, and training time increases substantially.
-
The choice between the three matters less than proper feature engineering, data quality, and early stopping. A poorly engineered dataset with the "best" library will lose to a well-engineered dataset with any of the three.
This case study supports Chapter 14: Gradient Boosting. Return to the chapter for the full discussion of gradient boosting theory and hyperparameters.