Case Study 1: The StreamFlow Showdown --- XGBoost vs. LightGBM vs. CatBoost

DataField.Dev

XGBoost vs. LightGBM vs. CatBoost" type: case-study chapter: 14 part: 3

Case Study 1: The StreamFlow Showdown --- XGBoost vs. LightGBM vs. CatBoost

Background

StreamFlow's data science team has been building a subscriber churn prediction model for six months. In Chapter 11, they established a logistic regression baseline. In Chapter 13, they trained a Random Forest that beat the baseline by a meaningful margin. Now the team is ready to bring out the heavy artillery: gradient boosting.

The stakes are real. StreamFlow's churn rate is 8.2% monthly on a $180M ARR subscriber base. Every 0.1% reduction in churn is worth roughly $180K annually. The data science team has been told, explicitly, that they need the best predictive model the data can support. "We will worry about interpretability later," the VP of Product said. "Right now, I need to know who is about to leave."

This case study runs a proper head-to-head comparison of XGBoost, LightGBM, and CatBoost on the StreamFlow churn dataset. Not a toy comparison on a handful of features --- the full pipeline with feature engineering, categorical handling, early stopping, and hyperparameter tuning.

The Data

StreamFlow's churn dataset includes 24 features for 50,000 subscriber-months. The features span usage behavior, billing history, support interactions, and subscriber demographics.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_score, recall_score,
    average_precision_score, log_loss
)
from sklearn.preprocessing import LabelEncoder
import time

np.random.seed(42)
n = 50000

# --- Subscriber features ---
streamflow = pd.DataFrame({
    # Behavioral
    'monthly_hours_watched': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'avg_session_minutes': np.random.exponential(28, n).round(1),
    'unique_titles_watched': np.random.poisson(8, n),
    'content_completion_rate': np.random.beta(3, 2, n).round(3),
    'binge_sessions_30d': np.random.poisson(2, n),
    'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
    'peak_hour_ratio': np.random.beta(3, 2, n).round(3),

    # Engagement trends
    'hours_change_pct': np.random.normal(0, 30, n).round(1),  # month-over-month
    'sessions_change_pct': np.random.normal(0, 25, n).round(1),

    # Account
    'months_active': np.random.randint(1, 60, n),
    'plan_type': np.random.choice(
        ['basic', 'standard', 'premium', 'family'], n,
        p=[0.35, 0.35, 0.20, 0.10]
    ),
    'plan_price': np.zeros(n),  # filled below
    'devices_used': np.random.randint(1, 6, n),
    'profiles_active': np.random.randint(1, 5, n),

    # Billing
    'payment_failures_6m': np.random.poisson(0.3, n),
    'used_promo': np.random.binomial(1, 0.25, n),
    'months_since_price_change': np.random.randint(0, 24, n),

    # Support
    'support_tickets_90d': np.random.poisson(1.2, n),
    'negative_sentiment_tickets': np.random.poisson(0.3, n),

    # Demographics
    'referral_source': np.random.choice(
        ['organic', 'paid_search', 'social', 'email', 'referral',
         'partner_bundle', 'tv_ad', 'podcast'],
        n, p=[0.30, 0.20, 0.12, 0.10, 0.10, 0.08, 0.05, 0.05]
    ),
    'region': np.random.choice(
        ['northeast', 'southeast', 'midwest', 'southwest', 'west', 'northwest'],
        n, p=[0.22, 0.18, 0.18, 0.12, 0.20, 0.10]
    ),
    'age_bucket': np.random.choice(
        ['18-24', '25-34', '35-44', '45-54', '55+'],
        n, p=[0.18, 0.30, 0.25, 0.15, 0.12]
    ),
    'genre_diversity': np.random.uniform(0.1, 1.0, n).round(3),
})

# Set plan prices
plan_prices = {'basic': 9.99, 'standard': 14.99, 'premium': 24.99, 'family': 29.99}
streamflow['plan_price'] = streamflow['plan_type'].map(plan_prices)

# Generate realistic churn signal
churn_score = (
    -0.025 * streamflow['months_active']
    - 0.03 * streamflow['monthly_hours_watched']
    + 0.12 * streamflow['support_tickets_90d']
    + 0.25 * streamflow['negative_sentiment_tickets']
    + 0.35 * streamflow['payment_failures_6m']
    - 0.02 * streamflow['sessions_last_30d']
    - 0.015 * streamflow['unique_titles_watched']
    - 0.3 * streamflow['content_completion_rate']
    - 0.4 * streamflow['genre_diversity']
    - 0.008 * streamflow['hours_change_pct']
    + 0.15 * (streamflow['plan_type'] == 'basic').astype(float)
    - 0.2 * (streamflow['plan_type'] == 'family').astype(float)
    + 0.1 * (streamflow['referral_source'] == 'paid_search').astype(float)
    - 0.15 * (streamflow['referral_source'] == 'referral').astype(float)
    + 0.08 * (streamflow['age_bucket'] == '18-24').astype(float)
    + 0.3 * streamflow['used_promo']
    + 1.2
    + np.random.normal(0, 0.6, n)
)

churn_prob = 1 / (1 + np.exp(-churn_score))
streamflow['churned'] = np.random.binomial(1, churn_prob)

print(f"Dataset shape: {streamflow.shape}")
print(f"Churn rate: {streamflow['churned'].mean():.1%}")
print(f"\nFeature types:")
print(f"  Numerical: {len(streamflow.select_dtypes(include='number').columns) - 1}")
print(f"  Categorical: {len(streamflow.select_dtypes(include='object').columns)}")

Dataset shape: (50000, 25)
Churn rate: 27.8%

Feature types:
  Numerical: 21
  Categorical: 4

Preparing the Three-Way Split

X = streamflow.drop('churned', axis=1)
y = streamflow['churned']

cat_features = ['plan_type', 'referral_source', 'region', 'age_bucket']
num_features = [c for c in X.columns if c not in cat_features]

# Three-way split: 60% train, 20% validation, 20% test
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42
)

print(f"Train:      {len(X_train):,} samples")
print(f"Validation: {len(X_val):,} samples (early stopping)")
print(f"Test:       {len(X_test):,} samples (final evaluation)")

Train:      30,000 samples
Validation: 10,000 samples (early stopping)
Test:       10,000 samples (final evaluation)

Preparing Data for Each Library

Each library has different requirements for categorical features. This is where the API differences matter most.

# --- XGBoost: one-hot encode categoricals ---
X_train_xgb = pd.get_dummies(X_train, columns=cat_features, drop_first=True)
X_val_xgb = pd.get_dummies(X_val, columns=cat_features, drop_first=True)
X_test_xgb = pd.get_dummies(X_test, columns=cat_features, drop_first=True)

# Align columns
for df in [X_val_xgb, X_test_xgb]:
    for col in X_train_xgb.columns:
        if col not in df.columns:
            df[col] = 0
X_val_xgb = X_val_xgb[X_train_xgb.columns]
X_test_xgb = X_test_xgb[X_train_xgb.columns]

print(f"XGBoost features: {X_train_xgb.shape[1]} (after one-hot encoding)")

# --- LightGBM: cast to pandas category dtype ---
X_train_lgb = X_train.copy()
X_val_lgb = X_val.copy()
X_test_lgb = X_test.copy()

for col in cat_features:
    cat_type = pd.CategoricalDtype(
        categories=X_train[col].unique()
    )
    X_train_lgb[col] = X_train_lgb[col].astype(cat_type)
    X_val_lgb[col] = X_val_lgb[col].astype(cat_type)
    X_test_lgb[col] = X_test_lgb[col].astype(cat_type)

print(f"LightGBM features: {X_train_lgb.shape[1]} (native categoricals)")

# --- CatBoost: identify categorical column indices ---
cat_indices = [X_train.columns.get_loc(c) for c in cat_features]
print(f"CatBoost features: {X_train.shape[1]} (native categoricals, indices: {cat_indices})")

XGBoost features: 39 (after one-hot encoding)
LightGBM features: 24 (native categoricals)
CatBoost features: 24 (native categoricals, indices: [11, 20, 21, 22])

The one-hot encoding inflated XGBoost's feature count from 24 to 39. For this dataset with moderate cardinality (4-8 categories per feature), the impact is manageable. With hundreds of categories, the explosion would be far worse.

Training All Three Models

All three models use comparable hyperparameters: learning rate 0.05, approximately equivalent tree complexity, subsampling, and early stopping with patience 50.

import xgboost as xgb
import lightgbm as lgb
from catboost import CatBoostClassifier

results = {}

# --- XGBoost ---
print("Training XGBoost...")
start = time.time()
xgb_model = xgb.XGBClassifier(
    n_estimators=3000,
    learning_rate=0.05,
    max_depth=6,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    min_child_weight=5,
    early_stopping_rounds=50,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)
xgb_model.fit(
    X_train_xgb, y_train,
    eval_set=[(X_val_xgb, y_val)],
    verbose=False
)
xgb_time = time.time() - start

xgb_proba = xgb_model.predict_proba(X_test_xgb)[:, 1]
xgb_pred = xgb_model.predict(X_test_xgb)

results['XGBoost'] = {
    'proba': xgb_proba, 'pred': xgb_pred,
    'time': xgb_time, 'trees': xgb_model.best_iteration,
    'features': X_train_xgb.shape[1]
}
print(f"  Done: {xgb_time:.1f}s, {xgb_model.best_iteration} trees")

# --- LightGBM ---
print("Training LightGBM...")
start = time.time()
lgb_model = lgb.LGBMClassifier(
    n_estimators=3000,
    learning_rate=0.05,
    num_leaves=63,            # ~depth 6 equivalent
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    min_child_samples=20,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)
lgb_model.fit(
    X_train_lgb, y_train,
    eval_set=[(X_val_lgb, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)
lgb_time = time.time() - start

lgb_proba = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_pred = lgb_model.predict(X_test_lgb)

results['LightGBM'] = {
    'proba': lgb_proba, 'pred': lgb_pred,
    'time': lgb_time, 'trees': lgb_model.best_iteration_,
    'features': X_train_lgb.shape[1]
}
print(f"  Done: {lgb_time:.1f}s, {lgb_model.best_iteration_} trees")

# --- CatBoost ---
print("Training CatBoost...")
start = time.time()
cat_model = CatBoostClassifier(
    iterations=3000,
    learning_rate=0.05,
    depth=6,
    subsample=0.8,
    l2_leaf_reg=3.0,
    min_data_in_leaf=20,
    random_seed=42,
    early_stopping_rounds=50,
    cat_features=cat_indices,
    eval_metric='AUC',
    verbose=0
)
cat_model.fit(
    X_train, y_train,
    eval_set=(X_val, y_val),
    verbose=False
)
cat_time = time.time() - start

cat_proba = cat_model.predict_proba(X_test)[:, 1]
cat_pred = cat_model.predict(X_test)

results['CatBoost'] = {
    'proba': cat_proba, 'pred': cat_pred,
    'time': cat_time, 'trees': cat_model.get_best_iteration(),
    'features': X_train.shape[1]
}
print(f"  Done: {cat_time:.1f}s, {cat_model.get_best_iteration()} trees")

Training XGBoost...
  Done: 8.4s, 412 trees
Training LightGBM...
  Done: 3.1s, 387 trees
Training CatBoost...
  Done: 14.7s, 445 trees

The Comparison

print("=" * 75)
print("STREAMFLOW CHURN: GRADIENT BOOSTING SHOWDOWN")
print("=" * 75)

comparison = pd.DataFrame({
    'Model': [],
    'AUC': [],
    'Avg Precision': [],
    'F1': [],
    'Precision': [],
    'Recall': [],
    'Log Loss': [],
    'Train Time (s)': [],
    'Trees Used': [],
    'Input Features': [],
})

for name, r in results.items():
    row = pd.DataFrame([{
        'Model': name,
        'AUC': roc_auc_score(y_test, r['proba']),
        'Avg Precision': average_precision_score(y_test, r['proba']),
        'F1': f1_score(y_test, r['pred']),
        'Precision': precision_score(y_test, r['pred']),
        'Recall': recall_score(y_test, r['pred']),
        'Log Loss': log_loss(y_test, r['proba']),
        'Train Time (s)': r['time'],
        'Trees Used': r['trees'],
        'Input Features': r['features'],
    }])
    comparison = pd.concat([comparison, row], ignore_index=True)

print(comparison.to_string(index=False, float_format='%.4f'))

===========================================================================
STREAMFLOW CHURN: GRADIENT BOOSTING SHOWDOWN
===========================================================================
     Model    AUC  Avg Precision     F1  Precision  Recall  Log Loss  Train Time (s)  Trees Used  Input Features
   XGBoost 0.8547        0.7142 0.6134     0.6423  0.5872    0.4821          8.4000    412.0000         39.0000
  LightGBM 0.8561        0.7168 0.6158     0.6398  0.5938    0.4798          3.1000    387.0000         24.0000
  CatBoost 0.8589        0.7201 0.6189     0.6445  0.5952    0.4772         14.7000    445.0000         24.0000

Analysis: What the Numbers Tell Us

CatBoost wins, but barely

CatBoost edges ahead on every metric. The AUC gap between CatBoost and XGBoost is 0.0042 --- meaningful in a competition, negligible in most production contexts. The gap between CatBoost and LightGBM is 0.0028.

Why does CatBoost win here? The dataset has four categorical features, and CatBoost's ordered target statistics extract slightly more signal from them than LightGBM's optimal category splits or XGBoost's one-hot encoding. On a purely numerical dataset, the three would be nearly indistinguishable.

LightGBM is fastest by a wide margin

LightGBM trained in 3.1 seconds vs. XGBoost's 8.4 and CatBoost's 14.7. This 3-5x speed advantage is the histogram-based splitting in action. On a 50K-row dataset, the absolute difference is small. On a 50M-row dataset, it is the difference between "we can retrain nightly" and "we need a GPU cluster."

One-hot encoding costs XGBoost features and signal

XGBoost had to work with 39 features (after one-hot encoding) while the other two used the original 24. More features means more candidate splits to evaluate and more opportunities for the algorithm to waste splits on low-information dummy columns. The referral_source feature alone became 7 binary columns, each of which is individually weak.

Feature Importance Comparison

import matplotlib.pyplot as plt

# XGBoost feature importance (top 15 by gain)
xgb_importance = pd.DataFrame({
    'feature': X_train_xgb.columns,
    'importance': xgb_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)

# LightGBM feature importance (top 15 by gain)
lgb_importance = pd.DataFrame({
    'feature': X_train_lgb.columns,
    'importance': lgb_model.feature_importances_
}).sort_values('importance', ascending=False).head(15)

# CatBoost feature importance (top 15)
cat_importance = pd.DataFrame({
    'feature': X_train.columns,
    'importance': cat_model.get_feature_importance()
}).sort_values('importance', ascending=False).head(15)

print("Top 10 Features by Model")
print("-" * 65)
print(f"{'XGBoost':<25} {'LightGBM':<25} {'CatBoost':<25}")
print("-" * 65)
for i in range(10):
    xgb_f = xgb_importance.iloc[i]['feature']
    lgb_f = lgb_importance.iloc[i]['feature']
    cat_f = cat_importance.iloc[i]['feature']
    print(f"{xgb_f:<25} {lgb_f:<25} {cat_f:<25}")

Top 10 Features by Model
-----------------------------------------------------------------
XGBoost                   LightGBM                  CatBoost
-----------------------------------------------------------------
payment_failures_6m       payment_failures_6m       payment_failures_6m
support_tickets_90d       support_tickets_90d       negative_sentiment_tickets
negative_sentiment_tickets negative_sentiment_tickets support_tickets_90d
monthly_hours_watched     monthly_hours_watched     monthly_hours_watched
content_completion_rate   content_completion_rate   content_completion_rate
months_active             genre_diversity           months_active
genre_diversity           months_active             genre_diversity
sessions_last_30d         sessions_last_30d         plan_type
hours_change_pct          hours_change_pct          sessions_last_30d
plan_price                referral_source           referral_source

The top 5 features are consistent across all three models: payment failures, support tickets, negative sentiment, hours watched, and content completion. This agreement increases confidence in the feature importance rankings. Note that CatBoost ranks plan_type (a categorical feature) at position 8 as a single feature, while XGBoost splits this information across plan_price and multiple dummy columns.

Hyperparameter Sensitivity Analysis

How sensitive is each model to its key hyperparameters?

# Learning rate sensitivity for all three
lrs = [0.2, 0.1, 0.05, 0.02, 0.01]

print("Learning Rate Sensitivity (Test AUC)")
print("-" * 55)
print(f"{'LR':<8}{'XGBoost':<12}{'LightGBM':<12}{'CatBoost':<12}")
print("-" * 55)

for lr in lrs:
    # XGBoost
    m = xgb.XGBClassifier(
        n_estimators=5000, learning_rate=lr, max_depth=6,
        subsample=0.8, colsample_bytree=0.8,
        early_stopping_rounds=50, eval_metric='logloss',
        random_state=42, n_jobs=-1
    )
    m.fit(X_train_xgb, y_train, eval_set=[(X_val_xgb, y_val)], verbose=False)
    auc_xgb = roc_auc_score(y_test, m.predict_proba(X_test_xgb)[:, 1])

    # LightGBM
    m2 = lgb.LGBMClassifier(
        n_estimators=5000, learning_rate=lr, num_leaves=63,
        subsample=0.8, colsample_bytree=0.8,
        random_state=42, n_jobs=-1, verbose=-1
    )
    m2.fit(
        X_train_lgb, y_train, eval_set=[(X_val_lgb, y_val)],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
    )
    auc_lgb = roc_auc_score(y_test, m2.predict_proba(X_test_lgb)[:, 1])

    # CatBoost
    m3 = CatBoostClassifier(
        iterations=5000, learning_rate=lr, depth=6,
        subsample=0.8, l2_leaf_reg=3.0,
        random_seed=42, early_stopping_rounds=50,
        cat_features=cat_indices, verbose=0
    )
    m3.fit(X_train, y_train, eval_set=(X_val, y_val), verbose=False)
    auc_cat = roc_auc_score(y_test, m3.predict_proba(X_test)[:, 1])

    print(f"{lr:<8}{auc_xgb:<12.4f}{auc_lgb:<12.4f}{auc_cat:<12.4f}")

Learning Rate Sensitivity (Test AUC)
-------------------------------------------------------
LR      XGBoost     LightGBM    CatBoost
-------------------------------------------------------
0.2     0.8498      0.8512      0.8541
0.1     0.8531      0.8548      0.8572
0.05    0.8547      0.8561      0.8589
0.02    0.8554      0.8568      0.8593
0.01    0.8556      0.8571      0.8595

CatBoost is the most stable across learning rates --- its ordered boosting provides implicit regularization. XGBoost is the most sensitive --- the gap between lr=0.2 and lr=0.01 is 0.0058 (vs. 0.0054 for CatBoost). All three improve with lower learning rates, confirming the "lower than you think" principle.

The Verdict for StreamFlow

For the StreamFlow churn prediction system, the team's recommendation:

Primary model: LightGBM. The AUC gap to CatBoost is 0.0028 --- roughly 14 additional correct churn predictions out of 10,000 test subscribers. LightGBM's 3x training speed advantage matters more for the nightly retraining pipeline, hyperparameter search, and model experimentation that the team does daily.

Fallback if categorical features increase: If StreamFlow adds more categorical features (content genre preferences, device types, geographic detail), revisit CatBoost. Its advantage grows with categorical feature count and cardinality.

XGBoost as reference: Keep XGBoost in the comparison pipeline as a sanity check. If LightGBM ever produces suspicious results (data drift, pipeline bugs), XGBoost provides an independent second opinion.

Production Tip --- In production, "best model" is not just about AUC. It is about AUC times training speed times debugging ease times deployment compatibility times team familiarity. LightGBM wins this multi-dimensional comparison for StreamFlow because the team already uses it for three other models, the nightly retraining pipeline is optimized for it, and the 0.3% AUC gap is not worth the operational complexity of maintaining a CatBoost deployment alongside the existing LightGBM infrastructure.

Key Takeaways from This Showdown

CatBoost wins on accuracy, especially with categorical features. But the margin is small on datasets with moderate categorical cardinality.
LightGBM wins on speed. 3-5x faster training is not a rounding error --- it compounds across hyperparameter search, cross-validation, and daily retraining.
XGBoost's one-hot encoding penalty is real but manageable. On this dataset (4 categorical features, <10 categories each), the penalty is ~0.4% AUC. On datasets with 50 categorical features, it would be much worse.
All three models agree on feature importance. This is reassuring: the signal in the data is consistent regardless of the algorithm extracting it.
Lower learning rates consistently improve all three. But the improvement diminishes below 0.02, and training time increases substantially.
The choice between the three matters less than proper feature engineering, data quality, and early stopping. A poorly engineered dataset with the "best" library will lose to a well-engineered dataset with any of the three.

This case study supports Chapter 14: Gradient Boosting. Return to the chapter for the full discussion of gradient boosting theory and hyperparameters.