Case Study 1: StreamFlow Optuna Tuning --- From Defaults to Diminishing Returns
Background
StreamFlow's data science team has been building a subscriber churn prediction model for seven months. In Chapter 11, they established a logistic regression baseline. In Chapter 13, they trained a Random Forest. In Chapter 14, they ran a head-to-head comparison of XGBoost, LightGBM, and CatBoost, with XGBoost winning by a narrow margin. In Chapter 16, they evaluated the models properly with stratified group cross-validation. In Chapter 17, they addressed the 8.2% churn imbalance with SMOTE and class weight adjustments.
Now the VP of Engineering has a simple question: "How much better can you make the XGBoost model?"
The answer, it turns out, is: not as much as you might hope. This case study walks through the full three-step tuning protocol on StreamFlow's churn data, quantifies every increment of improvement, and delivers the uncomfortable truth about where the ceiling is.
The Data
StreamFlow's churn dataset: 50,000 subscriber-months, 24 features, 8.2% churn rate.
import pandas as pd
import numpy as np
from sklearn.model_selection import (
cross_val_score, StratifiedKFold, RandomizedSearchCV
)
from sklearn.metrics import roc_auc_score, average_precision_score
from scipy.stats import randint, uniform, loguniform
from xgboost import XGBClassifier
import optuna
import time
optuna.logging.set_verbosity(optuna.logging.WARNING)
np.random.seed(42)
n = 50000
streamflow = pd.DataFrame({
'monthly_hours_watched': np.random.exponential(18, n).round(1),
'sessions_last_30d': np.random.poisson(14, n),
'avg_session_minutes': np.random.exponential(28, n).round(1),
'unique_titles_watched': np.random.poisson(8, n),
'content_completion_rate': np.random.beta(3, 2, n).round(3),
'binge_sessions_30d': np.random.poisson(2, n),
'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
'hours_change_pct': np.random.normal(0, 30, n).round(1),
'sessions_change_pct': np.random.normal(0, 25, n).round(1),
'months_active': np.random.randint(1, 60, n),
'plan_price': np.random.choice([9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]),
'devices_used': np.random.randint(1, 6, n),
'profiles_active': np.random.randint(1, 5, n),
'payment_failures_6m': np.random.poisson(0.3, n),
'support_tickets_90d': np.random.poisson(0.8, n),
'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
'recommendation_click_rate': np.random.beta(2, 8, n).round(3),
'search_frequency_30d': np.random.poisson(6, n),
'download_count_30d': np.random.poisson(3, n),
'share_count_30d': np.random.poisson(1, n),
'rating_count_30d': np.random.poisson(2, n),
'free_trial_convert': np.random.binomial(1, 0.65, n),
'referral_source': np.random.choice([0, 1, 2, 3], n, p=[0.50, 0.25, 0.15, 0.10]),
})
# Churn probability based on behavioral signals
churn_logit = (
-3.0
+ 0.08 * streamflow['days_since_last_session']
- 0.02 * streamflow['monthly_hours_watched']
- 0.04 * streamflow['sessions_last_30d']
+ 0.15 * streamflow['payment_failures_6m']
+ 0.10 * streamflow['support_tickets_90d']
- 0.03 * streamflow['content_completion_rate'] * 10
+ 0.05 * (streamflow['hours_change_pct'] < -30).astype(int)
- 0.01 * streamflow['months_active']
+ 0.08 * (streamflow['plan_price'] > 19.99).astype(int)
- 0.02 * streamflow['unique_titles_watched']
+ np.random.normal(0, 0.3, n)
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
streamflow['churned'] = np.random.binomial(1, churn_prob)
print(f"Churn rate: {streamflow['churned'].mean():.3f}")
print(f"Shape: {streamflow.shape}")
X = streamflow.drop('churned', axis=1).values
y = streamflow['churned'].values
feature_names = streamflow.drop('churned', axis=1).columns.tolist()
Churn rate: 0.083
Shape: (50000, 25)
Step 1: Default Baseline
The first number any tuning study needs is the baseline: how does the model perform with zero tuning effort?
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Pure default XGBoost
default_model = XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]), # Handle imbalance
random_state=42,
n_jobs=-1
)
start = time.time()
default_scores = cross_val_score(default_model, X, y, cv=cv, scoring='roc_auc')
default_time = time.time() - start
print("Step 1: Default XGBoost")
print(f" AUC: {default_scores.mean():.4f} +/- {default_scores.std():.4f}")
print(f" Time: {default_time:.1f} seconds")
Step 1: Default XGBoost
AUC: 0.8743 +/- 0.0038
Time: 4.2 seconds
AUC of 0.874 out of the box. That is the number to beat.
Step 2: Random Search (80 Trials)
param_distributions = {
'n_estimators': randint(100, 2000),
'learning_rate': loguniform(1e-3, 0.5),
'max_depth': randint(3, 12),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.5, 0.5),
'min_child_weight': randint(1, 15),
'reg_alpha': loguniform(1e-3, 10.0),
'reg_lambda': loguniform(1e-3, 10.0),
}
random_search = RandomizedSearchCV(
estimator=XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
random_state=42,
n_jobs=-1
),
param_distributions=param_distributions,
n_iter=80,
cv=cv,
scoring='roc_auc',
random_state=42,
n_jobs=-1,
return_train_score=True,
verbose=0
)
start = time.time()
random_search.fit(X, y)
rs_time = time.time() - start
print("Step 2: Random Search (80 trials)")
print(f" Best AUC: {random_search.best_score_:.4f}")
print(f" Time: {rs_time:.1f} seconds")
print(f" Improvement over defaults: +{random_search.best_score_ - default_scores.mean():.4f}")
print(f"\nBest parameters:")
for k, v in sorted(random_search.best_params_.items()):
print(f" {k}: {v}")
Step 2: Random Search (80 trials)
Best AUC: 0.8921
Time: 387.4 seconds
Improvement over defaults: +0.0178
Best parameters:
colsample_bytree: 0.7234
learning_rate: 0.0412
max_depth: 7
min_child_weight: 5
n_estimators: 1142
reg_alpha: 0.0654
reg_lambda: 2.1387
subsample: 0.8156
Analyzing the Random Search Landscape
rs_results = pd.DataFrame(random_search.cv_results_).sort_values('rank_test_score')
# How spread out are the top configurations?
print("Top 20 configurations:")
print(f" Best AUC: {rs_results.iloc[0]['mean_test_score']:.4f}")
print(f" 5th AUC: {rs_results.iloc[4]['mean_test_score']:.4f}")
print(f" 10th AUC: {rs_results.iloc[9]['mean_test_score']:.4f}")
print(f" 20th AUC: {rs_results.iloc[19]['mean_test_score']:.4f}")
print(f" Gap (1st-20th): {rs_results.iloc[0]['mean_test_score'] - rs_results.iloc[19]['mean_test_score']:.4f}")
print(f" CV Std (best): {rs_results.iloc[0]['std_test_score']:.4f}")
# Overfitting check: train vs. test scores
print(f"\nOverfitting check (best configuration):")
print(f" Train AUC: {rs_results.iloc[0]['mean_train_score']:.4f}")
print(f" Test AUC: {rs_results.iloc[0]['mean_test_score']:.4f}")
print(f" Gap: {rs_results.iloc[0]['mean_train_score'] - rs_results.iloc[0]['mean_test_score']:.4f}")
Top 20 configurations:
Best AUC: 0.8921
5th AUC: 0.8908
10th AUC: 0.8897
20th AUC: 0.8881
Gap (1st-20th): 0.0040
CV Std (best): 0.0041
Overfitting check (best configuration):
Train AUC: 0.9634
Test AUC: 0.8921
Gap: 0.0713
Two observations. First, the gap between the 1st and 20th-best configurations (0.0040) is almost identical to the cross-validation standard deviation (0.0041). The top 20 configurations are statistically indistinguishable. Second, the train-test gap of 0.071 suggests some overfitting, which is normal for gradient boosting but worth monitoring.
Step 3: Bayesian Optimization with Optuna (100 Trials)
Narrow the search space based on the random search findings:
def streamflow_objective(trial):
"""Optuna objective: tune XGBoost on StreamFlow churn data."""
params = {
'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.15, log=True),
'max_depth': trial.suggest_int('max_depth', 5, 10),
'subsample': trial.suggest_float('subsample', 0.65, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.55, 0.90),
'min_child_weight': trial.suggest_int('min_child_weight', 2, 12),
'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 0.5, 10.0, log=True),
'scale_pos_weight': len(y[y == 0]) / len(y[y == 1]),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1
}
model = XGBClassifier(**params)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
return scores.mean()
study = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
start = time.time()
study.optimize(streamflow_objective, n_trials=100, show_progress_bar=True)
optuna_time = time.time() - start
print("Step 3: Optuna Bayesian Optimization (100 trials)")
print(f" Best AUC: {study.best_value:.4f}")
print(f" Time: {optuna_time:.1f} seconds")
print(f" Improvement over random search: +{study.best_value - random_search.best_score_:.4f}")
print(f"\nBest parameters:")
for k, v in sorted(study.best_params.items()):
if isinstance(v, float):
print(f" {k}: {v:.4f}")
else:
print(f" {k}: {v}")
Step 3: Optuna Bayesian Optimization (100 trials)
Best AUC: 0.8934
Time: 524.6 seconds
Improvement over random search: +0.0013
Best parameters:
colsample_bytree: 0.7412
learning_rate: 0.0358
max_depth: 7
min_child_weight: 6
n_estimators: 1387
reg_alpha: 0.0823
reg_lambda: 1.8934
subsample: 0.8267
Hyperparameter Importance
import optuna.visualization as vis
fig_importance = vis.plot_param_importances(study)
fig_importance.update_layout(
title="StreamFlow Churn: Hyperparameter Importance (fANOVA)",
width=700, height=400
)
fig_importance.show()
Hyperparameter Importance:
learning_rate: 0.38
max_depth: 0.24
n_estimators: 0.14
min_child_weight: 0.09
subsample: 0.06
colsample_bytree: 0.04
reg_lambda: 0.03
reg_alpha: 0.02
Learning rate and max depth together account for 62% of the performance variance. The regularization parameters (reg_alpha, reg_lambda) account for only 5%. This is consistent with the general pattern: tree structure and learning rate dominate; regularization fine-tunes.
Convergence Analysis
# When did Optuna actually converge?
trial_values = [t.value for t in study.trials if t.value is not None]
running_best = np.maximum.accumulate(trial_values)
# Find the trial where the best score was set for the last time
last_improvement_trial = np.argmax(running_best == running_best[-1])
print(f"Best score achieved at trial: {last_improvement_trial}")
print(f"Remaining {100 - last_improvement_trial} trials produced no improvement")
# Check improvement by segments
for segment_end in [25, 50, 75, 100]:
segment_best = max(trial_values[:segment_end])
print(f" Best AUC after {segment_end:3d} trials: {segment_best:.4f}")
Best score achieved at trial: 63
Remaining 37 trials produced no improvement
Best AUC after 25 trials: 0.8926
Best AUC after 50 trials: 0.8931
Best AUC after 75 trials: 0.8934
Best AUC after 100 trials: 0.8934
The study effectively converged at trial 63. The last 37 trials --- more than a third of the compute budget --- produced zero improvement. If we had stopped at 50 trials, we would have gotten within 0.0003 of the final answer.
The Full Picture
print("=" * 65)
print("StreamFlow Churn: Tuning Progression Summary")
print("=" * 65)
print(f"{'Step':<30} {'AUC':>8} {'Gain':>8} {'Cumul.':>8} {'Time':>10}")
print("-" * 65)
print(f"{'1. Defaults':<30} {default_scores.mean():>8.4f} {'---':>8} {'---':>8} {default_time:>8.1f}s")
print(f"{'2. Random Search (80 trial)':<30} {random_search.best_score_:>8.4f} {random_search.best_score_ - default_scores.mean():>+8.4f} {random_search.best_score_ - default_scores.mean():>+8.4f} {rs_time:>8.1f}s")
print(f"{'3. Optuna (100 trials)':<30} {study.best_value:>8.4f} {study.best_value - random_search.best_score_:>+8.4f} {study.best_value - default_scores.mean():>+8.4f} {optuna_time:>8.1f}s")
print("-" * 65)
total_improvement = study.best_value - default_scores.mean()
rs_share = (random_search.best_score_ - default_scores.mean()) / total_improvement * 100
optuna_share = (study.best_value - random_search.best_score_) / total_improvement * 100
print(f"\nTotal improvement: {total_improvement:+.4f} AUC")
print(f"Random search contribution: {rs_share:.0f}%")
print(f"Optuna contribution: {optuna_share:.0f}%")
print(f"\nTotal tuning time: {(rs_time + optuna_time) / 60:.1f} minutes")
=================================================================
StreamFlow Churn: Tuning Progression Summary
=================================================================
Step AUC Gain Cumul. Time
-----------------------------------------------------------------
1. Defaults 0.8743 --- --- 4.2s
2. Random Search (80 trial) 0.8921 +0.0178 +0.0178 387.4s
3. Optuna (100 trials) 0.8934 +0.0013 +0.0191 524.6s
-----------------------------------------------------------------
Total improvement: +0.0191 AUC
Random search contribution: 93%
Optuna contribution: 7%
Total tuning time: 15.2 minutes
The Business Translation
StreamFlow's annual recurring revenue is $180 million with an 8.2% monthly churn rate. The data science team estimates that each 0.01 AUC improvement in the churn model translates to roughly 0.08% better churn identification, worth approximately $144,000 annually.
| Step | AUC Gain | Estimated Annual Value | Engineering Time |
|---|---|---|---|
| Defaults to random search | +0.0178 | ~$256,000 | 2 hours |
| Random search to Optuna | +0.0013 | ~$19,000 | 3 hours |
The random search step was unambiguously worth it: $256K in value for 2 hours of work. The Optuna refinement is a judgment call: $19K annually for 3 hours of additional tuning is still positive ROI, but the team could have generated more value by spending those 3 hours engineering a new feature.
Key Takeaway
The Pattern --- Random search captures 90%+ of the available tuning improvement. Bayesian optimization polishes the remaining few percent. For StreamFlow, the practical message is clear: run 80 random search trials, take the best configuration, and move on to feature engineering or deployment. Come back for Bayesian refinement only if you have exhausted higher-leverage improvements and the business case for marginal gains is clear.
This case study supports Chapter 18: Hyperparameter Tuning. Return to the chapter for the full tuning framework.