Case Study 1: StreamFlow Optuna Tuning --- From Defaults to Diminishing Returns


Background

StreamFlow's data science team has been building a subscriber churn prediction model for seven months. In Chapter 11, they established a logistic regression baseline. In Chapter 13, they trained a Random Forest. In Chapter 14, they ran a head-to-head comparison of XGBoost, LightGBM, and CatBoost, with XGBoost winning by a narrow margin. In Chapter 16, they evaluated the models properly with stratified group cross-validation. In Chapter 17, they addressed the 8.2% churn imbalance with SMOTE and class weight adjustments.

Now the VP of Engineering has a simple question: "How much better can you make the XGBoost model?"

The answer, it turns out, is: not as much as you might hope. This case study walks through the full three-step tuning protocol on StreamFlow's churn data, quantifies every increment of improvement, and delivers the uncomfortable truth about where the ceiling is.


The Data

StreamFlow's churn dataset: 50,000 subscriber-months, 24 features, 8.2% churn rate.

import pandas as pd
import numpy as np
from sklearn.model_selection import (
    cross_val_score, StratifiedKFold, RandomizedSearchCV
)
from sklearn.metrics import roc_auc_score, average_precision_score
from scipy.stats import randint, uniform, loguniform
from xgboost import XGBClassifier
import optuna
import time

optuna.logging.set_verbosity(optuna.logging.WARNING)

np.random.seed(42)
n = 50000

streamflow = pd.DataFrame({
    'monthly_hours_watched': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'avg_session_minutes': np.random.exponential(28, n).round(1),
    'unique_titles_watched': np.random.poisson(8, n),
    'content_completion_rate': np.random.beta(3, 2, n).round(3),
    'binge_sessions_30d': np.random.poisson(2, n),
    'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
    'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
    'hours_change_pct': np.random.normal(0, 30, n).round(1),
    'sessions_change_pct': np.random.normal(0, 25, n).round(1),
    'months_active': np.random.randint(1, 60, n),
    'plan_price': np.random.choice([9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]),
    'devices_used': np.random.randint(1, 6, n),
    'profiles_active': np.random.randint(1, 5, n),
    'payment_failures_6m': np.random.poisson(0.3, n),
    'support_tickets_90d': np.random.poisson(0.8, n),
    'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
    'recommendation_click_rate': np.random.beta(2, 8, n).round(3),
    'search_frequency_30d': np.random.poisson(6, n),
    'download_count_30d': np.random.poisson(3, n),
    'share_count_30d': np.random.poisson(1, n),
    'rating_count_30d': np.random.poisson(2, n),
    'free_trial_convert': np.random.binomial(1, 0.65, n),
    'referral_source': np.random.choice([0, 1, 2, 3], n, p=[0.50, 0.25, 0.15, 0.10]),
})

# Churn probability based on behavioral signals
churn_logit = (
    -3.0
    + 0.08 * streamflow['days_since_last_session']
    - 0.02 * streamflow['monthly_hours_watched']
    - 0.04 * streamflow['sessions_last_30d']
    + 0.15 * streamflow['payment_failures_6m']
    + 0.10 * streamflow['support_tickets_90d']
    - 0.03 * streamflow['content_completion_rate'] * 10
    + 0.05 * (streamflow['hours_change_pct'] < -30).astype(int)
    - 0.01 * streamflow['months_active']
    + 0.08 * (streamflow['plan_price'] > 19.99).astype(int)
    - 0.02 * streamflow['unique_titles_watched']
    + np.random.normal(0, 0.3, n)
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
streamflow['churned'] = np.random.binomial(1, churn_prob)

print(f"Churn rate: {streamflow['churned'].mean():.3f}")
print(f"Shape: {streamflow.shape}")

X = streamflow.drop('churned', axis=1).values
y = streamflow['churned'].values
feature_names = streamflow.drop('churned', axis=1).columns.tolist()
Churn rate: 0.083
Shape: (50000, 25)

Step 1: Default Baseline

The first number any tuning study needs is the baseline: how does the model perform with zero tuning effort?

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# Pure default XGBoost
default_model = XGBClassifier(
    eval_metric='logloss',
    scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),  # Handle imbalance
    random_state=42,
    n_jobs=-1
)

start = time.time()
default_scores = cross_val_score(default_model, X, y, cv=cv, scoring='roc_auc')
default_time = time.time() - start

print("Step 1: Default XGBoost")
print(f"  AUC:  {default_scores.mean():.4f} +/- {default_scores.std():.4f}")
print(f"  Time: {default_time:.1f} seconds")
Step 1: Default XGBoost
  AUC:  0.8743 +/- 0.0038
  Time: 4.2 seconds

AUC of 0.874 out of the box. That is the number to beat.


Step 2: Random Search (80 Trials)

param_distributions = {
    'n_estimators': randint(100, 2000),
    'learning_rate': loguniform(1e-3, 0.5),
    'max_depth': randint(3, 12),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.5, 0.5),
    'min_child_weight': randint(1, 15),
    'reg_alpha': loguniform(1e-3, 10.0),
    'reg_lambda': loguniform(1e-3, 10.0),
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(
        eval_metric='logloss',
        scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
        random_state=42,
        n_jobs=-1
    ),
    param_distributions=param_distributions,
    n_iter=80,
    cv=cv,
    scoring='roc_auc',
    random_state=42,
    n_jobs=-1,
    return_train_score=True,
    verbose=0
)

start = time.time()
random_search.fit(X, y)
rs_time = time.time() - start

print("Step 2: Random Search (80 trials)")
print(f"  Best AUC: {random_search.best_score_:.4f}")
print(f"  Time:     {rs_time:.1f} seconds")
print(f"  Improvement over defaults: +{random_search.best_score_ - default_scores.mean():.4f}")
print(f"\nBest parameters:")
for k, v in sorted(random_search.best_params_.items()):
    print(f"  {k}: {v}")
Step 2: Random Search (80 trials)
  Best AUC: 0.8921
  Time:     387.4 seconds
  Improvement over defaults: +0.0178

Best parameters:
  colsample_bytree: 0.7234
  learning_rate: 0.0412
  max_depth: 7
  min_child_weight: 5
  n_estimators: 1142
  reg_alpha: 0.0654
  reg_lambda: 2.1387
  subsample: 0.8156

Analyzing the Random Search Landscape

rs_results = pd.DataFrame(random_search.cv_results_).sort_values('rank_test_score')

# How spread out are the top configurations?
print("Top 20 configurations:")
print(f"  Best AUC:  {rs_results.iloc[0]['mean_test_score']:.4f}")
print(f"  5th AUC:   {rs_results.iloc[4]['mean_test_score']:.4f}")
print(f"  10th AUC:  {rs_results.iloc[9]['mean_test_score']:.4f}")
print(f"  20th AUC:  {rs_results.iloc[19]['mean_test_score']:.4f}")
print(f"  Gap (1st-20th): {rs_results.iloc[0]['mean_test_score'] - rs_results.iloc[19]['mean_test_score']:.4f}")
print(f"  CV Std (best):  {rs_results.iloc[0]['std_test_score']:.4f}")

# Overfitting check: train vs. test scores
print(f"\nOverfitting check (best configuration):")
print(f"  Train AUC: {rs_results.iloc[0]['mean_train_score']:.4f}")
print(f"  Test AUC:  {rs_results.iloc[0]['mean_test_score']:.4f}")
print(f"  Gap:       {rs_results.iloc[0]['mean_train_score'] - rs_results.iloc[0]['mean_test_score']:.4f}")
Top 20 configurations:
  Best AUC:  0.8921
  5th AUC:   0.8908
  10th AUC:  0.8897
  20th AUC:  0.8881
  Gap (1st-20th): 0.0040
  CV Std (best):  0.0041

Overfitting check (best configuration):
  Train AUC: 0.9634
  Test AUC:  0.8921
  Gap:       0.0713

Two observations. First, the gap between the 1st and 20th-best configurations (0.0040) is almost identical to the cross-validation standard deviation (0.0041). The top 20 configurations are statistically indistinguishable. Second, the train-test gap of 0.071 suggests some overfitting, which is normal for gradient boosting but worth monitoring.


Step 3: Bayesian Optimization with Optuna (100 Trials)

Narrow the search space based on the random search findings:

def streamflow_objective(trial):
    """Optuna objective: tune XGBoost on StreamFlow churn data."""
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.15, log=True),
        'max_depth': trial.suggest_int('max_depth', 5, 10),
        'subsample': trial.suggest_float('subsample', 0.65, 0.95),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.55, 0.90),
        'min_child_weight': trial.suggest_int('min_child_weight', 2, 12),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 5.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.5, 10.0, log=True),
        'scale_pos_weight': len(y[y == 0]) / len(y[y == 1]),
        'eval_metric': 'logloss',
        'random_state': 42,
        'n_jobs': -1
    }

    model = XGBClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    return scores.mean()


study = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)

start = time.time()
study.optimize(streamflow_objective, n_trials=100, show_progress_bar=True)
optuna_time = time.time() - start

print("Step 3: Optuna Bayesian Optimization (100 trials)")
print(f"  Best AUC: {study.best_value:.4f}")
print(f"  Time:     {optuna_time:.1f} seconds")
print(f"  Improvement over random search: +{study.best_value - random_search.best_score_:.4f}")
print(f"\nBest parameters:")
for k, v in sorted(study.best_params.items()):
    if isinstance(v, float):
        print(f"  {k}: {v:.4f}")
    else:
        print(f"  {k}: {v}")
Step 3: Optuna Bayesian Optimization (100 trials)
  Best AUC: 0.8934
  Time:     524.6 seconds
  Improvement over random search: +0.0013

Best parameters:
  colsample_bytree: 0.7412
  learning_rate: 0.0358
  max_depth: 7
  min_child_weight: 6
  n_estimators: 1387
  reg_alpha: 0.0823
  reg_lambda: 1.8934
  subsample: 0.8267

Hyperparameter Importance

import optuna.visualization as vis

fig_importance = vis.plot_param_importances(study)
fig_importance.update_layout(
    title="StreamFlow Churn: Hyperparameter Importance (fANOVA)",
    width=700, height=400
)
fig_importance.show()
Hyperparameter Importance:
  learning_rate:     0.38
  max_depth:         0.24
  n_estimators:      0.14
  min_child_weight:  0.09
  subsample:         0.06
  colsample_bytree:  0.04
  reg_lambda:        0.03
  reg_alpha:         0.02

Learning rate and max depth together account for 62% of the performance variance. The regularization parameters (reg_alpha, reg_lambda) account for only 5%. This is consistent with the general pattern: tree structure and learning rate dominate; regularization fine-tunes.

Convergence Analysis

# When did Optuna actually converge?
trial_values = [t.value for t in study.trials if t.value is not None]
running_best = np.maximum.accumulate(trial_values)

# Find the trial where the best score was set for the last time
last_improvement_trial = np.argmax(running_best == running_best[-1])

print(f"Best score achieved at trial: {last_improvement_trial}")
print(f"Remaining {100 - last_improvement_trial} trials produced no improvement")

# Check improvement by segments
for segment_end in [25, 50, 75, 100]:
    segment_best = max(trial_values[:segment_end])
    print(f"  Best AUC after {segment_end:3d} trials: {segment_best:.4f}")
Best score achieved at trial: 63
Remaining 37 trials produced no improvement

  Best AUC after  25 trials: 0.8926
  Best AUC after  50 trials: 0.8931
  Best AUC after  75 trials: 0.8934
  Best AUC after 100 trials: 0.8934

The study effectively converged at trial 63. The last 37 trials --- more than a third of the compute budget --- produced zero improvement. If we had stopped at 50 trials, we would have gotten within 0.0003 of the final answer.


The Full Picture

print("=" * 65)
print("StreamFlow Churn: Tuning Progression Summary")
print("=" * 65)
print(f"{'Step':<30} {'AUC':>8} {'Gain':>8} {'Cumul.':>8} {'Time':>10}")
print("-" * 65)
print(f"{'1. Defaults':<30} {default_scores.mean():>8.4f} {'---':>8} {'---':>8} {default_time:>8.1f}s")
print(f"{'2. Random Search (80 trial)':<30} {random_search.best_score_:>8.4f} {random_search.best_score_ - default_scores.mean():>+8.4f} {random_search.best_score_ - default_scores.mean():>+8.4f} {rs_time:>8.1f}s")
print(f"{'3. Optuna (100 trials)':<30} {study.best_value:>8.4f} {study.best_value - random_search.best_score_:>+8.4f} {study.best_value - default_scores.mean():>+8.4f} {optuna_time:>8.1f}s")
print("-" * 65)

total_improvement = study.best_value - default_scores.mean()
rs_share = (random_search.best_score_ - default_scores.mean()) / total_improvement * 100
optuna_share = (study.best_value - random_search.best_score_) / total_improvement * 100

print(f"\nTotal improvement: {total_improvement:+.4f} AUC")
print(f"Random search contribution: {rs_share:.0f}%")
print(f"Optuna contribution:        {optuna_share:.0f}%")
print(f"\nTotal tuning time: {(rs_time + optuna_time) / 60:.1f} minutes")
=================================================================
StreamFlow Churn: Tuning Progression Summary
=================================================================
Step                                AUC     Gain   Cumul.       Time
-----------------------------------------------------------------
1. Defaults                      0.8743      ---      ---       4.2s
2. Random Search (80 trial)      0.8921   +0.0178   +0.0178    387.4s
3. Optuna (100 trials)           0.8934   +0.0013   +0.0191    524.6s
-----------------------------------------------------------------

Total improvement: +0.0191 AUC
Random search contribution: 93%
Optuna contribution:        7%

Total tuning time: 15.2 minutes

The Business Translation

StreamFlow's annual recurring revenue is $180 million with an 8.2% monthly churn rate. The data science team estimates that each 0.01 AUC improvement in the churn model translates to roughly 0.08% better churn identification, worth approximately $144,000 annually.

Step AUC Gain Estimated Annual Value Engineering Time
Defaults to random search +0.0178 ~$256,000 2 hours
Random search to Optuna +0.0013 ~$19,000 3 hours

The random search step was unambiguously worth it: $256K in value for 2 hours of work. The Optuna refinement is a judgment call: $19K annually for 3 hours of additional tuning is still positive ROI, but the team could have generated more value by spending those 3 hours engineering a new feature.


Key Takeaway

The Pattern --- Random search captures 90%+ of the available tuning improvement. Bayesian optimization polishes the remaining few percent. For StreamFlow, the practical message is clear: run 80 random search trials, take the best configuration, and move on to feature engineering or deployment. Come back for Bayesian refinement only if you have exhausted higher-leverage improvements and the business case for marginal gains is clear.


This case study supports Chapter 18: Hyperparameter Tuning. Return to the chapter for the full tuning framework.