14 min read

> Core Principle --- Hyperparameter tuning is important. It is also the single most over-invested activity in amateur data science. I have watched teams spend three weeks tuning XGBoost's max_depth from 6 to 7 and back again, chasing a 0.002 AUC...

Chapter 18: Hyperparameter Tuning

Grid Search, Random Search, Bayesian Optimization, and When to Stop


Learning Objectives

By the end of this chapter, you will be able to:

  1. Distinguish hyperparameters from parameters and explain why tuning matters
  2. Implement grid search and random search with cross-validation
  3. Apply Bayesian optimization (Optuna) for efficient hyperparameter search
  4. Use Halving search for resource-efficient tuning
  5. Know when tuning provides diminishing returns and when to stop

The Most Over-Invested Activity in Amateur Data Science

Core Principle --- Hyperparameter tuning is important. It is also the single most over-invested activity in amateur data science. I have watched teams spend three weeks tuning XGBoost's max_depth from 6 to 7 and back again, chasing a 0.002 AUC improvement, while their training data had a feature leakage bug that was worth 0.08 AUC. I have seen a data scientist run 10,000-trial Bayesian optimization on a logistic regression --- a model with two meaningful hyperparameters --- because "more trials is better." I have reviewed production models where the tuning log showed 48 hours of GPU time and the final model was 0.3% better than the default.

Here is the uncomfortable truth, expressed as a rough hierarchy of impact:

Activity Typical AUC improvement
Better features 5--20%
Fixing data quality issues 3--15%
Choosing the right model family 2--8%
Default hyperparameters --> rough tuning 2--5%
Rough tuning --> perfect tuning 0.1--0.5%

The last row is where most of the tuning time gets spent. The first row is where most of the performance improvement lives. This does not mean tuning is worthless --- 2--5% from a rough tune is real, and on a production model where small improvements translate to millions of dollars, even 0.1% matters. But it means you should tune after you have exhausted higher-leverage activities, and you should know when to stop.

This chapter teaches the correct sequence: start with defaults, do a rough random search, refine with Bayesian optimization if the stakes justify it, and stop when the returns diminish. By the end, you will be a disciplined, efficient tuner --- not a hyperparameter gambler.


Part 1: Parameters vs. Hyperparameters

What Gets Learned vs. What You Set

The distinction is fundamental but often muddy in practice. Parameters are learned from data during training. Hyperparameters are set before training begins and control how learning happens.

Parameters Hyperparameters
Set by The training algorithm The practitioner
When During .fit() Before .fit()
Examples Weights, coefficients, split thresholds Learning rate, max_depth, C, n_estimators
Optimized by Gradient descent, impurity reduction Grid search, random search, Bayesian opt.
Stored in The trained model The model constructor

For a Random Forest, the parameters are the split points and thresholds in every tree --- thousands of values that the algorithm discovers from the data. The hyperparameters are n_estimators, max_depth, min_samples_split, max_features, and a handful of others that you specify before training.

For an XGBoost model, the parameters are the leaf weights in each boosted tree. The hyperparameters are learning_rate, max_depth, n_estimators, subsample, colsample_bytree, reg_alpha, reg_lambda, and min_child_weight.

Why Hyperparameters Matter (and When They Do Not)

Hyperparameters control model complexity, regularization, and optimization behavior. Setting them poorly can cause underfitting (too much regularization, too few estimators) or overfitting (too little regularization, trees too deep). The default values in scikit-learn, XGBoost, and LightGBM are competent --- they represent the library authors' best guess at a reasonable starting point for a wide range of problems. But defaults are generic. Your data is specific.

The question is: how much improvement can tuning buy you?

import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from xgboost import XGBClassifier

# Generate a moderately challenging classification problem
X, y = make_classification(
    n_samples=15000, n_features=25, n_informative=15,
    n_redundant=5, flip_y=0.08, class_sep=0.8,
    weights=[0.85, 0.15], random_state=42
)

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

# XGBoost with pure defaults
default_model = XGBClassifier(
    eval_metric='logloss', random_state=42, n_jobs=-1
)
default_scores = cross_val_score(
    default_model, X, y, cv=cv, scoring='roc_auc'
)

# XGBoost with reasonable manual tuning
tuned_model = XGBClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=5,
    subsample=0.8,
    colsample_bytree=0.8,
    min_child_weight=3,
    reg_alpha=0.1,
    reg_lambda=1.0,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)
tuned_scores = cross_val_score(
    tuned_model, X, y, cv=cv, scoring='roc_auc'
)

print("Default XGBoost:")
print(f"  AUC: {default_scores.mean():.4f} +/- {default_scores.std():.4f}")
print(f"\nRough-tuned XGBoost:")
print(f"  AUC: {tuned_scores.mean():.4f} +/- {tuned_scores.std():.4f}")
print(f"\nImprovement: {(tuned_scores.mean() - default_scores.mean()):.4f}")
Default XGBoost:
  AUC: 0.9247 +/- 0.0052

Rough-tuned XGBoost:
  AUC: 0.9389 +/- 0.0038

Improvement: 0.0142

That 0.014 improvement is real and meaningful --- about 1.5 percentage points of AUC. But notice: the defaults already gave us 0.925. A 1.5% improvement on top of an already-good model is worth pursuing if the business impact justifies the engineering time. It is not worth three weeks of someone's salary.


Part 2: Grid Search --- The Brute-Force Baseline

How GridSearchCV Works

Grid search is the simplest tuning method: define a set of values for each hyperparameter, try every combination, and keep the best one. Scikit-learn's GridSearchCV wraps this in cross-validation, so each combination is evaluated on multiple folds.

from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier

param_grid = {
    'n_estimators': [100, 200, 500],
    'max_depth': [5, 10, 15, None],
    'min_samples_split': [2, 5, 10],
    'max_features': ['sqrt', 'log2']
}

# Total combinations: 3 * 4 * 3 * 2 = 72
# With 5-fold CV: 72 * 5 = 360 model fits
grid_search = GridSearchCV(
    estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
    param_grid=param_grid,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    return_train_score=True
)

grid_search.fit(X, y)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best AUC: {grid_search.best_score_:.4f}")
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters: {'max_depth': 15, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 500}
Best AUC: 0.9318

When Grid Search Works

Grid search works when:

  1. You have 2--3 hyperparameters to tune. The total number of combinations is the product of grid sizes. Two parameters with 5 values each: 25 combinations. Three parameters with 5 values each: 125 combinations. Five parameters with 5 values each: 3,125 combinations. The combinatorial explosion is real.

  2. Each model trains fast. 360 fits of a Random Forest on 15,000 samples takes under a minute. 360 fits of an XGBoost with 2,000 estimators on 500,000 samples takes hours.

  3. You know the approximate range. Grid search is good at refining a ballpark, not at exploring a wide search space. If max_depth should be between 3 and 15, a grid of [3, 5, 7, 10, 15] is reasonable. If you have no idea whether it should be 2 or 200, grid search will waste most of its budget on bad regions.

When Grid Search Fails

Warning

--- Grid search is mostly obsolete for high-dimensional hyperparameter spaces. If you are tuning 5 or more hyperparameters, skip ahead to random search or Bayesian optimization. Grid search will either take too long or force you to use a coarse grid that misses the good regions.

The fundamental problem: grid search distributes its budget evenly across the entire grid, including regions that are obviously bad. If learning_rate=1.0 is terrible for your problem (it almost always is), grid search does not care --- it will train 5 cross-validation folds at learning_rate=1.0 for every combination of the other hyperparameters before moving on.

# This grid has 7,776 combinations (5-fold CV = 38,880 fits)
# Most of these combinations are awful
bloated_grid = {
    'n_estimators': [100, 200, 500, 1000, 2000],
    'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.3, 1.0],
    'max_depth': [3, 5, 7, 10, 15, None],
    'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
    'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
    'min_child_weight': [1, 3, 5, 7, 10],
    'reg_alpha': [0, 0.01, 0.1, 1.0],
    'reg_lambda': [0, 0.1, 1.0, 10.0]
}
total = 1
for v in bloated_grid.values():
    total *= len(v)
print(f"Total combinations: {total:,}")  # 432,000
print(f"With 5-fold CV: {total * 5:,} fits")  # 2,160,000
Total combinations: 432,000
With 5-fold CV: 2,160,000 fits

Do not do this.


Part 3: Random Search --- The Better Default

The Bergstra-Bengio Insight

In 2012, James Bergstra and Yoshua Bengio published a paper with a title that should have ended grid search forever: "Random Search for Hyper-Parameter Optimization." Their key insight was elegant: in most problems, only a few hyperparameters matter, and you do not know which ones in advance. Grid search wastes budget exploring every combination of unimportant hyperparameters. Random search samples the entire space and, by chance, explores more unique values of the important hyperparameters.

The visual intuition is powerful. Imagine a 2D grid where only the x-axis matters (say, learning rate) and the y-axis is irrelevant (say, min_child_weight). A 5x5 grid search evaluates 25 combinations but only 5 unique values of learning rate. A random search with 25 samples evaluates 25 different values of learning rate. It covers the important dimension far more efficiently.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint, loguniform

param_distributions = {
    'n_estimators': randint(100, 2000),
    'learning_rate': loguniform(1e-3, 1.0),  # log-uniform: 0.001 to 1.0
    'max_depth': randint(3, 15),
    'subsample': uniform(0.6, 0.4),  # uniform: 0.6 to 1.0
    'colsample_bytree': uniform(0.5, 0.5),  # uniform: 0.5 to 1.0
    'min_child_weight': randint(1, 10),
    'reg_alpha': loguniform(1e-3, 10.0),
    'reg_lambda': loguniform(1e-3, 10.0),
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1),
    param_distributions=param_distributions,
    n_iter=100,  # 100 random combinations
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='roc_auc',
    n_jobs=-1,
    verbose=1,
    random_state=42,
    return_train_score=True
)

random_search.fit(X, y)

print(f"Best parameters: {random_search.best_params_}")
print(f"Best AUC: {random_search.best_score_:.4f}")
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'colsample_bytree': 0.7834, 'learning_rate': 0.0412, 'max_depth': 6, 'min_child_weight': 3, 'n_estimators': 847, 'reg_alpha': 0.0821, 'reg_lambda': 1.4523, 'subsample': 0.8219}
Best AUC: 0.9401

Key Design Decisions

1. Use loguniform for learning rates and regularization parameters. These operate on a logarithmic scale --- the difference between 0.001 and 0.01 matters far more than the difference between 0.9 and 1.0. A uniform distribution would waste most samples in the upper range where differences are negligible.

from scipy.stats import loguniform, uniform
import matplotlib.pyplot as plt

# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

samples_uniform = uniform(0.001, 0.999).rvs(10000)
samples_loguniform = loguniform(1e-3, 1.0).rvs(10000)

axes[0].hist(samples_uniform, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Uniform(0.001, 1.0)\nMost samples near 1.0')
axes[0].set_xlabel('learning_rate')

axes[1].hist(samples_loguniform, bins=50, edgecolor='black', alpha=0.7)
axes[1].set_title('LogUniform(0.001, 1.0)\nEqual coverage per order of magnitude')
axes[1].set_xlabel('learning_rate')

plt.tight_layout()
plt.savefig('distribution_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

2. Set n_iter based on your budget, not your ambition. Bergstra and Bengio showed that 60 random trials finds a configuration in the top 5% of the search space with 95% probability --- assuming 1--2 important hyperparameters. For problems with 3--4 important hyperparameters, 100--200 trials is a reasonable starting point. Beyond 200, random search starts to show diminishing returns and Bayesian optimization becomes worthwhile.

3. Always use cross-validation, never a single validation split. With 100 random trials, you are looking at 100 different performance estimates. If each estimate comes from a single random split, noise in the split can make a mediocre configuration look great. Five-fold cross-validation averages out the noise and gives you more reliable rankings.

Inspecting Random Search Results

The cv_results_ attribute contains everything you need for post-hoc analysis:

results = pd.DataFrame(random_search.cv_results_)
results = results.sort_values('rank_test_score')

# Top 10 configurations
print(results[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))

# How much does the best differ from 10th-best?
best = results.iloc[0]['mean_test_score']
tenth = results.iloc[9]['mean_test_score']
print(f"\nBest AUC:  {best:.4f}")
print(f"10th AUC:  {tenth:.4f}")
print(f"Gap:       {best - tenth:.4f}")
Best AUC:  0.9401
10th AUC:  0.9386
Gap:       0.0015

That 0.0015 gap between the best and 10th-best configuration is smaller than the cross-validation standard deviation. This is a strong signal that you are in the diminishing-returns zone and further tuning is unlikely to yield meaningful improvement.


Part 4: Bayesian Optimization with Optuna

Why Bayesian Optimization Exists

Random search treats every trial independently. Trial 50 has no idea what trials 1--49 discovered. This is wasteful: if the first 30 trials show that learning_rate below 0.01 always performs poorly, trial 31 should not sample learning_rate=0.003.

Bayesian optimization solves this by maintaining a surrogate model --- a probabilistic model of the relationship between hyperparameters and performance. After each trial, the surrogate model updates its beliefs about which regions of the search space are promising. An acquisition function then decides where to sample next, balancing exploration (trying under-explored regions) and exploitation (sampling near the current best).

The typical loop:

  1. Evaluate a few random configurations to initialize the surrogate model
  2. Fit the surrogate model to all observed (hyperparameters, score) pairs
  3. Use the acquisition function to pick the next configuration to evaluate
  4. Evaluate that configuration with cross-validation
  5. Update the surrogate model
  6. Repeat until budget exhausted

Optuna: The Modern Choice

Optuna is the current standard for Bayesian hyperparameter optimization in Python. It uses a Tree-structured Parzen Estimator (TPE) as its default surrogate model, supports pruning (early stopping of unpromising trials), and has an excellent visualization API.

import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Suppress Optuna's INFO logs for cleaner output
optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective(trial):
    """Optuna objective function for XGBoost tuning."""
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1.0, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
        'eval_metric': 'logloss',
        'random_state': 42,
        'n_jobs': -1
    }

    model = XGBClassifier(**params)
    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    return scores.mean()

# Create and run the study
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=100, show_progress_bar=True)

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best parameters:")
for key, value in study.best_params.items():
    print(f"  {key}: {value}")
Best AUC: 0.9412
Best parameters:
  n_estimators: 1243
  learning_rate: 0.0387
  max_depth: 6
  subsample: 0.8341
  colsample_bytree: 0.7612
  min_child_weight: 4
  reg_alpha: 0.0534
  reg_lambda: 1.2876

Optuna Pruning: Stop Wasting Time on Bad Trials

One of Optuna's most powerful features is pruning --- early stopping of trials that are performing poorly partway through training. Instead of training 2,000 boosting rounds and then discovering the configuration is bad, Optuna can stop after 200 rounds if the intermediate scores are unpromising.

For XGBoost and LightGBM, this integrates naturally with the built-in early stopping callbacks:

from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import optuna
from optuna.integration import XGBoostPruningCallback

optuna.logging.set_verbosity(optuna.logging.WARNING)

def objective_with_pruning(trial):
    """Optuna objective with XGBoost pruning callback."""
    params = {
        'n_estimators': 2000,  # Set high; early stopping will find the right number
        'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'subsample': trial.suggest_float('subsample', 0.6, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
        'eval_metric': 'logloss',
        'random_state': 42,
        'n_jobs': -1,
        'early_stopping_rounds': 50,
    }

    cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
    scores = []

    for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X, y)):
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        model = XGBClassifier(**params)
        model.fit(
            X_train, y_train,
            eval_set=[(X_val, y_val)],
            verbose=False
        )

        from sklearn.metrics import roc_auc_score
        y_pred = model.predict_proba(X_val)[:, 1]
        score = roc_auc_score(y_val, y_pred)
        scores.append(score)

        # Report intermediate score for pruning
        trial.report(np.mean(scores), fold_idx)
        if trial.should_prune():
            raise optuna.TrialPruned()

    return np.mean(scores)

study_pruned = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42),
    pruner=optuna.pruners.MedianPruner(n_startup_trials=10, n_warmup_steps=1)
)
study_pruned.optimize(objective_with_pruning, n_trials=100, show_progress_bar=True)

n_pruned = len([t for t in study_pruned.trials if t.state == optuna.trial.TrialState.PRUNED])
n_complete = len([t for t in study_pruned.trials if t.state == optuna.trial.TrialState.COMPLETE])

print(f"Completed trials: {n_complete}")
print(f"Pruned trials:    {n_pruned}")
print(f"Best AUC:         {study_pruned.best_value:.4f}")
Completed trials: 67
Pruned trials:    33
Best AUC:         0.9415

One third of trials were pruned early, saving roughly 30% of the total computation time. The best score is comparable to the unpruned study --- pruning cuts cost, not quality.

Visualizing Optuna Results

Optuna ships with a visualization module that answers the questions you should be asking after a tuning study:

import optuna.visualization as vis

# 1. Which hyperparameters matter most?
fig_importance = vis.plot_param_importances(study)
fig_importance.update_layout(title="Hyperparameter Importance")
fig_importance.show()

# 2. How did the optimization progress over time?
fig_history = vis.plot_optimization_history(study)
fig_history.update_layout(title="Optimization History")
fig_history.show()

# 3. What is the relationship between each hyperparameter and performance?
fig_slice = vis.plot_slice(study, params=[
    'learning_rate', 'max_depth', 'subsample', 'colsample_bytree'
])
fig_slice.update_layout(title="Hyperparameter Slice Plots")
fig_slice.show()

# 4. How do pairs of hyperparameters interact?
fig_contour = vis.plot_contour(study, params=['learning_rate', 'max_depth'])
fig_contour.update_layout(title="Contour: learning_rate vs max_depth")
fig_contour.show()

Practitioner Note --- The plot_param_importances visualization is the single most useful output from a tuning study. If it shows that learning_rate and max_depth account for 80% of the variance in performance, and reg_alpha accounts for 1%, you know not to waste time fine-tuning reg_alpha. Focus your effort on the parameters that matter.

Reading the Importance Plot

A typical plot_param_importances result for an XGBoost tuning study looks like:

Hyperparameter Importance (fANOVA):
  learning_rate:     0.42
  max_depth:         0.28
  n_estimators:      0.12
  subsample:         0.07
  colsample_bytree:  0.05
  min_child_weight:  0.03
  reg_lambda:        0.02
  reg_alpha:         0.01

Learning rate dominates. This is consistent across almost every gradient boosting tuning study I have seen. If you can tune only one hyperparameter, tune the learning rate. If you can tune two, add max_depth (or num_leaves for LightGBM). Everything else is refinement.


Part 5: HalvingGridSearchCV and HalvingRandomSearchCV

Successive Halving: Tournament-Style Tuning

Scikit-learn provides HalvingGridSearchCV and HalvingRandomSearchCV --- resource-efficient alternatives that use a tournament metaphor. Start with many candidate configurations and a small resource budget (few training samples or few estimators). Evaluate all candidates. Eliminate the bottom half. Double the resource budget. Repeat until one candidate remains.

from sklearn.experimental import enable_halving_search_cv  # noqa
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import randint, uniform, loguniform

param_distributions_halving = {
    'n_estimators': randint(100, 1500),
    'learning_rate': loguniform(1e-3, 0.5),
    'max_depth': randint(3, 12),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.5, 0.5),
    'min_child_weight': randint(1, 10),
}

halving_search = HalvingRandomSearchCV(
    estimator=XGBClassifier(
        eval_metric='logloss', random_state=42, n_jobs=-1
    ),
    param_distributions=param_distributions_halving,
    n_candidates=128,  # Start with 128 candidates
    factor=2,          # Halve each round
    resource='n_samples',  # Double training data each round
    min_resources=500,
    cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
    scoring='roc_auc',
    random_state=42,
    verbose=1
)

halving_search.fit(X, y)

print(f"Best AUC: {halving_search.best_score_:.4f}")
print(f"Best parameters: {halving_search.best_params_}")
print(f"Number of iterations: {halving_search.n_iterations_}")
n_candidates=128, n_resources=500, n_candidates_to_select=64
n_candidates=64, n_resources=1000, n_candidates_to_select=32
n_candidates=32, n_resources=2000, n_candidates_to_select=16
n_candidates=16, n_resources=4000, n_candidates_to_select=8
n_candidates=8, n_resources=8000, n_candidates_to_select=4
n_candidates=4, n_resources=15000, n_candidates_to_select=1
Best AUC: 0.9394
Best parameters: {'colsample_bytree': 0.7456, 'learning_rate': 0.0523, 'max_depth': 6, 'min_child_weight': 4, 'n_estimators': 1187, 'subsample': 0.8234}
Number of iterations: 6

Key Advantage --- Halving search evaluated 128 candidates but only trained the final few on the full dataset. The total computation is roughly equivalent to training 30--40 full models, compared to 128 for standard random search. On large datasets, this speedup is substantial.

When to Use Halving

Halving search shines when:

  • Your dataset is large (100K+ rows) and training is the bottleneck
  • You have many candidate configurations to evaluate
  • You are tuning a model that benefits from more data (gradient boosting, not KNN)

It struggles when:

  • The model's performance ranking changes dramatically between small and large datasets (a configuration that is best on 500 samples might not be best on 15,000)
  • The dataset is already small --- there is no resource to halve

Part 6: The Correct Tuning Sequence

The Three-Step Protocol

After years of watching teams tune models, I have settled on a protocol that balances thoroughness with efficiency:

Step 1: Establish the Default Baseline

from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Pure defaults
baseline = XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
baseline_scores = cross_val_score(baseline, X, y, cv=cv, scoring='roc_auc')
print(f"Default baseline AUC: {baseline_scores.mean():.4f} +/- {baseline_scores.std():.4f}")

Step 2: Random Search for the Ballpark (50--100 trials)

Random search finds the right region of the search space. You are not looking for the optimal configuration; you are eliminating vast swaths of bad parameter space.

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform

rough_params = {
    'n_estimators': randint(100, 2000),
    'learning_rate': loguniform(1e-3, 0.5),
    'max_depth': randint(3, 12),
    'subsample': uniform(0.6, 0.4),
    'colsample_bytree': uniform(0.5, 0.5),
    'min_child_weight': randint(1, 10),
    'reg_alpha': loguniform(1e-3, 10.0),
    'reg_lambda': loguniform(1e-3, 10.0),
}

rough_search = RandomizedSearchCV(
    XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1),
    rough_params, n_iter=80, cv=cv, scoring='roc_auc',
    random_state=42, n_jobs=-1
)
rough_search.fit(X, y)
print(f"Random search AUC: {rough_search.best_score_:.4f}")

Step 3: Bayesian Optimization to Refine (50--200 trials, if warranted)

Use the random search results to narrow the search space, then let Optuna refine:

import optuna

optuna.logging.set_verbosity(optuna.logging.WARNING)

# Narrow the search space based on random search results
best_rs = rough_search.best_params_

def objective_refined(trial):
    params = {
        # Center ranges around random search best
        'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
        'learning_rate': trial.suggest_float(
            'learning_rate',
            best_rs['learning_rate'] * 0.3,
            min(best_rs['learning_rate'] * 3.0, 0.5),
            log=True
        ),
        'max_depth': trial.suggest_int(
            'max_depth',
            max(3, best_rs['max_depth'] - 2),
            min(12, best_rs['max_depth'] + 2)
        ),
        'subsample': trial.suggest_float('subsample', 0.65, 0.95),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.55, 0.95),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 8),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 5.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 5.0, log=True),
        'eval_metric': 'logloss',
        'random_state': 42,
        'n_jobs': -1,
    }
    model = XGBClassifier(**params)
    scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
    return scores.mean()

study_refined = optuna.create_study(
    direction='maximize',
    sampler=optuna.samplers.TPESampler(seed=42)
)
study_refined.optimize(objective_refined, n_trials=100)
print(f"Refined Optuna AUC: {study_refined.best_value:.4f}")

Comparing the Steps

print("Tuning progression:")
print(f"  Step 1 (defaults):       {baseline_scores.mean():.4f}")
print(f"  Step 2 (random search):  {rough_search.best_score_:.4f}  (+{rough_search.best_score_ - baseline_scores.mean():.4f})")
print(f"  Step 3 (Bayesian opt):   {study_refined.best_value:.4f}  (+{study_refined.best_value - rough_search.best_score_:.4f})")
print(f"\n  Total improvement:       {study_refined.best_value - baseline_scores.mean():.4f}")
print(f"  Step 2 share:            {((rough_search.best_score_ - baseline_scores.mean()) / (study_refined.best_value - baseline_scores.mean()) * 100):.0f}%")
print(f"  Step 3 share:            {((study_refined.best_value - rough_search.best_score_) / (study_refined.best_value - baseline_scores.mean()) * 100):.0f}%")
Tuning progression:
  Step 1 (defaults):       0.9247
  Step 2 (random search):  0.9401  (+0.0154)
  Step 3 (Bayesian opt):   0.9416  (+0.0015)

  Total improvement:       0.0169
  Step 2 share:            91%
  Step 3 share:            9%

Random search captured 91% of the total tuning improvement. Bayesian optimization added 9%. This pattern is typical. The random search does the heavy lifting; Bayesian optimization polishes.


Part 7: When to Stop Tuning

The Diminishing Returns Signal

How do you know when to stop? Watch for these signals:

1. The top-N configurations have similar performance.

# After any tuning study
results = pd.DataFrame(random_search.cv_results_).sort_values('rank_test_score')
top_20 = results.head(20)

print(f"Best AUC:  {top_20.iloc[0]['mean_test_score']:.4f}")
print(f"20th AUC:  {top_20.iloc[19]['mean_test_score']:.4f}")
print(f"Gap:       {top_20.iloc[0]['mean_test_score'] - top_20.iloc[19]['mean_test_score']:.4f}")
print(f"CV Std:    {top_20.iloc[0]['std_test_score']:.4f}")

gap = top_20.iloc[0]['mean_test_score'] - top_20.iloc[19]['mean_test_score']
std = top_20.iloc[0]['std_test_score']
print(f"\nGap < CV Std?  {gap < std}  --> {'Stop tuning' if gap < std else 'Continue tuning'}")
Best AUC:  0.9401
20th AUC:  0.9378
Gap:       0.0023
CV Std:    0.0042

Gap < CV Std?  True  --> Stop tuning

When the gap between the best and 20th-best configuration is smaller than the cross-validation standard deviation, you are in noise territory. Further tuning is moving the decimal point, not improving the model.

2. The optimization history has plateaued.

# After an Optuna study
import optuna.visualization as vis

fig = vis.plot_optimization_history(study)
fig.show()

# Programmatic check: compare first half vs. second half of trials
trial_values = [t.value for t in study.trials if t.value is not None]
first_half = np.max(trial_values[:len(trial_values)//2])
second_half = np.max(trial_values[len(trial_values)//2:])
improvement = second_half - first_half
print(f"Best in first 50 trials:  {first_half:.4f}")
print(f"Best in last 50 trials:   {second_half:.4f}")
print(f"Improvement:              {improvement:.4f}")

3. The tuning improvement is small relative to the feature engineering improvement.

If adding one new feature improved AUC by 0.02 and three weeks of tuning improved AUC by 0.002, go find another feature.

The Decision Framework

Should I keep tuning?
    |
    v
Is the gap between best and 10th-best > 2x CV standard deviation?
    |           |
    YES         NO --> Stop. You are in noise territory.
    |
    v
Has the Optuna optimization history improved in the last 30 trials?
    |           |
    YES         NO --> Stop. The search has converged.
    |
    v
Is the marginal AUC improvement worth the compute cost?
    |           |
    YES         NO --> Stop. Diminishing returns.
    |
    v
Continue tuning. Increase n_trials by 50.

What to Do Instead of Tuning

When tuning has plateaued, redirect your effort to higher-leverage activities:

  1. Engineer new features. Domain-specific features almost always beat tuning. Interaction terms, time-since-event features, aggregation features, and ratio features are reliable sources of improvement.

  2. Get more data. If your learning curve shows the validation score still climbing, more training data will help more than more tuning.

  3. Fix data quality. Missing data patterns, label noise, and feature leakage are more impactful than any hyperparameter.

  4. Ensemble different model families. A blend of XGBoost and LightGBM with different feature subsets often beats either one alone by 0.5--1.0%, which is more than refined tuning typically provides.

  5. Move to deployment. A model that is 0.5% worse but deployed three weeks earlier creates three weeks of business value that the "perfect" model never will.


Part 8: Tuning Specific Model Families

XGBoost / LightGBM / CatBoost

For gradient boosting, focus tuning effort here (in order of importance):

Hyperparameter XGBoost LightGBM Typical range
Learning rate learning_rate learning_rate 0.01--0.3 (log)
Tree complexity max_depth num_leaves 3--10 / 15--127
Number of trees n_estimators n_estimators Use early stopping
Row sampling subsample bagging_fraction 0.6--1.0
Column sampling colsample_bytree feature_fraction 0.5--1.0
Min leaf weight min_child_weight min_child_samples 1--10 / 5--100
L1 regularization reg_alpha lambda_l1 1e-3--10 (log)
L2 regularization reg_lambda lambda_l2 1e-3--10 (log)

Practitioner Note --- LightGBM uses num_leaves instead of max_depth to control tree complexity. A tree with max_depth=6 has up to 64 leaves (2^6). Setting num_leaves=40 creates a tree with similar capacity but a more flexible structure. The rule of thumb: num_leaves should be less than 2^max_depth to avoid overfitting. Start with num_leaves around 31 and tune from there.

For n_estimators, do not grid-search it. Set it high (1,000--5,000) and use early stopping to find the right number:

from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split

X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = XGBClassifier(
    n_estimators=5000,
    learning_rate=0.05,
    max_depth=6,
    early_stopping_rounds=50,
    eval_metric='logloss',
    random_state=42,
    n_jobs=-1
)

model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
print(f"Best iteration: {model.best_iteration}")
print(f"Actual trees used: {model.best_iteration + 1}")
Best iteration: 487
Actual trees used: 488

The model stopped at 488 trees, far short of the 5,000 maximum. Early stopping found the right number automatically.

Random Forest

Random Forests are far less sensitive to hyperparameters than gradient boosting. The defaults in scikit-learn are usually close to optimal. If you tune, focus on:

Hyperparameter Default Typical range Notes
n_estimators 100 200--1000 More is almost always better (no overfitting)
max_depth None 10--30 or None None = grow until pure leaves
max_features 'sqrt' 'sqrt', 'log2', 0.3--0.8 Controls decorrelation
min_samples_leaf 1 1--10 Regularization

Logistic Regression and SVM

These have fewer hyperparameters, and grid search is perfectly adequate:

from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV

# Logistic regression: tune C and penalty
lr_grid = {
    'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
    'penalty': ['l1', 'l2'],
    'solver': ['saga']
}

# SVM: tune C and gamma
svm_grid = {
    'C': [0.1, 1.0, 10.0, 100.0],
    'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
    'kernel': ['rbf']
}

For simple models with 1--2 hyperparameters, use grid search. Save Bayesian optimization for models with 5+ hyperparameters.


Part 9: Reproducibility and Logging

Recording Your Tuning Results

Tuning without recording is gambling without keeping score. At minimum, log:

import json
from datetime import datetime

tuning_log = {
    'timestamp': datetime.now().isoformat(),
    'model': 'XGBClassifier',
    'dataset': 'streamflow_churn_v3',
    'n_samples': len(X),
    'n_features': X.shape[1],
    'method': 'Optuna TPE',
    'n_trials': 100,
    'cv_strategy': 'StratifiedKFold(n_splits=5)',
    'scoring': 'roc_auc',
    'baseline_score': float(baseline_scores.mean()),
    'best_score': float(study.best_value),
    'improvement': float(study.best_value - baseline_scores.mean()),
    'best_params': study.best_params,
    'random_state': 42,
    'wall_time_seconds': sum(
        t.duration.total_seconds()
        for t in study.trials if t.duration is not None
    ),
}

with open('tuning_log.json', 'w') as f:
    json.dump(tuning_log, f, indent=2, default=str)

Reproducibility --- Always set random_state in both the model and the cross-validation splitter. Set seed in the Optuna sampler. Record the library versions. Tuning is stochastic; without fixed seeds, you cannot reproduce your results. This is discussed further in Chapter 10 (Reproducible Data Pipelines) and Chapter 30 (Experiment Tracking).


Progressive Project M8: StreamFlow Churn Tuning

Your Task

Apply the three-step tuning protocol to StreamFlow's best-performing churn model from Chapters 14 and 17.

Deliverables:

  1. Default baseline AUC from the untuned model
  2. Random search results (80 trials) with the best AUC and parameters
  3. Optuna Bayesian optimization (100 trials) with pruning
  4. plot_param_importances visualization showing which hyperparameters matter most
  5. Diminishing returns analysis: compare improvement from Step 1-->2 vs. Step 2-->3
  6. Final tuned model saved as streamflow_tuned_model.json

Key question to answer: Was the Bayesian optimization step worth the additional compute? At what trial count did the optimization effectively plateau?

Getting Started

# Start from your Chapter 17 pipeline
# Load the StreamFlow dataset with SMOTE or class_weight adjustments
# Use the model that won the Chapter 14 comparison (likely XGBoost or LightGBM)

# Step 1: Cross-validate with defaults
# Step 2: RandomizedSearchCV with 80 trials
# Step 3: Optuna with 100 trials and pruning
# Step 4: Compare all three and generate the importance plot
# Step 5: Write a 1-paragraph summary of whether the tuning was worth it

Summary

Hyperparameter tuning is the art of extracting the last few percentage points of performance from a model that is already good. The correct sequence is: defaults first, random search for the ballpark, Bayesian optimization to refine. Grid search is mostly obsolete for high-dimensional spaces. Optuna with pruning is the current best tool for efficient Bayesian optimization. But the most important skill is knowing when to stop --- when the gap between the best and 10th-best configuration is smaller than the cross-validation noise, further tuning is moving noise, not signal. Spend your time on features, data quality, and deployment instead.


Next chapter: Chapter 19: Model Interpretation --- opening the black box.