> Core Principle --- Hyperparameter tuning is important. It is also the single most over-invested activity in amateur data science. I have watched teams spend three weeks tuning XGBoost's max_depth from 6 to 7 and back again, chasing a 0.002 AUC...
In This Chapter
- Grid Search, Random Search, Bayesian Optimization, and When to Stop
- The Most Over-Invested Activity in Amateur Data Science
- Part 1: Parameters vs. Hyperparameters
- Part 2: Grid Search --- The Brute-Force Baseline
- Part 3: Random Search --- The Better Default
- Part 4: Bayesian Optimization with Optuna
- Part 5: HalvingGridSearchCV and HalvingRandomSearchCV
- Part 6: The Correct Tuning Sequence
- Part 7: When to Stop Tuning
- Part 8: Tuning Specific Model Families
- Part 9: Reproducibility and Logging
- Progressive Project M8: StreamFlow Churn Tuning
- Summary
Chapter 18: Hyperparameter Tuning
Grid Search, Random Search, Bayesian Optimization, and When to Stop
Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish hyperparameters from parameters and explain why tuning matters
- Implement grid search and random search with cross-validation
- Apply Bayesian optimization (Optuna) for efficient hyperparameter search
- Use Halving search for resource-efficient tuning
- Know when tuning provides diminishing returns and when to stop
The Most Over-Invested Activity in Amateur Data Science
Core Principle --- Hyperparameter tuning is important. It is also the single most over-invested activity in amateur data science. I have watched teams spend three weeks tuning XGBoost's
max_depthfrom 6 to 7 and back again, chasing a 0.002 AUC improvement, while their training data had a feature leakage bug that was worth 0.08 AUC. I have seen a data scientist run 10,000-trial Bayesian optimization on a logistic regression --- a model with two meaningful hyperparameters --- because "more trials is better." I have reviewed production models where the tuning log showed 48 hours of GPU time and the final model was 0.3% better than the default.
Here is the uncomfortable truth, expressed as a rough hierarchy of impact:
| Activity | Typical AUC improvement |
|---|---|
| Better features | 5--20% |
| Fixing data quality issues | 3--15% |
| Choosing the right model family | 2--8% |
| Default hyperparameters --> rough tuning | 2--5% |
| Rough tuning --> perfect tuning | 0.1--0.5% |
The last row is where most of the tuning time gets spent. The first row is where most of the performance improvement lives. This does not mean tuning is worthless --- 2--5% from a rough tune is real, and on a production model where small improvements translate to millions of dollars, even 0.1% matters. But it means you should tune after you have exhausted higher-leverage activities, and you should know when to stop.
This chapter teaches the correct sequence: start with defaults, do a rough random search, refine with Bayesian optimization if the stakes justify it, and stop when the returns diminish. By the end, you will be a disciplined, efficient tuner --- not a hyperparameter gambler.
Part 1: Parameters vs. Hyperparameters
What Gets Learned vs. What You Set
The distinction is fundamental but often muddy in practice. Parameters are learned from data during training. Hyperparameters are set before training begins and control how learning happens.
| Parameters | Hyperparameters | |
|---|---|---|
| Set by | The training algorithm | The practitioner |
| When | During .fit() |
Before .fit() |
| Examples | Weights, coefficients, split thresholds | Learning rate, max_depth, C, n_estimators |
| Optimized by | Gradient descent, impurity reduction | Grid search, random search, Bayesian opt. |
| Stored in | The trained model | The model constructor |
For a Random Forest, the parameters are the split points and thresholds in every tree --- thousands of values that the algorithm discovers from the data. The hyperparameters are n_estimators, max_depth, min_samples_split, max_features, and a handful of others that you specify before training.
For an XGBoost model, the parameters are the leaf weights in each boosted tree. The hyperparameters are learning_rate, max_depth, n_estimators, subsample, colsample_bytree, reg_alpha, reg_lambda, and min_child_weight.
Why Hyperparameters Matter (and When They Do Not)
Hyperparameters control model complexity, regularization, and optimization behavior. Setting them poorly can cause underfitting (too much regularization, too few estimators) or overfitting (too little regularization, trees too deep). The default values in scikit-learn, XGBoost, and LightGBM are competent --- they represent the library authors' best guess at a reasonable starting point for a wide range of problems. But defaults are generic. Your data is specific.
The question is: how much improvement can tuning buy you?
import numpy as np
import pandas as pd
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score, StratifiedKFold
from xgboost import XGBClassifier
# Generate a moderately challenging classification problem
X, y = make_classification(
n_samples=15000, n_features=25, n_informative=15,
n_redundant=5, flip_y=0.08, class_sep=0.8,
weights=[0.85, 0.15], random_state=42
)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# XGBoost with pure defaults
default_model = XGBClassifier(
eval_metric='logloss', random_state=42, n_jobs=-1
)
default_scores = cross_val_score(
default_model, X, y, cv=cv, scoring='roc_auc'
)
# XGBoost with reasonable manual tuning
tuned_model = XGBClassifier(
n_estimators=500,
learning_rate=0.05,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
min_child_weight=3,
reg_alpha=0.1,
reg_lambda=1.0,
eval_metric='logloss',
random_state=42,
n_jobs=-1
)
tuned_scores = cross_val_score(
tuned_model, X, y, cv=cv, scoring='roc_auc'
)
print("Default XGBoost:")
print(f" AUC: {default_scores.mean():.4f} +/- {default_scores.std():.4f}")
print(f"\nRough-tuned XGBoost:")
print(f" AUC: {tuned_scores.mean():.4f} +/- {tuned_scores.std():.4f}")
print(f"\nImprovement: {(tuned_scores.mean() - default_scores.mean()):.4f}")
Default XGBoost:
AUC: 0.9247 +/- 0.0052
Rough-tuned XGBoost:
AUC: 0.9389 +/- 0.0038
Improvement: 0.0142
That 0.014 improvement is real and meaningful --- about 1.5 percentage points of AUC. But notice: the defaults already gave us 0.925. A 1.5% improvement on top of an already-good model is worth pursuing if the business impact justifies the engineering time. It is not worth three weeks of someone's salary.
Part 2: Grid Search --- The Brute-Force Baseline
How GridSearchCV Works
Grid search is the simplest tuning method: define a set of values for each hyperparameter, try every combination, and keep the best one. Scikit-learn's GridSearchCV wraps this in cross-validation, so each combination is evaluated on multiple folds.
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
param_grid = {
'n_estimators': [100, 200, 500],
'max_depth': [5, 10, 15, None],
'min_samples_split': [2, 5, 10],
'max_features': ['sqrt', 'log2']
}
# Total combinations: 3 * 4 * 3 * 2 = 72
# With 5-fold CV: 72 * 5 = 360 model fits
grid_search = GridSearchCV(
estimator=RandomForestClassifier(random_state=42, n_jobs=-1),
param_grid=param_grid,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='roc_auc',
n_jobs=-1,
verbose=1,
return_train_score=True
)
grid_search.fit(X, y)
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best AUC: {grid_search.best_score_:.4f}")
Fitting 5 folds for each of 72 candidates, totalling 360 fits
Best parameters: {'max_depth': 15, 'max_features': 'sqrt', 'min_samples_split': 5, 'n_estimators': 500}
Best AUC: 0.9318
When Grid Search Works
Grid search works when:
-
You have 2--3 hyperparameters to tune. The total number of combinations is the product of grid sizes. Two parameters with 5 values each: 25 combinations. Three parameters with 5 values each: 125 combinations. Five parameters with 5 values each: 3,125 combinations. The combinatorial explosion is real.
-
Each model trains fast. 360 fits of a Random Forest on 15,000 samples takes under a minute. 360 fits of an XGBoost with 2,000 estimators on 500,000 samples takes hours.
-
You know the approximate range. Grid search is good at refining a ballpark, not at exploring a wide search space. If
max_depthshould be between 3 and 15, a grid of[3, 5, 7, 10, 15]is reasonable. If you have no idea whether it should be 2 or 200, grid search will waste most of its budget on bad regions.
When Grid Search Fails
Warning
--- Grid search is mostly obsolete for high-dimensional hyperparameter spaces. If you are tuning 5 or more hyperparameters, skip ahead to random search or Bayesian optimization. Grid search will either take too long or force you to use a coarse grid that misses the good regions.
The fundamental problem: grid search distributes its budget evenly across the entire grid, including regions that are obviously bad. If learning_rate=1.0 is terrible for your problem (it almost always is), grid search does not care --- it will train 5 cross-validation folds at learning_rate=1.0 for every combination of the other hyperparameters before moving on.
# This grid has 7,776 combinations (5-fold CV = 38,880 fits)
# Most of these combinations are awful
bloated_grid = {
'n_estimators': [100, 200, 500, 1000, 2000],
'learning_rate': [0.001, 0.01, 0.05, 0.1, 0.3, 1.0],
'max_depth': [3, 5, 7, 10, 15, None],
'subsample': [0.6, 0.7, 0.8, 0.9, 1.0],
'colsample_bytree': [0.5, 0.6, 0.7, 0.8, 0.9, 1.0],
'min_child_weight': [1, 3, 5, 7, 10],
'reg_alpha': [0, 0.01, 0.1, 1.0],
'reg_lambda': [0, 0.1, 1.0, 10.0]
}
total = 1
for v in bloated_grid.values():
total *= len(v)
print(f"Total combinations: {total:,}") # 432,000
print(f"With 5-fold CV: {total * 5:,} fits") # 2,160,000
Total combinations: 432,000
With 5-fold CV: 2,160,000 fits
Do not do this.
Part 3: Random Search --- The Better Default
The Bergstra-Bengio Insight
In 2012, James Bergstra and Yoshua Bengio published a paper with a title that should have ended grid search forever: "Random Search for Hyper-Parameter Optimization." Their key insight was elegant: in most problems, only a few hyperparameters matter, and you do not know which ones in advance. Grid search wastes budget exploring every combination of unimportant hyperparameters. Random search samples the entire space and, by chance, explores more unique values of the important hyperparameters.
The visual intuition is powerful. Imagine a 2D grid where only the x-axis matters (say, learning rate) and the y-axis is irrelevant (say, min_child_weight). A 5x5 grid search evaluates 25 combinations but only 5 unique values of learning rate. A random search with 25 samples evaluates 25 different values of learning rate. It covers the important dimension far more efficiently.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint, loguniform
param_distributions = {
'n_estimators': randint(100, 2000),
'learning_rate': loguniform(1e-3, 1.0), # log-uniform: 0.001 to 1.0
'max_depth': randint(3, 15),
'subsample': uniform(0.6, 0.4), # uniform: 0.6 to 1.0
'colsample_bytree': uniform(0.5, 0.5), # uniform: 0.5 to 1.0
'min_child_weight': randint(1, 10),
'reg_alpha': loguniform(1e-3, 10.0),
'reg_lambda': loguniform(1e-3, 10.0),
}
random_search = RandomizedSearchCV(
estimator=XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1),
param_distributions=param_distributions,
n_iter=100, # 100 random combinations
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='roc_auc',
n_jobs=-1,
verbose=1,
random_state=42,
return_train_score=True
)
random_search.fit(X, y)
print(f"Best parameters: {random_search.best_params_}")
print(f"Best AUC: {random_search.best_score_:.4f}")
Fitting 5 folds for each of 100 candidates, totalling 500 fits
Best parameters: {'colsample_bytree': 0.7834, 'learning_rate': 0.0412, 'max_depth': 6, 'min_child_weight': 3, 'n_estimators': 847, 'reg_alpha': 0.0821, 'reg_lambda': 1.4523, 'subsample': 0.8219}
Best AUC: 0.9401
Key Design Decisions
1. Use loguniform for learning rates and regularization parameters. These operate on a logarithmic scale --- the difference between 0.001 and 0.01 matters far more than the difference between 0.9 and 1.0. A uniform distribution would waste most samples in the upper range where differences are negligible.
from scipy.stats import loguniform, uniform
import matplotlib.pyplot as plt
# Compare distributions
fig, axes = plt.subplots(1, 2, figsize=(12, 4))
samples_uniform = uniform(0.001, 0.999).rvs(10000)
samples_loguniform = loguniform(1e-3, 1.0).rvs(10000)
axes[0].hist(samples_uniform, bins=50, edgecolor='black', alpha=0.7)
axes[0].set_title('Uniform(0.001, 1.0)\nMost samples near 1.0')
axes[0].set_xlabel('learning_rate')
axes[1].hist(samples_loguniform, bins=50, edgecolor='black', alpha=0.7)
axes[1].set_title('LogUniform(0.001, 1.0)\nEqual coverage per order of magnitude')
axes[1].set_xlabel('learning_rate')
plt.tight_layout()
plt.savefig('distribution_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
2. Set n_iter based on your budget, not your ambition. Bergstra and Bengio showed that 60 random trials finds a configuration in the top 5% of the search space with 95% probability --- assuming 1--2 important hyperparameters. For problems with 3--4 important hyperparameters, 100--200 trials is a reasonable starting point. Beyond 200, random search starts to show diminishing returns and Bayesian optimization becomes worthwhile.
3. Always use cross-validation, never a single validation split. With 100 random trials, you are looking at 100 different performance estimates. If each estimate comes from a single random split, noise in the split can make a mediocre configuration look great. Five-fold cross-validation averages out the noise and gives you more reliable rankings.
Inspecting Random Search Results
The cv_results_ attribute contains everything you need for post-hoc analysis:
results = pd.DataFrame(random_search.cv_results_)
results = results.sort_values('rank_test_score')
# Top 10 configurations
print(results[['params', 'mean_test_score', 'std_test_score', 'rank_test_score']].head(10))
# How much does the best differ from 10th-best?
best = results.iloc[0]['mean_test_score']
tenth = results.iloc[9]['mean_test_score']
print(f"\nBest AUC: {best:.4f}")
print(f"10th AUC: {tenth:.4f}")
print(f"Gap: {best - tenth:.4f}")
Best AUC: 0.9401
10th AUC: 0.9386
Gap: 0.0015
That 0.0015 gap between the best and 10th-best configuration is smaller than the cross-validation standard deviation. This is a strong signal that you are in the diminishing-returns zone and further tuning is unlikely to yield meaningful improvement.
Part 4: Bayesian Optimization with Optuna
Why Bayesian Optimization Exists
Random search treats every trial independently. Trial 50 has no idea what trials 1--49 discovered. This is wasteful: if the first 30 trials show that learning_rate below 0.01 always performs poorly, trial 31 should not sample learning_rate=0.003.
Bayesian optimization solves this by maintaining a surrogate model --- a probabilistic model of the relationship between hyperparameters and performance. After each trial, the surrogate model updates its beliefs about which regions of the search space are promising. An acquisition function then decides where to sample next, balancing exploration (trying under-explored regions) and exploitation (sampling near the current best).
The typical loop:
- Evaluate a few random configurations to initialize the surrogate model
- Fit the surrogate model to all observed (hyperparameters, score) pairs
- Use the acquisition function to pick the next configuration to evaluate
- Evaluate that configuration with cross-validation
- Update the surrogate model
- Repeat until budget exhausted
Optuna: The Modern Choice
Optuna is the current standard for Bayesian hyperparameter optimization in Python. It uses a Tree-structured Parzen Estimator (TPE) as its default surrogate model, supports pruning (early stopping of unpromising trials), and has an excellent visualization API.
import optuna
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Suppress Optuna's INFO logs for cleaner output
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
"""Optuna objective function for XGBoost tuning."""
params = {
'n_estimators': trial.suggest_int('n_estimators', 100, 2000),
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 1.0, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 12),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1
}
model = XGBClassifier(**params)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
return scores.mean()
# Create and run the study
study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=42))
study.optimize(objective, n_trials=100, show_progress_bar=True)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best parameters:")
for key, value in study.best_params.items():
print(f" {key}: {value}")
Best AUC: 0.9412
Best parameters:
n_estimators: 1243
learning_rate: 0.0387
max_depth: 6
subsample: 0.8341
colsample_bytree: 0.7612
min_child_weight: 4
reg_alpha: 0.0534
reg_lambda: 1.2876
Optuna Pruning: Stop Wasting Time on Bad Trials
One of Optuna's most powerful features is pruning --- early stopping of trials that are performing poorly partway through training. Instead of training 2,000 boosting rounds and then discovering the configuration is bad, Optuna can stop after 200 rounds if the intermediate scores are unpromising.
For XGBoost and LightGBM, this integrates naturally with the built-in early stopping callbacks:
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold
import optuna
from optuna.integration import XGBoostPruningCallback
optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective_with_pruning(trial):
"""Optuna objective with XGBoost pruning callback."""
params = {
'n_estimators': 2000, # Set high; early stopping will find the right number
'learning_rate': trial.suggest_float('learning_rate', 1e-3, 0.3, log=True),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'subsample': trial.suggest_float('subsample', 0.6, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1,
'early_stopping_rounds': 50,
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = []
for fold_idx, (train_idx, val_idx) in enumerate(cv.split(X, y)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
model = XGBClassifier(**params)
model.fit(
X_train, y_train,
eval_set=[(X_val, y_val)],
verbose=False
)
from sklearn.metrics import roc_auc_score
y_pred = model.predict_proba(X_val)[:, 1]
score = roc_auc_score(y_val, y_pred)
scores.append(score)
# Report intermediate score for pruning
trial.report(np.mean(scores), fold_idx)
if trial.should_prune():
raise optuna.TrialPruned()
return np.mean(scores)
study_pruned = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42),
pruner=optuna.pruners.MedianPruner(n_startup_trials=10, n_warmup_steps=1)
)
study_pruned.optimize(objective_with_pruning, n_trials=100, show_progress_bar=True)
n_pruned = len([t for t in study_pruned.trials if t.state == optuna.trial.TrialState.PRUNED])
n_complete = len([t for t in study_pruned.trials if t.state == optuna.trial.TrialState.COMPLETE])
print(f"Completed trials: {n_complete}")
print(f"Pruned trials: {n_pruned}")
print(f"Best AUC: {study_pruned.best_value:.4f}")
Completed trials: 67
Pruned trials: 33
Best AUC: 0.9415
One third of trials were pruned early, saving roughly 30% of the total computation time. The best score is comparable to the unpruned study --- pruning cuts cost, not quality.
Visualizing Optuna Results
Optuna ships with a visualization module that answers the questions you should be asking after a tuning study:
import optuna.visualization as vis
# 1. Which hyperparameters matter most?
fig_importance = vis.plot_param_importances(study)
fig_importance.update_layout(title="Hyperparameter Importance")
fig_importance.show()
# 2. How did the optimization progress over time?
fig_history = vis.plot_optimization_history(study)
fig_history.update_layout(title="Optimization History")
fig_history.show()
# 3. What is the relationship between each hyperparameter and performance?
fig_slice = vis.plot_slice(study, params=[
'learning_rate', 'max_depth', 'subsample', 'colsample_bytree'
])
fig_slice.update_layout(title="Hyperparameter Slice Plots")
fig_slice.show()
# 4. How do pairs of hyperparameters interact?
fig_contour = vis.plot_contour(study, params=['learning_rate', 'max_depth'])
fig_contour.update_layout(title="Contour: learning_rate vs max_depth")
fig_contour.show()
Practitioner Note --- The
plot_param_importancesvisualization is the single most useful output from a tuning study. If it shows thatlearning_rateandmax_depthaccount for 80% of the variance in performance, andreg_alphaaccounts for 1%, you know not to waste time fine-tuningreg_alpha. Focus your effort on the parameters that matter.
Reading the Importance Plot
A typical plot_param_importances result for an XGBoost tuning study looks like:
Hyperparameter Importance (fANOVA):
learning_rate: 0.42
max_depth: 0.28
n_estimators: 0.12
subsample: 0.07
colsample_bytree: 0.05
min_child_weight: 0.03
reg_lambda: 0.02
reg_alpha: 0.01
Learning rate dominates. This is consistent across almost every gradient boosting tuning study I have seen. If you can tune only one hyperparameter, tune the learning rate. If you can tune two, add max_depth (or num_leaves for LightGBM). Everything else is refinement.
Part 5: HalvingGridSearchCV and HalvingRandomSearchCV
Successive Halving: Tournament-Style Tuning
Scikit-learn provides HalvingGridSearchCV and HalvingRandomSearchCV --- resource-efficient alternatives that use a tournament metaphor. Start with many candidate configurations and a small resource budget (few training samples or few estimators). Evaluate all candidates. Eliminate the bottom half. Double the resource budget. Repeat until one candidate remains.
from sklearn.experimental import enable_halving_search_cv # noqa
from sklearn.model_selection import HalvingRandomSearchCV
from scipy.stats import randint, uniform, loguniform
param_distributions_halving = {
'n_estimators': randint(100, 1500),
'learning_rate': loguniform(1e-3, 0.5),
'max_depth': randint(3, 12),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.5, 0.5),
'min_child_weight': randint(1, 10),
}
halving_search = HalvingRandomSearchCV(
estimator=XGBClassifier(
eval_metric='logloss', random_state=42, n_jobs=-1
),
param_distributions=param_distributions_halving,
n_candidates=128, # Start with 128 candidates
factor=2, # Halve each round
resource='n_samples', # Double training data each round
min_resources=500,
cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
scoring='roc_auc',
random_state=42,
verbose=1
)
halving_search.fit(X, y)
print(f"Best AUC: {halving_search.best_score_:.4f}")
print(f"Best parameters: {halving_search.best_params_}")
print(f"Number of iterations: {halving_search.n_iterations_}")
n_candidates=128, n_resources=500, n_candidates_to_select=64
n_candidates=64, n_resources=1000, n_candidates_to_select=32
n_candidates=32, n_resources=2000, n_candidates_to_select=16
n_candidates=16, n_resources=4000, n_candidates_to_select=8
n_candidates=8, n_resources=8000, n_candidates_to_select=4
n_candidates=4, n_resources=15000, n_candidates_to_select=1
Best AUC: 0.9394
Best parameters: {'colsample_bytree': 0.7456, 'learning_rate': 0.0523, 'max_depth': 6, 'min_child_weight': 4, 'n_estimators': 1187, 'subsample': 0.8234}
Number of iterations: 6
Key Advantage --- Halving search evaluated 128 candidates but only trained the final few on the full dataset. The total computation is roughly equivalent to training 30--40 full models, compared to 128 for standard random search. On large datasets, this speedup is substantial.
When to Use Halving
Halving search shines when:
- Your dataset is large (100K+ rows) and training is the bottleneck
- You have many candidate configurations to evaluate
- You are tuning a model that benefits from more data (gradient boosting, not KNN)
It struggles when:
- The model's performance ranking changes dramatically between small and large datasets (a configuration that is best on 500 samples might not be best on 15,000)
- The dataset is already small --- there is no resource to halve
Part 6: The Correct Tuning Sequence
The Three-Step Protocol
After years of watching teams tune models, I have settled on a protocol that balances thoroughness with efficiency:
Step 1: Establish the Default Baseline
from xgboost import XGBClassifier
from sklearn.model_selection import cross_val_score, StratifiedKFold
# Pure defaults
baseline = XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
baseline_scores = cross_val_score(baseline, X, y, cv=cv, scoring='roc_auc')
print(f"Default baseline AUC: {baseline_scores.mean():.4f} +/- {baseline_scores.std():.4f}")
Step 2: Random Search for the Ballpark (50--100 trials)
Random search finds the right region of the search space. You are not looking for the optimal configuration; you are eliminating vast swaths of bad parameter space.
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform, loguniform
rough_params = {
'n_estimators': randint(100, 2000),
'learning_rate': loguniform(1e-3, 0.5),
'max_depth': randint(3, 12),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.5, 0.5),
'min_child_weight': randint(1, 10),
'reg_alpha': loguniform(1e-3, 10.0),
'reg_lambda': loguniform(1e-3, 10.0),
}
rough_search = RandomizedSearchCV(
XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1),
rough_params, n_iter=80, cv=cv, scoring='roc_auc',
random_state=42, n_jobs=-1
)
rough_search.fit(X, y)
print(f"Random search AUC: {rough_search.best_score_:.4f}")
Step 3: Bayesian Optimization to Refine (50--200 trials, if warranted)
Use the random search results to narrow the search space, then let Optuna refine:
import optuna
optuna.logging.set_verbosity(optuna.logging.WARNING)
# Narrow the search space based on random search results
best_rs = rough_search.best_params_
def objective_refined(trial):
params = {
# Center ranges around random search best
'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
'learning_rate': trial.suggest_float(
'learning_rate',
best_rs['learning_rate'] * 0.3,
min(best_rs['learning_rate'] * 3.0, 0.5),
log=True
),
'max_depth': trial.suggest_int(
'max_depth',
max(3, best_rs['max_depth'] - 2),
min(12, best_rs['max_depth'] + 2)
),
'subsample': trial.suggest_float('subsample', 0.65, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.55, 0.95),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 8),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 5.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 0.1, 5.0, log=True),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1,
}
model = XGBClassifier(**params)
scores = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
return scores.mean()
study_refined = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
study_refined.optimize(objective_refined, n_trials=100)
print(f"Refined Optuna AUC: {study_refined.best_value:.4f}")
Comparing the Steps
print("Tuning progression:")
print(f" Step 1 (defaults): {baseline_scores.mean():.4f}")
print(f" Step 2 (random search): {rough_search.best_score_:.4f} (+{rough_search.best_score_ - baseline_scores.mean():.4f})")
print(f" Step 3 (Bayesian opt): {study_refined.best_value:.4f} (+{study_refined.best_value - rough_search.best_score_:.4f})")
print(f"\n Total improvement: {study_refined.best_value - baseline_scores.mean():.4f}")
print(f" Step 2 share: {((rough_search.best_score_ - baseline_scores.mean()) / (study_refined.best_value - baseline_scores.mean()) * 100):.0f}%")
print(f" Step 3 share: {((study_refined.best_value - rough_search.best_score_) / (study_refined.best_value - baseline_scores.mean()) * 100):.0f}%")
Tuning progression:
Step 1 (defaults): 0.9247
Step 2 (random search): 0.9401 (+0.0154)
Step 3 (Bayesian opt): 0.9416 (+0.0015)
Total improvement: 0.0169
Step 2 share: 91%
Step 3 share: 9%
Random search captured 91% of the total tuning improvement. Bayesian optimization added 9%. This pattern is typical. The random search does the heavy lifting; Bayesian optimization polishes.
Part 7: When to Stop Tuning
The Diminishing Returns Signal
How do you know when to stop? Watch for these signals:
1. The top-N configurations have similar performance.
# After any tuning study
results = pd.DataFrame(random_search.cv_results_).sort_values('rank_test_score')
top_20 = results.head(20)
print(f"Best AUC: {top_20.iloc[0]['mean_test_score']:.4f}")
print(f"20th AUC: {top_20.iloc[19]['mean_test_score']:.4f}")
print(f"Gap: {top_20.iloc[0]['mean_test_score'] - top_20.iloc[19]['mean_test_score']:.4f}")
print(f"CV Std: {top_20.iloc[0]['std_test_score']:.4f}")
gap = top_20.iloc[0]['mean_test_score'] - top_20.iloc[19]['mean_test_score']
std = top_20.iloc[0]['std_test_score']
print(f"\nGap < CV Std? {gap < std} --> {'Stop tuning' if gap < std else 'Continue tuning'}")
Best AUC: 0.9401
20th AUC: 0.9378
Gap: 0.0023
CV Std: 0.0042
Gap < CV Std? True --> Stop tuning
When the gap between the best and 20th-best configuration is smaller than the cross-validation standard deviation, you are in noise territory. Further tuning is moving the decimal point, not improving the model.
2. The optimization history has plateaued.
# After an Optuna study
import optuna.visualization as vis
fig = vis.plot_optimization_history(study)
fig.show()
# Programmatic check: compare first half vs. second half of trials
trial_values = [t.value for t in study.trials if t.value is not None]
first_half = np.max(trial_values[:len(trial_values)//2])
second_half = np.max(trial_values[len(trial_values)//2:])
improvement = second_half - first_half
print(f"Best in first 50 trials: {first_half:.4f}")
print(f"Best in last 50 trials: {second_half:.4f}")
print(f"Improvement: {improvement:.4f}")
3. The tuning improvement is small relative to the feature engineering improvement.
If adding one new feature improved AUC by 0.02 and three weeks of tuning improved AUC by 0.002, go find another feature.
The Decision Framework
Should I keep tuning?
|
v
Is the gap between best and 10th-best > 2x CV standard deviation?
| |
YES NO --> Stop. You are in noise territory.
|
v
Has the Optuna optimization history improved in the last 30 trials?
| |
YES NO --> Stop. The search has converged.
|
v
Is the marginal AUC improvement worth the compute cost?
| |
YES NO --> Stop. Diminishing returns.
|
v
Continue tuning. Increase n_trials by 50.
What to Do Instead of Tuning
When tuning has plateaued, redirect your effort to higher-leverage activities:
-
Engineer new features. Domain-specific features almost always beat tuning. Interaction terms, time-since-event features, aggregation features, and ratio features are reliable sources of improvement.
-
Get more data. If your learning curve shows the validation score still climbing, more training data will help more than more tuning.
-
Fix data quality. Missing data patterns, label noise, and feature leakage are more impactful than any hyperparameter.
-
Ensemble different model families. A blend of XGBoost and LightGBM with different feature subsets often beats either one alone by 0.5--1.0%, which is more than refined tuning typically provides.
-
Move to deployment. A model that is 0.5% worse but deployed three weeks earlier creates three weeks of business value that the "perfect" model never will.
Part 8: Tuning Specific Model Families
XGBoost / LightGBM / CatBoost
For gradient boosting, focus tuning effort here (in order of importance):
| Hyperparameter | XGBoost | LightGBM | Typical range |
|---|---|---|---|
| Learning rate | learning_rate |
learning_rate |
0.01--0.3 (log) |
| Tree complexity | max_depth |
num_leaves |
3--10 / 15--127 |
| Number of trees | n_estimators |
n_estimators |
Use early stopping |
| Row sampling | subsample |
bagging_fraction |
0.6--1.0 |
| Column sampling | colsample_bytree |
feature_fraction |
0.5--1.0 |
| Min leaf weight | min_child_weight |
min_child_samples |
1--10 / 5--100 |
| L1 regularization | reg_alpha |
lambda_l1 |
1e-3--10 (log) |
| L2 regularization | reg_lambda |
lambda_l2 |
1e-3--10 (log) |
Practitioner Note --- LightGBM uses
num_leavesinstead ofmax_depthto control tree complexity. A tree withmax_depth=6has up to 64 leaves (2^6). Settingnum_leaves=40creates a tree with similar capacity but a more flexible structure. The rule of thumb:num_leavesshould be less than 2^max_depthto avoid overfitting. Start withnum_leavesaround 31 and tune from there.
For n_estimators, do not grid-search it. Set it high (1,000--5,000) and use early stopping to find the right number:
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
X_train, X_val, y_train, y_val = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
model = XGBClassifier(
n_estimators=5000,
learning_rate=0.05,
max_depth=6,
early_stopping_rounds=50,
eval_metric='logloss',
random_state=42,
n_jobs=-1
)
model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
print(f"Best iteration: {model.best_iteration}")
print(f"Actual trees used: {model.best_iteration + 1}")
Best iteration: 487
Actual trees used: 488
The model stopped at 488 trees, far short of the 5,000 maximum. Early stopping found the right number automatically.
Random Forest
Random Forests are far less sensitive to hyperparameters than gradient boosting. The defaults in scikit-learn are usually close to optimal. If you tune, focus on:
| Hyperparameter | Default | Typical range | Notes |
|---|---|---|---|
n_estimators |
100 | 200--1000 | More is almost always better (no overfitting) |
max_depth |
None | 10--30 or None | None = grow until pure leaves |
max_features |
'sqrt' | 'sqrt', 'log2', 0.3--0.8 | Controls decorrelation |
min_samples_leaf |
1 | 1--10 | Regularization |
Logistic Regression and SVM
These have fewer hyperparameters, and grid search is perfectly adequate:
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
# Logistic regression: tune C and penalty
lr_grid = {
'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0],
'penalty': ['l1', 'l2'],
'solver': ['saga']
}
# SVM: tune C and gamma
svm_grid = {
'C': [0.1, 1.0, 10.0, 100.0],
'gamma': ['scale', 'auto', 0.001, 0.01, 0.1],
'kernel': ['rbf']
}
For simple models with 1--2 hyperparameters, use grid search. Save Bayesian optimization for models with 5+ hyperparameters.
Part 9: Reproducibility and Logging
Recording Your Tuning Results
Tuning without recording is gambling without keeping score. At minimum, log:
import json
from datetime import datetime
tuning_log = {
'timestamp': datetime.now().isoformat(),
'model': 'XGBClassifier',
'dataset': 'streamflow_churn_v3',
'n_samples': len(X),
'n_features': X.shape[1],
'method': 'Optuna TPE',
'n_trials': 100,
'cv_strategy': 'StratifiedKFold(n_splits=5)',
'scoring': 'roc_auc',
'baseline_score': float(baseline_scores.mean()),
'best_score': float(study.best_value),
'improvement': float(study.best_value - baseline_scores.mean()),
'best_params': study.best_params,
'random_state': 42,
'wall_time_seconds': sum(
t.duration.total_seconds()
for t in study.trials if t.duration is not None
),
}
with open('tuning_log.json', 'w') as f:
json.dump(tuning_log, f, indent=2, default=str)
Reproducibility --- Always set
random_statein both the model and the cross-validation splitter. Setseedin the Optuna sampler. Record the library versions. Tuning is stochastic; without fixed seeds, you cannot reproduce your results. This is discussed further in Chapter 10 (Reproducible Data Pipelines) and Chapter 30 (Experiment Tracking).
Progressive Project M8: StreamFlow Churn Tuning
Your Task
Apply the three-step tuning protocol to StreamFlow's best-performing churn model from Chapters 14 and 17.
Deliverables:
- Default baseline AUC from the untuned model
- Random search results (80 trials) with the best AUC and parameters
- Optuna Bayesian optimization (100 trials) with pruning
plot_param_importancesvisualization showing which hyperparameters matter most- Diminishing returns analysis: compare improvement from Step 1-->2 vs. Step 2-->3
- Final tuned model saved as
streamflow_tuned_model.json
Key question to answer: Was the Bayesian optimization step worth the additional compute? At what trial count did the optimization effectively plateau?
Getting Started
# Start from your Chapter 17 pipeline
# Load the StreamFlow dataset with SMOTE or class_weight adjustments
# Use the model that won the Chapter 14 comparison (likely XGBoost or LightGBM)
# Step 1: Cross-validate with defaults
# Step 2: RandomizedSearchCV with 80 trials
# Step 3: Optuna with 100 trials and pruning
# Step 4: Compare all three and generate the importance plot
# Step 5: Write a 1-paragraph summary of whether the tuning was worth it
Summary
Hyperparameter tuning is the art of extracting the last few percentage points of performance from a model that is already good. The correct sequence is: defaults first, random search for the ballpark, Bayesian optimization to refine. Grid search is mostly obsolete for high-dimensional spaces. Optuna with pruning is the current best tool for efficient Bayesian optimization. But the most important skill is knowing when to stop --- when the gap between the best and 10th-best configuration is smaller than the cross-validation noise, further tuning is moving noise, not signal. Spend your time on features, data quality, and deployment instead.
Next chapter: Chapter 19: Model Interpretation --- opening the black box.