Key Takeaways: Chapter 18
Hyperparameter Tuning
-
Hyperparameter tuning is the most over-invested activity in amateur data science. The typical hierarchy of impact is: better features (5-20% improvement), fixing data quality (3-15%), choosing the right model family (2-8%), default to rough tuning (2-5%), rough to perfect tuning (0.1-0.5%). Teams that spend more time on tuning than on feature engineering are investing their effort in the wrong place. Tune after you have exhausted higher-leverage activities, not before.
-
Hyperparameters are set before training; parameters are learned during training. Hyperparameters control how the model learns (learning rate, tree depth, regularization strength). Parameters are what the model learns (coefficients, split thresholds, leaf weights). This distinction determines what you tune (hyperparameters, using cross-validated search) versus what you leave to the algorithm (parameters, via gradient descent or impurity reduction).
-
Grid search is mostly obsolete for high-dimensional hyperparameter spaces. Grid search evaluates every combination, distributing its budget evenly across good and bad regions. For 2-3 hyperparameters, it works fine. For 5+ hyperparameters, the combinatorial explosion makes it impractical. A grid of 5 values across 8 hyperparameters produces 390,625 combinations. Use random search or Bayesian optimization instead.
-
Random search is the correct default tuning method. Bergstra and Bengio showed that random search explores more unique values of the important hyperparameters than grid search for the same budget, because it does not waste trials exhaustively covering unimportant dimensions. Use
loguniformfor learning rates and regularization parameters,uniformfor sampling fractions, andrandintfor integer parameters. Sixty trials finds a top-5% configuration with 95% probability when 1-2 hyperparameters dominate. -
Bayesian optimization (Optuna) refines what random search finds. Optuna maintains a surrogate model (TPE) of the hyperparameter-performance relationship and uses an acquisition function to sample promising regions. It is more efficient than random search but requires 10-30 initial trials to build a useful surrogate. Use Optuna after random search has established the ballpark, not as a replacement for it. Enable pruning to save 20-40% of compute by early-stopping unpromising trials.
-
The correct tuning sequence is: defaults, random search, Bayesian optimization. Step 1: establish the default baseline. Step 2: random search (50-100 trials) to find the right region. Step 3: Bayesian optimization (50-200 trials) to refine, if warranted. Random search typically captures 85-95% of the total tuning improvement. Bayesian optimization adds the last 5-15%. Skip Step 3 if the stakes do not justify the additional compute.
-
For gradient boosting, learning rate and tree depth account for most of the tuning benefit. Optuna's
plot_param_importancesconsistently shows thatlearning_rateandmax_depth(ornum_leavesfor LightGBM) explain 50-70% of performance variance across hyperparameter configurations. Regularization parameters (reg_alpha,reg_lambda) typically explain less than 5%. If you can only tune two hyperparameters, tune these two. -
Use early stopping for
n_estimators, not grid search. Setn_estimatorshigh (1,000-5,000) and useearly_stopping_roundswith a validation set. The model stops training when the validation loss has not improved for a specified number of rounds, finding the optimal number of trees automatically and precisely. Grid searchingn_estimatorsover a predefined list wastes compute and gives a coarser answer. -
Stop tuning when the top-N configurations are indistinguishable from noise. When the gap between the best and 20th-best configuration is smaller than the cross-validation standard deviation, further tuning is moving noise, not signal. When the Optuna optimization history has not improved in 30+ trials, the search has converged. When the Bayesian optimization step adds less than 10% of what random search found, the returns have diminished. Stop and invest elsewhere.
-
Feature engineering beats hyperparameter tuning almost every time. In the chapter's e-commerce case study, 250 trials of tuning on the original features added 0.018 AUC. Thirteen engineered and external features with default hyperparameters added 0.059 AUC --- more than three times the tuning gain. Tuning adjusts how the model uses existing information. Feature engineering adds new information. When the model is stuck, the answer is usually more information, not better hyperparameters.
If You Remember One Thing
The three-step tuning protocol --- defaults, random search, Bayesian refinement --- captures nearly all available improvement in under 200 total trials. Random search does the heavy lifting. Bayesian optimization polishes. But the single most common tuning mistake is not choosing the wrong method or running too few trials. It is tuning before the features are right. Before you invest an hour in hyperparameter search, invest fifteen minutes asking: "What information does this model not have that would help it predict better?" If you can answer that question, go get that information first. It will be worth more than any tuning study.
These takeaways summarize Chapter 18: Hyperparameter Tuning. Return to the chapter for full context.