Key Takeaways: Chapter 18

Hyperparameter Tuning


  1. Hyperparameter tuning is the most over-invested activity in amateur data science. The typical hierarchy of impact is: better features (5-20% improvement), fixing data quality (3-15%), choosing the right model family (2-8%), default to rough tuning (2-5%), rough to perfect tuning (0.1-0.5%). Teams that spend more time on tuning than on feature engineering are investing their effort in the wrong place. Tune after you have exhausted higher-leverage activities, not before.

  2. Hyperparameters are set before training; parameters are learned during training. Hyperparameters control how the model learns (learning rate, tree depth, regularization strength). Parameters are what the model learns (coefficients, split thresholds, leaf weights). This distinction determines what you tune (hyperparameters, using cross-validated search) versus what you leave to the algorithm (parameters, via gradient descent or impurity reduction).

  3. Grid search is mostly obsolete for high-dimensional hyperparameter spaces. Grid search evaluates every combination, distributing its budget evenly across good and bad regions. For 2-3 hyperparameters, it works fine. For 5+ hyperparameters, the combinatorial explosion makes it impractical. A grid of 5 values across 8 hyperparameters produces 390,625 combinations. Use random search or Bayesian optimization instead.

  4. Random search is the correct default tuning method. Bergstra and Bengio showed that random search explores more unique values of the important hyperparameters than grid search for the same budget, because it does not waste trials exhaustively covering unimportant dimensions. Use loguniform for learning rates and regularization parameters, uniform for sampling fractions, and randint for integer parameters. Sixty trials finds a top-5% configuration with 95% probability when 1-2 hyperparameters dominate.

  5. Bayesian optimization (Optuna) refines what random search finds. Optuna maintains a surrogate model (TPE) of the hyperparameter-performance relationship and uses an acquisition function to sample promising regions. It is more efficient than random search but requires 10-30 initial trials to build a useful surrogate. Use Optuna after random search has established the ballpark, not as a replacement for it. Enable pruning to save 20-40% of compute by early-stopping unpromising trials.

  6. The correct tuning sequence is: defaults, random search, Bayesian optimization. Step 1: establish the default baseline. Step 2: random search (50-100 trials) to find the right region. Step 3: Bayesian optimization (50-200 trials) to refine, if warranted. Random search typically captures 85-95% of the total tuning improvement. Bayesian optimization adds the last 5-15%. Skip Step 3 if the stakes do not justify the additional compute.

  7. For gradient boosting, learning rate and tree depth account for most of the tuning benefit. Optuna's plot_param_importances consistently shows that learning_rate and max_depth (or num_leaves for LightGBM) explain 50-70% of performance variance across hyperparameter configurations. Regularization parameters (reg_alpha, reg_lambda) typically explain less than 5%. If you can only tune two hyperparameters, tune these two.

  8. Use early stopping for n_estimators, not grid search. Set n_estimators high (1,000-5,000) and use early_stopping_rounds with a validation set. The model stops training when the validation loss has not improved for a specified number of rounds, finding the optimal number of trees automatically and precisely. Grid searching n_estimators over a predefined list wastes compute and gives a coarser answer.

  9. Stop tuning when the top-N configurations are indistinguishable from noise. When the gap between the best and 20th-best configuration is smaller than the cross-validation standard deviation, further tuning is moving noise, not signal. When the Optuna optimization history has not improved in 30+ trials, the search has converged. When the Bayesian optimization step adds less than 10% of what random search found, the returns have diminished. Stop and invest elsewhere.

  10. Feature engineering beats hyperparameter tuning almost every time. In the chapter's e-commerce case study, 250 trials of tuning on the original features added 0.018 AUC. Thirteen engineered and external features with default hyperparameters added 0.059 AUC --- more than three times the tuning gain. Tuning adjusts how the model uses existing information. Feature engineering adds new information. When the model is stuck, the answer is usually more information, not better hyperparameters.


If You Remember One Thing

The three-step tuning protocol --- defaults, random search, Bayesian refinement --- captures nearly all available improvement in under 200 total trials. Random search does the heavy lifting. Bayesian optimization polishes. But the single most common tuning mistake is not choosing the wrong method or running too few trials. It is tuning before the features are right. Before you invest an hour in hyperparameter search, invest fifteen minutes asking: "What information does this model not have that would help it predict better?" If you can answer that question, go get that information first. It will be worth more than any tuning study.


These takeaways summarize Chapter 18: Hyperparameter Tuning. Return to the chapter for full context.