Quiz: Chapter 18

Hyperparameter Tuning


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

Which of the following is a hyperparameter of a Random Forest classifier?

  • A) The split thresholds at each node
  • B) The feature importances
  • C) The max_depth limit
  • D) The predicted probabilities

Answer: C) The max_depth limit. Hyperparameters are set before training and control how the algorithm learns. Split thresholds (A) and feature importances (B) are derived from the data during training, making them parameters (or derived quantities). Predicted probabilities (D) are model outputs, not parameters of any kind. max_depth is specified in the constructor before .fit() is called.


Question 2 (Multiple Choice)

A data scientist runs a grid search with the following configuration:

param_grid = {
    'n_estimators': [100, 300, 500],
    'max_depth': [3, 5, 7, 9],
    'learning_rate': [0.01, 0.05, 0.1],
}

Using 5-fold cross-validation, how many individual model fits will this require?

  • A) 36
  • B) 60
  • C) 180
  • D) 900

Answer: C) 180. The grid has 3 x 4 x 3 = 36 unique combinations. Each combination is evaluated with 5-fold cross-validation, so 36 x 5 = 180 total fits.


Question 3 (Multiple Choice)

Why is loguniform preferred over uniform for sampling learning_rate in random search?

  • A) loguniform always produces smaller values
  • B) loguniform gives equal probability to each order of magnitude, matching how learning rates affect performance
  • C) loguniform is faster to compute
  • D) loguniform guarantees finding the optimal value

Answer: B) loguniform gives equal probability to each order of magnitude, matching how learning rates affect performance. The difference between learning_rate=0.01 and learning_rate=0.02 matters much more than the difference between learning_rate=0.9 and learning_rate=1.0. A uniform distribution would place most samples in the high range where differences are negligible, wasting the search budget.


Question 4 (Short Answer)

Explain the Bergstra-Bengio insight about why random search is more efficient than grid search for most hyperparameter tuning problems.

Answer: Bergstra and Bengio observed that in most problems, only a small number of hyperparameters have a large effect on performance, but you do not know which ones in advance. Grid search evaluates every combination of all hyperparameters, meaning it explores only a few unique values of each parameter (determined by the grid resolution). Random search samples each hyperparameter independently, so it explores many more unique values of the important parameters for the same total budget. In a search space where only 1-2 of 6 hyperparameters matter, random search covers those critical dimensions far more thoroughly.


Question 5 (Multiple Choice)

In Bayesian optimization with Optuna, what is the role of the surrogate model?

  • A) It replaces the ML model being tuned and makes predictions directly
  • B) It approximates the relationship between hyperparameters and performance to guide the search
  • C) It performs cross-validation faster than scikit-learn
  • D) It generates the training data for the ML model

Answer: B) It approximates the relationship between hyperparameters and performance to guide the search. The surrogate model (TPE in Optuna's default implementation) builds a probabilistic model of the objective function based on previously evaluated trials. This model predicts which regions of the hyperparameter space are likely to yield good results, allowing the search to focus on promising areas rather than sampling randomly.


Question 6 (Multiple Choice)

What is the purpose of the acquisition function in Bayesian optimization?

  • A) To acquire more training data for the ML model
  • B) To decide which hyperparameter configuration to evaluate next, balancing exploration and exploitation
  • C) To compute the cross-validation score
  • D) To prune low-performing trials

Answer: B) To decide which hyperparameter configuration to evaluate next, balancing exploration and exploitation. The acquisition function takes the surrogate model's predictions (including uncertainty estimates) and selects the next point to evaluate. It balances exploitation (sampling near the current best) with exploration (sampling regions where the surrogate model is uncertain), ensuring the search does not get trapped in local optima.


Question 7 (Short Answer)

A data scientist reports the following tuning results:

Method Best AUC Trials Wall time
Defaults 0.8821 1 5 sec
Random search 0.9034 100 12 min
Optuna 0.9041 200 28 min

Should they continue tuning? Justify your answer.

Answer: No, they should stop tuning. The jump from defaults to random search was 0.0213 AUC --- substantial and clearly worthwhile. The jump from random search to Optuna was only 0.0007 AUC despite doubling the trials and more than doubling the wall time. This 0.0007 improvement is almost certainly within the cross-validation noise (typical CV standard deviations are 0.003-0.010). The optimization has converged, and further trials will move noise rather than signal. Time would be better spent on feature engineering, data quality, or deployment.


Question 8 (Multiple Choice)

When using HalvingRandomSearchCV with resource='n_samples', what happens at each iteration?

  • A) The number of hyperparameter candidates is doubled and the dataset size is halved
  • B) The number of hyperparameter candidates is halved and the dataset size is increased
  • C) Both the number of candidates and the dataset size are halved
  • D) Both the number of candidates and the dataset size are doubled

Answer: B) The number of hyperparameter candidates is halved and the dataset size is increased. Successive halving works like a tournament: many candidates start on small data subsets, the worst half is eliminated, and the survivors are evaluated on larger data. Only the final few candidates are trained on the full dataset, saving substantial computation.


Question 9 (Multiple Choice)

For XGBoost, what is the best way to determine the optimal n_estimators?

  • A) Grid search over [100, 200, 500, 1000, 2000]
  • B) Set it to 1000 as a default
  • C) Set it high (e.g., 5000) and use early_stopping_rounds with a validation set
  • D) Use the smallest value that fits within your time budget

Answer: C) Set it high (e.g., 5000) and use early_stopping_rounds with a validation set. Early stopping monitors the validation loss during training and stops when it has not improved for a specified number of rounds. This automatically finds the precise optimal number of trees rather than choosing from a small predefined set. It also prevents overfitting without requiring a separate search. Grid searching n_estimators wastes compute by training 5 completely independent models when early stopping accomplishes the same goal in a single training run.


Question 10 (Short Answer)

A colleague asks: "If Bayesian optimization is smarter than random search, why not always use Bayesian optimization?" Give two reasons why random search is still a valid first step.

Answer: First, Bayesian optimization's surrogate model needs 10-30 initial random trials to build a useful approximation of the search space, so the first trials are effectively random anyway. Starting with a dedicated random search phase gives you a reliable baseline and a clear picture of the search space before investing in the more complex Bayesian approach. Second, random search is trivially parallelizable (all trials are independent), while Bayesian optimization is inherently sequential (each trial depends on the results of previous trials). On a cluster with many cores, 200 random search trials can run simultaneously, while 200 Bayesian trials must run mostly in sequence.


Question 11 (Multiple Choice)

Which of the following is the correct tuning sequence recommended in this chapter?

  • A) Grid search --> Random search --> Bayesian optimization
  • B) Bayesian optimization --> Random search --> Grid search
  • C) Default baseline --> Random search for ballpark --> Bayesian optimization to refine
  • D) Default baseline --> Grid search --> Deploy

Answer: C) Default baseline --> Random search for ballpark --> Bayesian optimization to refine. The chapter's three-step protocol starts with defaults to establish a baseline, uses random search (50-100 trials) to find the right region of the search space, then uses Bayesian optimization (50-200 trials) to refine if the stakes justify it. Grid search is mostly obsolete for high-dimensional spaces. This sequence captures most of the tuning benefit (random search) efficiently before investing in the more computationally intensive refinement.


Question 12 (Short Answer)

After running 100 trials of Optuna, you observe that the gap between the best configuration and the 20th-best configuration is 0.0018 AUC, and the cross-validation standard deviation of the best configuration is 0.0045. What does this tell you, and what should you do next?

Answer: The gap between the 1st and 20th-best configurations (0.0018) is smaller than the cross-validation standard deviation (0.0045). This means the difference between these configurations is indistinguishable from random noise in the evaluation. You are in the diminishing-returns zone where further tuning will not produce meaningful improvement. You should stop tuning and redirect effort to higher-leverage activities: feature engineering, data quality improvements, or moving to deployment.


Question 13 (Multiple Choice)

In an Optuna plot_param_importances result, learning_rate has importance 0.45 and reg_alpha has importance 0.02. What is the practical implication?

  • A) reg_alpha should be removed from the model entirely
  • B) learning_rate should be set to 0.45
  • C) Tuning effort should focus on learning_rate; reg_alpha can be left at its default or roughly tuned
  • D) Both parameters are equally important because the model uses both

Answer: C) Tuning effort should focus on learning_rate; reg_alpha can be left at its default or roughly tuned. Hyperparameter importance measures how much of the variance in model performance is explained by each hyperparameter. An importance of 0.02 means reg_alpha has almost no effect on the final score --- spending time refining it produces negligible returns. An importance of 0.45 means learning_rate is the single most influential hyperparameter and deserves careful tuning.


Question 14 (Multiple Choice)

Which statement best summarizes the typical impact of hyperparameter tuning?

  • A) Tuning is the most important step in the ML pipeline and should receive the most time
  • B) Default to rough tune gives 2-5% improvement; rough to perfect gives 0.1-0.5%; better features give 5-20%
  • C) Tuning always provides at least 10% improvement over defaults
  • D) Tuning is unnecessary because defaults are always optimal

Answer: B) Default to rough tune gives 2-5% improvement; rough to perfect gives 0.1-0.5%; better features give 5-20%. This hierarchy of impact means that feature engineering and data quality work typically provide far more improvement than hyperparameter tuning. Rough tuning is worthwhile (2-5% is real), but the returns diminish sharply after the initial sweep. Teams that spend more time on tuning than on features are investing their effort backwards.


Question 15 (Short Answer)

You are tuning a production churn model where a 0.1% AUC improvement translates to approximately $50,000 in annual revenue. You have already completed random search (best AUC: 0.8934) and 100 trials of Optuna (best AUC: 0.8942). Should you run another 200 Optuna trials? Justify your answer with both the statistical and business perspectives.

Answer: Statistically, the 0.0008 improvement from 100 Optuna trials over random search is very small and likely close to the noise floor. Another 200 trials might yield an additional 0.0002-0.0005 improvement at best. From a business perspective, however, even 0.1% AUC translates to $50,000 annually. If the 200 additional trials take a few hours of compute time (costing perhaps $20-50 in cloud resources plus a few hours of data scientist time), the expected $10,000-25,000 in annual value could justify the investment --- provided the improvement is real and not just overfitting to the cross-validation folds. The decision hinges on whether the marginal improvement will hold in production, which requires an A/B test regardless.


This quiz covers Chapter 18: Hyperparameter Tuning. Return to the chapter for full context.