Exercises: Chapter 18
Hyperparameter Tuning
Exercise 1: Parameter vs. Hyperparameter Identification (Conceptual)
For each of the following, identify whether it is a parameter (learned during training) or a hyperparameter (set before training). Briefly explain your reasoning.
a) The coefficients in a logistic regression model.
b) The C value in a logistic regression model.
c) The split thresholds in a decision tree.
d) The max_depth of a decision tree.
e) The weights connecting two layers of a neural network.
f) The learning rate of an optimizer.
g) The cluster centroids in K-means.
h) The value of K in K-means.
i) The leaf weights in a gradient boosted tree.
j) The subsample fraction in XGBoost.
Exercise 2: Grid Search Combinatorics (Short Answer + Code)
a) A team defines the following grid for a Random Forest:
param_grid = {
'n_estimators': [100, 200, 300, 500, 1000],
'max_depth': [5, 10, 15, 20, None],
'min_samples_split': [2, 5, 10, 20],
'max_features': ['sqrt', 'log2', 0.5, 0.8],
'min_samples_leaf': [1, 2, 5, 10],
}
How many total combinations does this grid contain? With 5-fold cross-validation, how many individual model fits will this require?
b) Suppose each model fit takes 8 seconds on average. How long will the full grid search take? Express your answer in hours.
c) The team has a 2-hour compute budget. Using RandomizedSearchCV, how many trials (n_iter) can they afford with the same per-fit time and 5-fold CV? How does this compare to the coverage of grid search?
d) Write the RandomizedSearchCV configuration using appropriate continuous distributions for the float parameters and randint for the integer parameters.
Exercise 3: Random Search Distribution Design (Code)
For an XGBoost classifier, design appropriate probability distributions for the following hyperparameters using scipy.stats:
a) learning_rate: should range from 0.005 to 0.3, with equal probability per order of magnitude.
b) max_depth: integers from 3 to 10.
c) subsample: continuous values from 0.6 to 1.0, uniformly distributed.
d) reg_alpha: should range from 0.0001 to 100, with equal probability per order of magnitude.
e) n_estimators: integers from 100 to 3000.
Explain why loguniform is appropriate for (a) and (d) but uniform is appropriate for (c). What would go wrong if you used uniform for learning_rate?
Exercise 4: Optuna Objective Function (Code)
Write an Optuna objective function to tune a LightGBM classifier. Your function should:
a) Define the following search space:
- num_leaves: integer, 15 to 127
- learning_rate: float, 0.005 to 0.3, log scale
- feature_fraction: float, 0.4 to 1.0
- bagging_fraction: float, 0.4 to 1.0
- min_child_samples: integer, 5 to 100
- lambda_l1: float, 1e-3 to 10.0, log scale
- lambda_l2: float, 1e-3 to 10.0, log scale
b) Use 5-fold stratified cross-validation with roc_auc scoring.
c) Return the mean cross-validation AUC.
d) Run the study for 80 trials and print the best parameters and score.
e) After the study, call optuna.visualization.plot_param_importances and interpret the result. Which hyperparameters matter most?
Exercise 5: Diminishing Returns Analysis (Code + Analysis)
Using the synthetic dataset from the chapter:
from sklearn.datasets import make_classification
X, y = make_classification(
n_samples=15000, n_features=25, n_informative=15,
n_redundant=5, flip_y=0.08, class_sep=0.8,
weights=[0.85, 0.15], random_state=42
)
a) Train an XGBoost model with pure defaults and record the 5-fold cross-validated AUC.
b) Run RandomizedSearchCV with n_iter=20 and record the best AUC. Run again with n_iter=50, n_iter=100, and n_iter=200. Plot the best AUC as a function of n_iter.
c) At what value of n_iter does the improvement per additional trial become negligible? Define "negligible" as less than 0.001 AUC improvement from doubling n_iter.
d) Now add a new feature to X that is genuinely predictive:
import numpy as np
np.random.seed(42)
new_feature = y * 0.5 + np.random.normal(0, 1, len(y))
X_enhanced = np.column_stack([X, new_feature])
Train the default XGBoost model on X_enhanced and compare its AUC to the best-tuned model on X. Which gives a larger improvement --- 200 trials of tuning on the original features, or one new feature with default hyperparameters?
e) Write a 3-sentence conclusion about the relative value of feature engineering vs. hyperparameter tuning.
Exercise 6: Early Stopping as a Tuning Strategy (Code)
This exercise demonstrates that for n_estimators, early stopping is better than grid search.
a) Grid-search n_estimators over [100, 200, 500, 1000, 2000] for an XGBoost model with learning_rate=0.05, using 5-fold cross-validation and roc_auc scoring. Record the best n_estimators and the wall time.
b) Now train a single XGBoost model with n_estimators=5000 and early_stopping_rounds=50, using a held-out 20% validation set. Record the best iteration and the wall time.
c) Compare:
- Which approach found a better (or equivalent) n_estimators?
- Which was faster?
- Which gives a more precise answer (grid search can only find one of 5 values; early stopping finds the exact number)?
d) Explain why you should never include n_estimators in a grid or random search when using XGBoost or LightGBM. What should you do instead?
Exercise 7: HalvingRandomSearchCV vs. RandomizedSearchCV (Code)
Compare the two approaches head-to-head on the chapter's synthetic dataset.
a) Run RandomizedSearchCV with n_iter=100 and the XGBoost parameter distributions from the chapter. Record the best AUC and the wall time (use %%time or time.time()).
b) Run HalvingRandomSearchCV with n_candidates=100, factor=3, and resource='n_samples'. Record the best AUC and wall time.
c) Create a table comparing: - Best AUC - Wall time - Total model fits
d) Under what conditions would you prefer HalvingRandomSearchCV? Under what conditions is it a bad choice?
Exercise 8: Tuning a Complete Pipeline (Code)
Real-world tuning must handle preprocessing. Build a scikit-learn Pipeline that includes:
StandardScalerfor numerical featuresOneHotEncoderfor categorical features (useColumnTransformer)- An XGBoost classifier
Then tune the pipeline using RandomizedSearchCV. The hyperparameter names in the grid must use the Pipeline prefix notation (e.g., classifier__max_depth).
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from xgboost import XGBClassifier
# Create a dataset with mixed types
np.random.seed(42)
n = 10000
df = pd.DataFrame({
'age': np.random.normal(45, 15, n).clip(18, 80).round(0),
'income': np.random.lognormal(10.5, 0.8, n).round(0),
'credit_score': np.random.normal(680, 80, n).clip(300, 850).round(0),
'education': np.random.choice(['high_school', 'bachelors', 'masters', 'phd'], n),
'employment': np.random.choice(['employed', 'self_employed', 'retired', 'unemployed'], n),
'region': np.random.choice(['northeast', 'southeast', 'midwest', 'west'], n),
})
# Target
y = (
(df['income'] > 50000).astype(int) * 0.3
+ (df['credit_score'] > 700).astype(int) * 0.3
+ np.random.normal(0, 0.3, n)
)
y = (y > 0.5).astype(int)
a) Build the Pipeline with ColumnTransformer.
b) Define a param_distributions dictionary with appropriate distributions for 5 XGBoost hyperparameters, using the classifier__ prefix.
c) Run RandomizedSearchCV with 50 trials. Print the best parameters and AUC.
d) Explain why it is critical that the scaler is inside the Pipeline and not applied before the cross-validation split.
Exercise 9: Multi-Metric Tuning (Code + Analysis)
Sometimes you care about more than one metric. Use RandomizedSearchCV with scoring set to multiple metrics and refit set to a primary metric.
a) Run a random search that evaluates roc_auc, average_precision, f1, and recall simultaneously:
from sklearn.model_selection import RandomizedSearchCV
scoring = {
'auc': 'roc_auc',
'avg_precision': 'average_precision',
'f1': 'f1',
'recall': 'recall',
}
search = RandomizedSearchCV(
estimator=XGBClassifier(eval_metric='logloss', random_state=42, n_jobs=-1),
param_distributions=param_distributions, # from the chapter
n_iter=60,
cv=cv,
scoring=scoring,
refit='auc', # primary metric for selecting best model
random_state=42,
n_jobs=-1,
return_train_score=True
)
b) After fitting, extract the results and create a table showing the top 5 configurations ranked by AUC, with all four metrics displayed. Does the best-AUC configuration also have the best F1? The best recall?
c) Identify a configuration that has lower AUC but higher recall than the best-AUC configuration. Under what business circumstances would you prefer this configuration?
d) Why does using refit='auc' matter? What happens if you set refit=False?
Exercise 10: Full Tuning Report (Portfolio Exercise)
This exercise synthesizes the chapter into a complete tuning workflow. Using a dataset of your choice (the chapter's synthetic data, StreamFlow churn, or a public dataset from scikit-learn or Kaggle), produce a tuning report that includes:
- Baseline performance with default hyperparameters (5-fold CV)
- Random search with 80 trials: best score, best parameters, top-10 configuration spread
- Optuna Bayesian optimization with 100 trials: best score, best parameters, hyperparameter importance plot
- Diminishing returns analysis: plot of best score vs. trial number for both random search and Optuna
- Comparison table: defaults vs. random search vs. Optuna (score, improvement, wall time)
- 1-paragraph conclusion: Was the tuning worth it? Where would you invest next?
Submit the report as a Jupyter notebook with markdown cells explaining each step. This exercise directly prepares you for the StreamFlow progressive project milestone M8.
These exercises support Chapter 18: Hyperparameter Tuning. Return to the chapter for reference.