Exercises: Chapter 14

DataField.Dev

Exercises: Chapter 14

Gradient Boosting

Exercise 1: Residual Fitting by Hand (Conceptual)

Consider a regression problem with five data points:

x	y
1	3.0
2	5.0
3	4.0
4	8.0
5	7.0

a) The initial prediction F_0 is the mean of y. Calculate F_0.

b) Calculate the residuals r_i = y_i - F_0 for each data point.

c) Suppose the first tree (a decision stump) splits at x <= 2.5 vs. x > 2.5. The left leaf predicts the mean residual of points in the left group, and the right leaf predicts the mean residual of points in the right group. Calculate the predictions of this stump.

d) Using a learning rate of eta = 0.3, calculate the updated predictions F_1(x_i) = F_0 + eta * h_1(x_i) for each data point.

e) Calculate the new residuals after round 1. Are they smaller in magnitude than the original residuals? Explain why this should always be the case (assuming a well-fitted tree).

Exercise 2: Learning Rate and Number of Estimators (Code)

Run the following experiment and answer the questions below.

import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X, y = make_classification(
    n_samples=8000, n_features=20, n_informative=12,
    n_redundant=4, flip_y=0.08, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.25, stratify=y, random_state=42
)

configs = [
    (0.3,  5000, 30),
    (0.1,  5000, 50),
    (0.05, 5000, 50),
    (0.01, 5000, 100),
    (0.005, 10000, 100),
]

for lr, n_est, patience in configs:
    model = xgb.XGBClassifier(
        n_estimators=n_est,
        learning_rate=lr,
        max_depth=6,
        subsample=0.8,
        colsample_bytree=0.8,
        early_stopping_rounds=patience,
        eval_metric='logloss',
        random_state=42,
        n_jobs=-1
    )
    model.fit(X_train, y_train, eval_set=[(X_val, y_val)], verbose=False)
    auc = roc_auc_score(y_val, model.predict_proba(X_val)[:, 1])
    print(f"LR={lr:<6} Trees={model.best_iteration:<5} AUC={auc:.4f}")

a) What is the relationship between learning rate and number of trees used (as determined by early stopping)?

b) Do lower learning rates always produce better AUC? Is there a point of diminishing returns?

c) A colleague argues: "Just use learning_rate=0.001 and n_estimators=50000 to get the best model." What are the practical problems with this approach?

d) For a production model that retrains nightly, which learning rate would you recommend and why?

Exercise 3: Max Depth --- Boosting vs. Random Forest (Code)

Run the following comparison and answer the questions.

import numpy as np
from sklearn.datasets import make_classification
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb

X, y = make_classification(
    n_samples=5000, n_features=15, n_informative=10,
    n_redundant=3, flip_y=0.05, random_state=42
)

depths = [3, 5, 7, 10, 15, 20, None]

print("Random Forest:")
for d in depths:
    rf = RandomForestClassifier(
        n_estimators=200, max_depth=d, random_state=42, n_jobs=-1
    )
    scores = cross_val_score(rf, X, y, cv=5, scoring='roc_auc')
    d_str = str(d) if d is not None else "None"
    print(f"  depth={d_str:<5} AUC={scores.mean():.4f} +/- {scores.std():.4f}")

print("\nXGBoost:")
for d in [d for d in depths if d is not None]:
    xgb_m = xgb.XGBClassifier(
        n_estimators=200, max_depth=d, learning_rate=0.1,
        random_state=42, n_jobs=-1, eval_metric='logloss'
    )
    scores = cross_val_score(xgb_m, X, y, cv=5, scoring='roc_auc')
    print(f"  depth={d:<5} AUC={scores.mean():.4f} +/- {scores.std():.4f}")

a) What is the optimal max_depth for Random Forest? For XGBoost?

b) Why does Random Forest perform well with deep trees while XGBoost performs poorly with deep trees?

c) What happens to the variance (standard deviation) of cross-validation scores as depth increases for each algorithm? Explain the difference.

d) A colleague trained an XGBoost model with max_depth=20 and says "it works great --- 99.5% train accuracy." What question should you immediately ask?

Exercise 4: Early Stopping (Code)

import numpy as np
import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X, y = make_classification(
    n_samples=5000, n_features=20, n_informative=10,
    n_redundant=5, flip_y=0.1, random_state=42
)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Train without early stopping
model_no_es = xgb.XGBClassifier(
    n_estimators=3000, learning_rate=0.1, max_depth=6,
    random_state=42, n_jobs=-1, eval_metric='logloss'
)
model_no_es.fit(
    X_train, y_train,
    eval_set=[(X_train, y_train), (X_test, y_test)],
    verbose=False
)

results = model_no_es.evals_result()

a) Plot the training and validation log-loss curves across all 3000 iterations. At what iteration does validation loss reach its minimum? At what iteration does the gap between training and validation loss exceed 0.05?

b) Re-train with early_stopping_rounds=25, then with early_stopping_rounds=100. Compare the stopping points and final AUC. Which patience value do you prefer and why?

c) Explain why using the test set for both early stopping and final evaluation produces an optimistically biased performance estimate. Design a proper three-way split pipeline that avoids this bias.

d) In a production pipeline with automated retraining, you cannot manually inspect learning curves. Write a function that trains an XGBoost model with early stopping, logs the training curve, and raises a warning if the model stopped at fewer than 20 iterations (likely too few) or used more than 90% of the maximum iterations (likely needs more).

Exercise 5: The Three Libraries Head-to-Head (Code)

Train XGBoost, LightGBM, and CatBoost on the same dataset and compare them across multiple dimensions.

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
import time

# Adult Income dataset (mix of numerical and categorical features)
adult = fetch_openml(name='adult', version=2, as_frame=True, parser='auto')
X = adult.data
y = (adult.target == '>50K').astype(int)

# Identify categorical and numerical columns
cat_cols = X.select_dtypes(include=['category', 'object']).columns.tolist()
num_cols = X.select_dtypes(include=['number']).columns.tolist()

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

a) Train all three models with comparable hyperparameters (learning_rate=0.05, ~6 depth, early stopping with patience 50). For XGBoost, one-hot encode categoricals. For LightGBM, use native categorical support. For CatBoost, use native categorical support. Report: AUC, training time, and number of trees used.

b) Which model handles the categorical features most gracefully from a code complexity standpoint?

c) Now increase the dataset size artificially by resampling with replacement to 200K rows. Re-run the comparison. Which model benefits most from the larger dataset? Which is fastest?

d) Write a recommendation memo (3-5 sentences) for a colleague choosing between these three libraries for a production system with mixed numerical/categorical features and nightly retraining.

Exercise 6: Hyperparameter Tuning Pipeline (Code)

Build a complete hyperparameter tuning pipeline for LightGBM on a classification problem.

a) Implement the following tuning stages in order: - Stage 1: Fix learning_rate=0.05, tune num_leaves (20 to 100, step 10) and max_depth (3 to 8) - Stage 2: Using the best from Stage 1, tune subsample (0.6 to 1.0, step 0.1) and colsample_bytree (0.6 to 1.0, step 0.1) - Stage 3: Using the best from Stage 2, tune reg_alpha and reg_lambda (0, 0.1, 1.0, 5.0) - Stage 4: Lower learning_rate to 0.01, increase n_estimators to 5000, use early stopping

b) Compare the final tuned model to a LightGBM trained with all defaults. What is the AUC improvement?

c) How long did the full tuning pipeline take? Is the improvement worth the compute cost for a weekly retraining pipeline?

d) An alternative to sequential grid search is Bayesian optimization (e.g., Optuna). Research and describe in 2-3 sentences how Bayesian optimization differs from grid search and why it is often more efficient for gradient boosting hyperparameters.

Exercise 7: Feature Importance Comparison (Code)

Train an XGBoost model and extract feature importance using three different methods.

import xgboost as xgb
from sklearn.inspection import permutation_importance
import shap  # pip install shap

# Use your preferred dataset (StreamFlow churn, Adult Income, or a make_classification dataset)
# Train an XGBoost model with early stopping

a) Extract and plot the "weight" (number of times a feature is used in splits), "gain" (average loss reduction from splits on that feature), and "cover" (average number of samples affected by splits on that feature) importance rankings. Do they agree?

b) Compute permutation importance on the test set. How does it compare to the built-in importance measures?

c) Compute SHAP values and create a SHAP summary plot. Does the SHAP ranking match any of the three built-in rankings?

d) A stakeholder asks: "What are the top 3 features driving churn?" Based on your analysis, which importance method would you use to answer this question, and why?

Exercise 8: Gradient Boosting for Regression (Code)

Gradient boosting is not only for classification. Train a gradient boosting regressor on the California Housing dataset.

from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
import xgboost as xgb
import lightgbm as lgb

housing = fetch_california_housing(as_frame=True)
X = housing.data
y = housing.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

a) Train an XGBRegressor and an LGBMRegressor with early stopping. Report RMSE, MAE, and R-squared on the test set.

b) The default loss function for XGBRegressor is squared error (reg:squarederror). Train a second model with Huber loss (reg:pseudohubererror). Compare the two on the test set. Which is more robust to outliers?

c) Train a quantile regression model that predicts the 10th, 50th, and 90th percentiles of house prices. Use LightGBM with objective='quantile' and alpha=0.1, 0.5, and 0.9. Plot the prediction intervals for a sample of 50 test points.

d) In what business scenario would quantile regression (prediction intervals) be more useful than point predictions?

Exercise 9: Debugging a Poorly Performing Gradient Boosting Model (Debugging Challenge)

A colleague sends you the following model and says "I can't get above 0.82 AUC, but I know this data should support 0.87+."

import xgboost as xgb
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X, y = make_classification(
    n_samples=10000, n_features=30, n_informative=15,
    n_redundant=5, flip_y=0.05, random_state=42
)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

model = xgb.XGBClassifier(
    n_estimators=50,
    learning_rate=0.5,
    max_depth=15,
    subsample=1.0,
    colsample_bytree=1.0,
    reg_alpha=0,
    reg_lambda=0,
    random_state=42
)
model.fit(X_train, y_train)

auc = roc_auc_score(y_test, model.predict_proba(X_test)[:, 1])
print(f"AUC: {auc:.4f}")

a) List every problem you can identify with this model configuration. For each, explain what is wrong and what the fix should be.

b) Fix the configuration and demonstrate an AUC above 0.87. Show your reasoning for each parameter change.

c) Which single parameter change made the biggest difference? How did you determine this?

Exercise 10: StreamFlow Churn --- The Full Pipeline (Progressive Project)

Build the complete StreamFlow churn prediction pipeline using gradient boosting. This pulls together everything from Chapters 6-14.

a) Load the StreamFlow dataset (from the progressive project). Apply the feature engineering pipeline from Chapter 6, the categorical encoding from Chapter 7, the missing data handling from Chapter 8, and the feature selection from Chapter 9.

b) Establish baselines: logistic regression (Chapter 11), SVM (Chapter 12, if applicable), Random Forest (Chapter 13).

c) Train XGBoost, LightGBM, and CatBoost with early stopping. Use a proper three-way split.

d) Create a comparison table showing all models with: AUC, F1, Precision, Recall, training time, and number of parameters/trees.

e) Select the best model and justify your choice. Consider not just accuracy, but training time, interpretability, and ease of maintenance.

f) Write a one-paragraph summary that a product manager could understand: which model did you choose, how well does it predict churn, and what are the top 3 features driving predictions?

These exercises support Chapter 14: Gradient Boosting. Return to the chapter for reference.