Quiz: Chapter 14

DataField.Dev

Quiz: Chapter 14

Gradient Boosting

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

In gradient boosting, each new tree is trained to predict:

A) The original target variable
B) The residuals (errors) from all previous trees combined
C) A random bootstrap sample of the target variable
D) The average prediction of all previous trees

Answer: B) The residuals (errors) from all previous trees combined. Each new tree learns the pattern in the current errors, not the original target. The final prediction is the sum of all trees' predictions (scaled by the learning rate) plus the initial prediction. This sequential error-correction is the defining characteristic of gradient boosting.

Question 2 (Multiple Choice)

A data scientist trains an XGBoost model with learning_rate=0.01 and n_estimators=100. A colleague trains the same model with learning_rate=0.1 and n_estimators=100. Assuming both use early stopping and neither stops early, which statement is most likely true?

A) The model with learning_rate=0.01 will have better test performance
B) The model with learning_rate=0.1 will have better test performance
C) The model with learning_rate=0.01 is underfitting --- it needs more trees
D) Both models will perform identically

Answer: C) The model with learning_rate=0.01 is underfitting --- it needs more trees. A learning rate of 0.01 with only 100 trees means the ensemble has barely moved from its initial prediction. Each tree contributes only 1% of its prediction, and 100 trees at 1% each is far too little. The model needs 1000+ trees with a learning rate this low. Early stopping with a high enough n_estimators would solve this.

Question 3 (Short Answer)

Explain why max_depth should be much smaller in gradient boosting (typically 3-8) than in Random Forests (often unlimited). What goes wrong if you use max_depth=20 in gradient boosting?

Answer: In Random Forests, each tree is grown deep to reduce bias, and the averaging of many independent trees controls variance. In gradient boosting, each tree is a weak learner that should capture a small piece of the residual pattern. A tree with max_depth=20 is powerful enough to memorize residuals (including noise), and since gradient boosting adds trees sequentially, this memorization compounds across rounds. The result is severe overfitting: near-perfect training accuracy but poor test performance.

Question 4 (Multiple Choice)

Early stopping in gradient boosting works by:

A) Removing the worst-performing trees from the ensemble after training
B) Stopping training when the training loss reaches zero
C) Monitoring validation loss and stopping when it has not improved for N consecutive rounds
D) Randomly stopping at a different number of rounds each time

Answer: C) Monitoring validation loss and stopping when it has not improved for N consecutive rounds. Early stopping uses a held-out validation set to detect when adding more trees starts hurting generalization. The parameter N (early_stopping_rounds) determines how patient the algorithm is before declaring that improvement has stopped. The model from the best round is returned.

Question 5 (Multiple Choice)

Which of the following is the primary advantage of LightGBM's histogram-based splitting over XGBoost's default exact splitting?

A) Higher accuracy on small datasets
B) Better handling of missing values
C) Dramatically faster training on large datasets
D) Automatic feature selection

Answer: C) Dramatically faster training on large datasets. Histogram-based splitting bins continuous features into discrete buckets (typically 255), reducing the number of candidate split points from O(n) to O(bins). This makes split-finding much faster, especially on large datasets. The accuracy difference is typically negligible because 255 bins provides sufficient granularity for most problems.

Question 6 (Short Answer)

A colleague says: "I used the test set for early stopping and got 0.94 AUC. That is my model's real performance." What is wrong with this claim, and how should the evaluation be structured?

Answer: The test set was used to make a training decision (when to stop), so the 0.94 AUC estimate is biased upward. The model was indirectly fitted to the test set through the stopping point. The correct approach is a three-way split: training set (fit trees), validation set (early stopping), and a held-out test set (final, unbiased evaluation). The test set must never influence any training decision.

Question 7 (Multiple Choice)

Which library provides the best native handling of categorical features without requiring manual encoding?

A) XGBoost
B) LightGBM
C) CatBoost
D) scikit-learn's GradientBoostingClassifier

Answer: C) CatBoost. CatBoost uses ordered target statistics with a permutation-based scheme that prevents target leakage, providing the most sophisticated native categorical handling. LightGBM also supports native categoricals (via optimal split on categories), but CatBoost's implementation is more mature and generally produces better results on high-cardinality categorical features. XGBoost historically required manual encoding, though experimental native support was added in version 1.6.

Question 8 (Multiple Choice)

In the context of gradient boosting, the subsample parameter set to 0.8 means:

A) 80% of features are used for each tree
B) 80% of training rows are randomly sampled (without replacement) for each tree
C) 80% of the learning rate is applied
D) 80% of the trees are kept in the final ensemble

Answer: B) 80% of training rows are randomly sampled (without replacement) for each tree. This is stochastic gradient boosting, analogous to the bootstrap sampling in Random Forests. It introduces randomness that reduces overfitting and can actually improve generalization. The colsample_bytree parameter controls the fraction of features, not subsample.

Question 9 (Short Answer)

You train a gradient boosting model with n_estimators=5000 and early_stopping_rounds=50. Early stopping triggers at iteration 47. What does this tell you about your model, and what should you investigate?

Answer: Stopping at iteration 47 out of 5000 means the model overfit extremely quickly --- validation loss peaked after fewer than 50 trees. This suggests the learning rate may be too high (each tree is too aggressive), max_depth may be too large (trees are too complex), or the data may have very little learnable signal beyond what the initial prediction captures. Investigate by lowering the learning rate (e.g., from 0.1 to 0.01), reducing max_depth, and checking whether a simpler baseline (logistic regression) achieves similar performance.

Question 10 (Multiple Choice)

Which statement about gradient boosting vs. Random Forest is FALSE?

A) Random Forest is harder to overfit than gradient boosting
B) Gradient boosting typically achieves higher accuracy on tabular data
C) Random Forest trees are trained independently; gradient boosting trees are trained sequentially
D) Random Forest always trains slower than gradient boosting

Answer: D) Random Forest always trains slower than gradient boosting. In practice, Random Forest often trains faster because its trees can be built in parallel (each tree is independent), while gradient boosting is inherently sequential (each tree depends on the errors of all previous trees). The other three statements are true: RF is more resistant to overfitting, GB typically achieves slightly higher accuracy on tabular data, and RF trees are independent while GB trees are sequential.

Question 11 (Short Answer)

Explain what DART (Dropouts meet Multiple Additive Regression Trees) does and when you might use it instead of standard gradient boosting.

Answer: DART randomly drops existing trees from the ensemble when computing residuals for the next tree. This prevents early trees from dominating the ensemble, which is a known issue in standard gradient boosting where later trees make increasingly marginal corrections. DART can improve generalization when you observe that standard regularization (learning rate, subsampling) is not sufficient. However, DART is slower (no early stopping shortcut) and not always beneficial, so standard gradient boosting with early stopping remains the default choice.

Question 12 (Multiple Choice)

The recommended order for tuning gradient boosting hyperparameters is:

A) regularization, then learning rate, then max_depth, then subsample
B) learning rate and n_estimators first, then max_depth, then subsample, then regularization
C) max_depth first, then everything else simultaneously with grid search
D) Use defaults for everything and only tune learning rate

Answer: B) Learning rate and n_estimators first, then max_depth, then subsample, then regularization. Fix a moderate learning rate (0.05-0.1) with high n_estimators and early stopping. Then tune tree structure (max_depth or num_leaves). Then tune subsampling rates. Then fine-tune regularization. Finally, lower the learning rate and re-run for the best generalization. This order moves from the most impactful parameters to the least.

Question 13 (Short Answer)

A team trains LightGBM with num_leaves=31 (the default) and max_depth=-1 (unlimited). Their colleague is confused: "If depth is unlimited, won't the tree grow forever?" Explain how num_leaves controls tree complexity in LightGBM's leaf-wise growth strategy.

Answer: LightGBM grows trees leaf-wise, always splitting the leaf that produces the largest loss reduction. The num_leaves parameter limits the total number of leaves in the tree, regardless of depth. With num_leaves=31, the tree can have at most 31 leaves, which constrains its complexity even without a depth limit. A balanced tree with 31 leaves would have depth ~5, but leaf-wise growth produces asymmetric trees that may be deeper in some branches and shallower in others. The num_leaves parameter is the primary complexity control for LightGBM, replacing max_depth in this role.

Question 14 (Multiple Choice)

You are choosing between XGBoost, LightGBM, and CatBoost for a production system. The dataset has 50 million rows, 200 features, and 30 high-cardinality categorical features. Training must complete within 2 hours nightly. Which combination is most appropriate?

A) XGBoost with one-hot encoding on CPU
B) LightGBM with native categorical support on CPU
C) CatBoost with native categorical support on GPU
D) All three would work equally well

Answer: C) CatBoost with native categorical support on GPU. With 30 high-cardinality categoricals, CatBoost's ordered target statistics will handle encoding most effectively. One-hot encoding 30 high-cardinality features would massively expand the feature space, hurting XGBoost. LightGBM's native categoricals are good but CatBoost's are generally better for high cardinality. The GPU acceleration is important for the 50M-row dataset within the 2-hour time constraint. LightGBM with native categoricals on GPU would be the second-best choice.

Question 15 (Short Answer)

Why is gradient boosting described as "gradient descent in function space"? How does this relate to the choice of loss function?

Answer: In traditional gradient descent, you update model parameters by stepping in the direction of the negative gradient of the loss. In gradient boosting, you update the prediction function itself by adding a new tree that approximates the negative gradient of the loss with respect to the current predictions. This means gradient boosting works with any differentiable loss function --- squared error, log-loss, Huber loss, quantile loss, ranking losses --- because it only needs the gradient. The loss function determines what "error" means, and the trees learn to correct those errors regardless of which loss is chosen.

This quiz covers Chapter 14: Gradient Boosting. Return to the chapter to review concepts.