Quiz: Chapter 13
Tree-Based Methods
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
A decision tree selects the best feature and threshold at each node by:
- A) Randomly choosing a feature and finding the median split point
- B) Choosing the split that maximizes the reduction in impurity (information gain)
- C) Selecting the feature with the highest correlation with the target
- D) Minimizing the number of resulting leaf nodes
Answer: B) Choosing the split that maximizes the reduction in impurity (information gain). At each node, the tree evaluates all possible features and thresholds, computes the weighted average impurity of the resulting child nodes, and selects the split that produces the largest decrease from the parent node's impurity. This greedy approach optimizes locally at each step.
Question 2 (Multiple Choice)
The Gini impurity of a node where 80% of samples belong to class A and 20% belong to class B is:
- A) 0.16
- B) 0.32
- C) 0.50
- D) 0.80
Answer: B) 0.32. Gini = 1 - (0.8^2 + 0.2^2) = 1 - (0.64 + 0.04) = 1 - 0.68 = 0.32. A pure node would have Gini = 0, and a perfectly mixed 50/50 node would have Gini = 0.50.
Question 3 (Short Answer)
Explain why an unrestricted decision tree achieves 100% training accuracy on most datasets but performs poorly on test data.
Answer: An unrestricted tree keeps splitting until every leaf is pure, effectively creating a unique path for each training sample. This means it memorizes the training data --- including noise and random fluctuations --- rather than learning generalizable patterns. On test data, those noise-specific rules do not apply, so predictions are unreliable. This is the classic overfitting pattern: the model confuses memorization for learning.
Question 4 (Multiple Choice)
Which hyperparameter is the single most effective way to prevent overfitting in a single decision tree?
- A)
n_estimators - B)
max_features - C)
max_depth - D)
criterion
Answer: C) max_depth. Limiting tree depth directly controls how complex the tree can become. A tree with max_depth=6 can ask at most 6 questions before making a prediction, which prevents it from memorizing noise in the training data. n_estimators and max_features apply to ensembles, not single trees. criterion (Gini vs. entropy) has minimal effect on overfitting.
Question 5 (Multiple Choice)
In a Random Forest, "double randomization" refers to:
- A) Random initialization of split thresholds and random selection of the loss function
- B) Bootstrap sampling of training data and random selection of features at each split
- C) Random pruning of branches and random assignment of class labels
- D) Random selection of tree depth and random weighting of leaf predictions
Answer: B) Bootstrap sampling of training data and random selection of features at each split. Each tree is trained on a bootstrap sample (~63.2% of the data), and at each split, only a random subset of features is considered (controlled by max_features). This double randomization forces trees to explore different patterns, decorrelating their predictions and making the ensemble more powerful than any individual tree.
Question 6 (Short Answer)
Explain the purpose of the max_features parameter in a Random Forest. What happens if you set max_features to the total number of features?
Answer: max_features controls how many features are randomly sampled as candidates at each split. Setting it to the total number of features removes feature randomization entirely, reducing the Random Forest to a bagged ensemble. The resulting trees are more correlated because they all tend to split on the same dominant features first. This correlation means the averaging provides less variance reduction, and the ensemble is weaker than one with proper feature randomization.
Question 7 (Multiple Choice)
A Random Forest trained with n_estimators=500 and oob_score=True reports an OOB accuracy of 0.923. What does this mean?
- A) 92.3% of the training data was used in each bootstrap sample
- B) Each tree achieved 92.3% accuracy on the training set
- C) When each training sample is classified only by trees that did NOT include it in their bootstrap, the overall accuracy is 92.3%
- D) 92.3% of the 500 trees agree on the majority class
Answer: C) When each training sample is classified only by trees that did NOT include it in their bootstrap, the overall accuracy is 92.3%. The OOB score uses the ~36.8% of trees that excluded each sample as a natural validation set. It is approximately equivalent to cross-validation accuracy and serves as a free estimate of generalization performance.
Question 8 (Multiple Choice)
Which statement about decision trees and feature scaling is TRUE?
- A) Decision trees require StandardScaler to perform well
- B) Decision trees produce different predictions after MinMaxScaler is applied
- C) Decision trees are invariant to monotonic feature transformations
- D) Decision trees require features to be normally distributed
Answer: C) Decision trees are invariant to monotonic feature transformations. Trees split on threshold comparisons (is feature X <= threshold T?), and monotonic transformations preserve the ordering of values. Scaling, log transforms, and other monotonic transforms change the threshold values but not which samples end up on each side of the split. The resulting predictions are identical.
Question 9 (Short Answer)
Compare impurity-based (MDI) feature importance and permutation-based feature importance. When do they disagree, and which should you trust?
Answer: Impurity-based importance sums the Gini impurity reductions across all splits on a given feature. It is fast but biased toward high-cardinality and continuous features because they offer more possible split points. Permutation importance shuffles each feature and measures the drop in model performance --- it is slower but unbiased. The two methods tend to disagree when features have different cardinalities or when correlated features are present. When the ranking matters for decision-making, trust permutation importance.
Question 10 (Multiple Choice)
On the StreamFlow churn dataset, the Random Forest achieved an AUC-ROC of 0.889 compared to 0.823 for the logistic regression baseline. The primary reason for this improvement is:
- A) The Random Forest uses more training data
- B) The Random Forest captures non-linear interactions without manual feature engineering
- C) The Random Forest is immune to class imbalance
- D) The Random Forest uses a better loss function
Answer: B) The Random Forest captures non-linear interactions without manual feature engineering. The churn-generating process involves interactions between features (e.g., contract type interacts with tenure), which trees capture naturally through hierarchical splits. Logistic regression can only model these interactions if they are explicitly engineered as new features. Both models see the same data and face the same class imbalance.
Question 11 (Multiple Choice)
Which of the following is a known limitation of Random Forests?
- A) They cannot handle categorical features
- B) They cannot extrapolate beyond the range of training data
- C) They require feature scaling to work properly
- D) They always overfit on large datasets
Answer: B) They cannot extrapolate beyond the range of training data. Trees partition the feature space into regions and assign predictions based on training values within each region. If a test point has a target value outside the range seen in training (e.g., predicting a stock price higher than any historical value), the Random Forest caps the prediction at the most extreme training value. Linear models extrapolate naturally.
Question 12 (Short Answer)
A colleague shows you a Random Forest with n_estimators=50 and says, "Adding more trees will cause overfitting, just like adding more depth to a single tree." Is your colleague correct? Explain.
Answer: No, your colleague is incorrect. Adding more trees to a Random Forest does not increase overfitting. Each tree is trained independently on a bootstrap sample, and the ensemble prediction is the average (or majority vote) of all trees. Adding more trees reduces the variance of this average, making predictions more stable. In contrast, increasing the depth of a single tree allows it to memorize more noise. The ensemble's averaging mechanism is fundamentally different from a single tree's memorization mechanism.
Question 13 (Multiple Choice)
In a bootstrap sample of n items drawn with replacement from n items, approximately what fraction of the original items are NOT included?
- A) 10%
- B) 25%
- C) 36.8%
- D) 50%
Answer: C) 36.8%. The probability that a specific item is NOT drawn in any single draw is (1 - 1/n). Over n draws, the probability it is never drawn is (1 - 1/n)^n, which converges to 1/e = 0.368 as n grows. This means roughly 36.8% of items are excluded from each bootstrap sample, and these form the out-of-bag (OOB) samples used for validation.
Question 14 (Short Answer)
You train a Random Forest and find that impurity-based importance ranks a binary feature (is_enterprise_customer, 0/1) at position 12 out of 20 features, but permutation importance ranks it at position 2. Explain what is likely happening.
Answer: The binary feature has only one possible split point, so it produces fewer Gini impurity reductions across the forest compared to continuous features with many unique values. Impurity-based importance systematically underrates low-cardinality features. Permutation importance, which measures the drop in model performance when the feature is shuffled, correctly identifies that the binary feature carries substantial predictive information. The permutation ranking is more trustworthy here.
Question 15 (Multiple Choice)
You are building a churn prediction model. Stakeholders require that every prediction can be explained by following a single decision path. Which model should you use?
- A) Random Forest with 500 trees
- B) A single pruned decision tree
- C) Gradient Boosted Trees
- D) A Random Forest with feature importance analysis
Answer: B) A single pruned decision tree. Only a single tree provides a traceable decision path from root to leaf for each prediction. A Random Forest aggregates predictions from hundreds of trees, and while you can extract feature importance, you cannot point to one coherent decision path. When explainability of individual predictions is a hard requirement, a single (well-pruned) tree or logistic regression is the appropriate choice.
These quiz questions cover Chapter 13: Tree-Based Methods. Return to the chapter to review concepts.