Chapter 28 Quiz: Decision Trees and Random Forests

Contributors to Introduction to Data Science

Chapter 28 Quiz: Decision Trees and Random Forests

Instructions: This quiz tests your understanding of Chapter 28. Answer all questions before checking the solutions. For multiple choice, select the best answer — some options may be partially correct. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. In a decision tree, which component makes the final prediction?

(A) The root node
(B) Internal nodes
(C) Leaf nodes
(D) Branches

Answer

**Correct: (C)** Leaf nodes are the endpoints of the tree where predictions are made. The root node asks the first question, internal nodes ask subsequent questions, and branches represent the answers. A sample "falls through" the tree from the root until it reaches a leaf, which provides the prediction (a class label for classification or a number for regression).

Question 2. A node contains 60 samples of class A and 40 samples of class B. What is its Gini impurity?

(A) 0.24
(B) 0.48
(C) 0.50
(D) 0.60

Answer

**Correct: (B)** Gini = 1 - (60/100)² - (40/100)² = 1 - 0.36 - 0.16 = 0.48. - **(A)** 0.24 would be half the correct answer — a common arithmetic error if you forget the formula. - **(C)** 0.50 is the maximum Gini impurity for two classes (50/50 split), which this is not. - **(D)** 0.60 is the proportion of class A, not the Gini impurity.

Question 3. Why don't decision trees require feature scaling (e.g., standardization)?

(A) Decision trees are immune to all data preprocessing issues
(B) Trees make splits based on threshold comparisons within each feature, not on distances or weighted sums across features
(C) Scikit-learn automatically scales features inside DecisionTreeClassifier
(D) Feature scaling has no effect on any machine learning algorithm

Answer

**Correct: (B)** Decision trees split data by asking "Is feature X above or below threshold T?" This comparison happens within a single feature, so the scale of that feature relative to other features doesn't matter. Whether GDP is measured in dollars ($50,000) or millions ($0.05M), the tree finds the same optimal split point. This contrasts with algorithms like logistic regression or k-nearest neighbors, which compute weighted sums or distances across features and are therefore affected by scale.

Question 4. A decision tree trained with default settings achieves 100% accuracy on training data and 72% on test data. What is the most likely problem?

(A) Underfitting
(B) Overfitting
(C) The training and test sets are from different distributions
(D) The features are not scaled

Answer

**Correct: (B)** The large gap between training accuracy (100%) and test accuracy (72%) is the classic signature of overfitting. The tree has memorized the training data, creating highly specific rules that don't generalize. The fix is pruning — setting `max_depth`, `min_samples_leaf`, or `min_samples_split` to constrain the tree's complexity. - **(A)** Underfitting would show low accuracy on both training and test data. - **(C)** While possible, the most common explanation for 100% train / low test is overfitting. - **(D)** Decision trees don't require feature scaling.

Question 5. What is the primary purpose of bagging in a random forest?

(A) To speed up training by distributing work across CPUs
(B) To create diverse training sets for each tree, reducing the ensemble's variance
(C) To select the best features before training
(D) To prune individual trees after training

Answer

**Correct: (B)** Bagging (Bootstrap Aggregating) creates different bootstrap samples — random samples drawn with replacement — for each tree. This ensures that each tree sees a slightly different version of the data, leading to different splits and different learned patterns. When these diverse trees vote together, their individual errors tend to cancel out, reducing variance. - **(A)** Parallelization is a practical benefit but not the purpose of bagging. - **(C)** Feature selection at each split is a separate mechanism (feature randomness), not bagging itself. - **(D)** Bagging doesn't involve post-training pruning.

Question 6. In a random forest with 25 features and max_features='sqrt', how many features are considered at each split?

(A) 25
(B) 12 or 13
(C) 5
(D) 1

Answer

**Correct: (C)** sqrt(25) = 5. At each split, the algorithm randomly selects 5 of the 25 features and finds the best split among only those 5. This forces different trees to rely on different features, increasing ensemble diversity. - **(A)** 25 would mean considering all features — that's the default for a single decision tree, not a random forest. - **(B)** This would be approximately half, which isn't the default. - **(D)** Considering only 1 feature per split would create extremely random, weak trees.

Question 7. Feature importance in a random forest tells you:

(A) Which features cause the target variable to change
(B) Which features the model relies on most to make accurate predictions
(C) Which features are most correlated with the target variable
(D) Which features should be removed from the dataset

Answer

**Correct: (B)** Feature importance measures how much each feature contributes to the model's predictions, typically measured by the average reduction in impurity across all trees. It does NOT tell you about causation (A), simple correlation (C), or that features should be removed (D). A feature with low importance might still be causally relevant, and a feature with high importance might be a proxy for something else.

Question 8. Which of the following is a disadvantage of random forests compared to single decision trees?

(A) Random forests are always less accurate
(B) Random forests require feature scaling
(C) Random forests are harder to interpret because you can't visualize 200 trees as a single flowchart
(D) Random forests can't handle classification problems

Answer

**Correct: (C)** The primary disadvantage of random forests over single decision trees is reduced interpretability. A single tree can be drawn, read, and explained to non-technical stakeholders. A forest of hundreds of trees cannot. This matters in settings where model transparency is required (e.g., regulated industries, medical decisions). - **(A)** Random forests are typically more accurate, not less. - **(B)** Neither single trees nor random forests require feature scaling. - **(D)** Random forests handle both classification and regression.

Question 9. You increase n_estimators in a random forest from 100 to 1000. What is the most likely effect?

(A) Training time increases significantly; test accuracy improves dramatically
(B) Training time increases significantly; test accuracy improves slightly or not at all
(C) Training time stays the same; test accuracy improves dramatically
(D) Both training time and test accuracy decrease

Answer

**Correct: (B)** More trees always increase training time (linearly — 1000 trees take roughly 10x longer than 100 trees). However, test accuracy typically plateaus after a certain number of trees. Going from 10 to 100 trees usually helps a lot; going from 100 to 1000 rarely provides significant improvement. More trees never *hurt* accuracy (they can only help), but the marginal benefit diminishes rapidly.

Question 10. A decision tree trained on house prices can only predict prices between $80,000 and $1,200,000 (the range in the training data). A new house has features that suggest it should be worth $1,500,000. What will the tree predict?

(A) $1,500,000, by extrapolating the learned pattern
(B) A value at or near $1,200,000, the highest price in the training data
(C) An error, because the input is out of range
(D) $0, because the tree can't handle this input

Answer

**Correct: (B)** Decision trees (and random forests) cannot extrapolate beyond the range of values seen in training. Each leaf predicts the average of the training samples that landed in that leaf. For an unusually high-value house, the tree will route it to the leaf containing the most expensive training houses and predict their average — which will be at or near (but never above) the training maximum. This is a fundamental limitation of tree-based models compared to linear models, which can extrapolate along the learned line.

Section 2: True/False (4 questions, 4 points each)

Question 11. True or False: A decision tree with max_depth=1 (a "stump") can only use one feature to make its prediction.

Answer

**True.** A tree with `max_depth=1` has exactly one split (the root node) that divides data into two leaf nodes. That single split uses one feature and one threshold. The tree's entire decision-making is based on whether one feature is above or below that threshold. This is why stumps are useful as building blocks in boosting but too simple for most real problems.

Question 12. True or False: In a random forest, each tree is trained on the entire training dataset.

Answer

**False.** Each tree in a random forest is trained on a bootstrap sample — a random sample drawn with replacement from the training data. A bootstrap sample has the same size as the original training set, but some original samples are duplicated and about 37% are left out. This is the "bagging" mechanism that creates diversity among the trees.

Question 13. True or False: If a random forest's most important feature is age, it means that age causes the outcome.

Answer

**False.** Feature importance measures predictive contribution, not causation. A feature can be important for prediction because it's correlated with the true cause, or because it's a proxy for unmeasured variables. For example, `age` might be the most important predictor of health outcomes not because age directly causes illness, but because age correlates with lifestyle, exposure history, and genetic factors. Establishing causation requires experimental or quasi-experimental methods, not feature importance scores.

Question 14. True or False: Increasing max_depth in a decision tree always improves test accuracy.

Answer

**False.** Increasing `max_depth` always improves (or maintains) *training* accuracy, because a deeper tree can fit the training data more precisely. But test accuracy typically rises to a peak at moderate depths and then declines as the tree begins overfitting — memorizing noise in the training data rather than learning generalizable patterns. This is the bias-variance trade-off in action.

Section 3: Short Answer (3 questions, 6 points each)

Question 15. Explain the difference between pre-pruning and post-pruning in decision trees. Give one example of a pre-pruning parameter in scikit-learn.

Answer

**Pre-pruning** (early stopping) constrains the tree *during* training by setting limits on how complex it can grow. Examples in scikit-learn include `max_depth` (limits the number of levels), `min_samples_split` (requires a minimum number of samples to create a split), and `min_samples_leaf` (requires a minimum number of samples in each leaf). **Post-pruning** grows the full tree first and then removes branches that don't improve performance on a validation set. In scikit-learn, the `ccp_alpha` parameter implements cost-complexity pruning, a form of post-pruning. Pre-pruning is more common in practice because it's simpler and computationally cheaper — you avoid growing unnecessary branches in the first place.

Question 16. Why does a random forest use feature randomness (considering only a subset of features at each split) in addition to bagging? What problem does this solve?

Answer

Feature randomness solves the problem of tree correlation. If one feature is much stronger than all others (e.g., GDP per capita dominates when predicting health outcomes), every tree in the bagged ensemble will split on that feature first, producing very similar trees. Similar trees make similar errors, and averaging similar predictions provides little benefit. By forcing each split to consider only a random subset of features, the algorithm ensures that some trees will split on their second-best or third-best feature first, producing genuinely diverse trees. The ensemble's power comes from this diversity — diverse trees make different errors that cancel out when averaged.

Question 17. A colleague says: "I trained a decision tree with max_depth=3 for our presentation and a random forest with 500 trees for production. The decision tree shows that income is the root split, and the random forest ranks income as the most important feature. So our story is consistent." Is this a valid approach? What are its strengths and potential weaknesses?

Answer

This is a valid and common approach — using a simple model for communication and a complex model for production. The strength is that it gives stakeholders an intuitive understanding (the tree) while deploying the most accurate model (the forest). The consistency between the tree's root split and the forest's top feature adds confidence. Potential weaknesses: (1) The shallow tree may oversimplify, hiding important interactions that only appear deeper in the tree. (2) The tree and forest may disagree on specific predictions for individual cases, which could be confusing if stakeholders try to verify the forest's predictions against the tree's logic. (3) Feature importance rankings can differ between single trees and forests, especially for lower-ranked features. The approach works best when the two models agree on the big picture, even if they disagree on details.

Section 4: Applied Scenarios (2 questions, 5 points each)

Question 18. You're building a model to predict which customers will cancel their gym memberships. You train a random forest and find:

Feature importances: months_since_last_visit (0.38), monthly_visits (0.25), contract_length (0.18), age (0.12), distance_from_gym (0.07)
Test accuracy: 0.84

Your gym manager asks: "So what should we do about it?" Write 3-4 sentences translating the model's findings into actionable business recommendations. Be specific about how you'd use the feature importance information.

Answer

A strong answer connects the feature importance scores to concrete actions: "The two strongest predictors of cancellation are how recently a member visited and how often they visit each month. This suggests our primary retention strategy should focus on re-engaging members who haven't visited recently — perhaps a personal check-in call or email after 2-3 weeks of inactivity, before they drift further away. Members on shorter contracts are also at higher risk, so we might offer incentives for longer-term commitments or create milestone rewards for consistent attendance. Distance from the gym is the weakest predictor, suggesting that location alone isn't driving cancellations — it's engagement habits that matter most."

Question 19. Look at this code and its output:

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
import numpy as np

np.random.seed(42)
X = np.random.randn(100, 5)
y = np.random.randint(0, 2, 100)  # Random labels!

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
dt = DecisionTreeClassifier(random_state=42)
dt.fit(X_train, y_train)
print(f"Train accuracy: {dt.score(X_train, y_train):.2f}")  # 1.00
print(f"Test accuracy: {dt.score(X_test, y_test):.2f}")      # 0.47

The labels are completely random — there's no real pattern. Yet the tree achieves 100% training accuracy. Explain why this happens and what it tells us about the importance of proper model evaluation.

Answer

This is a powerful demonstration of overfitting. The labels are random, meaning there is no true relationship between the features and the target. Yet the decision tree, with no depth limit, can always find a way to perfectly separate the training data — it just creates increasingly specific, meaningless rules until every leaf contains samples of only one class. The test accuracy of ~47% (close to random guessing at 50%) reveals the truth: the model has learned nothing useful. This example shows why training accuracy alone is meaningless. You must always evaluate on held-out test data. A model that looks perfect on training data might be memorizing noise rather than learning signal. This is also why cross-validation ([Chapter 29](../chapter-29-evaluating-models/index.md)) is so important — it gives you a more reliable estimate of how well your model actually generalizes.

Section 5: Code Analysis (1 question, 6 points)

Question 20. The following code trains a random forest and plots feature importance. Identify and fix the bug that would cause the feature importance plot to display incorrectly.

from sklearn.ensemble import RandomForestClassifier
import matplotlib.pyplot as plt
import pandas as pd

features = ['income', 'age', 'credit_score', 'months_employed']
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)

# Bug is here:
importance_series = pd.Series(rf.feature_importances_, index=['a', 'b', 'c', 'd'])
importance_series.sort_values().plot(kind='barh')
plt.xlabel('Importance')
plt.title('Feature Importance')
plt.show()

Answer

The bug is in the `index` parameter when creating the Series. The feature importances are being labeled with `['a', 'b', 'c', 'd']` instead of the actual feature names. The correct line is:

importance_series = pd.Series(rf.feature_importances_, index=features)

This is a common and insidious bug because the code runs without errors — it produces a plot with bars labeled 'a', 'b', 'c', 'd' instead of the actual feature names. You might even interpret the results incorrectly if you assume the order matches your expectation. Always use the actual feature names when creating importance plots, and double-check that labels match by printing the importance values alongside their names.