Chapter 28 Exercises: Decision Trees and Random Forests

Contributors to Introduction to Data Science

Chapter 28 Exercises: Decision Trees and Random Forests

How to use these exercises: Work through the sections in order. Conceptual questions solidify your understanding of how trees work; applied exercises get you building and tuning models in scikit-learn; real-world and synthesis exercises push you to think critically about when trees are the right tool. Code exercises assume you have a Jupyter notebook open with pandas, scikit-learn, and matplotlib imported.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

These questions check whether you internalized how decision trees make decisions and how random forests improve on them.

Exercise 28.1 — Tree anatomy

Draw (on paper or digitally) a decision tree for the following problem: predicting whether a student passes an exam based on three features — hours_studied, attendance_rate, and previous_gpa. Your tree should have exactly 3 internal nodes (including the root) and 4 leaf nodes. Label each node with a feature and a threshold, and label each leaf with "Pass" or "Fail."

Guidance

There are many valid answers. One example: root splits on `hours_studied > 5`, left child splits on `previous_gpa > 2.5`, right child splits on `attendance_rate > 80%`. The key is that your tree makes intuitive sense — students who study more AND attend class regularly are more likely to pass. Make sure every path from the root to a leaf tells a coherent story.

Exercise 28.2 — Gini by hand

Compute the Gini impurity for each of the following nodes:

A node with 40 samples of class A and 10 samples of class B
A node with 25 samples of class A and 25 samples of class B
A node with 48 samples of class A and 2 samples of class B
A node with 20 samples of class A, 20 of class B, and 20 of class C

Guidance

Use the formula: Gini = 1 - sum of (proportion of each class)². 1. Gini = 1 - (40/50)² - (10/50)² = 1 - 0.64 - 0.04 = 0.32 2. Gini = 1 - (25/50)² - (25/50)² = 1 - 0.25 - 0.25 = 0.50 (maximum for 2 classes) 3. Gini = 1 - (48/50)² - (2/50)² = 1 - 0.9216 - 0.0016 = 0.0768 (very pure) 4. Gini = 1 - (20/60)² - (20/60)² - (20/60)² = 1 - 3(1/9) = 1 - 0.333 = 0.667 (maximum for 3 classes)

Exercise 28.3 — Information gain calculation

A node contains 80 samples: 50 of class "Yes" and 30 of class "No." A proposed split divides it into: - Left child: 40 samples (35 Yes, 5 No) - Right child: 40 samples (15 Yes, 25 No)

Calculate the information gain of this split using Gini impurity. Is this likely to be a good split?

Guidance

Parent Gini = 1 - (50/80)² - (30/80)² = 1 - 0.3906 - 0.1406 = 0.4688 Left Gini = 1 - (35/40)² - (5/40)² = 1 - 0.7656 - 0.0156 = 0.2188 Right Gini = 1 - (15/40)² - (25/40)² = 1 - 0.1406 - 0.3906 = 0.4688 Weighted child Gini = (40/80)(0.2188) + (40/80)(0.4688) = 0.1094 + 0.2344 = 0.3438 Information gain = 0.4688 - 0.3438 = 0.125 This is a moderately good split. The left child became much purer (mostly Yes), but the right child is still fairly impure. The algorithm would compare this to all other possible splits and choose the best one.

Exercise 28.4 — Pruning intuition

Explain in your own words why an unpruned decision tree almost always overfits. Use the analogy of writing an exam answer that's "too specific" — so specific that it only applies to one particular question wording and would fail if the question were rephrased slightly.

Guidance

An unpruned tree keeps splitting until every leaf is pure (or nearly pure). This means it creates very specific rules that apply to tiny subsets of the training data — rules like "if feature A is between 3.47 and 3.52 AND feature B is above 7.83, predict class 1." These rules are memorizing noise in the training data, not capturing real patterns. Just as an exam answer that's so specific it only works for one exact phrasing of a question would fail a slightly rephrased version, the tree's hyper-specific rules fail on new data that doesn't match the training data exactly. Pruning forces the tree to make broader, more generalizable rules.

Exercise 28.5 — Bagging and bootstrap

Explain the bagging process step by step, as if you were teaching it to a friend who understands basic machine learning but hasn't encountered ensembles. Include: what a bootstrap sample is, why sampling with replacement creates diversity, and how the final prediction is made.

Guidance

A strong answer covers: (1) Bootstrap sampling means drawing n samples with replacement from the original n samples — some duplicated, some left out. (2) This creates diversity because each tree sees a slightly different version of the data, so it learns slightly different patterns. (3) For classification, the final prediction is majority vote across all trees; for regression, it's the average prediction. The key insight is that individual trees' errors tend to be uncorrelated, so they cancel out when averaged.

Exercise 28.6 — Feature randomness

In a random forest with 20 features and max_features='sqrt', how many features are considered at each split? Why does this additional randomness help, even though it might make individual trees weaker?

Guidance

sqrt(20) ≈ 4.47, so the algorithm considers 4 features at each split (scikit-learn rounds down). This makes individual trees weaker because they might miss the best possible split if the optimal feature isn't in their random subset. But it makes the ensemble stronger because it forces different trees to rely on different features, increasing diversity. Without this, if one feature dominates (like GDP in a health dataset), every tree would split on that feature first, producing very similar trees — and a forest of identical trees is no better than one tree.

Exercise 28.7 — Trees vs. linear models

A dataset has two features (x1 and x2) and a binary target. The true decision boundary is a straight diagonal line: if x1 + x2 > 10, the class is 1; otherwise it's 0. Would a decision tree or logistic regression be better suited for this problem? Why?

Guidance

Logistic regression would be better here because the true boundary is linear — exactly the kind of pattern logistic regression is designed to capture. A decision tree can only make axis-aligned splits (horizontal and vertical lines), so it would need many splits to approximate a diagonal boundary, creating a staircase-like pattern. This doesn't mean decision trees can't work — they'll eventually approximate the diagonal — but they'll need more data and more complexity to match what logistic regression captures in a single equation. This illustrates why choosing the right model matters.

Part B: Applied Exercises ⭐⭐

These exercises require writing and running code. Use a Jupyter notebook.

Exercise 28.8 — Your first tree

Load the Iris dataset from scikit-learn (from sklearn.datasets import load_iris). Train a DecisionTreeClassifier with default settings. Report the training and test accuracy (use a 70/30 split with random_state=42). Then retrain with max_depth=3 and compare.

Guidance

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
    iris.data, iris.target, test_size=0.3, random_state=42
)

# Default tree
dt_default = DecisionTreeClassifier(random_state=42)
dt_default.fit(X_train, y_train)

# Pruned tree
dt_pruned = DecisionTreeClassifier(max_depth=3, random_state=42)
dt_pruned.fit(X_train, y_train)

The default tree likely achieves 100% training accuracy and ~95-97% test accuracy. The pruned tree sacrifices some training accuracy but may match or improve test accuracy while being much simpler.

Exercise 28.9 — Visualize and interpret

Using the pruned tree from Exercise 28.8, visualize it using plot_tree. Then use export_text to print the decision rules. In plain English, describe the first two splits: what feature does the tree ask about first, and what does the answer determine?

Guidance

from sklearn.tree import plot_tree, export_text
import matplotlib.pyplot as plt

plt.figure(figsize=(14, 8))
plot_tree(dt_pruned, feature_names=iris.feature_names,
          class_names=iris.target_names, filled=True, rounded=True)
plt.title('Iris Decision Tree')
plt.tight_layout()
plt.show()

print(export_text(dt_pruned, feature_names=list(iris.feature_names)))

You should find that the tree first splits on petal width or petal length. Describe what you see in conversational terms: "The tree first asks whether petal length is less than about 2.45 cm. If yes, it immediately classifies the flower as setosa — no further questions needed."

Exercise 28.10 — Depth vs. accuracy curve

Using the Iris dataset, train decision trees with max_depth from 1 to 15. Plot the training accuracy and test accuracy on the same chart. At what depth does overfitting begin? What depth would you choose for deployment?

Guidance

You should see training accuracy rise steadily toward 1.0 and test accuracy peak around depth 3-5 before leveling off or slightly declining. The optimal depth is where test accuracy is highest and the gap between training and test accuracy is reasonable.

Exercise 28.11 — Random forest comparison

Using the same Iris data split, train a RandomForestClassifier with 200 trees and compare its test accuracy to your best single tree. Also extract and plot feature importances from both models side by side.

Guidance

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, random_state=42)
rf.fit(X_train, y_train)

The random forest should achieve similar or slightly better test accuracy. The feature importance comparison is more interesting — both should agree that petal-related features are most important, but the exact rankings might differ.

Exercise 28.12 — The effect of n_estimators

Train random forests on the Iris data with n_estimators = [1, 5, 10, 25, 50, 100, 200, 500]. For each, record the test accuracy. Plot the results. At what point does adding more trees stop helping?

Guidance

Create a loop that trains a new `RandomForestClassifier` for each value of `n_estimators`. You should see accuracy improve rapidly from 1 to ~50 trees and then plateau. This demonstrates that more trees rarely hurt, but there are diminishing returns after a certain point.

Exercise 28.13 — Class imbalance

Create a synthetic imbalanced dataset using make_classification from scikit-learn with weights=[0.95, 0.05] and 1000 samples. Train a default decision tree and report its accuracy. Then train with class_weight='balanced' and compare. What changed, and why?

Guidance

from sklearn.datasets import make_classification

X, y = make_classification(n_samples=1000, weights=[0.95, 0.05],
                           random_state=42, n_informative=3)

The default tree will likely achieve ~95% accuracy by mostly predicting the majority class. With `class_weight='balanced'`, accuracy might decrease slightly, but the model will be much better at catching the minority class. Check the classification report to see the difference in recall for the minority class.

Exercise 28.14 — Permutation importance

Using the Iris random forest from Exercise 28.11, compute permutation importance using permutation_importance from sklearn.inspection. Compare the results to the default impurity-based importance. Do they agree? If not, which do you trust more for this dataset?

Guidance

from sklearn.inspection import permutation_importance

perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=30,
                                   random_state=42)

For the Iris dataset, both methods should broadly agree that petal features are more important than sepal features. Permutation importance is generally more trustworthy because it measures actual impact on predictions rather than an internal impurity-based proxy.

Part C: Real-World Applications ⭐⭐⭐

Exercise 28.15 — Marcus's bakery decisions

Marcus wants to predict which days will have above-average sales at his bakery. His features include: day_of_week, is_holiday, temperature, rain, and nearby_event. Design a decision tree (on paper or in pseudocode) that makes intuitive sense for this problem. What would you choose as the root node and why?

Guidance

The root node should be the feature that creates the most informative split. Candidates: `day_of_week` (weekends vs. weekdays) or `is_holiday` (holidays likely spike demand). A reasonable tree might split first on `day_of_week` (Saturday/Sunday vs. others), then on `nearby_event` for weekends and `temperature` for weekdays. The key is making your reasoning explicit — the best root node is the one that separates high-sales and low-sales days most cleanly.

Exercise 28.16 — The interpretability trade-off

You're a data scientist at a hospital. Your team built two models to predict patient readmission risk: - Model A: Decision tree with max_depth=3, accuracy = 0.78 - Model B: Random forest with 500 trees, accuracy = 0.85

The hospital's chief medical officer says: "I need to explain to doctors why we're flagging specific patients." Which model would you recommend deploying, and how would you handle the interpretability gap? Write 2-3 paragraphs defending your recommendation.

Guidance

There's no single right answer, but a strong response acknowledges the tension. One approach: deploy Model B for predictions but use Model A (or a simplified version) for explanations. Another approach: use the random forest's feature importance to identify the top 3-4 risk factors and present those to doctors, even if the full model is more complex. You might also discuss SHAP values or partial dependence plots as tools for explaining complex models — we'll cover these in later courses, but awareness of the problem is the important thing here.

Exercise 28.17 — Priya's NBA predictor

Priya wants to predict whether the home team wins an NBA game using features like home_win_pct, away_win_pct, home_rest_days, away_rest_days, home_avg_points, and away_avg_points. She trains a random forest and finds that home_win_pct has importance 0.42, far above all other features. What does this tell her? What doesn't it tell her? How might she write this up for her sports article?

Guidance

It tells her that a team's overall winning percentage is the strongest predictor of any single game outcome — which makes sense (good teams win more). It doesn't tell her that winning percentage *causes* victories, and it doesn't tell her about the marginal effect of rest days or specific matchups. For her article, she might write something like: "While our model considers six factors, a team's season-long record dominates the prediction — confirming what fans already suspect: the best predictor of tonight's game is who's been better all season." She should avoid language suggesting the model reveals *why* teams win.

Exercise 28.18 — Jordan's grade predictor

Jordan builds a decision tree to predict whether students get an A in a course. The tree's first split is on professor_name. Jordan excitedly concludes: "The professor you get is the most important factor in whether you get an A!" Is this conclusion valid? What concerns should Jordan have?

Guidance

Jordan's conclusion conflates predictive importance with causal importance. The tree is saying that professor identity is the most informative feature for *predicting* the grade, not that professors *cause* grade differences. Several confounds: self-selection (harder courses attract different students), grading standards (some professors grade more generously), course content, and time-of-day effects. Also, if `professor_name` is a high-cardinality categorical variable, impurity-based importance may be biased in its favor. Jordan should be careful about claiming causation and should consider controlling for confounding factors.

Exercise 28.19 — Elena's policy tree

Elena wants to build a model to identify communities at risk of low vaccination coverage. She has good data and trains a random forest with high accuracy. But her director asks: "Can you give me a simple rule we can use in the field — something our outreach workers can check without a computer?" What approach would you recommend?

Guidance

Train a shallow decision tree (max_depth=2 or 3) alongside the random forest. The shallow tree won't be as accurate, but it will produce a simple rule like: "If a community has fewer than 2 physicians per 1000 people AND median income is below $30,000, it's high-risk." Validate this simple rule against the random forest's predictions to check that the tree captures the most important patterns. This is the "interpretable model for communication, complex model for prediction" strategy discussed in the chapter.

Part D: Synthesis and Critical Thinking ⭐⭐⭐

Exercise 28.20 — Model comparison table

You now have four types of models in your toolbox: linear regression, logistic regression, decision trees, and random forests. Create a comparison table with at least six dimensions (e.g., handles non-linearity, requires scaling, interpretability, speed, etc.). For each cell, provide a brief justification, not just "yes" or "no."

Guidance

Your table should include dimensions like: interpretability, accuracy ceiling, handles non-linearity, requires feature scaling, handles categorical features, sensitivity to outliers, training speed, prediction speed, extrapolation ability, and stability. A strong table acknowledges nuance — for example, logistic regression is interpretable through coefficients, while decision trees are interpretable through visual structure. Neither is simply "more interpretable" in all situations.

Exercise 28.21 — The bias-variance trade-off in trees

Explain the bias-variance trade-off as it applies to decision tree depth. Specifically: what happens to bias and variance as you increase max_depth from 1 to 20? How does a random forest address this trade-off differently than a single tree?

Guidance

Depth 1 (a stump): high bias (too simple to capture patterns), low variance (stable across different training sets). Depth 20: low bias (captures complex patterns), high variance (different training data produces very different trees). The optimal depth balances both. A random forest addresses this by keeping individual trees relatively complex (low bias) but averaging many of them (reducing variance). The averaging effect of the ensemble tames the variance without reintroducing bias.

Exercise 28.22 — When trees fail

Describe a type of dataset or problem where decision trees would perform poorly compared to logistic regression. Create a small synthetic example (even just 10 data points on paper) that illustrates the problem. Explain why the tree struggles and the linear model succeeds.

Guidance

Decision trees struggle with simple linear boundaries. Consider: x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]. The boundary is at x = 5.5. A decision tree handles this fine (one split). But now make it two-dimensional: the boundary is x1 + x2 > 10. The tree needs many axis-aligned splits to approximate this diagonal, creating a staircase pattern. Logistic regression captures it with a single equation: log-odds = b0 + b1*x1 + b2*x2.

Exercise 28.23 — Design a model selection strategy

You've been hired as a data scientist at a non-profit that wants to predict which at-risk youth are most likely to drop out of high school. The model needs to be: (a) accurate enough to be useful, (b) interpretable enough to be trusted by school counselors, and (c) fair across demographic groups. Describe your modeling strategy, including which models you'd try, how you'd evaluate them, and how you'd handle the interpretability and fairness requirements.

Guidance

A strong answer includes: (1) Start with logistic regression as a baseline for interpretability. (2) Train a random forest for accuracy. (3) Build a shallow decision tree for communication. (4) Evaluate using metrics beyond accuracy — precision and recall matter because false negatives (missing at-risk students) are costly. (5) Check fairness by comparing model performance across demographic groups. (6) Present the shallow tree to counselors while using the forest for the actual predictions. (7) Document limitations and monitor the model over time.

Part E: Extension Challenges ⭐⭐⭐⭐

Exercise 28.24 — Out-of-bag evaluation

Research how out-of-bag (OOB) scoring works in random forests. Train a RandomForestClassifier on the Iris dataset with oob_score=True and compare the OOB score to the test score from a standard train/test split. Write a paragraph explaining why OOB scoring is useful and when it might be preferred over a held-out test set.

Guidance

rf_oob = RandomForestClassifier(n_estimators=200, oob_score=True, random_state=42)
rf_oob.fit(X_train, y_train)
print(f"OOB score: {rf_oob.oob_score_:.3f}")
print(f"Test score: {rf_oob.score(X_test, y_test):.3f}")

OOB scoring uses the ~37% of samples left out of each tree's bootstrap sample as a validation set for that tree. Averaged across all trees, this provides an estimate of generalization performance without needing a separate test set. It's useful when data is scarce and you can't afford to hold out a test set, but it's specific to bagging-based ensembles.

Exercise 28.25 — Beyond random forests: gradient boosting preview

Research the difference between bagging (random forests) and boosting (gradient boosted trees like XGBoost). Write a 1-page comparison covering: (1) how each approach builds its ensemble, (2) whether trees are built independently or sequentially, (3) typical accuracy comparisons, and (4) when you'd choose one over the other. We'll cover boosting in more depth in later courses, but understanding the landscape now will help you make better model choices.

Guidance

Key differences: Bagging builds trees independently and in parallel, each on a bootstrap sample. Boosting builds trees sequentially, with each new tree focusing on the errors of the previous ensemble. Boosting typically achieves higher accuracy on structured data but is more prone to overfitting and harder to tune. Random forests are a safer default; gradient boosting is what you reach for when you need maximum performance and are willing to invest time in tuning. Most Kaggle competitions are won by gradient boosting variants (XGBoost, LightGBM, CatBoost).