> "The best model is the one the decision-maker actually trusts."
Learning Objectives
- Explain how a decision tree makes predictions by splitting data at each node based on feature thresholds
- Define Gini impurity and information gain and describe their role in choosing splits
- Train a DecisionTreeClassifier in scikit-learn and interpret the resulting tree structure
- Identify overfitting in decision trees and apply pruning techniques including max_depth, min_samples_leaf, and min_samples_split
- Explain the ensemble concept and describe how bagging produces a random forest from many trees
- Train a RandomForestClassifier and extract feature importance scores
- Compare decision trees and random forests in terms of interpretability, accuracy, and computational cost
In This Chapter
- Chapter Overview
- 28.1 Thinking in Questions: What Is a Decision Tree?
- 28.2 How Does a Tree Decide Where to Split? Gini Impurity and Information Gain
- 28.3 Building Your First Decision Tree in scikit-learn
- 28.4 The Overfitting Problem: Why Trees Need Pruning
- 28.5 From One Tree to a Forest: The Ensemble Idea
- 28.6 Random Forests: Bagging + Feature Randomness
- 28.7 Feature Importance: What the Forest Learned
- 28.8 Decision Trees vs. Random Forests: When to Use Which
- 28.9 Regression Trees: Trees That Predict Numbers
- 28.10 Progressive Project: Building a Decision Tree for Vaccination Coverage
- 28.11 Common Pitfalls and How to Avoid Them
- 28.12 The Big Picture: Where Trees Fit in Your Toolbox
- Summary
Chapter 28: Decision Trees and Random Forests — Models You Can Explain to Your Boss
"The best model is the one the decision-maker actually trusts." — Overheard in a data science team meeting
Chapter Overview
Picture this. You've just spent three weeks building a logistic regression model that predicts which customers are likely to cancel their subscriptions. The model works — 87% accuracy on the test set. You're proud. You walk into your manager's meeting, pull up your slides, and start explaining the coefficients.
"So the log-odds of churn increase by 0.34 for each unit decrease in monthly usage, holding other variables constant, and the sigmoid function transforms..."
Your manager's eyes glaze over. The VP of Marketing is checking her phone. The CEO asks: "Can you just tell me why they're leaving?"
Now imagine a different scenario. You walk into the same meeting, pull up a diagram that looks like a flowchart, and say:
"Customers who haven't logged in for 30 days AND whose monthly spend dropped below $15 — those are your highest-risk group. Eighty percent of them cancel within two months."
Everyone nods. The VP puts down her phone. The CEO says: "Let's target those customers with a retention campaign this week."
That flowchart? It was a decision tree. And it just did something that your technically superior logistic regression couldn't: it communicated.
Welcome to the chapter where machine learning meets human communication. Decision trees are one of the oldest and most intuitive algorithms in the field, and they remain one of the most widely used — not because they're always the most accurate, but because they produce models that people can actually understand, explain, and act on.
But we won't stop at single trees. We'll also learn about random forests, an ensemble method that combines hundreds of decision trees into something much more powerful. The trade-off? You gain accuracy but lose some of that beautiful interpretability. Understanding when to make that trade-off is one of the key skills of a practicing data scientist.
In this chapter, you will learn to:
- Explain how a decision tree makes predictions by splitting data at each node (all paths)
- Define Gini impurity and information gain and describe their role in choosing splits (standard + deep dive paths)
- Train a
DecisionTreeClassifierin scikit-learn and interpret the resulting tree (all paths) - Identify overfitting in decision trees and apply pruning techniques (all paths)
- Explain the ensemble concept and how bagging produces a random forest (standard + deep dive paths)
- Train a
RandomForestClassifierand extract feature importance scores (all paths) - Compare decision trees and random forests in terms of interpretability, accuracy, and cost (all paths)
28.1 Thinking in Questions: What Is a Decision Tree?
You already know how to think like a decision tree. You do it every day.
When you decide what to wear in the morning, your brain runs something like this:
- Is it raining? Yes -> Grab a jacket. No -> Keep going.
- Is the temperature above 70°F? Yes -> Short sleeves. No -> Long sleeves.
- Am I going somewhere fancy? Yes -> The nice shirt. No -> Whatever's clean.
That's a decision tree. A series of yes/no questions, asked in a specific order, that leads to a decision at the end. Each question narrows down the possibilities until you reach a conclusion.
A decision tree in machine learning works exactly the same way — except instead of asking about weather and dress codes, it asks about features in your data. And instead of you deciding which questions to ask, the algorithm figures out the best questions automatically by looking at the training data.
The Anatomy of a Tree
Let's get the vocabulary down. Every decision tree has three types of components:
-
Root node: The very first question at the top of the tree. This is the single most informative question the algorithm can ask — the one that splits your data into the most useful groups.
-
Internal nodes: Every subsequent question after the root. Each internal node takes the data that flowed into it and splits it further based on another feature.
-
Leaf nodes (or just leaves): The endpoints where no more questions are asked. Each leaf contains a prediction — a class label for classification, or a number for regression.
The connections between nodes are called branches, and they represent the answers to each question (typically "yes/go left" and "no/go right" for binary splits).
Here's a simple tree that a bank might use to decide whether to approve a loan:
[Income > $50K?] <-- Root node
/ \
Yes No
/ \
[Credit Score > 700?] [Has Collateral?] <-- Internal nodes
/ \ / \
Yes No Yes No
| | | |
APPROVE REVIEW REVIEW DENY <-- Leaf nodes
When a new loan application comes in, the tree asks the first question: "Is the applicant's income above $50,000?" If yes, go left and ask about the credit score. If the credit score is above 700, approve the loan. If not, flag it for review. If the income was below $50,000, go right and ask about collateral instead.
This is a classification tree — it predicts a category (approve, review, or deny). Decision trees can also do regression — predicting a continuous number, like the loan amount to offer — but we'll focus on classification in this chapter because it connects directly to the logistic regression work you did in Chapter 27.
Check Your Understanding
- In the loan approval tree above, how many leaf nodes are there? How many internal nodes (including the root)?
- If an applicant has an income of $45,000 and offers their car as collateral, what does the tree predict?
- Why do you think the algorithm chose "Income > $50K" as the root node rather than "Has Collateral?" What might that tell us about the training data?
28.2 How Does a Tree Decide Where to Split? Gini Impurity and Information Gain
Here's the big question: how does the algorithm figure out which questions to ask, and in what order?
The answer involves a concept called impurity. Think of it this way: if you have a bowl of fruit that's 100% apples, it's perfectly "pure" — there's no uncertainty about what you'll grab. But if the bowl is 50% apples and 50% oranges, it's maximally "impure" — you genuinely don't know what you'll get.
A decision tree wants each split to create groups that are as pure as possible. If you're trying to classify loan applications as "approve" or "deny," the best question is one that puts most of the "approve" cases on one side and most of the "deny" cases on the other.
Gini Impurity
The most common measure of impurity in scikit-learn (and the default for DecisionTreeClassifier) is Gini impurity. Here's the intuition:
Imagine you pick a random sample from a node and then randomly assign it a label based on the distribution of labels in that node. Gini impurity measures the probability that you'd assign the wrong label. Lower Gini = purer node = better.
For a node with two classes, the formula is:
Gini = 1 - p₁² - p₂²
Where p₁ and p₂ are the proportions of each class. Let's see some examples:
- Pure node (all one class): p₁ = 1.0, p₂ = 0.0. Gini = 1 - 1.0² - 0.0² = 0.0. Perfect purity.
- Maximally impure (50/50 split): p₁ = 0.5, p₂ = 0.5. Gini = 1 - 0.25 - 0.25 = 0.5. Maximum uncertainty.
- Mostly one class (90/10): p₁ = 0.9, p₂ = 0.1. Gini = 1 - 0.81 - 0.01 = 0.18. Pretty pure.
You can compute Gini impurity by hand:
def gini_impurity(labels):
from collections import Counter
counts = Counter(labels)
total = len(labels)
return 1 - sum((c / total) ** 2 for c in counts.values())
# A pure node
print(gini_impurity(['approve'] * 10)) # 0.0
# A 50/50 node
print(gini_impurity(['approve'] * 5 + ['deny'] * 5)) # 0.5
Information Gain
Now that we can measure how impure a node is, we can evaluate potential splits. Information gain measures how much a split reduces impurity. The algorithm tries every possible split on every feature and picks the one with the highest information gain.
Here's the idea: before the split, you have one node with some Gini impurity. After the split, you have two child nodes, each with their own Gini impurity. The information gain is the difference — how much purer the children are compared to the parent, weighted by the number of samples in each child.
Information Gain = Gini(parent) - [weighted average of Gini(children)]
Let's make this concrete. Suppose you have 100 loan applications: 60 approved, 40 denied. The parent node has:
Gini(parent) = 1 - (0.6)² - (0.4)² = 1 - 0.36 - 0.16 = 0.48
Now you split on "Income > $50K." The left child gets 50 samples (45 approved, 5 denied), and the right child gets 50 samples (15 approved, 35 denied):
Gini(left) = 1 - (45/50)² - (5/50)² = 1 - 0.81 - 0.01 = 0.18
Gini(right) = 1 - (15/50)² - (35/50)² = 1 - 0.09 - 0.49 = 0.42
The weighted average Gini of the children:
Weighted Gini = (50/100) × 0.18 + (50/100) × 0.42 = 0.09 + 0.21 = 0.30
So the information gain from this split is:
Gain = 0.48 - 0.30 = 0.18
The algorithm would compare this gain to every other possible split (income > $40K, credit score > 650, credit score > 700, etc.) and pick the split with the highest gain. That's it. That's the whole algorithm at the split level.
Entropy: An Alternative Measure
You might also encounter entropy as a measure of impurity, especially in textbooks rooted in information theory. Entropy uses logarithms instead of squared proportions:
Entropy = -p₁ × log₂(p₁) - p₂ × log₂(p₂)
When information gain is computed using entropy, it's called the ID3 or C4.5 approach. When it uses Gini impurity, it's the CART approach (Classification and Regression Trees). In practice, the two measures produce very similar trees. Scikit-learn defaults to Gini, and that's what we'll use.
Why This Matters: You might be thinking, "Do I really need to understand the math?" For most practical work, you don't need to compute Gini by hand. But understanding what the algorithm is optimizing helps you diagnose problems. When your tree makes a weird split, knowing that it's chasing the highest information gain helps you understand why — and what to do about it.
Check Your Understanding
- A node contains 30 samples of class A and 10 samples of class B. What is its Gini impurity?
- If a split produces one child that's perfectly pure and one child that's 50/50, is that necessarily a good split? (Hint: think about the sizes of the children.)
- Why does the algorithm weight the children's Gini values by the number of samples in each child?
28.3 Building Your First Decision Tree in scikit-learn
Enough theory — let's build a tree. We'll use the same approach we've been using throughout Part V: load data, split into training and test sets, fit a model, and evaluate it.
For our running example, let's use a health-related dataset that connects to Elena's vaccination work. We'll predict whether a country has "high" (above median) vaccination coverage based on economic and health indicators.
Step 1: Prepare the Data
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the health indicators dataset
health = pd.read_csv('global_health_indicators.csv')
# Create binary target: high vs low vaccination coverage
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)
# Select features
features = ['gdp_per_capita', 'health_spending_pct', 'physicians_per_1000',
'literacy_rate', 'urban_population_pct']
X = health[features]
y = health['high_coverage']
# Split: 70% train, 30% test
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Notice something interesting: we didn't need to scale the features. Unlike linear and logistic regression, decision trees don't care about feature scale. Whether GDP is measured in dollars or millions of dollars, the tree will find the same split point. This is a genuine practical advantage.
Step 2: Train the Tree
from sklearn.tree import DecisionTreeClassifier
tree_model = DecisionTreeClassifier(random_state=42)
tree_model.fit(X_train, y_train)
print(f"Training accuracy: {tree_model.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree_model.score(X_test, y_test):.3f}")
If you run this code with the defaults, you'll likely see something alarming:
Training accuracy: 1.000
Test accuracy: 0.783
A perfect training score and a much lower test score. The tree has memorized the training data. It grew so deep and so specific that it created a unique path for nearly every training example. This is the classic sign of overfitting, and it's the single biggest challenge with decision trees.
Step 3: Visualize the Tree
One of the best things about decision trees is that you can actually look at the model. Try this:
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
plt.figure(figsize=(20, 10))
plot_tree(tree_model, feature_names=features, class_names=['Low', 'High'],
filled=True, rounded=True, max_depth=3, fontsize=10)
plt.title('Decision Tree for Vaccination Coverage (first 3 levels)')
plt.tight_layout()
plt.savefig('decision_tree_visualization.png', dpi=150)
plt.show()
The max_depth=3 parameter in plot_tree doesn't change the model — it just limits how many levels of the tree are displayed. The filled=True option colors nodes by their dominant class, making it easy to see the decision landscape at a glance.
You can also export the tree as text:
from sklearn.tree import export_text
tree_rules = export_text(tree_model, feature_names=features, max_depth=4)
print(tree_rules)
This produces something like:
|--- gdp_per_capita <= 4523.50
| |--- literacy_rate <= 67.45
| | |--- class: Low
| |--- literacy_rate > 67.45
| | |--- health_spending_pct <= 3.82
| | | |--- class: Low
| | |--- health_spending_pct > 3.82
| | | |--- class: High
|--- gdp_per_capita > 4523.50
| |--- physicians_per_1000 <= 0.85
| | |--- class: Low
| |--- physicians_per_1000 > 0.85
| | |--- class: High
Read it from top to bottom: "If GDP per capita is at most $4,523.50, look at literacy rate. If literacy rate is at most 67.45%, predict Low coverage." This is a model you can explain to anyone — a policymaker, a journalist, a community health worker. That's the power of decision trees.
Real-World Application: Elena could use this exact tree to brief her county health director: "Countries — or in our case, neighborhoods — with lower economic indicators AND lower literacy rates tend to have the lowest vaccination coverage. But here's the interesting part: among lower-income areas, those with higher health spending as a percentage of their budget actually achieve high coverage. That suggests targeted health investment can overcome economic disadvantages."
28.4 The Overfitting Problem: Why Trees Need Pruning
Let's address that elephant in the room — the 100% training accuracy and the 78% test accuracy. An unrestricted decision tree will keep splitting until every leaf contains samples from only one class. The resulting tree is enormous, brittle, and useless for new data.
Think of it this way: if your tree has a leaf that says "approve the loan if the applicant's income is exactly $67,423 AND their credit score is exactly 714 AND they applied on a Tuesday," that rule is memorizing one specific training example, not learning a general pattern.
What Is Pruning?
Pruning is the process of simplifying a tree by removing branches that don't contribute to generalization. There are two approaches:
-
Pre-pruning (also called "early stopping"): Stop growing the tree before it gets too deep. Set limits on how complex the tree can become.
-
Post-pruning: Grow the full tree first, then cut back branches that don't improve performance on a validation set. Scikit-learn's
DecisionTreeClassifiersupports a form of this through theccp_alphaparameter (cost-complexity pruning).
Pre-pruning is more common in practice because it's simpler and faster. Here are the key hyperparameters:
Key Pruning Hyperparameters
max_depth: The maximum number of levels the tree can have. A depth of 1 means the tree asks only one question (a "stump"). A depth of 5 means up to five questions in sequence. This is the single most important hyperparameter for controlling overfitting.
tree_pruned = DecisionTreeClassifier(max_depth=4, random_state=42)
tree_pruned.fit(X_train, y_train)
print(f"Training accuracy: {tree_pruned.score(X_train, y_train):.3f}")
print(f"Test accuracy: {tree_pruned.score(X_test, y_test):.3f}")
Training accuracy: 0.891
Test accuracy: 0.854
The training accuracy dropped (the tree can't memorize as well), but the test accuracy improved significantly. That's the pruning trade-off working in your favor.
min_samples_split: The minimum number of samples a node must have before it can be split further. If a node has fewer samples than this threshold, it becomes a leaf. Default is 2 (meaning every node with at least 2 samples can be split — very permissive).
tree_min_split = DecisionTreeClassifier(min_samples_split=20, random_state=42)
tree_min_split.fit(X_train, y_train)
min_samples_leaf: The minimum number of samples that must end up in each leaf node. If a split would create a leaf with fewer samples than this threshold, the split is rejected. This prevents the tree from creating tiny, overly specific leaves.
tree_min_leaf = DecisionTreeClassifier(min_samples_leaf=10, random_state=42)
tree_min_leaf.fit(X_train, y_train)
max_features: The maximum number of features to consider when looking for the best split. By default, the tree considers all features at every split. Limiting this introduces randomness and can reduce overfitting. (This parameter is even more important in random forests, as we'll see.)
Finding the Right Depth
How do you choose the right max_depth? Plot the training and test accuracy for different depths:
import numpy as np
depths = range(1, 21)
train_scores = []
test_scores = []
for d in depths:
dt = DecisionTreeClassifier(max_depth=d, random_state=42)
dt.fit(X_train, y_train)
train_scores.append(dt.score(X_train, y_train))
test_scores.append(dt.score(X_test, y_test))
plt.figure(figsize=(10, 6))
plt.plot(depths, train_scores, 'o-', label='Training accuracy')
plt.plot(depths, test_scores, 's-', label='Test accuracy')
plt.xlabel('Max Depth')
plt.ylabel('Accuracy')
plt.title('Decision Tree: Training vs. Test Accuracy by Depth')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()
You'll see a characteristic pattern: training accuracy climbs steadily toward 1.0 as depth increases, while test accuracy rises to a peak (often around depth 4-8) and then starts to decline. The optimal depth is at the peak of the test accuracy curve — the sweet spot between underfitting and overfitting.
This is the bias-variance trade-off in action. A shallow tree has high bias (it's too simple to capture the real patterns) but low variance (it gives consistent predictions on different samples). A deep tree has low bias but high variance (its predictions change wildly depending on which training data it sees). The best model balances the two.
Check Your Understanding
- If your decision tree has 100% training accuracy but 65% test accuracy, what's happening? What would you try first to fix it?
- What's the conceptual difference between
min_samples_split=20andmin_samples_leaf=10?- Why does limiting
max_depthreduce overfitting? What's the downside of settingmax_depthtoo low?
28.5 From One Tree to a Forest: The Ensemble Idea
Decision trees are interpretable and fast, but they have a fundamental weakness: they're unstable. Small changes in the training data can produce completely different trees. Remove five samples, and the root node might split on a different feature entirely, producing a totally different model.
This instability isn't just an academic concern — it means that the specific tree you get depends heavily on which data happened to land in your training set. That's not ideal when you're making decisions that matter.
What if, instead of relying on one tree, you could build hundreds of trees and let them vote?
That's the ensemble idea: combine multiple weak models into one strong model. It's the same principle behind the "wisdom of crowds" — a crowd's average guess is often more accurate than any individual's guess, as long as the individuals are making independent, somewhat informed estimates.
How Ensembles Work
Imagine you're a judge at a baking competition. You taste a pie and rate it 7 out of 10. That's one opinion. But if 50 judges each taste the pie and their average rating is 7.3, that's much more reliable — individual biases and quirks cancel out.
The same principle applies to decision trees. A single tree might have learned some quirky rules from the specific training data it saw. But if you train 500 trees, each on slightly different data, their individual quirks tend to cancel out, and the consensus prediction is more robust.
For classification, the ensemble uses majority voting: each tree predicts a class, and the class with the most votes wins. For regression, the ensemble uses averaging: each tree predicts a number, and the final prediction is the mean.
The Key Insight: Diversity Matters
Here's the critical point: an ensemble of identical trees is useless. If all 500 trees are the same, 500 votes are no better than one. The power of the ensemble comes from diversity — each tree needs to be different enough to capture different aspects of the data.
How do you make the trees different? That's where bagging comes in.
28.6 Random Forests: Bagging + Feature Randomness
A random forest is an ensemble of decision trees that achieves diversity through two mechanisms:
Mechanism 1: Bagging (Bootstrap Aggregating)
Bagging was invented by Leo Breiman in 1996. The idea is beautifully simple:
- Start with your training dataset of n samples.
- Create a new dataset of the same size n by sampling with replacement from the original. This is called a bootstrap sample. Some original samples will appear multiple times, and some won't appear at all (on average, about 63% of the original samples appear in each bootstrap sample).
- Train a decision tree on this bootstrap sample.
- Repeat steps 2-3 hundreds of times, each time creating a new bootstrap sample and training a new tree.
- To make a prediction, have all trees vote.
Because each tree is trained on a different bootstrap sample, each tree learns slightly different patterns. The samples that are "left out" of each bootstrap (about 37%) can even be used for internal validation — they're called out-of-bag (OOB) samples.
Mechanism 2: Feature Randomness
Bagging alone isn't enough. If one feature is overwhelmingly strong (say, GDP per capita dominates all others), every tree will split on that feature first, and the trees will be too similar.
Random forests add a second layer of diversity: at each split, the algorithm considers only a random subset of features. If you have 10 features, the tree might only consider 3 randomly chosen features at each split. This forces trees to find splits using features they might not have considered otherwise, creating more diverse trees.
The default in scikit-learn is max_features='sqrt', meaning the algorithm considers the square root of the total number of features at each split. For 16 features, that's 4 features per split.
Putting It Together
from sklearn.ensemble import RandomForestClassifier
rf_model = RandomForestClassifier(
n_estimators=200, # Number of trees
max_depth=8, # Max depth per tree
max_features='sqrt', # Features to consider at each split
random_state=42,
n_jobs=-1 # Use all CPU cores
)
rf_model.fit(X_train, y_train)
print(f"Training accuracy: {rf_model.score(X_train, y_train):.3f}")
print(f"Test accuracy: {rf_model.score(X_test, y_test):.3f}")
Training accuracy: 0.953
Test accuracy: 0.891
The test accuracy jumped from 0.854 (pruned single tree) to 0.891 (random forest). The gap between training and test accuracy also narrowed, suggesting better generalization.
Let's break down the key hyperparameters:
| Parameter | What It Controls | Typical Starting Values |
|---|---|---|
n_estimators |
Number of trees in the forest | 100-500 |
max_depth |
Maximum depth per tree | 5-15, or None (no limit) |
max_features |
Features considered per split | 'sqrt' (classification) |
min_samples_leaf |
Minimum samples per leaf | 1-10 |
n_jobs |
Number of CPU cores to use | -1 (use all) |
random_state |
Seed for reproducibility | Any integer |
Why This Matters — Priya's Perspective: Priya is building a model to predict NBA game outcomes. She started with a single decision tree and found that it was unstable — retrain on slightly different data and the predictions changed dramatically. Switching to a random forest with 300 trees gave her consistent predictions. Now she can confidently write: "Based on 12 pre-game factors, our model gives the home team a 72% chance of winning tonight." The forest's stability makes that kind of statement defensible.
28.7 Feature Importance: What the Forest Learned
One of the most valuable outputs of a random forest is feature importance — a score for each feature indicating how much it contributed to the model's predictions. This is computed by measuring how much each feature reduces impurity across all the trees in the forest.
importances = rf_model.feature_importances_
feature_importance = pd.Series(importances, index=features).sort_values()
plt.figure(figsize=(8, 5))
feature_importance.plot(kind='barh')
plt.xlabel('Feature Importance (mean impurity decrease)')
plt.title('Random Forest: Feature Importance for Vaccination Coverage')
plt.tight_layout()
plt.show()
This produces a horizontal bar chart showing, for example, that gdp_per_capita has the highest importance (0.35), followed by physicians_per_1000 (0.24), literacy_rate (0.19), health_spending_pct (0.14), and urban_population_pct (0.08).
Feature importance scores sum to 1.0 (or close to it due to rounding). They tell you which features the model relies on most, which is incredibly useful for:
- Understanding the model: "GDP per capita is the strongest predictor of vaccination coverage" is a meaningful insight for policymakers.
- Feature selection: If a feature has near-zero importance, you might drop it to simplify the model.
- Debugging: If a feature you expected to be important isn't, something might be wrong with the data.
A Caution About Feature Importance
Default feature importance (based on impurity reduction) has a known bias: it tends to favor features with more unique values (continuous features over binary ones). If this matters for your application, consider using permutation importance instead:
from sklearn.inspection import permutation_importance
perm_importance = permutation_importance(
rf_model, X_test, y_test, n_repeats=10, random_state=42
)
perm_imp = pd.Series(perm_importance.importances_mean, index=features)
print(perm_imp.sort_values(ascending=False))
Permutation importance works by randomly shuffling each feature's values and measuring how much the model's accuracy drops. If shuffling a feature causes a big drop, that feature is important. This method is slower but unbiased.
Check Your Understanding
- A random forest reports that
incomehas importance 0.45 andzip_codehas importance 0.02. What does this tell you? What doesn't it tell you?- Why does the random forest consider only a subset of features at each split? What would happen if it considered all features?
- What's the difference between impurity-based importance and permutation importance? When might you prefer one over the other?
28.8 Decision Trees vs. Random Forests: When to Use Which
Now you have two models. How do you choose between them? Here's the honest comparison:
| Factor | Decision Tree | Random Forest |
|---|---|---|
| Interpretability | Excellent — you can draw it, read it, explain it to anyone | Limited — 200 trees can't be visualized as a single flowchart |
| Accuracy | Moderate — prone to overfitting even with pruning | High — ensembling reduces overfitting |
| Speed (training) | Very fast — one tree | Slower — hundreds of trees (but parallelizable) |
| Speed (prediction) | Very fast — traverse one tree | Slightly slower — traverse hundreds of trees |
| Stability | Low — small data changes can produce different trees | High — the ensemble smooths out individual quirks |
| Feature scaling required? | No | No |
| Handles missing values? | Not natively in scikit-learn | Not natively in scikit-learn |
| Feature importance | Available | Available (and more reliable due to averaging) |
Rules of Thumb
Use a decision tree when: - Interpretability is the top priority — your audience needs to understand why the model makes each prediction - You're building a preliminary model to explore which features matter - The dataset is small and a complex model would overfit - You need to explain the model to non-technical stakeholders (this is the "explain to your boss" scenario) - You're required to provide transparent decision logic (e.g., regulated industries like lending or healthcare)
Use a random forest when: - Accuracy is the top priority and you can sacrifice some interpretability - You have a medium-to-large dataset - You want a model that's robust to small changes in the data - You want reliable feature importance scores - You're in a competitive setting (random forests are strong default models for many problems)
Use something else when: - You have a very large dataset (>100K samples) and need fast training — consider gradient boosting (XGBoost, LightGBM) - You need a linear model for inference or theoretical reasons — stick with logistic regression - Your data has a strong linear structure — tree-based models can struggle with simple linear relationships
The Practical Compromise: Small Tree + Forest
Many practitioners use both:
-
Build a shallow decision tree (max_depth=3 or 4) for communication. This tree won't be the most accurate, but it tells a story: "Here are the three most important factors driving this outcome."
-
Build a random forest for actual predictions. This model is the one you deploy, the one that makes decisions in production.
-
Use the feature importance from the random forest to validate the story told by the shallow tree. If the tree's top splits align with the forest's most important features, you can be more confident in both.
This combination gives you the best of both worlds: an interpretable story for stakeholders and a high-performing model for production.
28.9 Regression Trees: Trees That Predict Numbers
Everything we've discussed so far applies to classification. But decision trees can also predict continuous numbers — a technique called regression with trees. The mechanics are similar, but instead of Gini impurity, the algorithm minimizes mean squared error (MSE) at each split.
In a regression tree, each leaf predicts the average value of the training samples that ended up in that leaf.
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
# Predict actual vaccination rate (continuous) instead of high/low
X = health[features]
y_continuous = health['vaccination_rate']
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(
X, y_continuous, test_size=0.3, random_state=42
)
# Single regression tree
reg_tree = DecisionTreeRegressor(max_depth=5, random_state=42)
reg_tree.fit(X_train_r, y_train_r)
# Random forest for regression
rf_reg = RandomForestRegressor(n_estimators=200, max_depth=8, random_state=42)
rf_reg.fit(X_train_r, y_train_r)
from sklearn.metrics import mean_squared_error, r2_score
for name, model in [('Single Tree', reg_tree), ('Random Forest', rf_reg)]:
pred = model.predict(X_test_r)
print(f"{name}: RMSE={mean_squared_error(y_test_r, pred)**0.5:.2f}, "
f"R²={r2_score(y_test_r, pred):.3f}")
The same general pattern holds: the random forest will typically outperform the single tree, especially on test data.
28.10 Progressive Project: Building a Decision Tree for Vaccination Coverage
Let's connect this to our ongoing project. In Chapter 26, you built a linear regression predicting vaccination rates. In Chapter 27, you built a logistic regression classifying countries as high or low coverage. Now let's add a decision tree and a random forest to the comparison.
Project Task: Tree-Based Models for Vaccination Coverage
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_text, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt
# 1. Load and prepare data (same as previous chapters)
health = pd.read_csv('global_health_indicators.csv')
median_rate = health['vaccination_rate'].median()
health['high_coverage'] = (health['vaccination_rate'] >= median_rate).astype(int)
features = ['gdp_per_capita', 'health_spending_pct', 'physicians_per_1000',
'literacy_rate', 'urban_population_pct']
X = health[features].dropna()
y = health.loc[X.index, 'high_coverage']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# 2. Fit a pruned decision tree
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=5, random_state=42)
dt.fit(X_train, y_train)
# 3. Visualize the tree
plt.figure(figsize=(16, 8))
plot_tree(dt, feature_names=features, class_names=['Low', 'High'],
filled=True, rounded=True, fontsize=9)
plt.title('Decision Tree: Vaccination Coverage Prediction')
plt.tight_layout()
plt.savefig('project_decision_tree.png', dpi=150)
plt.show()
# 4. Print the rules
print(export_text(dt, feature_names=features))
# 5. Fit a random forest
rf = RandomForestClassifier(n_estimators=200, max_depth=8, random_state=42)
rf.fit(X_train, y_train)
# 6. Compare models
print("Decision Tree:")
print(classification_report(y_test, dt.predict(X_test), target_names=['Low', 'High']))
print("\nRandom Forest:")
print(classification_report(y_test, rf.predict(X_test), target_names=['Low', 'High']))
# 7. Feature importance comparison
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Decision tree importance
dt_imp = pd.Series(dt.feature_importances_, index=features).sort_values()
dt_imp.plot(kind='barh', ax=axes[0], color='steelblue')
axes[0].set_title('Decision Tree Feature Importance')
# Random forest importance
rf_imp = pd.Series(rf.feature_importances_, index=features).sort_values()
rf_imp.plot(kind='barh', ax=axes[1], color='darkgreen')
axes[1].set_title('Random Forest Feature Importance')
plt.tight_layout()
plt.savefig('project_feature_importance_comparison.png', dpi=150)
plt.show()
What to write in your project notebook:
- Document the tree's decision rules in plain English. What's the story it tells about vaccination coverage?
- Compare the decision tree's accuracy to the logistic regression from Chapter 27. Which performed better? Why?
- Note which features the random forest considers most important. Do they align with what the decision tree splits on?
- Reflect: if you were presenting to a health policy audience, which model would you show? Why?
Milestone Check: Your project notebook should now contain three classification models (logistic regression, decision tree, random forest) predicting vaccination coverage, with accuracy scores and feature importance plots for each. In Chapter 29, you'll learn how to evaluate these models properly — because accuracy alone isn't enough.
28.11 Common Pitfalls and How to Avoid Them
Let's close with the mistakes that trip up nearly everyone when they first use tree-based models.
Pitfall 1: Not Pruning the Tree
An unrestricted DecisionTreeClassifier() will almost always overfit. Always set at least max_depth or min_samples_leaf. Start with max_depth=5 and adjust from there.
Pitfall 2: Too Few Trees in the Forest
The default n_estimators=100 in scikit-learn is a reasonable starting point, but more trees almost never hurt (they just take longer to train). If your dataset is large, try 200-500 trees. The accuracy usually plateaus after a certain point, so check the learning curve.
Pitfall 3: Confusing Feature Importance with Causation
Feature importance tells you what the model uses to make predictions, not what causes the outcome. Just because GDP per capita has the highest importance doesn't mean increasing GDP would increase vaccination rates. Correlation and predictive power are not the same as causation — we covered this thoroughly in Chapter 24.
Pitfall 4: Ignoring Class Imbalance
If your dataset has 95% of one class and 5% of another, a decision tree can achieve 95% accuracy by always predicting the majority class. Use class_weight='balanced' to tell the tree to pay more attention to the minority class:
dt_balanced = DecisionTreeClassifier(max_depth=5, class_weight='balanced',
random_state=42)
Pitfall 5: Interpreting Deep Trees as Simple
A decision tree with max_depth=15 has up to 32,768 possible paths. That's not interpretable. If interpretability is your goal, keep the depth at 3-5 levels. Deeper trees are for accuracy, not explanation.
Pitfall 6: Forgetting That Trees Can't Extrapolate
Decision trees and random forests can only predict values they've seen in training. If the training data has GDP values ranging from $500 to $60,000, the model has no idea what to predict for a country with GDP of $100,000. It will simply assign the prediction from the nearest leaf — which might be wildly wrong. Linear models, by contrast, can extrapolate along the learned line (for better or worse). Keep this limitation in mind when your test data might fall outside the training range.
28.12 The Big Picture: Where Trees Fit in Your Toolbox
Let's step back and see where decision trees and random forests fit in the landscape of models you've learned:
| Model | Type | Interpretability | Handles Non-Linear? | Feature Scaling Needed? |
|---|---|---|---|---|
| Linear Regression (Ch. 26) | Regression | High | No | Yes |
| Logistic Regression (Ch. 27) | Classification | High | No | Yes |
| Decision Tree (this chapter) | Both | Very High | Yes | No |
| Random Forest (this chapter) | Both | Moderate | Yes | No |
Trees naturally handle non-linear relationships — the if/else structure can capture complex patterns that linear models miss entirely. They also handle interactions between features automatically (a split on feature A followed by a split on feature B captures the interaction). And they don't require feature scaling, which simplifies preprocessing.
The trade-off is that trees are more prone to overfitting and can't extrapolate beyond the training data. Understanding these strengths and weaknesses will help you choose the right tool for each problem.
In Chapter 29, you'll learn how to rigorously evaluate all the models you've built — because "the model with the highest accuracy wins" is a dangerously oversimplified way to think about model selection. See you there.
Summary
A decision tree splits data recursively, asking yes/no questions about features at each node and making predictions at the leaves. The algorithm chooses splits by maximizing information gain — the reduction in Gini impurity (or entropy). Unrestricted trees overfit badly, so pruning parameters like max_depth and min_samples_leaf are essential.
Random forests address trees' instability by building hundreds of trees on different bootstrap samples and having them vote. Feature randomness at each split ensures the trees are diverse. The resulting ensemble is more accurate and stable than any single tree, at the cost of reduced interpretability.
Feature importance scores — from either single trees or forests — reveal which features drive predictions and are one of the most practically useful outputs of tree-based models. The choice between a single tree (for communication) and a forest (for accuracy) depends on your audience and your goal.
Coming up in Chapter 29: You've built four models now — linear regression, logistic regression, a decision tree, and a random forest. But how do you know which one is actually good? Accuracy is just the beginning. We'll learn about precision, recall, F1, ROC curves, and why a model that's 99% accurate might be completely useless.
Related Reading
Explore this topic in other books
Intro to Data Science What Is a Model? Intro to Data Science ML Workflow Introductory Statistics Statistics and AI