Key Takeaways: Decision Trees and Random Forests
This is your reference card for Chapter 28. The core idea: decision trees are the most interpretable model in machine learning, and random forests show how combining many simple models can outperform any single one.
Key Concepts
-
A decision tree is a flowchart of yes/no questions. Each internal node asks a question about a feature ("Is income > $50K?"), each branch represents an answer, and each leaf node makes a prediction. The tree is built by the algorithm, not by hand — it learns the best questions and thresholds from training data.
-
Gini impurity measures how "mixed" a node is. A pure node (all one class) has Gini = 0. A maximally mixed node (equal proportions of all classes) has the highest Gini. The tree chooses splits that reduce Gini impurity the most — that is, splits with the highest information gain.
-
Unpruned trees overfit. Without constraints, a decision tree will keep splitting until every leaf is pure, memorizing the training data. Always use pruning parameters:
max_depth,min_samples_leaf, ormin_samples_split. Start withmax_depth=4or5and adjust based on the training vs. test accuracy gap. -
A random forest is many trees voting together. Each tree is trained on a different bootstrap sample (random sampling with replacement) and considers only a random subset of features at each split. This diversity ensures that individual trees' errors cancel out, producing a more accurate and stable ensemble.
-
Feature importance reveals what drives predictions. Both single trees and random forests can report how much each feature contributes to the model. This is one of the most practically valuable outputs — it tells you what the model learned, not just how well it predicts.
-
Interpretability and accuracy trade off. A single tree with
max_depth=3is easy to explain to anyone. A random forest with 500 trees is more accurate but impossible to visualize. Many practitioners use both: a shallow tree for communication and a forest for production.
Decision Tree vs. Random Forest — At a Glance
DECISION TREE RANDOM FOREST
───────────── ─────────────
One tree Many trees (100-500+)
Prone to overfitting Resistant to overfitting
Interpretable (draw it, read it) Hard to interpret (too many trees)
Fast to train Slower to train (but parallelizable)
Unstable (small data changes = Stable (individual quirks cancel
different tree) out in the vote)
No feature scaling needed No feature scaling needed
Can't extrapolate beyond Can't extrapolate beyond
training range training range
The Key Hyperparameters
| Parameter | What It Controls | Start Here |
|---|---|---|
max_depth |
How deep the tree grows | 4-8 for a single tree; 8-15 for forest trees |
min_samples_leaf |
Minimum samples per leaf node | 5-20 |
min_samples_split |
Minimum samples to split a node | 10-30 |
n_estimators |
Number of trees in the forest | 200-500 |
max_features |
Features considered per split | 'sqrt' for classification |
class_weight |
How to handle class imbalance | 'balanced' if classes are unequal |
Scikit-learn Quick Reference
# Decision tree (classification)
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text
dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=10,
random_state=42)
dt.fit(X_train, y_train)
dt.predict(X_test)
dt.score(X_test, y_test)
# Visualize the tree
plot_tree(dt, feature_names=features, class_names=class_names,
filled=True, rounded=True)
# Print rules as text
print(export_text(dt, feature_names=features))
# Random forest (classification)
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_depth=8,
random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf.predict(X_test)
rf.score(X_test, y_test)
# Feature importance
rf.feature_importances_ # array of importance scores (sum to ~1.0)
# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)
Common Mistakes to Avoid
| Mistake | Why It's Wrong | What to Do Instead |
|---|---|---|
Using default DecisionTreeClassifier() without pruning |
Almost always overfits — 100% training accuracy, poor test accuracy | Set max_depth, min_samples_leaf, or both |
| Interpreting feature importance as causation | Importance measures prediction contribution, not causal effect | Say "X is the strongest predictor" not "X causes Y" |
| Ignoring class imbalance | With 95% majority class, the tree just predicts the majority | Use class_weight='balanced' |
| Using too few trees in the forest | Accuracy may not stabilize with too few trees | Use at least 100-200 trees; check the learning curve |
| Trying to visualize a random forest | 200 trees can't be drawn as one flowchart | Use feature importance plots instead; build a shallow tree for communication |
What You Should Be Able to Do Now
- [ ] Explain how a decision tree makes predictions by splitting data at each node
- [ ] Compute Gini impurity for a simple node (by hand or in code)
- [ ] Train a
DecisionTreeClassifierin scikit-learn and adjust pruning parameters - [ ] Visualize a decision tree using
plot_treeandexport_text - [ ] Explain why unrestricted trees overfit and how pruning addresses this
- [ ] Describe how bagging and feature randomness create a diverse random forest
- [ ] Train a
RandomForestClassifierand extract feature importance scores - [ ] Compare single trees and forests and articulate when each is appropriate
- [ ] Connect these models to the linear and logistic regressions from Chapters 26-27
If you checked every box, you're ready for Chapter 29, where you'll learn to evaluate all your models properly — because accuracy alone doesn't tell the whole story.