Key Takeaways: Decision Trees and Random Forests

Contributors to Introduction to Data Science

Key Takeaways: Decision Trees and Random Forests

This is your reference card for Chapter 28. The core idea: decision trees are the most interpretable model in machine learning, and random forests show how combining many simple models can outperform any single one.

Key Concepts

A decision tree is a flowchart of yes/no questions. Each internal node asks a question about a feature ("Is income > $50K?"), each branch represents an answer, and each leaf node makes a prediction. The tree is built by the algorithm, not by hand — it learns the best questions and thresholds from training data.
Gini impurity measures how "mixed" a node is. A pure node (all one class) has Gini = 0. A maximally mixed node (equal proportions of all classes) has the highest Gini. The tree chooses splits that reduce Gini impurity the most — that is, splits with the highest information gain.
Unpruned trees overfit. Without constraints, a decision tree will keep splitting until every leaf is pure, memorizing the training data. Always use pruning parameters: max_depth, min_samples_leaf, or min_samples_split. Start with max_depth=4 or 5 and adjust based on the training vs. test accuracy gap.
A random forest is many trees voting together. Each tree is trained on a different bootstrap sample (random sampling with replacement) and considers only a random subset of features at each split. This diversity ensures that individual trees' errors cancel out, producing a more accurate and stable ensemble.
Feature importance reveals what drives predictions. Both single trees and random forests can report how much each feature contributes to the model. This is one of the most practically valuable outputs — it tells you what the model learned, not just how well it predicts.
Interpretability and accuracy trade off. A single tree with max_depth=3 is easy to explain to anyone. A random forest with 500 trees is more accurate but impossible to visualize. Many practitioners use both: a shallow tree for communication and a forest for production.

Decision Tree vs. Random Forest — At a Glance

DECISION TREE                        RANDOM FOREST
─────────────                        ─────────────
One tree                             Many trees (100-500+)
Prone to overfitting                 Resistant to overfitting
Interpretable (draw it, read it)     Hard to interpret (too many trees)
Fast to train                        Slower to train (but parallelizable)
Unstable (small data changes =       Stable (individual quirks cancel
  different tree)                       out in the vote)
No feature scaling needed            No feature scaling needed
Can't extrapolate beyond             Can't extrapolate beyond
  training range                       training range

The Key Hyperparameters

Parameter	What It Controls	Start Here
`max_depth`	How deep the tree grows	4-8 for a single tree; 8-15 for forest trees
`min_samples_leaf`	Minimum samples per leaf node	5-20
`min_samples_split`	Minimum samples to split a node	10-30
`n_estimators`	Number of trees in the forest	200-500
`max_features`	Features considered per split	`'sqrt'` for classification
`class_weight`	How to handle class imbalance	`'balanced'` if classes are unequal

Scikit-learn Quick Reference

# Decision tree (classification)
from sklearn.tree import DecisionTreeClassifier, plot_tree, export_text

dt = DecisionTreeClassifier(max_depth=4, min_samples_leaf=10,
                             random_state=42)
dt.fit(X_train, y_train)
dt.predict(X_test)
dt.score(X_test, y_test)

# Visualize the tree
plot_tree(dt, feature_names=features, class_names=class_names,
          filled=True, rounded=True)

# Print rules as text
print(export_text(dt, feature_names=features))

# Random forest (classification)
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_depth=8,
                             random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
rf.predict(X_test)
rf.score(X_test, y_test)

# Feature importance
rf.feature_importances_  # array of importance scores (sum to ~1.0)

# Permutation importance (more reliable)
from sklearn.inspection import permutation_importance
perm_imp = permutation_importance(rf, X_test, y_test, n_repeats=10)

Common Mistakes to Avoid

Mistake	Why It's Wrong	What to Do Instead
Using default `DecisionTreeClassifier()` without pruning	Almost always overfits — 100% training accuracy, poor test accuracy	Set `max_depth`, `min_samples_leaf`, or both
Interpreting feature importance as causation	Importance measures prediction contribution, not causal effect	Say "X is the strongest predictor" not "X causes Y"
Ignoring class imbalance	With 95% majority class, the tree just predicts the majority	Use `class_weight='balanced'`
Using too few trees in the forest	Accuracy may not stabilize with too few trees	Use at least 100-200 trees; check the learning curve
Trying to visualize a random forest	200 trees can't be drawn as one flowchart	Use feature importance plots instead; build a shallow tree for communication

What You Should Be Able to Do Now

[ ] Explain how a decision tree makes predictions by splitting data at each node
[ ] Compute Gini impurity for a simple node (by hand or in code)
[ ] Train a DecisionTreeClassifier in scikit-learn and adjust pruning parameters
[ ] Visualize a decision tree using plot_tree and export_text
[ ] Explain why unrestricted trees overfit and how pruning addresses this
[ ] Describe how bagging and feature randomness create a diverse random forest
[ ] Train a RandomForestClassifier and extract feature importance scores
[ ] Compare single trees and forests and articulate when each is appropriate
[ ] Connect these models to the linear and logistic regressions from Chapters 26-27

If you checked every box, you're ready for Chapter 29, where you'll learn to evaluate all your models properly — because accuracy alone doesn't tell the whole story.