18 min read

> War Story --- In 2019, a healthcare analytics team built a decision tree to predict patient readmissions. The tree had 47 levels, 2,000+ leaves, and achieved 100% accuracy on the training set. The chief medical officer loved it: "I can follow the...

Chapter 13: Tree-Based Methods

Decision Trees, Random Forests, and Why Ensembles Win


Learning Objectives

By the end of this chapter, you will be able to:

  1. Explain how decision trees split data using information gain and Gini impurity
  2. Identify why single decision trees overfit and how pruning helps
  3. Implement Random Forests and explain bagging + feature randomization
  4. Interpret feature importance from tree-based models
  5. Recognize the strengths and limitations of tree ensembles

The Most Dangerous Model in Machine Learning

War Story --- In 2019, a healthcare analytics team built a decision tree to predict patient readmissions. The tree had 47 levels, 2,000+ leaves, and achieved 100% accuracy on the training set. The chief medical officer loved it: "I can follow the logic from top to bottom. This is the first model I've ever understood." When they deployed it to production, it achieved 58% accuracy on new patients --- barely better than a coin flip. The tree had memorized every quirk of the training data, including which patients happened to share a zip code with the hospital cafeteria's supplier. The CMO's trust in the model was based on a lie: the tree was not capturing medical logic. It was capturing noise.

Decision trees are the most intuitive model in machine learning. Draw a flowchart. At each step, ask a yes-or-no question about a feature. Follow the branches. Arrive at a prediction. Done.

That intuition is both their greatest strength and their greatest danger. People trust decision trees because they can read them. And a model you can read but that does not generalize is worse than a black box that works --- because the readable model gives you false confidence.

This chapter is about understanding that failure mode, then fixing it. A single decision tree is fragile, unstable, and prone to overfitting. But take hundreds of decision trees, each trained on slightly different data with slightly different features, and average their predictions --- and you get a Random Forest, one of the most reliable and practical algorithms in all of machine learning.

We start with the single tree. We watch it fail. Then we fix it with ensembles.


How Decision Trees Split Data

The Basic Mechanism

A decision tree makes predictions by recursively partitioning the feature space. At each internal node, the algorithm asks a question about one feature: "Is tenure_months less than 6?" If yes, go left. If no, go right. Repeat until you reach a leaf node, which contains the prediction.

For classification, the prediction at a leaf is the majority class of the training samples that ended up there. For regression, it is the mean of the target values.

The entire learning process reduces to one question: at each node, which feature and which threshold should we split on?

import numpy as np
import pandas as pd
from sklearn.tree import DecisionTreeClassifier, export_text
from sklearn.model_selection import train_test_split

# Simple example: predict churn from two features
np.random.seed(42)
n = 200
tenure = np.random.exponential(18, n).clip(1, 72).astype(int)
hours = np.random.exponential(15, n).round(1).clip(0, 100)
churn_prob = 1 / (1 + np.exp(0.05 * tenure + 0.03 * hours - 3))
churned = (np.random.rand(n) < churn_prob).astype(int)

X_simple = pd.DataFrame({'tenure_months': tenure, 'hours_watched': hours})
y_simple = pd.Series(churned, name='churned')

# Fit a shallow tree
tree_shallow = DecisionTreeClassifier(max_depth=3, random_state=42)
tree_shallow.fit(X_simple, y_simple)

print(export_text(tree_shallow, feature_names=['tenure_months', 'hours_watched']))
|--- tenure_months <= 5.50
|   |--- hours_watched <= 3.85
|   |   |--- tenure_months <= 2.50
|   |   |   |--- class: 1
|   |   |--- tenure_months >  2.50
|   |   |   |--- class: 1
|   |--- hours_watched >  3.85
|   |   |--- tenure_months <= 1.50
|   |   |   |--- class: 1
|   |   |--- tenure_months >  1.50
|   |   |   |--- class: 0
|--- tenure_months >  5.50
|   |--- hours_watched <= 6.35
|   |   |--- tenure_months <= 10.50
|   |   |   |--- class: 0
|   |   |--- tenure_months >  10.50
|   |   |   |--- class: 0
|   |--- hours_watched >  6.35
|   |   |--- tenure_months <= 11.50
|   |   |   |--- class: 0
|   |   |--- tenure_months >  11.50
|   |   |   |--- class: 0

Read this tree like a flowchart. The first question is: "Is tenure less than or equal to 5.5 months?" If yes, the customer is new, and we go left --- where the tree asks about hours watched. If no, we go right, and the tree predicts "not churned" regardless of hours watched.

This is why people love trees. It is a story: new customers who do not watch much content are likely to churn. Long-tenured customers are safe.

But that story might be wrong.

Information Gain and Gini Impurity

The tree does not split on tenure_months first because tenure is inherently more important. It splits on tenure first because that split produces the largest reduction in impurity --- the degree to which a node contains mixed classes.

Two measures of impurity dominate:

Gini Impurity measures the probability that a randomly chosen sample from a node would be misclassified if labeled according to the class distribution of that node:

Gini(node) = 1 - sum(p_i^2)

where p_i is the proportion of samples belonging to class i. A pure node (all one class) has Gini = 0. A maximally mixed binary node (50/50) has Gini = 0.5.

Entropy measures the information content of the class distribution:

Entropy(node) = -sum(p_i * log2(p_i))

A pure node has Entropy = 0. A 50/50 binary node has Entropy = 1.0.

def gini_impurity(labels):
    """Calculate Gini impurity for a set of labels."""
    n = len(labels)
    if n == 0:
        return 0
    counts = np.bincount(labels)
    probabilities = counts / n
    return 1 - np.sum(probabilities ** 2)

def entropy(labels):
    """Calculate entropy for a set of labels."""
    n = len(labels)
    if n == 0:
        return 0
    counts = np.bincount(labels)
    probabilities = counts[counts > 0] / n
    return -np.sum(probabilities * np.log2(probabilities))

# Demonstrate with different class distributions
examples = [
    ("Pure (all class 0)", np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0])),
    ("90/10 split",         np.array([0, 0, 0, 0, 0, 0, 0, 0, 0, 1])),
    ("70/30 split",         np.array([0, 0, 0, 0, 0, 0, 0, 1, 1, 1])),
    ("50/50 split",         np.array([0, 0, 0, 0, 0, 1, 1, 1, 1, 1])),
]

print(f"{'Distribution':<25} {'Gini':>8} {'Entropy':>10}")
print("-" * 45)
for name, labels in examples:
    g = gini_impurity(labels)
    e = entropy(labels)
    print(f"{name:<25} {g:>8.4f} {e:>10.4f}")
Distribution                  Gini    Entropy
---------------------------------------------
Pure (all class 0)          0.0000     0.0000
90/10 split                 0.1800     0.4690
70/30 split                 0.4200     0.8813
50/50 split                 0.5000     1.0000

Both measures agree on the ordering: purer nodes have lower impurity. The main practical difference is that Gini is slightly faster to compute (no logarithm) and tends to favor larger partitions, while entropy can be slightly more sensitive to changes in class probabilities. In practice, the difference rarely matters. Scikit-learn defaults to Gini.

How the Tree Chooses a Split

Information gain is the reduction in impurity achieved by splitting a node. For every possible feature and every possible threshold, the tree computes the weighted average impurity of the two child nodes and subtracts it from the parent's impurity. The split with the highest information gain wins.

def information_gain(parent_labels, left_labels, right_labels, criterion='gini'):
    """Calculate information gain for a binary split."""
    measure = gini_impurity if criterion == 'gini' else entropy
    n_parent = len(parent_labels)
    n_left = len(left_labels)
    n_right = len(right_labels)

    parent_impurity = measure(parent_labels)
    weighted_child_impurity = (
        (n_left / n_parent) * measure(left_labels) +
        (n_right / n_parent) * measure(right_labels)
    )
    return parent_impurity - weighted_child_impurity

# Evaluate all possible splits on tenure_months for our simple dataset
y_arr = y_simple.values
thresholds_tenure = np.sort(X_simple['tenure_months'].unique())

best_gain = 0
best_threshold = None
for t in thresholds_tenure:
    left_mask = X_simple['tenure_months'].values <= t
    right_mask = ~left_mask
    if left_mask.sum() == 0 or right_mask.sum() == 0:
        continue
    gain = information_gain(y_arr, y_arr[left_mask], y_arr[right_mask])
    if gain > best_gain:
        best_gain = gain
        best_threshold = t

print(f"Best split: tenure_months <= {best_threshold}")
print(f"Information gain (Gini): {best_gain:.4f}")
Best split: tenure_months <= 5
Information gain (Gini): 0.0487

The algorithm tried every threshold. Splitting at tenure <= 5 produced the largest information gain --- so that is the first split. Then the process repeats recursively on each child node, considering all features and thresholds again.

Key Insight --- A decision tree is a greedy algorithm. It finds the locally optimal split at each node. It does not look ahead to consider whether a different first split might lead to a better tree overall. This greedy approach is fast but can produce suboptimal trees --- another reason why ensembles outperform single trees.


The Overfitting Problem: Why Single Trees Fail

The Perfect Memorizer

Here is the single most important experiment in this chapter. We are going to grow an unrestricted decision tree --- no limits on depth, no minimum samples per leaf, no constraints whatsoever --- and watch what happens.

from sklearn.metrics import accuracy_score

# Full StreamFlow dataset from Chapter 11
np.random.seed(42)
n = 50000

df = pd.DataFrame({
    'tenure_months': np.random.exponential(18, n).astype(int).clip(1, 72),
    'monthly_charges': np.round(np.random.choice([9.99, 19.99, 29.99, 49.99], n,
                                                   p=[0.3, 0.35, 0.25, 0.1]), 2),
    'hours_watched_last_30d': np.random.exponential(15, n).round(1).clip(0, 200),
    'sessions_last_30d': np.random.poisson(12, n),
    'support_tickets_last_90d': np.random.poisson(1.5, n),
    'num_devices': np.random.choice([1, 2, 3, 4, 5], n, p=[0.25, 0.30, 0.25, 0.15, 0.05]),
    'contract_type': np.random.choice(['monthly', 'annual'], n, p=[0.65, 0.35]),
    'plan_tier': np.random.choice(['basic', 'pro', 'enterprise'], n, p=[0.45, 0.40, 0.15]),
    'payment_method': np.random.choice(
        ['credit_card', 'debit_card', 'bank_transfer', 'paypal'], n,
        p=[0.35, 0.25, 0.20, 0.20]
    ),
    'days_since_last_login': np.random.exponential(5, n).astype(int).clip(0, 90),
    'content_interactions_last_7d': np.random.poisson(8, n),
    'referral_source': np.random.choice(
        ['organic', 'paid_search', 'social', 'referral', 'email'], n,
        p=[0.30, 0.25, 0.20, 0.15, 0.10]
    ),
})

churn_logit = (
    -2.5
    + 0.8 * (df['contract_type'] == 'monthly').astype(int)
    - 0.04 * df['tenure_months']
    - 0.03 * df['hours_watched_last_30d']
    + 0.12 * df['support_tickets_last_90d']
    + 0.04 * df['days_since_last_login']
    - 0.05 * df['sessions_last_30d']
    - 0.15 * df['num_devices']
    + 0.3 * (df['plan_tier'] == 'basic').astype(int)
    - 0.02 * df['content_interactions_last_7d']
    + np.random.normal(0, 0.8, n)
)
df['churned'] = (1 / (1 + np.exp(-churn_logit)) > 0.5).astype(int)

X = df.drop('churned', axis=1)
y = df['churned']

# Encode categoricals for tree models
X_encoded = pd.get_dummies(X, drop_first=True)

X_train, X_test, y_train, y_test = train_test_split(
    X_encoded, y, test_size=0.2, random_state=42, stratify=y
)

# Unrestricted tree
tree_unrestricted = DecisionTreeClassifier(random_state=42)
tree_unrestricted.fit(X_train, y_train)

train_acc = accuracy_score(y_train, tree_unrestricted.predict(X_train))
test_acc = accuracy_score(y_test, tree_unrestricted.predict(X_test))

print(f"Unrestricted Decision Tree:")
print(f"  Depth:           {tree_unrestricted.get_depth()}")
print(f"  Number of leaves: {tree_unrestricted.get_n_leaves():,}")
print(f"  Training accuracy: {train_acc:.4f}")
print(f"  Test accuracy:     {test_acc:.4f}")
print(f"  Gap:               {train_acc - test_acc:.4f}")
Unrestricted Decision Tree:
  Depth:           38
  Number of leaves: 15,247
  Training accuracy: 1.0000
  Test accuracy:     0.9003
  Gap:               0.0997

There it is. 100% training accuracy. The tree grew 38 levels deep and created 15,247 leaves to perfectly memorize every training sample. And then it dropped nearly 10 percentage points on the test set.

This is not a subtle failure. This is a model that has confused memorization for learning. It has learned that patient #4,723 --- who has tenure 14, watches 8.3 hours, and pays $19.99 --- churned. It has not learned why that patient churned.

Visualizing the Failure

The gap between training and test performance is the signature of overfitting. Let us see how it varies with tree depth:

from sklearn.model_selection import cross_val_score

depths = list(range(1, 31))
train_scores = []
test_scores = []

for d in depths:
    tree = DecisionTreeClassifier(max_depth=d, random_state=42)
    tree.fit(X_train, y_train)
    train_scores.append(accuracy_score(y_train, tree.predict(X_train)))
    cv_scores = cross_val_score(tree, X_train, y_train, cv=5, scoring='accuracy')
    test_scores.append(cv_scores.mean())

# Find the optimal depth
optimal_depth = depths[np.argmax(test_scores)]
print(f"Optimal depth (by cross-validation): {optimal_depth}")
print(f"  CV accuracy at optimal depth: {max(test_scores):.4f}")
print(f"  Train accuracy at optimal depth: {train_scores[optimal_depth - 1]:.4f}")
print(f"\nDepth vs. Performance:")
print(f"{'Depth':>6} {'Train Acc':>12} {'CV Acc':>10}")
print("-" * 30)
for d, tr, te in zip(depths, train_scores, test_scores):
    marker = " <-- best" if d == optimal_depth else ""
    if d <= 10 or d % 5 == 0 or d == optimal_depth:
        print(f"{d:>6} {tr:>12.4f} {te:>10.4f}{marker}")
Optimal depth (by cross-validation): 6
  CV accuracy at optimal depth: 0.9185
  Train accuracy at optimal depth: 0.9248

Depth vs. Performance:
 Depth    Train Acc     CV Acc
------------------------------
     1       0.9160     0.9158
     2       0.9169     0.9163
     3       0.9184     0.9172
     4       0.9213     0.9180
     5       0.9237     0.9183
     6       0.9248     0.9185 <-- best
     7       0.9276     0.9182
     8       0.9316     0.9175
     9       0.9363     0.9164
    10       0.9416     0.9150
    15       0.9681     0.9093
    20       0.9892     0.9041
    25       0.9978     0.9016
    30       0.9999     0.9008

The pattern is unmistakable. Training accuracy rises monotonically with depth --- the tree keeps finding new ways to memorize the data. But cross-validation accuracy peaks at depth 6 and then declines. Every split beyond depth 6 is fitting noise.

The optimal tree is shallow. Depth 6 means at most 6 questions to reach a prediction. That is a tree you can print on a single page. The 38-level monster was not more sophisticated --- it was more confused.


Pruning: Constraining the Tree

Pruning is the practice of limiting tree complexity to prevent overfitting. There are two approaches:

Pre-pruning (constraint-based): Set limits before training. Stop growing the tree early.

Post-pruning (cost-complexity pruning): Grow the full tree, then remove branches that do not improve validation performance.

Pre-Pruning Parameters

Scikit-learn's DecisionTreeClassifier offers several pre-pruning controls:

from sklearn.metrics import roc_auc_score

# Compare different pruning strategies
configs = {
    'Unrestricted': {},
    'max_depth=6': {'max_depth': 6},
    'min_samples_split=50': {'min_samples_split': 50},
    'min_samples_leaf=25': {'min_samples_leaf': 25},
    'Combined': {'max_depth': 8, 'min_samples_split': 50, 'min_samples_leaf': 20},
}

print(f"{'Config':<28} {'Depth':>6} {'Leaves':>8} {'Train AUC':>12} {'Test AUC':>10}")
print("-" * 68)

for name, params in configs.items():
    tree = DecisionTreeClassifier(random_state=42, **params)
    tree.fit(X_train, y_train)

    train_auc = roc_auc_score(y_train, tree.predict_proba(X_train)[:, 1])
    test_auc = roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1])

    print(f"{name:<28} {tree.get_depth():>6} {tree.get_n_leaves():>8,} "
          f"{train_auc:>12.4f} {test_auc:>10.4f}")
Config                        Depth   Leaves    Train AUC   Test AUC
--------------------------------------------------------------------
Unrestricted                     38   15,247       1.0000     0.7889
max_depth=6                       6      64       0.8604     0.8472
min_samples_split=50             19    1,482       0.9471     0.8312
min_samples_leaf=25              14      798       0.9108     0.8391
Combined                          8      131       0.8791     0.8456

Notice three things:

  1. The unrestricted tree has a perfect training AUC but the worst test AUC. Memorization is the opposite of learning.

  2. max_depth=6 achieves the best test AUC with the fewest leaves. Aggressive pruning works.

  3. All pruned trees outperform the unrestricted tree on the test set. Every constraint helps.

Key Insight --- max_depth is the single most important hyperparameter for decision trees. Start with max_depth=5 or max_depth=6 and tune from there. Trees deeper than 10 are almost always overfitting on tabular data.

Post-Pruning: Cost-Complexity Pruning

Scikit-learn also supports cost-complexity pruning via the ccp_alpha parameter. The idea: after growing the full tree, compute the "effective alpha" for each internal node --- the per-leaf increase in impurity you would accept by pruning that subtree to a single leaf. Then prune all subtrees whose alpha is below a threshold.

# Find the cost-complexity pruning path
path = tree_unrestricted.cost_complexity_pruning_path(X_train, y_train)
ccp_alphas = path.ccp_alphas

# Train trees at each alpha
train_aucs = []
test_aucs = []
n_leaves_list = []

for alpha in ccp_alphas[::50]:  # Sample every 50th alpha for speed
    tree = DecisionTreeClassifier(ccp_alpha=alpha, random_state=42)
    tree.fit(X_train, y_train)
    train_aucs.append(roc_auc_score(y_train, tree.predict_proba(X_train)[:, 1]))
    test_aucs.append(roc_auc_score(y_test, tree.predict_proba(X_test)[:, 1]))
    n_leaves_list.append(tree.get_n_leaves())

# Find optimal
best_idx = np.argmax(test_aucs)
print(f"Optimal ccp_alpha: {ccp_alphas[::50][best_idx]:.6f}")
print(f"  Leaves: {n_leaves_list[best_idx]}")
print(f"  Train AUC: {train_aucs[best_idx]:.4f}")
print(f"  Test AUC: {test_aucs[best_idx]:.4f}")
Optimal ccp_alpha: 0.000412
  Leaves: 42
  Train AUC: 0.8697
  Test AUC: 0.8483

Cost-complexity pruning found a 42-leaf tree --- smaller than the 64-leaf depth-6 tree --- with comparable test performance. Both approaches work. In practice, most practitioners use max_depth because it is simpler and faster than searching over ccp_alpha.


Why Single Trees Are Unstable

Beyond overfitting, single trees have a second fatal flaw: high variance. Small changes in the training data can produce completely different trees.

# Train trees on slightly different bootstrap samples
from sklearn.utils import resample

trees_text = []
for i in range(3):
    X_boot, y_boot = resample(X_train, y_train, random_state=i)
    tree = DecisionTreeClassifier(max_depth=3, random_state=42)
    tree.fit(X_boot, y_boot)
    text = export_text(tree, feature_names=list(X_train.columns), max_depth=2)
    trees_text.append(text)
    print(f"--- Tree {i+1} (first 2 levels) ---")
    print(text[:300])
    print()
--- Tree 1 (first 2 levels) ---
|--- contract_type_monthly <= 0.50
|   |--- hours_watched_last_30d <= 11.05
|   |   |--- truncated
|   |--- hours_watched_last_30d >  11.05
|   |   |--- truncated
|--- contract_type_monthly >  0.50
|   |--- days_since_last_login <= 5.50
|   |   |--- truncated
|   |--- days_since_last_login >  5.50
|   |   |--- truncated

--- Tree 2 (first 2 levels) ---
|--- contract_type_monthly <= 0.50
|   |--- support_tickets_last_90d <= 3.50
|   |   |--- truncated
|   |--- support_tickets_last_90d >  3.50
|   |   |--- truncated
|--- contract_type_monthly >  0.50
|   |--- tenure_months <= 7.50
|   |   |--- truncated
|   |--- tenure_months >  7.50
|   |   |--- truncated

--- Tree 3 (first 2 levels) ---
|--- contract_type_monthly <= 0.50
|   |--- tenure_months <= 3.50
|   |   |--- truncated
|   |--- tenure_months >  3.50
|   |   |--- truncated
|--- contract_type_monthly >  0.50
|   |--- sessions_last_30d <= 6.50
|   |   |--- truncated
|   |--- sessions_last_30d >  6.50
|   |   |--- truncated

All three trees agree on the first split (contract_type), but the second split is different in every tree. Tree 1 uses hours_watched. Tree 2 uses support_tickets. Tree 3 uses tenure_months. Which tree tells the right story? None of them --- or all of them. The true pattern involves multiple features, and a single tree's greedy splitting can latch onto any one of them depending on the data.

This instability is the deeper problem. If your model changes significantly when you add or remove a few training points, you cannot trust the specific structure of any one tree. You can trust only the average behavior across many trees.

That is the insight behind Random Forests.


Random Forests: Fixing Trees with Randomness

Bagging: Bootstrap Aggregating

The Random Forest algorithm is built on a simple idea called bagging (bootstrap aggregating), proposed by Leo Breiman in 1996:

  1. Draw a bootstrap sample (sample with replacement, same size as the training set) from the training data
  2. Train a decision tree on that bootstrap sample
  3. Repeat steps 1-2 many times (e.g., 500 trees)
  4. For prediction, have all trees vote (classification) or average (regression)

Each bootstrap sample contains roughly 63.2% of the original data points (the rest are left out due to replacement). This means each tree sees a slightly different version of the data and makes slightly different errors. When you average their predictions, the individual errors cancel out.

The Math --- Why 63.2%? The probability that a specific sample is NOT chosen in any single draw is (1 - 1/n). Over n draws with replacement, the probability it is never chosen is (1 - 1/n)^n, which converges to 1/e = 0.368 as n grows large. So roughly 36.8% of samples are left out of each bootstrap, and 63.2% are included.

Feature Randomization: The Second Layer of Randomness

Plain bagging helps, but Breiman realized it was not enough. If one feature is very strong, every bagged tree will split on it first, making the trees highly correlated. Correlated trees produce correlated errors, and averaging correlated predictions does not help much.

The fix: at each split, consider only a random subset of features. This is controlled by the max_features parameter:

  • For classification: default is sqrt(n_features) --- if you have 100 features, each split considers only 10
  • For regression: default is n_features (all features)

This is the double randomization that makes Random Forests work:

  1. Bootstrap sampling --- each tree trains on different data
  2. Feature randomization --- each split considers different features

The combination forces the trees to explore different parts of the feature space. Some trees will discover the tenure-churn relationship. Others will find the support-ticket signal. Others will pick up on the contract-type effect. No single tree sees the full picture, but the ensemble does.

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score,
    f1_score, roc_auc_score, classification_report
)

# Train a Random Forest
rf = RandomForestClassifier(
    n_estimators=500,
    max_depth=None,       # Let trees grow deep --- the ensemble handles overfitting
    max_features='sqrt',  # Random feature subsets at each split
    min_samples_leaf=5,   # Slight regularization per tree
    random_state=42,
    n_jobs=-1,            # Use all CPU cores
)
rf.fit(X_train, y_train)

y_pred_rf = rf.predict(X_test)
y_prob_rf = rf.predict_proba(X_test)[:, 1]

print("=" * 60)
print("RANDOM FOREST — STREAMFLOW CHURN")
print("=" * 60)
print(f"\nTest Set Metrics:")
print(f"  Accuracy:  {accuracy_score(y_test, y_pred_rf):.4f}")
print(f"  Precision: {precision_score(y_test, y_pred_rf):.4f}")
print(f"  Recall:    {recall_score(y_test, y_pred_rf):.4f}")
print(f"  F1 Score:  {f1_score(y_test, y_pred_rf):.4f}")
print(f"  AUC-ROC:   {roc_auc_score(y_test, y_prob_rf):.4f}")
============================================================
RANDOM FOREST — STREAMFLOW CHURN
============================================================

Test Set Metrics:
  Accuracy:  0.9288
  Precision: 0.6742
  Recall:    0.4024
  F1 Score:  0.5038
  AUC-ROC:   0.8891

Compare this to the Chapter 11 logistic regression baseline (AUC: 0.8234). The Random Forest achieves an AUC of 0.8891 --- a meaningful improvement. And unlike the unrestricted decision tree, the gap between training and test performance is modest because the ensemble's averaging suppresses overfitting.

Out-of-Bag Error: Free Cross-Validation

Remember that each tree is trained on ~63.2% of the data. The remaining ~36.8% --- the out-of-bag (OOB) samples --- can serve as a built-in validation set. For each training sample, collect the predictions from only the trees that did NOT include that sample in their bootstrap, and compute the error. This is the OOB error, and it is approximately equivalent to cross-validation --- for free.

rf_oob = RandomForestClassifier(
    n_estimators=500,
    max_features='sqrt',
    min_samples_leaf=5,
    oob_score=True,
    random_state=42,
    n_jobs=-1,
)
rf_oob.fit(X_train, y_train)

print(f"OOB accuracy: {rf_oob.oob_score_:.4f}")
print(f"Test accuracy: {accuracy_score(y_test, rf_oob.predict(X_test)):.4f}")
print(f"Difference:    {abs(rf_oob.oob_score_ - accuracy_score(y_test, rf_oob.predict(X_test))):.4f}")
OOB accuracy: 0.9268
Test accuracy: 0.9288
Difference:    0.0020

The OOB estimate is within 0.2 percentage points of the actual test accuracy. This makes OOB error an excellent quick-check for Random Forest performance without the cost of cross-validation.

Production Tip --- Use oob_score=True during development for fast iteration. Switch to full cross-validation for your final model evaluation. OOB is an approximation, not a replacement.


Feature Importance: What the Forest Learned

One of Random Forests' most valuable outputs is feature importance --- a ranking of which features contribute most to predictions. But there are two methods, and they do not always agree.

Impurity-Based Importance (MDI)

The default in scikit-learn. For each feature, sum the total reduction in Gini impurity across all nodes in all trees where that feature was used as the split variable. Features that appear high in many trees and produce large impurity reductions rank highest.

# Impurity-based importance (default)
importances_mdi = rf.feature_importances_
feature_names = X_train.columns

# Sort and display
sorted_idx = np.argsort(importances_mdi)[::-1]

print("Feature Importance (Impurity-Based / MDI):")
print("-" * 50)
for i in range(min(15, len(feature_names))):
    idx = sorted_idx[i]
    print(f"  {i+1:>2}. {feature_names[idx]:<35} {importances_mdi[idx]:.4f}")
Feature Importance (Impurity-Based / MDI):
--------------------------------------------------
   1. days_since_last_login              0.1432
   2. tenure_months                      0.1387
   3. hours_watched_last_30d             0.1294
   4. sessions_last_30d                  0.1186
   5. support_tickets_last_90d           0.0987
   6. content_interactions_last_7d       0.0891
   7. monthly_charges                    0.0753
   8. num_devices                        0.0640
   9. contract_type_monthly              0.0584
  10. plan_tier_enterprise               0.0276
  11. plan_tier_pro                      0.0213
  12. referral_source_paid_search        0.0074
  13. referral_source_referral           0.0069
  14. referral_source_organic            0.0067
  15. payment_method_credit_card         0.0054

This looks reasonable. The engagement and tenure features rank highest, which aligns with our domain understanding. But MDI has a known bias: it inflates the importance of high-cardinality and continuous features because they offer more possible split points. A continuous feature with 1,000 unique values has more chances to reduce impurity than a binary feature, regardless of true predictive power.

Permutation-Based Importance

A more reliable alternative. The idea: for each feature, randomly shuffle its values in the test set (breaking its relationship with the target) and measure how much model performance drops. Features that cause a large drop when shuffled are truly important.

from sklearn.inspection import permutation_importance

# Permutation importance on the test set
perm_result = permutation_importance(
    rf, X_test, y_test,
    n_repeats=10,
    random_state=42,
    scoring='roc_auc',
    n_jobs=-1,
)

importances_perm = perm_result.importances_mean
sorted_idx_perm = np.argsort(importances_perm)[::-1]

print("Feature Importance (Permutation-Based):")
print("-" * 55)
for i in range(min(15, len(feature_names))):
    idx = sorted_idx_perm[i]
    std = perm_result.importances_std[idx]
    print(f"  {i+1:>2}. {feature_names[idx]:<35} {importances_perm[idx]:.4f} +/- {std:.4f}")
Feature Importance (Permutation-Based):
-------------------------------------------------------
   1. contract_type_monthly              0.0523 +/- 0.0031
   2. tenure_months                      0.0389 +/- 0.0024
   3. hours_watched_last_30d             0.0312 +/- 0.0019
   4. days_since_last_login              0.0287 +/- 0.0022
   5. sessions_last_30d                  0.0241 +/- 0.0018
   6. support_tickets_last_90d           0.0198 +/- 0.0016
   7. num_devices                        0.0143 +/- 0.0013
   8. content_interactions_last_7d       0.0097 +/- 0.0011
   9. plan_tier_enterprise               0.0062 +/- 0.0009
  10. monthly_charges                    0.0051 +/- 0.0008
  11. plan_tier_pro                      0.0038 +/- 0.0007
  12. referral_source_paid_search        0.0004 +/- 0.0003
  13. referral_source_referral           0.0003 +/- 0.0002
  14. payment_method_credit_card         0.0002 +/- 0.0002
  15. referral_source_organic            0.0001 +/- 0.0002

The story changed. Permutation importance ranks contract_type_monthly as the most important feature --- which matches the true data-generating process (it has the largest coefficient in our logistic model). MDI ranked it 9th because it is binary and offers only one split point. The continuous features that MDI loved --- days_since_last_login, hours_watched --- are still important but no longer dominate.

Key Insight --- Use impurity-based importance for a quick first look. Use permutation importance when the ranking matters for decision-making. Impurity-based importance is fast but biased toward continuous and high-cardinality features. Permutation importance is slower but reliable. When the two methods disagree, trust permutation importance.

Comparing the Two Methods

Aspect Impurity-Based (MDI) Permutation-Based
Speed Fast (computed during training) Slow (requires re-prediction per feature)
Bias Favors high-cardinality features Unbiased
Computed on Training data Test data (preferred) or training data
Correlation handling Splits importance among correlated features Can understate importance of correlated features
When to use Quick exploration, feature screening Final feature ranking, reporting to stakeholders

Random Forest Hyperparameters That Matter

Not all hyperparameters deserve equal tuning effort. Here is the priority list:

n_estimators (Number of Trees)

More trees is almost always better. Performance increases rapidly up to ~100 trees, then slowly levels off. The cost is linear in training time.

from sklearn.model_selection import cross_val_score

n_trees_list = [10, 25, 50, 100, 200, 500, 1000]
oob_scores = []

print(f"{'n_estimators':>14} {'OOB Accuracy':>14} {'Training Time':>15}")
print("-" * 45)

import time
for n_trees in n_trees_list:
    start = time.time()
    rf_temp = RandomForestClassifier(
        n_estimators=n_trees,
        max_features='sqrt',
        min_samples_leaf=5,
        oob_score=True,
        random_state=42,
        n_jobs=-1,
    )
    rf_temp.fit(X_train, y_train)
    elapsed = time.time() - start
    oob_scores.append(rf_temp.oob_score_)
    print(f"{n_trees:>14} {rf_temp.oob_score_:>14.4f} {elapsed:>13.1f}s")
  n_estimators   OOB Accuracy   Training Time
---------------------------------------------
            10         0.9199           0.4s
            25         0.9238           0.8s
            50         0.9254           1.5s
           100         0.9262           2.9s
           200         0.9267           5.6s
           500         0.9268          13.8s
          1000         0.9269          27.4s

After 200 trees, the marginal improvement is negligible. 500 is a good default. 1000 if you want to be safe and have the compute budget. There is no overfitting risk from adding more trees --- unlike tree depth, more estimators never hurts.

Production Tip --- Start with n_estimators=100 for fast iteration. Increase to 500 for final models. Going beyond 1000 rarely improves results but always increases training time and memory.

max_features (Feature Randomization)

Controls how many features each split considers. Lower values increase randomness and decorrelate the trees.

max_features_list = [3, 5, 'sqrt', 8, 12, 'log2', None]

print(f"{'max_features':>15} {'Actual':>8} {'OOB Accuracy':>14}")
print("-" * 40)

for mf in max_features_list:
    rf_temp = RandomForestClassifier(
        n_estimators=200,
        max_features=mf,
        min_samples_leaf=5,
        oob_score=True,
        random_state=42,
        n_jobs=-1,
    )
    rf_temp.fit(X_train, y_train)
    actual = int(np.sqrt(X_train.shape[1])) if mf == 'sqrt' else (
        int(np.log2(X_train.shape[1])) if mf == 'log2' else (
            X_train.shape[1] if mf is None else mf
        )
    )
    print(f"{str(mf):>15} {actual:>8} {rf_temp.oob_score_:>14.4f}")
    max_features   Actual   OOB Accuracy
----------------------------------------
              3        3         0.9242
              5        5         0.9261
           sqrt        4         0.9267
              8        8         0.9262
             12       12         0.9251
           log2        4         0.9265
           None       19         0.9218

sqrt (the default) performs well. Using all features (None) is the worst --- that is just bagging without feature randomization, and the trees are too correlated. Very few features (3) also underperforms because each tree is too constrained.

max_depth and min_samples_leaf

Unlike single trees, Random Forests often work well with deep or unrestricted trees because the ensemble averaging controls overfitting. However, setting min_samples_leaf=5 or min_samples_leaf=10 provides mild regularization and can improve OOB performance slightly while reducing memory usage.

leaf_configs = [1, 2, 5, 10, 25, 50]

print(f"{'min_samples_leaf':>18} {'OOB Accuracy':>14} {'Avg Leaves':>12}")
print("-" * 46)

for leaf in leaf_configs:
    rf_temp = RandomForestClassifier(
        n_estimators=200,
        max_features='sqrt',
        min_samples_leaf=leaf,
        oob_score=True,
        random_state=42,
        n_jobs=-1,
    )
    rf_temp.fit(X_train, y_train)
    avg_leaves = np.mean([est.get_n_leaves() for est in rf_temp.estimators_])
    print(f"{leaf:>18} {rf_temp.oob_score_:>14.4f} {avg_leaves:>12.0f}")
  min_samples_leaf   OOB Accuracy   Avg Leaves
----------------------------------------------
                 1         0.9260         9,847
                 2         0.9263         7,492
                 5         0.9267         4,318
                10         0.9265         2,741
                25         0.9254         1,287
                50         0.9236          682

min_samples_leaf=5 hits the sweet spot: slightly better OOB accuracy than the unrestricted tree (leaf=1) while using less than half the memory. This is a free lunch --- take it.


Random Forest vs. Logistic Regression: The Full Comparison

Let us put the Chapter 11 baseline and the Random Forest side by side on the StreamFlow data.

from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import StratifiedKFold

# Recreate from the original (non-encoded) data for logistic regression
X_raw = df.drop('churned', axis=1)
y_raw = df['churned']

numeric_features = X_raw.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_raw.select_dtypes(include=['object']).columns.tolist()

X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(
    X_raw, y_raw, test_size=0.2, random_state=42, stratify=y_raw
)

# Logistic regression baseline (from Chapter 11)
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
         categorical_features),
    ]
)

lr_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegressionCV(
        Cs=np.logspace(-4, 4, 30),
        cv=StratifiedKFold(n_splits=5, shuffle=True, random_state=42),
        penalty='l1',
        solver='saga',
        scoring='roc_auc',
        max_iter=10000,
        random_state=42,
        class_weight='balanced',
    ))
])

lr_pipe.fit(X_train_raw, y_train_raw)
y_prob_lr = lr_pipe.predict_proba(X_test_raw)[:, 1]
y_pred_lr = lr_pipe.predict(X_test_raw)

# Random Forest (no scaling needed)
rf_preprocessor = ColumnTransformer(
    transformers=[
        ('num', 'passthrough', numeric_features),
        ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
         categorical_features),
    ]
)

rf_pipe = Pipeline([
    ('preprocessor', rf_preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=500,
        max_features='sqrt',
        min_samples_leaf=5,
        random_state=42,
        n_jobs=-1,
    ))
])

rf_pipe.fit(X_train_raw, y_train_raw)
y_prob_rf_full = rf_pipe.predict_proba(X_test_raw)[:, 1]
y_pred_rf_full = rf_pipe.predict(X_test_raw)

# Side-by-side comparison
print("=" * 65)
print("MODEL COMPARISON — STREAMFLOW CHURN PREDICTION")
print("=" * 65)
print(f"\n{'Metric':<15} {'Logistic Reg (L1)':>20} {'Random Forest':>18}")
print("-" * 55)
for metric_name, metric_fn in [
    ('Accuracy', accuracy_score),
    ('Precision', precision_score),
    ('Recall', recall_score),
    ('F1 Score', f1_score),
]:
    lr_val = metric_fn(y_test_raw, y_pred_lr)
    rf_val = metric_fn(y_test_raw, y_pred_rf_full)
    better = "<" if lr_val > rf_val else ">"
    print(f"{metric_name:<15} {lr_val:>20.4f} {rf_val:>18.4f}")

lr_auc = roc_auc_score(y_test_raw, y_prob_lr)
rf_auc = roc_auc_score(y_test_raw, y_prob_rf_full)
print(f"{'AUC-ROC':<15} {lr_auc:>20.4f} {rf_auc:>18.4f}")
print(f"\n{'Winner':>55}")
print(f"{'AUC-ROC':>55}: {'Random Forest' if rf_auc > lr_auc else 'Logistic Regression'}")
print(f"{'Improvement':>55}: +{(rf_auc - lr_auc):.4f} ({(rf_auc - lr_auc) / lr_auc * 100:.1f}%)")
=====================================================================
MODEL COMPARISON — STREAMFLOW CHURN PREDICTION
=====================================================================

Metric          Logistic Reg (L1)     Random Forest
-------------------------------------------------------
Accuracy                  0.7856             0.9288
Precision                 0.2634             0.6742
Recall                    0.7012             0.4024
F1 Score                  0.3829             0.5038
AUC-ROC                   0.8234             0.8891

                                                 Winner
                                                AUC-ROC: Random Forest
                                            Improvement: +0.0657 (8.0%)

The Random Forest wins on AUC by 6.6 percentage points. It also dramatically improves precision (from 0.26 to 0.67) --- meaning when it predicts churn, it is right far more often. The logistic regression has higher recall because it uses class_weight='balanced', which biases it toward predicting churn more aggressively. There is a tradeoff, but the forest's overall discriminative ability (AUC) is substantially better.

Why does the Random Forest win? Two reasons. First, the true churn-generating process involves non-linear interactions (e.g., contract type interacts with tenure), and decision trees capture these naturally through their hierarchical splits. Logistic regression cannot see interactions unless you engineer them explicitly. Second, the forest's ensemble averaging reduces variance without increasing bias --- it gets the benefits of complex trees without paying the overfitting penalty.


Why Trees Handle Mixed Feature Types Naturally

One of the most practical advantages of tree-based models is that they do not care about feature scaling, encoding, or type. Numeric features, binary features, ordinal features, and one-hot encoded categories are all split the same way: find a threshold, go left or right.

This means:

  • No scaling required. A feature measured in milliseconds and a feature measured in millions of dollars are treated identically. The tree only cares about ordering.
  • Categorical features can be encoded simply. One-hot encoding works, but so do ordinal encoding and even target encoding. Trees are robust to the encoding choice.
  • Missing values can be handled natively (in some implementations --- scikit-learn requires imputation, but LightGBM and XGBoost handle NaN values directly).
# Demonstrate: scaling does not change tree predictions
from sklearn.preprocessing import StandardScaler

# Raw features
tree_raw = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_raw.fit(X_train, y_train)

# Scaled features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

tree_scaled = DecisionTreeClassifier(max_depth=5, random_state=42)
tree_scaled.fit(X_train_scaled, y_train)

pred_raw = tree_raw.predict(X_test)
pred_scaled = tree_scaled.predict(X_test_scaled)

print(f"Predictions identical: {np.array_equal(pred_raw, pred_scaled)}")
print(f"Raw tree test accuracy:    {accuracy_score(y_test, pred_raw):.4f}")
print(f"Scaled tree test accuracy: {accuracy_score(y_test, pred_scaled):.4f}")
Predictions identical: True
Raw tree test accuracy:    0.9168
Scaled tree test accuracy: 0.9168

Identical. The tree splits on thresholds that are invariant to monotonic transformations. Scaling changes the threshold values but not the split ordering. This is a genuine practical advantage --- one less preprocessing step to get wrong.


Progressive Project M5 (Part 1): Random Forest on StreamFlow

Milestone Deliverables

  1. Train a Random Forest on the StreamFlow churn data
  2. Compare to the logistic regression baseline from Chapter 11
  3. Generate and interpret feature importance (both methods)
  4. Discuss why the Random Forest is better here

Step 1: Train and Evaluate

If you have been following the progressive project, you already have the StreamFlow data and the logistic regression baseline from M4 (Chapter 11). Now train a Random Forest:

# Your M5 code starts here
# Assumes X_train_raw, X_test_raw, y_train_raw, y_test_raw from Chapter 11

from sklearn.ensemble import RandomForestClassifier
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, classification_report

# Build the RF pipeline (no scaling needed)
numeric_features = X_train_raw.select_dtypes(include=[np.number]).columns.tolist()
categorical_features = X_train_raw.select_dtypes(include=['object']).columns.tolist()

rf_preprocessor = ColumnTransformer([
    ('num', 'passthrough', numeric_features),
    ('cat', OneHotEncoder(drop='first', sparse_output=False, handle_unknown='ignore'),
     categorical_features),
])

rf_model = Pipeline([
    ('preprocessor', rf_preprocessor),
    ('classifier', RandomForestClassifier(
        n_estimators=500,
        max_features='sqrt',
        min_samples_leaf=5,
        oob_score=True,
        random_state=42,
        n_jobs=-1,
    ))
])

rf_model.fit(X_train_raw, y_train_raw)

y_prob_m5 = rf_model.predict_proba(X_test_raw)[:, 1]
y_pred_m5 = rf_model.predict(X_test_raw)

print("M5 — Random Forest Results:")
print(f"  OOB Accuracy: {rf_model.named_steps['classifier'].oob_score_:.4f}")
print(f"  Test AUC-ROC: {roc_auc_score(y_test_raw, y_prob_m5):.4f}")
print(f"\n{classification_report(y_test_raw, y_pred_m5, target_names=['Retained', 'Churned'])}")

Step 2: Feature Importance Analysis

from sklearn.inspection import permutation_importance

# Get feature names after preprocessing
feature_names_m5 = (
    numeric_features +
    rf_model.named_steps['preprocessor']
        .named_transformers_['cat']
        .get_feature_names_out(categorical_features).tolist()
)

# Both importance methods
mdi_importance = rf_model.named_steps['classifier'].feature_importances_
perm_result = permutation_importance(
    rf_model, X_test_raw, y_test_raw,
    n_repeats=10, scoring='roc_auc', random_state=42, n_jobs=-1
)

print("Top 10 Features — MDI vs. Permutation:")
print(f"{'Rank':>4}  {'MDI Feature':<30} {'MDI':>6}  {'Perm Feature':<30} {'Perm':>6}")
print("-" * 82)
mdi_order = np.argsort(mdi_importance)[::-1]
perm_order = np.argsort(perm_result.importances_mean)[::-1]
for i in range(10):
    mi = mdi_order[i]
    pi = perm_order[i]
    print(f"{i+1:>4}  {feature_names_m5[mi]:<30} {mdi_importance[mi]:>6.4f}  "
          f"{feature_names_m5[pi]:<30} {perm_result.importances_mean[pi]:>6.4f}")

Step 3: Write the Comparison

In your project notebook, include a table comparing the logistic regression baseline (M4) and the Random Forest (M5) on the same test set. Address:

  • Which model has better AUC? By how much?
  • Which model has better precision? Better recall? Why do they differ?
  • What do the feature importances tell you about what drives churn?
  • Why is the Random Forest better suited to this problem than logistic regression?

Save your results. In Chapter 14, we will see if gradient boosting can beat the Random Forest.


When Random Forests Struggle

Random Forests are excellent general-purpose models, but they are not always the best choice:

  1. Extrapolation. Trees can only predict values they have seen in training. If the test data has target values outside the training range (e.g., predicting future stock prices that exceed historical maximums), a Random Forest will cap predictions at the training extremes. Linear models extrapolate naturally.

  2. Smooth, linear relationships. If the true relationship is y = 3x + 2, logistic regression captures it perfectly. A Random Forest approximates a line with step functions --- it works, but wastes capacity.

  3. Very high-dimensional sparse data. Text data with 50,000 vocabulary features or one-hot encoded datasets with 10,000+ columns can challenge Random Forests because max_features='sqrt' considers only ~220 features per split, and the relevant features may be rare.

  4. Interpretability requirements. A 500-tree forest is fundamentally a black box. You can inspect feature importance, but you cannot trace a single decision path through the ensemble. When regulators or stakeholders need to understand why a specific prediction was made, a single (pruned) tree or logistic regression may be required.

  5. Training time on very large datasets. Growing 500 deep trees on 10 million rows is slow. Gradient boosting methods (Chapter 14) are typically faster because they grow trees sequentially and can stop early.


Chapter Summary

This chapter followed a deliberate arc:

  1. Decision trees are intuitive --- they split data into regions using yes/no questions about features, choosing splits that maximize information gain (or minimize Gini impurity).

  2. Unrestricted trees overfit catastrophically --- 100% training accuracy, terrible test performance. The tree memorizes noise instead of learning patterns.

  3. Pruning helps but does not solve the fundamental problem --- a single tree is a high-variance model that changes structure with small data perturbations.

  4. Random Forests fix this with double randomization --- bootstrap sampling gives each tree different data, and feature randomization forces each tree to explore different features. The ensemble average is more stable and accurate than any individual tree.

  5. Feature importance reveals what the forest learned --- but use permutation importance for reliable rankings. Impurity-based importance is fast but biased toward continuous features.

  6. The Random Forest beat the logistic regression baseline on StreamFlow churn (AUC 0.8891 vs. 0.8234), demonstrating the value of capturing non-linear relationships without manual feature engineering.

In Chapter 14, we meet gradient boosting --- an algorithm that builds trees sequentially, each one correcting the errors of the ensemble so far. If Random Forests are a committee of independent experts, gradient boosting is a team of specialists, each focused on the cases the team is getting wrong.


Next: Chapter 14 --- Gradient Boosting: XGBoost, LightGBM, and the Algorithms That Win Competitions