Key Takeaways: Chapter 13

DataField.Dev

Key Takeaways: Chapter 13

Tree-Based Methods

Decision trees split data by asking yes/no questions about features, choosing each split to maximize information gain. At every node, the algorithm evaluates all features and all thresholds, then selects the one that most reduces impurity (measured by Gini or entropy). The result is a flowchart that partitions the feature space into rectangular regions, each with a prediction.
Gini impurity and entropy measure the same thing --- how mixed a node's class distribution is --- and produce nearly identical trees in practice. Gini = 1 - sum(p_i^2), Entropy = -sum(p_i * log2(p_i)). Both equal zero for a pure node and are maximized for a 50/50 split. Scikit-learn defaults to Gini because it is slightly faster (no logarithm).
An unrestricted decision tree will achieve 100% training accuracy and fail on test data. This is the most important lesson of the chapter. The tree memorizes every training sample by growing dozens of levels and thousands of leaves. This is not learning --- it is memorization. The gap between training and test performance is the signature of overfitting.
Pruning limits tree complexity to prevent overfitting. max_depth is the single most important parameter --- start at 5 or 6 and tune from there. min_samples_split and min_samples_leaf provide additional control. Post-pruning via ccp_alpha grows the full tree first, then removes branches that do not help. Any form of pruning outperforms an unrestricted tree on test data.
Single trees are unstable: small changes in training data produce different tree structures. This high variance means you cannot trust the specific logic of any one tree. The first split might use tenure in one bootstrap sample and hours watched in another. This instability is a deeper problem than overfitting and is the fundamental motivation for ensembles.
Random Forests fix instability with double randomization: bootstrap sampling and feature randomization. Each tree trains on a ~63.2% bootstrap sample (different data). At each split, only a random subset of features is considered (different features). This forces trees to explore different patterns, decorrelating their predictions. The ensemble average is more stable and accurate than any individual tree.
Adding more trees to a Random Forest never causes overfitting. Unlike increasing tree depth (which memorizes noise), adding more trees reduces the variance of the ensemble average. Performance rises with more trees and then plateaus. 500 trees is a safe default. The only cost of more trees is training time and memory.
Out-of-bag (OOB) error provides a free estimate of generalization performance. Each training sample is excluded from ~36.8% of trees. Predicting each sample using only the trees that excluded it gives an accuracy estimate similar to cross-validation --- without the computational cost of retraining.
Impurity-based feature importance is fast but biased toward high-cardinality features. A continuous feature with 1,000 unique values has more chances to split than a binary feature, inflating its apparent importance regardless of true predictive power. Use impurity-based importance for quick exploration only.
Permutation-based feature importance is slower but reliable. It measures the actual drop in model performance when a feature's relationship to the target is destroyed by shuffling. When impurity-based and permutation-based rankings disagree, trust permutation importance --- especially for stakeholder-facing reports.
Trees do not require feature scaling. Splits are threshold comparisons that depend on ordering, not magnitude. A feature measured in milliseconds and a feature measured in millions of dollars are treated identically. Applying StandardScaler, MinMaxScaler, or even log transforms does not change a tree's predictions. This is a genuine practical advantage --- one less preprocessing step to get wrong.
The Random Forest beat the logistic regression baseline on StreamFlow churn by 6.6 AUC points (0.889 vs. 0.823). The forest captures non-linear interactions --- contract type interacting with tenure, sessions interacting with days since login --- that logistic regression cannot model without manual feature engineering. This is the core advantage of tree-based methods on tabular data.

If You Remember One Thing

A single decision tree memorizes. A Random Forest learns. The double randomization of Random Forests --- bootstrap sampling for different data, feature randomization for different splits --- forces each tree to discover different patterns, and the ensemble average suppresses the noise that any one tree would overfit to. Trees are intuitive, but never trust a single tree's story. Trust the forest's consensus.

These takeaways summarize Chapter 13: Tree-Based Methods. Return to the chapter for full context.