Further Reading: Chapter 13

Tree-Based Methods

Foundational Papers

1. "Classification and Regression Trees" --- Leo Breiman, Jerome Friedman, Charles Stone, Richard Olshen (1984) The book that started it all, universally known as CART. Breiman and colleagues formalized the decision tree algorithm, introduced Gini impurity as a splitting criterion, and developed cost-complexity pruning. Dense and mathematical by modern standards, but the first four chapters remain the clearest explanation of how and why trees work. Out of print but available through academic libraries.

2. "Bagging Predictors" --- Leo Breiman (1996) The paper that introduced bootstrap aggregating. Breiman showed that averaging predictions from models trained on bootstrap samples dramatically reduces variance for unstable estimators like decision trees. The key insight --- that instability is not a flaw to be fixed but a property to be exploited --- laid the groundwork for Random Forests. Published in Machine Learning, Vol. 24, No. 2.

3. "Random Forests" --- Leo Breiman (2001) The paper that introduced the Random Forest algorithm by adding feature randomization to bagging. Breiman demonstrated that the combination of bootstrap sampling and random feature subsets decorrelates trees, reducing generalization error. Also introduced OOB error as a free validation estimate and variable importance measures. Published in Machine Learning, Vol. 45, No. 1. One of the most cited papers in machine learning.

Textbooks

4. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) Chapter 8 covers tree-based methods: decision trees, bagging, Random Forests, and boosting. The explanations are exceptionally clear, with geometric intuitions and accessible mathematics. The Python edition (ISLP, 2023) includes implementation labs. Free PDF at statlearning.com. This is the single best starting point for understanding tree methods at the intermediate level.

5. The Elements of Statistical Learning (ESL) --- Hastie, Tibshirani, Friedman (2nd edition, 2009) Chapters 9 and 15 provide deeper mathematical treatments of trees and Random Forests. Chapter 15 includes the variance reduction proof for Random Forests and analysis of the decorrelation effect of feature randomization. More rigorous than ISLR --- read this if you want to understand the theoretical guarantees behind ensemble methods. Free PDF at the authors' website.

6. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow --- Aurelien Geron (3rd edition, 2022) Chapter 6 covers decision trees and Chapter 7 covers ensemble methods including Random Forests. Geron's strength is practical implementation: the code examples are production-ready and the visualizations of tree splits, decision boundaries, and feature importance are excellent. Recommended as a companion to this chapter for hands-on practice.

Practical Guides

7. scikit-learn User Guide --- "Decision Trees" and "Ensemble Methods" The official scikit-learn documentation for DecisionTreeClassifier, RandomForestClassifier, and related classes. Includes mathematical formulations, parameter descriptions, practical tips, and complexity analysis. The "Tips on Practical Use" sections are particularly valuable. The page on feature importance includes a clear warning about the bias of impurity-based importance. Available at scikit-learn.org.

8. "Understanding Random Forests: From Theory to Practice" --- Gilles Louppe (PhD thesis, 2014) A thorough and readable treatment of Random Forest theory, including the bias-variance decomposition of ensemble predictions, the effect of randomization on tree correlation, and the properties of variable importance measures. Chapter 6 on variable importances is the best available analysis of when and why impurity-based importance is biased. Freely available online.

Feature Importance Deep Dives

9. "Beware Default Random Forest Importances" --- Terence Parr, Kerem Turgutlu, Christopher Csiszar, Jeremy Howard (2018) A landmark blog post and companion paper demonstrating that scikit-learn's default (impurity-based) feature importance is unreliable. The authors show that random noise features can rank higher than true predictors if they have more unique values. They recommend permutation importance as the standard method. Available at explained.ai.

10. "Permutation Importance vs. Random Forest Feature Importance (MDI)" --- scikit-learn documentation example A worked example in the scikit-learn documentation comparing impurity-based and permutation-based importance on a dataset with correlated features and a random noise feature. Shows that MDI assigns non-zero importance to the random feature while permutation importance correctly assigns it zero. The clearest practical demonstration of why permutation importance is preferred. Available in the scikit-learn examples gallery.

11. "Conditional Variable Importance for Random Forests" --- Strobl, Boulesteix, Kneib, Augustin, Zeileis (2008) A paper addressing the bias of both impurity-based and standard permutation importance when features are correlated. The authors propose conditional permutation importance, which permutes a feature's values conditional on the values of correlated features. Important for understanding the limitations of standard importance measures on real-world datasets with correlated predictors. Published in BMC Bioinformatics, Vol. 9, No. 1.

12. "Extremely Randomized Trees" --- Geurts, Ernst, Wehenkel (2006) Introduces Extra-Trees, which add even more randomization than Random Forests: instead of finding the optimal threshold for each feature at each split, Extra-Trees pick thresholds randomly. This further decorrelates trees and can improve generalization, especially on noisy data. Scikit-learn implements this as ExtraTreesClassifier. Published in Machine Learning, Vol. 63, No. 1.

13. "A Unified Approach to Interpreting Model Predictions" --- Lundberg and Lee (2017) The paper introducing SHAP (SHapley Additive exPlanations), which provides consistent, locally faithful feature attribution for any model. TreeSHAP is a fast algorithm for computing exact SHAP values for tree-based models. If you want to explain individual Random Forest predictions (not just global importance), SHAP is the standard tool. Published in NeurIPS 2017.

Video and Multimedia

14. StatQuest with Josh Starmer --- "Decision Trees" and "Random Forests" A series of short (10-15 minute) videos covering decision tree splitting, Gini impurity, entropy, information gain, bagging, and Random Forests. Starmer builds each concept step by step with clear visuals and avoids unnecessary mathematical formalism. Recommended as a first-pass explanation or review. Available on YouTube.

15. MIT OpenCourseWare --- 6.034 Artificial Intelligence, Lecture 11: "Learning: Identification Trees, Disorder" Patrick Winston's lecture on decision trees builds the algorithm from the ground up using information-theoretic foundations. The lecture style is Socratic, and the pacing lets the ideas breathe. Particularly good on the intuition behind entropy as a measure of "disorder" in the data. Available on YouTube.

How to Use This List

If you read nothing else, read ISLR Chapter 8 (item 4) and the Parr et al. blog post on feature importance (item 9). Together they take about 3 hours: ISLR gives you the theory of trees and ensembles with clear visuals, and Parr et al. teaches you why the default importance ranking in scikit-learn can mislead you.

If you want the mathematical foundations, start with Breiman's 2001 Random Forests paper (item 3) and then read Louppe's thesis Chapter 6 (item 8) for the importance analysis.

If you want practical implementation guidance, the scikit-learn documentation (item 7) and Geron's textbook (item 6) are the best starting points. Both include runnable code examples.

If you want to explain individual predictions (not just global importance), read the SHAP paper (item 13) and install the shap library. We will cover SHAP in detail in Chapter 19.

This reading list supports Chapter 13: Tree-Based Methods. Return to the chapter to review concepts before diving in.