Further Reading: Chapter 14

Gradient Boosting

Foundational Papers

1. "Greedy Function Approximation: A Gradient Boosting Machine" --- Jerome Friedman (2001) The paper that formalized gradient boosting. Friedman showed that boosting could be understood as gradient descent in function space, generalizing AdaBoost to arbitrary differentiable loss functions. Dense but essential: this is where the theory comes from. Published in The Annals of Statistics, Vol. 29, No. 5. The companion paper "Stochastic Gradient Boosting" (1999) introduced subsampling, which significantly improved generalization.

2. "XGBoost: A Scalable Tree Boosting System" --- Tianqi Chen and Carlos Guestrin (2016) The paper that launched XGBoost. Chen and Guestrin describe the regularized objective, the weighted quantile sketch for approximate split finding, sparsity-aware splitting, and the system engineering that made XGBoost an order of magnitude faster than previous implementations. Published in KDD 2016. This is one of the most-cited machine learning papers of the decade --- read it to understand why XGBoost won everything.

3. "LightGBM: A Highly Efficient Gradient Boosting Decision Tree" --- Ke et al. (2017) The Microsoft paper introducing histogram-based splitting, Gradient-based One-Side Sampling (GOSS), and Exclusive Feature Bundling (EFB). These three innovations made LightGBM 5-10x faster than XGBoost on large datasets with comparable accuracy. Published in NeurIPS 2017. The paper is well-written and includes clear algorithmic descriptions.

4. "CatBoost: Unbiased Boosting with Categorical Features" --- Prokhorenkova et al. (2018) The Yandex paper introducing ordered boosting and ordered target statistics for categorical features. The key insight is that standard gradient boosting suffers from "prediction shift" (using the same data for computing residuals and fitting trees creates a subtle form of overfitting), and ordered boosting eliminates it. Published in NeurIPS 2018. The categorical handling methodology is the most novel contribution.

Practical Guides and Tutorials

5. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow --- Aurelien Geron (3rd edition, 2022) Chapter 7 covers ensemble methods including gradient boosting, with clear explanations of AdaBoost and gradient boosting from first principles, followed by practical XGBoost examples. Geron's visual explanation of residual fitting --- showing how each tree corrects the previous tree's errors --- is one of the best available. The code examples use scikit-learn's API throughout.

6. XGBoost Official Documentation --- "Introduction to Boosted Trees" The XGBoost docs include a tutorial that walks through the mathematics of the regularized objective, the connection between gradient boosting and Newton's method, and the split-finding algorithm. More mathematical than most documentation but extremely clear. Available at xgboost.readthedocs.io. The "Parameters" and "XGBoost Tutorials" sections are essential for practitioners.

7. LightGBM Official Documentation --- "Parameters" and "Advanced Topics" LightGBM's documentation covers the critical differences from XGBoost: leaf-wise growth, histogram-based splitting, and native categorical handling. The "Parameters Tuning" guide is particularly useful --- it provides a recommended tuning order that matches production best practices. Available at lightgbm.readthedocs.io.

8. CatBoost Official Documentation --- "Training Parameters" and "Categorical Features" CatBoost's docs include the most thorough explanation of ordered target statistics and ordered boosting available outside the original paper. The "Tutorials" section includes Jupyter notebooks comparing CatBoost to XGBoost and LightGBM on standard benchmarks. Available at catboost.ai.

Deeper Theory

9. The Elements of Statistical Learning --- Hastie, Tibshirani, Friedman (2nd edition, 2009) Chapter 10 ("Boosting and Additive Trees") provides the rigorous mathematical treatment of boosting, from AdaBoost through gradient boosting to stochastic gradient boosting. This is where you go to understand why gradient boosting works, not just how. Includes the connection to forward stagewise additive modeling and the statistical interpretation of boosting as fitting an additive model in a greedy fashion. Free PDF at the authors' website.

10. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) Chapter 8.2 covers boosting at a more accessible level than ESL, with good intuitive explanations and lab exercises. The discussion of the interaction depth parameter (number of splits per tree) and its connection to variable interaction order is particularly illuminating. The Python edition (ISLP, 2023) includes updated code. Free at statlearning.com.

11. "A Short Introduction to Boosting" --- Freund and Schapire (1999) The original AdaBoost paper, presented in an accessible form. While gradient boosting has superseded AdaBoost in practice, understanding AdaBoost's reweighting scheme provides essential intuition for why sequential ensembles work. Published in the Journal of Japanese Society for Artificial Intelligence, Vol. 14, No. 5.

Hyperparameter Tuning and Practical Advice

12. "Complete Guide to Parameter Tuning in XGBoost" --- Aarshay Jain (Analytics Vidhya) A widely-referenced blog post that provides a step-by-step tuning procedure for XGBoost: fix learning rate, tune tree parameters, tune regularization, lower learning rate. Despite being a blog post, the methodology is sound and matches what experienced practitioners do. Available on Analytics Vidhya.

13. "Laurae's XGBoost/LightGBM Parameters Mapping" A community-maintained cross-reference table mapping equivalent parameters between XGBoost, LightGBM, and CatBoost. Essential when switching between libraries or comparing configurations. The table covers not just parameter names but default values and semantic differences. Available on GitHub (search "xgboost lightgbm parameters").

14. Optuna Documentation --- "LightGBM Tuner" Optuna (a Bayesian hyperparameter optimization framework) includes a built-in LightGBM tuner that automatically searches over the most impactful hyperparameters using tree-structured Parzen estimator (TPE). The documentation explains both the algorithm and the practical usage. A good introduction to Bayesian optimization for gradient boosting. Available at optuna.readthedocs.io.

Benchmarks and Comparisons

15. "An Empirical Evaluation of Gradient Boosting on Tabular Data" --- Various benchmark papers Several papers and blog posts systematically compare XGBoost, LightGBM, CatBoost, and (more recently) deep learning approaches on tabular data. The consistent finding: gradient boosting dominates tabular data, the three libraries perform within 0.5% of each other on most datasets, and deep learning alternatives (TabNet, FT-Transformer) occasionally match but do not consistently beat gradient boosting. Search for "tabular data benchmark" on arXiv for the latest comparisons.

16. "Why Do Tree-Based Models Still Outperform Deep Learning on Tabular Data?" --- Grinsztajn, Oyallon, and Varoquaux (2022) A rigorous benchmark paper from NeurIPS 2022 that demonstrates why gradient boosting trees remain competitive with (and often superior to) deep learning on tabular data. The paper identifies specific data characteristics (irregular feature distributions, uninformative features, feature interactions) that favor tree-based methods. Essential reading for anyone considering deep learning as an alternative to gradient boosting for tabular problems.

Advanced Topics

17. "DART: Dropouts meet Multiple Additive Regression Trees" --- Rashmi and Gilad-Bachrach (2015) The paper introducing dropout for gradient boosting. DART addresses the "over-specialization" problem where later trees in the ensemble make increasingly marginal contributions because early trees dominate. Available in AISTATS 2015 proceedings.

18. "NGBoost: Natural Gradient Boosting for Probabilistic Prediction" --- Duan et al. (2020) NGBoost uses the natural gradient instead of the ordinary gradient, enabling probabilistic predictions (uncertainty quantification) from gradient boosting. Useful when you need not just a prediction but a confidence interval. Published in ICML 2020.

Video Resources

19. StatQuest with Josh Starmer --- "Gradient Boost" (Parts 1-4) A four-part video series covering gradient boosting from the mathematical foundations through regression and classification to XGBoost's specific innovations. Starmer's visual, step-by-step approach is the best entry point for learners who find the papers dense. Each video is 10-15 minutes. Available on YouTube.

20. Machine Learning University (MLU) --- "Gradient Boosting" (Amazon) Amazon's MLU course includes a free gradient boosting module with video lectures, slides, and Jupyter notebooks. The treatment is more production-oriented than academic, covering topics like feature importance reliability and deployment considerations. Available on YouTube and the MLU-Explain website.

How to Use This List

If you read nothing else, read the Friedman paper (item 1) for theory and the XGBoost paper (item 2) for the system that made gradient boosting practical. Together they take about 4 hours and give you a complete understanding of why gradient boosting works and how it is implemented.

If you prefer video, start with StatQuest (item 19) for intuition, then read the library documentation (items 6-8) for your chosen implementation.

If you want to tune better, read the Analytics Vidhya tuning guide (item 12) for the methodology and the Optuna docs (item 14) for automated search.

If you are deciding whether to invest in deep learning for tabular data, read Grinsztajn et al. (item 16) first. The answer, in 2026, is almost always "not yet."

This reading list supports Chapter 14: Gradient Boosting. Return to the chapter to review concepts before diving in.