Chapter 6: Further Reading
Foundational Texts
Linear Models and Regression
-
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3 (Linear Methods for Regression) and 4 (Linear Methods for Classification) are the definitive reference for the mathematical treatment of linear and logistic regression, Ridge, Lasso, and Elastic Net. Freely available at https://hastie.su.domains/ElemStatLearn/.
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4 (Linear Models for Classification) provides an excellent probabilistic perspective on logistic regression and connects it to generative classifiers. Chapter 3 covers Bayesian linear regression.
-
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2023). An Introduction to Statistical Learning, 2nd ed. Springer. A more accessible treatment of regression and classification, with practical R and Python labs. Freely available at https://www.statlearning.com/.
Support Vector Machines
-
Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. The comprehensive treatment of SVMs, kernel methods, and their mathematical foundations, including the reproducing kernel Hilbert space (RKHS) framework.
-
Burges, C. J. C. (1998). "A Tutorial on Support Vector Machines for Pattern Recognition." Data Mining and Knowledge Discovery, 2(2), 121--167. A clear, well-written tutorial that covers the SVM formulation, the kernel trick, and soft margins.
Tree-Based and Ensemble Methods
-
Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5--32. The seminal paper introducing random forests, including the out-of-bag error estimate and variable importance measures.
-
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics, 29(5), 1189--1232. The foundational paper on gradient boosting. Dense but essential for understanding the algorithm's theoretical underpinnings.
-
Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The paper describing XGBoost's algorithmic innovations, including regularized objectives, sparsity-aware split finding, and cache-aware access patterns.
-
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30. Introduces histogram-based split finding and Gradient-based One-Side Sampling (GOSS) for faster training.
-
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). "CatBoost: Unbiased Boosting with Categorical Features." Advances in Neural Information Processing Systems, 31. Introduces ordered boosting and an efficient method for handling categorical features natively.
Key Papers
Bias-Variance Tradeoff
-
Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1--58. The classic formalization of the bias-variance decomposition and its implications for model selection.
-
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proceedings of the National Academy of Sciences, 116(32), 15849--15854. Introduces the double descent phenomenon, showing that the classical U-shaped test error curve is incomplete for overparameterized models.
Regularization
-
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267--288. The original Lasso paper, introducing L1 regularization for simultaneous estimation and feature selection.
-
Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55--67. The foundational Ridge regression paper.
-
Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society, Series B, 67(2), 301--320. Introduces the Elastic Net, combining L1 and L2 penalties.
Model Evaluation
- Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808. A comprehensive survey of evaluation methodologies, including nested cross-validation and statistical comparison tests.
Online Resources and Tutorials
-
scikit-learn User Guide: https://scikit-learn.org/stable/supervised_learning.html --- Excellent documentation for all supervised learning algorithms, including mathematical descriptions, code examples, and practical tips.
-
StatQuest (YouTube): Josh Starmer's videos on linear regression, logistic regression, SVMs, decision trees, random forests, and gradient boosting are exceptionally clear and intuitive. Recommended for building visual intuition.
-
Google's Machine Learning Crash Course: https://developers.google.com/machine-learning/crash-course --- A free, practical introduction to supervised learning concepts with interactive exercises.
-
Kaggle Learn: https://www.kaggle.com/learn --- Hands-on micro-courses on machine learning, including "Intro to Machine Learning" and "Intermediate Machine Learning" that cover many topics from this chapter.
-
Sebastian Raschka's "Machine Learning FAQ": https://sebastianraschka.com/faq/ --- Concise answers to common questions about supervised learning algorithms, evaluation, and best practices.
Software Libraries
-
scikit-learn (
sklearn): The primary library used throughout this chapter. ProvidesLinearRegression,Ridge,Lasso,ElasticNet,LogisticRegression,SVC,DecisionTreeClassifier,RandomForestClassifier,GradientBoostingClassifier, and all evaluation metrics. -
XGBoost (
xgboost): The most popular gradient boosting library. Install withpip install xgboost. ProvidesXGBClassifierandXGBRegressorwith a scikit-learn-compatible API. -
LightGBM (
lightgbm): Microsoft's gradient boosting library, often faster than XGBoost for large datasets. Install withpip install lightgbm. -
CatBoost (
catboost): Yandex's gradient boosting library with native categorical feature handling. Install withpip install catboost. -
statsmodels (
statsmodels): Provides OLS regression with detailed statistical summaries (p-values, confidence intervals, diagnostic tests). Useful for statistical analysis alongside scikit-learn's machine learning focus. Install withpip install statsmodels.
Advanced Topics for Further Study
-
Kernel Methods Beyond SVMs: Kernel PCA, kernel regression, and Gaussian processes (Chapter 10) all use the kernel trick in different contexts. Understanding the kernel framework unifies many seemingly disparate methods.
-
Online Learning: Stochastic gradient descent can be applied in an online setting where data arrives one example at a time. See
sklearn.linear_model.SGDClassifierandSGDRegressor. -
Multi-Label and Multi-Output Learning: Problems where each instance can belong to multiple classes simultaneously (e.g., image tagging). See
sklearn.multioutputandsklearn.multiclass. -
Ordinal Regression: When the target is ordinal (e.g., ratings from 1 to 5), standard classification ignores the ordering. Specialized methods like cumulative link models preserve this structure.
-
Conformal Prediction: A distribution-free framework for constructing prediction intervals with guaranteed coverage, complementing the point predictions from this chapter. An increasingly important topic in production ML.
-
Automated Machine Learning (AutoML): Tools like Auto-sklearn, FLAML, and H2O AutoML automate the model selection and hyperparameter tuning pipeline described in Section 6.8. We discuss AutoML further in Part VI.