Chapter 6: Further Reading

Foundational Texts

Linear Models and Regression

  • Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3 (Linear Methods for Regression) and 4 (Linear Methods for Classification) are the definitive reference for the mathematical treatment of linear and logistic regression, Ridge, Lasso, and Elastic Net. Freely available at https://hastie.su.domains/ElemStatLearn/.

  • Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4 (Linear Models for Classification) provides an excellent probabilistic perspective on logistic regression and connects it to generative classifiers. Chapter 3 covers Bayesian linear regression.

  • James, G., Witten, D., Hastie, T., and Tibshirani, R. (2023). An Introduction to Statistical Learning, 2nd ed. Springer. A more accessible treatment of regression and classification, with practical R and Python labs. Freely available at https://www.statlearning.com/.

Support Vector Machines

  • Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. The comprehensive treatment of SVMs, kernel methods, and their mathematical foundations, including the reproducing kernel Hilbert space (RKHS) framework.

  • Burges, C. J. C. (1998). "A Tutorial on Support Vector Machines for Pattern Recognition." Data Mining and Knowledge Discovery, 2(2), 121--167. A clear, well-written tutorial that covers the SVM formulation, the kernel trick, and soft margins.

Tree-Based and Ensemble Methods

  • Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5--32. The seminal paper introducing random forests, including the out-of-bag error estimate and variable importance measures.

  • Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics, 29(5), 1189--1232. The foundational paper on gradient boosting. Dense but essential for understanding the algorithm's theoretical underpinnings.

  • Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The paper describing XGBoost's algorithmic innovations, including regularized objectives, sparsity-aware split finding, and cache-aware access patterns.

  • Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30. Introduces histogram-based split finding and Gradient-based One-Side Sampling (GOSS) for faster training.

  • Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). "CatBoost: Unbiased Boosting with Categorical Features." Advances in Neural Information Processing Systems, 31. Introduces ordered boosting and an efficient method for handling categorical features natively.

Key Papers

Bias-Variance Tradeoff

  • Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1--58. The classic formalization of the bias-variance decomposition and its implications for model selection.

  • Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proceedings of the National Academy of Sciences, 116(32), 15849--15854. Introduces the double descent phenomenon, showing that the classical U-shaped test error curve is incomplete for overparameterized models.

Regularization

  • Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267--288. The original Lasso paper, introducing L1 regularization for simultaneous estimation and feature selection.

  • Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55--67. The foundational Ridge regression paper.

  • Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society, Series B, 67(2), 301--320. Introduces the Elastic Net, combining L1 and L2 penalties.

Model Evaluation

  • Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808. A comprehensive survey of evaluation methodologies, including nested cross-validation and statistical comparison tests.

Online Resources and Tutorials

  • scikit-learn User Guide: https://scikit-learn.org/stable/supervised_learning.html --- Excellent documentation for all supervised learning algorithms, including mathematical descriptions, code examples, and practical tips.

  • StatQuest (YouTube): Josh Starmer's videos on linear regression, logistic regression, SVMs, decision trees, random forests, and gradient boosting are exceptionally clear and intuitive. Recommended for building visual intuition.

  • Google's Machine Learning Crash Course: https://developers.google.com/machine-learning/crash-course --- A free, practical introduction to supervised learning concepts with interactive exercises.

  • Kaggle Learn: https://www.kaggle.com/learn --- Hands-on micro-courses on machine learning, including "Intro to Machine Learning" and "Intermediate Machine Learning" that cover many topics from this chapter.

  • Sebastian Raschka's "Machine Learning FAQ": https://sebastianraschka.com/faq/ --- Concise answers to common questions about supervised learning algorithms, evaluation, and best practices.

Software Libraries

  • scikit-learn (sklearn): The primary library used throughout this chapter. Provides LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, and all evaluation metrics.

  • XGBoost (xgboost): The most popular gradient boosting library. Install with pip install xgboost. Provides XGBClassifier and XGBRegressor with a scikit-learn-compatible API.

  • LightGBM (lightgbm): Microsoft's gradient boosting library, often faster than XGBoost for large datasets. Install with pip install lightgbm.

  • CatBoost (catboost): Yandex's gradient boosting library with native categorical feature handling. Install with pip install catboost.

  • statsmodels (statsmodels): Provides OLS regression with detailed statistical summaries (p-values, confidence intervals, diagnostic tests). Useful for statistical analysis alongside scikit-learn's machine learning focus. Install with pip install statsmodels.

Advanced Topics for Further Study

  • Kernel Methods Beyond SVMs: Kernel PCA, kernel regression, and Gaussian processes (Chapter 10) all use the kernel trick in different contexts. Understanding the kernel framework unifies many seemingly disparate methods.

  • Online Learning: Stochastic gradient descent can be applied in an online setting where data arrives one example at a time. See sklearn.linear_model.SGDClassifier and SGDRegressor.

  • Multi-Label and Multi-Output Learning: Problems where each instance can belong to multiple classes simultaneously (e.g., image tagging). See sklearn.multioutput and sklearn.multiclass.

  • Ordinal Regression: When the target is ordinal (e.g., ratings from 1 to 5), standard classification ignores the ordering. Specialized methods like cumulative link models preserve this structure.

  • Conformal Prediction: A distribution-free framework for constructing prediction intervals with guaranteed coverage, complementing the point predictions from this chapter. An increasingly important topic in production ML.

  • Automated Machine Learning (AutoML): Tools like Auto-sklearn, FLAML, and H2O AutoML automate the model selection and hyperparameter tuning pipeline described in Section 6.8. We discuss AutoML further in Part VI.