Chapter 6: Further Reading

Foundational Texts

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 3 (Linear Methods for Regression) and 4 (Linear Methods for Classification) are the definitive reference for the mathematical treatment of linear and logistic regression, Ridge, Lasso, and Elastic Net. Freely available at https://hastie.su.domains/ElemStatLearn/.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 4 (Linear Models for Classification) provides an excellent probabilistic perspective on logistic regression and connects it to generative classifiers. Chapter 3 covers Bayesian linear regression.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2023). An Introduction to Statistical Learning, 2nd ed. Springer. A more accessible treatment of regression and classification, with practical R and Python labs. Freely available at https://www.statlearning.com/.

Scholkopf, B. and Smola, A. J. (2002). Learning with Kernels. MIT Press. The comprehensive treatment of SVMs, kernel methods, and their mathematical foundations, including the reproducing kernel Hilbert space (RKHS) framework.
Burges, C. J. C. (1998). "A Tutorial on Support Vector Machines for Pattern Recognition." Data Mining and Knowledge Discovery, 2(2), 121--167. A clear, well-written tutorial that covers the SVM formulation, the kernel trick, and soft margins.

Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5--32. The seminal paper introducing random forests, including the out-of-bag error estimate and variable importance measures.
Friedman, J. H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine." Annals of Statistics, 29(5), 1189--1232. The foundational paper on gradient boosting. Dense but essential for understanding the algorithm's theoretical underpinnings.
Chen, T. and Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System." Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. The paper describing XGBoost's algorithmic innovations, including regularized objectives, sparsity-aware split finding, and cache-aware access patterns.
Ke, G., Meng, Q., Finley, T., et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree." Advances in Neural Information Processing Systems, 30. Introduces histogram-based split finding and Gradient-based One-Side Sampling (GOSS) for faster training.
Prokhorenkova, L., Gusev, G., Vorobev, A., et al. (2018). "CatBoost: Unbiased Boosting with Categorical Features." Advances in Neural Information Processing Systems, 31. Introduces ordered boosting and an efficient method for handling categorical features natively.

Geman, S., Bienenstock, E., and Doursat, R. (1992). "Neural Networks and the Bias/Variance Dilemma." Neural Computation, 4(1), 1--58. The classic formalization of the bias-variance decomposition and its implications for model selection.
Belkin, M., Hsu, D., Ma, S., and Mandal, S. (2019). "Reconciling Modern Machine-Learning Practice and the Classical Bias-Variance Trade-Off." Proceedings of the National Academy of Sciences, 116(32), 15849--15854. Introduces the double descent phenomenon, showing that the classical U-shaped test error curve is incomplete for overparameterized models.

Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267--288. The original Lasso paper, introducing L1 regularization for simultaneous estimation and feature selection.
Hoerl, A. E. and Kennard, R. W. (1970). "Ridge Regression: Biased Estimation for Nonorthogonal Problems." Technometrics, 12(1), 55--67. The foundational Ridge regression paper.
Zou, H. and Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net." Journal of the Royal Statistical Society, Series B, 67(2), 301--320. Introduces the Elastic Net, combining L1 and L2 penalties.

Raschka, S. (2018). "Model Evaluation, Model Selection, and Algorithm Selection in Machine Learning." arXiv:1811.12808. A comprehensive survey of evaluation methodologies, including nested cross-validation and statistical comparison tests.

scikit-learn User Guide: https://scikit-learn.org/stable/supervised_learning.html --- Excellent documentation for all supervised learning algorithms, including mathematical descriptions, code examples, and practical tips.
StatQuest (YouTube): Josh Starmer's videos on linear regression, logistic regression, SVMs, decision trees, random forests, and gradient boosting are exceptionally clear and intuitive. Recommended for building visual intuition.
Google's Machine Learning Crash Course: https://developers.google.com/machine-learning/crash-course --- A free, practical introduction to supervised learning concepts with interactive exercises.
Kaggle Learn: https://www.kaggle.com/learn --- Hands-on micro-courses on machine learning, including "Intro to Machine Learning" and "Intermediate Machine Learning" that cover many topics from this chapter.
Sebastian Raschka's "Machine Learning FAQ": https://sebastianraschka.com/faq/ --- Concise answers to common questions about supervised learning algorithms, evaluation, and best practices.

scikit-learn (sklearn): The primary library used throughout this chapter. Provides LinearRegression, Ridge, Lasso, ElasticNet, LogisticRegression, SVC, DecisionTreeClassifier, RandomForestClassifier, GradientBoostingClassifier, and all evaluation metrics.
XGBoost (xgboost): The most popular gradient boosting library. Install with pip install xgboost. Provides XGBClassifier and XGBRegressor with a scikit-learn-compatible API.
LightGBM (lightgbm): Microsoft's gradient boosting library, often faster than XGBoost for large datasets. Install with pip install lightgbm.
CatBoost (catboost): Yandex's gradient boosting library with native categorical feature handling. Install with pip install catboost.
statsmodels (statsmodels): Provides OLS regression with detailed statistical summaries (p-values, confidence intervals, diagnostic tests). Useful for statistical analysis alongside scikit-learn's machine learning focus. Install with pip install statsmodels.

Kernel Methods Beyond SVMs: Kernel PCA, kernel regression, and Gaussian processes (Chapter 10) all use the kernel trick in different contexts. Understanding the kernel framework unifies many seemingly disparate methods.
Online Learning: Stochastic gradient descent can be applied in an online setting where data arrives one example at a time. See sklearn.linear_model.SGDClassifier and SGDRegressor.
Multi-Label and Multi-Output Learning: Problems where each instance can belong to multiple classes simultaneously (e.g., image tagging). See sklearn.multioutput and sklearn.multiclass.
Ordinal Regression: When the target is ordinal (e.g., ratings from 1 to 5), standard classification ignores the ordering. Specialized methods like cumulative link models preserve this structure.
Conformal Prediction: A distribution-free framework for constructing prediction intervals with guaranteed coverage, complementing the point predictions from this chapter. An increasingly important topic in production ML.
Automated Machine Learning (AutoML): Tools like Auto-sklearn, FLAML, and H2O AutoML automate the model selection and hyperparameter tuning pipeline described in Section 6.8. We discuss AutoML further in Part VI.