Further Reading: Chapter 9

Feature Selection: Reducing Dimensionality Without Losing Signal


Foundational References

1. "An Introduction to Variable and Feature Selection" --- Isabelle Guyon and Andre Elisseeff (2003) Journal of Machine Learning Research, 3, 1157-1182. The single most cited paper on feature selection in machine learning. Guyon and Elisseeff provide a clear taxonomy of filter, wrapper, and embedded methods, discuss evaluation criteria, and outline practical guidelines for choosing among methods. This paper defined the vocabulary that the field still uses. Essential reading for anyone who wants to understand feature selection at a conceptual level. Freely available at jmlr.org.

2. "Feature Engineering and Selection: A Practical Approach for Predictive Models" --- Max Kuhn and Kjell Johnson (2019) CRC Press. The best single book on the joint problem of feature engineering and feature selection. Chapters 10-12 cover filter methods (correlation, information gain), wrapper methods (RFE, genetic algorithms), and embedded methods (LASSO, elastic net, tree importance) with extensive R examples. The discussion of feature selection stability (Chapter 12) is particularly relevant to production deployment. Kuhn is the creator of the caret and tidymodels packages in R and brings deep practical experience.

3. "Regularization Paths for Generalized Linear Models via Coordinate Descent" --- Friedman, Hastie, and Tibshirani (2010) Journal of Statistical Software, 33(1), 1-22. The paper behind glmnet, the most widely used LASSO/elastic net implementation. Explains why L1 regularization produces sparse solutions (exact zeros) while L2 does not, and how the coordinate descent algorithm efficiently computes the entire regularization path. If you want to understand why LASSO works as a feature selector, start here.


Practical Guides and Tutorials

4. scikit-learn --- Feature Selection User Guide The official documentation covering VarianceThreshold, SelectKBest, SelectFromModel, RFECV, and SequentialFeatureSelector. Includes examples of integrating feature selection into Pipeline objects (the correct approach emphasized in this chapter). Pay particular attention to the section on using Pipeline to avoid data leakage. Available at scikit-learn.org.

5. "Feature Selection with scikit-learn" --- scikit-learn Examples Gallery A collection of worked examples demonstrating filter, wrapper, and embedded methods on real and synthetic datasets. The "Comparison of F-test and Mutual Information" example is particularly useful for understanding when different filter methods diverge. Available at scikit-learn.org/stable/auto_examples/.

6. "Permutation Importance vs. Random Forest Feature Importance (MDI)" --- scikit-learn Examples A single example that demonstrates why impurity-based importance is biased and how permutation importance corrects the bias. Includes the canonical demonstration using random features that receive artificially high impurity importance but zero permutation importance. Essential for anyone using tree-based importance for feature selection.


Multicollinearity and VIF

7. "Applied Linear Statistical Models" --- Kutner, Nachtsheim, Neter, and Li (5th edition, 2004) Chapter 10 covers multicollinearity detection (VIF, condition numbers, eigenvalue analysis) and remediation (ridge regression, variable selection, principal components). The most thorough textbook treatment of multicollinearity at the applied level. The VIF thresholds cited in this chapter (2.5, 5, 10) originate from this tradition.

8. "Variance Inflation Factors in the Analysis of Complex Survey Data" --- Allison (2012) Statistical Horizons blog post. A concise, practitioner-oriented explanation of VIF that covers what it measures, how to interpret it, and when the standard thresholds (5 and 10) do and do not apply. Allison argues persuasively that VIF thresholds should be context-dependent, not universal. Recommended for readers who want a deeper understanding of when high VIF matters and when it does not.

9. statsmodels --- variance_inflation_factor documentation The Python implementation used in this chapter. Note that statsmodels expects you to include a constant (intercept) column in the design matrix for correct VIF computation. The documentation is sparse, but the source code is readable. Always standardize features before computing VIF to avoid numerical issues with features on very different scales.


Advanced Feature Selection

10. "Stability Selection" --- Nicolai Meinshausen and Peter Buhlmann (2010) Journal of the Royal Statistical Society, Series B, 72(4), 417-473. Stability selection addresses a critical weakness of standard feature selection: instability. The paper proposes running feature selection on many random subsamples of the data and keeping only features that are consistently selected. This approach produces a more reliable feature set at the cost of being more conservative (it may miss weak but real features). Highly recommended for production settings where feature set stability matters.

11. "Boruta: A System for Feature Selection" --- Kursa and Rudnicki (2010) Fundamenta Informaticae, 101(4), 271-285. Boruta is an elegant feature selection algorithm built on random forests. It creates "shadow" features by shuffling each real feature, trains a random forest on both real and shadow features, and marks a feature as important only if it consistently outperforms the best shadow feature. This provides a principled statistical test for feature relevance. Available as the boruta package in R and BorutaPy in Python.

12. "Feature Selection with the Elastic Net" --- Zou and Hastie (2005) Journal of the Royal Statistical Society, Series B, 67(2), 301-320. The elastic net combines L1 and L2 regularization, offering a compromise: L1 drives some coefficients to zero (feature selection), while L2 encourages correlated features to share their coefficients rather than one dominating and the other being zeroed out. Useful when you have groups of correlated features and want to select the group rather than a single representative.


The Curse of Dimensionality

13. "The Elements of Statistical Learning" --- Hastie, Tibshirani, and Friedman (2nd edition, 2009) Chapter 2, Sections 2.5 and 2.9. The clearest mathematical treatment of why high-dimensional spaces behave counterintuitively: distances concentrate, neighborhoods become large relative to the feature space, and local methods (KNN, kernel methods) break down. Free PDF available at hastie.su.domains/ElemStatLearn/. Dense but rewarding.

14. "A Few Useful Things to Know About Machine Learning" --- Pedro Domingos (2012) Communications of the ACM, 55(10), 78-87. A widely-read survey paper that discusses the curse of dimensionality, overfitting, and feature engineering in practical terms. Section 6 ("More Data Beats a Cleverer Algorithm") and Section 8 ("Feature Engineering Is the Key") are directly relevant to this chapter's themes. Accessible and highly recommended for all practitioners.


Feature Selection in Production

15. "Reliable Machine Learning" --- Cathy Chen, Martin Fowler, Zhiyuan Hu, and Todd Underwood (2022) O'Reilly Media. Chapter 7 covers feature management in production, including monitoring feature drift, managing feature dependencies, and the operational cost of maintaining large feature sets. The discussion of "feature stores" and their role in standardizing feature computation is relevant to understanding why feature selection matters for deployment.

16. "Rules of Machine Learning: Best Practices for ML Engineering" --- Martin Zinkevich (Google) A Google internal document made public. Rule 16 ("Plan to launch and iterate") and Rule 21 ("The number of feature weights you can learn in a linear model is roughly proportional to the amount of data you have") are directly relevant. Rule 43 ("Your friends tend to be the same across different products") discusses feature reuse and stability. Available freely online; search for the title.


How to Use This List

If you read nothing else, read Guyon and Elisseeff (item 1). It is the single best overview of the feature selection landscape, and it is free.

If you are implementing feature selection in production, start with the scikit-learn user guide (item 4) for the API, and read Meinshausen and Buhlmann (item 10) on stability selection to ensure your feature set does not change with every data refresh.

If you are debugging multicollinearity, Allison's blog post (item 8) is the most practical reference. If you need the mathematical foundations, Kutner et al. (item 7) is the definitive source.

If you want to understand why LASSO selects features (drives coefficients to zero), Friedman, Hastie, and Tibshirani (item 3) explains the geometry clearly. If you want the elastic net generalization, read Zou and Hastie (item 12).

If you are concerned about the operational cost of large feature sets in production, Chen et al. (item 15) and Zinkevich (item 16) provide the engineering perspective that complements the statistical perspective of this chapter.


This reading list supports Chapter 9: Feature Selection. Return to the chapter to review concepts before diving in.