Chapter 9: Further Reading
Foundational Texts
-
Zheng, A. and Casari, A. (2018). Feature Engineering for Machine Learning. O'Reilly Media. A practical, hands-on guide to feature engineering covering numerical, categorical, text, and image features. Well-suited for engineers transitioning from theory to practice.
-
Kuhn, M. and Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press. Comprehensive treatment of feature engineering from a statistical modeling perspective, with excellent coverage of feature selection. Freely available at https://bookdown.org/max/FES/.
-
Muller, A. C. and Guido, S. (2016). Introduction to Machine Learning with Python. O'Reilly Media. Chapter 4 provides an accessible, code-driven introduction to feature engineering with scikit-learn, including pipelines and ColumnTransformers.
-
VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media. Excellent treatment of pandas-based data manipulation and feature extraction. Freely available at https://jakevdp.github.io/PythonDataScienceHandbook/.
Key Papers
Feature Scaling and Transformation
-
Box, G. E. P. and Cox, D. R. (1964). "An Analysis of Transformations." Journal of the Royal Statistical Society, Series B, 26(2), 211--252. The foundational paper on power transformations for normalizing distributions, the basis for
PowerTransformerin scikit-learn. -
Yeo, I. and Johnson, R. A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." Biometrika, 87(4), 954--959. Extends Box-Cox to handle zero and negative values, implemented as the Yeo-Johnson transform.
Categorical Encoding
-
Micci-Barreca, D. (2001). "A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems." ACM SIGKDD Explorations, 3(1), 27--32. Introduces target encoding with regularization, a key technique for high-cardinality features.
-
Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). "Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High-Cardinality Features." Computational Statistics, 37, 2671--2692. Systematic comparison of encoding strategies, demonstrating regularized target encoding's advantages.
Feature Selection
-
Guyon, I. and Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3, 1157--1182. The definitive survey on feature selection methods (filter, wrapper, embedded), widely cited and still highly relevant.
-
Chandrashekar, G. and Sahin, F. (2014). "A Survey on Feature Selection Methods." Computers & Electrical Engineering, 40(1), 16--28. A more recent survey covering both classical and modern feature selection techniques.
-
Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267--288. Introduces L1 regularization for simultaneous feature selection and regression, foundational for embedded methods.
-
Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5--32. Introduces permutation importance and Gini importance for tree-based feature ranking.
Pipelines and Reproducibility
- Buitinck, L. et al. (2013). "API Design for Machine Learning Software: Experiences from the scikit-learn Project." arXiv:1309.0238. Describes the design principles behind scikit-learn's Pipeline, Transformer, and Estimator APIs.
Data Leakage
-
Kaufman, S., Rosset, S., and Perlich, C. (2012). "Leakage in Data Mining: Formulation, Detection, and Avoidance." ACM Transactions on Knowledge Discovery from Data, 6(4), 15. Formalizes data leakage, classifies its types, and provides detection and prevention strategies.
-
Kapoor, S. and Narayanan, A. (2023). "Leakage and the Reproducibility Crisis in Machine-Learning-Based Science." Patterns, 4(9). Demonstrates how data leakage has affected thousands of published studies, with recommendations for prevention.
Online Resources and Tutorials
-
scikit-learn Preprocessing Guide: https://scikit-learn.org/stable/modules/preprocessing.html --- Comprehensive documentation for all scikit-learn transformers including scalers, encoders, and imputers.
-
scikit-learn Pipeline Guide: https://scikit-learn.org/stable/modules/compose.html --- Official guide for Pipeline, ColumnTransformer, and FeatureUnion.
-
scikit-learn Feature Selection Guide: https://scikit-learn.org/stable/modules/feature_selection.html --- Documentation for VarianceThreshold, SelectKBest, RFE, and SelectFromModel.
-
Category Encoders Library Documentation: https://contrib.scikit-learn.org/category_encoders/ --- Documentation for the
category_encoderspackage providing 20+ encoding strategies with scikit-learn API compatibility. -
Kaggle Feature Engineering Course: https://www.kaggle.com/learn/feature-engineering --- Free interactive course with practical notebooks covering target encoding, feature creation, and mutual information.
Software Libraries
-
scikit-learn (
sklearn): Core library for pipelines (Pipeline,ColumnTransformer), preprocessing (StandardScaler,OneHotEncoder,OrdinalEncoder), feature selection (SelectKBest,RFE,SelectFromModel), and imputation (SimpleImputer,IterativeImputer,KNNImputer). -
category_encoders (
category_encoders): Provides target encoding, leave-one-out encoding, binary encoding, hashing encoding, and more. Integrates seamlessly with scikit-learn pipelines. Install withpip install category-encoders. -
feature-engine (
feature_engine): Specialized library for feature engineering with scikit-learn-compatible transformers for encoding, discretization, outlier handling, and missing data. Install withpip install feature-engine. -
pandas (
pandas): Essential for data manipulation, datetime feature extraction, and exploratory analysis before pipeline construction. -
scikit-learn contrib: imbalanced-learn (
imblearn): Provides pipeline-compatible samplers (SMOTE, ADASYN) for handling class imbalance. Install withpip install imbalanced-learn.
Advanced Topics for Further Study
-
Automated Feature Engineering: Libraries like
featuretoolsgenerate features automatically from relational datasets using deep feature synthesis. See Kanter and Veeramachaneni (2015), "Deep Feature Synthesis: Towards Automating Data Science Endeavors." -
Feature Stores: Production systems for managing, sharing, and serving features at scale. See Feast (https://feast.dev/) and Tecton for open-source and managed solutions.
-
Neural Feature Learning: Deep learning models (Chapter 11+) learn features automatically from raw data. Understanding manual feature engineering helps interpret what deep models learn and when manual engineering still outperforms learned representations.
-
Causal Feature Selection: Using causal reasoning to select features that represent true causes rather than mere correlations. See Peters, Janzing, and Scholkopf (2017), Elements of Causal Inference.
-
Missing Data Theory: Little and Rubin (2019), Statistical Analysis with Missing Data, 3rd ed. Wiley. The authoritative reference on MCAR, MAR, and MNAR missing data mechanisms and their implications for imputation.