Chapter 9: Further Reading

Foundational Texts

  • Zheng, A. and Casari, A. (2018). Feature Engineering for Machine Learning. O'Reilly Media. A practical, hands-on guide to feature engineering covering numerical, categorical, text, and image features. Well-suited for engineers transitioning from theory to practice.

  • Kuhn, M. and Johnson, K. (2019). Feature Engineering and Selection: A Practical Approach for Predictive Models. CRC Press. Comprehensive treatment of feature engineering from a statistical modeling perspective, with excellent coverage of feature selection. Freely available at https://bookdown.org/max/FES/.

  • Muller, A. C. and Guido, S. (2016). Introduction to Machine Learning with Python. O'Reilly Media. Chapter 4 provides an accessible, code-driven introduction to feature engineering with scikit-learn, including pipelines and ColumnTransformers.

  • VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media. Excellent treatment of pandas-based data manipulation and feature extraction. Freely available at https://jakevdp.github.io/PythonDataScienceHandbook/.

Key Papers

Feature Scaling and Transformation

  • Box, G. E. P. and Cox, D. R. (1964). "An Analysis of Transformations." Journal of the Royal Statistical Society, Series B, 26(2), 211--252. The foundational paper on power transformations for normalizing distributions, the basis for PowerTransformer in scikit-learn.

  • Yeo, I. and Johnson, R. A. (2000). "A New Family of Power Transformations to Improve Normality or Symmetry." Biometrika, 87(4), 954--959. Extends Box-Cox to handle zero and negative values, implemented as the Yeo-Johnson transform.

Categorical Encoding

  • Micci-Barreca, D. (2001). "A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems." ACM SIGKDD Explorations, 3(1), 27--32. Introduces target encoding with regularization, a key technique for high-cardinality features.

  • Pargent, F., Pfisterer, F., Thomas, J., and Bischl, B. (2022). "Regularized Target Encoding Outperforms Traditional Methods in Supervised Machine Learning with High-Cardinality Features." Computational Statistics, 37, 2671--2692. Systematic comparison of encoding strategies, demonstrating regularized target encoding's advantages.

Feature Selection

  • Guyon, I. and Elisseeff, A. (2003). "An Introduction to Variable and Feature Selection." Journal of Machine Learning Research, 3, 1157--1182. The definitive survey on feature selection methods (filter, wrapper, embedded), widely cited and still highly relevant.

  • Chandrashekar, G. and Sahin, F. (2014). "A Survey on Feature Selection Methods." Computers & Electrical Engineering, 40(1), 16--28. A more recent survey covering both classical and modern feature selection techniques.

  • Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso." Journal of the Royal Statistical Society, Series B, 58(1), 267--288. Introduces L1 regularization for simultaneous feature selection and regression, foundational for embedded methods.

  • Breiman, L. (2001). "Random Forests." Machine Learning, 45(1), 5--32. Introduces permutation importance and Gini importance for tree-based feature ranking.

Pipelines and Reproducibility

  • Buitinck, L. et al. (2013). "API Design for Machine Learning Software: Experiences from the scikit-learn Project." arXiv:1309.0238. Describes the design principles behind scikit-learn's Pipeline, Transformer, and Estimator APIs.

Data Leakage

  • Kaufman, S., Rosset, S., and Perlich, C. (2012). "Leakage in Data Mining: Formulation, Detection, and Avoidance." ACM Transactions on Knowledge Discovery from Data, 6(4), 15. Formalizes data leakage, classifies its types, and provides detection and prevention strategies.

  • Kapoor, S. and Narayanan, A. (2023). "Leakage and the Reproducibility Crisis in Machine-Learning-Based Science." Patterns, 4(9). Demonstrates how data leakage has affected thousands of published studies, with recommendations for prevention.

Online Resources and Tutorials

  • scikit-learn Preprocessing Guide: https://scikit-learn.org/stable/modules/preprocessing.html --- Comprehensive documentation for all scikit-learn transformers including scalers, encoders, and imputers.

  • scikit-learn Pipeline Guide: https://scikit-learn.org/stable/modules/compose.html --- Official guide for Pipeline, ColumnTransformer, and FeatureUnion.

  • scikit-learn Feature Selection Guide: https://scikit-learn.org/stable/modules/feature_selection.html --- Documentation for VarianceThreshold, SelectKBest, RFE, and SelectFromModel.

  • Category Encoders Library Documentation: https://contrib.scikit-learn.org/category_encoders/ --- Documentation for the category_encoders package providing 20+ encoding strategies with scikit-learn API compatibility.

  • Kaggle Feature Engineering Course: https://www.kaggle.com/learn/feature-engineering --- Free interactive course with practical notebooks covering target encoding, feature creation, and mutual information.

Software Libraries

  • scikit-learn (sklearn): Core library for pipelines (Pipeline, ColumnTransformer), preprocessing (StandardScaler, OneHotEncoder, OrdinalEncoder), feature selection (SelectKBest, RFE, SelectFromModel), and imputation (SimpleImputer, IterativeImputer, KNNImputer).

  • category_encoders (category_encoders): Provides target encoding, leave-one-out encoding, binary encoding, hashing encoding, and more. Integrates seamlessly with scikit-learn pipelines. Install with pip install category-encoders.

  • feature-engine (feature_engine): Specialized library for feature engineering with scikit-learn-compatible transformers for encoding, discretization, outlier handling, and missing data. Install with pip install feature-engine.

  • pandas (pandas): Essential for data manipulation, datetime feature extraction, and exploratory analysis before pipeline construction.

  • scikit-learn contrib: imbalanced-learn (imblearn): Provides pipeline-compatible samplers (SMOTE, ADASYN) for handling class imbalance. Install with pip install imbalanced-learn.

Advanced Topics for Further Study

  • Automated Feature Engineering: Libraries like featuretools generate features automatically from relational datasets using deep feature synthesis. See Kanter and Veeramachaneni (2015), "Deep Feature Synthesis: Towards Automating Data Science Endeavors."

  • Feature Stores: Production systems for managing, sharing, and serving features at scale. See Feast (https://feast.dev/) and Tecton for open-source and managed solutions.

  • Neural Feature Learning: Deep learning models (Chapter 11+) learn features automatically from raw data. Understanding manual feature engineering helps interpret what deep models learn and when manual engineering still outperforms learned representations.

  • Causal Feature Selection: Using causal reasoning to select features that represent true causes rather than mere correlations. See Peters, Janzing, and Scholkopf (2017), Elements of Causal Inference.

  • Missing Data Theory: Little and Rubin (2019), Statistical Analysis with Missing Data, 3rd ed. Wiley. The authoritative reference on MCAR, MAR, and MNAR missing data mechanisms and their implications for imputation.