Further Reading: Chapter 7

Handling Categorical Data


Foundational Papers

1. "A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems" --- Daniele Micci-Barreca (2001) ACM SIGKDD Explorations, Vol. 3, No. 1. The original paper introducing smoothed target encoding (called "impact coding" in the paper). Micci-Barreca formalizes the Bayesian shrinkage approach: blending category-level statistics with global statistics, weighted by sample size. The smoothing parameter m controls how aggressively small categories are regularized toward the global mean. This is the theoretical foundation for every target encoding implementation in use today, including the category_encoders library's TargetEncoder. If you use target encoding in production, this 10-page paper is essential reading.

2. "CatBoost: Unbiased Boosting with Categorical Features" --- Prokhorenkova et al. (2018) NeurIPS 2018. The paper behind Yandex's CatBoost library, which introduces an ordered target encoding that avoids the leakage problem entirely. The key insight is to compute target statistics using only "previously seen" observations (in a random permutation), creating a natural leave-one-out scheme without explicit cross-validation. The paper also demonstrates that naive target encoding causes prediction shift and proposes a principled solution. Highly relevant if you use CatBoost or want to understand why target encoding leakage is a deeper problem than most tutorials acknowledge.

3. "Entity Embeddings of Categorical Variables" --- Cheng Guo and Felix Berkhahn (2016) arXiv:1604.06737. Introduces the idea of learning dense vector representations (embeddings) for categorical features using neural networks, similar to word embeddings in NLP. The authors demonstrate that entity embeddings capture semantic relationships between categories (e.g., similar days of the week, nearby geographic regions) and transfer well across tasks. A preview of the embedding approach mentioned in this chapter and covered in depth in Chapters 26 and 27. Particularly relevant for very high cardinality features where target encoding is unstable.


Books

4. Feature Engineering and Selection: A Practical Approach for Predictive Models --- Max Kuhn and Kjell Johnson (2019) Chapter 7, "Encoding Categorical Predictors," is the most thorough treatment of categorical encoding in any textbook. Covers one-hot, dummy, effect, ordinal, target, and hash encoding with mathematical detail and empirical comparisons. The discussion of the relationship between encoding choice and model type (Section 7.4) directly supports the decision framework introduced in this chapter. The code is in R, but the concepts are language-agnostic. Available as a free online book at bookdown.org/max/FES.

5. Feature Engineering for Machine Learning: Principles and Techniques for Data Scientists --- Alice Zheng and Amanda Casari (O'Reilly, 2018) Chapters 5 ("Categorical Variables: Counting Eggs in the Age of Robotic Chickens") and 6 ("Dimensionality Reduction: Squashing the Data Pancake") are directly relevant. The treatment of feature hashing is the clearest available in any introductory text. The authors demonstrate hashing for both text and categorical features, with practical guidance on choosing the number of hash bins. Shorter and more accessible than Kuhn and Johnson, with Python code throughout.


Practical Guides

6. category_encoders Library Documentation The Python library implementing target encoding, binary encoding, hash encoding, leave-one-out encoding, ordinal encoding, and 15+ other schemes. The API is scikit-learn compatible, so all encoders work inside Pipeline and ColumnTransformer. The TargetEncoder documentation includes the smoothing formula and explains the cross-validation behavior. The library source code is also well-commented and worth reading for implementation details. Documentation at contrib.scikit-learn.org/category_encoders or the GitHub repository.

7. scikit-learn User Guide --- "Encoding Categorical Features" The official scikit-learn documentation for OneHotEncoder, OrdinalEncoder, and the newer TargetEncoder (added in scikit-learn 1.3). The ColumnTransformer documentation is essential for building pipelines that apply different encodings to different features. The examples are minimal but precise. Pay particular attention to the handle_unknown parameter documentation, which describes the behavior for unseen categories. Available at scikit-learn.org.

8. "Encoding Categorical Features" --- Kaggle Learn Module (Free) A hands-on Kaggle course covering ordinal encoding, one-hot encoding, and target encoding with interactive notebooks. The target encoding lesson includes a clear demonstration of the leakage problem and the cross-validation fix. The examples use the Ames Housing dataset, which has multiple categorical features with varying cardinality. Good for reinforcing the concepts from this chapter with practice on different data. Registration required but free.


Technical Blog Posts

9. "Target Encoding Done the Right Way" --- Maxim Milakov (2019) A detailed blog post from the NVIDIA RAPIDS team explaining the correct implementation of target encoding, including the smoothing formula, cross-validation protocol, and the relationship between smoothing and regularization. Includes performance benchmarks showing that smoothed target encoding with cross-validation consistently outperforms naive target encoding on out-of-sample data. The post also covers the connection between target encoding and Bayesian hierarchical models.

10. "Categorical Features and Encoding in Decision Trees" --- Ben Reiniger (2020) A Stack Overflow canonical answer (and expanded blog post) explaining why tree-based models can handle ordinal encoding of nominal features. The key insight: a decision tree can reconstruct any partition of category values through a series of binary splits, regardless of the integer assignment. This is why LightGBM and CatBoost use integer encoding internally for categorical features. Clarifies a common source of confusion about when ordinal encoding is "safe."

11. "Hash Encoding: Handling High-Cardinality Categorical Features" --- Various Sources The feature hashing trick was introduced by Weinberger et al. (2009) in "Feature Hashing for Large Scale Multitask Learning" (ICML 2009). The original paper addresses text classification, but the technique applies directly to categorical encoding. For a practical treatment, search for the scikit-learn HashingVectorizer documentation and the category_encoders HashingEncoder documentation. The key parameter is n_components (number of hash bins), which controls the tradeoff between compression and collision rate.


Domain-Specific Resources

12. "ICD-10-CM Official Guidelines for Coding and Reporting" --- CDC/NCHS (Updated Annually) The official coding guidelines for ICD-10 diagnosis codes used in the Metro General case study. Understanding the hierarchical structure of ICD-10 (chapters, blocks, categories, subcategories) is essential for designing domain-informed encoding strategies for medical data. The document is dense and clinical, but the introductory sections on code structure (Section I.A) are accessible to non-clinicians and directly relevant to the grouping strategies discussed in this chapter.

13. "Clinical Classifications Software (CCS) for ICD-10" --- AHRQ/HCUP The Agency for Healthcare Research and Quality provides a mapping from ICD-10 codes to 283 clinically meaningful categories. This is a domain-expert-designed grouping that reduces cardinality from 14,000+ to 283, with each group representing a coherent clinical concept. For healthcare data science, CCS groupings are often a better starting point than statistical encoding methods because they encode clinical domain knowledge directly. Available free at hcup-us.ahrq.gov.


Advanced Topics (Preview)

14. "Neural Oblivious Decision Ensembles for Deep Learning on Tabular Data" --- Popov et al. (2019) arXiv:1909.06312. Introduces NODE, a neural network architecture for tabular data that learns categorical embeddings jointly with the prediction task. Relevant as a preview of how deep learning handles categorical features differently from classical ML. The learned embeddings can be extracted and used as features in other models --- a technique gaining traction in industry for very high cardinality features.

15. "Tabular Data: Deep Learning Is Not All You Need" --- Shwartz-Ziv and Armon (2022) arXiv:2106.03253. A systematic comparison of deep learning and gradient boosting on tabular datasets, with analysis of how each handles categorical features. The conclusion: gradient boosting with proper categorical encoding outperforms deep learning on most tabular datasets. Reinforces the importance of the encoding strategies covered in this chapter.


How to Use This List

If you read one thing, read the category_encoders documentation (item 6). It is the practical tool you will use most often, and understanding its API is immediately applicable to your projects.

If you want to understand the theory behind target encoding, read Micci-Barreca (item 1) for the original formulation and Prokhorenkova et al. (item 2) for the CatBoost refinement.

If you work with medical data, the CCS resource (item 13) will save you weeks of work --- it provides a pre-built domain-expert grouping for ICD-10 codes that eliminates most of the encoding challenge.

If you are curious about where categorical encoding is heading, read Guo and Berkhahn (item 3) on entity embeddings. This is the bridge between the classical encoding methods in this chapter and the deep learning approaches in later chapters.


This reading list supports Chapter 7: Handling Categorical Data. Return to the chapter to review concepts before diving in.