Key Takeaways: Chapter 7

DataField.Dev

Key Takeaways: Chapter 7

Handling Categorical Data

There is no single best encoding strategy. The right encoding depends on four factors: whether the feature is nominal or ordinal, its cardinality, the strength of its relationship with the target, and whether the model is tree-based or linear. One-hot encoding is the default for low cardinality, ordinal encoding is correct for ordered features, target encoding is powerful for high cardinality, and frequency encoding is a safe fallback that requires no target variable. Choosing an encoding is a decision, not a default.
One-hot encoding does not scale past ~50 categories. For a feature with k categories, OHE creates k binary columns. At 47 genre levels, this is borderline. At 14,000 ICD-10 codes, it is disastrous --- both computationally (2.6 billion values in the matrix) and statistically (most columns have too few observations for the model to learn stable patterns). Recognize the cardinality wall and choose an encoding that compresses high-cardinality features.
Ordinal encoding is dangerous for nominal features in linear models, but safe for tree-based models. A linear model interprets ordinal-encoded values as magnitudes: country=3 has "three times the effect" of country=1. This is meaningless for nominal features. Tree-based models split on individual values and are not affected by the numeric assignment. If you use ordinal encoding for a nominal feature, verify that the downstream model handles it correctly.
Target encoding is the most powerful encoding for high-cardinality features --- and the most dangerous. It compresses any number of categories into a single column by encoding the relationship between category and target. But naive target encoding (computing means on the full training set) causes data leakage: each row's encoding is computed using its own target value. The encoding "knows" the answer because it was computed from the answer. This inflates training metrics and degrades generalization.
The fix for target encoding leakage is cross-validation and smoothing. Always compute target encodings from out-of-fold data. In scikit-learn, this means putting the TargetEncoder inside a Pipeline and using cross_val_score, which automatically handles fold-wise fitting. Smoothing (Bayesian shrinkage) regularizes categories with few observations by blending the category mean with the global mean, preventing the model from memorizing small-sample noise.
Frequency encoding is underrated. It replaces each category with its proportion in the training data. It has zero leakage risk, requires no target variable (works for unsupervised learning), and captures the signal that common categories often behave differently from rare ones. It is not as powerful as target encoding when the category-to-target relationship is strong, but it is a strong baseline that requires no special handling.
Domain knowledge beats mechanical encoding. The ICD-10 case study demonstrated that domain-informed grouping (chapter-level OHE with 12 columns) outperformed brute-force OHE (2,296 columns) on both AUC and training time. Before encoding a high-cardinality feature, ask whether the domain provides a natural hierarchy, grouping, or taxonomy. A 10-minute conversation with a subject matter expert can eliminate thousands of noisy categories.
Multi-resolution encoding captures information at different granularities. For features with natural hierarchies (ICD-10 codes, geographic regions, product categories), encode at multiple levels simultaneously: coarse grouping via OHE plus fine-grained target encoding. The model gets both a stable, broad signal and an informative, granular signal. This consistently outperformed any single-level encoding in the Metro General case study.
Always plan for unseen categories in production. New genres, new diagnosis codes, new product categories. Set handle_unknown='ignore' for one-hot encoders (unseen categories become all-zero vectors). Use global-mean fallbacks for target encoders. Monitor the rate of unknown categories in production --- a sudden spike indicates a data pipeline change. Models that crash on unseen categories are a common source of production outages.
Tree-based models are more robust to encoding choice, but not immune. The StreamFlow case study showed that gradient boosting AUCs varied by less than 1 percentage point across encoding strategies, while logistic regression AUCs varied by over 5 percentage points. Trees can compensate for suboptimal encoding by finding good splits, but they still benefit from encoding that compresses high-cardinality features and reduces noise.

If You Remember One Thing

Categorical encoding is a compression problem, not a formatting problem. The goal is not to convert strings to numbers. The goal is to represent the information in a categorical feature as compactly and accurately as possible, without leaking the target, without blowing up dimensionality, and without losing the signal that domain knowledge tells you is there. The decision tree in this chapter --- type, cardinality, target relationship, model type --- gives you the framework. The case studies give you the intuition. The category_encoders library gives you the implementation. Use all three.

These takeaways summarize Chapter 7: Handling Categorical Data. Return to the chapter for full context.