Chapter 7: Handling Categorical Data

DataField.Dev

16 min read

> War Story --- A data scientist at a health insurance company built a claims fraud model using one-hot encoding for every categorical feature. The model worked fine in development --- until it hit the ICD-10 diagnosis code column. ICD-10 has over...

In This Chapter

Encoding Strategies and Their Tradeoffs
Most Tutorials Show One-Hot Encoding and Stop. Reality Does Not.
The Encoding Decision Framework
One-Hot Encoding: The Default That Works Until It Does Not
Ordinal Encoding: When Order Matters
Target Encoding: Powerful, Dangerous, Essential
Frequency Encoding: Simple, Leakage-Free, Underrated
Binary Encoding: The Middle Ground
Hash Encoding: When Nothing Else Scales
Putting It All Together: The StreamFlow Encoding Pipeline
Handling Unseen Categories in Production
Encoding Strategy for the Metro General Hospital Dataset
The Encoding Comparison Summary
Progressive Project M3 (Part 1): StreamFlow Categorical Encoding
Common Mistakes and How to Avoid Them
Looking Ahead
Key Terms

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 7: Handling Categorical Data

Encoding Strategies and Their Tradeoffs

Learning Objectives

By the end of this chapter, you will be able to:

Choose the right encoding strategy based on cardinality, model type, and data characteristics
Implement one-hot, ordinal, target, frequency, and binary encoding
Handle high-cardinality categoricals without blowing up dimensionality
Avoid target leakage in target encoding
Use category_encoders and scikit-learn for encoding pipelines

Most Tutorials Show One-Hot Encoding and Stop. Reality Does Not.

War Story --- A data scientist at a health insurance company built a claims fraud model using one-hot encoding for every categorical feature. The model worked fine in development --- until it hit the ICD-10 diagnosis code column. ICD-10 has over 14,000 unique codes. One-hot encoding produced a 14,000-column sparse matrix from a single feature. The model took 47 minutes to train on what had been a 90-second pipeline. Worse, the vast majority of those 14,000 columns contained fewer than 5 observations, so the model memorized training noise instead of learning patterns. The AUC on the test set was 0.02 points lower than a model that simply dropped the diagnosis code feature entirely. Fourteen thousand columns. Negative lift.

That is what happens when you treat categorical encoding as a solved problem with a single solution.

If you have been doing data science for more than a week, you have used one-hot encoding. It is the default answer in every tutorial, every introductory course, and every Stack Overflow response. And for features with 3-10 categories, it works perfectly well. The problem is that real-world data does not stay at 3-10 categories.

The StreamFlow dataset has device_type with 4 levels (mobile, desktop, tablet, smart_tv). One-hot encoding is fine for that. It also has primary_genre with 47 levels. One-hot encoding is debatable for that. And the Metro General Hospital dataset has icd10_code with 14,000+ levels. One-hot encoding is disastrous for that.

This chapter gives you the full toolkit. We will cover six encoding strategies, build a decision framework for choosing among them, and implement each one in a production-ready pipeline. The goal is not to memorize techniques. The goal is to develop the judgment to look at a categorical feature, assess its properties, and select the encoding that matches the situation.

The Encoding Decision Framework

Before we touch code, we need a mental model. The right encoding depends on three properties of the feature and one property of the model:

Feature properties: 1. Type: Is the feature nominal (no inherent order) or ordinal (has a meaningful order)? 2. Cardinality: How many unique values does the feature have? 3. Relationship to target: Is the association between categories and the target strong, weak, or nonexistent?

Model property: 4. Model type: Is the model tree-based (random forest, gradient boosting) or distance/linear-based (logistic regression, SVM, KNN)?

These four inputs determine the encoding. Here is the decision tree:

Is the feature ordinal?
  YES --> Ordinal encoding (verify the ordering with domain expertise!)
  NO  --> What is the cardinality?
            LOW (2-10)     --> One-hot encoding
            MEDIUM (11-50) --> One-hot encoding (if tree-based model)
                               Target encoding (if linear model)
            HIGH (51-500)  --> Target encoding or frequency encoding
            VERY HIGH (500+) --> Hash encoding, frequency encoding,
                                 or group into fewer categories

Is the model tree-based?
  YES --> Ordinal encoding is safe even for nominal features
          (trees split on individual values, not magnitude)
  NO  --> Ordinal encoding for nominal features is DANGEROUS
          (linear models interpret the integers as magnitudes)

Concept Check --- Why is ordinal encoding dangerous for nominal features in linear models? Because a linear model learns a single coefficient for the feature. If you encode device_type as mobile=1, desktop=2, tablet=3, smart_tv=4, the model learns that smart_tv has 4x the "effect" of mobile. That is meaningless for a nominal feature. Tree-based models do not have this problem because they split on device_type == 2 versus device_type != 2, treating each value independently.

Let us work through each encoding strategy, starting from the simplest and moving toward the most powerful.

One-Hot Encoding: The Default That Works Until It Does Not

One-hot encoding (OHE) creates one binary column per category. A feature with k categories becomes k (or k-1) binary columns.

Basic Implementation

import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder

# StreamFlow device_type: 4 categories
df = pd.DataFrame({
    'subscriber_id': ['S001', 'S002', 'S003', 'S004', 'S005'],
    'device_type': ['mobile', 'desktop', 'tablet', 'smart_tv', 'mobile'],
    'primary_genre': ['drama', 'comedy', 'action', 'drama', 'sci_fi'],
    'subscription_plan': ['basic', 'standard', 'premium', 'basic', 'standard'],
    'country': ['US', 'US', 'UK', 'DE', 'BR']
})

# pandas get_dummies --- quick and convenient
ohe_pandas = pd.get_dummies(df['device_type'], prefix='device')
print(ohe_pandas)

   device_desktop  device_mobile  device_smart_tv  device_tablet
0               0              1                0              0
1               1              0                0              0
2               0              0                0              1
3               0              0                1              0
4               0              1                0              0

# scikit-learn OneHotEncoder --- production-ready, handles unseen categories
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
ohe_sklearn = encoder.fit_transform(df[['device_type']])
print(f"Shape: {ohe_sklearn.shape}")
print(f"Feature names: {encoder.get_feature_names_out()}")

Shape: (5, 4)
Feature names: ['device_type_desktop' 'device_type_mobile' 'device_type_smart_tv'
 'device_type_tablet']

The Dummy Variable Trap

If you include all k binary columns, they are perfectly collinear: they always sum to 1. For linear models (logistic regression, linear regression), this creates multicollinearity problems. The solution is to drop one column, using k-1 columns instead. This is called "drop first" or "reference encoding."

# Drop first to avoid multicollinearity (important for linear models)
encoder_drop = OneHotEncoder(sparse_output=False, drop='first',
                             handle_unknown='error')
ohe_dropped = encoder_drop.fit_transform(df[['device_type']])
print(f"Shape (drop first): {ohe_dropped.shape}")
print(f"Feature names: {encoder_drop.get_feature_names_out()}")

Shape (drop first): (5, 3)
Feature names: ['device_type_mobile' 'device_type_smart_tv' 'device_type_tablet']

The dropped category (desktop) becomes the reference level. A subscriber with desktop has all three columns set to 0. The model's intercept captures the baseline effect for the reference category.

Production Tip --- For tree-based models, do not drop a column. Trees split on individual features, so collinearity does not affect them. Dropping a column actually removes information: the tree cannot directly test "is device_type == desktop?" without it. Only drop first for linear models.

When One-Hot Encoding Breaks Down

# StreamFlow: primary_genre has 47 unique values
n_genres = 47
print(f"One-hot encoding primary_genre: {n_genres} new columns")
print(f"With 2.4M rows: {2_400_000 * n_genres:,} values in the OHE matrix")

One-hot encoding primary_genre: 47 new columns
With 2.4M rows: 112,800,000 values in the OHE matrix

Forty-seven columns is borderline. For tree-based models with plenty of data, it works. But for linear models, 47 binary features means 47 coefficients for a single original feature --- and many of those genres may have fewer than 1,000 subscribers, leading to unstable coefficient estimates.

The real disaster is high cardinality:

# Metro General: ICD-10 diagnosis codes
n_icd10 = 14_283
n_patients = 185_000
print(f"One-hot encoding ICD-10: {n_icd10:,} new columns")
print(f"Matrix size: {n_patients:,} x {n_icd10:,} = {n_patients * n_icd10:,.0f} values")
print(f"Average patients per code: {n_patients / n_icd10:.1f}")

One-hot encoding ICD-10: 14,283 new columns
Matrix size: 185,000 x 14,283 = 2,642,355,000 values
Average patients per code: 12.9

2.6 billion values. Most of them zeros. Average of 12.9 patients per code means the model has almost no data to learn from for most categories. This is the cardinality wall, and one-hot encoding does not scale past it.

Ordinal Encoding: When Order Matters

Ordinal encoding assigns an integer to each category, preserving a meaningful order. It is the correct encoding for features where the categories have a natural ranking.

When to Use It

subscription_plan: basic < standard < premium
education_level: high_school < bachelors < masters < phd
severity: low < medium < high < critical
satisfaction_score: very_dissatisfied < dissatisfied < neutral < satisfied < very_satisfied

Implementation

from sklearn.preprocessing import OrdinalEncoder

# StreamFlow subscription_plan has a natural order
plan_order = [['basic', 'standard', 'premium']]

ordinal_encoder = OrdinalEncoder(categories=plan_order,
                                  handle_unknown='use_encoded_value',
                                  unknown_value=-1)

df['plan_encoded'] = ordinal_encoder.fit_transform(df[['subscription_plan']])
print(df[['subscription_plan', 'plan_encoded']])

  subscription_plan  plan_encoded
0             basic           0.0
1          standard           1.0
2           premium           2.0
3             basic           0.0
4          standard           1.0

Domain Knowledge Alert --- The ordering must come from domain expertise, not from the data. "Basic < standard < premium" makes sense because the plans have increasing price and features. But what if you encoded country as US=0, UK=1, DE=2, BR=3? That ordering is arbitrary and meaningless. A linear model would learn that Brazil has "3x the effect" of the US, which is nonsense. Always verify the ordering with someone who understands the business.

Ordinal Encoding for Tree-Based Models (The Shortcut)

Here is a fact that surprises many practitioners: tree-based models handle ordinal encoding of nominal features just fine. A gradient boosted tree splitting on country_encoded will test splits like country <= 1.5 (US and UK vs. DE and BR). It does not interpret the integers as magnitudes --- it just finds the split point that maximizes information gain. So for tree-based models, you can ordinal-encode everything and skip one-hot encoding entirely.

from sklearn.preprocessing import OrdinalEncoder
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score

# Ordinal encode ALL categoricals for a tree-based model
all_cats = ['device_type', 'primary_genre', 'subscription_plan', 'country']
ordinal_all = OrdinalEncoder(handle_unknown='use_encoded_value',
                              unknown_value=-1)
X_ordinal = ordinal_all.fit_transform(df[all_cats])

Production Tip --- LightGBM and CatBoost have native support for categorical features. Instead of manually encoding, you can pass the raw categories and let the algorithm handle them internally. LightGBM uses an optimized split-finding algorithm for categoricals. CatBoost uses a permutation-based target encoding internally. If you are using these libraries, read their documentation on categorical feature handling before encoding manually.

Target Encoding: Powerful, Dangerous, Essential

Target encoding (also called mean encoding) replaces each category with the mean of the target variable for that category. It is the most powerful encoding for high-cardinality features --- and the most dangerous if done incorrectly.

The Intuition

If the churn rate for subscribers whose primary genre is "horror" is 12%, and the overall churn rate is 8.2%, then horror fans churn at above-average rates. Target encoding captures this directly by replacing "horror" with 0.12.

# Naive target encoding --- DO NOT USE IN PRODUCTION
df_demo = pd.DataFrame({
    'genre': ['drama', 'comedy', 'horror', 'drama', 'comedy',
              'horror', 'drama', 'comedy', 'action', 'action'],
    'churned': [0, 0, 1, 1, 0, 1, 0, 1, 0, 0]
})

naive_means = df_demo.groupby('genre')['churned'].mean()
print("Naive target means:")
print(naive_means)

Naive target means:
genre
action    0.000000
comedy    0.333333
drama     0.333333
horror    1.000000
Name: churned, dtype: float64

This looks clean. But there is a fatal problem.

The Target Leakage Problem

When you compute the target mean for "horror" and then assign that mean back to every horror observation, you are leaking the target into the feature. Each row's encoded value was computed using its own target value. This is circular --- the feature "knows" the answer because it was computed from the answer.

The leakage is most severe for low-count categories. If there is only one horror subscriber and they churned, the target encoding is 1.0 --- a perfect predictor, but only because the encoding memorized that single observation.

Critical Warning --- Naive target encoding (computing means on the full training set and applying them back to the same training set) causes data leakage. Your training metrics will be inflated, and your model will underperform on new data. This is the single most common mistake in categorical encoding.

The Fix: Leave-One-Out and Cross-Validation Target Encoding

The correct approach computes the target mean for each observation while excluding that observation from the calculation. There are two standard methods:

Method 1: Leave-One-Out (LOO) Encoding

For each row, compute the mean of the target for all other rows with the same category.

def leave_one_out_encode(df, feature, target):
    """
    Leave-one-out target encoding.

    For each row, the encoded value is the mean of the target
    for all OTHER rows with the same category value.
    """
    global_mean = df[target].mean()
    category_sum = df.groupby(feature)[target].transform('sum')
    category_count = df.groupby(feature)[target].transform('count')

    # Subtract this row's target from the category sum
    loo_encoded = (category_sum - df[target]) / (category_count - 1)

    # Handle categories with count == 1 (division by zero)
    loo_encoded = loo_encoded.fillna(global_mean)

    return loo_encoded

df_demo['genre_loo'] = leave_one_out_encode(df_demo, 'genre', 'churned')
print(df_demo[['genre', 'churned', 'genre_loo']])

    genre  churned  genre_loo
0   drama        0       0.50
1  comedy        0       0.50
2  horror        1       1.00
3   drama        1       0.00
4  comedy        0       0.50
5  horror        1       1.00
6   drama        0       0.50
7  comedy        1       0.00
8  action        0       0.00
9  action        0       0.40

Notice that horror still gets 1.0 because with only two horror observations (both churned), leaving one out still gives a mean of 1.0. This is where smoothing helps.

Method 2: K-Fold Cross-Validation Target Encoding

Split the training data into K folds. For each fold, compute target means from the other K-1 folds and assign them to the held-out fold. This is more robust than LOO for small categories.

from sklearn.model_selection import KFold

def cv_target_encode(df, feature, target, n_splits=5, random_state=42):
    """
    Cross-validated target encoding with smoothing.

    For each fold, target means are computed from the out-of-fold data.
    Smoothing blends category means with the global mean to regularize
    low-count categories.
    """
    global_mean = df[target].mean()
    encoded = pd.Series(index=df.index, dtype=float)
    kf = KFold(n_splits=n_splits, shuffle=True, random_state=random_state)

    for train_idx, val_idx in kf.split(df):
        # Compute means from training fold only
        train_means = df.iloc[train_idx].groupby(feature)[target].mean()

        # Map to validation fold
        encoded.iloc[val_idx] = df.iloc[val_idx][feature].map(train_means)

    # Fill any NaN (categories that appeared only in the val fold)
    encoded = encoded.fillna(global_mean)

    return encoded

Smoothing: Regularizing Small Categories

Even with cross-validation, a category with 3 observations and a target mean of 1.0 is unreliable. Smoothing blends the category mean with the global mean, weighted by the number of observations:

smoothed_mean = (n * category_mean + m * global_mean) / (n + m)

where:
  n = number of observations in the category
  m = smoothing parameter (higher = more regularization)

When n is large relative to m, the smoothed mean is close to the category mean. When n is small, the smoothed mean is pulled toward the global mean. This is Bayesian shrinkage --- the same principle used in James-Stein estimation.

def smoothed_target_encode(df, feature, target, m=10):
    """
    Smoothed target encoding (Bayesian shrinkage).

    m controls the regularization strength.
    Higher m = more shrinkage toward the global mean.
    """
    global_mean = df[target].mean()
    stats = df.groupby(feature)[target].agg(['mean', 'count'])
    smoothed = (stats['count'] * stats['mean'] + m * global_mean) / (stats['count'] + m)
    return df[feature].map(smoothed)

# Compare naive vs. smoothed for the genre feature
stats = df_demo.groupby('genre')['churned'].agg(['mean', 'count'])
global_mean = df_demo['churned'].mean()
m = 5  # smoothing parameter

stats['smoothed'] = (
    (stats['count'] * stats['mean'] + m * global_mean) /
    (stats['count'] + m)
)

print(f"Global churn rate: {global_mean:.2f}\n")
print(stats)

Global churn rate: 0.40

        mean  count  smoothed
genre
action  0.00      2  0.285714
comedy  0.33      3  0.362500
drama   0.33      3  0.362500
horror  1.00      2  0.571429

Without smoothing, horror's encoding was 1.0 --- a perfect signal that would cause overfitting. With smoothing (m=5), it drops to 0.57. The category still encodes "higher than average churn," but the extreme value is pulled back toward the global mean. Action's 0.0 is similarly pulled up to 0.29.

Production-Ready Target Encoding with category_encoders

The category_encoders library implements smoothed, cross-validated target encoding in a scikit-learn compatible API:

import category_encoders as ce
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Simulated StreamFlow data for demonstration
np.random.seed(42)
n = 10_000
genres = np.random.choice(
    ['drama', 'comedy', 'action', 'horror', 'sci_fi', 'documentary',
     'romance', 'thriller', 'animation', 'reality'],
    size=n, p=[0.20, 0.15, 0.12, 0.08, 0.10, 0.07, 0.08, 0.10, 0.05, 0.05]
)
churn_probs = {
    'drama': 0.07, 'comedy': 0.06, 'action': 0.09, 'horror': 0.12,
    'sci_fi': 0.08, 'documentary': 0.05, 'romance': 0.07,
    'thriller': 0.10, 'animation': 0.04, 'reality': 0.11
}
churned = np.array([np.random.binomial(1, churn_probs[g]) for g in genres])

df_stream = pd.DataFrame({'primary_genre': genres})
y = pd.Series(churned, name='churned')

# Target encoder with smoothing
target_enc = ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0)

pipe = Pipeline([
    ('encoder', target_enc),
    ('model', LogisticRegression(random_state=42))
])

# CRITICAL: use cross_val_score, which handles the encoding within each fold
scores = cross_val_score(pipe, df_stream, y, cv=5, scoring='roc_auc')
print(f"Target Encoding + LogReg AUC: {scores.mean():.3f} (+/- {scores.std():.3f})")

Target Encoding + LogReg AUC: 0.617 (+/- 0.022)

Critical Warning --- When using target encoding in a pipeline, the encoding and the model training must happen inside each cross-validation fold. If you fit the target encoder on the full training set first and then cross-validate the model, you have leaked the target. The Pipeline + cross_val_score pattern above handles this correctly because scikit-learn calls fit_transform on each training fold and transform on each validation fold.

Frequency Encoding: Simple, Leakage-Free, Underrated

Frequency encoding replaces each category with its frequency (count or proportion) in the training data. It captures the intuition that common categories behave differently from rare ones --- and it has zero risk of target leakage.

def frequency_encode(train_series, test_series=None):
    """
    Frequency encoding: replace each category with its
    proportion in the training data.
    """
    freq_map = train_series.value_counts(normalize=True)

    train_encoded = train_series.map(freq_map)

    if test_series is not None:
        test_encoded = test_series.map(freq_map).fillna(0)
        return train_encoded, test_encoded

    return train_encoded

# StreamFlow primary_genre
df_stream['genre_freq'] = frequency_encode(df_stream['primary_genre'])
print("\nFrequency encoding for primary_genre:")
print(df_stream.groupby('primary_genre')['genre_freq'].first().sort_values(ascending=False))

Frequency encoding for primary_genre:
primary_genre
drama          0.2012
comedy         0.1534
action         0.1213
thriller       0.0991
sci_fi         0.0993
horror         0.0830
romance        0.0796
documentary    0.0691
animation      0.0504
reality        0.0436
Name: genre_freq, dtype: float64

Frequency encoding is not as powerful as target encoding (it does not capture the relationship between category and target), but it is safe, fast, and works surprisingly well when category frequency correlates with the target. In many real datasets, rare categories behave differently from common ones, and frequency encoding captures that signal.

Production Tip --- Frequency encoding is an excellent first choice when you need a quick, safe encoding for a medium-to-high cardinality feature. It adds no leakage risk, requires no cross-validation, and often gets you 80% of the lift that target encoding provides. Use it as a baseline before investing in target encoding.

Binary Encoding: The Middle Ground

Binary encoding converts category integers into binary digits, then uses each digit as a separate feature. A feature with 47 categories requires only ceil(log2(47)) = 6 binary columns, compared to 47 for one-hot encoding.

import category_encoders as ce

# Binary encoding for primary_genre
binary_enc = ce.BinaryEncoder(cols=['primary_genre'])
binary_encoded = binary_enc.fit_transform(df_stream[['primary_genre']])
print(f"Original columns: 1")
print(f"Binary encoded columns: {binary_encoded.shape[1]}")
print(binary_encoded.head(10))

Original columns: 1
Binary encoded columns: 4
   primary_genre_0  primary_genre_1  primary_genre_2  primary_genre_3
0                0                0                0                1
1                0                0                1                0
2                0                0                1                1
3                0                1                0                0
4                0                0                0                1
5                0                1                0                1
6                0                0                1                0
7                0                1                1                0
8                0                0                0                1
9                0                1                1                1

Binary encoding produces compact representations. The tradeoff is that the binary columns introduce arbitrary relationships between categories. Categories that happen to share binary digits may appear "similar" to the model, even if they are not semantically related. This is usually acceptable for tree-based models (which can split on individual bits) but can confuse linear models.

Hash Encoding: When Nothing Else Scales

Hash encoding applies a hash function to map categories to a fixed number of columns. It handles arbitrarily high cardinality because the output dimensionality is fixed regardless of input cardinality.

import category_encoders as ce

# Hash encoding for ICD-10 codes (14,000+ unique values)
# Map to 64 columns instead of 14,000
np.random.seed(42)
icd10_codes = [f"ICD10-{np.random.choice(['A','B','C','D','E','F','G','H','I','J','K','L','M','N','O','P','Q','R','S','T'])}{np.random.randint(0,99):02d}.{np.random.randint(0,9)}" for _ in range(1000)]
df_icd = pd.DataFrame({'icd10_code': icd10_codes})

hash_enc = ce.HashingEncoder(cols=['icd10_code'], n_components=64)
hash_encoded = hash_enc.fit_transform(df_icd)
print(f"Unique ICD-10 codes: {df_icd['icd10_code'].nunique()}")
print(f"Hash encoded columns: {hash_encoded.shape[1]}")

Unique ICD-10 codes: 848
Hash encoded columns: 64

The obvious downside: hash collisions. Two unrelated categories may hash to the same column. With 14,000 codes and 64 columns, collisions are guaranteed. The model loses the ability to distinguish between colliding categories. In practice, you tune n_components to balance dimensionality reduction against collision severity. Values between 32 and 256 are common starting points.

When to Use Hash Encoding --- Use it when cardinality exceeds 500+, target encoding is not appropriate (e.g., unsupervised learning), and you need a fixed-dimensionality representation. Hash encoding is also useful for features that grow over time (new ICD-10 codes are added annually), because new categories are automatically mapped to existing columns without retraining the encoder.

Putting It All Together: The StreamFlow Encoding Pipeline

Now let us build a complete encoding pipeline for the StreamFlow churn model. Each categorical feature gets the encoding that matches its properties.

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
import category_encoders as ce

# Simulated StreamFlow data
np.random.seed(42)
n = 50_000

plans = np.random.choice(['basic', 'standard', 'premium'], size=n, p=[0.4, 0.35, 0.25])
devices = np.random.choice(['mobile', 'desktop', 'tablet', 'smart_tv'],
                           size=n, p=[0.45, 0.30, 0.15, 0.10])
genres = np.random.choice(
    [f'genre_{i:02d}' for i in range(47)], size=n
)
countries = np.random.choice(
    [f'country_{i:03d}' for i in range(195)], size=n
)

# Generate target with realistic relationships
plan_effect = {'basic': 0.12, 'standard': 0.08, 'premium': 0.05}
base_churn = np.array([plan_effect[p] for p in plans])
noise = np.random.normal(0, 0.02, n)
churn_prob = np.clip(base_churn + noise, 0.01, 0.99)
churned = np.random.binomial(1, churn_prob)

df_full = pd.DataFrame({
    'subscription_plan': plans,
    'device_type': devices,
    'primary_genre': genres,
    'country': countries,
    'tenure_months': np.random.exponential(12, n).clip(1, 72).round(1),
    'hours_last_30d': np.random.exponential(15, n).round(1),
    'days_since_last_login': np.random.exponential(10, n).clip(0, 365).round(0)
})
y_full = pd.Series(churned, name='churned')

Strategy 1: The Encoding Decision for Each Feature

# Feature analysis: cardinality and type
cat_features = ['subscription_plan', 'device_type', 'primary_genre', 'country']

for feat in cat_features:
    nunique = df_full[feat].nunique()
    label = (
        'LOW' if nunique <= 10 else
        'MEDIUM' if nunique <= 50 else
        'HIGH' if nunique <= 500 else
        'VERY HIGH'
    )
    print(f"{feat:25s}  unique: {nunique:>5}  cardinality: {label}")

subscription_plan          unique:     3  cardinality: LOW
device_type                unique:     4  cardinality: LOW
primary_genre              unique:    47  cardinality: MEDIUM
country                    unique:   195  cardinality: HIGH

Based on our decision framework:

Feature	Cardinality	Type	Encoding (Linear Model)	Encoding (Tree Model)
`subscription_plan`	LOW (3)	Ordinal	Ordinal encoding	Ordinal encoding
`device_type`	LOW (4)	Nominal	One-hot encoding	Ordinal or one-hot
`primary_genre`	MEDIUM (47)	Nominal	Target encoding	One-hot or ordinal
`country`	HIGH (195)	Nominal	Target encoding	Target or frequency encoding

Strategy 2: Pipeline for a Linear Model

# Linear model pipeline: different encoding per feature type
numeric_features = ['tenure_months', 'hours_last_30d', 'days_since_last_login']

# For the linear model, we handle encodings in stages
# Stage 1: Ordinal for subscription_plan, OHE for device_type
# Stage 2: Target encoding for genre and country (via category_encoders)

# Approach: Use ColumnTransformer for ordinal + OHE + passthrough,
# then wrap target encoding around the pipeline

preprocessor_linear = ColumnTransformer(
    transformers=[
        ('ordinal', OrdinalEncoder(
            categories=[['basic', 'standard', 'premium']]),
            ['subscription_plan']),
        ('ohe', OneHotEncoder(sparse_output=False, drop='first',
                              handle_unknown='ignore'),
            ['device_type']),
        ('target_enc', ce.TargetEncoder(
            cols=['primary_genre', 'country'], smoothing=1.0),
            ['primary_genre', 'country']),
        ('numeric', StandardScaler(), numeric_features)
    ],
    remainder='drop'
)

pipe_linear = Pipeline([
    ('preprocessor', preprocessor_linear),
    ('model', LogisticRegression(random_state=42, max_iter=1000))
])

scores_linear = cross_val_score(pipe_linear, df_full, y_full,
                                 cv=5, scoring='roc_auc')
print(f"Linear model AUC: {scores_linear.mean():.3f} (+/- {scores_linear.std():.3f})")

Linear model AUC: 0.702 (+/- 0.009)

Strategy 3: Pipeline for a Tree-Based Model

# Tree model pipeline: simpler encoding, trees handle categoricals well
preprocessor_tree = ColumnTransformer(
    transformers=[
        ('ordinal', OrdinalEncoder(
            categories=[['basic', 'standard', 'premium']]),
            ['subscription_plan']),
        ('ohe_low', OneHotEncoder(sparse_output=False,
                                   handle_unknown='ignore'),
            ['device_type']),
        ('target_enc', ce.TargetEncoder(
            cols=['primary_genre', 'country'], smoothing=1.0),
            ['primary_genre', 'country']),
        ('numeric', 'passthrough', numeric_features)
    ],
    remainder='drop'
)

pipe_tree = Pipeline([
    ('preprocessor', preprocessor_tree),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=5, random_state=42))
])

scores_tree = cross_val_score(pipe_tree, df_full, y_full,
                               cv=5, scoring='roc_auc')
print(f"Tree model AUC:   {scores_tree.mean():.3f} (+/- {scores_tree.std():.3f})")

Tree model AUC:   0.738 (+/- 0.008)

Strategy 4: Comparing Encodings Head-to-Head

from sklearn.model_selection import cross_val_score
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Compare different encodings for primary_genre specifically
encoding_results = {}

# OHE for genre
ct_ohe = ColumnTransformer([
    ('genre', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['primary_genre']),
    ('other', 'passthrough', numeric_features)
])
pipe_ohe = Pipeline([('ct', ct_ohe),
                     ('model', GradientBoostingClassifier(
                         n_estimators=200, max_depth=5, random_state=42))])
scores = cross_val_score(pipe_ohe, df_full, y_full, cv=5, scoring='roc_auc')
encoding_results['OHE (47 cols)'] = scores.mean()

# Target encoding for genre
ct_te = ColumnTransformer([
    ('genre', ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0),
     ['primary_genre']),
    ('other', 'passthrough', numeric_features)
])
pipe_te = Pipeline([('ct', ct_te),
                    ('model', GradientBoostingClassifier(
                        n_estimators=200, max_depth=5, random_state=42))])
scores = cross_val_score(pipe_te, df_full, y_full, cv=5, scoring='roc_auc')
encoding_results['Target Enc (1 col)'] = scores.mean()

# Frequency encoding for genre
df_full['genre_freq'] = frequency_encode(df_full['primary_genre'])
ct_freq = ColumnTransformer([
    ('genre', 'passthrough', ['genre_freq']),
    ('other', 'passthrough', numeric_features)
])
pipe_freq = Pipeline([('ct', ct_freq),
                      ('model', GradientBoostingClassifier(
                          n_estimators=200, max_depth=5, random_state=42))])
scores = cross_val_score(pipe_freq, df_full, y_full, cv=5, scoring='roc_auc')
encoding_results['Freq Enc (1 col)'] = scores.mean()

# Binary encoding for genre
ct_bin = ColumnTransformer([
    ('genre', ce.BinaryEncoder(cols=['primary_genre']),
     ['primary_genre']),
    ('other', 'passthrough', numeric_features)
])
pipe_bin = Pipeline([('ct', ct_bin),
                     ('model', GradientBoostingClassifier(
                         n_estimators=200, max_depth=5, random_state=42))])
scores = cross_val_score(pipe_bin, df_full, y_full, cv=5, scoring='roc_auc')
encoding_results['Binary Enc (6 cols)'] = scores.mean()

print("\n=== Encoding Comparison: primary_genre (47 categories) ===")
for name, auc in sorted(encoding_results.items(), key=lambda x: -x[1]):
    print(f"  {name:25s}  AUC: {auc:.3f}")

=== Encoding Comparison: primary_genre (47 categories) ===
  Target Enc (1 col)         AUC: 0.691
  OHE (47 cols)              AUC: 0.689
  Binary Enc (6 cols)        AUC: 0.685
  Freq Enc (1 col)           AUC: 0.672

The differences are small here because the genre-to-churn relationship is relatively weak in this simulated data. In practice, the differences are larger when the categorical feature has a strong, non-uniform relationship with the target. The key insight is that target encoding achieved comparable performance with 1 column versus one-hot encoding's 47 columns. For high-cardinality features, that compression is the point.

Handling Unseen Categories in Production

Every encoding strategy must handle categories that appear in production but were not in the training data. A new genre gets added to StreamFlow. A subscriber registers from a country not in the training set. A new ICD-10 code is published.

# Demonstrating unseen category handling
from sklearn.preprocessing import OneHotEncoder

# Train on known categories
encoder = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
encoder.fit(df_full[['device_type']])

# New data with an unseen category
new_data = pd.DataFrame({'device_type': ['mobile', 'vr_headset', 'desktop']})
encoded_new = encoder.transform(new_data)
print("Encoding for unseen 'vr_headset':")
print(pd.DataFrame(encoded_new, columns=encoder.get_feature_names_out()))

Encoding for unseen 'vr_headset':
   device_type_desktop  device_type_mobile  device_type_smart_tv  device_type_tablet
0                  0.0                 1.0                   0.0                 0.0
1                  0.0                 0.0                   0.0                 0.0
2                  1.0                 0.0                   0.0                 0.0

The unseen "vr_headset" category gets encoded as all zeros --- it matches no known category. For target encoding and frequency encoding, the standard practice is to map unseen categories to the global mean or zero, respectively.

Production Tip --- Always set handle_unknown='ignore' (for OHE) or provide a fallback strategy. Models that crash on unseen categories in production are a common source of outages. Monitor the rate of unknown categories in your production logs --- a sudden spike often signals a data pipeline change upstream.

Encoding Strategy for the Metro General Hospital Dataset

Let us apply the decision framework to a genuinely challenging feature: ICD-10 diagnosis codes with 14,000+ unique values.

# Metro General: ICD-10 encoding strategies
# With 14,000+ codes, our options are:

# Option 1: Group into higher-level categories
# ICD-10 codes have a hierarchical structure: A00-B99 (infectious diseases),
# C00-D49 (neoplasms), etc. Group by the first 3 characters.
def icd10_group(code):
    """Extract the ICD-10 category (first 3 characters)."""
    return code[:3] if pd.notna(code) else 'UNK'

# This reduces 14,000 codes to ~1,500 groups
# Still high cardinality, but much more manageable

# Option 2: Group by ICD-10 chapter (21 chapters)
def icd10_chapter(code):
    """Map ICD-10 code to its chapter (A00-B99 -> 'Infectious')."""
    if pd.isna(code):
        return 'Unknown'
    first_char = code[0]
    chapter_map = {
        'A': 'Infectious', 'B': 'Infectious',
        'C': 'Neoplasms', 'D': 'Blood/Immune',
        'E': 'Endocrine', 'F': 'Mental',
        'G': 'Nervous', 'H': 'Eye/Ear',
        'I': 'Circulatory', 'J': 'Respiratory',
        'K': 'Digestive', 'L': 'Skin',
        'M': 'Musculoskeletal', 'N': 'Genitourinary',
        'O': 'Pregnancy', 'P': 'Perinatal',
        'Q': 'Congenital', 'R': 'Symptoms',
        'S': 'Injury', 'T': 'Injury',
        'V': 'External', 'W': 'External',
        'X': 'External', 'Y': 'External',
        'Z': 'Health_Services'
    }
    return chapter_map.get(first_char, 'Other')

# Option 3: Target encoding on the raw codes (with heavy smoothing)
# Option 4: Hash encoding to fixed dimensionality

print("ICD-10 encoding strategy comparison:")
print(f"  Raw codes:         14,283 unique -> 14,283 OHE columns (unusable)")
print(f"  3-char groups:     ~1,500 unique -> target encode to 1 column")
print(f"  Chapter groups:    21 unique     -> OHE to 21 columns")
print(f"  Hash encoding:     14,283 unique -> 64-256 hash columns")
print(f"  Target encoding:   14,283 unique -> 1 column (needs heavy smoothing)")

ICD-10 encoding strategy comparison:
  Raw codes:         14,283 unique -> 14,283 OHE columns (unusable)
  3-char groups:     ~1,500 unique -> target encode to 1 column
  Chapter groups:    21 unique     -> OHE to 21 columns
  Hash encoding:     14,283 unique -> 64-256 hash columns
  Target encoding:   14,283 unique -> 1 column (needs heavy smoothing)

Domain Knowledge Alert --- The best approach for ICD-10 codes is often to use multiple levels of encoding simultaneously. Include both the chapter-level OHE (21 columns, captures broad diagnostic category) and the 3-character target encoding (1 column, captures finer-grained relationships). This gives the model both a coarse signal that is stable and a fine signal that is informative. Domain-informed feature engineering beats mechanical application of any single encoding strategy.

The Encoding Comparison Summary

Encoding	Columns Created	Handles High Cardinality	Leakage Risk	Works for Linear Models	Works for Trees
One-Hot	k (or k-1)	No	None	Yes (drop first)	Yes
Ordinal	1	Yes	None	Only for ordinal features	Yes (any feature)
Target	1	Yes	High (needs CV + smoothing)	Yes	Yes
Frequency	1	Yes	None	Yes	Yes
Binary	ceil(log2(k))	Moderate	None	Moderate	Yes
Hash	Fixed (user-specified)	Yes	None	Moderate	Yes

Progressive Project M3 (Part 1): StreamFlow Categorical Encoding

This milestone connects to the feature engineering work from Chapter 6. You will encode the StreamFlow categorical features and compare strategies.

Task 1: Encode All Categorical Features

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
import category_encoders as ce

# Load your StreamFlow feature matrix from M2
# df_features = pd.read_parquet('streamflow_features_m2.parquet')
# y = df_features['churned_within_30_days']

# Categorical features to encode:
# subscription_plan: ordinal (basic < standard < premium)
# device_type: nominal, low cardinality (4 values)
# primary_genre: nominal, medium cardinality (47 values)
# country: nominal, high cardinality (195 values)

cat_features = {
    'subscription_plan': 'ordinal',       # basic < standard < premium
    'device_type': 'ohe',                 # 4 categories -> 3 OHE columns
    'primary_genre': 'target_or_ohe',     # 47 categories -> compare
    'country': 'target_or_frequency'      # 195 categories -> compare
}

Task 2: Compare OHE vs. Target Encoding for Genre

# Build two pipelines: one with OHE for genre, one with target encoding
# Use the same GradientBoostingClassifier with random_state=42

# Pipeline A: OHE for genre
preprocessor_a = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['primary_genre']),
    ('country', ce.TargetEncoder(cols=['country'], smoothing=1.0),
     ['country']),
    ('numeric', StandardScaler(), numeric_features)
])

# Pipeline B: Target encoding for genre
preprocessor_b = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre', ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0),
     ['primary_genre']),
    ('country', ce.TargetEncoder(cols=['country'], smoothing=1.0),
     ['country']),
    ('numeric', StandardScaler(), numeric_features)
])

# Compare with cross-validation
# Record: AUC, training time, number of features

Task 3: Demonstrate Target Encoding Leakage

# INTENTIONALLY show the wrong way, then the right way

# WRONG: Fit target encoder on full training set, then evaluate
# target_enc = ce.TargetEncoder(cols=['primary_genre'])
# df_full['genre_te_LEAKED'] = target_enc.fit_transform(
#     df_full[['primary_genre']], y_full
# )['primary_genre']
# --> This leaks. The encoded values were computed using the target
#     for the same rows that will be used in training.

# RIGHT: Target encoding inside a cross-validation pipeline
# pipe = Pipeline([
#     ('encoder', ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0)),
#     ('model', GradientBoostingClassifier(random_state=42))
# ])
# scores = cross_val_score(pipe, df_full, y_full, cv=5, scoring='roc_auc')
# --> No leakage. Each fold computes encodings from out-of-fold data only.

Deliverable --- A Jupyter notebook showing: (1) the encoding decision for each categorical feature with justification, (2) a side-by-side AUC comparison of OHE vs. target encoding for primary_genre, and (3) a demonstration of target encoding leakage vs. correct cross-validated target encoding.

Common Mistakes and How to Avoid Them

Mistake 1: One-hot encoding everything. One-hot encoding a feature with 500 levels creates 500 columns, most of which have near-zero variance. Your model trains slowly, overfits, and gains no predictive lift from the sparse columns. Use the cardinality decision tree.

Mistake 2: Ordinal encoding nominal features for linear models. If color has no natural order, encoding it as red=1, blue=2, green=3 tells a linear model that green is "3x red." Use one-hot encoding for nominal features in linear models.

Mistake 3: Fitting target encoding on the full training set. This leaks the target into the features. Always use cross-validation or leave-one-out within the training set. The Pipeline + cross_val_score pattern prevents this automatically.

Mistake 4: Forgetting to handle unseen categories. Your model will encounter categories in production that were not in the training data. Set handle_unknown='ignore' for one-hot encoding, and use global-mean fallbacks for target encoding.

Mistake 5: Ignoring domain-specific groupings. Before encoding 14,000 ICD-10 codes, check whether the domain has a natural hierarchy. ICD-10 codes group into chapters, sections, and categories. A 10-minute conversation with a physician can reduce 14,000 levels to 21 meaningful groups.

Looking Ahead

In Chapter 8, we tackle missing data --- the other universal data preparation problem. You will learn why "just drop the nulls" is almost always wrong, how to diagnose whether data is missing at random, and how to build imputation strategies that do not leak information. Missing data and categorical encoding interact in practice: a missing value in a categorical feature is itself a category, and how you handle it affects both imputation and encoding.

Key Terms

Term	Definition
Categorical variable	A variable that takes one of a fixed set of values (categories), not continuous numbers
Nominal	A categorical variable with no inherent order (e.g., country, color, genre)
Ordinal	A categorical variable with a meaningful order (e.g., low < medium < high)
Cardinality	The number of unique values a categorical variable takes
One-hot encoding (OHE)	Creates one binary column per category; a row has 1 in the column matching its category and 0 elsewhere
Dummy variable trap	Perfect multicollinearity that occurs when all k one-hot columns are included; solved by dropping one column
Ordinal encoding	Maps categories to integers preserving a meaningful order
Target encoding (mean encoding)	Replaces each category with the mean of the target variable for that category
Frequency encoding	Replaces each category with its proportion (or count) in the training data
Binary encoding	Converts category integers to binary digits, using each bit as a separate feature
Hash encoding	Applies a hash function to map categories to a fixed number of output columns
Leave-one-out encoding	Target encoding variant that excludes each row from its own category mean calculation
Smoothing (Bayesian shrinkage)	Blends category-level statistics with global statistics, weighted by sample size, to regularize small categories
Embedding (preview)	A learned dense vector representation of categories; covered in depth in Chapter 26 (NLP)
`category_encoders` library	A Python library implementing 15+ categorical encoding schemes, compatible with scikit-learn pipelines

This chapter covered encoding strategies for categorical data. Continue to the exercises to practice encoding decisions, or proceed to Chapter 8 for missing data strategies.