Exercises: Chapter 7

Handling Categorical Data


Exercise 1: The Encoding Decision Tree (Conceptual)

For each of the following features, specify the encoding you would use for (a) a logistic regression model and (b) a gradient boosted tree. Justify each choice by referencing cardinality, feature type (nominal vs. ordinal), and model type.

Feature Unique Values Example Values
payment_method 4 credit_card, paypal, bank_transfer, crypto
education_level 5 high_school, some_college, bachelors, masters, phd
zip_code 8,200 10001, 90210, 60614, ...
browser 12 chrome, firefox, safari, edge, opera, ...
product_category 340 electronics, clothing, home_garden, ...
job_title 4,800 software_engineer, nurse, teacher, ...

For zip_code and job_title, also describe a domain-informed grouping strategy that would reduce cardinality before encoding.


Exercise 2: One-Hot Encoding by Hand (Applied)

Given the following data:

import pandas as pd

df = pd.DataFrame({
    'subscriber_id': ['S1', 'S2', 'S3', 'S4'],
    'device_type': ['mobile', 'desktop', 'tablet', 'mobile'],
    'churned': [1, 0, 0, 1]
})

a) Write out the one-hot encoded matrix for device_type (without dropping any column). Show the full 4x3 matrix.

b) Now write the matrix with drop='first'. Which category is the reference level? How many columns remain?

c) If a new subscriber appears with device_type = 'vr_headset' (not in the training data), what does the one-hot encoded row look like under handle_unknown='ignore'? Why?

d) Explain why dropping a column matters for logistic regression but not for a random forest.


Exercise 3: Target Encoding Leakage (Conceptual + Applied)

Consider the following toy dataset:

import pandas as pd

df = pd.DataFrame({
    'genre': ['horror', 'horror', 'comedy', 'comedy', 'comedy',
              'drama', 'drama', 'drama', 'drama', 'drama'],
    'churned': [1, 1, 0, 0, 1, 0, 0, 1, 0, 0]
})

a) Compute the naive target encoding for each genre (mean of churned per genre, applied to the same rows). Show the encoded values.

b) For the horror genre (2 observations, both churned), the naive encoding is 1.0. Explain why this is problematic. What would a model learn from this feature?

c) Compute the leave-one-out encoding for each row. Show your work for at least one row from each genre.

d) Now apply smoothed target encoding with m=5 and a global mean of 0.3. Compute the smoothed encoding for each genre. Show the formula and the result.

e) Compare the naive, LOO, and smoothed encodings for the horror genre. Which is most reliable for a model? Why?


Exercise 4: Frequency Encoding (Applied)

Using the StreamFlow genre distribution below:

genre_counts = pd.Series({
    'drama': 480_000,
    'comedy': 360_000,
    'action': 288_000,
    'thriller': 240_000,
    'sci_fi': 240_000,
    'horror': 192_000,
    'romance': 168_000,
    'documentary': 120_000,
    'animation': 72_000,
    'reality': 48_000,
    'true_crime': 48_000,
    'foreign_language': 36_000,
    'classic': 24_000,
    'experimental': 12_000,
    'silent_era': 12_000
})
total = genre_counts.sum()

a) Compute the frequency encoding (proportion) for each genre. Round to 4 decimal places.

b) A new genre, ai_generated, appears in production with 500 subscribers. What frequency encoding value should it receive? Why?

c) Two genres have the same frequency encoding (both 0.02). Name them from the data above. Does this cause a problem for the model? Under what conditions might same-frequency categories behave differently?

d) Write a Python function frequency_encode(train_col, test_col) that computes frequency encoding from the training column and applies it to both training and test columns, handling unseen categories in the test set by mapping them to 0.


Exercise 5: The Smoothing Parameter (Applied)

The smoothed target encoding formula is:

smoothed = (n * category_mean + m * global_mean) / (n + m)

Given a global churn rate of 0.082 and the following category statistics:

Genre Category Mean Count (n)
horror 0.142 15
drama 0.071 480,000
experimental 0.200 8
classic 0.050 24

a) Compute the smoothed encoding for each genre with m=10. Show your work.

b) Compute the smoothed encoding for each genre with m=100. Show your work.

c) Which genre's encoding changes the most between m=10 and m=100? Why?

d) For the drama genre (n=480,000), does the smoothing parameter matter? Calculate the difference between m=10 and m=100 for drama. What does this tell you about when smoothing has the most impact?

e) A colleague argues that m should always be set to 1 for maximum signal. Write a 2-3 sentence response explaining why this is wrong for features with rare categories.


Exercise 6: Building a Complete Encoding Pipeline (Applied)

Build a scikit-learn pipeline for the following dataset:

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 20_000

df = pd.DataFrame({
    'plan': np.random.choice(['free', 'basic', 'pro', 'enterprise'], n,
                             p=[0.30, 0.35, 0.25, 0.10]),
    'device': np.random.choice(['mobile', 'desktop', 'tablet'], n,
                               p=[0.50, 0.35, 0.15]),
    'genre': np.random.choice([f'genre_{i}' for i in range(50)], n),
    'region': np.random.choice([f'region_{i}' for i in range(200)], n),
    'tenure_months': np.random.exponential(12, n).round(1),
    'hours_last_30d': np.random.exponential(15, n).round(1),
})
y = np.random.binomial(1, 0.08, n)

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, random_state=42, stratify=y
)

a) Classify each feature by type (ordinal/nominal) and cardinality (low/medium/high).

b) Write a ColumnTransformer that applies the appropriate encoding for each feature. Use OrdinalEncoder for plan (free < basic < pro < enterprise), OneHotEncoder for device, TargetEncoder for genre and region, and StandardScaler for numeric features.

c) Wrap the transformer in a Pipeline with a GradientBoostingClassifier(n_estimators=200, random_state=42).

d) Evaluate the pipeline using 5-fold cross-validation with roc_auc scoring. Report the mean and standard deviation.

e) Now modify the pipeline to use frequency encoding for region instead of target encoding. Compare the results. Which encoding performed better for this dataset?


Exercise 7: ICD-10 Grouping Strategy (Conceptual + Applied)

Metro General Hospital's dataset contains the icd10_primary_diagnosis column with 14,283 unique codes. Examples: I21.0 (acute myocardial infarction), J18.9 (pneumonia), K80.10 (gallstones), S72.001A (femoral neck fracture).

a) ICD-10 codes have a hierarchical structure. Describe three levels of grouping: - Level 1: First character (letter) -- what does this represent? - Level 2: First three characters -- what does this represent? - Level 3: Full code -- what does this represent?

b) For each level, estimate the cardinality and suggest an encoding strategy.

c) A data scientist proposes using all three levels simultaneously as features (chapter OHE + 3-char target encoding + full-code target encoding). Explain why this multi-resolution approach might outperform any single level. What risk does it introduce?

d) Write a Python function that takes an ICD-10 code string and returns a dictionary with all three levels extracted:

def extract_icd10_levels(code):
    """
    Extract hierarchical levels from an ICD-10 code.

    Returns:
        dict with keys: 'chapter_letter', 'category_3char', 'full_code'
    """
    # Your implementation here
    pass

# Example:
# extract_icd10_levels('I21.0')
# -> {'chapter_letter': 'I', 'category_3char': 'I21', 'full_code': 'I21.0'}

Exercise 8: Encoding and Model Interaction (Conceptual)

A data scientist runs the following experiment on the StreamFlow dataset:

Encoding for primary_genre (47 categories) Logistic Regression AUC Gradient Boosting AUC
One-hot encoding (47 columns) 0.741 0.783
Ordinal encoding (1 column, alphabetical order) 0.698 0.781
Target encoding (1 column, smoothed) 0.752 0.786
Frequency encoding (1 column) 0.729 0.779

a) Why does ordinal encoding perform much worse for logistic regression (0.698) but nearly the same for gradient boosting (0.781)?

b) Why does target encoding outperform one-hot encoding for logistic regression (0.752 vs. 0.741)?

c) The gradient boosting AUCs are much closer together across encodings (0.779-0.786). Explain why tree-based models are more robust to encoding choice.

d) Given these results, which encoding would you recommend for a production pipeline that might need to swap between a linear model and a tree model? Justify your answer.


Exercise 9: Handling New Categories in Production (Applied)

You have trained a churn model with the following encoding pipeline:

from sklearn.preprocessing import OneHotEncoder
import category_encoders as ce

# Trained on these known categories
known_devices = ['mobile', 'desktop', 'tablet', 'smart_tv']
known_genres = ['drama', 'comedy', 'action', 'horror', 'sci_fi',
                'documentary', 'romance', 'thriller']

In production, you encounter:

  • A subscriber with device_type = 'gaming_console' (new device)
  • A subscriber with primary_genre = 'true_crime' (new genre)
  • A subscriber with device_type = None (missing value)

For each case:

a) Describe what happens with OneHotEncoder(handle_unknown='ignore') for the device encoding.

b) Describe what happens with TargetEncoder for the genre encoding (assume it falls back to the global mean).

c) Write a monitoring function that logs the rate of unknown categories per feature per day. Include a threshold that triggers an alert when the unknown rate exceeds 5%.

def monitor_unknown_categories(df_batch, known_categories_dict, date):
    """
    Monitor rate of unknown categories in a production batch.

    Parameters:
        df_batch: DataFrame of new predictions
        known_categories_dict: dict mapping feature name -> set of known values
        date: date string for logging

    Returns:
        dict of alert messages (empty if no alerts)
    """
    # Your implementation here
    pass

Exercise 10: Progressive Project M3 --- Full Encoding Comparison (Project)

Using your StreamFlow feature matrix from M2 (Chapter 6):

a) For each categorical feature, document: - Feature name - Number of unique values - Whether it is nominal or ordinal - Your chosen encoding strategy (with justification)

b) Build two complete pipelines: - Pipeline A: One-hot encoding for all nominal features (regardless of cardinality) - Pipeline B: Mixed encoding (OHE for low cardinality, target encoding for medium/high)

c) Compare Pipeline A and Pipeline B on: - 5-fold cross-validated AUC (with GradientBoostingClassifier(n_estimators=200, random_state=42)) - Total number of features after encoding - Training time per fold (use %%timeit or time.time())

d) For primary_genre specifically, show the target encoding leakage experiment: - Fit TargetEncoder on the full training set, then evaluate on the test set (leaked) - Use TargetEncoder inside a Pipeline with cross_val_score (correct) - Report the difference in AUC between the leaked and correct approaches

e) Write a summary paragraph (4-6 sentences) recommending the encoding strategy for the StreamFlow production pipeline. Address both model performance and engineering concerns (training time, handling unseen categories, maintainability).


Solutions to selected exercises are available in the appendix. Return to the chapter for reference.