Case Study 2: Metro General Hospital --- Encoding 14,000 Diagnosis Codes

DataField.Dev

Encoding 14,000 Diagnosis Codes" type: case-study chapter: 7 part: 2

Case Study 2: Metro General Hospital --- Encoding 14,000 Diagnosis Codes

Background

Metro General Hospital's data science team is building a 30-day readmission prediction model. The dataset contains 185,000 patient discharge records from 2020-2024, with a readmission rate of 14.3%. The model needs to predict which patients are likely to be readmitted within 30 days so that the care coordination team can schedule follow-up appointments and home health visits.

The feature matrix includes standard clinical variables: patient age, length of stay, number of prior admissions, number of medications at discharge, and whether the patient has a primary care physician. These features produce a baseline AUC of 0.72 with a gradient boosted tree.

The team wants to add the primary diagnosis code (icd10_primary). This feature has 14,283 unique ICD-10 codes. The clinical team knows that diagnosis is one of the strongest predictors of readmission --- a patient discharged after heart failure surgery has a very different readmission risk than one discharged after a routine knee replacement. But 14,283 unique values makes encoding non-trivial.

This case study walks through four encoding strategies, from naive to sophisticated, and shows how domain knowledge transforms a seemingly intractable feature into a powerful predictor.

The Data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
import category_encoders as ce

# Simulated Metro General discharge data
np.random.seed(42)
n = 185_000

# ICD-10 chapter distribution (simplified)
chapters = {
    'I': ('Circulatory', 0.22, 0.19),     # letter, frequency, readmit rate
    'J': ('Respiratory', 0.14, 0.17),
    'K': ('Digestive', 0.11, 0.12),
    'M': ('Musculoskeletal', 0.10, 0.08),
    'S': ('Injury', 0.09, 0.10),
    'E': ('Endocrine', 0.07, 0.16),
    'N': ('Genitourinary', 0.06, 0.11),
    'C': ('Neoplasms', 0.05, 0.21),
    'F': ('Mental', 0.04, 0.23),
    'G': ('Nervous', 0.04, 0.13),
    'R': ('Symptoms', 0.04, 0.09),
    'Z': ('Health_Services', 0.04, 0.06)
}

# Generate ICD-10 codes with realistic structure
def generate_icd10(chapter_letter, n_codes_per_chapter=200):
    """Generate realistic ICD-10 codes for a chapter."""
    codes = []
    for i in range(n_codes_per_chapter):
        num = np.random.randint(0, 99)
        decimal = np.random.randint(0, 9)
        codes.append(f"{chapter_letter}{num:02d}.{decimal}")
    return codes

icd10_pool = {}
for letter in chapters:
    icd10_pool[letter] = generate_icd10(letter, 200)

# Assign chapters based on frequency distribution
chapter_letters = list(chapters.keys())
chapter_freqs = [chapters[c][1] for c in chapter_letters]
chapter_readmit = {c: chapters[c][2] for c in chapter_letters}

assigned_chapters = np.random.choice(chapter_letters, size=n, p=chapter_freqs)
icd10_codes = [np.random.choice(icd10_pool[ch]) for ch in assigned_chapters]

# Generate other features
age = np.random.normal(65, 15, n).clip(18, 100).round(0)
length_of_stay = np.random.exponential(4.5, n).clip(1, 60).round(0)
prior_admissions = np.random.poisson(1.2, n)
n_medications = np.random.poisson(6, n).clip(0, 25)
has_pcp = np.random.binomial(1, 0.72, n)

# Generate target: readmission influenced by diagnosis chapter and other features
base_prob = np.array([chapter_readmit[ch] for ch in assigned_chapters])
base_prob += 0.002 * (age - 65)  # older = higher risk
base_prob += 0.005 * prior_admissions  # more prior admits = higher risk
base_prob += 0.001 * n_medications  # more meds = higher risk
base_prob -= 0.03 * has_pcp  # having PCP reduces risk
base_prob = np.clip(base_prob, 0.02, 0.50)
readmitted = np.random.binomial(1, base_prob)

df = pd.DataFrame({
    'icd10_primary': icd10_codes,
    'icd10_chapter': assigned_chapters,
    'age': age,
    'length_of_stay': length_of_stay,
    'prior_admissions': prior_admissions,
    'n_medications': n_medications,
    'has_pcp': has_pcp
})
y = pd.Series(readmitted, name='readmitted_30d')

print(f"Patients: {n:,}")
print(f"Readmission rate: {y.mean():.3f}")
print(f"Unique ICD-10 codes: {df['icd10_primary'].nunique():,}")
print(f"ICD-10 chapters: {df['icd10_chapter'].nunique()}")
print(f"\nCode frequency distribution:")
code_counts = df['icd10_primary'].value_counts()
print(f"  Median patients per code: {code_counts.median():.0f}")
print(f"  Min patients per code:    {code_counts.min()}")
print(f"  Max patients per code:    {code_counts.max()}")
print(f"  Codes with < 10 patients: {(code_counts < 10).sum()}")

Patients: 185,000
Readmission rate: 0.143
Unique ICD-10 codes: 2,296
ICD-10 chapters: 12
Code frequency distribution:
  Median patients per code: 72
  Min patients per code:    22
  Max patients per code:    198
  Codes with < 10 patients: 0

Phase 1: Baseline (No Diagnosis Feature)

numeric_features = ['age', 'length_of_stay', 'prior_admissions',
                    'n_medications', 'has_pcp']

baseline_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

scores_baseline = cross_val_score(baseline_pipe, df[numeric_features], y,
                                   cv=5, scoring='roc_auc')
print(f"Baseline (no diagnosis): AUC = {scores_baseline.mean():.4f} "
      f"(+/- {scores_baseline.std():.4f})")

Baseline (no diagnosis): AUC = 0.6834 (+/- 0.0029)

Phase 2: Strategy 1 --- One-Hot Encoding (The Naive Approach)

# Attempt to OHE 2,296 codes
# With 185,000 rows: 185,000 x 2,296 = 424 million values
import time

ohe_preprocessor = ColumnTransformer([
    ('icd10_ohe', OneHotEncoder(sparse_output=True, handle_unknown='ignore'),
     ['icd10_primary']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_ohe = Pipeline([
    ('preprocess', ohe_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_ohe = cross_val_score(pipe_ohe, df, y, cv=5, scoring='roc_auc')
ohe_time = time.time() - start

print(f"OHE (2,296 cols):        AUC = {scores_ohe.mean():.4f} "
      f"(+/- {scores_ohe.std():.4f})  Time: {ohe_time:.1f}s")

OHE (2,296 cols):        AUC = 0.7102 (+/- 0.0034)  Time: 312.4s

OHE adds 2.7 AUC points over baseline --- the diagnosis feature is informative. But 312 seconds of training time is unacceptable for a model that needs frequent retraining, and the 2,296-column sparse matrix is fragile (new ICD-10 codes break the encoder).

Phase 3: Strategy 2 --- Chapter-Level Grouping + One-Hot Encoding (Domain Knowledge)

ICD-10 codes have a built-in hierarchy. The first character identifies the chapter (broad diagnostic category). A physician on the team confirmed that the chapter level captures the most important clinical distinction for readmission risk: circulatory patients (chapter I) have fundamentally different readmission patterns than musculoskeletal patients (chapter M).

# Extract chapter from ICD-10 code
df['icd10_chapter_extracted'] = df['icd10_primary'].str[0]

chapter_preprocessor = ColumnTransformer([
    ('chapter_ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['icd10_chapter_extracted']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_chapter = Pipeline([
    ('preprocess', chapter_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_chapter = cross_val_score(pipe_chapter, df, y, cv=5, scoring='roc_auc')
chapter_time = time.time() - start

print(f"Chapter OHE (12 cols):   AUC = {scores_chapter.mean():.4f} "
      f"(+/- {scores_chapter.std():.4f})  Time: {chapter_time:.1f}s")

Chapter OHE (12 cols):   AUC = 0.7241 (+/- 0.0031)  Time: 31.2s

Chapter-level grouping achieves higher AUC than full OHE (0.7241 vs. 0.7102) with 12 columns instead of 2,296. The training time dropped from 312 seconds to 31 seconds. The domain knowledge --- that broad diagnostic category matters more than the specific code for readmission --- paid off.

Domain Knowledge Alert --- This result is counterintuitive. How can 12 columns outperform 2,296 columns? Because the vast majority of those 2,296 columns had too few observations to learn reliable patterns. The model was overfitting to noise in the rare codes. By grouping into 12 chapters with thousands of observations each, every feature had enough data to learn a stable signal.

Phase 4: Strategy 3 --- Target Encoding on 3-Character Categories

The first three characters of an ICD-10 code identify the category (e.g., I21 = acute myocardial infarction, I50 = heart failure). This is more granular than the chapter level but much more manageable than the full code.

# Extract 3-character category
df['icd10_3char'] = df['icd10_primary'].str[:3]
print(f"3-character categories: {df['icd10_3char'].nunique()}")

# Target encode the 3-char category
te_3char_preprocessor = ColumnTransformer([
    ('icd10_te', ce.TargetEncoder(cols=['icd10_3char'], smoothing=1.0),
     ['icd10_3char']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_te_3char = Pipeline([
    ('preprocess', te_3char_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_te_3char = cross_val_score(pipe_te_3char, df, y, cv=5, scoring='roc_auc')
te_3char_time = time.time() - start

print(f"3-char target enc (1 col): AUC = {scores_te_3char.mean():.4f} "
      f"(+/- {scores_te_3char.std():.4f})  Time: {te_3char_time:.1f}s")

3-character categories: 1106
3-char target enc (1 col): AUC = 0.7198 (+/- 0.0030)  Time: 32.8s

Phase 5: Strategy 4 --- Multi-Resolution Encoding (The Best of Both Worlds)

The key insight: chapter-level and 3-character-level encodings capture different information. Chapters capture broad diagnostic category. The 3-character target encoding captures finer-grained readmission patterns within chapters. Using both gives the model coarse and fine signals simultaneously.

# Multi-resolution: chapter OHE + 3-char target encoding + full-code target encoding
multi_preprocessor = ColumnTransformer([
    ('chapter_ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['icd10_chapter_extracted']),
    ('icd10_3char_te', ce.TargetEncoder(cols=['icd10_3char'], smoothing=1.0),
     ['icd10_3char']),
    ('icd10_full_te', ce.TargetEncoder(cols=['icd10_primary'], smoothing=5.0),
     ['icd10_primary']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_multi = Pipeline([
    ('preprocess', multi_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_multi = cross_val_score(pipe_multi, df, y, cv=5, scoring='roc_auc')
multi_time = time.time() - start

print(f"Multi-resolution (14 cols): AUC = {scores_multi.mean():.4f} "
      f"(+/- {scores_multi.std():.4f})  Time: {multi_time:.1f}s")

Multi-resolution (14 cols): AUC = 0.7309 (+/- 0.0028)  Time: 38.4s

Phase 6: Strategy 5 --- Hash Encoding as a Robustness Check

# Hash encoding: fixed 64 columns for the full ICD-10 code
hash_preprocessor = ColumnTransformer([
    ('icd10_hash', ce.HashingEncoder(cols=['icd10_primary'], n_components=64),
     ['icd10_primary']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_hash = Pipeline([
    ('preprocess', hash_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_hash = cross_val_score(pipe_hash, df, y, cv=5, scoring='roc_auc')
hash_time = time.time() - start

print(f"Hash enc (64 cols):      AUC = {scores_hash.mean():.4f} "
      f"(+/- {scores_hash.std():.4f})  Time: {hash_time:.1f}s")

Hash enc (64 cols):      AUC = 0.7089 (+/- 0.0032)  Time: 45.1s

Phase 7: The Complete Comparison

results = pd.DataFrame({
    'Strategy': [
        'Baseline (no diagnosis)',
        'OHE full codes (2,296 cols)',
        'Chapter OHE (12 cols)',
        '3-char target enc (1 col)',
        'Multi-resolution (14 cols)',
        'Hash encoding (64 cols)'
    ],
    'AUC': [scores_baseline.mean(), scores_ohe.mean(), scores_chapter.mean(),
            scores_te_3char.mean(), scores_multi.mean(), scores_hash.mean()],
    'Std': [scores_baseline.std(), scores_ohe.std(), scores_chapter.std(),
            scores_te_3char.std(), scores_multi.std(), scores_hash.std()],
    'Columns': [5, 2301, 17, 6, 19, 69],
    'Time (s)': [28.5, 312.4, 31.2, 32.8, 38.4, 45.1],
    'Handles New Codes': ['N/A', 'No', 'Yes (new letter rare)',
                          'Yes (global mean)', 'Yes', 'Yes (hash)']
})

results['Lift vs Baseline'] = results['AUC'] - results['AUC'].iloc[0]
print(results.to_string(index=False))

                    Strategy     AUC     Std  Columns  Time (s) Handles New Codes  Lift vs Baseline
     Baseline (no diagnosis)  0.6834  0.0029        5      28.5               N/A            0.0000
  OHE full codes (2,296 cols) 0.7102  0.0034     2301     312.4                No            0.0268
         Chapter OHE (12 cols) 0.7241  0.0031       17      31.2 Yes (new letter rare)       0.0407
    3-char target enc (1 col)  0.7198  0.0030        6      32.8   Yes (global mean)         0.0364
   Multi-resolution (14 cols)  0.7309  0.0028       19      38.4               Yes           0.0475
        Hash encoding (64 cols) 0.7089  0.0032       69      45.1          Yes (hash)        0.0255

The Recommendation

The team recommends the multi-resolution encoding for the Metro General readmission model:

Best AUC (0.7309) with the tightest confidence interval (+/- 0.0028).
Only 19 columns --- 100x fewer than OHE.
Training time of 38 seconds --- 8x faster than OHE.
Handles new ICD-10 codes at all three levels: new chapters are extremely rare (ICD-10 chapters have not changed since the standard was adopted), new 3-character categories are handled by target encoding's global mean fallback, and new full codes are handled similarly.
Clinically interpretable: the chapter-level features let physicians understand which broad diagnostic categories drive the model's predictions. The target-encoded features capture finer patterns without requiring physicians to inspect thousands of individual codes.

The team also notes that the multi-resolution approach uses heavier smoothing (m=5.0) for the full-code target encoding than for the 3-character encoding (m=1.0), because full codes have fewer observations per category and need more regularization.

Takeaway --- High-cardinality categorical features are not a problem to be solved with a single encoding trick. They are an opportunity to apply domain knowledge. The ICD-10 hierarchy is not a nuisance; it is a feature engineering gift. Chapter-level grouping, 3-character target encoding, and full-code target encoding each capture different granularities of clinical information. Used together, they outperform any single approach.

Discussion Questions

The full-code OHE (2,296 columns) performed worse than chapter-level OHE (12 columns). Is this always the case for high-cardinality features? Under what conditions might the full OHE outperform the grouped encoding?
The team used smoothing m=1.0 for the 3-character encoding and m=5.0 for the full-code encoding. Explain why higher smoothing is appropriate for higher cardinality. How would you tune these smoothing parameters in practice?
ICD-10 codes are updated annually (new codes are added, some are retired). Design a monitoring and retraining strategy for the multi-resolution encoding that handles annual code updates without retraining the full model.
A clinical informaticist suggests using "clinical classification software" (CCS) to group ICD-10 codes into 283 clinically meaningful categories. How would you incorporate CCS groupings into the multi-resolution approach? Would you add a fourth resolution level or replace one of the existing levels?
The hash encoding (0.7089) performed worse than every domain-informed strategy. Does this mean hash encoding is never useful for medical data? Describe a scenario in healthcare where hash encoding might be the best option.

This case study uses the Metro General Hospital readmission dataset introduced in Chapter 1. Return to Chapter 7 for the encoding framework.