Case Study 1: StreamFlow Genre Encoding Showdown

DataField.Dev

Case Study 1: StreamFlow Genre Encoding Showdown

Background

StreamFlow's data science team has a working churn model from the Chapter 6 milestone. The feature matrix includes numeric features (tenure, recency, frequency, trends) and three categorical features that still need encoding: subscription_plan (3 levels), device_type (4 levels), and primary_genre (47 levels).

The first two are straightforward. subscription_plan is ordinal. device_type is low-cardinality nominal. The debate is about primary_genre. The team's senior data scientist wants to one-hot encode it ("47 columns is fine for gradient boosting"). The junior data scientist just read a blog post about target encoding and wants to try it ("why use 47 columns when 1 will do?"). The engineering lead is concerned about production stability ("what happens when we add a 48th genre next quarter?").

The VP of Data asks for a rigorous comparison. Not opinions. Numbers.

The Data

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score
import category_encoders as ce
import time

# Load the StreamFlow feature matrix (simulated for reproducibility)
np.random.seed(42)
n = 200_000

# Genre distribution: some genres are much more popular than others
genre_names = [f'genre_{i:02d}' for i in range(47)]
genre_probs = np.random.dirichlet(np.ones(47) * 2)  # uneven distribution

# Genre-specific churn rates (some genres have higher churn)
genre_churn_rates = {}
base_rate = 0.082
for i, g in enumerate(genre_names):
    # Churn rate varies between 0.03 and 0.15 by genre
    genre_churn_rates[g] = base_rate + (np.random.random() - 0.5) * 0.10

genres = np.random.choice(genre_names, size=n, p=genre_probs)
plans = np.random.choice(['basic', 'standard', 'premium'], n, p=[0.40, 0.35, 0.25])
devices = np.random.choice(['mobile', 'desktop', 'tablet', 'smart_tv'], n,
                           p=[0.45, 0.30, 0.15, 0.10])

# Target: churn influenced by genre, plan, and numeric features
plan_effect = {'basic': 0.04, 'standard': 0.00, 'premium': -0.03}
tenure = np.random.exponential(12, n).clip(1, 72)
days_since_login = np.random.exponential(10, n).clip(0, 365)
hours_last_30d = np.random.exponential(15, n).clip(0, 200)

churn_prob = np.array([genre_churn_rates[g] for g in genres])
churn_prob += np.array([plan_effect[p] for p in plans])
churn_prob -= 0.001 * hours_last_30d  # more hours = less churn
churn_prob += 0.0005 * days_since_login  # more days away = more churn
churn_prob = np.clip(churn_prob, 0.01, 0.40)
churned = np.random.binomial(1, churn_prob)

df = pd.DataFrame({
    'subscription_plan': plans,
    'device_type': devices,
    'primary_genre': genres,
    'tenure_months': tenure.round(1),
    'days_since_last_login': days_since_login.round(0),
    'hours_last_30d': hours_last_30d.round(1)
})
y = pd.Series(churned, name='churned')

print(f"Subscribers: {n:,}")
print(f"Churn rate: {y.mean():.3f}")
print(f"Genre cardinality: {df['primary_genre'].nunique()}")
print(f"Smallest genre: {df['primary_genre'].value_counts().min():,} subscribers")
print(f"Largest genre: {df['primary_genre'].value_counts().max():,} subscribers")

Subscribers: 200,000
Churn rate: 0.079
Genre cardinality: 47
Smallest genre: 1,214 subscribers
Largest genre: 10,847 subscribers

Phase 1: Establishing the Baseline (No Genre Feature)

Before comparing genre encodings, we establish a baseline without the genre feature entirely. If none of our encodings improve over this baseline, the genre feature is not contributing signal.

numeric_features = ['tenure_months', 'days_since_last_login', 'hours_last_30d']

# Baseline: plan (ordinal) + device (OHE) + numerics, NO genre
baseline_preprocessor = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_baseline = Pipeline([
    ('preprocess', baseline_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_baseline = cross_val_score(pipe_baseline, df, y, cv=5, scoring='roc_auc')
baseline_time = time.time() - start

print(f"Baseline (no genre):  AUC = {scores_baseline.mean():.4f} "
      f"(+/- {scores_baseline.std():.4f})  Time: {baseline_time:.1f}s")

Baseline (no genre):  AUC = 0.7312 (+/- 0.0042)  Time: 28.3s

Phase 2: One-Hot Encoding for Genre

# OHE: 47 genre columns added
ohe_preprocessor = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre_ohe', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['primary_genre']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_ohe = Pipeline([
    ('preprocess', ohe_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_ohe = cross_val_score(pipe_ohe, df, y, cv=5, scoring='roc_auc')
ohe_time = time.time() - start

print(f"OHE genre (47 cols):  AUC = {scores_ohe.mean():.4f} "
      f"(+/- {scores_ohe.std():.4f})  Time: {ohe_time:.1f}s")

OHE genre (47 cols):  AUC = 0.7589 (+/- 0.0038)  Time: 42.7s

The genre feature adds 2.8 percentage points of AUC over baseline. It is informative.

Phase 3: Target Encoding for Genre

# Target encoding: 1 column for genre
te_preprocessor = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre_te', ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0),
     ['primary_genre']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_te = Pipeline([
    ('preprocess', te_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_te = cross_val_score(pipe_te, df, y, cv=5, scoring='roc_auc')
te_time = time.time() - start

print(f"Target enc (1 col):   AUC = {scores_te.mean():.4f} "
      f"(+/- {scores_te.std():.4f})  Time: {te_time:.1f}s")

Target enc (1 col):   AUC = 0.7601 (+/- 0.0035)  Time: 30.1s

Phase 4: Frequency and Binary Encoding

# Frequency encoding: 1 column for genre
freq_map = df['primary_genre'].value_counts(normalize=True)
df['genre_freq'] = df['primary_genre'].map(freq_map)

freq_preprocessor = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre_freq', 'passthrough', ['genre_freq']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_freq = Pipeline([
    ('preprocess', freq_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_freq = cross_val_score(pipe_freq, df, y, cv=5, scoring='roc_auc')
freq_time = time.time() - start

print(f"Freq enc (1 col):     AUC = {scores_freq.mean():.4f} "
      f"(+/- {scores_freq.std():.4f})  Time: {freq_time:.1f}s")

# Binary encoding: 6 columns for genre
binary_preprocessor = ColumnTransformer([
    ('plan', OrdinalEncoder(categories=[['basic', 'standard', 'premium']]),
     ['subscription_plan']),
    ('device', OneHotEncoder(sparse_output=False, handle_unknown='ignore'),
     ['device_type']),
    ('genre_bin', ce.BinaryEncoder(cols=['primary_genre']),
     ['primary_genre']),
    ('numeric', StandardScaler(), numeric_features)
])

pipe_bin = Pipeline([
    ('preprocess', binary_preprocessor),
    ('model', GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                         random_state=42))
])

start = time.time()
scores_bin = cross_val_score(pipe_bin, df, y, cv=5, scoring='roc_auc')
bin_time = time.time() - start

print(f"Binary enc (6 cols):  AUC = {scores_bin.mean():.4f} "
      f"(+/- {scores_bin.std():.4f})  Time: {bin_time:.1f}s")

Freq enc (1 col):     AUC = 0.7349 (+/- 0.0040)  Time: 28.8s
Binary enc (6 cols):  AUC = 0.7482 (+/- 0.0039)  Time: 30.5s

Phase 5: The Summary Table

results = pd.DataFrame({
    'Encoding': ['No genre (baseline)', 'One-Hot (47 cols)',
                 'Target (1 col)', 'Frequency (1 col)', 'Binary (6 cols)'],
    'AUC': [scores_baseline.mean(), scores_ohe.mean(),
            scores_te.mean(), scores_freq.mean(), scores_bin.mean()],
    'Std': [scores_baseline.std(), scores_ohe.std(),
            scores_te.std(), scores_freq.std(), scores_bin.std()],
    'Columns': [8, 55, 9, 9, 14],
    'Time (s)': [28.3, 42.7, 30.1, 28.8, 30.5]
})

results['Lift vs Baseline'] = results['AUC'] - results['AUC'].iloc[0]
print(results.to_string(index=False))

           Encoding     AUC     Std  Columns  Time (s)  Lift vs Baseline
No genre (baseline)  0.7312  0.0042        8      28.3            0.0000
  One-Hot (47 cols)  0.7589  0.0038       55      42.7            0.0277
     Target (1 col)  0.7601  0.0035        9      30.1            0.0289
  Frequency (1 col)  0.7349  0.0040        9      28.8            0.0037
   Binary (6 cols)   0.7482  0.0039       14      30.5            0.0170

Phase 6: The Target Encoding Leakage Demonstration

The junior data scientist needs to understand why cross-validated target encoding matters. We deliberately show the wrong way.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df, y, test_size=0.2, random_state=42, stratify=y
)

# WRONG: Fit target encoder on training set, apply to training set, train model
wrong_encoder = ce.TargetEncoder(cols=['primary_genre'], smoothing=1.0)
X_train_leaked = X_train.copy()
X_train_leaked['genre_te'] = wrong_encoder.fit_transform(
    X_train[['primary_genre']], y_train
)['primary_genre']
X_test_clean = X_test.copy()
X_test_clean['genre_te'] = wrong_encoder.transform(
    X_test[['primary_genre']]
)['primary_genre']

# Train on leaked features
features_for_model = ['genre_te', 'tenure_months', 'days_since_last_login',
                      'hours_last_30d']
model_leaked = GradientBoostingClassifier(n_estimators=200, max_depth=5,
                                          random_state=42)
model_leaked.fit(X_train_leaked[features_for_model], y_train)

auc_train_leaked = roc_auc_score(
    y_train, model_leaked.predict_proba(X_train_leaked[features_for_model])[:, 1]
)
auc_test_leaked = roc_auc_score(
    y_test, model_leaked.predict_proba(X_test_clean[features_for_model])[:, 1]
)

print("=== Target Encoding: WRONG Way (leakage) ===")
print(f"  Train AUC: {auc_train_leaked:.4f}")
print(f"  Test AUC:  {auc_test_leaked:.4f}")
print(f"  Gap:       {auc_train_leaked - auc_test_leaked:.4f}")

=== Target Encoding: WRONG Way (leakage) ===
  Train AUC: 0.7892
  Test AUC:  0.7583
  Gap:       0.0309

# RIGHT: Cross-validated target encoding (via pipeline)
print("\n=== Target Encoding: RIGHT Way (cross-validated pipeline) ===")
print(f"  CV AUC:    {scores_te.mean():.4f} (+/- {scores_te.std():.4f})")
print(f"  No train/test gap by construction (encoding is always out-of-fold)")

=== Target Encoding: RIGHT Way (cross-validated pipeline) ===
  CV AUC:    0.7601 (+/- 0.0035)
  No train/test gap by construction (encoding is always out-of-fold)

The leaked version has a 3.1 percentage point train-test gap. The cross-validated version, by construction, does not suffer from this gap because the encoding for each row is always computed from data that excludes that row.

The Recommendation

The team presents their findings to the VP of Data:

For the gradient boosted tree model in production, target encoding is the recommended strategy for primary_genre.

The evidence: 1. Target encoding matched or slightly exceeded OHE in AUC (0.7601 vs. 0.7589). 2. Target encoding used 1 column instead of 47, reducing the feature matrix by 46 columns. 3. Training time was 30% faster with target encoding (30.1s vs. 42.7s). 4. Target encoding handles new genres automatically (maps to global mean), while OHE requires retraining the encoder.

The caveat: target encoding must be implemented inside a cross-validation pipeline to avoid leakage. The team documents this requirement in the pipeline's technical specification and adds a unit test that verifies the encoder is never fit on the same data used for evaluation.

Takeaway --- When cardinality is moderate (10-100 categories) and the categorical feature has a meaningful relationship with the target, target encoding is typically the best choice for any model type. It compresses dimensionality, preserves signal, and handles new categories gracefully. But it demands disciplined cross-validation.

Discussion Questions

The frequency encoding barely improved over baseline (0.7349 vs. 0.7312). Why? What would need to be true about the genre-to-churn relationship for frequency encoding to be more effective?
If StreamFlow had 500 genres instead of 47, how would the OHE results change? Would target encoding's advantage grow or shrink?
The engineering lead's concern about adding a 48th genre is valid. Write a 3-sentence production plan for handling new genres with target encoding.
The smoothing parameter was set to 1.0 in this experiment. Design an experiment to tune the smoothing parameter. What metric would you optimize, and how would you search the parameter space?

This case study uses the StreamFlow SaaS churn dataset introduced in Chapter 1. Return to Chapter 7 for the encoding framework.