Case Study 2: When Tuning Does Not Help --- The Features-First Lesson
Background
A data science team at a mid-size e-commerce company has been building a model to predict whether a customer will make a purchase within 7 days of visiting the site. The model will power a targeted email campaign: customers with a high predicted probability receive a personalized discount. The campaign budget allows emails to 15% of visitors, so the model must rank customers effectively.
The team trained an XGBoost model on 200,000 sessions with 12 features: page views, time on site, bounce indicator, device type, referral source, day of week, hour of day, number of prior visits, days since last visit, number of items viewed, number of items added to cart, and cart total value. The conversion rate is 3.8%.
The model's baseline AUC with default hyperparameters is 0.782. The team lead says: "We need at least 0.85 AUC to justify the campaign. Let's do a thorough hyperparameter tuning pass."
This case study is the story of what happens when you try to tune your way to a target that only features can reach.
The Data
import pandas as pd
import numpy as np
from sklearn.model_selection import cross_val_score, StratifiedKFold, RandomizedSearchCV
from sklearn.metrics import roc_auc_score
from xgboost import XGBClassifier
from scipy.stats import randint, uniform, loguniform
import optuna
import time
optuna.logging.set_verbosity(optuna.logging.WARNING)
np.random.seed(42)
n = 200000
# --- Session-level features ---
ecom = pd.DataFrame({
'page_views': np.random.poisson(5, n),
'time_on_site_sec': np.random.exponential(180, n).round(0).clip(5, 3600),
'is_bounce': np.random.binomial(1, 0.35, n),
'device_type': np.random.choice([0, 1, 2], n, p=[0.55, 0.35, 0.10]), # desktop, mobile, tablet
'referral_source': np.random.choice([0, 1, 2, 3, 4], n, p=[0.30, 0.25, 0.20, 0.15, 0.10]),
'day_of_week': np.random.randint(0, 7, n),
'hour_of_day': np.random.choice(range(24), n, p=[
0.01, 0.005, 0.005, 0.005, 0.005, 0.01, 0.02, 0.04,
0.06, 0.07, 0.08, 0.08, 0.07, 0.06, 0.05, 0.05,
0.05, 0.05, 0.05, 0.06, 0.06, 0.04, 0.03, 0.02
]),
'prior_visits': np.random.poisson(3, n),
'days_since_last_visit': np.random.exponential(14, n).round(0).clip(0, 365),
'items_viewed': np.random.poisson(4, n),
'items_in_cart': np.random.poisson(0.8, n),
'cart_value': np.random.exponential(45, n).round(2).clip(0, 500),
})
# Purchase probability (the features have moderate signal)
purchase_logit = (
-4.2
+ 0.08 * ecom['page_views']
+ 0.002 * ecom['time_on_site_sec']
- 0.8 * ecom['is_bounce']
+ 0.15 * ecom['items_viewed']
+ 0.5 * ecom['items_in_cart']
+ 0.005 * ecom['cart_value']
- 0.01 * ecom['days_since_last_visit']
+ 0.05 * ecom['prior_visits']
+ np.random.normal(0, 0.8, n) # substantial irreducible noise
)
purchase_prob = 1 / (1 + np.exp(-purchase_logit))
ecom['purchased'] = np.random.binomial(1, purchase_prob)
print(f"Conversion rate: {ecom['purchased'].mean():.3f}")
X_v1 = ecom.drop('purchased', axis=1).values
y = ecom['purchased'].values
feature_names_v1 = ecom.drop('purchased', axis=1).columns.tolist()
Conversion rate: 0.039
The Tuning Attempt
Step 1: Defaults
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
default_model = XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
random_state=42,
n_jobs=-1
)
default_scores = cross_val_score(default_model, X_v1, y, cv=cv, scoring='roc_auc')
print(f"Default AUC: {default_scores.mean():.4f} +/- {default_scores.std():.4f}")
Default AUC: 0.7823 +/- 0.0031
Step 2: Random Search (100 Trials)
param_distributions = {
'n_estimators': randint(100, 2000),
'learning_rate': loguniform(1e-3, 0.5),
'max_depth': randint(3, 12),
'subsample': uniform(0.6, 0.4),
'colsample_bytree': uniform(0.5, 0.5),
'min_child_weight': randint(1, 20),
'reg_alpha': loguniform(1e-3, 10.0),
'reg_lambda': loguniform(1e-3, 10.0),
'gamma': loguniform(1e-3, 5.0),
}
random_search = RandomizedSearchCV(
estimator=XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
random_state=42,
n_jobs=-1
),
param_distributions=param_distributions,
n_iter=100,
cv=cv,
scoring='roc_auc',
random_state=42,
n_jobs=-1,
verbose=0
)
start = time.time()
random_search.fit(X_v1, y)
rs_time = time.time() - start
print(f"Random Search AUC: {random_search.best_score_:.4f}")
print(f"Improvement: +{random_search.best_score_ - default_scores.mean():.4f}")
print(f"Time: {rs_time:.0f} seconds")
Random Search AUC: 0.7987
Improvement: +0.0164
Time: 1842 seconds
Step 3: Optuna (150 Trials)
The team is not at 0.85 yet, so they throw more compute at the problem:
def ecom_objective(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 500, 2500),
'learning_rate': trial.suggest_float('learning_rate', 0.005, 0.2, log=True),
'max_depth': trial.suggest_int('max_depth', 4, 10),
'subsample': trial.suggest_float('subsample', 0.6, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.9),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 15),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
'gamma': trial.suggest_float('gamma', 1e-3, 5.0, log=True),
'scale_pos_weight': len(y[y == 0]) / len(y[y == 1]),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1
}
model = XGBClassifier(**params)
scores = cross_val_score(model, X_v1, y, cv=cv, scoring='roc_auc')
return scores.mean()
study_v1 = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
start = time.time()
study_v1.optimize(ecom_objective, n_trials=150)
optuna_time = time.time() - start
print(f"Optuna AUC: {study_v1.best_value:.4f}")
print(f"Improvement over random search: +{study_v1.best_value - random_search.best_score_:.4f}")
print(f"Time: {optuna_time:.0f} seconds")
Optuna AUC: 0.8003
Improvement over random search: +0.0016
Time: 3214 seconds
The Wall
print("Tuning progression:")
print(f" Defaults: 0.7823")
print(f" Random Search: 0.7987 (+0.0164)")
print(f" Optuna: 0.8003 (+0.0016)")
print(f" Target: 0.8500")
print(f"\n Gap remaining: {0.8500 - 0.8003:.4f}")
print(f" Total tuning time: {(rs_time + optuna_time) / 3600:.1f} hours")
Tuning progression:
Defaults: 0.7823
Random Search: 0.7987 (+0.0164)
Optuna: 0.8003 (+0.0016)
Target: 0.8500
Gap remaining: 0.0497
Total tuning time: 1.4 hours
The team has spent 1.4 hours of compute time (and substantially more engineering time) and is stuck at 0.800. The target is 0.850. They are 0.050 AUC short, and the tuning curve has clearly plateaued --- the Optuna refinement added only 0.0016 over the random search.
No amount of additional tuning will close a 0.050 AUC gap. The ceiling is not the hyperparameters. The ceiling is the features.
The Features-First Pivot
The team steps back and asks: what information is missing?
Feature Engineering Round 1: Behavioral Sequences
# Engineered features from the same raw data
ecom['pages_per_minute'] = (
ecom['page_views'] / (ecom['time_on_site_sec'] / 60).clip(0.5, None)
).round(3)
ecom['cart_to_view_ratio'] = (
ecom['items_in_cart'] / ecom['items_viewed'].clip(1, None)
).round(3)
ecom['avg_item_value'] = (
ecom['cart_value'] / ecom['items_in_cart'].clip(1, None)
).round(2)
ecom['engagement_score'] = (
ecom['page_views'] * 0.3
+ (ecom['time_on_site_sec'] / 60) * 0.3
+ ecom['items_viewed'] * 0.2
+ ecom['items_in_cart'] * 0.2
).round(2)
ecom['is_returning_quick'] = (
(ecom['prior_visits'] >= 2) & (ecom['days_since_last_visit'] <= 3)
).astype(int)
# Recency-frequency interaction
ecom['visit_recency_frequency'] = (
ecom['prior_visits'] / ecom['days_since_last_visit'].clip(1, None)
).round(3)
feature_names_v2 = feature_names_v1 + [
'pages_per_minute', 'cart_to_view_ratio', 'avg_item_value',
'engagement_score', 'is_returning_quick', 'visit_recency_frequency'
]
X_v2 = ecom[feature_names_v2].values
# Test with DEFAULT hyperparameters
default_v2 = XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
random_state=42,
n_jobs=-1
)
v2_scores = cross_val_score(default_v2, X_v2, y, cv=cv, scoring='roc_auc')
print(f"V2 features (defaults): {v2_scores.mean():.4f} +/- {v2_scores.std():.4f}")
print(f"Improvement over tuned V1: +{v2_scores.mean() - study_v1.best_value:.4f}")
V2 features (defaults): 0.8187 +/- 0.0027
Improvement over tuned V1: +0.0184
Six engineered features with default hyperparameters beat 250 trials of tuning on the original features by 0.018 AUC. The features took 15 minutes to write. The tuning took 1.4 hours of compute.
Feature Engineering Round 2: External Data
The team now pulls in data from the CRM system:
# Simulated CRM data
ecom['customer_lifetime_value'] = np.random.exponential(120, n).round(2)
ecom['total_prior_purchases'] = np.random.poisson(2.5, n)
ecom['days_since_last_purchase'] = np.random.exponential(45, n).round(0).clip(0, 365)
ecom['avg_order_value'] = np.random.exponential(55, n).round(2)
ecom['email_open_rate_30d'] = np.random.beta(2, 5, n).round(3)
ecom['has_wishlist'] = np.random.binomial(1, 0.25, n)
# Purchase recency x frequency interaction
ecom['purchase_velocity'] = (
ecom['total_prior_purchases'] / ecom['days_since_last_purchase'].clip(1, None)
).round(4)
feature_names_v3 = feature_names_v2 + [
'customer_lifetime_value', 'total_prior_purchases', 'days_since_last_purchase',
'avg_order_value', 'email_open_rate_30d', 'has_wishlist', 'purchase_velocity'
]
X_v3 = ecom[feature_names_v3].values
# Still using default hyperparameters
default_v3 = XGBClassifier(
eval_metric='logloss',
scale_pos_weight=len(y[y == 0]) / len(y[y == 1]),
random_state=42,
n_jobs=-1
)
v3_scores = cross_val_score(default_v3, X_v3, y, cv=cv, scoring='roc_auc')
print(f"V3 features (defaults): {v3_scores.mean():.4f} +/- {v3_scores.std():.4f}")
print(f"Improvement over tuned V1: +{v3_scores.mean() - study_v1.best_value:.4f}")
V3 features (defaults): 0.8412 +/- 0.0024
Improvement over tuned V1: +0.0409
Now Tune the Enriched Features
def ecom_objective_v3(trial):
params = {
'n_estimators': trial.suggest_int('n_estimators', 500, 2000),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.2, log=True),
'max_depth': trial.suggest_int('max_depth', 4, 10),
'subsample': trial.suggest_float('subsample', 0.6, 0.95),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 0.9),
'min_child_weight': trial.suggest_int('min_child_weight', 1, 15),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-3, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-3, 10.0, log=True),
'scale_pos_weight': len(y[y == 0]) / len(y[y == 1]),
'eval_metric': 'logloss',
'random_state': 42,
'n_jobs': -1
}
model = XGBClassifier(**params)
scores = cross_val_score(model, X_v3, y, cv=cv, scoring='roc_auc')
return scores.mean()
study_v3 = optuna.create_study(
direction='maximize',
sampler=optuna.samplers.TPESampler(seed=42)
)
study_v3.optimize(ecom_objective_v3, n_trials=80)
print(f"V3 features + Optuna: {study_v3.best_value:.4f}")
print(f"V3 defaults alone: {v3_scores.mean():.4f}")
print(f"Tuning gain on V3: +{study_v3.best_value - v3_scores.mean():.4f}")
V3 features + Optuna: 0.8548
V3 defaults alone: 0.8412
Tuning gain on V3: +0.0136
The Complete Picture
print("=" * 70)
print("E-Commerce Purchase Prediction: Features vs. Tuning")
print("=" * 70)
print(f"{'Configuration':<40} {'AUC':>8} {'vs. Start':>10}")
print("-" * 70)
print(f"{'V1 features + defaults':<40} {'0.7823':>8} {'baseline':>10}")
print(f"{'V1 features + random search (100)':<40} {'0.7987':>8} {'+0.0164':>10}")
print(f"{'V1 features + Optuna (150)':<40} {'0.8003':>8} {'+0.0180':>10}")
print(f"{'V2 features + defaults':<40} {'0.8187':>8} {'+0.0364':>10}")
print(f"{'V3 features + defaults':<40} {'0.8412':>8} {'+0.0589':>10}")
print(f"{'V3 features + Optuna (80)':<40} {'0.8548':>8} {'+0.0725':>10}")
print(f"{'Target':<40} {'0.8500':>8} {'+0.0677':>10}")
print("-" * 70)
print()
print("Improvement breakdown:")
print(f" Tuning V1 (250 trials): +0.0180 AUC (25%)")
print(f" Feature engineering (V1 -> V3): +0.0409 AUC (56%)")
print(f" Tuning V3 (80 trials): +0.0136 AUC (19%)")
print(f" Total: +0.0725 AUC (100%)")
======================================================================
E-Commerce Purchase Prediction: Features vs. Tuning
======================================================================
Configuration AUC vs. Start
----------------------------------------------------------------------
V1 features + defaults 0.7823 baseline
V1 features + random search (100) 0.7987 +0.0164
V1 features + Optuna (150) 0.8003 +0.0180
V2 features + defaults 0.8187 +0.0364
V3 features + defaults 0.8412 +0.0589
V3 features + Optuna (80) 0.8548 +0.0725
Target 0.8500 +0.0677
----------------------------------------------------------------------
Improvement breakdown:
Tuning V1 (250 trials): +0.0180 AUC (25%)
Feature engineering (V1 -> V3): +0.0409 AUC (56%)
Tuning V3 (80 trials): +0.0136 AUC (19%)
Total: +0.0725 AUC (100%)
The Lessons
Lesson 1: Tuning Cannot Compensate for Missing Information
Hyperparameter tuning adjusts how the model learns from the features. It cannot create signal that is not in the features. The V1 features had a performance ceiling of approximately 0.800-0.805, regardless of hyperparameters. No amount of tuning --- not 250 trials, not 2,500 --- would have reached 0.850.
Lesson 2: Feature Engineering Has a Higher Ceiling
Six engineered features from the same raw data added 0.018 AUC. Seven external features from the CRM system added another 0.023. Together, features contributed 0.041 AUC --- more than double the 0.018 from tuning V1. And the feature engineering took less time.
Lesson 3: Tune After Feature Engineering, Not Before
The correct sequence was:
- Default model on V1 features: 0.782 (5 seconds)
- Engineer V2 and V3 features: 0.841 (30 minutes of thinking + coding)
- Tune V3: 0.855 (80 Optuna trials)
The team's actual sequence was:
- Default model on V1: 0.782
- Tune V1 exhaustively: 0.800 (250 trials, 1.4 hours of compute)
- Realize they are stuck
- Feature engineer: 0.841
- Tune again: 0.855
They wasted the first 250 tuning trials on features that could not support the target.
Lesson 4: The Decision Framework
Before you start tuning, ask:
Diagnostic Question --- "If I could see one more piece of information about each example, what would it be?"
If you can answer this question, go get that information. It will help more than tuning. Only when you genuinely cannot think of additional useful features is it time to invest heavily in hyperparameter optimization.
Epilogue
The e-commerce team hit the 0.850 target not through heroic tuning but through two moves: engineering interaction features from existing data (free) and pulling in CRM data (required a data engineering ticket and two days of pipeline work). The final 80 trials of Optuna polished the result past the target.
The VP of Data Science later summarized the project in a team retrospective: "We spent our first week tuning a model that did not have enough features to succeed. We spent our second week getting the right features, and the model basically tuned itself. The lesson is: feed the model better food before complaining about the recipe."
This case study supports Chapter 18: Hyperparameter Tuning. Return to the chapter for the complete tuning framework.