Case Study 2: Hyperparameter Optimization for a Production Model
Building a Customer Churn Prediction System at Scale
Background
DataDrive Inc. is an e-commerce platform with 2 million active customers. Their data science team has been tasked with building a customer churn prediction system. A customer is considered "churned" if they have not made a purchase in 90 days. Early identification of at-risk customers allows the marketing team to launch targeted retention campaigns (discount codes, personalized emails, loyalty rewards).
The dataset consists of: - Features: 45 engineered features including purchase frequency, average order value, days since last purchase, customer support interactions, browsing behavior, and demographic attributes. - Samples: 500,000 customers with labeled outcomes (churned vs. active). - Class distribution: 18% churned, 82% active. - Constraint: The model must retrain weekly and make predictions for all active customers daily. Inference latency must be under 50ms per customer.
The Challenge
The team has already established that a Gradient Boosted Tree (GBT) model outperforms logistic regression and random forests in preliminary experiments. Now they must:
- Find the optimal hyperparameters for the GBT model.
- Ensure the tuning process is statistically rigorous.
- Meet production constraints on latency and model size.
- Document the process for reproducibility and auditing.
Step 1: Defining the Search Space
The team identified the most impactful hyperparameters based on domain expertise and the scikit-learn documentation:
from scipy.stats import uniform, randint, loguniform
# Hyperparameter search space
SEARCH_SPACE = {
# Tree structure
"n_estimators": randint(100, 1000),
"max_depth": randint(3, 12),
"min_samples_split": randint(2, 50),
"min_samples_leaf": randint(1, 30),
# Learning parameters
"learning_rate": loguniform(1e-3, 1.0),
"subsample": uniform(0.5, 0.5),
# Regularization
"max_features": uniform(0.3, 0.7),
}
Key decisions in defining the search space:
- Learning rate uses a log-uniform distribution because the relevant range spans multiple orders of magnitude (0.001 to 1.0). A uniform distribution would oversample the upper range where differences matter less.
- n_estimators has a wide range because it interacts strongly with learning_rate (lower learning rates need more trees).
- subsample and max_features control stochastic gradient boosting, reducing variance (see Chapter 7, Section 7.4).
Step 2: Phase 1 -- Random Search Exploration
The team allocated a computational budget of 200 model evaluations. They started with random search to broadly explore the space.
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score
import time
def phase1_random_search(
X_train: np.ndarray,
y_train: np.ndarray,
n_iter: int = 200,
cv_folds: int = 5,
random_state: int = 42
) -> dict:
"""Phase 1: Broad exploration with random search.
Args:
X_train: Training feature matrix.
y_train: Training target vector.
n_iter: Number of random configurations to evaluate.
cv_folds: Number of cross-validation folds.
random_state: Random seed for reproducibility.
Returns:
Dictionary with search results and timing information.
"""
cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)
search = RandomizedSearchCV(
estimator=GradientBoostingClassifier(random_state=random_state),
param_distributions=SEARCH_SPACE,
n_iter=n_iter,
cv=cv,
scoring="f1",
n_jobs=-1,
random_state=random_state,
verbose=1,
return_train_score=True,
)
start_time = time.time()
search.fit(X_train, y_train)
elapsed = time.time() - start_time
# Analyze results
results = {
"best_params": search.best_params_,
"best_score": search.best_score_,
"elapsed_minutes": elapsed / 60,
"all_results": search.cv_results_,
}
# Print top 5 configurations
scores = search.cv_results_["mean_test_score"]
top_indices = np.argsort(scores)[-5:][::-1]
print("\nTop 5 configurations:")
for rank, idx in enumerate(top_indices, 1):
params = {k: search.cv_results_[f"param_{k}"][idx]
for k in SEARCH_SPACE.keys()}
print(f" #{rank}: F1={scores[idx]:.4f} | {params}")
return results
The random search results revealed several important patterns:
| Insight | Finding |
|---|---|
| Most important hyperparameter | learning_rate (low values consistently better) |
| Second most important | n_estimators (strong interaction with learning_rate) |
| Least sensitive | min_samples_leaf (performance stable across range) |
| Overfitting region | max_depth > 8 with high n_estimators |
The best configuration from Phase 1 achieved F1 = 0.723 with learning_rate=0.052, n_estimators=487, max_depth=5.
Step 3: Phase 2 -- Focused Grid Search
Based on Phase 1 insights, the team defined a refined grid around the promising region:
from sklearn.model_selection import GridSearchCV
def phase2_grid_search(
X_train: np.ndarray,
y_train: np.ndarray,
cv_folds: int = 5,
random_state: int = 42
) -> dict:
"""Phase 2: Focused refinement with grid search.
Args:
X_train: Training feature matrix.
y_train: Training target vector.
cv_folds: Number of cross-validation folds.
random_state: Random seed for reproducibility.
Returns:
Dictionary with search results.
"""
# Refined grid based on Phase 1 findings
refined_grid = {
"n_estimators": [400, 450, 500, 550, 600],
"max_depth": [4, 5, 6],
"learning_rate": [0.03, 0.04, 0.05, 0.06, 0.07],
"subsample": [0.7, 0.8, 0.9],
"min_samples_split": [10, 20, 30],
"min_samples_leaf": [5, 10], # Fixed near best from Phase 1
"max_features": [0.5, 0.6, 0.7],
}
# Total configurations: 5 * 3 * 5 * 3 * 3 * 2 * 3 = 4050
# With 5-fold CV: 20,250 model fits
# This is expensive -- consider reducing the grid or using Bayesian optimization
cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)
grid = GridSearchCV(
estimator=GradientBoostingClassifier(random_state=random_state),
param_grid=refined_grid,
cv=cv,
scoring="f1",
n_jobs=-1,
verbose=1,
return_train_score=True,
)
grid.fit(X_train, y_train)
# Check for overfitting
best_idx = grid.best_index_
train_score = grid.cv_results_["mean_train_score"][best_idx]
test_score = grid.cv_results_["mean_test_score"][best_idx]
print(f"Best train F1: {train_score:.4f}")
print(f"Best val F1: {test_score:.4f}")
print(f"Overfit gap: {train_score - test_score:.4f}")
return {
"best_params": grid.best_params_,
"best_score": grid.best_score_,
"train_score": train_score,
"overfit_gap": train_score - test_score,
}
Phase 2 improved the best F1 from 0.723 to 0.741. The final best parameters:
| Hyperparameter | Value |
|---|---|
| n_estimators | 550 |
| max_depth | 5 |
| learning_rate | 0.05 |
| subsample | 0.8 |
| min_samples_split | 20 |
| min_samples_leaf | 5 |
| max_features | 0.6 |
Step 4: Phase 3 -- Bayesian Optimization for Fine-Tuning
For the final refinement, the team used Bayesian optimization to squeeze out the last bits of performance:
def phase3_bayesian_optimization(
X_train: np.ndarray,
y_train: np.ndarray,
n_calls: int = 50,
cv_folds: int = 5,
random_state: int = 42
) -> dict:
"""Phase 3: Fine-tuning with Bayesian optimization.
Args:
X_train: Training feature matrix.
y_train: Training target vector.
n_calls: Number of Bayesian optimization iterations.
cv_folds: Number of cross-validation folds.
random_state: Random seed for reproducibility.
Returns:
Dictionary with optimization results.
"""
from sklearn.model_selection import cross_val_score
# Define the objective function
def objective(params: list) -> float:
"""Objective function for Bayesian optimization.
Args:
params: List of hyperparameter values.
Returns:
Negative mean F1 score (we minimize).
"""
n_estimators, max_depth, learning_rate, subsample = params
model = GradientBoostingClassifier(
n_estimators=int(n_estimators),
max_depth=int(max_depth),
learning_rate=learning_rate,
subsample=subsample,
min_samples_split=20,
min_samples_leaf=5,
max_features=0.6,
random_state=random_state,
)
cv = StratifiedKFold(
n_splits=cv_folds, shuffle=True, random_state=random_state
)
scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="f1")
return -np.mean(scores) # Negative because we minimize
try:
from skopt import gp_minimize
from skopt.space import Integer, Real
space = [
Integer(450, 650, name="n_estimators"),
Integer(4, 7, name="max_depth"),
Real(0.03, 0.08, prior="log-uniform", name="learning_rate"),
Real(0.7, 0.95, name="subsample"),
]
result = gp_minimize(
objective,
space,
n_calls=n_calls,
random_state=random_state,
verbose=True,
)
return {
"best_score": -result.fun,
"best_params": {
"n_estimators": result.x[0],
"max_depth": result.x[1],
"learning_rate": result.x[2],
"subsample": result.x[3],
},
"convergence": [-v for v in result.func_vals],
}
except ImportError:
print("scikit-optimize not installed. Skipping Bayesian optimization.")
return {}
Bayesian optimization improved F1 marginally to 0.748 -- a small but meaningful improvement given the scale of the customer base (each 0.1% improvement in churn prediction translates to approximately $50,000 in annual retention revenue).
Step 5: Production Constraints Validation
Before deployment, the team validated that the final model met production requirements:
import time
def validate_production_constraints(
model,
X_single: np.ndarray,
max_latency_ms: float = 50.0,
max_model_size_mb: float = 100.0,
n_timing_runs: int = 1000
) -> dict[str, bool]:
"""Validate that the model meets production constraints.
Args:
model: Trained scikit-learn estimator.
X_single: Single sample for latency testing, shape (1, n_features).
max_latency_ms: Maximum allowable inference latency in milliseconds.
max_model_size_mb: Maximum allowable model size in megabytes.
n_timing_runs: Number of runs for latency measurement.
Returns:
Dictionary with constraint satisfaction results.
"""
import pickle
import sys
# Measure inference latency
latencies = []
for _ in range(n_timing_runs):
start = time.perf_counter()
model.predict_proba(X_single)
end = time.perf_counter()
latencies.append((end - start) * 1000)
p50_latency = np.percentile(latencies, 50)
p99_latency = np.percentile(latencies, 99)
# Measure model size
model_bytes = sys.getsizeof(pickle.dumps(model))
model_size_mb = model_bytes / (1024 * 1024)
results = {
"p50_latency_ms": round(p50_latency, 2),
"p99_latency_ms": round(p99_latency, 2),
"latency_ok": p99_latency < max_latency_ms,
"model_size_mb": round(model_size_mb, 2),
"size_ok": model_size_mb < max_model_size_mb,
}
print(f"Latency: p50={results['p50_latency_ms']}ms, "
f"p99={results['p99_latency_ms']}ms "
f"({'PASS' if results['latency_ok'] else 'FAIL'})")
print(f"Model size: {results['model_size_mb']}MB "
f"({'PASS' if results['size_ok'] else 'FAIL'})")
return results
The model with 550 trees and max_depth=5 passed both constraints: - p99 latency: 12ms (well under 50ms) - Model size: 23MB (well under 100MB)
However, the team noted that if the model had required 2000+ trees (as Bayesian optimization briefly explored), latency would have exceeded the constraint. This illustrates the importance of treating production requirements as hard constraints during hyperparameter search.
Step 6: Final Evaluation with Nested Cross-Validation
To get an unbiased estimate of the tuned model's performance, the team ran nested cross-validation:
from sklearn.model_selection import cross_val_score
def final_nested_evaluation(
X: np.ndarray,
y: np.ndarray,
best_params: dict,
outer_k: int = 5,
random_state: int = 42
) -> dict:
"""Final unbiased evaluation using the tuned hyperparameters.
Args:
X: Full feature matrix.
y: Full target vector.
best_params: Best hyperparameters from tuning.
outer_k: Number of outer cross-validation folds.
random_state: Random seed.
Returns:
Dictionary with performance estimates and confidence intervals.
"""
model = GradientBoostingClassifier(**best_params, random_state=random_state)
cv = StratifiedKFold(n_splits=outer_k, shuffle=True, random_state=random_state)
scores = cross_val_score(model, X, y, cv=cv, scoring="f1")
mean_f1 = np.mean(scores)
std_f1 = np.std(scores)
ci_lower = mean_f1 - 1.96 * std_f1 / np.sqrt(outer_k)
ci_upper = mean_f1 + 1.96 * std_f1 / np.sqrt(outer_k)
print(f"F1 Score: {mean_f1:.4f} +/- {std_f1:.4f}")
print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
print(f"Fold scores: {[round(s, 4) for s in scores]}")
return {
"mean_f1": mean_f1,
"std_f1": std_f1,
"ci_lower": ci_lower,
"ci_upper": ci_upper,
"fold_scores": scores.tolist(),
}
Final result: F1 = 0.748 +/- 0.012, 95% CI [0.737, 0.759].
Step 7: Documentation and Reproducibility
The team created a comprehensive experiment log:
def create_experiment_log(
phases: list[dict],
final_params: dict,
final_metrics: dict,
production_checks: dict
) -> dict:
"""Create a structured experiment log for reproducibility.
Args:
phases: List of results from each optimization phase.
final_params: Final selected hyperparameters.
final_metrics: Final performance metrics.
production_checks: Production constraint validation results.
Returns:
Complete experiment log dictionary.
"""
return {
"experiment_id": "churn-gbt-v2.3",
"date": "2024-09-15",
"dataset": {
"n_samples": 500000,
"n_features": 45,
"positive_rate": 0.18,
"train_size": 400000,
"test_size": 100000,
},
"optimization": {
"phase1_random_search": {
"n_iterations": 200,
"best_f1": phases[0]["best_score"],
},
"phase2_grid_search": {
"n_configurations": 4050,
"best_f1": phases[1]["best_score"],
},
"phase3_bayesian": {
"n_iterations": 50,
"best_f1": phases[2].get("best_score", "N/A"),
},
},
"final_model": {
"type": "GradientBoostingClassifier",
"hyperparameters": final_params,
"metrics": final_metrics,
"production_validation": production_checks,
},
"random_seed": 42,
"software_versions": {
"python": "3.11",
"scikit-learn": "1.3.0",
"numpy": "1.25.0",
},
}
Results Summary
| Phase | Strategy | Budget | Best F1 | Wall Time |
|---|---|---|---|---|
| 1 | Random Search | 200 evals | 0.723 | 3.2 hours |
| 2 | Grid Search | 4,050 evals | 0.741 | 14.5 hours |
| 3 | Bayesian Optimization | 50 evals | 0.748 | 2.1 hours |
The three-phase approach invested the most computation in Phase 2 (broad refinement) but achieved the best efficiency in Phase 3 (targeted optimization).
Lessons Learned
-
Start broad, then narrow. Random search efficiently identifies promising regions of the hyperparameter space. Grid search and Bayesian optimization are more effective once the search space is constrained.
-
Log-uniform distributions for scale parameters. Learning rate, regularization strength, and other parameters that span orders of magnitude should always be searched on a log scale.
-
Monitor the overfitting gap. The difference between training and validation scores is as important as the validation score itself. A high-scoring model with a large gap is likely to disappoint in production.
-
Production constraints are hyperparameters too. Model size and inference latency should be treated as hard constraints during the search, not afterthoughts.
-
Document everything. Random seeds, software versions, search spaces, and all intermediate results must be recorded for reproducibility. Six months from now, when the model needs retraining, the team must be able to reproduce the process exactly.
-
Diminishing returns are real. The jump from Phase 1 to Phase 2 (0.723 to 0.741) was much larger than Phase 2 to Phase 3 (0.741 to 0.748). Know when to stop tuning and start deploying.
Connection to Chapter Concepts
This case study directly applies:
- Section 8.7.2: Grid search for exhaustive local refinement.
- Section 8.7.3: Random search for efficient exploration of high-dimensional spaces.
- Section 8.7.4: Bayesian optimization for sample-efficient fine-tuning.
- Section 8.7.6: Practical tips including log-uniform distributions and computational budgets.
- Section 8.3.5: Nested cross-validation for unbiased performance estimation.
- Section 8.6: Bias-variance diagnosis through the overfitting gap.
- Section 8.10.2: Awareness of overfitting to the validation set.
Discussion Questions
-
The team evaluated 4,300 total configurations (200 + 4,050 + 50). How would you determine if this was sufficient? What evidence would suggest that more evaluations would yield meaningful improvement?
-
The team used F1 as the optimization metric. If the marketing team's retention campaign has a fixed budget (can only contact 5,000 customers per week), would a different metric be more appropriate? What would you recommend?
-
How would you handle the situation if the Phase 3 Bayesian optimization found a model with F1 = 0.755 but with p99 latency of 65ms (exceeding the 50ms constraint)?
-
The model will be retrained weekly. Should hyperparameter tuning be repeated each week, or should the tuned hyperparameters be reused? What signals would indicate that re-tuning is necessary?