Case Study 2: Hyperparameter Optimization for a Production Model

Building a Customer Churn Prediction System at Scale

Background

DataDrive Inc. is an e-commerce platform with 2 million active customers. Their data science team has been tasked with building a customer churn prediction system. A customer is considered "churned" if they have not made a purchase in 90 days. Early identification of at-risk customers allows the marketing team to launch targeted retention campaigns (discount codes, personalized emails, loyalty rewards).

The dataset consists of: - Features: 45 engineered features including purchase frequency, average order value, days since last purchase, customer support interactions, browsing behavior, and demographic attributes. - Samples: 500,000 customers with labeled outcomes (churned vs. active). - Class distribution: 18% churned, 82% active. - Constraint: The model must retrain weekly and make predictions for all active customers daily. Inference latency must be under 50ms per customer.

The Challenge

The team has already established that a Gradient Boosted Tree (GBT) model outperforms logistic regression and random forests in preliminary experiments. Now they must:

  1. Find the optimal hyperparameters for the GBT model.
  2. Ensure the tuning process is statistically rigorous.
  3. Meet production constraints on latency and model size.
  4. Document the process for reproducibility and auditing.

Step 1: Defining the Search Space

The team identified the most impactful hyperparameters based on domain expertise and the scikit-learn documentation:

from scipy.stats import uniform, randint, loguniform

# Hyperparameter search space
SEARCH_SPACE = {
    # Tree structure
    "n_estimators": randint(100, 1000),
    "max_depth": randint(3, 12),
    "min_samples_split": randint(2, 50),
    "min_samples_leaf": randint(1, 30),

    # Learning parameters
    "learning_rate": loguniform(1e-3, 1.0),
    "subsample": uniform(0.5, 0.5),

    # Regularization
    "max_features": uniform(0.3, 0.7),
}

Key decisions in defining the search space:

  • Learning rate uses a log-uniform distribution because the relevant range spans multiple orders of magnitude (0.001 to 1.0). A uniform distribution would oversample the upper range where differences matter less.
  • n_estimators has a wide range because it interacts strongly with learning_rate (lower learning rates need more trees).
  • subsample and max_features control stochastic gradient boosting, reducing variance (see Chapter 7, Section 7.4).

Step 2: Phase 1 -- Random Search Exploration

The team allocated a computational budget of 200 model evaluations. They started with random search to broadly explore the space.

import numpy as np
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import make_scorer, f1_score
import time

def phase1_random_search(
    X_train: np.ndarray,
    y_train: np.ndarray,
    n_iter: int = 200,
    cv_folds: int = 5,
    random_state: int = 42
) -> dict:
    """Phase 1: Broad exploration with random search.

    Args:
        X_train: Training feature matrix.
        y_train: Training target vector.
        n_iter: Number of random configurations to evaluate.
        cv_folds: Number of cross-validation folds.
        random_state: Random seed for reproducibility.

    Returns:
        Dictionary with search results and timing information.
    """
    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)

    search = RandomizedSearchCV(
        estimator=GradientBoostingClassifier(random_state=random_state),
        param_distributions=SEARCH_SPACE,
        n_iter=n_iter,
        cv=cv,
        scoring="f1",
        n_jobs=-1,
        random_state=random_state,
        verbose=1,
        return_train_score=True,
    )

    start_time = time.time()
    search.fit(X_train, y_train)
    elapsed = time.time() - start_time

    # Analyze results
    results = {
        "best_params": search.best_params_,
        "best_score": search.best_score_,
        "elapsed_minutes": elapsed / 60,
        "all_results": search.cv_results_,
    }

    # Print top 5 configurations
    scores = search.cv_results_["mean_test_score"]
    top_indices = np.argsort(scores)[-5:][::-1]
    print("\nTop 5 configurations:")
    for rank, idx in enumerate(top_indices, 1):
        params = {k: search.cv_results_[f"param_{k}"][idx]
                  for k in SEARCH_SPACE.keys()}
        print(f"  #{rank}: F1={scores[idx]:.4f} | {params}")

    return results

The random search results revealed several important patterns:

Insight Finding
Most important hyperparameter learning_rate (low values consistently better)
Second most important n_estimators (strong interaction with learning_rate)
Least sensitive min_samples_leaf (performance stable across range)
Overfitting region max_depth > 8 with high n_estimators

The best configuration from Phase 1 achieved F1 = 0.723 with learning_rate=0.052, n_estimators=487, max_depth=5.

Based on Phase 1 insights, the team defined a refined grid around the promising region:

from sklearn.model_selection import GridSearchCV

def phase2_grid_search(
    X_train: np.ndarray,
    y_train: np.ndarray,
    cv_folds: int = 5,
    random_state: int = 42
) -> dict:
    """Phase 2: Focused refinement with grid search.

    Args:
        X_train: Training feature matrix.
        y_train: Training target vector.
        cv_folds: Number of cross-validation folds.
        random_state: Random seed for reproducibility.

    Returns:
        Dictionary with search results.
    """
    # Refined grid based on Phase 1 findings
    refined_grid = {
        "n_estimators": [400, 450, 500, 550, 600],
        "max_depth": [4, 5, 6],
        "learning_rate": [0.03, 0.04, 0.05, 0.06, 0.07],
        "subsample": [0.7, 0.8, 0.9],
        "min_samples_split": [10, 20, 30],
        "min_samples_leaf": [5, 10],  # Fixed near best from Phase 1
        "max_features": [0.5, 0.6, 0.7],
    }

    # Total configurations: 5 * 3 * 5 * 3 * 3 * 2 * 3 = 4050
    # With 5-fold CV: 20,250 model fits
    # This is expensive -- consider reducing the grid or using Bayesian optimization

    cv = StratifiedKFold(n_splits=cv_folds, shuffle=True, random_state=random_state)

    grid = GridSearchCV(
        estimator=GradientBoostingClassifier(random_state=random_state),
        param_grid=refined_grid,
        cv=cv,
        scoring="f1",
        n_jobs=-1,
        verbose=1,
        return_train_score=True,
    )

    grid.fit(X_train, y_train)

    # Check for overfitting
    best_idx = grid.best_index_
    train_score = grid.cv_results_["mean_train_score"][best_idx]
    test_score = grid.cv_results_["mean_test_score"][best_idx]

    print(f"Best train F1: {train_score:.4f}")
    print(f"Best val F1:   {test_score:.4f}")
    print(f"Overfit gap:   {train_score - test_score:.4f}")

    return {
        "best_params": grid.best_params_,
        "best_score": grid.best_score_,
        "train_score": train_score,
        "overfit_gap": train_score - test_score,
    }

Phase 2 improved the best F1 from 0.723 to 0.741. The final best parameters:

Hyperparameter Value
n_estimators 550
max_depth 5
learning_rate 0.05
subsample 0.8
min_samples_split 20
min_samples_leaf 5
max_features 0.6

Step 4: Phase 3 -- Bayesian Optimization for Fine-Tuning

For the final refinement, the team used Bayesian optimization to squeeze out the last bits of performance:

def phase3_bayesian_optimization(
    X_train: np.ndarray,
    y_train: np.ndarray,
    n_calls: int = 50,
    cv_folds: int = 5,
    random_state: int = 42
) -> dict:
    """Phase 3: Fine-tuning with Bayesian optimization.

    Args:
        X_train: Training feature matrix.
        y_train: Training target vector.
        n_calls: Number of Bayesian optimization iterations.
        cv_folds: Number of cross-validation folds.
        random_state: Random seed for reproducibility.

    Returns:
        Dictionary with optimization results.
    """
    from sklearn.model_selection import cross_val_score

    # Define the objective function
    def objective(params: list) -> float:
        """Objective function for Bayesian optimization.

        Args:
            params: List of hyperparameter values.

        Returns:
            Negative mean F1 score (we minimize).
        """
        n_estimators, max_depth, learning_rate, subsample = params

        model = GradientBoostingClassifier(
            n_estimators=int(n_estimators),
            max_depth=int(max_depth),
            learning_rate=learning_rate,
            subsample=subsample,
            min_samples_split=20,
            min_samples_leaf=5,
            max_features=0.6,
            random_state=random_state,
        )

        cv = StratifiedKFold(
            n_splits=cv_folds, shuffle=True, random_state=random_state
        )
        scores = cross_val_score(model, X_train, y_train, cv=cv, scoring="f1")

        return -np.mean(scores)  # Negative because we minimize

    try:
        from skopt import gp_minimize
        from skopt.space import Integer, Real

        space = [
            Integer(450, 650, name="n_estimators"),
            Integer(4, 7, name="max_depth"),
            Real(0.03, 0.08, prior="log-uniform", name="learning_rate"),
            Real(0.7, 0.95, name="subsample"),
        ]

        result = gp_minimize(
            objective,
            space,
            n_calls=n_calls,
            random_state=random_state,
            verbose=True,
        )

        return {
            "best_score": -result.fun,
            "best_params": {
                "n_estimators": result.x[0],
                "max_depth": result.x[1],
                "learning_rate": result.x[2],
                "subsample": result.x[3],
            },
            "convergence": [-v for v in result.func_vals],
        }

    except ImportError:
        print("scikit-optimize not installed. Skipping Bayesian optimization.")
        return {}

Bayesian optimization improved F1 marginally to 0.748 -- a small but meaningful improvement given the scale of the customer base (each 0.1% improvement in churn prediction translates to approximately $50,000 in annual retention revenue).

Step 5: Production Constraints Validation

Before deployment, the team validated that the final model met production requirements:

import time

def validate_production_constraints(
    model,
    X_single: np.ndarray,
    max_latency_ms: float = 50.0,
    max_model_size_mb: float = 100.0,
    n_timing_runs: int = 1000
) -> dict[str, bool]:
    """Validate that the model meets production constraints.

    Args:
        model: Trained scikit-learn estimator.
        X_single: Single sample for latency testing, shape (1, n_features).
        max_latency_ms: Maximum allowable inference latency in milliseconds.
        max_model_size_mb: Maximum allowable model size in megabytes.
        n_timing_runs: Number of runs for latency measurement.

    Returns:
        Dictionary with constraint satisfaction results.
    """
    import pickle
    import sys

    # Measure inference latency
    latencies = []
    for _ in range(n_timing_runs):
        start = time.perf_counter()
        model.predict_proba(X_single)
        end = time.perf_counter()
        latencies.append((end - start) * 1000)

    p50_latency = np.percentile(latencies, 50)
    p99_latency = np.percentile(latencies, 99)

    # Measure model size
    model_bytes = sys.getsizeof(pickle.dumps(model))
    model_size_mb = model_bytes / (1024 * 1024)

    results = {
        "p50_latency_ms": round(p50_latency, 2),
        "p99_latency_ms": round(p99_latency, 2),
        "latency_ok": p99_latency < max_latency_ms,
        "model_size_mb": round(model_size_mb, 2),
        "size_ok": model_size_mb < max_model_size_mb,
    }

    print(f"Latency: p50={results['p50_latency_ms']}ms, "
          f"p99={results['p99_latency_ms']}ms "
          f"({'PASS' if results['latency_ok'] else 'FAIL'})")
    print(f"Model size: {results['model_size_mb']}MB "
          f"({'PASS' if results['size_ok'] else 'FAIL'})")

    return results

The model with 550 trees and max_depth=5 passed both constraints: - p99 latency: 12ms (well under 50ms) - Model size: 23MB (well under 100MB)

However, the team noted that if the model had required 2000+ trees (as Bayesian optimization briefly explored), latency would have exceeded the constraint. This illustrates the importance of treating production requirements as hard constraints during hyperparameter search.

Step 6: Final Evaluation with Nested Cross-Validation

To get an unbiased estimate of the tuned model's performance, the team ran nested cross-validation:

from sklearn.model_selection import cross_val_score

def final_nested_evaluation(
    X: np.ndarray,
    y: np.ndarray,
    best_params: dict,
    outer_k: int = 5,
    random_state: int = 42
) -> dict:
    """Final unbiased evaluation using the tuned hyperparameters.

    Args:
        X: Full feature matrix.
        y: Full target vector.
        best_params: Best hyperparameters from tuning.
        outer_k: Number of outer cross-validation folds.
        random_state: Random seed.

    Returns:
        Dictionary with performance estimates and confidence intervals.
    """
    model = GradientBoostingClassifier(**best_params, random_state=random_state)

    cv = StratifiedKFold(n_splits=outer_k, shuffle=True, random_state=random_state)
    scores = cross_val_score(model, X, y, cv=cv, scoring="f1")

    mean_f1 = np.mean(scores)
    std_f1 = np.std(scores)
    ci_lower = mean_f1 - 1.96 * std_f1 / np.sqrt(outer_k)
    ci_upper = mean_f1 + 1.96 * std_f1 / np.sqrt(outer_k)

    print(f"F1 Score: {mean_f1:.4f} +/- {std_f1:.4f}")
    print(f"95% CI: [{ci_lower:.4f}, {ci_upper:.4f}]")
    print(f"Fold scores: {[round(s, 4) for s in scores]}")

    return {
        "mean_f1": mean_f1,
        "std_f1": std_f1,
        "ci_lower": ci_lower,
        "ci_upper": ci_upper,
        "fold_scores": scores.tolist(),
    }

Final result: F1 = 0.748 +/- 0.012, 95% CI [0.737, 0.759].

Step 7: Documentation and Reproducibility

The team created a comprehensive experiment log:

def create_experiment_log(
    phases: list[dict],
    final_params: dict,
    final_metrics: dict,
    production_checks: dict
) -> dict:
    """Create a structured experiment log for reproducibility.

    Args:
        phases: List of results from each optimization phase.
        final_params: Final selected hyperparameters.
        final_metrics: Final performance metrics.
        production_checks: Production constraint validation results.

    Returns:
        Complete experiment log dictionary.
    """
    return {
        "experiment_id": "churn-gbt-v2.3",
        "date": "2024-09-15",
        "dataset": {
            "n_samples": 500000,
            "n_features": 45,
            "positive_rate": 0.18,
            "train_size": 400000,
            "test_size": 100000,
        },
        "optimization": {
            "phase1_random_search": {
                "n_iterations": 200,
                "best_f1": phases[0]["best_score"],
            },
            "phase2_grid_search": {
                "n_configurations": 4050,
                "best_f1": phases[1]["best_score"],
            },
            "phase3_bayesian": {
                "n_iterations": 50,
                "best_f1": phases[2].get("best_score", "N/A"),
            },
        },
        "final_model": {
            "type": "GradientBoostingClassifier",
            "hyperparameters": final_params,
            "metrics": final_metrics,
            "production_validation": production_checks,
        },
        "random_seed": 42,
        "software_versions": {
            "python": "3.11",
            "scikit-learn": "1.3.0",
            "numpy": "1.25.0",
        },
    }

Results Summary

Phase Strategy Budget Best F1 Wall Time
1 Random Search 200 evals 0.723 3.2 hours
2 Grid Search 4,050 evals 0.741 14.5 hours
3 Bayesian Optimization 50 evals 0.748 2.1 hours

The three-phase approach invested the most computation in Phase 2 (broad refinement) but achieved the best efficiency in Phase 3 (targeted optimization).

Lessons Learned

  1. Start broad, then narrow. Random search efficiently identifies promising regions of the hyperparameter space. Grid search and Bayesian optimization are more effective once the search space is constrained.

  2. Log-uniform distributions for scale parameters. Learning rate, regularization strength, and other parameters that span orders of magnitude should always be searched on a log scale.

  3. Monitor the overfitting gap. The difference between training and validation scores is as important as the validation score itself. A high-scoring model with a large gap is likely to disappoint in production.

  4. Production constraints are hyperparameters too. Model size and inference latency should be treated as hard constraints during the search, not afterthoughts.

  5. Document everything. Random seeds, software versions, search spaces, and all intermediate results must be recorded for reproducibility. Six months from now, when the model needs retraining, the team must be able to reproduce the process exactly.

  6. Diminishing returns are real. The jump from Phase 1 to Phase 2 (0.723 to 0.741) was much larger than Phase 2 to Phase 3 (0.741 to 0.748). Know when to stop tuning and start deploying.

Connection to Chapter Concepts

This case study directly applies:

  • Section 8.7.2: Grid search for exhaustive local refinement.
  • Section 8.7.3: Random search for efficient exploration of high-dimensional spaces.
  • Section 8.7.4: Bayesian optimization for sample-efficient fine-tuning.
  • Section 8.7.6: Practical tips including log-uniform distributions and computational budgets.
  • Section 8.3.5: Nested cross-validation for unbiased performance estimation.
  • Section 8.6: Bias-variance diagnosis through the overfitting gap.
  • Section 8.10.2: Awareness of overfitting to the validation set.

Discussion Questions

  1. The team evaluated 4,300 total configurations (200 + 4,050 + 50). How would you determine if this was sufficient? What evidence would suggest that more evaluations would yield meaningful improvement?

  2. The team used F1 as the optimization metric. If the marketing team's retention campaign has a fixed budget (can only contact 5,000 customers per week), would a different metric be more appropriate? What would you recommend?

  3. How would you handle the situation if the Phase 3 Bayesian optimization found a model with F1 = 0.755 but with p99 latency of 65ms (exceeding the 50ms constraint)?

  4. The model will be retrained weekly. Should hyperparameter tuning be repeated each week, or should the tuned hyperparameters be reused? What signals would indicate that re-tuning is necessary?