Case Study 2: ShopSmart Conversion Prediction --- Gradient Boosting in Production E-Commerce

DataField.Dev

Gradient Boosting in Production E-Commerce" type: case-study chapter: 14 part: 3

Case Study 2: ShopSmart Conversion Prediction --- Gradient Boosting in Production E-Commerce

Background

ShopSmart is a mid-size e-commerce marketplace with 14 million monthly active users. Their product team has a problem that sounds simple and is not: given a user's current browsing session, predict whether they will make a purchase before leaving the site.

This is not an academic exercise. ShopSmart uses the conversion probability to decide, in real time, whether to show a user a discount pop-up, a free shipping offer, or nothing. Show too many offers and margins collapse. Show too few and you lose convertible users. The model needs to be accurate and fast --- predictions must return in under 50 milliseconds per request at a peak load of 3,000 requests per second.

The previous model was a logistic regression trained on 12 hand-engineered features. It achieved an AUC of 0.74. The data science team believes gradient boosting can do significantly better. The business team has agreed to a six-week pilot: build a gradient boosting model, A/B test it against the logistic regression, and measure the impact on conversion rate and revenue.

This case study covers the full pipeline: data preparation, model training, hyperparameter tuning, latency optimization, and the production considerations that separate a notebook model from a deployed system.

The Data

ShopSmart's conversion dataset includes session-level features for 500,000 browsing sessions over 30 days.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    roc_auc_score, f1_score, precision_score, recall_score,
    average_precision_score, log_loss
)
import time

np.random.seed(42)
n = 500000

# --- Session features ---
sessions = pd.DataFrame({
    # Browsing behavior
    'pages_viewed': np.random.poisson(6, n),
    'time_on_site_minutes': np.random.exponential(8, n).round(1),
    'product_detail_views': np.random.poisson(3, n),
    'add_to_cart_count': np.random.poisson(0.8, n),
    'search_queries': np.random.poisson(1.5, n),
    'category_switches': np.random.poisson(2, n),
    'image_zoom_count': np.random.poisson(1.2, n),
    'review_reads': np.random.poisson(0.7, n),

    # Session context
    'device_type': np.random.choice(
        ['mobile', 'desktop', 'tablet'], n, p=[0.55, 0.35, 0.10]
    ),
    'traffic_source': np.random.choice(
        ['organic_search', 'paid_search', 'direct', 'social',
         'email', 'referral', 'display_ad'],
        n, p=[0.30, 0.20, 0.18, 0.12, 0.10, 0.06, 0.04]
    ),
    'day_of_week': np.random.choice(
        ['Monday', 'Tuesday', 'Wednesday', 'Thursday',
         'Friday', 'Saturday', 'Sunday'], n
    ),
    'hour_of_day': np.random.randint(0, 24, n),
    'is_weekend': np.zeros(n, dtype=int),  # filled below
    'is_peak_hour': np.zeros(n, dtype=int),  # filled below

    # User history
    'is_returning_user': np.random.binomial(1, 0.45, n),
    'days_since_last_visit': np.where(
        np.random.binomial(1, 0.45, n),
        np.random.exponential(14, n).round(0),
        -1  # new user
    ),
    'previous_purchases': np.random.poisson(1.2, n),
    'previous_sessions_30d': np.random.poisson(3, n),
    'account_age_days': np.random.exponential(180, n).round(0),
    'has_wishlist': np.random.binomial(1, 0.3, n),

    # Pricing context
    'avg_product_price_viewed': np.random.exponential(45, n).round(2),
    'cart_value': np.random.exponential(35, n).round(2) * np.random.binomial(1, 0.3, n),
    'has_coupon': np.random.binomial(1, 0.15, n),
    'free_shipping_eligible': np.random.binomial(1, 0.4, n),

    # Page-level
    'landing_page_category': np.random.choice(
        ['home', 'category', 'product', 'search', 'deal', 'brand'],
        n, p=[0.25, 0.20, 0.20, 0.15, 0.12, 0.08]
    ),
})

# Derived features
sessions['is_weekend'] = sessions['day_of_week'].isin(['Saturday', 'Sunday']).astype(int)
sessions['is_peak_hour'] = sessions['hour_of_day'].between(10, 21).astype(int)

# Generate realistic conversion signal (base rate ~3.5%)
conversion_score = (
    -4.0  # low base conversion
    + 0.8 * np.log1p(sessions['add_to_cart_count'])
    + 0.3 * np.log1p(sessions['product_detail_views'])
    + 0.2 * np.log1p(sessions['review_reads'])
    + 0.15 * np.log1p(sessions['image_zoom_count'])
    + 0.4 * sessions['is_returning_user']
    + 0.1 * np.log1p(sessions['previous_purchases'])
    + 0.25 * sessions['has_coupon']
    + 0.2 * sessions['free_shipping_eligible']
    + 0.5 * (sessions['cart_value'] > 0).astype(float)
    - 0.3 * (sessions['device_type'] == 'mobile').astype(float)
    + 0.15 * (sessions['traffic_source'] == 'email').astype(float)
    + 0.2 * (sessions['traffic_source'] == 'direct').astype(float)
    - 0.1 * (sessions['traffic_source'] == 'display_ad').astype(float)
    + 0.1 * sessions['has_wishlist']
    - 0.05 * np.log1p(sessions['category_switches'])
    + np.random.normal(0, 0.4, n)
)

conversion_prob = 1 / (1 + np.exp(-conversion_score))
sessions['converted'] = np.random.binomial(1, conversion_prob)

print(f"Dataset shape: {sessions.shape}")
print(f"Conversion rate: {sessions['converted'].mean():.2%}")
print(f"\nFeature types:")
cat_features = ['device_type', 'traffic_source', 'day_of_week', 'landing_page_category']
num_features = [c for c in sessions.columns if c not in cat_features + ['converted']]
print(f"  Numerical: {len(num_features)}")
print(f"  Categorical: {len(cat_features)}")

Dataset shape: (500000, 26)
Conversion rate: 3.47%

Feature types:
  Numerical: 21
  Categorical: 4

A 3.47% conversion rate means severe class imbalance --- about 28:1 negative to positive. This is typical for e-commerce. We will address it, but it also means we need to look beyond accuracy. Average precision (PR-AUC) will be more informative than ROC-AUC.

The Baseline: Logistic Regression

Before gradient boosting, we replicate the current production model.

from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

X = sessions.drop('converted', axis=1)
y = sessions['converted']

# Three-way split
X_temp, X_test, y_temp, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)
X_train, X_val, y_train, y_val = train_test_split(
    X_temp, y_temp, test_size=0.25, stratify=y_temp, random_state=42
)

# Logistic regression (one-hot encode for all non-boosting models)
X_train_lr = pd.get_dummies(X_train, columns=cat_features, drop_first=True)
X_val_lr = pd.get_dummies(X_val, columns=cat_features, drop_first=True)
X_test_lr = pd.get_dummies(X_test, columns=cat_features, drop_first=True)
for col in X_train_lr.columns:
    if col not in X_test_lr.columns:
        X_test_lr[col] = 0
    if col not in X_val_lr.columns:
        X_val_lr[col] = 0
X_test_lr = X_test_lr[X_train_lr.columns]
X_val_lr = X_val_lr[X_train_lr.columns]

lr_pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('lr', LogisticRegression(max_iter=1000, random_state=42, n_jobs=-1))
])

start = time.time()
lr_pipeline.fit(X_train_lr, y_train)
lr_time = time.time() - start

lr_proba = lr_pipeline.predict_proba(X_test_lr)[:, 1]
lr_pred = lr_pipeline.predict(X_test_lr)

print("Logistic Regression Baseline")
print(f"  AUC:           {roc_auc_score(y_test, lr_proba):.4f}")
print(f"  Avg Precision: {average_precision_score(y_test, lr_proba):.4f}")
print(f"  F1:            {f1_score(y_test, lr_pred):.4f}")
print(f"  Train time:    {lr_time:.2f}s")

Logistic Regression Baseline
  AUC:           0.8742
  Avg Precision: 0.3154
  F1:            0.2847
  Train time:    1.24s

An AUC of 0.874 sounds decent, but the average precision of 0.315 tells the real story: when the model predicts "will convert," it is right only about 31% of the time. For a system that triggers discount pop-ups, that means 69% of pop-ups go to users who would not have converted anyway --- wasted margin.

LightGBM: The Production Choice

For a 500K-row dataset with real-time prediction requirements, LightGBM is the natural first choice: fastest training, histogram-based splitting, and native categorical support.

import lightgbm as lgb

# Prepare LightGBM data (native categoricals)
X_train_lgb = X_train.copy()
X_val_lgb = X_val.copy()
X_test_lgb = X_test.copy()
for col in cat_features:
    cat_type = pd.CategoricalDtype(categories=X_train[col].unique())
    X_train_lgb[col] = X_train_lgb[col].astype(cat_type)
    X_val_lgb[col] = X_val_lgb[col].astype(cat_type)
    X_test_lgb[col] = X_test_lgb[col].astype(cat_type)

# Stage 1: Moderate hyperparameters, find a good baseline
start = time.time()
lgb_model = lgb.LGBMClassifier(
    n_estimators=5000,
    learning_rate=0.05,
    num_leaves=63,
    max_depth=-1,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    min_child_samples=50,
    is_unbalance=True,       # Handle class imbalance
    random_state=42,
    n_jobs=-1,
    verbose=-1
)
lgb_model.fit(
    X_train_lgb, y_train,
    eval_set=[(X_val_lgb, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)
lgb_time = time.time() - start

lgb_proba = lgb_model.predict_proba(X_test_lgb)[:, 1]
lgb_pred = (lgb_proba >= 0.5).astype(int)

print("LightGBM (is_unbalance=True)")
print(f"  AUC:           {roc_auc_score(y_test, lgb_proba):.4f}")
print(f"  Avg Precision: {average_precision_score(y_test, lgb_proba):.4f}")
print(f"  F1:            {f1_score(y_test, lgb_pred):.4f}")
print(f"  Trees used:    {lgb_model.best_iteration_}")
print(f"  Train time:    {lgb_time:.1f}s")

LightGBM (is_unbalance=True)
  AUC:           0.9214
  Avg Precision: 0.4683
  F1:            0.3521
  Train time:    6.8s

That is a substantial jump: AUC from 0.874 to 0.921, average precision from 0.315 to 0.468. The gradient boosting model is capturing non-linear interactions --- like the combination of add-to-cart count and cart value --- that logistic regression cannot.

Threshold Optimization for Production

The default threshold of 0.5 is wrong for imbalanced data. We need to find the threshold that maximizes business value.

from sklearn.metrics import precision_recall_curve

precisions, recalls, thresholds = precision_recall_curve(y_test, lgb_proba)

# ShopSmart's economics:
# - Cost of showing unnecessary discount: $2 (margin erosion)
# - Value of converting a user who would have churned: $15 (avg order profit)
# - We want: precision >= 0.35 (at most 65% false discount rate)
#   AND recall as high as possible

# Find threshold that gives best F1 with precision >= 0.35
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-10)

# Also find the threshold that maximizes expected profit
cost_per_false_positive = 2.0
value_per_true_positive = 15.0

# For each threshold, calculate expected profit per prediction
profits = []
for t in np.arange(0.01, 0.50, 0.005):
    pred_t = (lgb_proba >= t).astype(int)
    tp = ((pred_t == 1) & (y_test == 1)).sum()
    fp = ((pred_t == 1) & (y_test == 0)).sum()
    profit = tp * value_per_true_positive - fp * cost_per_false_positive
    profits.append({'threshold': t, 'profit': profit, 'tp': tp, 'fp': fp})

profits_df = pd.DataFrame(profits)
best_row = profits_df.loc[profits_df['profit'].idxmax()]

print("Threshold Optimization")
print(f"  Optimal threshold:    {best_row['threshold']:.3f}")
print(f"  Expected profit/day:  ${best_row['profit']:,.0f}")
print(f"  True positives:       {int(best_row['tp']):,} (users correctly targeted)")
print(f"  False positives:      {int(best_row['fp']):,} (unnecessary discounts)")
print(f"  Precision at thresh:  {best_row['tp'] / (best_row['tp'] + best_row['fp']):.3f}")
print(f"  Recall at thresh:     {best_row['tp'] / y_test.sum():.3f}")

# Compare to default 0.5
default_pred = (lgb_proba >= 0.5).astype(int)
tp_default = ((default_pred == 1) & (y_test == 1)).sum()
fp_default = ((default_pred == 1) & (y_test == 0)).sum()
profit_default = tp_default * value_per_true_positive - fp_default * cost_per_false_positive

print(f"\n  Profit at t=0.5:      ${profit_default:,.0f}")
print(f"  Profit improvement:   ${best_row['profit'] - profit_default:,.0f} ({(best_row['profit'] - profit_default) / abs(profit_default) * 100:+.0f}%)")

Threshold Optimization
  Optimal threshold:    0.065
  Expected profit/day:  $32,145
  True positives:       2,891 (users correctly targeted)
  False positives:      7,234 (unnecessary discounts)
  Precision at thresh:  0.286
  Recall at thresh:     0.832

  Profit at t=0.5:      $4,785
  Profit improvement:   $27,360 (+572%)

Production Tip --- The optimal threshold is 0.065 --- far below the default 0.5. This is typical for imbalanced datasets where the cost of a false negative (missing a convertible user) is much higher than the cost of a false positive (an unnecessary discount). Always optimize the threshold for your specific business economics. The default 0.5 is a mathematical convenience, not a business-optimal decision.

Hyperparameter Tuning at Scale

With 500K samples, a full grid search over all hyperparameters is expensive. We use a staged approach.

from sklearn.model_selection import GridSearchCV

# Stage 1: Tree structure (most impactful)
param_grid_1 = {
    'num_leaves': [31, 63, 127],
    'min_child_samples': [20, 50, 100],
}

base = lgb.LGBMClassifier(
    n_estimators=3000,
    learning_rate=0.05,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0.1,
    reg_lambda=1.0,
    is_unbalance=True,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

# Using 3-fold CV to keep training time reasonable at 500K samples
grid_1 = GridSearchCV(
    base, param_grid_1,
    scoring='roc_auc',
    cv=3,
    verbose=0,
    n_jobs=1
)
grid_1.fit(
    X_train_lgb, y_train,
    eval_set=[(X_val_lgb, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)

print("Stage 1: Tree Structure")
print(f"  Best params: {grid_1.best_params_}")
print(f"  Best CV AUC: {grid_1.best_score_:.4f}")

# Stage 2: Subsampling (with best structure from Stage 1)
best_leaves = grid_1.best_params_['num_leaves']
best_min_child = grid_1.best_params_['min_child_samples']

param_grid_2 = {
    'subsample': [0.6, 0.7, 0.8, 0.9],
    'colsample_bytree': [0.6, 0.7, 0.8, 0.9],
}

base_2 = lgb.LGBMClassifier(
    n_estimators=3000,
    learning_rate=0.05,
    num_leaves=best_leaves,
    min_child_samples=best_min_child,
    reg_alpha=0.1,
    reg_lambda=1.0,
    is_unbalance=True,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

grid_2 = GridSearchCV(base_2, param_grid_2, scoring='roc_auc', cv=3, n_jobs=1)
grid_2.fit(
    X_train_lgb, y_train,
    eval_set=[(X_val_lgb, y_val)],
    callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
)

print("\nStage 2: Subsampling")
print(f"  Best params: {grid_2.best_params_}")
print(f"  Best CV AUC: {grid_2.best_score_:.4f}")

Stage 1: Tree Structure
  Best params: {'min_child_samples': 50, 'num_leaves': 63}
  Best CV AUC: 0.9208

Stage 2: Subsampling
  Best params: {'colsample_bytree': 0.8, 'subsample': 0.8}
  Best CV AUC: 0.9211

# Stage 3: Final model with lower learning rate
best_subsample = grid_2.best_params_['subsample']
best_colsample = grid_2.best_params_['colsample_bytree']

start = time.time()
final_model = lgb.LGBMClassifier(
    n_estimators=10000,
    learning_rate=0.01,          # Lower for final model
    num_leaves=best_leaves,
    min_child_samples=best_min_child,
    subsample=best_subsample,
    colsample_bytree=best_colsample,
    reg_alpha=0.1,
    reg_lambda=1.0,
    is_unbalance=True,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)
final_model.fit(
    X_train_lgb, y_train,
    eval_set=[(X_val_lgb, y_val)],
    callbacks=[lgb.early_stopping(100), lgb.log_evaluation(0)]
)
final_time = time.time() - start

final_proba = final_model.predict_proba(X_test_lgb)[:, 1]

print("Final Tuned LightGBM")
print(f"  AUC:           {roc_auc_score(y_test, final_proba):.4f}")
print(f"  Avg Precision: {average_precision_score(y_test, final_proba):.4f}")
print(f"  Trees used:    {final_model.best_iteration_}")
print(f"  Train time:    {final_time:.1f}s")

Final Tuned LightGBM
  AUC:           0.9241
  Avg Precision: 0.4752
  Trees used:    1847
  Train time:    28.3s

The tuned model with a lower learning rate achieves AUC 0.9241 --- a modest but meaningful improvement over the initial 0.9214.

Production Latency: Can It Serve at 3,000 QPS?

The model is accurate. But can it predict fast enough?

# Measure single-prediction latency
import time

# Single-sample prediction (the production scenario)
single_sample = X_test_lgb.iloc[[0]]

latencies = []
for _ in range(1000):
    start = time.perf_counter()
    _ = final_model.predict_proba(single_sample)
    latencies.append((time.perf_counter() - start) * 1000)  # ms

latencies = np.array(latencies)

print("Single-Prediction Latency")
print(f"  p50:  {np.percentile(latencies, 50):.2f} ms")
print(f"  p95:  {np.percentile(latencies, 95):.2f} ms")
print(f"  p99:  {np.percentile(latencies, 99):.2f} ms")
print(f"  max:  {latencies.max():.2f} ms")
print(f"  Trees in model: {final_model.best_iteration_}")

# Batch prediction (for offline scoring)
batch_sizes = [1, 10, 100, 1000, 10000]
for bs in batch_sizes:
    batch = X_test_lgb.iloc[:bs]
    start = time.perf_counter()
    _ = final_model.predict_proba(batch)
    elapsed = (time.perf_counter() - start) * 1000
    per_sample = elapsed / bs
    print(f"  Batch {bs:<6}: {elapsed:.2f} ms total, {per_sample:.4f} ms/sample")

Single-Prediction Latency
  p50:  0.48 ms
  p95:  0.72 ms
  p99:  1.14 ms
  max:  2.31 ms
  Trees in model: 1847

  Batch 1     : 0.51 ms total, 0.5100 ms/sample
  Batch 10    : 0.58 ms total, 0.0580 ms/sample
  Batch 100   : 1.12 ms total, 0.0112 ms/sample
  Batch 1000  : 4.87 ms total, 0.0049 ms/sample
  Batch 10000 : 38.24 ms total, 0.0038 ms/sample

Single predictions take about 0.5ms at p50 --- well within the 50ms budget. But with 1,847 trees, we are using more compute than necessary. Can we trade a tiny amount of accuracy for fewer trees?

# Latency-accuracy tradeoff: what if we used fewer trees?
tree_counts = [100, 200, 500, 1000, 1500, 1847]

print("\nLatency-Accuracy Tradeoff")
print(f"{'Trees':<10}{'AUC':<10}{'Avg Prec':<12}{'p50 Latency':<15}")
print("-" * 47)

for n_trees in tree_counts:
    # Predict with first n_trees only
    proba = final_model.predict_proba(X_test_lgb, num_iteration=n_trees)[:, 1]
    auc = roc_auc_score(y_test, proba)
    ap = average_precision_score(y_test, proba)

    # Measure latency
    lats = []
    for _ in range(200):
        start = time.perf_counter()
        _ = final_model.predict_proba(single_sample, num_iteration=n_trees)
        lats.append((time.perf_counter() - start) * 1000)
    p50 = np.percentile(lats, 50)

    print(f"{n_trees:<10}{auc:<10.4f}{ap:<12.4f}{p50:<15.2f} ms")

Latency-Accuracy Tradeoff
Trees     AUC       Avg Prec    p50 Latency
-----------------------------------------------
100       0.9148    0.4421      0.11 ms
200       0.9189    0.4578      0.15 ms
500       0.9221    0.4687      0.24 ms
1000      0.9235    0.4731      0.35 ms
1500      0.9240    0.4748      0.43 ms
1847      0.9241    0.4752      0.48 ms

Using 500 trees instead of 1,847 gives up only 0.002 AUC while cutting latency in half. In production, ShopSmart would likely deploy the 500-tree variant for the real-time endpoint and use the full model for overnight batch scoring.

Production Tip --- LightGBM's num_iteration parameter at prediction time lets you truncate the model without retraining. This is a free knob for trading accuracy against latency. Train the full model (maximum accuracy), then evaluate different truncation points to find the sweet spot for your latency budget.

The A/B Test Results

After deploying the gradient boosting model (500-tree variant) alongside the logistic regression baseline, ShopSmart ran a two-week A/B test on 2 million sessions split 50/50.

# Simulated A/B test results
np.random.seed(42)
n_sessions_per_arm = 1_000_000

# Control: logistic regression (AUC 0.874, old threshold)
control_conversions = int(n_sessions_per_arm * 0.0347)  # base rate
control_discounts_shown = int(n_sessions_per_arm * 0.08)  # aggressive targeting
control_discount_conversions = int(control_discounts_shown * 0.12)

# Treatment: LightGBM (AUC 0.922, optimized threshold)
treatment_conversions = int(n_sessions_per_arm * 0.0362)  # slight lift from better targeting
treatment_discounts_shown = int(n_sessions_per_arm * 0.065)  # fewer but better targeted
treatment_discount_conversions = int(treatment_discounts_shown * 0.18)

print("=" * 60)
print("SHOPSMART A/B TEST RESULTS (2-week pilot)")
print("=" * 60)
print(f"\n{'Metric':<35}{'Control (LR)':<18}{'Treatment (GB)'}")
print("-" * 60)
print(f"{'Sessions':<35}{n_sessions_per_arm:>12,}{n_sessions_per_arm:>18,}")
print(f"{'Total conversions':<35}{control_conversions:>12,}{treatment_conversions:>18,}")
print(f"{'Conversion rate':<35}{control_conversions/n_sessions_per_arm:>12.2%}{treatment_conversions/n_sessions_per_arm:>18.2%}")
print(f"{'Discounts shown':<35}{control_discounts_shown:>12,}{treatment_discounts_shown:>18,}")
print(f"{'Discount-driven conversions':<35}{control_discount_conversions:>12,}{treatment_discount_conversions:>18,}")
print(f"{'Discount precision':<35}{control_discount_conversions/control_discounts_shown:>12.1%}{treatment_discount_conversions/treatment_discounts_shown:>18.1%}")

# Revenue calculation
avg_order_value = 52.0
discount_cost = 5.0  # average discount given

control_revenue = (
    control_conversions * avg_order_value
    - control_discounts_shown * discount_cost
)
treatment_revenue = (
    treatment_conversions * avg_order_value
    - treatment_discounts_shown * discount_cost
)

print(f"\n{'Gross conversion revenue':<35}${control_conversions * avg_order_value:>11,.0f}${treatment_conversions * avg_order_value:>17,.0f}")
print(f"{'Discount costs':<35}${control_discounts_shown * discount_cost:>11,.0f}${treatment_discounts_shown * discount_cost:>17,.0f}")
print(f"{'Net revenue':<35}${control_revenue:>11,.0f}${treatment_revenue:>17,.0f}")
print(f"{'Revenue lift':<35}{'':<12}${treatment_revenue - control_revenue:>17,.0f}")
print(f"{'Revenue lift %':<35}{'':<12}{(treatment_revenue - control_revenue) / control_revenue:>17.1%}")

============================================================
SHOPSMART A/B TEST RESULTS (2-week pilot)
============================================================

Metric                             Control (LR)    Treatment (GB)
------------------------------------------------------------
Sessions                          1,000,000         1,000,000
Total conversions                    34,700            36,200
Conversion rate                       3.47%             3.62%
Discounts shown                      80,000            65,000
Discount-driven conversions           9,600            11,700
Discount precision                    12.0%             18.0%

Gross conversion revenue          $1,804,400       $1,882,400
Discount costs                      $400,000         $325,000
Net revenue                       $1,404,400       $1,557,400
Revenue lift                                          $153,000
Revenue lift %                                          10.9%

What the A/B Test Proved

The gradient boosting model achieved three wins simultaneously:

Higher conversion rate (3.62% vs. 3.47%): Better targeting meant discounts reached users who were actually on the fence, nudging them to convert.
Fewer discounts shown (65K vs. 80K): The model's higher precision meant fewer "wasted" discounts on users who were either going to convert anyway (no discount needed) or never going to convert (discount wasted).
Higher discount precision (18.0% vs. 12.0%): When the model said "show a discount," it was right 50% more often.

The combined effect: $153K more revenue in two weeks, or roughly $4M annualized. Against an engineering investment of approximately $120K (six weeks of a data scientist and half a ML engineer), the ROI is clear.

War Story --- The ShopSmart product team initially pushed back on the 0.065 threshold: "You mean we only show discounts to 6.5% of sessions? The old model targeted 8%." The data science team's response: "We are targeting fewer users, but the right users. Precision went from 12% to 18%. We are wasting fewer discounts and converting more people." This is a conversation you will have in every production ML deployment. Stakeholders think in volume ("how many people did we reach?"). You need to reframe in terms of efficiency ("how many of the people we reached actually converted?").

Production Deployment Checklist

Based on the ShopSmart deployment, here is a checklist for putting a gradient boosting model into production:

Model serialization. Save the model in the library's native format (not pickle). LightGBM: model.booster_.save_model('model.txt'). This is portable and version-safe.
Feature pipeline. The feature engineering pipeline must be identical between training and serving. One mismatched feature name or one missing transformation will produce garbage predictions silently.
Categorical encoding. If using native categoricals (LightGBM/CatBoost), ensure the serving pipeline produces the same categorical encoding. If a new category appears at serving time that was not in training, handle it gracefully (map to "unknown" or most common category).
Threshold. Deploy the business-optimal threshold, not 0.5. Store the threshold alongside the model so it can be updated independently.
Latency testing. Measure prediction latency on production hardware under load, not on your laptop with one request at a time. Use model truncation (num_iteration) if needed.
Monitoring. Track prediction distributions daily. If the mean predicted probability shifts by more than 10%, investigate. Track the actual conversion rate among users targeted by the model --- if it diverges from the rate seen during A/B testing, the model may be degrading.
Fallback. Keep the old logistic regression model ready to serve. If the gradient boosting model fails (bug, latency spike, NaN predictions), fail over to the LR model automatically. A worse model is always better than no model.

# Save the production model
final_model.booster_.save_model('shopsmart_conversion_lgbm_v1.txt')

# Serving code (simplified)
import lightgbm as lgb
import pandas as pd

def load_model(path):
    """Load the production model."""
    booster = lgb.Booster(model_file=path)
    return booster

def predict_conversion(booster, features_dict, threshold=0.065, max_trees=500):
    """
    Predict conversion probability for a single session.

    Returns: (probability, should_show_discount)
    """
    df = pd.DataFrame([features_dict])
    # Apply same categorical encoding as training
    for col in ['device_type', 'traffic_source', 'day_of_week', 'landing_page_category']:
        df[col] = df[col].astype('category')

    proba = booster.predict(df, num_iteration=max_trees)[0]
    return proba, proba >= threshold

# Example usage
# booster = load_model('shopsmart_conversion_lgbm_v1.txt')
# prob, show_discount = predict_conversion(booster, session_features)

Key Takeaways from This Case Study

Gradient boosting provided a meaningful lift over logistic regression (AUC 0.874 to 0.924). On a real e-commerce dataset with non-linear feature interactions, this translates to millions of dollars annually.
Threshold optimization matters more than model selection. Moving from the default 0.5 threshold to the business-optimal 0.065 threshold had a larger impact on profit than the model upgrade itself.
The latency-accuracy tradeoff is real. A 500-tree truncation of a 1,847-tree model sacrificed 0.002 AUC but halved prediction latency. In real-time serving, this tradeoff is almost always worth making.
Staged hyperparameter tuning is practical at scale. Full grid search on 500K samples is expensive. Tuning in stages (structure, then sampling, then learning rate) reaches near-optimal performance in a fraction of the time.
The A/B test is the only evaluation that matters. Offline AUC improvements do not guarantee business impact. The A/B test proved that better predictions translated to better targeting, fewer wasted discounts, and higher revenue.
Production deployment is 50% of the work. Feature pipelines, categorical encoding, threshold management, latency optimization, monitoring, and fallback systems are not afterthoughts --- they are the difference between a notebook experiment and a revenue-generating system.

This case study supports Chapter 14: Gradient Boosting. Return to the chapter for the full discussion of gradient boosting theory and the three library implementations.