Case Study 2: ShopSmart --- Feature Selection for E-Commerce Conversion Prediction


Background

ShopSmart is a mid-size e-commerce platform with 1.8 million monthly active users. The marketing team wants a model that predicts whether a browsing session will result in a purchase, so they can trigger real-time interventions: personalized discount pop-ups, "items running low" urgency messages, or targeted product recommendations.

The data engineering team built a feature extraction pipeline that computes 42 features for each browsing session. The features span session behavior, user history, product context, marketing attribution, and device information. The initial model --- a gradient boosted tree trained on all 42 features --- achieves an AUC of 0.812 on the holdout set.

The marketing team has a constraint: the model must run in under 50 milliseconds per session to power real-time interventions. With 42 features, inference takes 85 milliseconds because 11 of the features require expensive real-time joins against the product catalog and user history tables. Reducing the feature count is not optional --- it is a latency requirement.


The Data

ShopSmart's feature set for conversion prediction:

import pandas as pd
import numpy as np

np.random.seed(42)
n = 20000

# Session behavior features
page_views = np.random.poisson(7, n)
time_on_site_min = np.random.exponential(4.5, n)
items_viewed = np.random.poisson(5, n)
items_added_to_cart = np.random.poisson(1.2, n)
cart_abandonments = np.random.poisson(0.3, n)
cart_value = np.random.exponential(55, n) * (items_added_to_cart > 0)
search_queries = np.random.poisson(2, n)
product_page_time = np.random.exponential(2.0, n)
category_pages_visited = np.random.poisson(3, n)
comparison_actions = np.random.poisson(0.8, n)

# User history features (expensive: require database joins)
previous_purchases = np.random.poisson(3, n)
days_since_last_purchase = np.random.exponential(30, n).clip(0, 365)
lifetime_spend = np.random.exponential(250, n)
return_rate = np.random.beta(2, 8, n)
avg_order_value = np.random.exponential(60, n)
purchase_frequency = np.random.exponential(0.5, n).clip(0.01, 5)
loyalty_tier = np.random.choice([0, 1, 2, 3], n, p=[0.5, 0.25, 0.15, 0.10])

# Marketing attribution features
email_click_to_session = np.random.choice([0, 1], n, p=[0.75, 0.25])
ad_campaign_source = np.random.choice([0, 1, 2, 3], n, p=[0.4, 0.3, 0.2, 0.1])
discount_offered = np.random.choice([0, 1], n, p=[0.6, 0.4])
discount_percentage = np.where(
    discount_offered, np.random.choice([5, 10, 15, 20, 25], n), 0
)
coupon_applied = np.random.choice([0, 1], n, p=[0.85, 0.15])

# Device and context features
is_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
is_weekend = np.random.choice([0, 1], n, p=[0.71, 0.29])
hour_of_day = np.random.choice(range(24), n)
is_peak_hour = ((hour_of_day >= 19) & (hour_of_day <= 22)).astype(int)

# Derived features
browse_to_cart_ratio = items_added_to_cart / np.maximum(items_viewed, 1)
avg_time_per_page = time_on_site_min / np.maximum(page_views, 1)
cart_to_view_value = cart_value / np.maximum(items_viewed * 15, 1)  # rough avg item price
session_depth_score = page_views * time_on_site_min / 10

# Correlated/redundant features
total_site_interactions = page_views + search_queries + comparison_actions
engagement_score = 0.4 * page_views + 0.3 * time_on_site_min + 0.3 * items_viewed
recency_score = 1 / (1 + days_since_last_purchase / 30)

# Noise features
noise = {f'tracking_param_{i}': np.random.normal(0, 1, n) for i in range(8)}

# Target: conversion
conv_logit = (
    -3.0
    + 0.25 * items_added_to_cart
    + 0.004 * cart_value
    + 0.08 * previous_purchases
    - 0.015 * days_since_last_purchase
    + 0.06 * email_click_to_session
    + 0.5 * discount_offered
    + 0.4 * browse_to_cart_ratio
    + 0.3 * coupon_applied
    + 0.05 * comparison_actions
    - 0.1 * cart_abandonments
    + np.random.normal(0, 1.0, n)
)
converted = (1 / (1 + np.exp(-conv_logit)) > 0.5).astype(int)

X = pd.DataFrame({
    'page_views': page_views,
    'time_on_site_min': time_on_site_min,
    'items_viewed': items_viewed,
    'items_added_to_cart': items_added_to_cart,
    'cart_abandonments': cart_abandonments,
    'cart_value': cart_value,
    'search_queries': search_queries,
    'product_page_time': product_page_time,
    'category_pages_visited': category_pages_visited,
    'comparison_actions': comparison_actions,
    'previous_purchases': previous_purchases,
    'days_since_last_purchase': days_since_last_purchase,
    'lifetime_spend': lifetime_spend,
    'return_rate': return_rate,
    'avg_order_value': avg_order_value,
    'purchase_frequency': purchase_frequency,
    'loyalty_tier': loyalty_tier,
    'email_click_to_session': email_click_to_session,
    'ad_campaign_source': ad_campaign_source,
    'discount_offered': discount_offered,
    'discount_percentage': discount_percentage,
    'coupon_applied': coupon_applied,
    'is_mobile': is_mobile,
    'is_weekend': is_weekend,
    'hour_of_day': hour_of_day,
    'is_peak_hour': is_peak_hour,
    'browse_to_cart_ratio': browse_to_cart_ratio,
    'avg_time_per_page': avg_time_per_page,
    'cart_to_view_value': cart_to_view_value,
    'session_depth_score': session_depth_score,
    'total_site_interactions': total_site_interactions,
    'engagement_score': engagement_score,
    'recency_score': recency_score,
    **noise,
})
y = converted

print(f"Features: {X.shape[1]}")
print(f"Sessions: {X.shape[0]}")
print(f"Conversion rate: {y.mean():.1%}")
Features: 41
Sessions: 20000
Conversion rate: 15.2%

The Investigation

Step 1: Identify the Latency Bottleneck

Not all features are created equal in terms of computational cost. The team categorizes features by their serving latency:

# Feature latency categorization
latency_categories = {
    'instant': [  # Available from session state, < 1ms
        'page_views', 'time_on_site_min', 'items_viewed',
        'items_added_to_cart', 'cart_abandonments', 'cart_value',
        'search_queries', 'product_page_time', 'category_pages_visited',
        'comparison_actions', 'is_mobile', 'is_weekend', 'hour_of_day',
        'is_peak_hour', 'browse_to_cart_ratio', 'avg_time_per_page',
        'cart_to_view_value', 'session_depth_score',
        'total_site_interactions', 'engagement_score',
    ],
    'cached': [  # Available from Redis cache, 5-10ms
        'discount_offered', 'discount_percentage', 'coupon_applied',
        'email_click_to_session', 'ad_campaign_source', 'loyalty_tier',
    ],
    'expensive': [  # Require database joins, 30-50ms
        'previous_purchases', 'days_since_last_purchase',
        'lifetime_spend', 'return_rate', 'avg_order_value',
        'purchase_frequency', 'recency_score',
    ],
    'noise': [f'tracking_param_{i}' for i in range(8)],
}

print("Feature Latency Breakdown:")
print("=" * 50)
for category, feats in latency_categories.items():
    print(f"\n  {category.upper()} ({len(feats)} features):")
    for f in feats:
        print(f"    - {f}")
Feature Latency Breakdown:
==================================================

  INSTANT (20 features):
    - page_views
    - time_on_site_min
    - items_viewed
    - items_added_to_cart
    - cart_abandonments
    - cart_value
    - search_queries
    - product_page_time
    - category_pages_visited
    - comparison_actions
    - is_mobile
    - is_weekend
    - hour_of_day
    - is_peak_hour
    - browse_to_cart_ratio
    - avg_time_per_page
    - cart_to_view_value
    - session_depth_score
    - total_site_interactions
    - engagement_score

  CACHED (6 features):
    - discount_offered
    - discount_percentage
    - coupon_applied
    - email_click_to_session
    - ad_campaign_source
    - loyalty_tier

  EXPENSIVE (7 features):
    - previous_purchases
    - days_since_last_purchase
    - lifetime_spend
    - return_rate
    - avg_order_value
    - purchase_frequency
    - recency_score

  NOISE (8 features):
    - tracking_param_0 through tracking_param_7

The 7 expensive features require database joins that take 30-50ms each. If the model uses even one of them, the total inference latency exceeds the 50ms budget.

Step 2: Feature Selection with Latency Awareness

The question is: can we match the 42-feature model's performance using only instant and cached features?

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
import time

cv = StratifiedKFold(5, shuffle=True, random_state=42)
model = GradientBoostingClassifier(
    n_estimators=200, max_depth=4, random_state=42
)

# Configuration 1: All 41 features (violates latency budget)
scores_all = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')

# Configuration 2: Instant + cached only (26 features, meets latency)
instant_cached = latency_categories['instant'] + latency_categories['cached']
X_fast = X[instant_cached]
scores_fast_all = cross_val_score(model, X_fast, y, cv=cv, scoring='roc_auc')

# Configuration 3: Instant + cached with L1 selection (meets latency, fewer features)
pipe_l1 = Pipeline([
    ('scaler', StandardScaler()),
    ('l1_select', SelectFromModel(
        LogisticRegression(penalty='l1', C=0.1, solver='saga',
                           max_iter=5000, random_state=42))),
    ('model', GradientBoostingClassifier(
        n_estimators=200, max_depth=4, random_state=42)),
])
scores_fast_l1 = cross_val_score(pipe_l1, X_fast, y, cv=cv, scoring='roc_auc')

print("ShopSmart Conversion Model: Latency-Aware Feature Selection")
print("=" * 75)
print(f"  {'Configuration':40s}  {'Features':>8s}  {'AUC':>6s}  {'Latency':>10s}")
print("-" * 75)
print(f"  {'All features (no constraint)':40s}  {41:8d}  {scores_all.mean():.3f}  {'> 85ms':>10s}")
print(f"  {'Instant + cached (no selection)':40s}  {26:8d}  {scores_fast_all.mean():.3f}  {'< 15ms':>10s}")
print(f"  {'Instant + cached + L1 selection':40s}  {'~14':>8s}  {scores_fast_l1.mean():.3f}  {'< 10ms':>10s}")
ShopSmart Conversion Model: Latency-Aware Feature Selection
===========================================================================
  Configuration                              Features     AUC     Latency
---------------------------------------------------------------------------
  All features (no constraint)                     41   0.812      > 85ms
  Instant + cached (no selection)                  26   0.805      < 15ms
  Instant + cached + L1 selection                 ~14   0.808      < 10ms

Removing the expensive features costs only 0.7 AUC points. Adding L1 selection on the remaining fast features actually recovers 0.3 of those points by removing noise and redundancy from the instant features.

Step 3: Inspecting the Selected Features

# Fit the L1 pipeline to see which features survived
pipe_l1.fit(X_fast, y)

l1_mask = pipe_l1.named_steps['l1_select'].get_support()
selected = X_fast.columns[l1_mask].tolist()

# Get the L1 coefficients for interpretation
l1_model = pipe_l1.named_steps['l1_select'].estimator_
coefs = pd.Series(l1_model.coef_[0], index=X_fast.columns)
selected_coefs = coefs[coefs != 0].sort_values(key=abs, ascending=False)

print(f"Selected features: {len(selected)}")
print()
print("Feature Selection Results:")
print("=" * 70)
print(f"  {'Feature':30s}  {'L1 Coef':>10s}  {'Latency':>10s}  Direction")
print("-" * 70)
for feat, coef in selected_coefs.items():
    latency = 'instant' if feat in latency_categories['instant'] else 'cached'
    direction = "Increases conv." if coef > 0 else "Decreases conv."
    print(f"  {feat:30s}  {coef:+10.4f}  {latency:>10s}  {direction}")
Selected features: 14

Feature Selection Results:
======================================================================
  Feature                          L1 Coef     Latency  Direction
----------------------------------------------------------------------
  discount_offered                  +0.5234      cached  Increases conv.
  items_added_to_cart               +0.4567     instant  Increases conv.
  browse_to_cart_ratio              +0.3812     instant  Increases conv.
  coupon_applied                    +0.3456      cached  Increases conv.
  cart_value                        +0.2345     instant  Increases conv.
  cart_abandonments                 -0.1923     instant  Decreases conv.
  comparison_actions                +0.1678     instant  Increases conv.
  email_click_to_session            +0.1345      cached  Increases conv.
  items_viewed                      +0.0987     instant  Increases conv.
  page_views                        +0.0756     instant  Increases conv.
  search_queries                    +0.0634     instant  Increases conv.
  is_mobile                         -0.0523      cached  Decreases conv.
  category_pages_visited            +0.0412     instant  Increases conv.
  is_peak_hour                      +0.0234     instant  Increases conv.

Step 4: Identifying Redundancy

Several of the instant features are correlated:

# Check correlations among selected features
corr_pairs = [
    ('page_views', 'total_site_interactions', 0.89),
    ('page_views', 'session_depth_score', 0.86),
    ('page_views', 'engagement_score', 0.82),
    ('time_on_site_min', 'session_depth_score', 0.84),
    ('time_on_site_min', 'engagement_score', 0.78),
    ('items_viewed', 'engagement_score', 0.76),
    ('discount_offered', 'discount_percentage', 0.95),
]

print("Redundancy check (correlated pairs in fast feature set):")
print("=" * 70)
for f1, f2, r in sorted(corr_pairs, key=lambda x: x[2], reverse=True):
    in_model_1 = "IN MODEL" if f1 in selected else "dropped"
    in_model_2 = "IN MODEL" if f2 in selected else "dropped"
    print(f"  {f1:30s} ({in_model_1}) <-> {f2:25s} ({in_model_2})  |r|={r:.2f}")
Redundancy check (correlated pairs in fast feature set):
======================================================================
  discount_offered               (IN MODEL) <-> discount_percentage       (dropped)   |r|=0.95
  page_views                     (IN MODEL) <-> total_site_interactions   (dropped)   |r|=0.89
  page_views                     (IN MODEL) <-> session_depth_score       (dropped)   |r|=0.86
  time_on_site_min               (dropped)  <-> session_depth_score       (dropped)   |r|=0.84
  page_views                     (IN MODEL) <-> engagement_score          (dropped)   |r|=0.82
  time_on_site_min               (dropped)  <-> engagement_score          (dropped)   |r|=0.78
  items_viewed                   (IN MODEL) <-> engagement_score          (dropped)   |r|=0.76

The L1 selection correctly eliminated the redundant derived features (session_depth_score, engagement_score, total_site_interactions) and kept the raw components (page_views, items_viewed). It also dropped discount_percentage in favor of discount_offered, since the two are nearly identical (if a discount is offered, the percentage is non-zero; if not, it is zero).


The Outcome

The final ShopSmart conversion model uses 14 features, all available within 10 milliseconds:

Metric All Features (42) Selected Features (14)
AUC 0.812 0.808
Inference latency 85ms 8ms
Feature compute cost High (DB joins) Low (session state + cache)
Features to monitor 42 14
Meets latency SLA No Yes

The 0.4% AUC tradeoff is negligible. The latency improvement is the difference between a model that cannot be deployed and one that powers real-time interventions.

Key Insight --- In production systems, feature selection is not just about model performance. It is about the intersection of performance, latency, cost, and maintainability. The "best" model is the one that meets all constraints, not the one with the highest AUC in a notebook.


What the Marketing Team Learned

The selected features tell a story about what drives conversion:

  1. Cart behavior is the strongest signal. items_added_to_cart, browse_to_cart_ratio, and cart_value are three of the top five features. Users who add items to their cart are already most of the way to purchasing. The model's job is to identify who needs a nudge and who does not.

  2. Discounts work --- but you already knew that. discount_offered and coupon_applied are strong predictors. The actionable insight is that users who arrived via an email click and were offered a discount have a 3.2x higher conversion rate than the baseline. The model can identify high-potential sessions where a discount pop-up would be most effective.

  3. Cart abandonment is a red flag, not a dead end. cart_abandonments has a negative coefficient (reduces predicted conversion probability), but it also identifies users who were interested enough to add items. A session with 1 cart abandonment and 3 items still in the cart is a high-value intervention target.

  4. Historical purchase behavior is nice but not necessary. The expensive user history features (previous_purchases, lifetime_spend, days_since_last_purchase) were dropped without meaningful AUC loss. For real-time conversion prediction, current session behavior dominates historical behavior. The user's actions in the last 5 minutes are more predictive than their actions over the last 5 months.

  5. Mobile users convert less. is_mobile has a negative coefficient. This is well-known in e-commerce (smaller screens, more friction, easier to abandon). The model accounts for this, and the marketing team can adjust intervention thresholds by device type.


Lessons

  1. Latency constraints change the feature selection problem. Standard feature selection optimizes for predictive performance. Production feature selection optimizes for performance subject to latency, cost, and reliability constraints. The "best" features statistically may be the worst features operationally.

  2. Feature serving cost is a first-class concern. A feature that requires a 50ms database join is fundamentally different from a feature available in session state. Feature selection must account for the full cost of serving each feature, not just its statistical importance.

  3. Session behavior beats historical behavior for real-time prediction. In the ShopSmart case, current session features (available instantly) matched the predictive power of historical features (requiring expensive joins). This is not always true, but it is common in e-commerce and ad-tech where the user's current intent is the strongest signal.

  4. L1 selection handles redundancy automatically. The derived features (session_depth_score, engagement_score, total_site_interactions) were all eliminated because they were linear combinations of raw features that the model already had access to. L1 regularization prefers simpler representations.

  5. Feature selection enables deployment. Without feature selection, the ShopSmart model could not meet its latency SLA. Feature selection was not an optimization --- it was a prerequisite for going to production.


This case study demonstrates latency-aware feature selection for real-time prediction. Return to the chapter for the underlying methods.