Case Study 2: ShopSmart --- Feature Selection for E-Commerce Conversion Prediction
Background
ShopSmart is a mid-size e-commerce platform with 1.8 million monthly active users. The marketing team wants a model that predicts whether a browsing session will result in a purchase, so they can trigger real-time interventions: personalized discount pop-ups, "items running low" urgency messages, or targeted product recommendations.
The data engineering team built a feature extraction pipeline that computes 42 features for each browsing session. The features span session behavior, user history, product context, marketing attribution, and device information. The initial model --- a gradient boosted tree trained on all 42 features --- achieves an AUC of 0.812 on the holdout set.
The marketing team has a constraint: the model must run in under 50 milliseconds per session to power real-time interventions. With 42 features, inference takes 85 milliseconds because 11 of the features require expensive real-time joins against the product catalog and user history tables. Reducing the feature count is not optional --- it is a latency requirement.
The Data
ShopSmart's feature set for conversion prediction:
import pandas as pd
import numpy as np
np.random.seed(42)
n = 20000
# Session behavior features
page_views = np.random.poisson(7, n)
time_on_site_min = np.random.exponential(4.5, n)
items_viewed = np.random.poisson(5, n)
items_added_to_cart = np.random.poisson(1.2, n)
cart_abandonments = np.random.poisson(0.3, n)
cart_value = np.random.exponential(55, n) * (items_added_to_cart > 0)
search_queries = np.random.poisson(2, n)
product_page_time = np.random.exponential(2.0, n)
category_pages_visited = np.random.poisson(3, n)
comparison_actions = np.random.poisson(0.8, n)
# User history features (expensive: require database joins)
previous_purchases = np.random.poisson(3, n)
days_since_last_purchase = np.random.exponential(30, n).clip(0, 365)
lifetime_spend = np.random.exponential(250, n)
return_rate = np.random.beta(2, 8, n)
avg_order_value = np.random.exponential(60, n)
purchase_frequency = np.random.exponential(0.5, n).clip(0.01, 5)
loyalty_tier = np.random.choice([0, 1, 2, 3], n, p=[0.5, 0.25, 0.15, 0.10])
# Marketing attribution features
email_click_to_session = np.random.choice([0, 1], n, p=[0.75, 0.25])
ad_campaign_source = np.random.choice([0, 1, 2, 3], n, p=[0.4, 0.3, 0.2, 0.1])
discount_offered = np.random.choice([0, 1], n, p=[0.6, 0.4])
discount_percentage = np.where(
discount_offered, np.random.choice([5, 10, 15, 20, 25], n), 0
)
coupon_applied = np.random.choice([0, 1], n, p=[0.85, 0.15])
# Device and context features
is_mobile = np.random.choice([0, 1], n, p=[0.40, 0.60])
is_weekend = np.random.choice([0, 1], n, p=[0.71, 0.29])
hour_of_day = np.random.choice(range(24), n)
is_peak_hour = ((hour_of_day >= 19) & (hour_of_day <= 22)).astype(int)
# Derived features
browse_to_cart_ratio = items_added_to_cart / np.maximum(items_viewed, 1)
avg_time_per_page = time_on_site_min / np.maximum(page_views, 1)
cart_to_view_value = cart_value / np.maximum(items_viewed * 15, 1) # rough avg item price
session_depth_score = page_views * time_on_site_min / 10
# Correlated/redundant features
total_site_interactions = page_views + search_queries + comparison_actions
engagement_score = 0.4 * page_views + 0.3 * time_on_site_min + 0.3 * items_viewed
recency_score = 1 / (1 + days_since_last_purchase / 30)
# Noise features
noise = {f'tracking_param_{i}': np.random.normal(0, 1, n) for i in range(8)}
# Target: conversion
conv_logit = (
-3.0
+ 0.25 * items_added_to_cart
+ 0.004 * cart_value
+ 0.08 * previous_purchases
- 0.015 * days_since_last_purchase
+ 0.06 * email_click_to_session
+ 0.5 * discount_offered
+ 0.4 * browse_to_cart_ratio
+ 0.3 * coupon_applied
+ 0.05 * comparison_actions
- 0.1 * cart_abandonments
+ np.random.normal(0, 1.0, n)
)
converted = (1 / (1 + np.exp(-conv_logit)) > 0.5).astype(int)
X = pd.DataFrame({
'page_views': page_views,
'time_on_site_min': time_on_site_min,
'items_viewed': items_viewed,
'items_added_to_cart': items_added_to_cart,
'cart_abandonments': cart_abandonments,
'cart_value': cart_value,
'search_queries': search_queries,
'product_page_time': product_page_time,
'category_pages_visited': category_pages_visited,
'comparison_actions': comparison_actions,
'previous_purchases': previous_purchases,
'days_since_last_purchase': days_since_last_purchase,
'lifetime_spend': lifetime_spend,
'return_rate': return_rate,
'avg_order_value': avg_order_value,
'purchase_frequency': purchase_frequency,
'loyalty_tier': loyalty_tier,
'email_click_to_session': email_click_to_session,
'ad_campaign_source': ad_campaign_source,
'discount_offered': discount_offered,
'discount_percentage': discount_percentage,
'coupon_applied': coupon_applied,
'is_mobile': is_mobile,
'is_weekend': is_weekend,
'hour_of_day': hour_of_day,
'is_peak_hour': is_peak_hour,
'browse_to_cart_ratio': browse_to_cart_ratio,
'avg_time_per_page': avg_time_per_page,
'cart_to_view_value': cart_to_view_value,
'session_depth_score': session_depth_score,
'total_site_interactions': total_site_interactions,
'engagement_score': engagement_score,
'recency_score': recency_score,
**noise,
})
y = converted
print(f"Features: {X.shape[1]}")
print(f"Sessions: {X.shape[0]}")
print(f"Conversion rate: {y.mean():.1%}")
Features: 41
Sessions: 20000
Conversion rate: 15.2%
The Investigation
Step 1: Identify the Latency Bottleneck
Not all features are created equal in terms of computational cost. The team categorizes features by their serving latency:
# Feature latency categorization
latency_categories = {
'instant': [ # Available from session state, < 1ms
'page_views', 'time_on_site_min', 'items_viewed',
'items_added_to_cart', 'cart_abandonments', 'cart_value',
'search_queries', 'product_page_time', 'category_pages_visited',
'comparison_actions', 'is_mobile', 'is_weekend', 'hour_of_day',
'is_peak_hour', 'browse_to_cart_ratio', 'avg_time_per_page',
'cart_to_view_value', 'session_depth_score',
'total_site_interactions', 'engagement_score',
],
'cached': [ # Available from Redis cache, 5-10ms
'discount_offered', 'discount_percentage', 'coupon_applied',
'email_click_to_session', 'ad_campaign_source', 'loyalty_tier',
],
'expensive': [ # Require database joins, 30-50ms
'previous_purchases', 'days_since_last_purchase',
'lifetime_spend', 'return_rate', 'avg_order_value',
'purchase_frequency', 'recency_score',
],
'noise': [f'tracking_param_{i}' for i in range(8)],
}
print("Feature Latency Breakdown:")
print("=" * 50)
for category, feats in latency_categories.items():
print(f"\n {category.upper()} ({len(feats)} features):")
for f in feats:
print(f" - {f}")
Feature Latency Breakdown:
==================================================
INSTANT (20 features):
- page_views
- time_on_site_min
- items_viewed
- items_added_to_cart
- cart_abandonments
- cart_value
- search_queries
- product_page_time
- category_pages_visited
- comparison_actions
- is_mobile
- is_weekend
- hour_of_day
- is_peak_hour
- browse_to_cart_ratio
- avg_time_per_page
- cart_to_view_value
- session_depth_score
- total_site_interactions
- engagement_score
CACHED (6 features):
- discount_offered
- discount_percentage
- coupon_applied
- email_click_to_session
- ad_campaign_source
- loyalty_tier
EXPENSIVE (7 features):
- previous_purchases
- days_since_last_purchase
- lifetime_spend
- return_rate
- avg_order_value
- purchase_frequency
- recency_score
NOISE (8 features):
- tracking_param_0 through tracking_param_7
The 7 expensive features require database joins that take 30-50ms each. If the model uses even one of them, the total inference latency exceeds the 50ms budget.
Step 2: Feature Selection with Latency Awareness
The question is: can we match the 42-feature model's performance using only instant and cached features?
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
import time
cv = StratifiedKFold(5, shuffle=True, random_state=42)
model = GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42
)
# Configuration 1: All 41 features (violates latency budget)
scores_all = cross_val_score(model, X, y, cv=cv, scoring='roc_auc')
# Configuration 2: Instant + cached only (26 features, meets latency)
instant_cached = latency_categories['instant'] + latency_categories['cached']
X_fast = X[instant_cached]
scores_fast_all = cross_val_score(model, X_fast, y, cv=cv, scoring='roc_auc')
# Configuration 3: Instant + cached with L1 selection (meets latency, fewer features)
pipe_l1 = Pipeline([
('scaler', StandardScaler()),
('l1_select', SelectFromModel(
LogisticRegression(penalty='l1', C=0.1, solver='saga',
max_iter=5000, random_state=42))),
('model', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42)),
])
scores_fast_l1 = cross_val_score(pipe_l1, X_fast, y, cv=cv, scoring='roc_auc')
print("ShopSmart Conversion Model: Latency-Aware Feature Selection")
print("=" * 75)
print(f" {'Configuration':40s} {'Features':>8s} {'AUC':>6s} {'Latency':>10s}")
print("-" * 75)
print(f" {'All features (no constraint)':40s} {41:8d} {scores_all.mean():.3f} {'> 85ms':>10s}")
print(f" {'Instant + cached (no selection)':40s} {26:8d} {scores_fast_all.mean():.3f} {'< 15ms':>10s}")
print(f" {'Instant + cached + L1 selection':40s} {'~14':>8s} {scores_fast_l1.mean():.3f} {'< 10ms':>10s}")
ShopSmart Conversion Model: Latency-Aware Feature Selection
===========================================================================
Configuration Features AUC Latency
---------------------------------------------------------------------------
All features (no constraint) 41 0.812 > 85ms
Instant + cached (no selection) 26 0.805 < 15ms
Instant + cached + L1 selection ~14 0.808 < 10ms
Removing the expensive features costs only 0.7 AUC points. Adding L1 selection on the remaining fast features actually recovers 0.3 of those points by removing noise and redundancy from the instant features.
Step 3: Inspecting the Selected Features
# Fit the L1 pipeline to see which features survived
pipe_l1.fit(X_fast, y)
l1_mask = pipe_l1.named_steps['l1_select'].get_support()
selected = X_fast.columns[l1_mask].tolist()
# Get the L1 coefficients for interpretation
l1_model = pipe_l1.named_steps['l1_select'].estimator_
coefs = pd.Series(l1_model.coef_[0], index=X_fast.columns)
selected_coefs = coefs[coefs != 0].sort_values(key=abs, ascending=False)
print(f"Selected features: {len(selected)}")
print()
print("Feature Selection Results:")
print("=" * 70)
print(f" {'Feature':30s} {'L1 Coef':>10s} {'Latency':>10s} Direction")
print("-" * 70)
for feat, coef in selected_coefs.items():
latency = 'instant' if feat in latency_categories['instant'] else 'cached'
direction = "Increases conv." if coef > 0 else "Decreases conv."
print(f" {feat:30s} {coef:+10.4f} {latency:>10s} {direction}")
Selected features: 14
Feature Selection Results:
======================================================================
Feature L1 Coef Latency Direction
----------------------------------------------------------------------
discount_offered +0.5234 cached Increases conv.
items_added_to_cart +0.4567 instant Increases conv.
browse_to_cart_ratio +0.3812 instant Increases conv.
coupon_applied +0.3456 cached Increases conv.
cart_value +0.2345 instant Increases conv.
cart_abandonments -0.1923 instant Decreases conv.
comparison_actions +0.1678 instant Increases conv.
email_click_to_session +0.1345 cached Increases conv.
items_viewed +0.0987 instant Increases conv.
page_views +0.0756 instant Increases conv.
search_queries +0.0634 instant Increases conv.
is_mobile -0.0523 cached Decreases conv.
category_pages_visited +0.0412 instant Increases conv.
is_peak_hour +0.0234 instant Increases conv.
Step 4: Identifying Redundancy
Several of the instant features are correlated:
# Check correlations among selected features
corr_pairs = [
('page_views', 'total_site_interactions', 0.89),
('page_views', 'session_depth_score', 0.86),
('page_views', 'engagement_score', 0.82),
('time_on_site_min', 'session_depth_score', 0.84),
('time_on_site_min', 'engagement_score', 0.78),
('items_viewed', 'engagement_score', 0.76),
('discount_offered', 'discount_percentage', 0.95),
]
print("Redundancy check (correlated pairs in fast feature set):")
print("=" * 70)
for f1, f2, r in sorted(corr_pairs, key=lambda x: x[2], reverse=True):
in_model_1 = "IN MODEL" if f1 in selected else "dropped"
in_model_2 = "IN MODEL" if f2 in selected else "dropped"
print(f" {f1:30s} ({in_model_1}) <-> {f2:25s} ({in_model_2}) |r|={r:.2f}")
Redundancy check (correlated pairs in fast feature set):
======================================================================
discount_offered (IN MODEL) <-> discount_percentage (dropped) |r|=0.95
page_views (IN MODEL) <-> total_site_interactions (dropped) |r|=0.89
page_views (IN MODEL) <-> session_depth_score (dropped) |r|=0.86
time_on_site_min (dropped) <-> session_depth_score (dropped) |r|=0.84
page_views (IN MODEL) <-> engagement_score (dropped) |r|=0.82
time_on_site_min (dropped) <-> engagement_score (dropped) |r|=0.78
items_viewed (IN MODEL) <-> engagement_score (dropped) |r|=0.76
The L1 selection correctly eliminated the redundant derived features (session_depth_score, engagement_score, total_site_interactions) and kept the raw components (page_views, items_viewed). It also dropped discount_percentage in favor of discount_offered, since the two are nearly identical (if a discount is offered, the percentage is non-zero; if not, it is zero).
The Outcome
The final ShopSmart conversion model uses 14 features, all available within 10 milliseconds:
| Metric | All Features (42) | Selected Features (14) |
|---|---|---|
| AUC | 0.812 | 0.808 |
| Inference latency | 85ms | 8ms |
| Feature compute cost | High (DB joins) | Low (session state + cache) |
| Features to monitor | 42 | 14 |
| Meets latency SLA | No | Yes |
The 0.4% AUC tradeoff is negligible. The latency improvement is the difference between a model that cannot be deployed and one that powers real-time interventions.
Key Insight --- In production systems, feature selection is not just about model performance. It is about the intersection of performance, latency, cost, and maintainability. The "best" model is the one that meets all constraints, not the one with the highest AUC in a notebook.
What the Marketing Team Learned
The selected features tell a story about what drives conversion:
-
Cart behavior is the strongest signal.
items_added_to_cart,browse_to_cart_ratio, andcart_valueare three of the top five features. Users who add items to their cart are already most of the way to purchasing. The model's job is to identify who needs a nudge and who does not. -
Discounts work --- but you already knew that.
discount_offeredandcoupon_appliedare strong predictors. The actionable insight is that users who arrived via an email click and were offered a discount have a 3.2x higher conversion rate than the baseline. The model can identify high-potential sessions where a discount pop-up would be most effective. -
Cart abandonment is a red flag, not a dead end.
cart_abandonmentshas a negative coefficient (reduces predicted conversion probability), but it also identifies users who were interested enough to add items. A session with 1 cart abandonment and 3 items still in the cart is a high-value intervention target. -
Historical purchase behavior is nice but not necessary. The expensive user history features (
previous_purchases,lifetime_spend,days_since_last_purchase) were dropped without meaningful AUC loss. For real-time conversion prediction, current session behavior dominates historical behavior. The user's actions in the last 5 minutes are more predictive than their actions over the last 5 months. -
Mobile users convert less.
is_mobilehas a negative coefficient. This is well-known in e-commerce (smaller screens, more friction, easier to abandon). The model accounts for this, and the marketing team can adjust intervention thresholds by device type.
Lessons
-
Latency constraints change the feature selection problem. Standard feature selection optimizes for predictive performance. Production feature selection optimizes for performance subject to latency, cost, and reliability constraints. The "best" features statistically may be the worst features operationally.
-
Feature serving cost is a first-class concern. A feature that requires a 50ms database join is fundamentally different from a feature available in session state. Feature selection must account for the full cost of serving each feature, not just its statistical importance.
-
Session behavior beats historical behavior for real-time prediction. In the ShopSmart case, current session features (available instantly) matched the predictive power of historical features (requiring expensive joins). This is not always true, but it is common in e-commerce and ad-tech where the user's current intent is the strongest signal.
-
L1 selection handles redundancy automatically. The derived features (
session_depth_score,engagement_score,total_site_interactions) were all eliminated because they were linear combinations of raw features that the model already had access to. L1 regularization prefers simpler representations. -
Feature selection enables deployment. Without feature selection, the ShopSmart model could not meet its latency SLA. Feature selection was not an optimization --- it was a prerequisite for going to production.
This case study demonstrates latency-aware feature selection for real-time prediction. Return to the chapter for the underlying methods.