Case Study 1: ShopSmart Customer Segmentation for Targeted A/B Tests

DataField.Dev

Case Study 1: ShopSmart Customer Segmentation for Targeted A/B Tests

Background

ShopSmart is a mid-market e-commerce platform with 200,000 monthly active customers, $180M in annual GMV, and a marketing team that has been running A/B tests on a one-size-fits-all basis. Every customer sees the same email campaigns, the same homepage promotions, the same cart abandonment nudges. The results have been mediocre: a 2.1% email click-through rate, a 12% homepage conversion rate, and a cart abandonment recovery rate of 6%.

The VP of Marketing, Priya Desai, has a hypothesis: "We are averaging across customers who behave completely differently. A high-value loyalist does not respond to the same incentive as a deal-seeker. If we can segment our customers into coherent groups, we can run targeted A/B tests within each segment and find what actually works."

She is probably right. This case study builds the segmentation.

The Data

ShopSmart's data warehouse contains 18 months of transaction and behavioral data. After feature engineering, the analysis dataset has 200,000 customers and 8 features:

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.metrics import silhouette_score, silhouette_samples, calinski_harabasz_score
import matplotlib.pyplot as plt

np.random.seed(42)
n = 200_000

# Simulate three latent customer types with overlap
type_labels = np.random.choice([0, 1, 2], n, p=[0.30, 0.40, 0.30])

shopsmart = pd.DataFrame({
    'annual_spend': np.where(
        type_labels == 0,
        np.random.normal(2400, 500, n),
        np.where(type_labels == 1, np.random.normal(780, 250, n),
                 np.random.normal(140, 70, n))
    ).clip(0).round(2),

    'orders_per_year': np.where(
        type_labels == 0,
        np.random.poisson(18, n),
        np.where(type_labels == 1, np.random.poisson(12, n),
                 np.random.poisson(3, n))
    ),

    'avg_order_value': np.where(
        type_labels == 0,
        np.random.normal(135, 30, n),
        np.where(type_labels == 1, np.random.normal(65, 20, n),
                 np.random.normal(48, 15, n))
    ).clip(5).round(2),

    'sessions_per_month': np.where(
        type_labels == 0,
        np.random.normal(11, 3, n),
        np.where(type_labels == 1, np.random.normal(18, 5, n),
                 np.random.normal(22, 7, n))
    ).clip(1).round(0).astype(int),

    'avg_discount_pct': np.where(
        type_labels == 0,
        np.random.normal(5, 3, n),
        np.where(type_labels == 1, np.random.normal(32, 8, n),
                 np.random.normal(12, 5, n))
    ).clip(0, 60).round(1),

    'items_per_order': np.where(
        type_labels == 0,
        np.random.normal(4.2, 1.1, n),
        np.where(type_labels == 1, np.random.normal(2.8, 0.9, n),
                 np.random.normal(1.3, 0.5, n))
    ).clip(1).round(1),

    'return_rate': np.where(
        type_labels == 0,
        np.random.beta(2, 18, n),
        np.where(type_labels == 1, np.random.beta(3, 12, n),
                 np.random.beta(1, 8, n))
    ).round(3),

    'days_since_last_purchase': np.where(
        type_labels == 0,
        np.random.exponential(12, n),
        np.where(type_labels == 1, np.random.exponential(20, n),
                 np.random.exponential(55, n))
    ).clip(0, 365).round(0).astype(int),
})

print(f"Dataset: {shopsmart.shape[0]:,} customers, {shopsmart.shape[1]} features")
print(shopsmart.describe().round(2).to_string())

Step 1: Feature Scaling

feature_cols = shopsmart.columns.tolist()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(shopsmart[feature_cols])

print("Post-scaling means (should be ~0):", X_scaled.mean(axis=0).round(4))
print("Post-scaling stds  (should be ~1):", X_scaled.std(axis=0).round(4))

Step 2: Choosing k

Elbow Method

inertias = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(X_scaled)
    inertias.append(km.inertia_)
    print(f"k={k}: inertia={km.inertia_:,.0f}")

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Inertia')
ax.set_title('ShopSmart: Elbow Method')
ax.set_xticks(list(K_range))
plt.tight_layout()
plt.show()

Silhouette Analysis

silhouette_scores = []

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X_scaled)
    sil = silhouette_score(X_scaled, labels_k, sample_size=30_000, random_state=42)
    silhouette_scores.append(sil)
    print(f"k={k}: silhouette={sil:.4f}")

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(K_range, silhouette_scores, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Mean Silhouette Score')
ax.set_title('ShopSmart: Silhouette Analysis')
ax.set_xticks(list(K_range))
plt.tight_layout()
plt.show()

Observation --- The elbow appears at k=3, and the silhouette score peaks at k=3. This is consistent with the three latent customer types we simulated. In a real project, you would not know the true number of segments, but when multiple methods agree, the evidence is stronger.

Silhouette Plots for k=3 vs. k=4

def plot_silhouette(X, labels, k, ax, sample_size=20_000):
    """Per-cluster silhouette plot with optional subsampling for speed."""
    if len(X) > sample_size:
        rng = np.random.RandomState(42)
        idx = rng.choice(len(X), sample_size, replace=False)
        X_sub, labels_sub = X[idx], labels[idx]
    else:
        X_sub, labels_sub = X, labels

    sil_values = silhouette_samples(X_sub, labels_sub)
    y_lower = 10

    for i in range(k):
        cluster_vals = np.sort(sil_values[labels_sub == i])
        size_i = cluster_vals.shape[0]
        y_upper = y_lower + size_i
        ax.fill_betweenx(np.arange(y_lower, y_upper), 0, cluster_vals, alpha=0.7)
        ax.text(-0.05, y_lower + 0.5 * size_i, str(i))
        y_lower = y_upper + 10

    avg = silhouette_score(X_sub, labels_sub)
    ax.axvline(x=avg, color='red', linestyle='--', label=f'Mean: {avg:.3f}')
    ax.set_xlabel('Silhouette Coefficient')
    ax.set_ylabel('Cluster Label')
    ax.legend()


fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for idx, k in enumerate([3, 4]):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(X_scaled)
    plot_silhouette(X_scaled, labels_k, k, axes[idx])
    axes[idx].set_title(f'Silhouette Plot (k={k})')

plt.tight_layout()
plt.show()

At k=3, all three clusters have well-shaped blades with few points below zero. At k=4, one cluster is thin and has many negative-silhouette points --- a sign that the fourth cluster is an artifact of splitting a natural group.

Decision: k=3.

Step 3: Final K-Means Clustering

km_final = KMeans(n_clusters=3, random_state=42, n_init=10)
shopsmart['segment'] = km_final.fit_predict(X_scaled)

sil = silhouette_score(X_scaled, shopsmart['segment'], sample_size=30_000, random_state=42)
ch = calinski_harabasz_score(X_scaled, shopsmart['segment'])
print(f"Final clustering: silhouette={sil:.4f}, Calinski-Harabasz={ch:.1f}")
print(f"Cluster sizes:\n{shopsmart['segment'].value_counts().sort_index()}")

Step 4: Cluster Profiling

profile = shopsmart.groupby('segment').agg(
    n=('annual_spend', 'count'),
    pct=('annual_spend', lambda x: f"{100 * len(x) / len(shopsmart):.1f}%"),
    avg_spend=('annual_spend', 'mean'),
    median_spend=('annual_spend', 'median'),
    avg_orders=('orders_per_year', 'mean'),
    avg_aov=('avg_order_value', 'mean'),
    avg_sessions=('sessions_per_month', 'mean'),
    avg_discount=('avg_discount_pct', 'mean'),
    avg_items=('items_per_order', 'mean'),
    avg_return_rate=('return_rate', 'mean'),
    avg_recency=('days_since_last_purchase', 'mean'),
).round(2)

print(profile.to_string())

The profiles tell a clear story:

Segment	Name	Annual Spend	Orders/Year	Discount %	Sessions/Mo	Recency (days)
0	High-Value Loyalists	~$2,400	~18	~5%	~11	~12
1	Deal-Seekers	~$780	~12	~32%	~18	~20
2	Window-Shoppers	~$140	~3	~12%	~22	~55

Key Insight --- Window-Shoppers browse the most (22 sessions/month) but buy the least (3 orders/year, $140 annual spend). They are not disengaged --- they are interested but unconverted. This is the segment with the highest potential ROI from intervention.

Step 5: Segment Validation with DBSCAN

Does the three-segment structure hold under a different algorithm?

from sklearn.neighbors import NearestNeighbors

# k-distance plot
min_samples = 2 * X_scaled.shape[1]  # 2 * 8 = 16
nn = NearestNeighbors(n_neighbors=min_samples)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_dists = np.sort(distances[:, -1])

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(k_dists, linewidth=1.0)
ax.set_xlabel('Points (sorted)')
ax.set_ylabel(f'{min_samples}-th Nearest Neighbor Distance')
ax.set_title('k-Distance Plot')
plt.tight_layout()
plt.show()

# DBSCAN with candidate eps from the k-distance plot
for eps in [1.0, 1.3, 1.5, 1.8]:
    db = DBSCAN(eps=eps, min_samples=min_samples)
    labels_db = db.fit_predict(X_scaled)
    n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
    n_noise = (labels_db == -1).sum()
    pct_noise = 100 * n_noise / len(labels_db)

    mask = labels_db != -1
    if mask.sum() > 1 and len(set(labels_db[mask])) >= 2:
        sil_db = silhouette_score(X_scaled[mask], labels_db[mask],
                                  sample_size=30_000, random_state=42)
    else:
        sil_db = float('nan')

    print(f"eps={eps:.1f}: {n_clusters} clusters, "
          f"{n_noise:,} noise ({pct_noise:.1f}%), silhouette={sil_db:.4f}")

Observation --- DBSCAN finds 3 clusters at certain eps values, broadly confirming the K-Means structure. The noise points (5-15% of customers depending on eps) are the ambiguous cases at cluster boundaries --- customers who share characteristics of multiple segments. In a business context, these are worth examining separately: they may be transitioning between segments.

Step 6: Designing Targeted A/B Tests

The segmentation enables Priya's team to design segment-specific A/B tests:

High-Value Loyalists (Segment 0)

These customers are already valuable. The risk is losing them; the opportunity is increasing their basket size.

loyalists = shopsmart[shopsmart['segment'] == 0]

print("High-Value Loyalists:")
print(f"  Count:         {len(loyalists):,}")
print(f"  Avg spend:     ${loyalists['annual_spend'].mean():,.0f}")
print(f"  Avg discount:  {loyalists['avg_discount_pct'].mean():.1f}%")
print(f"  Avg AOV:       ${loyalists['avg_order_value'].mean():,.0f}")

A/B test: Early access to new products (Treatment) vs. standard experience (Control). Hypothesis: loyalists are motivated by exclusivity, not discounts. Metric: 90-day average order value.

Deal-Seekers (Segment 1)

High engagement with promotions. They buy frequently but only when incentivized.

deal_seekers = shopsmart[shopsmart['segment'] == 1]

print("Deal-Seekers:")
print(f"  Count:         {len(deal_seekers):,}")
print(f"  Avg spend:     ${deal_seekers['annual_spend'].mean():,.0f}")
print(f"  Avg discount:  {deal_seekers['avg_discount_pct'].mean():.1f}%")
print(f"  Avg sessions:  {deal_seekers['sessions_per_month'].mean():.1f}")

A/B test: Tiered loyalty rewards (Treatment A: spend $100+, get 10% off next order; Treatment B: standard 30% flash sale) vs. standard promotions (Control). Hypothesis: tiered rewards shift deal-seekers toward higher spend per transaction. Metric: 90-day spend per customer.

Window-Shoppers (Segment 2)

Browse heavily but rarely convert. The conversion gap is the opportunity.

browsers = shopsmart[shopsmart['segment'] == 2]

print("Window-Shoppers:")
print(f"  Count:         {len(browsers):,}")
print(f"  Avg sessions:  {browsers['sessions_per_month'].mean():.1f}")
print(f"  Avg orders:    {browsers['orders_per_year'].mean():.1f}")
print(f"  Avg recency:   {browsers['days_since_last_purchase'].mean():.0f} days")

A/B test: Personalized product recommendations based on browse history (Treatment) vs. generic best-seller recommendations (Control). Hypothesis: browsers know what they want but cannot find it. Metric: 30-day conversion rate.

Step 7: Monitoring Segment Stability

Segments are not static. Customers move between segments over time (a deal-seeker may become a loyalist; a loyalist may lapse into a browser). ShopSmart should re-cluster monthly and track segment migration:

# Simulate a "next month" re-clustering
np.random.seed(99)
noise = np.random.normal(0, 0.15, X_scaled.shape)
X_next_month = X_scaled + noise

labels_next = km_final.predict(X_next_month)

# Migration matrix
from sklearn.metrics import adjusted_rand_score

migration = pd.crosstab(
    shopsmart['segment'], labels_next,
    rownames=['Month 1'], colnames=['Month 2']
)
migration_pct = migration.div(migration.sum(axis=1), axis=0).round(3)

print("Segment Migration Matrix (row = Month 1, col = Month 2):")
print(migration_pct.to_string())

ari = adjusted_rand_score(shopsmart['segment'], labels_next)
print(f"\nAdjusted Rand Index (month-over-month): {ari:.4f}")

Practical Recommendation --- If month-over-month ARI drops below 0.80, the segments are unstable and the A/B test targeting may be stale. Rebuild the segmentation and re-evaluate whether the underlying customer behavior has shifted or whether the features need updating.

Results and Business Impact

The segmentation project delivered three actionable outcomes:

Targeted A/B tests replaced the one-size-fits-all approach. Early results (simulated) show a 40% improvement in email click-through rate for the loyalist segment (exclusivity messaging) and a 25% improvement in conversion rate for the window-shopper segment (personalized recommendations).
Marketing budget reallocation. The team discovered that 40% of their discount budget was going to loyalists who would have purchased anyway. Redirecting that budget to window-shopper conversion and deal-seeker upselling improved ROI by an estimated 18%.
Segment-specific KPIs. Instead of tracking one aggregate conversion rate, the team now tracks per-segment metrics: loyalist retention rate, deal-seeker spend-per-transaction, and window-shopper conversion rate. Each metric has a different owner and a different intervention playbook.

Key Lessons

Clustering is a means, not an end. The value was not in the clusters themselves but in the differentiated A/B tests and budget allocation they enabled.
Start with K-Means, validate with alternatives. K-Means gave a fast, interpretable result. DBSCAN confirmed the structure and identified boundary cases. Hierarchical clustering (on a subsample) confirmed the three-group structure via the dendrogram.
Profile clusters on original features, not scaled features. Scaled values are meaningless to business stakeholders. "Cluster 0 has a z-score of 1.4 on annual spend" is useless. "Cluster 0 spends $2,400/year" is actionable.
Monitor segment stability. A segmentation that was valid last quarter may not be valid today. Re-cluster periodically and track migration.
Silhouette analysis resolves k ambiguity. The elbow method was suggestive but not definitive. The silhouette plot for k=4 clearly showed an artificial fourth cluster. Multiple methods, applied together, give confidence.

This case study supports Chapter 20: Clustering. Return to the chapter for the full algorithmic treatment.