Case Study 1: Customer Segmentation with Clustering
Business Context
A mid-sized e-commerce retailer wants to segment its customer base to tailor marketing strategies, personalize product recommendations, and optimize pricing. The marketing team has access to transactional data for 2,000 customers over the past year, but no predefined customer categories exist. This is a classic unsupervised learning problem.
The dataset includes the following features per customer:
| Feature | Description | Range |
|---|---|---|
annual_spending |
Total amount spent in the past year ($) | 50 -- 15,000 |
purchase_frequency |
Number of purchases in the past year | 1 -- 120 |
avg_order_value |
Average order value ($) | 10 -- 500 |
days_since_last_purchase |
Recency of last purchase (days) | 0 -- 365 |
product_categories_browsed |
Number of distinct product categories browsed | 1 -- 25 |
return_rate |
Fraction of purchases returned | 0.0 -- 0.5 |
customer_tenure_months |
Months since account creation | 1 -- 60 |
support_tickets |
Number of support tickets filed | 0 -- 15 |
Objectives
- Discover natural customer segments without prior labels
- Characterize each segment with descriptive profiles
- Compare multiple clustering approaches
- Evaluate clustering quality using internal metrics
- Generate actionable recommendations for the marketing team
Step 1: Data Generation and Exploration
Since we are working with a synthetic dataset for reproducibility, we generate data that mimics realistic customer behavior patterns.
import numpy as np
from sklearn.preprocessing import StandardScaler
def generate_customer_data(
n_customers: int = 2000,
random_state: int = 42
) -> tuple[np.ndarray, list[str]]:
"""Generate synthetic customer segmentation data.
Creates data with approximately 4 natural segments:
- High-value loyalists
- Occasional big spenders
- Frequent budget shoppers
- At-risk / churning customers
Args:
n_customers: Number of customers to generate.
random_state: Random seed for reproducibility.
Returns:
Tuple of (feature_matrix, feature_names).
"""
rng = np.random.RandomState(random_state)
# Segment 1: High-value loyalists (~25%)
n1 = n_customers // 4
seg1 = np.column_stack([
rng.normal(8000, 2000, n1), # annual_spending
rng.normal(60, 15, n1), # purchase_frequency
rng.normal(150, 40, n1), # avg_order_value
rng.normal(10, 5, n1), # days_since_last_purchase
rng.normal(15, 3, n1), # product_categories_browsed
rng.normal(0.05, 0.02, n1), # return_rate
rng.normal(36, 10, n1), # customer_tenure_months
rng.normal(2, 1, n1), # support_tickets
])
# Segment 2: Occasional big spenders (~25%)
n2 = n_customers // 4
seg2 = np.column_stack([
rng.normal(4000, 1500, n2),
rng.normal(10, 5, n2),
rng.normal(350, 80, n2),
rng.normal(60, 30, n2),
rng.normal(5, 2, n2),
rng.normal(0.08, 0.03, n2),
rng.normal(24, 12, n2),
rng.normal(1, 1, n2),
])
# Segment 3: Frequent budget shoppers (~25%)
n3 = n_customers // 4
seg3 = np.column_stack([
rng.normal(2000, 500, n3),
rng.normal(80, 20, n3),
rng.normal(30, 10, n3),
rng.normal(5, 3, n3),
rng.normal(8, 3, n3),
rng.normal(0.15, 0.05, n3),
rng.normal(18, 8, n3),
rng.normal(5, 2, n3),
])
# Segment 4: At-risk / churning (~25%)
n4 = n_customers - n1 - n2 - n3
seg4 = np.column_stack([
rng.normal(500, 300, n4),
rng.normal(5, 3, n4),
rng.normal(80, 30, n4),
rng.normal(200, 60, n4),
rng.normal(3, 2, n4),
rng.normal(0.20, 0.08, n4),
rng.normal(30, 15, n4),
rng.normal(8, 3, n4),
])
X = np.vstack([seg1, seg2, seg3, seg4])
X = np.clip(X, 0, None) # No negative values
feature_names = [
"annual_spending", "purchase_frequency", "avg_order_value",
"days_since_last_purchase", "product_categories_browsed",
"return_rate", "customer_tenure_months", "support_tickets"
]
return X, feature_names
See code/case-study-code.py for the complete runnable implementation.
Step 2: Preprocessing
As emphasized throughout Chapters 3 and 7, scaling is essential before applying distance-based algorithms.
X, feature_names = generate_customer_data()
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")
print(f"Scaled mean (should be ~0): {X_scaled.mean(axis=0).round(2)}")
print(f"Scaled std (should be ~1): {X_scaled.std(axis=0).round(2)}")
Step 3: Determining the Number of Clusters
We apply three methods to estimate the optimal number of clusters.
Elbow Method
from sklearn.cluster import KMeans
inertias = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
# Look for the elbow: the point where adding clusters
# yields diminishing returns in inertia reduction
Silhouette Analysis
from sklearn.metrics import silhouette_score
sil_scores = []
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
labels = km.fit_predict(X_scaled)
sil_scores.append(silhouette_score(X_scaled, labels))
best_k_sil = K_range[np.argmax(sil_scores)]
print(f"Best k by silhouette: {best_k_sil} (score: {max(sil_scores):.3f})")
BIC with GMM
from sklearn.mixture import GaussianMixture
bics = []
for k in K_range:
gmm = GaussianMixture(n_components=k, n_init=5, random_state=42)
gmm.fit(X_scaled)
bics.append(gmm.bic(X_scaled))
best_k_bic = K_range[np.argmin(bics)]
print(f"Best k by BIC: {best_k_bic}")
Expected Result: All three methods should converge on $k = 4$, matching the four segments we designed into the synthetic data.
Step 4: Applying and Comparing Clustering Algorithms
K-Means
kmeans = KMeans(n_clusters=4, n_init=20, random_state=42)
labels_km = kmeans.fit_predict(X_scaled)
Gaussian Mixture Model
gmm = GaussianMixture(n_components=4, covariance_type="full", n_init=10, random_state=42)
labels_gmm = gmm.fit_predict(X_scaled)
probs_gmm = gmm.predict_proba(X_scaled)
Agglomerative Clustering
from sklearn.cluster import AgglomerativeClustering
agg = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels_agg = agg.fit_predict(X_scaled)
Evaluation Comparison
from sklearn.metrics import (
silhouette_score,
calinski_harabasz_score,
davies_bouldin_score
)
algorithms = {
"K-Means": labels_km,
"GMM": labels_gmm,
"Agglomerative": labels_agg,
}
for name, labels in algorithms.items():
sil = silhouette_score(X_scaled, labels)
ch = calinski_harabasz_score(X_scaled, labels)
db = davies_bouldin_score(X_scaled, labels)
print(f"{name:15s} | Silhouette: {sil:.3f} | CH: {ch:.1f} | DB: {db:.3f}")
Step 5: Visualizing Segments with PCA
To visualize the 8-dimensional customer data, we project it to 2D using PCA.
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)
print(f"Variance explained by 2 components: {pca.explained_variance_ratio_.sum():.2%}")
# Plot X_2d colored by cluster labels from each algorithm
# See code/case-study-code.py for complete visualization
Step 6: Profiling the Segments
The most important step for business value: characterizing what each segment looks like.
def profile_segments(
X: np.ndarray,
labels: np.ndarray,
feature_names: list[str]
) -> None:
"""Print descriptive statistics for each cluster segment.
Args:
X: Original (unscaled) feature matrix.
labels: Cluster labels.
feature_names: Names of each feature.
"""
for cluster_id in np.unique(labels):
mask = labels == cluster_id
cluster_data = X[mask]
print(f"\n--- Segment {cluster_id} ({mask.sum()} customers, "
f"{mask.sum() / len(labels):.1%}) ---")
for j, name in enumerate(feature_names):
mean_val = cluster_data[:, j].mean()
std_val = cluster_data[:, j].std()
print(f" {name:35s}: {mean_val:8.1f} (+/- {std_val:.1f})")
Expected Segment Profiles:
| Segment | Label | Spending | Frequency | Recency | Description |
|---|---|---|---|---|---|
| 0 | High-value loyalist | High | High | Low | Core customers; reward and retain |
| 1 | Occasional big spender | Medium | Low | Medium | Upsell frequency; personalize recommendations |
| 2 | Frequent budget shopper | Low | Very high | Very low | Cross-sell; premium product exposure |
| 3 | At-risk / churning | Very low | Very low | High | Re-engagement campaigns; win-back offers |
Step 7: Actionable Recommendations
Based on the discovered segments, we recommend the following marketing strategies:
-
High-value loyalists: Loyalty programs, early access to new products, personalized thank-you notes. High retention priority.
-
Occasional big spenders: Targeted campaigns around their preferred product categories. Reminder emails. Incentivize more frequent purchases with time-limited offers.
-
Frequent budget shoppers: Bundle deals, volume discounts. Gradually introduce higher-margin products. Leverage their high engagement.
-
At-risk / churning: Win-back email campaigns, special discount codes, satisfaction surveys to identify pain points. Monitor support ticket trends.
Key Lessons from This Case Study
-
Multiple methods for choosing $k$: The elbow method, silhouette analysis, and BIC/AIC often agree but can differ. Use all three and apply domain judgment.
-
Algorithm comparison matters: Different algorithms can find slightly different segmentations. K-means is a good baseline, but GMMs provide richer information through soft assignments.
-
Scaling is non-negotiable: Without standardization, features like
annual_spending(range: 50--15,000) would dominate overreturn_rate(range: 0--0.5). -
Visualization aids interpretation: PCA projections to 2D provide a sanity check on whether clusters are well-separated.
-
Business context drives interpretation: The numbers alone are not actionable. Translating segments into personas and strategies is where unsupervised learning creates business value.
-
Soft assignments from GMMs add nuance: A customer with 60% probability of being a "loyalist" and 40% probability of being a "big spender" might receive messaging from both campaigns.