Case Study 1: Customer Segmentation with Clustering

Business Context

A mid-sized e-commerce retailer wants to segment its customer base to tailor marketing strategies, personalize product recommendations, and optimize pricing. The marketing team has access to transactional data for 2,000 customers over the past year, but no predefined customer categories exist. This is a classic unsupervised learning problem.

The dataset includes the following features per customer:

Feature	Description	Range
`annual_spending`	Total amount spent in the past year ($)	50 -- 15,000
`purchase_frequency`	Number of purchases in the past year	1 -- 120
`avg_order_value`	Average order value ($)	10 -- 500
`days_since_last_purchase`	Recency of last purchase (days)	0 -- 365
`product_categories_browsed`	Number of distinct product categories browsed	1 -- 25
`return_rate`	Fraction of purchases returned	0.0 -- 0.5
`customer_tenure_months`	Months since account creation	1 -- 60
`support_tickets`	Number of support tickets filed	0 -- 15

Objectives

Discover natural customer segments without prior labels
Characterize each segment with descriptive profiles
Compare multiple clustering approaches
Evaluate clustering quality using internal metrics
Generate actionable recommendations for the marketing team

Step 1: Data Generation and Exploration

Since we are working with a synthetic dataset for reproducibility, we generate data that mimics realistic customer behavior patterns.

import numpy as np
from sklearn.preprocessing import StandardScaler

def generate_customer_data(
    n_customers: int = 2000,
    random_state: int = 42
) -> tuple[np.ndarray, list[str]]:
    """Generate synthetic customer segmentation data.

    Creates data with approximately 4 natural segments:
    - High-value loyalists
    - Occasional big spenders
    - Frequent budget shoppers
    - At-risk / churning customers

    Args:
        n_customers: Number of customers to generate.
        random_state: Random seed for reproducibility.

    Returns:
        Tuple of (feature_matrix, feature_names).
    """
    rng = np.random.RandomState(random_state)

    # Segment 1: High-value loyalists (~25%)
    n1 = n_customers // 4
    seg1 = np.column_stack([
        rng.normal(8000, 2000, n1),    # annual_spending
        rng.normal(60, 15, n1),         # purchase_frequency
        rng.normal(150, 40, n1),        # avg_order_value
        rng.normal(10, 5, n1),          # days_since_last_purchase
        rng.normal(15, 3, n1),          # product_categories_browsed
        rng.normal(0.05, 0.02, n1),     # return_rate
        rng.normal(36, 10, n1),         # customer_tenure_months
        rng.normal(2, 1, n1),           # support_tickets
    ])

    # Segment 2: Occasional big spenders (~25%)
    n2 = n_customers // 4
    seg2 = np.column_stack([
        rng.normal(4000, 1500, n2),
        rng.normal(10, 5, n2),
        rng.normal(350, 80, n2),
        rng.normal(60, 30, n2),
        rng.normal(5, 2, n2),
        rng.normal(0.08, 0.03, n2),
        rng.normal(24, 12, n2),
        rng.normal(1, 1, n2),
    ])

    # Segment 3: Frequent budget shoppers (~25%)
    n3 = n_customers // 4
    seg3 = np.column_stack([
        rng.normal(2000, 500, n3),
        rng.normal(80, 20, n3),
        rng.normal(30, 10, n3),
        rng.normal(5, 3, n3),
        rng.normal(8, 3, n3),
        rng.normal(0.15, 0.05, n3),
        rng.normal(18, 8, n3),
        rng.normal(5, 2, n3),
    ])

    # Segment 4: At-risk / churning (~25%)
    n4 = n_customers - n1 - n2 - n3
    seg4 = np.column_stack([
        rng.normal(500, 300, n4),
        rng.normal(5, 3, n4),
        rng.normal(80, 30, n4),
        rng.normal(200, 60, n4),
        rng.normal(3, 2, n4),
        rng.normal(0.20, 0.08, n4),
        rng.normal(30, 15, n4),
        rng.normal(8, 3, n4),
    ])

    X = np.vstack([seg1, seg2, seg3, seg4])
    X = np.clip(X, 0, None)  # No negative values

    feature_names = [
        "annual_spending", "purchase_frequency", "avg_order_value",
        "days_since_last_purchase", "product_categories_browsed",
        "return_rate", "customer_tenure_months", "support_tickets"
    ]

    return X, feature_names

See code/case-study-code.py for the complete runnable implementation.

Step 2: Preprocessing

As emphasized throughout Chapters 3 and 7, scaling is essential before applying distance-based algorithms.

X, feature_names = generate_customer_data()

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Dataset shape: {X.shape}")
print(f"Features: {feature_names}")
print(f"Scaled mean (should be ~0): {X_scaled.mean(axis=0).round(2)}")
print(f"Scaled std (should be ~1): {X_scaled.std(axis=0).round(2)}")

Step 3: Determining the Number of Clusters

We apply three methods to estimate the optimal number of clusters.

Elbow Method

from sklearn.cluster import KMeans

inertias = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    km.fit(X_scaled)
    inertias.append(km.inertia_)

# Look for the elbow: the point where adding clusters
# yields diminishing returns in inertia reduction

Silhouette Analysis

from sklearn.metrics import silhouette_score

sil_scores = []
for k in K_range:
    km = KMeans(n_clusters=k, n_init=10, random_state=42)
    labels = km.fit_predict(X_scaled)
    sil_scores.append(silhouette_score(X_scaled, labels))

best_k_sil = K_range[np.argmax(sil_scores)]
print(f"Best k by silhouette: {best_k_sil} (score: {max(sil_scores):.3f})")

BIC with GMM

from sklearn.mixture import GaussianMixture

bics = []
for k in K_range:
    gmm = GaussianMixture(n_components=k, n_init=5, random_state=42)
    gmm.fit(X_scaled)
    bics.append(gmm.bic(X_scaled))

best_k_bic = K_range[np.argmin(bics)]
print(f"Best k by BIC: {best_k_bic}")

Expected Result: All three methods should converge on $k = 4$, matching the four segments we designed into the synthetic data.

Step 4: Applying and Comparing Clustering Algorithms

K-Means

kmeans = KMeans(n_clusters=4, n_init=20, random_state=42)
labels_km = kmeans.fit_predict(X_scaled)

Gaussian Mixture Model

gmm = GaussianMixture(n_components=4, covariance_type="full", n_init=10, random_state=42)
labels_gmm = gmm.fit_predict(X_scaled)
probs_gmm = gmm.predict_proba(X_scaled)

Agglomerative Clustering

from sklearn.cluster import AgglomerativeClustering

agg = AgglomerativeClustering(n_clusters=4, linkage="ward")
labels_agg = agg.fit_predict(X_scaled)

Evaluation Comparison

from sklearn.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score
)

algorithms = {
    "K-Means": labels_km,
    "GMM": labels_gmm,
    "Agglomerative": labels_agg,
}

for name, labels in algorithms.items():
    sil = silhouette_score(X_scaled, labels)
    ch = calinski_harabasz_score(X_scaled, labels)
    db = davies_bouldin_score(X_scaled, labels)
    print(f"{name:15s} | Silhouette: {sil:.3f} | CH: {ch:.1f} | DB: {db:.3f}")

Step 5: Visualizing Segments with PCA

To visualize the 8-dimensional customer data, we project it to 2D using PCA.

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_2d = pca.fit_transform(X_scaled)

print(f"Variance explained by 2 components: {pca.explained_variance_ratio_.sum():.2%}")

# Plot X_2d colored by cluster labels from each algorithm
# See code/case-study-code.py for complete visualization

Step 6: Profiling the Segments

The most important step for business value: characterizing what each segment looks like.

def profile_segments(
    X: np.ndarray,
    labels: np.ndarray,
    feature_names: list[str]
) -> None:
    """Print descriptive statistics for each cluster segment.

    Args:
        X: Original (unscaled) feature matrix.
        labels: Cluster labels.
        feature_names: Names of each feature.
    """
    for cluster_id in np.unique(labels):
        mask = labels == cluster_id
        cluster_data = X[mask]
        print(f"\n--- Segment {cluster_id} ({mask.sum()} customers, "
              f"{mask.sum() / len(labels):.1%}) ---")
        for j, name in enumerate(feature_names):
            mean_val = cluster_data[:, j].mean()
            std_val = cluster_data[:, j].std()
            print(f"  {name:35s}: {mean_val:8.1f} (+/- {std_val:.1f})")

Expected Segment Profiles:

Segment	Label	Spending	Frequency	Recency	Description
0	High-value loyalist	High	High	Low	Core customers; reward and retain
1	Occasional big spender	Medium	Low	Medium	Upsell frequency; personalize recommendations
2	Frequent budget shopper	Low	Very high	Very low	Cross-sell; premium product exposure
3	At-risk / churning	Very low	Very low	High	Re-engagement campaigns; win-back offers

Step 7: Actionable Recommendations

Based on the discovered segments, we recommend the following marketing strategies:

High-value loyalists: Loyalty programs, early access to new products, personalized thank-you notes. High retention priority.
Occasional big spenders: Targeted campaigns around their preferred product categories. Reminder emails. Incentivize more frequent purchases with time-limited offers.
Frequent budget shoppers: Bundle deals, volume discounts. Gradually introduce higher-margin products. Leverage their high engagement.
At-risk / churning: Win-back email campaigns, special discount codes, satisfaction surveys to identify pain points. Monitor support ticket trends.

Key Lessons from This Case Study

Multiple methods for choosing $k$: The elbow method, silhouette analysis, and BIC/AIC often agree but can differ. Use all three and apply domain judgment.
Algorithm comparison matters: Different algorithms can find slightly different segmentations. K-means is a good baseline, but GMMs provide richer information through soft assignments.
Scaling is non-negotiable: Without standardization, features like annual_spending (range: 50--15,000) would dominate over return_rate (range: 0--0.5).
Visualization aids interpretation: PCA projections to 2D provide a sanity check on whether clusters are well-separated.
Business context drives interpretation: The numbers alone are not actionable. Translating segments into personas and strategies is where unsupervised learning creates business value.
Soft assignments from GMMs add nuance: A customer with 60% probability of being a "loyalist" and 40% probability of being a "big spender" might receive messaging from both campaigns.