16 min read

> Core Principle --- There is no "correct" clustering of a dataset. There are only useful clusterings and less useful ones. This is the hardest lesson in unsupervised learning, and most practitioners learn it too late.

Chapter 20: Clustering

K-Means, DBSCAN, and Hierarchical Clustering


Learning Objectives

By the end of this chapter, you will be able to:

  1. Implement K-Means and understand its assumptions and failure modes
  2. Choose k using the elbow method, silhouette analysis, and gap statistic
  3. Apply DBSCAN for density-based clustering and noise detection
  4. Use hierarchical clustering with dendrograms for exploratory analysis
  5. Evaluate clusters without ground truth labels

Clustering Is Exploratory --- There Is No Right Answer

Core Principle --- There is no "correct" clustering of a dataset. There are only useful clusterings and less useful ones. This is the hardest lesson in unsupervised learning, and most practitioners learn it too late.

In supervised learning, you had a safety net: labeled data, a target variable, cross-validation scores, confusion matrices. You could measure whether your model was right. That safety net is gone now. A clustering algorithm does not learn a mapping from inputs to outputs. It imposes structure on unlabeled data, and whether that structure is meaningful depends entirely on the question you are asking.

Here is what this means in practice. ShopSmart, an e-commerce platform with 200,000 active customers, wants to segment its customer base for targeted A/B tests. The marketing team believes there are "high-value loyalists," "deal-seekers," and "browsers." Are there? Maybe. Or maybe there are five segments. Or twelve. Or the data is a continuous gradient with no natural clusters at all. K-Means will find three clusters if you ask for three --- even if the true structure is completely different. It will find ten clusters if you ask for ten. The algorithm always returns an answer. Your job is to determine whether the answer is useful.

This chapter teaches three clustering algorithms --- K-Means, DBSCAN, and hierarchical clustering --- because no single algorithm handles all data geometries. Then it teaches you how to evaluate clusters without ground truth labels, which is the skill that separates clustering practitioners who produce insight from those who produce noise.

Algorithm Cluster Shape Handles Noise? Needs k? Scales to 100k+ Rows?
K-Means Spherical, equal variance No Yes Yes (fast)
DBSCAN Arbitrary shape Yes (noise label) No (needs eps, min_samples) Yes (with indexing)
Hierarchical (Agglomerative) Depends on linkage No Yes (cut the dendrogram) Moderate (O(n^2) memory)

By the end of this chapter, you will know which algorithm to reach for, how to tune it, how to evaluate the result, and --- crucially --- when to admit that your data does not cluster well at all.


Part 1: K-Means --- Simple, Fast, and Wrong About 30% of the Time

How K-Means Works

K-Means is the simplest clustering algorithm that is actually useful. The algorithm:

  1. Choose k initial centroids (cluster centers).
  2. Assign each point to the nearest centroid.
  3. Recompute each centroid as the mean of its assigned points.
  4. Repeat steps 2-3 until assignments stop changing (or a maximum iteration count is reached).

That is the entire algorithm. It minimizes the inertia --- the sum of squared distances from each point to its assigned centroid:

$$\text{Inertia} = \sum_{i=1}^{n} \| x_i - \mu_{c(i)} \|^2$$

where $\mu_{c(i)}$ is the centroid of the cluster assigned to observation $x_i$.

Let us see it in action on a dataset where it works well:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_blobs

# Generate data where K-Means assumptions hold:
# 3 spherical, equal-variance clusters
X_blobs, y_true = make_blobs(
    n_samples=1500, centers=3, cluster_std=1.0, random_state=42
)

kmeans = KMeans(n_clusters=3, random_state=42, n_init=10)
labels = kmeans.fit_predict(X_blobs)

fig, axes = plt.subplots(1, 2, figsize=(12, 5))

# Ground truth
axes[0].scatter(X_blobs[:, 0], X_blobs[:, 1], c=y_true, cmap='viridis', s=8, alpha=0.6)
axes[0].set_title('Ground Truth')

# K-Means result
axes[1].scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels, cmap='viridis', s=8, alpha=0.6)
axes[1].scatter(
    kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1],
    c='red', marker='X', s=200, edgecolors='black', linewidths=1.5
)
axes[1].set_title('K-Means (k=3)')
plt.tight_layout()
plt.show()

On well-separated spherical clusters, K-Means is nearly perfect. The centroids (red X markers) land at the center of each cluster, and every point is assigned correctly. This is K-Means at its best.

K-Means Initialization: Why K-Means++ Matters

The original K-Means algorithm initializes centroids randomly, which creates a problem: bad initialization can produce bad clusters that never recover.

# Demonstrate initialization sensitivity
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for i, seed in enumerate([0, 7, 42]):
    km = KMeans(n_clusters=3, init='random', n_init=1, random_state=seed)
    labels_i = km.fit_predict(X_blobs)
    axes[i].scatter(X_blobs[:, 0], X_blobs[:, 1], c=labels_i, cmap='viridis', s=8, alpha=0.6)
    axes[i].scatter(
        km.cluster_centers_[:, 0], km.cluster_centers_[:, 1],
        c='red', marker='X', s=200, edgecolors='black', linewidths=1.5
    )
    axes[i].set_title(f'Random init (seed={seed}), Inertia={km.inertia_:.0f}')

plt.tight_layout()
plt.show()

K-Means++ (Arthur and Vassilvitskii, 2007) solves this by spreading initial centroids apart. It selects the first centroid randomly, then each subsequent centroid is chosen with probability proportional to its squared distance from the nearest existing centroid. This guarantees a solution within $O(\log k)$ of optimal inertia.

Practical Rule --- Always use init='k-means++' (the default in scikit-learn since version 0.18). The n_init=10 parameter runs the full algorithm 10 times with different initializations and keeps the best result. Do not reduce n_init below 10 unless you have a specific reason.

The Assumptions K-Means Makes (and Breaks)

K-Means assumes:

  1. Clusters are spherical. Each cluster is roughly a ball in feature space. Elongated, ring-shaped, or crescent-shaped clusters will be split incorrectly.
  2. Clusters have similar variance. A tight cluster next to a diffuse cluster confuses K-Means because the large cluster's points are farther from its centroid and can get "stolen" by the small cluster.
  3. Clusters have roughly equal size. K-Means prefers balanced partitions. It will sometimes split a large cluster and merge two small ones.
  4. The number of clusters k is known. K-Means requires you to specify k in advance.

When any of these assumptions is violated, K-Means still runs. It still returns labels. It just returns wrong labels, and it does not tell you they are wrong.

from sklearn.datasets import make_moons, make_circles

fig, axes = plt.subplots(2, 3, figsize=(15, 9))

# --- Failure mode 1: Non-spherical clusters ---
X_moons, y_moons = make_moons(n_samples=1000, noise=0.08, random_state=42)
km_moons = KMeans(n_clusters=2, random_state=42, n_init=10)

axes[0, 0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='viridis', s=8)
axes[0, 0].set_title('Moons: Ground Truth')
axes[0, 1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.fit_predict(X_moons), cmap='viridis', s=8)
axes[0, 1].set_title('Moons: K-Means (fails)')

# --- Failure mode 2: Concentric circles ---
X_circles, y_circles = make_circles(n_samples=1000, noise=0.05, factor=0.5, random_state=42)
km_circles = KMeans(n_clusters=2, random_state=42, n_init=10)

axes[1, 0].scatter(X_circles[:, 0], X_circles[:, 1], c=y_circles, cmap='viridis', s=8)
axes[1, 0].set_title('Circles: Ground Truth')
axes[1, 1].scatter(X_circles[:, 0], X_circles[:, 1], c=km_circles.fit_predict(X_circles), cmap='viridis', s=8)
axes[1, 1].set_title('Circles: K-Means (fails)')

# --- Failure mode 3: Unequal variance ---
X_unequal, y_unequal = make_blobs(
    n_samples=[100, 1500, 100], centers=[[-5, 0], [0, 0], [5, 0]],
    cluster_std=[0.5, 3.0, 0.5], random_state=42
)
km_unequal = KMeans(n_clusters=3, random_state=42, n_init=10)

axes[0, 2].scatter(X_unequal[:, 0], X_unequal[:, 1], c=y_unequal, cmap='viridis', s=8)
axes[0, 2].set_title('Unequal Variance: Truth')
axes[1, 2].scatter(X_unequal[:, 0], X_unequal[:, 1], c=km_unequal.fit_predict(X_unequal), cmap='viridis', s=8)
axes[1, 2].set_title('Unequal Variance: K-Means (fails)')

plt.tight_layout()
plt.show()

This visualization is the most important in the chapter. K-Means fails silently on non-spherical clusters (moons), nested clusters (circles), and unequal-variance clusters. It does not raise an exception. It does not warn you. It returns confident labels that happen to be wrong.

Feature Scaling Is Non-Negotiable

K-Means uses Euclidean distance. If one feature is measured in thousands (annual spend) and another in single digits (number of sessions), the high-magnitude feature dominates the distance calculation and the clustering is effectively one-dimensional.

# Simulate unscaled ShopSmart data
np.random.seed(42)
n = 5000

shopsmart_raw = pd.DataFrame({
    'annual_spend': np.concatenate([
        np.random.normal(2500, 400, 1500),   # high-value
        np.random.normal(800, 200, 2000),     # deal-seekers
        np.random.normal(150, 80, 1500),      # browsers
    ]),
    'sessions_per_month': np.concatenate([
        np.random.normal(12, 3, 1500),
        np.random.normal(18, 4, 2000),
        np.random.normal(25, 5, 1500),
    ]),
    'avg_discount_pct': np.concatenate([
        np.random.normal(5, 2, 1500),
        np.random.normal(35, 8, 2000),
        np.random.normal(10, 4, 1500),
    ]),
    'items_per_order': np.concatenate([
        np.random.normal(4.5, 1.2, 1500),
        np.random.normal(2.8, 0.8, 2000),
        np.random.normal(1.2, 0.4, 1500),
    ]),
})

# Without scaling --- clustering dominated by annual_spend
km_unscaled = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_unscaled = km_unscaled.fit_predict(shopsmart_raw)

# With scaling --- all features contribute equally
scaler = StandardScaler()
shopsmart_scaled = scaler.fit_transform(shopsmart_raw)
km_scaled = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_scaled = km_scaled.fit_predict(shopsmart_scaled)

print("Cluster sizes (unscaled):", np.bincount(labels_unscaled))
print("Cluster sizes (scaled):  ", np.bincount(labels_scaled))

Practical Rule --- Always scale features before K-Means. Use StandardScaler for continuous features. If you have binary or categorical features, consider separate treatment (Gower distance, k-prototypes) or encode them before scaling.

Mini-Batch K-Means for Large Datasets

Standard K-Means recomputes all n distances at every iteration. For datasets above 100,000 rows, this becomes slow. Mini-Batch K-Means (Sculley, 2010) updates centroids using random mini-batches instead of the full dataset, trading a small amount of accuracy for significant speed improvement.

from sklearn.cluster import MiniBatchKMeans
import time

# Generate a large dataset
X_large, _ = make_blobs(n_samples=500_000, centers=8, random_state=42)

# Standard K-Means
start = time.time()
km_full = KMeans(n_clusters=8, random_state=42, n_init=3)
km_full.fit(X_large)
time_full = time.time() - start

# Mini-Batch K-Means
start = time.time()
km_mini = MiniBatchKMeans(n_clusters=8, random_state=42, batch_size=1024, n_init=3)
km_mini.fit(X_large)
time_mini = time.time() - start

print(f"K-Means:           {time_full:.2f}s, inertia={km_full.inertia_:.0f}")
print(f"Mini-Batch K-Means: {time_mini:.2f}s, inertia={km_mini.inertia_:.0f}")

Mini-Batch K-Means typically runs 3-10x faster with inertia within 1-3% of the full algorithm. Use it whenever your dataset exceeds 50,000 rows and you need to iterate quickly on k.


Part 2: Choosing k --- The Problem That Never Goes Away

The most common question in any clustering project: "How many clusters should I use?" There is no single correct answer, but three methods help you make an informed choice.

The Elbow Method

The elbow method plots inertia against k and looks for the "elbow" --- the point where adding another cluster yields diminishing returns.

inertias = []
K_range = range(2, 11)

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    km.fit(shopsmart_scaled)
    inertias.append(km.inertia_)

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(K_range, inertias, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Inertia (Within-Cluster Sum of Squares)')
ax.set_title('Elbow Method')
ax.set_xticks(list(K_range))
plt.tight_layout()
plt.show()

Limitation --- The elbow is often ambiguous. Real-world data rarely produces a sharp bend. When three people look at the same elbow plot and pick k=3, k=4, and k=5, all three are arguably correct. Use the elbow method as a starting point, not a final answer.

Silhouette Analysis

The silhouette score for a single observation measures how similar it is to its own cluster versus the nearest other cluster:

$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$

where $a(i)$ is the mean distance to other points in the same cluster and $b(i)$ is the mean distance to points in the nearest neighboring cluster. The score ranges from -1 (wrong cluster) to +1 (well-separated).

The overall silhouette score (mean across all observations) summarizes clustering quality. But the silhouette plot is far more informative than the single number because it reveals per-cluster structure.

from sklearn.metrics import silhouette_score, silhouette_samples

silhouette_avgs = []

for k in K_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(shopsmart_scaled)
    score = silhouette_score(shopsmart_scaled, labels_k)
    silhouette_avgs.append(score)
    print(f"k={k}: silhouette={score:.4f}")

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(K_range, silhouette_avgs, 'bo-', linewidth=2, markersize=8)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Mean Silhouette Score')
ax.set_title('Silhouette Analysis')
ax.set_xticks(list(K_range))
plt.tight_layout()
plt.show()

Now create the per-cluster silhouette plot for the best candidate:

def plot_silhouette(X, labels, k, ax):
    """Silhouette plot showing per-cluster silhouette width."""
    sample_silhouette_values = silhouette_samples(X, labels)
    y_lower = 10

    for i in range(k):
        ith_cluster_values = sample_silhouette_values[labels == i]
        ith_cluster_values.sort()
        size_cluster_i = ith_cluster_values.shape[0]
        y_upper = y_lower + size_cluster_i

        ax.fill_betweenx(
            np.arange(y_lower, y_upper), 0, ith_cluster_values, alpha=0.7
        )
        ax.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))
        y_lower = y_upper + 10

    avg_score = silhouette_score(X, labels)
    ax.axvline(x=avg_score, color='red', linestyle='--', label=f'Mean: {avg_score:.3f}')
    ax.set_xlabel('Silhouette Coefficient')
    ax.set_ylabel('Cluster Label')
    ax.legend()

fig, axes = plt.subplots(1, 3, figsize=(18, 5))
for idx, k in enumerate([3, 4, 5]):
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels_k = km.fit_predict(shopsmart_scaled)
    plot_silhouette(shopsmart_scaled, labels_k, k, axes[idx])
    axes[idx].set_title(f'Silhouette Plot (k={k})')

plt.tight_layout()
plt.show()

What to Look For --- Good clusters produce silhouette plots where every cluster's "blade" is roughly the same width (balanced cluster sizes) and roughly the same height (similar silhouette values). If one cluster has many points below zero, those points are probably in the wrong cluster. If one blade is much thinner than the others, that cluster may be an artifact.

The Gap Statistic

The gap statistic (Tibshirani, Walther, and Hastie, 2001) compares the observed within-cluster dispersion to the dispersion expected under a reference null distribution (uniform random data in the same bounding box). The idea: if clusters are real, the observed dispersion should be substantially lower than the random expectation.

from sklearn.cluster import KMeans

def gap_statistic(X, K_range, n_references=20, random_state=42):
    """Compute the gap statistic for each k in K_range."""
    rng = np.random.RandomState(random_state)
    gaps = []
    s_k = []

    for k in K_range:
        # Observed inertia
        km = KMeans(n_clusters=k, random_state=random_state, n_init=10)
        km.fit(X)
        W_k = km.inertia_

        # Reference inertias from uniform random data
        ref_inertias = []
        for _ in range(n_references):
            X_ref = rng.uniform(
                low=X.min(axis=0), high=X.max(axis=0), size=X.shape
            )
            km_ref = KMeans(n_clusters=k, random_state=random_state, n_init=5)
            km_ref.fit(X_ref)
            ref_inertias.append(km_ref.inertia_)

        ref_log = np.log(ref_inertias)
        gap = ref_log.mean() - np.log(W_k)
        sdk = ref_log.std() * np.sqrt(1 + 1.0 / n_references)

        gaps.append(gap)
        s_k.append(sdk)

    return np.array(gaps), np.array(s_k)

gaps, s_k = gap_statistic(shopsmart_scaled, K_range, n_references=20)

fig, ax = plt.subplots(figsize=(8, 5))
ax.errorbar(K_range, gaps, yerr=s_k, fmt='bo-', linewidth=2, markersize=8, capsize=4)
ax.set_xlabel('Number of Clusters (k)')
ax.set_ylabel('Gap Statistic')
ax.set_title('Gap Statistic')
ax.set_xticks(list(K_range))
plt.tight_layout()
plt.show()

# The optimal k is the smallest k such that Gap(k) >= Gap(k+1) - s(k+1)
for i in range(len(gaps) - 1):
    k_val = list(K_range)[i]
    if gaps[i] >= gaps[i + 1] - s_k[i + 1]:
        print(f"Gap statistic suggests k = {k_val}")
        break

Practical Rule --- Use all three methods together. If the elbow, silhouette, and gap statistic all point to k=3, you can be reasonably confident. If they disagree, you have a harder problem --- the data may not have clean, well-separated clusters, and your choice of k may depend more on business utility than on mathematical optimality.

A Decision Framework for Choosing k

Situation Recommended Approach
Strong elbow + high silhouette + gap statistic agrees Use that k. You got lucky.
Methods disagree by 1-2 Try both, profile the clusters, and pick the one that tells a more actionable business story.
No clear elbow, silhouette is flat Your data may not cluster well. Consider DBSCAN or question whether clustering is the right approach.
Business already defines segments Use the business-defined k and validate whether the data supports it. If silhouette is below 0.25, the segments may not be statistically distinct.

Part 3: DBSCAN --- When Clusters Are Not Spheres

The Density-Based Intuition

DBSCAN (Density-Based Spatial Clustering of Applications with Noise, Ester et al., 1996) takes a fundamentally different approach. Instead of partitioning space around centroids, it finds dense regions separated by sparse regions. Dense regions become clusters. Points in sparse regions become noise.

Two parameters control the algorithm:

  • eps ($\varepsilon$): The maximum distance between two points for them to be considered neighbors.
  • min_samples: The minimum number of neighbors (within eps) for a point to be considered a core point.

Three point types:

Type Definition Cluster Membership
Core point Has at least min_samples neighbors within eps Always in a cluster
Border point Within eps of a core point, but has fewer than min_samples neighbors In the cluster of a nearby core point
Noise point Not within eps of any core point Label = -1 (no cluster)

The algorithm:

  1. For each unvisited point, find all neighbors within eps.
  2. If the point has at least min_samples neighbors, it is a core point. Start a new cluster.
  3. Expand the cluster by adding all density-reachable points (neighbors of core points, recursively).
  4. Points that are not core points and not reachable from any core point are noise (label -1).
from sklearn.cluster import DBSCAN

# DBSCAN on the moons dataset where K-Means failed
db_moons = DBSCAN(eps=0.15, min_samples=5)
labels_db_moons = db_moons.fit_predict(X_moons)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))

# Ground truth
axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=y_moons, cmap='viridis', s=12)
axes[0].set_title('Moons: Ground Truth')

# K-Means (fails)
axes[1].scatter(X_moons[:, 0], X_moons[:, 1], c=km_moons.fit_predict(X_moons), cmap='viridis', s=12)
axes[1].set_title('Moons: K-Means (fails)')

# DBSCAN (succeeds)
colors = labels_db_moons.copy().astype(float)
colors[labels_db_moons == -1] = np.nan  # noise in gray
axes[2].scatter(
    X_moons[:, 0], X_moons[:, 1], c=colors, cmap='viridis', s=12, alpha=0.7
)
noise_mask = labels_db_moons == -1
if noise_mask.any():
    axes[2].scatter(
        X_moons[noise_mask, 0], X_moons[noise_mask, 1],
        c='gray', marker='x', s=20, alpha=0.5, label='Noise'
    )
    axes[2].legend()
axes[2].set_title(f'Moons: DBSCAN (eps=0.15)')

plt.tight_layout()
plt.show()

n_clusters = len(set(labels_db_moons)) - (1 if -1 in labels_db_moons else 0)
n_noise = (labels_db_moons == -1).sum()
print(f"DBSCAN found {n_clusters} clusters, {n_noise} noise points")

DBSCAN handles the moons and circles datasets perfectly. It does not need you to specify k. It can find clusters of arbitrary shape. And it explicitly labels noise, which K-Means cannot do.

Choosing eps and min_samples

DBSCAN does not need k, but it needs eps and min_samples. Poor choices produce poor results.

min_samples: A common heuristic is min_samples = 2 * n_features. For 4-dimensional data, start with min_samples=8. Higher values produce more conservative clusters (more noise points, fewer clusters). Lower values produce more aggressive clusters.

eps: The k-distance plot helps choose eps. Compute the distance to each point's k-th nearest neighbor (where k = min_samples), sort these distances, and look for an elbow.

from sklearn.neighbors import NearestNeighbors

min_samples = 8  # 2 * n_features for 4D data

# Use the scaled ShopSmart data
nn = NearestNeighbors(n_neighbors=min_samples)
nn.fit(shopsmart_scaled)
distances, _ = nn.kneighbors(shopsmart_scaled)

# Sort the k-th nearest neighbor distances
k_distances = np.sort(distances[:, -1])

fig, ax = plt.subplots(figsize=(8, 5))
ax.plot(k_distances, linewidth=1.5)
ax.set_xlabel('Points (sorted by distance)')
ax.set_ylabel(f'{min_samples}-th Nearest Neighbor Distance')
ax.set_title('k-Distance Plot for Choosing eps')
ax.axhline(y=1.5, color='red', linestyle='--', alpha=0.7, label='Candidate eps=1.5')
ax.legend()
plt.tight_layout()
plt.show()

Practical Rule --- The elbow in the k-distance plot suggests the eps value that separates dense regions from sparse ones. Points below the elbow are in clusters; points above are noise. In practice, try 3-5 candidate eps values around the elbow and compare the results.

# Compare multiple eps values
for eps in [1.0, 1.5, 2.0, 2.5]:
    db = DBSCAN(eps=eps, min_samples=min_samples)
    labels_db = db.fit_predict(shopsmart_scaled)
    n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
    n_noise = (labels_db == -1).sum()
    pct_noise = 100 * n_noise / len(labels_db)
    print(f"eps={eps:.1f}: {n_clusters} clusters, {n_noise} noise ({pct_noise:.1f}%)")

DBSCAN's Limitations

DBSCAN is not a universal solution:

  1. Varying density. If clusters have very different densities, a single eps cannot capture both. The dense cluster is found correctly, but the sparse cluster is labeled as noise.
  2. High dimensions. In high-dimensional space, distances concentrate (the "curse of dimensionality"), making it hard to distinguish dense from sparse regions. DBSCAN degrades above ~20 features. Reduce dimensionality first (Chapter 21) if needed.
  3. No predict method. DBSCAN does not learn a model that can assign new points to clusters. It must be refit on the full dataset. (Workaround: train a classifier on the DBSCAN labels.)
  4. Parameter sensitivity. Small changes in eps can produce dramatically different results.

HDBSCAN: The Pragmatic Extension

When density varies across clusters, HDBSCAN (McInnes et al., 2017) is often a better choice. It extends DBSCAN by building a hierarchy of clusterings across different density levels and extracting the most stable clusters. It requires only min_cluster_size as a primary parameter.

# HDBSCAN (requires: pip install hdbscan)
import hdbscan

clusterer = hdbscan.HDBSCAN(min_cluster_size=50, min_samples=8)
labels_hdb = clusterer.fit_predict(shopsmart_scaled)

n_clusters = len(set(labels_hdb)) - (1 if -1 in labels_hdb else 0)
n_noise = (labels_hdb == -1).sum()
print(f"HDBSCAN: {n_clusters} clusters, {n_noise} noise ({100*n_noise/len(labels_hdb):.1f}%)")

# HDBSCAN also provides cluster membership probabilities
print(f"Mean membership probability: {clusterer.probabilities_.mean():.3f}")

Practical Recommendation --- If your data has clusters of varying density (most real-world datasets), try HDBSCAN before standard DBSCAN. It requires fewer parameter decisions and handles density variation more gracefully.


Part 4: Hierarchical Clustering and Dendrograms

The Agglomerative Approach

Hierarchical agglomerative clustering (HAC) starts with each point as its own cluster and progressively merges the two closest clusters until only one remains. The result is a dendrogram --- a tree that shows the complete merging history.

The linkage criterion determines how "distance between clusters" is computed:

Linkage Distance Between Clusters Effect
Ward Increase in total within-cluster variance Tends to produce equal-sized, compact clusters. Most common default.
Complete Maximum distance between any two points in different clusters Produces compact clusters; sensitive to outliers.
Average Mean distance between all pairs of points in different clusters Compromise between single and complete.
Single Minimum distance between any two points in different clusters Can find elongated clusters; susceptible to chaining (bridges between clusters).
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
from sklearn.cluster import AgglomerativeClustering

# Use a subset for dendrogram visibility
np.random.seed(42)
subset_idx = np.random.choice(len(shopsmart_scaled), 200, replace=False)
X_subset = shopsmart_scaled[subset_idx]

# Compute the linkage matrix
Z = linkage(X_subset, method='ward')

fig, ax = plt.subplots(figsize=(14, 6))
dendrogram(Z, truncate_mode='lastp', p=30, ax=ax, leaf_rotation=90)
ax.set_title('Dendrogram (Ward Linkage, 200-Point Subset)')
ax.set_xlabel('Cluster Size')
ax.set_ylabel('Distance')
ax.axhline(y=20, color='red', linestyle='--', alpha=0.7, label='Cut at distance=20')
ax.legend()
plt.tight_layout()
plt.show()

Reading a Dendrogram

The dendrogram is one of the most useful exploratory visualizations in data science, and one of the most misread.

What the vertical axis means: The height of each merge represents the distance (or dissimilarity) at which two clusters were joined. Large jumps indicate well-separated clusters. Small jumps indicate gradual merging of similar groups.

How to choose k: Draw a horizontal line. The number of vertical lines it crosses equals the number of clusters at that distance. Move the line up (fewer clusters, more general grouping) or down (more clusters, finer resolution).

The gap heuristic: Look for the largest vertical gap between successive merges. Cutting just below that gap gives you the "natural" number of clusters --- the grouping that persists over the widest range of distances.

# Full dataset with AgglomerativeClustering
for n_clust in [3, 4, 5]:
    hac = AgglomerativeClustering(n_clusters=n_clust, linkage='ward')
    labels_hac = hac.fit_predict(shopsmart_scaled)
    sil = silhouette_score(shopsmart_scaled, labels_hac)
    print(f"HAC (ward, k={n_clust}): silhouette={sil:.4f}, sizes={np.bincount(labels_hac)}")

Comparing Linkage Methods

fig, axes = plt.subplots(1, 4, figsize=(20, 5))
linkage_methods = ['ward', 'complete', 'average', 'single']

for ax, method in zip(axes, linkage_methods):
    Z_method = linkage(X_subset, method=method)
    dendrogram(Z_method, truncate_mode='lastp', p=20, ax=ax, leaf_rotation=90)
    ax.set_title(f'{method.capitalize()} Linkage')
    ax.set_ylabel('Distance')

plt.tight_layout()
plt.show()

Practical Rule --- Start with Ward linkage unless you have a specific reason not to. Ward minimizes variance and produces the most "K-Means-like" clusters in the hierarchical framework. Use single linkage only if you expect elongated or chain-like clusters, and be prepared for it to produce unbalanced results.

When to Use Hierarchical Clustering

Hierarchical clustering is best suited for:

  • Exploratory analysis where you do not yet know k. The dendrogram lets you explore multiple k values simultaneously.
  • Small-to-medium datasets (under ~20,000 points). The O(n^2) memory and O(n^3) or O(n^2 log n) time make it impractical for large datasets.
  • Nested or hierarchical structure where clusters contain sub-clusters (e.g., product categories within departments).
  • Presentation. Dendrograms are intuitive for non-technical audiences. They show grouping at every level of resolution.

For large datasets, use K-Means or DBSCAN for the actual clustering and hierarchical clustering on a random subsample or on the K-Means centroids for exploratory structure analysis.


Part 5: Evaluating Clusters Without Ground Truth

This is the hard part. In supervised learning, you evaluate against labeled data. In clustering, the labels are what you are trying to create. How do you know if your clusters are good?

Internal Metrics

Internal metrics evaluate clusters based on the data alone:

Metric Range Best Value What It Measures
Silhouette Score [-1, 1] Higher Cohesion vs. separation
Calinski-Harabasz Index [0, inf) Higher Between-cluster variance / within-cluster variance
Davies-Bouldin Index [0, inf) Lower Average similarity between each cluster and its most similar neighbor
from sklearn.metrics import (
    silhouette_score,
    calinski_harabasz_score,
    davies_bouldin_score,
)

# Compare K-Means, DBSCAN, and HAC on the ShopSmart data
results = []

# K-Means
km3 = KMeans(n_clusters=3, random_state=42, n_init=10)
labels_km3 = km3.fit_predict(shopsmart_scaled)
results.append(('K-Means (k=3)', labels_km3))

# DBSCAN (filter noise for metrics)
db = DBSCAN(eps=1.5, min_samples=8)
labels_db = db.fit_predict(shopsmart_scaled)
results.append(('DBSCAN (eps=1.5)', labels_db))

# Hierarchical
hac3 = AgglomerativeClustering(n_clusters=3, linkage='ward')
labels_hac3 = hac3.fit_predict(shopsmart_scaled)
results.append(('HAC Ward (k=3)', labels_hac3))

print(f"{'Algorithm':<25} {'Silhouette':>12} {'Calinski-H':>12} {'Davies-B':>12}")
print("-" * 65)

for name, labels in results:
    # DBSCAN may have noise (-1). Exclude noise for internal metrics.
    mask = labels != -1
    if mask.sum() < 2 or len(set(labels[mask])) < 2:
        print(f"{name:<25} {'N/A':>12} {'N/A':>12} {'N/A':>12}")
        continue
    sil = silhouette_score(shopsmart_scaled[mask], labels[mask])
    ch = calinski_harabasz_score(shopsmart_scaled[mask], labels[mask])
    db_score = davies_bouldin_score(shopsmart_scaled[mask], labels[mask])
    print(f"{name:<25} {sil:>12.4f} {ch:>12.1f} {db_score:>12.4f}")

External Validation (When You Have Partial Labels)

Occasionally you have some labeled data --- even if it is noisy or incomplete. In that case, external metrics compare cluster assignments against the known labels:

from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

# We generated the ShopSmart data with known segments, so we can compute external metrics
true_labels = np.concatenate([
    np.zeros(1500),   # high-value
    np.ones(2000),    # deal-seekers
    np.full(1500, 2), # browsers
])

for name, labels in results:
    mask = labels != -1
    if mask.sum() < 2:
        continue
    ari = adjusted_rand_score(true_labels[mask], labels[mask])
    nmi = normalized_mutual_info_score(true_labels[mask], labels[mask])
    print(f"{name:<25} ARI={ari:.4f}  NMI={nmi:.4f}")

Key Insight --- Adjusted Rand Index (ARI) ranges from -1 to 1 (random assignment scores ~0). Normalized Mutual Information (NMI) ranges from 0 to 1. Both are symmetric (the cluster labels do not need to match the ground truth labels numerically). Use these when you have labels for validation, but remember: in real clustering problems, you usually do not.

Cluster Profiling --- The Most Important Step

Metrics tell you whether clusters are well-separated. They do not tell you whether clusters are useful. Cluster profiling answers the question: "What makes each cluster different?"

# Profile the K-Means clusters
shopsmart_profiled = shopsmart_raw.copy()
shopsmart_profiled['cluster'] = labels_km3

profile = shopsmart_profiled.groupby('cluster').agg(['mean', 'median', 'std', 'count'])
print(profile.to_string())
# Cleaner profiling table
profile_clean = shopsmart_profiled.groupby('cluster').agg(
    n=('annual_spend', 'count'),
    avg_spend=('annual_spend', 'mean'),
    median_spend=('annual_spend', 'median'),
    avg_sessions=('sessions_per_month', 'mean'),
    avg_discount=('avg_discount_pct', 'mean'),
    avg_items_per_order=('items_per_order', 'mean'),
).round(2)

print(profile_clean.to_string())

The profiling table is what you show the business. "Cluster 0 spends $2,400/year, visits 12 times/month, and uses minimal discounts --- these are your loyalists. Cluster 1 spends $800/year, visits 18 times/month, and uses 35% discounts --- these are your deal-seekers." Assign each cluster a human-readable label and validate with domain experts.


Part 6: Putting It Together --- A Clustering Workflow

Here is the workflow I use on every clustering project:

1. Define the business question
   "What customer segments exist for differentiated A/B testing?"

2. Select and engineer features
   Remove identifiers, scale continuous features, handle categoricals.

3. Reduce dimensionality if > 20 features
   PCA or UMAP (Chapter 21) to avoid curse of dimensionality.

4. Run K-Means with multiple k values
   Elbow, silhouette, gap statistic. Narrow to 2-3 candidate k values.

5. Run DBSCAN (or HDBSCAN) as a complement
   Compare to K-Means. Does DBSCAN find noise that K-Means forced into clusters?

6. Run hierarchical clustering on a subsample
   Use the dendrogram to check whether the data has natural hierarchical structure.

7. Compare algorithms using internal metrics
   Silhouette, Calinski-Harabasz, Davies-Bouldin. No single metric is definitive.

8. Profile clusters on original (unscaled) features
   Compute means, medians, distributions for each cluster. Name the clusters.

9. Validate with domain experts
   Show the cluster profiles. Do they recognize the segments? Are they actionable?

10. Iterate
    Clustering is exploratory. Expect to revise feature selection and k at least once.

Practical Rule --- The best clustering is the one that changes a decision. If the marketing team runs different A/B tests for each segment, and the segments perform differently, the clustering was useful. If the segments do not lead to different actions, the clustering was an academic exercise. Always ask: "What will we do differently for each cluster?"


Part 7: Applying Clustering to the StreamFlow Churn Problem

Let us bring this back to the running example. StreamFlow wants to segment its 50,000 subscribers to understand whether different subscriber types have different churn rates --- and whether they should receive different retention interventions.

np.random.seed(42)
n = 50000

streamflow = pd.DataFrame({
    'monthly_hours_watched': np.random.exponential(18, n).round(1),
    'sessions_last_30d': np.random.poisson(14, n),
    'avg_session_minutes': np.random.exponential(28, n).round(1),
    'content_completion_rate': np.random.beta(3, 2, n).round(3),
    'binge_sessions_30d': np.random.poisson(2, n),
    'hours_change_pct': np.random.normal(0, 30, n).round(1),
    'months_active': np.random.randint(1, 60, n),
    'plan_price': np.random.choice(
        [9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]
    ),
    'devices_used': np.random.randint(1, 6, n),
    'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
    'support_tickets_90d': np.random.poisson(0.8, n),
    'payment_failures_6m': np.random.poisson(0.3, n),
})

# Simulate churn as correlated with engagement features
churn_logit = (
    -2.0
    + 0.08 * streamflow['days_since_last_session']
    - 0.03 * streamflow['monthly_hours_watched']
    - 0.05 * streamflow['sessions_last_30d']
    + 0.4 * streamflow['payment_failures_6m']
    + 0.15 * streamflow['support_tickets_90d']
    - 0.02 * streamflow['months_active']
    + np.random.normal(0, 0.5, n)
)
streamflow['churned'] = (np.random.random(n) < 1 / (1 + np.exp(-churn_logit))).astype(int)
print(f"Churn rate: {streamflow['churned'].mean():.3f}")

# Select clustering features (exclude the target)
cluster_features = [
    'monthly_hours_watched', 'sessions_last_30d', 'avg_session_minutes',
    'content_completion_rate', 'binge_sessions_30d', 'hours_change_pct',
    'months_active', 'plan_price', 'devices_used', 'days_since_last_session',
    'support_tickets_90d', 'payment_failures_6m',
]

scaler = StandardScaler()
X_stream = scaler.fit_transform(streamflow[cluster_features])

# K-Means with k=4
km_stream = KMeans(n_clusters=4, random_state=42, n_init=10)
streamflow['segment'] = km_stream.fit_predict(X_stream)

# Churn rate by segment
churn_by_segment = streamflow.groupby('segment').agg(
    n=('churned', 'count'),
    churn_rate=('churned', 'mean'),
    avg_hours=('monthly_hours_watched', 'mean'),
    avg_sessions=('sessions_last_30d', 'mean'),
    avg_days_inactive=('days_since_last_session', 'mean'),
    avg_months=('months_active', 'mean'),
    avg_plan_price=('plan_price', 'mean'),
    avg_tickets=('support_tickets_90d', 'mean'),
).round(3)

print("\nStreamFlow Segments:")
print(churn_by_segment.to_string())

The segments reveal different behavioral profiles with different churn rates. Each segment suggests a different retention strategy:

  • High-engagement, low-churn: Leave them alone. They are happy.
  • Declining engagement: Proactive content recommendations (Chapter 24).
  • Payment friction: Fix the billing issue. This is the easiest intervention.
  • New and disengaged: Onboarding improvement. These subscribers never found content they liked.

This is the clustering payoff: not the clusters themselves, but the differentiated actions they enable.


Summary

K-Means is simple, fast, and your first tool for clustering. Use it when you have a reasonable guess for k and your data is roughly spherical. DBSCAN finds arbitrary-shaped clusters and labels noise, but requires careful parameter tuning. Hierarchical clustering produces dendrograms that reveal the data's nested structure at every resolution.

No single algorithm works everywhere. The most important skill in clustering is not running the algorithm --- it is evaluating the result. Silhouette analysis, cluster profiling, and domain expert validation separate useful clusters from algorithmic noise. And the ultimate test is not a metric but a question: does this clustering change a decision?


Next chapter: Chapter 21 --- Dimensionality Reduction will give you PCA, t-SNE, and UMAP --- tools that reduce feature spaces and visualize the clusters you just created.