Exercises: Chapter 20

DataField.Dev

Exercises: Chapter 20

Clustering: K-Means, DBSCAN, and Hierarchical Clustering

Exercise 1: K-Means Fundamentals (Conceptual)

a) Explain in your own words what "inertia" measures in the context of K-Means. Why does inertia always decrease as k increases? Why is minimizing inertia alone not a good strategy for choosing k?

b) K-Means assumes clusters are spherical, equal-variance, and roughly equal-sized. For each assumption violation below, describe what K-Means will do and why: - Two crescent-shaped clusters (moons) - One cluster with 50 points and another with 5,000 points, both well-separated - Three clusters with standard deviations of 0.5, 3.0, and 0.5

c) Why is feature scaling required before K-Means? A colleague argues: "I scaled my features, but the clusters are the same as without scaling." Under what conditions would this be true? Under what conditions would it definitely not be true?

d) Explain the difference between init='random' and init='k-means++' in scikit-learn's KMeans. Why does K-Means++ produce better initializations? What does the n_init parameter do, and what is its default value?

Exercise 2: Implementing K-Means from Scratch (Code)

Implement K-Means from scratch (without using scikit-learn's KMeans) to solidify your understanding. Your implementation should:

Accept a 2D numpy array X and an integer k
Use random initialization (select k random points from X as initial centroids)
Iterate until convergence (assignments do not change) or max_iter is reached
Return the cluster labels and final centroids

import numpy as np
from sklearn.datasets import make_blobs

def kmeans_scratch(X, k, max_iter=100, random_state=42):
    """
    K-Means clustering from scratch.

    Parameters
    ----------
    X : np.ndarray of shape (n_samples, n_features)
    k : int, number of clusters
    max_iter : int, maximum iterations
    random_state : int, random seed

    Returns
    -------
    labels : np.ndarray of shape (n_samples,)
    centroids : np.ndarray of shape (k, n_features)
    n_iterations : int
    """
    rng = np.random.RandomState(random_state)

    # Step 1: Initialize centroids by selecting k random points from X
    # Your code here

    # Step 2: Iterate until convergence or max_iter
    for iteration in range(max_iter):
        # 2a: Assign each point to the nearest centroid
        # Your code here

        # 2b: Recompute centroids as the mean of assigned points
        # Your code here

        # 2c: Check for convergence (labels did not change)
        # Your code here
        pass

    return labels, centroids, iteration + 1


# Test on synthetic data
X_test, y_test = make_blobs(n_samples=1000, centers=3, cluster_std=1.0, random_state=42)
labels, centroids, n_iters = kmeans_scratch(X_test, k=3)
print(f"Converged in {n_iters} iterations")
print(f"Cluster sizes: {np.bincount(labels)}")

After implementing, compare your results to scikit-learn's KMeans on the same data. Do the centroids match (up to label permutation)?

Exercise 3: Choosing k with Three Methods (Code + Analysis)

Using the ShopSmart dataset from the chapter (or your own dataset with at least 4 continuous features and 2,000+ rows), perform a full k selection analysis:

a) Elbow method: Plot inertia vs. k for k = 2 through 10. Identify the elbow visually. Is it clear or ambiguous?

b) Silhouette analysis: Plot mean silhouette score vs. k for the same range. Create silhouette plots (per-cluster blades) for your top 3 candidate k values. Identify any clusters with many points below zero.

c) Gap statistic: Compute and plot the gap statistic for k = 2 through 10 using at least 20 reference datasets. Apply the formal rule: the optimal k is the smallest k where Gap(k) >= Gap(k+1) - s(k+1).

d) Synthesis: Do the three methods agree? If not, examine the candidate k values by profiling the clusters (mean of each feature per cluster). Which k produces the most interpretable and actionable segments? Justify your choice in 3-5 sentences.

import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score, silhouette_samples

# Load or generate your dataset
# Your code here

# Elbow method
# Your code here

# Silhouette analysis (scores + per-cluster plots)
# Your code here

# Gap statistic
# Your code here

# Cluster profiling for the top candidate(s)
# Your code here

Exercise 4: DBSCAN Parameter Tuning (Code + Analysis)

a) Using the ShopSmart dataset (scaled), create the k-distance plot with k = 2 * n_features. Identify the elbow and select a candidate eps value.

b) Run DBSCAN with your candidate eps and three additional values (one above and two below). For each, report: - Number of clusters found - Number and percentage of noise points - Silhouette score (excluding noise points)

from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
from sklearn.metrics import silhouette_score

# k-distance plot
# Your code here

# DBSCAN with multiple eps values
eps_candidates = [___]  # fill in your candidates
for eps in eps_candidates:
    db = DBSCAN(eps=eps, min_samples=___)
    labels = db.fit_predict(X_scaled)
    # Report metrics
    # Your code here

c) Pick the best DBSCAN result and compare it to K-Means (k=3) using silhouette score, cluster sizes, and cluster profiles. Which algorithm produces more interpretable segments for this data?

d) Generate or load a dataset with non-spherical clusters (e.g., make_moons or make_circles). Run both K-Means and DBSCAN. Show visually that DBSCAN succeeds where K-Means fails. What is the silhouette score for each?

Exercise 5: Hierarchical Clustering and Dendrograms (Code + Analysis)

a) Take a random subsample of 300 points from the ShopSmart dataset (scaled). Compute the linkage matrix using Ward linkage and plot the dendrogram. Based on the dendrogram: - How many "natural" clusters does the data appear to have? - At what distance would you cut the dendrogram to get that many clusters? - Is there a large gap between merge distances that supports your choice?

b) Compare four linkage methods (ward, complete, average, single) on the same 300-point subsample. Plot all four dendrograms side by side. Which linkage produces the most balanced clusters? Which produces the most unbalanced? Explain why single linkage behaves differently.

c) Use AgglomerativeClustering on the full ShopSmart dataset with Ward linkage and k=3. Compare the resulting cluster profiles to those from K-Means (k=3) and DBSCAN. Are the clusters similar across algorithms? Compute the Adjusted Rand Index between each pair of clusterings.

from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.cluster import AgglomerativeClustering
from sklearn.metrics import adjusted_rand_score

# Subsample and dendrogram
# Your code here

# Compare linkage methods
# Your code here

# Full dataset comparison
# Your code here

Exercise 6: Cluster Evaluation Without Ground Truth (Code + Analysis)

a) Using the ShopSmart dataset, run K-Means with k=3, k=4, and k=5. For each, compute: - Silhouette score - Calinski-Harabasz index - Davies-Bouldin index

Present the results in a table. Do the three internal metrics agree on the best k?

b) For your best k, create a detailed cluster profile showing the mean, median, and standard deviation of each feature per cluster. Name each cluster based on its profile (e.g., "High-Value Loyalists," "Deal-Seekers," "Browsers"). Justify each name with specific feature values.

c) Stability analysis: Run K-Means (k=3) on 10 different random 80% subsamples of the data. For each pair of runs, compute the ARI between their labels (on the overlapping observations). Report the mean and standard deviation of the ARI across all 45 pairs. Is the clustering stable? What ARI threshold would you consider "stable enough"?

from sklearn.metrics import adjusted_rand_score
from itertools import combinations

# Stability analysis
n_runs = 10
all_labels = []

for i in range(n_runs):
    sample_idx = np.random.RandomState(i).choice(len(X_scaled), int(0.8 * len(X_scaled)), replace=False)
    km = KMeans(n_clusters=3, random_state=42, n_init=10)
    labels_run = np.full(len(X_scaled), -1)
    labels_run[sample_idx] = km.fit_predict(X_scaled[sample_idx])
    all_labels.append((sample_idx, labels_run))

ari_scores = []
for (idx_a, lab_a), (idx_b, lab_b) in combinations(all_labels, 2):
    overlap = np.intersect1d(idx_a, idx_b)
    if len(overlap) > 0:
        ari = adjusted_rand_score(lab_a[overlap], lab_b[overlap])
        ari_scores.append(ari)

print(f"Mean ARI: {np.mean(ari_scores):.4f} +/- {np.std(ari_scores):.4f}")

Exercise 7: End-to-End Clustering Project (Comprehensive)

Build a complete clustering analysis for a real or simulated dataset. Choose one of:

Option A: The StreamFlow subscriber dataset from the chapter (50,000 subscribers, 12 features). Segment subscribers and analyze churn rate differences across segments.
Option B: Your own tabular dataset with at least 5 continuous features and 1,000+ rows.

Your deliverable should include:

Data preparation: Feature selection justification, scaling, handling of categoricals.
K-Means analysis: Elbow, silhouette, and gap statistic. At least 3 candidate k values explored.
DBSCAN analysis: k-distance plot, eps selection, comparison to K-Means.
Hierarchical clustering: Dendrogram on a subsample, linkage comparison.
Algorithm comparison: Table of internal metrics (silhouette, Calinski-Harabasz, Davies-Bouldin) across all methods.
Cluster profiles: Named clusters with feature summaries. For Option A, include churn rate per segment.
Business recommendation: One paragraph per cluster describing the recommended action (intervention, offer, status quo).

This exercise should produce a Jupyter notebook of 20-30 cells with code, output, and markdown commentary.

Exercise 8: Scaling and Computational Considerations (Code)

a) Generate datasets of increasing size (n = 5,000; 10,000; 50,000; 100,000; 500,000) with 10 features and 5 clusters. Time K-Means, Mini-Batch K-Means, DBSCAN, and AgglomerativeClustering on each size. Plot runtime vs. dataset size for each algorithm.

import time
from sklearn.datasets import make_blobs
from sklearn.cluster import KMeans, MiniBatchKMeans, DBSCAN, AgglomerativeClustering

sizes = [5_000, 10_000, 50_000, 100_000, 500_000]
results = {alg: [] for alg in ['KMeans', 'MiniBatchKMeans', 'DBSCAN', 'HAC']}

for n in sizes:
    X, _ = make_blobs(n_samples=n, n_features=10, centers=5, random_state=42)

    # K-Means
    start = time.time()
    KMeans(n_clusters=5, random_state=42, n_init=3).fit(X)
    results['KMeans'].append(time.time() - start)

    # Mini-Batch K-Means
    start = time.time()
    MiniBatchKMeans(n_clusters=5, random_state=42, batch_size=1024).fit(X)
    results['MiniBatchKMeans'].append(time.time() - start)

    # DBSCAN (skip for 500k if too slow)
    if n <= 100_000:
        start = time.time()
        DBSCAN(eps=2.0, min_samples=10).fit(X)
        results['DBSCAN'].append(time.time() - start)
    else:
        results['DBSCAN'].append(np.nan)

    # HAC (skip for > 50k --- O(n^2) memory)
    if n <= 50_000:
        start = time.time()
        AgglomerativeClustering(n_clusters=5).fit(X)
        results['HAC'].append(time.time() - start)
    else:
        results['HAC'].append(np.nan)

    print(f"n={n:>7}: KM={results['KMeans'][-1]:.2f}s, "
          f"MBK={results['MiniBatchKMeans'][-1]:.2f}s, "
          f"DBSCAN={results['DBSCAN'][-1]}, "
          f"HAC={results['HAC'][-1]}")

b) At what dataset size does each algorithm become impractical (>60 seconds for a single fit)? What is your recommendation for a clustering pipeline that must handle 1 million rows?

Exercises support Chapter 20: Clustering. Return to the chapter for full context, or check your work against selected answers.