> "The goal is to turn data into information, and information into insight." --- Carly Fiorina
Learning Objectives
- Understand the fundamental differences between supervised and unsupervised learning
- Implement and evaluate clustering algorithms including K-means, hierarchical, DBSCAN, and Gaussian mixture models
- Apply dimensionality reduction techniques such as PCA, t-SNE, and UMAP
- Detect anomalies using unsupervised methods
- Evaluate unsupervised learning results using appropriate metrics
- Select the right unsupervised algorithm for a given problem
In This Chapter
Chapter 7: Unsupervised Learning and Dimensionality Reduction
"The goal is to turn data into information, and information into insight." --- Carly Fiorina
In the preceding chapters, we worked exclusively with labeled data. Every training example came paired with a target value---a class label in Chapter 5 (Classification) or a continuous output in Chapter 6 (Regression). But in the real world, labels are expensive. A radiologist must annotate each medical scan. A human reviewer must tag each email as spam or not-spam. A domain expert must assign quality scores to each product. In many practical scenarios, you have mountains of data but no labels at all.
This is where unsupervised learning enters the picture. Rather than learning a mapping from inputs to known outputs, unsupervised algorithms discover hidden structure in unlabeled data. They find natural groupings, reveal underlying patterns, compress high-dimensional representations, and flag observations that deviate from the norm---all without ever being told what to look for.
This chapter covers three major families of unsupervised techniques:
- Clustering --- partitioning data into meaningful groups
- Dimensionality reduction --- projecting high-dimensional data into lower-dimensional spaces while preserving essential structure
- Anomaly detection --- identifying data points that deviate significantly from the majority
By the end of this chapter, you will have a comprehensive toolkit for extracting insight from unlabeled data, and you will know how to evaluate the quality of your results even in the absence of ground-truth labels.
7.1 The Unsupervised Learning Landscape
7.1.1 Supervised vs. Unsupervised Learning
Recall from Chapter 1 that machine learning problems fall into several broad categories. In supervised learning, we optimize a model to predict known targets. The loss function compares predictions to ground truth, providing a clear optimization signal. In unsupervised learning, there are no targets. The algorithm must define its own notion of "good structure" and optimize accordingly.
This distinction has profound implications:
| Aspect | Supervised Learning | Unsupervised Learning |
|---|---|---|
| Input | Features + labels | Features only |
| Objective | Minimize prediction error | Discover hidden structure |
| Evaluation | Accuracy, RMSE, F1, etc. | Silhouette score, inertia, reconstruction error |
| Validation | Train/test split with ground truth | Internal metrics, domain expertise |
| Overfitting risk | Well-understood | Less well-defined |
7.1.2 Why Unsupervised Learning Matters
Unsupervised learning serves several critical purposes in the AI engineering pipeline:
Exploratory data analysis. Before building a supervised model, clustering and dimensionality reduction help you understand your data's structure. Are there natural subgroups? Are features redundant? These insights inform feature engineering decisions (Chapter 4).
Feature extraction. Dimensionality reduction techniques like PCA produce new features that can improve downstream supervised models. This connects directly to the preprocessing pipelines we built in Chapter 3.
Data compression. Reducing dimensionality from thousands of features to tens or hundreds makes subsequent computation tractable. This is especially important for the high-dimensional data common in text and image applications.
Anomaly detection. Identifying unusual observations is essential for fraud detection, network security, manufacturing quality control, and medical diagnostics.
Label generation. Clustering can produce pseudo-labels for semi-supervised learning, bridging the gap between fully unsupervised and supervised approaches.
7.1.3 A Taxonomy of Unsupervised Methods
Unsupervised Learning
├── Clustering
│ ├── Partitional (K-means, K-medoids)
│ ├── Hierarchical (Agglomerative, Divisive)
│ ├── Density-based (DBSCAN, OPTICS, HDBSCAN)
│ └── Model-based (Gaussian Mixture Models)
├── Dimensionality Reduction
│ ├── Linear (PCA, Factor Analysis)
│ └── Nonlinear (t-SNE, UMAP, Autoencoders)
├── Anomaly Detection
│ ├── Statistical (Z-score, IQR)
│ ├── Distance-based (LOF, k-NN)
│ └── Isolation-based (Isolation Forest)
└── Association Rule Learning
└── (Apriori, FP-Growth --- not covered here)
We will focus on the first three families, as they are the most widely used in modern AI engineering.
7.2 Clustering Algorithms
Clustering is the task of grouping data points so that points within the same cluster are more similar to each other than to points in other clusters. The notion of "similarity" varies by algorithm, and this choice profoundly affects the clusters you discover.
7.2.1 K-Means Clustering
K-means is the most widely used clustering algorithm, prized for its simplicity and scalability. It partitions $n$ observations into $k$ clusters by minimizing the within-cluster sum of squares (WCSS), also called inertia:
$$J = \sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2$$
where $C_i$ is the set of points assigned to cluster $i$ and $\boldsymbol{\mu}_i$ is the centroid (mean) of cluster $i$.
The Algorithm (Lloyd's Algorithm):
- Initialize $k$ centroids (randomly, or using K-means++ for smarter initialization)
- Assign each point to the nearest centroid
- Update each centroid to the mean of its assigned points
- Repeat steps 2-3 until convergence (assignments no longer change) or a maximum number of iterations is reached
K-means++ Initialization. Random initialization can lead to poor local minima. K-means++ (Arthur and Vassilvitskii, 2007) selects initial centroids that are spread apart:
- Choose the first centroid uniformly at random from the data
- For each subsequent centroid, select a data point with probability proportional to $D(\mathbf{x})^2$, where $D(\mathbf{x})$ is the distance to the nearest existing centroid
- Repeat until $k$ centroids are chosen
This initialization provides an $O(\log k)$ approximation guarantee and is the default in scikit-learn.
import numpy as np
from sklearn.cluster import KMeans
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler
# Generate synthetic data with 4 clusters
X, y_true = make_blobs(
n_samples=500,
centers=4,
cluster_std=0.8,
random_state=42
)
# Always scale before K-means (recall Chapter 3)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Fit K-means
kmeans = KMeans(
n_clusters=4,
init="k-means++",
n_init=10, # Run 10 times with different seeds
max_iter=300,
random_state=42
)
labels = kmeans.fit_predict(X_scaled)
print(f"Inertia: {kmeans.inertia_:.2f}")
print(f"Centroids shape: {kmeans.cluster_centers_.shape}")
print(f"Iterations to converge: {kmeans.n_iter_}")
Choosing $k$: The Elbow Method. Since K-means requires specifying $k$ in advance, we need a principled way to select it. The elbow method plots inertia against $k$ and looks for an "elbow" where the rate of decrease sharply changes:
inertias = []
K_range = range(2, 11)
for k in K_range:
km = KMeans(n_clusters=k, n_init=10, random_state=42)
km.fit(X_scaled)
inertias.append(km.inertia_)
# The "elbow" typically appears where adding another cluster
# provides diminishing returns in inertia reduction
Strengths and Limitations of K-means:
| Strengths | Limitations |
|---|---|
| Simple and fast: $O(nkd)$ per iteration | Must specify $k$ in advance |
| Scales well to large datasets | Assumes spherical, equally-sized clusters |
| Guaranteed to converge | Sensitive to initialization (mitigated by K-means++) |
| Works well when clusters are compact and well-separated | Cannot detect non-convex clusters |
| Sensitive to outliers |
7.2.2 Hierarchical Clustering
Hierarchical clustering builds a tree-like structure (a dendrogram) that captures nested groupings at multiple levels of granularity. Unlike K-means, you do not need to specify $k$ in advance---you can cut the dendrogram at any level to obtain the desired number of clusters.
Agglomerative (Bottom-Up) Approach:
- Start with each point as its own cluster ($n$ clusters)
- Merge the two closest clusters
- Repeat until a single cluster remains
The key design choice is the linkage criterion, which defines "distance between clusters":
- Single linkage: $d(A, B) = \min_{a \in A, b \in B} d(a, b)$ --- distance between the closest pair. Can detect elongated clusters but suffers from the "chaining effect."
- Complete linkage: $d(A, B) = \max_{a \in A, b \in B} d(a, b)$ --- distance between the farthest pair. Produces compact clusters but is sensitive to outliers.
- Average linkage: $d(A, B) = \frac{1}{|A||B|} \sum_{a \in A} \sum_{b \in B} d(a, b)$ --- average pairwise distance. A compromise between single and complete.
- Ward's method: Minimizes the total within-cluster variance at each merge. Tends to produce equally-sized, spherical clusters. This is the default in scikit-learn and generally the best starting point.
from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram, linkage
# Compute linkage matrix for dendrogram visualization
Z = linkage(X_scaled, method="ward")
# Fit agglomerative clustering with a chosen number of clusters
agg = AgglomerativeClustering(
n_clusters=4,
linkage="ward"
)
labels_agg = agg.fit_predict(X_scaled)
Time Complexity. Naive agglomerative clustering is $O(n^3)$, which limits it to moderate-sized datasets (roughly $n < 10{,}000$). Optimized implementations using priority queues achieve $O(n^2 \log n)$, and Ward's linkage with structured connectivity can be even faster.
7.2.3 DBSCAN: Density-Based Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) takes a fundamentally different approach: it defines clusters as dense regions of points separated by regions of lower density. This allows it to discover clusters of arbitrary shape and naturally handles outliers. The algorithm was introduced by Ester, Kriegel, Sander, and Xu in 1996 and remains one of the most cited clustering papers in computer science---a testament to its practical value.
Key Parameters:
- $\varepsilon$ (
eps): The radius of the neighborhood around each point min_samples: The minimum number of points within $\varepsilon$ to form a dense region
Point Classification:
- Core point: Has at least
min_samplesneighbors within $\varepsilon$ - Border point: Within $\varepsilon$ of a core point but does not have
min_samplesneighbors itself - Noise point: Neither core nor border---these are the outliers
Formal Definitions for Density Reachability. To understand DBSCAN's cluster formation precisely, we need two definitions:
- Directly density-reachable: A point $\mathbf{q}$ is directly density-reachable from a core point $\mathbf{p}$ if $\mathbf{q}$ is within $\varepsilon$ of $\mathbf{p}$.
- Density-reachable: A point $\mathbf{q}$ is density-reachable from $\mathbf{p}$ if there exists a chain of points $\mathbf{p} = \mathbf{p}_1, \mathbf{p}_2, \ldots, \mathbf{p}_k = \mathbf{q}$ such that each $\mathbf{p}_{i+1}$ is directly density-reachable from $\mathbf{p}_i$.
- Density-connected: Two points $\mathbf{p}$ and $\mathbf{q}$ are density-connected if there exists a point $\mathbf{o}$ such that both $\mathbf{p}$ and $\mathbf{q}$ are density-reachable from $\mathbf{o}$.
A DBSCAN cluster is a maximal set of density-connected points. This formal definition is what allows DBSCAN to discover clusters of arbitrary shape---any set of points connected through a chain of dense neighborhoods forms a single cluster, regardless of the overall shape.
The Algorithm (Step by Step):
- For each unvisited point, compute the $\varepsilon$-neighborhood (all points within distance $\varepsilon$)
- If the point has at least
min_samplesneighbors, it is a core point: start a new cluster and add it - Expand the cluster: for each core point in the cluster, add all points in its $\varepsilon$-neighborhood. If a newly added point is also a core point, recursively expand from it
- Border points are added to the cluster of the first core point that reaches them
- Points that are never reached by any core point are labeled as noise
Time Complexity. With a naive implementation, DBSCAN is $O(n^2)$ because every point must query every other point. Using spatial indexing structures like KD-trees or ball trees, the average case improves to $O(n \log n)$. Scikit-learn uses these index structures automatically.
from sklearn.cluster import DBSCAN
from sklearn.neighbors import NearestNeighbors
# Use k-distance plot to estimate eps
# k = min_samples - 1 is a common heuristic
k = 4
nn = NearestNeighbors(n_neighbors=k + 1)
nn.fit(X_scaled)
distances, _ = nn.kneighbors(X_scaled)
k_distances = np.sort(distances[:, k])
# Look for the "knee" in the sorted k-distance plot
# This suggests a good value for eps
# Fit DBSCAN
dbscan = DBSCAN(eps=0.5, min_samples=5)
labels_db = dbscan.fit_predict(X_scaled)
n_clusters = len(set(labels_db)) - (1 if -1 in labels_db else 0)
n_noise = (labels_db == -1).sum()
print(f"Clusters found: {n_clusters}")
print(f"Noise points: {n_noise}")
Worked Example: Choosing eps with the k-Distance Plot. Suppose you have a dataset of 1,000 customer transactions with 5 features, all standardized. You set min_samples = 5 (a common rule of thumb is min_samples = 2 * d where $d$ is the number of features, so for 5 features you might start with 10; for simplicity we use 5 here). You compute the 5th-nearest-neighbor distance for every point and sort these distances in ascending order. The resulting plot typically shows a smooth curve that rises gradually, then suddenly steepens. The "knee" of this curve---the point where the slope changes most dramatically---corresponds to a good value for $\varepsilon$. Points below this threshold live in dense regions; points above it are in sparse areas. If the knee occurs at a distance of 0.5, you set eps = 0.5.
HDBSCAN: Hierarchical DBSCAN. A significant limitation of DBSCAN is that a single global $\varepsilon$ cannot handle clusters of varying density. HDBSCAN (Campello, Moulavi, and Sander, 2013) addresses this by extending DBSCAN into a hierarchical framework. Instead of fixing $\varepsilon$, HDBSCAN runs DBSCAN at all possible $\varepsilon$ values, builds a cluster hierarchy (a dendrogram of density-based clusters), and then extracts the most stable clusters from this hierarchy. The key parameter shifts from $\varepsilon$ to min_cluster_size, which is more intuitive to set.
# HDBSCAN requires the hdbscan package
# pip install hdbscan
import hdbscan
clusterer = hdbscan.HDBSCAN(
min_cluster_size=15,
min_samples=5,
metric="euclidean"
)
labels_hdb = clusterer.fit_predict(X_scaled)
# HDBSCAN also provides cluster membership probabilities
probabilities = clusterer.probabilities_
print(f"Clusters found: {labels_hdb.max() + 1}")
print(f"Noise points: {(labels_hdb == -1).sum()}")
HDBSCAN is often a better default than DBSCAN because it eliminates the need to tune $\varepsilon$ and handles varying-density clusters gracefully.
Strengths and Limitations of DBSCAN:
| Strengths | Limitations |
|---|---|
| No need to specify number of clusters | Sensitive to $\varepsilon$ and min_samples |
| Can find arbitrarily-shaped clusters | Struggles with clusters of varying density |
| Robust to outliers (labels them as noise) | Not suitable for high-dimensional data (curse of dimensionality) |
| Deterministic (given fixed parameters) | Cannot assign cluster labels to new points directly |
7.2.4 Gaussian Mixture Models (GMMs)
Gaussian Mixture Models offer a probabilistic approach to clustering. Instead of making hard assignments, GMMs model the data as a mixture of $k$ multivariate Gaussian distributions and provide soft assignments---the probability that each point belongs to each cluster.
The probability density of a GMM with $k$ components is:
$$p(\mathbf{x}) = \sum_{j=1}^{k} \pi_j \, \mathcal{N}(\mathbf{x} \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)$$
where $\pi_j$ are the mixing coefficients ($\sum_j \pi_j = 1$), $\boldsymbol{\mu}_j$ are the means, and $\boldsymbol{\Sigma}_j$ are the covariance matrices.
Expectation-Maximization (EM) Algorithm:
The parameters are estimated using the EM algorithm, which alternates between:
- E-step: Compute the posterior probability (responsibility) that each component generated each data point:
$$\gamma_{ij} = \frac{\pi_j \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_j, \boldsymbol{\Sigma}_j)}{\sum_{l=1}^{k} \pi_l \, \mathcal{N}(\mathbf{x}_i \mid \boldsymbol{\mu}_l, \boldsymbol{\Sigma}_l)}$$
- M-step: Update the parameters using the responsibilities:
$$\boldsymbol{\mu}_j = \frac{\sum_i \gamma_{ij} \mathbf{x}_i}{\sum_i \gamma_{ij}}, \quad \boldsymbol{\Sigma}_j = \frac{\sum_i \gamma_{ij} (\mathbf{x}_i - \boldsymbol{\mu}_j)(\mathbf{x}_i - \boldsymbol{\mu}_j)^T}{\sum_i \gamma_{ij}}, \quad \pi_j = \frac{\sum_i \gamma_{ij}}{n}$$
from sklearn.mixture import GaussianMixture
gmm = GaussianMixture(
n_components=4,
covariance_type="full", # Each component has its own full covariance matrix
n_init=5,
random_state=42
)
gmm.fit(X_scaled)
# Hard assignments
labels_gmm = gmm.predict(X_scaled)
# Soft assignments (probabilities)
probs = gmm.predict_proba(X_scaled)
print(f"Probability of point 0 belonging to each cluster: {probs[0].round(3)}")
# Model selection with BIC
bics = []
for k in range(2, 11):
gm = GaussianMixture(n_components=k, n_init=5, random_state=42)
gm.fit(X_scaled)
bics.append(gm.bic(X_scaled))
Covariance Types. Scikit-learn supports four covariance parameterizations, each offering a different trade-off between flexibility and the number of parameters:
"full": Each component has its own unconstrained covariance matrix. Most flexible but requires the most data."tied": All components share the same covariance matrix."diag": Each component has its own diagonal covariance matrix (features are independent within each component)."spherical": Each component has a single variance parameter (isotropic). Equivalent to K-means when covariances are equal.
Model Selection with BIC/AIC. Unlike the elbow method, GMMs provide a principled way to select $k$ using information criteria. The Bayesian Information Criterion (BIC) balances model fit against complexity:
$$\text{BIC} = -2 \ln \hat{L} + p \ln n$$
where $\hat{L}$ is the maximized likelihood, $p$ is the number of parameters, and $n$ is the number of data points. Lower BIC is better. The Akaike Information Criterion (AIC) uses a similar formula with a different penalty: $\text{AIC} = -2 \ln \hat{L} + 2p$.
7.2.5 Spectral Clustering
Spectral clustering is a powerful technique that can discover clusters with complex, non-convex shapes by leveraging the eigenstructure of a similarity graph. While K-means operates directly in feature space, spectral clustering first transforms the data into a space where clusters are easier to find.
Intuition. Imagine your data forms two interleaved half-moons. K-means fails because it assumes spherical clusters. Spectral clustering succeeds because it reformulates the problem as a graph partitioning task: build a graph where nearby points are connected, then find the "best" way to cut this graph into groups with minimal connections between them.
The Algorithm:
- Build a similarity graph. Construct an affinity matrix $\mathbf{A}$ where $A_{ij}$ measures the similarity between points $\mathbf{x}_i$ and $\mathbf{x}_j$. A common choice is the Gaussian (RBF) kernel:
$$A_{ij} = \exp\left(-\frac{\|\mathbf{x}_i - \mathbf{x}_j\|^2}{2\sigma^2}\right)$$
Alternatively, use a k-nearest-neighbor graph where $A_{ij} = 1$ if $\mathbf{x}_j$ is among the $k$ nearest neighbors of $\mathbf{x}_i$ (or vice versa).
-
Compute the graph Laplacian. The unnormalized Laplacian is $\mathbf{L} = \mathbf{D} - \mathbf{A}$, where $\mathbf{D}$ is the degree matrix with $D_{ii} = \sum_j A_{ij}$. The normalized symmetric Laplacian $\mathbf{L}_{\text{sym}} = \mathbf{D}^{-1/2} \mathbf{L} \mathbf{D}^{-1/2}$ is often preferred for its better mathematical properties.
-
Compute eigenvectors. Find the $k$ eigenvectors $\mathbf{u}_1, \ldots, \mathbf{u}_k$ corresponding to the $k$ smallest eigenvalues of $\mathbf{L}$. Stack them into a matrix $\mathbf{U} \in \mathbb{R}^{n \times k}$.
-
Cluster in the embedded space. Treat each row of $\mathbf{U}$ as a new representation of the corresponding data point, and apply K-means in this $k$-dimensional space.
Why it works. The eigenvectors of the Laplacian encode the connectivity structure of the graph. The smallest eigenvalue is always zero (corresponding to a constant eigenvector), and the number of zero eigenvalues equals the number of connected components. For nearly-separated clusters, the smallest non-trivial eigenvectors approximately indicate cluster membership.
from sklearn.cluster import SpectralClustering
spectral = SpectralClustering(
n_clusters=4,
affinity="rbf", # Gaussian kernel
gamma=1.0, # Kernel bandwidth parameter
assign_labels="kmeans",
random_state=42
)
labels_spectral = spectral.fit_predict(X_scaled)
Strengths: Handles non-convex clusters, grounded in graph theory, works well when clusters are defined by connectivity rather than compactness. Limitations: Requires specifying $k$, computationally expensive for large datasets ($O(n^3)$ for eigendecomposition, though approximations exist), and sensitive to the similarity kernel and its parameters.
7.2.6 Comparing Clustering Algorithms
Choosing the right clustering algorithm depends on your data and objectives:
| Algorithm | Cluster Shape | Handles Noise | Scalability | Soft Assignments | Requires $k$ |
|---|---|---|---|---|---|
| K-means | Spherical | No | Excellent | No | Yes |
| Hierarchical | Depends on linkage | No | Poor ($O(n^2)$+) | No | Optional |
| DBSCAN | Arbitrary | Yes | Good | No | No |
| GMM | Ellipsoidal | No | Moderate | Yes | Yes |
| Spectral | Arbitrary (graph-based) | No | Poor ($O(n^3)$) | No | Yes |
Rules of thumb:
- Start with K-means for a quick baseline
- Use DBSCAN when you expect non-convex clusters or significant noise
- Use GMMs when you need probabilistic assignments or ellipsoidal clusters
- Use hierarchical clustering when you want to explore multiple granularity levels
- Use spectral clustering when clusters have complex shapes and the dataset is moderate-sized
7.2.7 Real-World Clustering Application: Customer Segmentation
To ground these algorithms in practice, consider a common AI engineering task: customer segmentation for an e-commerce platform. You have data on customer behavior---purchase frequency, average order value, recency of last purchase, browsing time, and category preferences.
Step 1: Preprocessing. As we discussed in Chapter 3, standardize all features to zero mean and unit variance. Purchase frequency and order value often have skewed distributions, so consider a log transform before scaling.
Step 2: Choose and apply an algorithm. Start with K-means as a baseline. Use the elbow method and silhouette analysis to choose $k$. Suppose you find $k = 5$ gives the best balance.
Step 3: Profile the clusters. Examine the cluster centroids and compute summary statistics for each cluster. You might discover segments like: - High-value regulars: High frequency, high order value, recent purchases - Bargain hunters: High frequency, low order value, drawn to discounts - Dormant customers: Low recency, previously active but now inactive - Window shoppers: High browsing time but low purchase frequency - Occasional splurgers: Low frequency but very high order value
Step 4: Validate. Check cluster stability by running K-means with different random seeds. Compute silhouette scores. Discuss findings with domain experts. Good clusters should lead to actionable insights---for example, targeting "dormant customers" with re-engagement campaigns.
This workflow---preprocess, cluster, profile, validate---applies whether you are segmenting customers, grouping documents, or categorizing sensor readings.
7.3 Dimensionality Reduction
High-dimensional data presents several challenges. As we discussed in Chapter 4 (Feature Engineering), the curse of dimensionality means that distances become less meaningful, models require exponentially more data, and computation becomes expensive. Dimensionality reduction addresses these challenges by projecting data into a lower-dimensional space while preserving as much relevant structure as possible.
7.3.1 Principal Component Analysis (PCA)
PCA is the most fundamental dimensionality reduction technique. It finds orthogonal directions of maximum variance in the data and projects onto those directions. These directions are called principal components.
Mathematical Foundation:
Given a centered data matrix $\mathbf{X} \in \mathbb{R}^{n \times d}$ (mean subtracted from each feature, as discussed in Chapter 3), PCA computes the eigendecomposition of the covariance matrix:
$$\mathbf{C} = \frac{1}{n-1} \mathbf{X}^T \mathbf{X} = \mathbf{V} \boldsymbol{\Lambda} \mathbf{V}^T$$
where $\mathbf{V}$ contains the eigenvectors (principal components) as columns and $\boldsymbol{\Lambda}$ is diagonal with eigenvalues $\lambda_1 \geq \lambda_2 \geq \cdots \geq \lambda_d \geq 0$.
The first principal component $\mathbf{v}_1$ is the direction that maximizes the variance of the projected data:
$$\mathbf{v}_1 = \arg\max_{\|\mathbf{v}\|=1} \text{Var}(\mathbf{X}\mathbf{v}) = \arg\max_{\|\mathbf{v}\|=1} \mathbf{v}^T \mathbf{C} \mathbf{v}$$
Each subsequent component maximizes variance subject to being orthogonal to all previous components.
Explained Variance Ratio. The fraction of total variance captured by the $j$-th component is:
$$\text{EVR}_j = \frac{\lambda_j}{\sum_{i=1}^{d} \lambda_i}$$
The cumulative explained variance ratio tells you how many components you need to retain a given fraction (e.g., 95%) of the total variance.
from sklearn.decomposition import PCA
# Fit PCA
pca = PCA(n_components=None) # Compute all components
pca.fit(X_scaled)
# Explained variance
evr = pca.explained_variance_ratio_
cumulative_evr = np.cumsum(evr)
print(f"Variance explained by first 2 components: {cumulative_evr[1]:.2%}")
# Find number of components for 95% variance
n_95 = np.argmax(cumulative_evr >= 0.95) + 1
print(f"Components needed for 95% variance: {n_95}")
# Transform to reduced dimensionality
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)
print(f"Reduced shape: {X_pca.shape}")
The Connection to Singular Value Decomposition (SVD). PCA can also be computed via SVD of the centered data matrix: $\mathbf{X} = \mathbf{U} \boldsymbol{\Sigma} \mathbf{V}^T$. The columns of $\mathbf{V}$ are the principal components, and the singular values $\sigma_i$ relate to the eigenvalues by $\lambda_i = \sigma_i^2 / (n - 1)$. SVD is numerically more stable than eigendecomposition of the covariance matrix and is the method scikit-learn uses internally.
Worked Example. Suppose you have a dataset with 5 features and 100 samples. After standardization and PCA, you find that the first two components explain 85% of the variance, the third adds 8%, and the remaining two add 4% and 3%. To retain 95% of the variance, you would keep the first 3 components, reducing dimensionality from 5 to 3. Each component is a linear combination of the original features; for example, PC1 might be 0.6 * Feature1 + 0.5 * Feature2 - 0.3 * Feature3 + ..., which you can interpret by examining the loadings.
When to Use PCA:
- Preprocessing for supervised learning: Reduce feature dimensionality before training a classifier or regressor (Chapters 5-6). This can reduce overfitting and speed up training.
- Visualization: Project high-dimensional data to 2D or 3D for exploration.
- Noise reduction: Discarding low-variance components can remove noise. If the signal lives in the first few components and the noise spreads uniformly across all components, discarding the tail effectively denoises the data.
- Multicollinearity: PCA components are orthogonal, eliminating collinearity issues in linear models.
- Speeding up computation: Reducing a dataset from 10,000 features to 100 components makes subsequent algorithms dramatically faster.
Limitations of PCA:
- Captures only linear relationships; nonlinear structure requires t-SNE, UMAP, or autoencoders
- Assumes variance equals importance (not always true---a low-variance feature could be the most predictive)
- Components can be difficult to interpret, especially when many features contribute to each component
- Sensitive to feature scaling (always standardize first---see Chapter 3)
- Sensitive to outliers, which can disproportionately affect the directions of maximum variance
7.3.2 t-Distributed Stochastic Neighbor Embedding (t-SNE)
While PCA excels at preserving global structure (large distances), it often fails to reveal local structure in complex datasets. t-SNE (van der Maaten and Hinton, 2008) is a nonlinear technique designed specifically for visualization of high-dimensional data in 2D or 3D.
Historical Context. t-SNE evolved from an earlier method called Stochastic Neighbor Embedding (SNE), proposed by Hinton and Roweis in 2002. The original SNE used Gaussian distributions in both the high- and low-dimensional spaces, but it suffered from a "crowding problem" and an asymmetric cost function that was difficult to optimize. The breakthrough of t-SNE was the use of the Student's t-distribution in the low-dimensional space and a symmetrized objective function, which together produced dramatically better visualizations.
How t-SNE Works:
- Compute pairwise similarities in high-dimensional space using Gaussian distributions. For each pair of points $i, j$, the conditional probability that $i$ would pick $j$ as its neighbor is:
$$p_{j|i} = \frac{\exp(-\|\mathbf{x}_i - \mathbf{x}_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|\mathbf{x}_i - \mathbf{x}_k\|^2 / 2\sigma_i^2)}$$
where: - $\sigma_i$ is the bandwidth of the Gaussian kernel centered on point $i$, and - the denominator ensures the conditional probabilities sum to 1 over all points $k \neq i$.
The bandwidth $\sigma_i$ is set automatically so that the entropy of the conditional distribution $P_i$ matches a user-specified perplexity. Formally, $\sigma_i$ is chosen to satisfy:
$$\text{Perplexity}(P_i) = 2^{H(P_i)} = 2^{-\sum_{j} p_{j|i} \log_2 p_{j|i}}$$
This is solved via binary search for each point, which is why t-SNE adapts its kernel bandwidth locally---points in dense regions get smaller $\sigma_i$, and points in sparse regions get larger $\sigma_i$.
The joint probability is symmetrized: $p_{ij} = (p_{j|i} + p_{i|j}) / (2n)$.
- Compute pairwise similarities in low-dimensional space using a Student's t-distribution with one degree of freedom (a heavy-tailed distribution):
$$q_{ij} = \frac{(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}}{\sum_{k \neq l} (1 + \|\mathbf{y}_k - \mathbf{y}_l\|^2)^{-1}}$$
- Minimize the KL divergence between $P$ and $Q$ using gradient descent:
$$KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}$$
where: - $p_{ij}$ is the joint similarity in the high-dimensional space, - $q_{ij}$ is the joint similarity in the low-dimensional embedding, and - the KL divergence is asymmetric: it heavily penalizes cases where $p_{ij}$ is large but $q_{ij}$ is small (nearby points mapped far apart), while being lenient when $p_{ij}$ is small but $q_{ij}$ is large (distant points mapped close together).
This asymmetry is a feature, not a bug: t-SNE prioritizes preserving local neighborhoods over global distances, which is exactly what makes it effective for visualization.
The Crowding Problem and Why the t-Distribution Helps. In high-dimensional space, the volume of a spherical shell grows polynomially with the dimension. This means that a point in $\mathbb{R}^{100}$ has room for many "moderately distant" neighbors, but when we embed into $\mathbb{R}^2$, all those moderately distant points crowd into a limited area. Using a Gaussian in the low-dimensional space forces these moderate-distance points too close together, crushing cluster structure. The Student's t-distribution, with its heavier tails, allocates more probability mass to moderate distances, giving clusters room to separate while keeping truly nearby points close.
The Gradient. The gradient of the KL divergence with respect to the embedding coordinate $\mathbf{y}_i$ is:
$$\frac{\partial KL}{\partial \mathbf{y}_i} = 4 \sum_{j} (p_{ij} - q_{ij})(1 + \|\mathbf{y}_i - \mathbf{y}_j\|^2)^{-1}(\mathbf{y}_i - \mathbf{y}_j)$$
Each term has three factors: the mismatch between $p_{ij}$ and $q_{ij}$ (the "error signal"), a weight that decays with distance (the t-distribution kernel), and the direction from $\mathbf{y}_j$ to $\mathbf{y}_i$. Points where $p_{ij} > q_{ij}$ (they should be closer) are attracted; points where $p_{ij} < q_{ij}$ (they should be farther) are repelled.
The Perplexity Parameter. The key hyperparameter is perplexity, which loosely corresponds to the effective number of neighbors considered for each point. It controls the bandwidth $\sigma_i$ of the Gaussian kernels:
- Low perplexity (5--10): Focuses on very local structure; may produce fragmented clusters
- Medium perplexity (30--50): Good default range; balances local and global structure
- High perplexity (100+): Emphasizes global structure; clusters may merge
from sklearn.manifold import TSNE
# t-SNE with default perplexity
tsne = TSNE(
n_components=2,
perplexity=30,
learning_rate="auto",
n_iter=1000,
random_state=42
)
X_tsne = tsne.fit_transform(X_scaled)
print(f"t-SNE output shape: {X_tsne.shape}")
print(f"Final KL divergence: {tsne.kl_divergence_:.4f}")
Critical Caveats for t-SNE:
- Not suitable for new data: t-SNE has no
transform()method---it cannot project unseen points. For this, consider parametric t-SNE or UMAP. - Distances between clusters are meaningless: Only local structure (within-cluster relationships) is preserved. The relative positions and distances between clusters should not be interpreted.
- Run multiple times: Results vary across runs. Always check stability.
- Scale matters: Always standardize features first.
- Reduce dimensionality first: For high-dimensional data ($d > 50$), apply PCA to 50 components before running t-SNE. This speeds up computation and can improve results.
7.3.3 Manifold Learning: The Underlying Idea
Before we explore UMAP, it helps to understand the broader concept of manifold learning. The manifold hypothesis states that high-dimensional real-world data often lies on or near a low-dimensional manifold embedded in the high-dimensional space. For example, images of a single person's face under varying lighting and pose form a low-dimensional manifold within the space of all possible pixel arrays.
PCA can only recover manifolds that are linear subspaces (hyperplanes). For curved manifolds---such as the "Swiss roll" (a 2D sheet rolled up in 3D)---we need nonlinear techniques. The family of algorithms designed for this purpose is called manifold learning, and it includes Isomap, Locally Linear Embedding (LLE), Laplacian Eigenmaps, t-SNE, and UMAP.
The key insight shared by most manifold learning methods is to preserve some notion of local geometry: if two points are close on the manifold, they should remain close in the embedding. What differs between methods is how they define "closeness" and what objective function they optimize.
7.3.4 Uniform Manifold Approximation and Projection (UMAP)
UMAP (McInnes, Healy, and Melville, 2018) is a more recent nonlinear technique that addresses several limitations of t-SNE while achieving similar or better visualization quality.
Conceptual Overview:
UMAP is grounded in Riemannian geometry and algebraic topology, but its practical mechanics can be understood without deep knowledge of these fields. At a high level, UMAP performs three steps:
- Build a weighted k-nearest-neighbor graph in high-dimensional space
- Optimize a low-dimensional layout to preserve the graph structure
- Use a cross-entropy loss (rather than KL divergence) that better preserves both local and global structure
Theoretical Foundation (Simplified). UMAP models the high-dimensional data as a fuzzy topological structure---a "fuzzy simplicial set." For each point, UMAP constructs a local metric based on the distance to its nearest neighbors, then builds a fuzzy graph where edge weights represent the probability that two points are connected. The key idea is that each point defines its own local notion of distance (by normalizing distances relative to the distance to its nearest neighbor), and these local metrics are combined into a single global structure via a fuzzy union operation. The low-dimensional embedding is then found by minimizing the cross-entropy between the high-dimensional fuzzy graph and an analogous graph in the embedding space.
The cross-entropy objective has the form:
$$\mathcal{L} = \sum_{(i,j)} \left[ w_{ij} \log\frac{w_{ij}}{v_{ij}} + (1 - w_{ij}) \log\frac{1 - w_{ij}}{1 - v_{ij}} \right]$$
where: - $w_{ij}$ is the edge weight in the high-dimensional graph, and - $v_{ij}$ is the edge weight in the low-dimensional embedding.
Unlike the KL divergence used in t-SNE, this cross-entropy penalizes both "pulling apart nearby points" and "pushing together distant points," which is why UMAP preserves global structure better than t-SNE.
Key Parameters:
n_neighbors: Controls the balance between local and global structure (analogous to perplexity). Default: 15. Higher values preserve more global structure.min_dist: Controls how tightly points are packed in the embedding. Smaller values create denser clusters. Default: 0.1.metric: Distance metric in the input space. Supports Euclidean, cosine, Manhattan, and many others.
Worked Example: Exploring n_neighbors. Suppose you have a dataset of 10,000 handwritten digit images. With n_neighbors=5, UMAP focuses on very local structure: each digit class may fragment into subclusters (e.g., different writing styles of the digit "7"). With n_neighbors=50, UMAP considers broader neighborhoods: digit classes form cohesive blobs, and the relative positions of these blobs reflect global similarity (e.g., "3" and "8" are placed near each other because they share visual features). With n_neighbors=200, the embedding starts to resemble PCA---local detail is lost but the overall layout of classes is meaningful. A value between 10 and 50 usually provides the best balance for visualization.
# UMAP requires the umap-learn package
# pip install umap-learn
import umap
reducer = umap.UMAP(
n_components=2,
n_neighbors=15,
min_dist=0.1,
metric="euclidean",
random_state=42
)
X_umap = reducer.fit_transform(X_scaled)
print(f"UMAP output shape: {X_umap.shape}")
# Unlike t-SNE, UMAP can transform new data
# X_new_umap = reducer.transform(X_new_scaled)
UMAP for General Dimensionality Reduction. An important advantage of UMAP over t-SNE is that UMAP is not limited to 2D/3D visualization. You can set n_components to 10, 50, or any value and use the resulting features as input to downstream models. In this mode, UMAP acts as a nonlinear alternative to PCA, often capturing structure that PCA misses. For example, applying UMAP with 10 components before K-means can reveal clusters that are invisible in PCA-reduced space.
t-SNE vs. UMAP:
| Feature | t-SNE | UMAP |
|---|---|---|
| Speed | Slow for large $n$ | Significantly faster |
| Transform new data | No | Yes |
| Global structure | Poorly preserved | Better preserved |
| Theoretical foundation | Information theory | Topology / category theory |
| Reproducibility | Varies across runs | More stable with random_state |
| Components | Typically 2-3 | Any (can be used for general reduction) |
| Scalability | $O(n^2)$ or $O(n \log n)$ with Barnes-Hut | $O(n^{1.14})$ approximate |
7.3.5 Matrix Factorization and Topic Modeling
Dimensionality reduction is not limited to numerical feature matrices. An important family of techniques applies matrix factorization to discover latent structure in data represented as matrices---most notably, term-document matrices in text analysis.
Non-Negative Matrix Factorization (NMF). Given a non-negative matrix $\mathbf{V} \in \mathbb{R}^{m \times n}_{\geq 0}$ (for example, a term-document matrix where entry $V_{ij}$ is the frequency of term $i$ in document $j$), NMF finds non-negative factors $\mathbf{W} \in \mathbb{R}^{m \times r}_{\geq 0}$ and $\mathbf{H} \in \mathbb{R}^{r \times n}_{\geq 0}$ such that:
$$\mathbf{V} \approx \mathbf{W} \mathbf{H}$$
where: - $r$ is the number of latent components (analogous to topics), - each column of $\mathbf{W}$ is a "topic"---a distribution over terms, and - each column of $\mathbf{H}$ is a document's representation---a distribution over topics.
The non-negativity constraint is what makes NMF interpretable for text: topics are additive combinations of words (no negative word frequencies), and documents are additive combinations of topics. This is in contrast to PCA, where components can have negative loadings that are harder to interpret.
from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer
# Build a term-document matrix
documents = [
"machine learning and deep learning",
"neural networks for image classification",
"clustering algorithms for data analysis",
"reinforcement learning in robotics",
]
vectorizer = TfidfVectorizer()
V = vectorizer.fit_transform(documents)
# Apply NMF to discover topics
nmf = NMF(n_components=2, random_state=42)
W = nmf.fit_transform(V) # Document-topic matrix
H = nmf.components_ # Topic-term matrix
# Display top words per topic
feature_names = vectorizer.get_feature_names_out()
for topic_idx, topic in enumerate(H):
top_words = [feature_names[i] for i in topic.argsort()[-5:]]
print(f"Topic {topic_idx}: {', '.join(top_words)}")
NMF connects directly to the clustering ideas earlier in this chapter: documents that load heavily on the same topic are implicitly clustered together. In practice, NMF and Latent Dirichlet Allocation (LDA) are the two dominant techniques for topic modeling. NMF is simpler and faster; LDA provides a full probabilistic model with Bayesian inference. Both are valuable tools in the AI engineer's toolkit, especially when working with text data as we will explore further in Part IV.
7.3.6 Choosing a Dimensionality Reduction Method
The choice depends on your goal:
- Preprocessing for ML models: Use PCA. It is fast, deterministic, and produces components that work well as features. Set
n_componentsto preserve 95% of variance. - 2D/3D visualization: Use UMAP as a default. Fall back to t-SNE if UMAP results look questionable. Always try multiple parameter settings.
- Interpretable components: Use PCA. The loadings matrix tells you which original features contribute most to each component.
- Very large datasets ($n > 100{,}000$): Use PCA (near-instant) or UMAP (minutes). Avoid t-SNE unless you use an approximate implementation.
- General nonlinear reduction: Use UMAP with more components (e.g., 10--50) for downstream tasks.
7.4 Anomaly Detection
Anomaly detection (also called outlier detection) identifies observations that deviate significantly from the majority of the data. This is inherently an unsupervised problem: anomalies are, by definition, rare and unexpected, so labeled examples are scarce.
7.4.1 Statistical Methods
The simplest approach uses univariate statistics:
Z-Score Method. Flag points where $|z_j| > \theta$ for any feature $j$:
$$z_j = \frac{x_j - \mu_j}{\sigma_j}$$
A common threshold is $\theta = 3$ (corresponding to the 99.7% interval for Gaussian data). We encountered Z-scores in Chapter 3 during standardization.
IQR Method. For non-Gaussian data, use the interquartile range:
$$\text{Lower bound} = Q_1 - 1.5 \times \text{IQR}, \quad \text{Upper bound} = Q_3 + 1.5 \times \text{IQR}$$
Worked Example. A manufacturing sensor reads temperature values. The mean is $\mu = 72.0$°F and the standard deviation is $\sigma = 3.5$°F. A reading of $x = 85.0$°F gives a Z-score of:
$$z = \frac{85.0 - 72.0}{3.5} = \frac{13.0}{3.5} = 3.71$$
Since $|z| = 3.71 > 3$, this reading is flagged as anomalous. However, if the temperature is legitimately high due to a seasonal pattern, this would be a false alarm---which is why domain context always matters for anomaly detection.
These univariate methods are fast but cannot detect multivariate anomalies---points that are normal in each individual feature but abnormal in their combination. For example, a temperature of 80°F and a humidity of 90% might each be within normal ranges individually, but their combination might be physically impossible in the environment being monitored, signaling a sensor malfunction.
7.4.2 Isolation Forest
Isolation Forest (Liu, Ting, and Zhou, 2008) is based on a beautifully simple insight: anomalies are easier to isolate than normal points. If you randomly partition the feature space, anomalies will be separated from the rest in fewer splits.
How it Works:
- Build an ensemble of random isolation trees. Each tree recursively partitions the data by randomly selecting a feature and a random split value within the feature's range.
- The anomaly score of a point is based on the average path length across all trees. Shorter paths indicate anomalies.
The anomaly score is normalized:
$$s(\mathbf{x}, n) = 2^{-E[h(\mathbf{x})] / c(n)}$$
where $E[h(\mathbf{x})]$ is the average path length and $c(n)$ is a normalization constant. Scores close to 1 indicate anomalies; scores close to 0.5 indicate normal points.
from sklearn.ensemble import IsolationForest
# contamination is the expected proportion of anomalies
iso_forest = IsolationForest(
n_estimators=100,
contamination=0.05, # Expect ~5% anomalies
random_state=42
)
iso_forest.fit(X_scaled)
# Predictions: 1 = normal, -1 = anomaly
anomaly_labels = iso_forest.predict(X_scaled)
anomaly_scores = iso_forest.decision_function(X_scaled)
n_anomalies = (anomaly_labels == -1).sum()
print(f"Anomalies detected: {n_anomalies}")
7.4.3 Local Outlier Factor (LOF)
LOF (Breunig et al., 2000) measures the local density around each point relative to its neighbors. A point with substantially lower density than its neighbors is considered an anomaly.
Key Concept: Local Reachability Density (LRD).
For a point $\mathbf{x}$, the local reachability density is:
$$\text{LRD}_k(\mathbf{x}) = \left( \frac{\sum_{\mathbf{o} \in N_k(\mathbf{x})} \text{reach-dist}_k(\mathbf{x}, \mathbf{o})}{|N_k(\mathbf{x})|} \right)^{-1}$$
The LOF score compares a point's LRD to its neighbors' LRDs:
$$\text{LOF}_k(\mathbf{x}) = \frac{1}{|N_k(\mathbf{x})|} \sum_{\mathbf{o} \in N_k(\mathbf{x})} \frac{\text{LRD}_k(\mathbf{o})}{\text{LRD}_k(\mathbf{x})}$$
LOF $\approx 1$ means the point has similar density to its neighbors (normal). LOF $\gg 1$ means the point is in a sparser region (anomaly).
from sklearn.neighbors import LocalOutlierFactor
lof = LocalOutlierFactor(
n_neighbors=20,
contamination=0.05
)
lof_labels = lof.fit_predict(X_scaled)
lof_scores = lof.negative_outlier_factor_
n_anomalies_lof = (lof_labels == -1).sum()
print(f"LOF anomalies: {n_anomalies_lof}")
LOF vs. Isolation Forest:
- LOF is density-based and excels at detecting local anomalies---points that are anomalous relative to their neighborhood, even if they are in a globally dense region.
- Isolation Forest is better for global anomalies and scales better to high dimensions.
- LOF requires computing k-nearest neighbors ($O(n^2)$ naive or $O(n \log n)$ with KD-trees), making it slower for large datasets.
7.4.4 Using Clustering for Anomaly Detection
Clustering algorithms can be repurposed for anomaly detection:
- K-means: Points far from all centroids (high distance to nearest centroid) are potential anomalies.
- DBSCAN: Points labeled as noise (label = -1) are natural anomaly candidates.
- GMM: Points with low probability under the mixture model are anomalies. Use
gmm.score_samples(X)to get log-likelihoods.
# GMM-based anomaly detection
log_likelihoods = gmm.score_samples(X_scaled)
threshold = np.percentile(log_likelihoods, 5) # Bottom 5%
gmm_anomalies = log_likelihoods < threshold
print(f"GMM anomalies: {gmm_anomalies.sum()}")
7.5 Evaluating Unsupervised Learning
Evaluation in unsupervised learning is inherently challenging because there are no ground-truth labels to compare against. We rely on internal metrics (computed from the data alone) and, when available, external metrics (comparing to known labels).
7.5.1 Internal Evaluation Metrics
Silhouette Score. The silhouette coefficient for a single point $i$ is:
$$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$$
where: - $a(i)$ = mean distance from $i$ to other points in the same cluster (cohesion) - $b(i)$ = mean distance from $i$ to points in the nearest other cluster (separation)
The score ranges from $-1$ to $+1$: - $s(i) \approx +1$: Point is well-clustered (far from neighboring clusters) - $s(i) \approx 0$: Point is on or near the boundary between clusters - $s(i) \approx -1$: Point is likely in the wrong cluster
The overall silhouette score is the mean across all points.
from sklearn.metrics import silhouette_score, silhouette_samples
# Overall score
sil_score = silhouette_score(X_scaled, labels)
print(f"Silhouette Score: {sil_score:.3f}")
# Per-sample scores for visualization
sil_samples = silhouette_samples(X_scaled, labels)
Calinski-Harabasz Index (Variance Ratio Criterion). Measures the ratio of between-cluster dispersion to within-cluster dispersion:
$$\text{CH} = \frac{\text{tr}(\mathbf{B}_k)}{\text{tr}(\mathbf{W}_k)} \times \frac{n - k}{k - 1}$$
where: - $\mathbf{B}_k$ is the between-cluster scatter matrix, which measures how spread apart the cluster centroids are from the overall centroid, - $\mathbf{W}_k$ is the within-cluster scatter matrix, which measures how tightly the points in each cluster are grouped, and - the factor $(n - k) / (k - 1)$ adjusts for the number of clusters.
Higher values indicate better-defined clusters. The CH index favors convex, well-separated clusters and tends to increase monotonically with $k$ for non-clustered data, so it should be used in conjunction with other metrics.
Worked Example. Suppose you run K-means with $k = 3$ on a dataset of 300 points and obtain CH = 450. You then try $k = 4$ and get CH = 380. The decrease suggests that splitting into 4 clusters does not improve cluster definition over 3 clusters. However, if $k = 5$ yields CH = 520, the data may have a natural 5-cluster structure. Always plot CH against $k$ and look for peaks.
Davies-Bouldin Index. Measures the average similarity between each cluster and its most similar cluster:
$$\text{DB} = \frac{1}{k} \sum_{i=1}^{k} \max_{j \neq i} \frac{S_i + S_j}{d(\boldsymbol{\mu}_i, \boldsymbol{\mu}_j)}$$
where: - $S_i$ is the average distance of points in cluster $i$ to the centroid $\boldsymbol{\mu}_i$ (a measure of cluster spread), - $d(\boldsymbol{\mu}_i, \boldsymbol{\mu}_j)$ is the distance between cluster centroids, and - the ratio $(S_i + S_j) / d(\boldsymbol{\mu}_i, \boldsymbol{\mu}_j)$ measures the "similarity" between clusters $i$ and $j$---higher values mean the clusters overlap more.
Lower values indicate better clustering. A DB index of 0 would mean perfectly separated, infinitely compact clusters. In practice, values below 1.0 indicate good separation; values above 2.0 suggest significant cluster overlap. The Davies-Bouldin index is particularly useful because it penalizes configurations where large clusters are close together, a situation that other metrics like silhouette score may not flag as clearly.
from sklearn.metrics import calinski_harabasz_score, davies_bouldin_score
ch_score = calinski_harabasz_score(X_scaled, labels)
db_score = davies_bouldin_score(X_scaled, labels)
print(f"Calinski-Harabasz: {ch_score:.2f}")
print(f"Davies-Bouldin: {db_score:.3f}")
Comparing Internal Metrics. Each metric captures a different aspect of clustering quality:
| Metric | What It Measures | Best Value | Biases |
|---|---|---|---|
| Silhouette | Cohesion vs. separation | +1 | Favors convex clusters |
| Calinski-Harabasz | Between/within variance ratio | Higher is better | Favors spherical, equally-sized clusters |
| Davies-Bouldin | Worst-case cluster overlap | 0 | Sensitive to outlier clusters |
| Inertia (K-means) | Within-cluster sum of squares | Lower is better | Always decreases with $k$ |
No single metric tells the whole story. We recommend computing all three (silhouette, CH, DB) and looking for consistent signals across them. If silhouette and CH agree that $k = 4$ is optimal but DB suggests $k = 3$, investigate the boundary between the 3- and 4-cluster solutions to understand why they disagree.
7.5.2 External Evaluation Metrics
When ground-truth labels are available (e.g., during development or benchmarking), we can use external metrics:
Adjusted Rand Index (ARI). Measures the agreement between two clusterings, adjusted for chance. Ranges from $-1$ (worse than random) to $+1$ (perfect agreement), with $0$ indicating random clustering.
Normalized Mutual Information (NMI). Measures the mutual information between the predicted and true clusterings, normalized to $[0, 1]$. A value of 1 indicates perfect agreement.
Adjusted Mutual Information (AMI). Like NMI but adjusted for chance. More reliable when cluster sizes are imbalanced.
from sklearn.metrics import (
adjusted_rand_score,
normalized_mutual_info_score,
adjusted_mutual_info_score
)
# These require ground-truth labels
ari = adjusted_rand_score(y_true, labels)
nmi = normalized_mutual_info_score(y_true, labels)
ami = adjusted_mutual_info_score(y_true, labels)
print(f"Adjusted Rand Index: {ari:.3f}")
print(f"Normalized Mutual Information: {nmi:.3f}")
print(f"Adjusted Mutual Information: {ami:.3f}")
7.5.3 Evaluating Dimensionality Reduction
For dimensionality reduction, evaluation depends on the goal:
For preprocessing: Evaluate by the performance of the downstream supervised model. Does PCA with 50 components yield similar accuracy to using all 500 features? If so, the reduction is successful.
For visualization: Evaluation is largely qualitative---do clusters separate visually? Do known groups appear as distinct regions? Quantitative measures include:
- Trustworthiness: Measures whether nearest neighbors in the embedding were also near in the original space. Values close to 1 are good.
- Reconstruction error: For PCA, the mean squared error between original and reconstructed data.
from sklearn.manifold import trustworthiness
trust_tsne = trustworthiness(X_scaled, X_tsne, n_neighbors=15)
print(f"t-SNE trustworthiness: {trust_tsne:.3f}")
# PCA reconstruction error
X_reconstructed = pca_2d.inverse_transform(X_pca)
recon_error = np.mean((X_scaled - X_reconstructed) ** 2)
print(f"PCA reconstruction MSE: {recon_error:.4f}")
7.5.4 Practical Evaluation Workflow
In practice, evaluating unsupervised learning requires a combination of quantitative metrics and qualitative assessment:
-
Compute multiple internal metrics. No single metric captures all aspects of clustering quality. Use silhouette, Calinski-Harabasz, and Davies-Bouldin together.
-
Visualize the results. Project clustered data to 2D using PCA or UMAP. Color points by cluster label. Look for clear separation and check whether assignments make intuitive sense.
-
Stability analysis. Run the algorithm with different random seeds or on bootstrap samples. Stable algorithms should produce similar results.
-
Domain validation. The ultimate test: do the discovered clusters correspond to meaningful real-world categories? This requires domain expertise and is often the most important evaluation step.
-
Downstream task performance. If clusters are used as features or for data segmentation, evaluate by the performance of the downstream task.
7.6 Putting It All Together: An Unsupervised Learning Pipeline
Let us build a complete unsupervised learning pipeline that combines the techniques from this chapter. This pipeline follows the preprocessing principles from Chapter 3 and integrates with the scikit-learn API patterns from Chapters 5 and 6.
7.6.1 The Standard Workflow
Raw Data → Preprocessing → Dimensionality Reduction → Clustering → Evaluation → Interpretation
Each stage has multiple options, and the best combination depends on your data:
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
def build_unsupervised_pipeline(
n_pca_components: int = 50,
n_clusters: int = 5
) -> Pipeline:
"""Build a complete unsupervised learning pipeline.
Args:
n_pca_components: Number of PCA components to retain.
n_clusters: Number of clusters for K-means.
Returns:
A scikit-learn Pipeline ready for fit/predict.
"""
return Pipeline([
("scaler", StandardScaler()),
("pca", PCA(n_components=n_pca_components)),
("kmeans", KMeans(n_clusters=n_clusters, n_init=10, random_state=42))
])
7.6.2 Systematic Algorithm Comparison
When you do not know which algorithm will work best, compare them systematically:
from sklearn.cluster import (
KMeans,
AgglomerativeClustering,
DBSCAN,
)
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score
def compare_clustering_algorithms(
X: np.ndarray,
k_range: range = range(2, 11)
) -> dict:
"""Compare clustering algorithms across different k values.
Args:
X: Scaled feature matrix.
k_range: Range of cluster counts to evaluate.
Returns:
Dictionary mapping algorithm names to lists of silhouette scores.
"""
results = {"kmeans": [], "agglomerative": [], "gmm": []}
for k in k_range:
# K-means
km_labels = KMeans(n_clusters=k, n_init=10, random_state=42).fit_predict(X)
results["kmeans"].append(silhouette_score(X, km_labels))
# Agglomerative
agg_labels = AgglomerativeClustering(n_clusters=k).fit_predict(X)
results["agglomerative"].append(silhouette_score(X, agg_labels))
# GMM
gmm_labels = GaussianMixture(
n_components=k, n_init=5, random_state=42
).fit_predict(X)
results["gmm"].append(silhouette_score(X, gmm_labels))
return results
7.6.3 End-to-End Example
Here is a complete workflow on a realistic dataset:
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import (
silhouette_score,
adjusted_rand_score,
normalized_mutual_info_score
)
# Load handwritten digits dataset (8x8 images, 10 classes)
digits = load_digits()
X, y = digits.data, digits.target
print(f"Original shape: {X.shape}") # (1797, 64)
# Step 1: Scale
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Step 2: Reduce dimensionality
pca = PCA(n_components=0.95, random_state=42) # Retain 95% variance
X_reduced = pca.fit_transform(X_scaled)
print(f"Reduced shape: {X_reduced.shape}") # Typically ~25 components
# Step 3: Cluster
kmeans = KMeans(n_clusters=10, n_init=20, random_state=42)
labels = kmeans.fit_predict(X_reduced)
# Step 4: Evaluate
print(f"Silhouette Score: {silhouette_score(X_reduced, labels):.3f}")
print(f"ARI: {adjusted_rand_score(y, labels):.3f}")
print(f"NMI: {normalized_mutual_info_score(y, labels):.3f}")
# Step 5: Visualize (project to 2D for plotting)
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X_scaled)
# Plot X_2d colored by labels for visual inspection
7.6.4 Common Pitfalls and Best Practices
Pitfall 1: Forgetting to scale. K-means, hierarchical clustering, PCA, t-SNE, and UMAP are all sensitive to feature scaling. Always standardize first (Chapter 3). The only exception is when features are already on comparable scales.
Pitfall 2: Using t-SNE distances for clustering. t-SNE distorts distances. Never cluster on t-SNE output. Instead, cluster on the original (or PCA-reduced) data and use t-SNE only for visualization.
Pitfall 3: Treating K-means inertia as an absolute metric. Inertia always decreases with more clusters (it reaches zero when $k = n$). It is only useful for relative comparisons across different $k$ values.
Pitfall 4: Ignoring cluster stability. A "good" silhouette score is meaningless if the clustering changes drastically with different random seeds. Always check stability.
Pitfall 5: Applying PCA to categorical features. PCA assumes linear, continuous features. For mixed data, consider Multiple Correspondence Analysis (MCA) or encode categoricals appropriately first (Chapter 3).
Best Practice Summary:
- Preprocess: Scale features, handle missing values (Chapter 3)
- Reduce (if needed): PCA for datasets with $d > 50$
- Explore: Try multiple algorithms and parameter settings
- Evaluate: Use multiple internal metrics plus visualization
- Validate: Check stability and seek domain expert feedback
- Iterate: Refine based on evaluation and domain knowledge
7.7 Connections to Other Chapters
Unsupervised learning does not exist in a vacuum. It connects deeply to nearly every other chapter in this book:
- Chapter 1 (Introduction to AI Engineering): Unsupervised learning is one of the three major paradigms introduced in the ML taxonomy.
- Chapter 2 (Mathematical Foundations): The linear algebra behind PCA (eigendecomposition, SVD) and the probability theory behind GMMs (Gaussian distributions, EM algorithm) draw directly on the foundations established there.
- Chapter 3 (Data Preprocessing): Feature scaling is essential for distance-based methods. Unsupervised techniques can also serve as preprocessing steps (PCA for dimensionality reduction, clustering for feature generation).
- Chapter 4 (Feature Engineering): PCA components and cluster assignments can serve as engineered features for supervised models. t-SNE and UMAP embeddings provide powerful visualizations for understanding feature spaces.
- Chapter 5 (Classification): Cluster labels can be used as pseudo-labels for semi-supervised classification. Evaluation metrics like the confusion matrix have analogs in clustering evaluation (contingency tables).
- Chapter 6 (Regression): Clustering can segment a regression problem into subpopulations, each with its own model (a technique called "mixture of experts").
As we move forward into Part III of this book, you will see unsupervised techniques appear repeatedly---in neural network architectures (autoencoders in Chapter 14), natural language processing (word embeddings, topic models), and generative AI. The dimensionality reduction techniques from this chapter---PCA, t-SNE, and UMAP---will become indispensable visualization tools as you explore high-dimensional embedding spaces in later chapters. The clustering algorithms will reappear in image segmentation, document clustering, and data preprocessing for large-scale model training.
Perhaps most importantly, the evaluation challenges we discussed in this chapter---the absence of ground truth, the need for multiple metrics, the importance of domain expertise---are not unique to unsupervised learning. They resurface in generative AI (how do you evaluate the quality of generated text?), in reinforcement learning (how do you define a good policy?), and in many real-world applications where the "right answer" is ambiguous. The mindset of combining quantitative metrics with qualitative assessment and domain validation will serve you throughout this book and your career.
Summary
In this chapter, we covered the three pillars of unsupervised learning:
Clustering groups data points based on similarity. K-means is fast and effective for spherical clusters; hierarchical clustering reveals multi-scale structure; DBSCAN handles arbitrary shapes and noise; and Gaussian mixture models provide probabilistic, soft assignments.
Dimensionality reduction projects high-dimensional data into lower-dimensional spaces. PCA provides fast, linear, interpretable reduction; t-SNE creates high-quality 2D visualizations; and UMAP offers fast nonlinear reduction that preserves both local and global structure.
Anomaly detection identifies unusual observations. Statistical methods handle simple cases; Isolation Forest provides fast, scalable detection; and Local Outlier Factor captures local density variations.
Evaluation of unsupervised methods requires internal metrics (silhouette score, Calinski-Harabasz, Davies-Bouldin), external metrics when labels are available (ARI, NMI), visualization, stability analysis, and domain expertise.
The techniques in this chapter complete your foundation in classical machine learning. You now have the tools to handle both labeled and unlabeled data, to extract structure from raw observations, and to prepare high-dimensional datasets for downstream analysis. In the next chapter, we will build on these foundations as we transition to deep learning architectures that can learn even more complex representations.