Chapter 7: Key Takeaways
Core Concepts
-
Unsupervised learning discovers structure in unlabeled data. Unlike supervised learning (Chapters 5--6), there are no target labels. The algorithm defines and optimizes its own notion of "good structure"---whether that means compact clusters, maximum variance directions, or density-based groupings.
-
The three pillars of unsupervised learning are clustering, dimensionality reduction, and anomaly detection. Each addresses a different question: "What groups exist?" (clustering), "How can I simplify this data?" (dimensionality reduction), and "What is unusual?" (anomaly detection).
-
There is no single best clustering algorithm. K-means is fast and effective for spherical clusters; DBSCAN handles arbitrary shapes and noise; hierarchical clustering reveals multi-scale structure; and GMMs provide probabilistic soft assignments. The choice depends on data geometry, noise levels, scalability needs, and whether you need hard or soft assignments.
Clustering
-
K-means minimizes within-cluster sum of squares (inertia). It is the default starting point for clustering due to its simplicity, speed ($O(nkd)$ per iteration), and effectiveness on well-separated, roughly spherical clusters. Always use K-means++ initialization.
-
The elbow method helps choose $k$, but it is not always definitive. Plot inertia vs. $k$ and look for the "elbow." When the elbow is ambiguous, supplement with silhouette analysis or BIC from GMMs.
-
DBSCAN does not require specifying the number of clusters. Instead, it requires
eps(neighborhood radius) andmin_samples(density threshold). Use the k-distance plot to guideepsselection. DBSCAN naturally labels outliers as noise. -
Gaussian Mixture Models generalize K-means. GMMs model each cluster as a Gaussian distribution with its own mean and covariance, providing soft (probabilistic) assignments. K-means is a special case of GMMs with spherical, equal-variance components.
-
Use BIC for model selection with GMMs. The Bayesian Information Criterion balances model fit against complexity and provides a principled, objective criterion for choosing the number of components.
Dimensionality Reduction
-
PCA finds directions of maximum variance. It is linear, deterministic, fast, and interpretable. Use it for preprocessing (retain 95% variance), denoising, and as a first-pass visualization. Always standardize features first (Chapter 3).
-
t-SNE creates high-quality 2D visualizations by preserving local neighborhoods. It excels at revealing cluster structure that PCA misses but has critical caveats: inter-cluster distances are meaningless, results vary across runs, and it cannot transform new data.
-
UMAP is the recommended default for nonlinear dimensionality reduction. It is faster than t-SNE, better preserves global structure, supports transforming new data, and works for general-purpose reduction (not just visualization).
-
Never cluster on t-SNE output. t-SNE distorts distances. Cluster on the original or PCA-reduced data, then overlay cluster labels on the t-SNE visualization for interpretation.
Anomaly Detection
-
Isolation Forest detects anomalies based on ease of isolation. Anomalies require fewer random splits to isolate. It is fast, scalable, and effective for global anomalies in moderate-to-high dimensions.
-
Local Outlier Factor (LOF) detects local anomalies. LOF compares a point's density to its neighbors' densities. It excels when anomalies are relative to local context (e.g., a point that is sparse relative to its dense neighborhood).
-
Clustering algorithms can double as anomaly detectors. K-means (large distance to nearest centroid), DBSCAN (noise label), and GMM (low log-likelihood) all provide natural anomaly scores.
Evaluation
-
Evaluating unsupervised learning is inherently harder than evaluating supervised learning. Without ground-truth labels, we rely on internal metrics, visualization, stability analysis, and domain expertise.
-
The silhouette score is the most versatile internal metric. It measures both cohesion (intra-cluster compactness) and separation (inter-cluster distance) on a scale from $-1$ to $+1$. Higher is better.
-
Use multiple evaluation metrics together. Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index each capture different aspects of clustering quality. No single metric tells the whole story.
-
When labels are available, use ARI and NMI. The Adjusted Rand Index and Normalized Mutual Information are the standard external metrics for comparing predicted clusters to ground-truth labels.
-
Trustworthiness quantifies dimensionality reduction quality. It measures whether nearest neighbors in the embedding are consistent with nearest neighbors in the original space. Scores close to 1.0 are desirable.
Practical Workflow
-
Always scale features before applying unsupervised algorithms. K-means, hierarchical clustering, PCA, t-SNE, and UMAP are all distance-sensitive. Standardization (zero mean, unit variance) is the default choice.
-
The standard unsupervised pipeline is: preprocess, reduce, cluster, evaluate, interpret. Iterate on each step based on evaluation results and domain knowledge.
-
Unsupervised learning connects to every other chapter. It uses preprocessing from Chapter 3, can create features for models in Chapters 5--6, builds on the math from Chapter 2, and provides foundational concepts for deep learning techniques in later chapters.
-
Domain expertise is the ultimate evaluator. Metrics and visualizations guide you, but the final judgment on whether discovered clusters or reduced representations are meaningful requires understanding the problem domain.