Chapter 7: Key Takeaways

Core Concepts

  1. Unsupervised learning discovers structure in unlabeled data. Unlike supervised learning (Chapters 5--6), there are no target labels. The algorithm defines and optimizes its own notion of "good structure"---whether that means compact clusters, maximum variance directions, or density-based groupings.

  2. The three pillars of unsupervised learning are clustering, dimensionality reduction, and anomaly detection. Each addresses a different question: "What groups exist?" (clustering), "How can I simplify this data?" (dimensionality reduction), and "What is unusual?" (anomaly detection).

  3. There is no single best clustering algorithm. K-means is fast and effective for spherical clusters; DBSCAN handles arbitrary shapes and noise; hierarchical clustering reveals multi-scale structure; and GMMs provide probabilistic soft assignments. The choice depends on data geometry, noise levels, scalability needs, and whether you need hard or soft assignments.

Clustering

  1. K-means minimizes within-cluster sum of squares (inertia). It is the default starting point for clustering due to its simplicity, speed ($O(nkd)$ per iteration), and effectiveness on well-separated, roughly spherical clusters. Always use K-means++ initialization.

  2. The elbow method helps choose $k$, but it is not always definitive. Plot inertia vs. $k$ and look for the "elbow." When the elbow is ambiguous, supplement with silhouette analysis or BIC from GMMs.

  3. DBSCAN does not require specifying the number of clusters. Instead, it requires eps (neighborhood radius) and min_samples (density threshold). Use the k-distance plot to guide eps selection. DBSCAN naturally labels outliers as noise.

  4. Gaussian Mixture Models generalize K-means. GMMs model each cluster as a Gaussian distribution with its own mean and covariance, providing soft (probabilistic) assignments. K-means is a special case of GMMs with spherical, equal-variance components.

  5. Use BIC for model selection with GMMs. The Bayesian Information Criterion balances model fit against complexity and provides a principled, objective criterion for choosing the number of components.

Dimensionality Reduction

  1. PCA finds directions of maximum variance. It is linear, deterministic, fast, and interpretable. Use it for preprocessing (retain 95% variance), denoising, and as a first-pass visualization. Always standardize features first (Chapter 3).

  2. t-SNE creates high-quality 2D visualizations by preserving local neighborhoods. It excels at revealing cluster structure that PCA misses but has critical caveats: inter-cluster distances are meaningless, results vary across runs, and it cannot transform new data.

  3. UMAP is the recommended default for nonlinear dimensionality reduction. It is faster than t-SNE, better preserves global structure, supports transforming new data, and works for general-purpose reduction (not just visualization).

  4. Never cluster on t-SNE output. t-SNE distorts distances. Cluster on the original or PCA-reduced data, then overlay cluster labels on the t-SNE visualization for interpretation.

Anomaly Detection

  1. Isolation Forest detects anomalies based on ease of isolation. Anomalies require fewer random splits to isolate. It is fast, scalable, and effective for global anomalies in moderate-to-high dimensions.

  2. Local Outlier Factor (LOF) detects local anomalies. LOF compares a point's density to its neighbors' densities. It excels when anomalies are relative to local context (e.g., a point that is sparse relative to its dense neighborhood).

  3. Clustering algorithms can double as anomaly detectors. K-means (large distance to nearest centroid), DBSCAN (noise label), and GMM (low log-likelihood) all provide natural anomaly scores.

Evaluation

  1. Evaluating unsupervised learning is inherently harder than evaluating supervised learning. Without ground-truth labels, we rely on internal metrics, visualization, stability analysis, and domain expertise.

  2. The silhouette score is the most versatile internal metric. It measures both cohesion (intra-cluster compactness) and separation (inter-cluster distance) on a scale from $-1$ to $+1$. Higher is better.

  3. Use multiple evaluation metrics together. Silhouette score, Calinski-Harabasz index, and Davies-Bouldin index each capture different aspects of clustering quality. No single metric tells the whole story.

  4. When labels are available, use ARI and NMI. The Adjusted Rand Index and Normalized Mutual Information are the standard external metrics for comparing predicted clusters to ground-truth labels.

  5. Trustworthiness quantifies dimensionality reduction quality. It measures whether nearest neighbors in the embedding are consistent with nearest neighbors in the original space. Scores close to 1.0 are desirable.

Practical Workflow

  1. Always scale features before applying unsupervised algorithms. K-means, hierarchical clustering, PCA, t-SNE, and UMAP are all distance-sensitive. Standardization (zero mean, unit variance) is the default choice.

  2. The standard unsupervised pipeline is: preprocess, reduce, cluster, evaluate, interpret. Iterate on each step based on evaluation results and domain knowledge.

  3. Unsupervised learning connects to every other chapter. It uses preprocessing from Chapter 3, can create features for models in Chapters 5--6, builds on the math from Chapter 2, and provides foundational concepts for deep learning techniques in later chapters.

  4. Domain expertise is the ultimate evaluator. Metrics and visualizations guide you, but the final judgment on whether discovered clusters or reduced representations are meaningful requires understanding the problem domain.