Chapter 7: Further Reading

Foundational Texts

Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning, 2nd ed. Springer. Chapters 13 (Prototypes and Nearest-Neighbors) and 14 (Unsupervised Learning) provide rigorous coverage of K-means, hierarchical clustering, and self-organizing maps. Freely available at https://hastie.su.domains/ElemStatLearn/.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapter 9 (Mixture Models and EM) is the definitive reference for Gaussian Mixture Models and the Expectation-Maximization algorithm. The treatment of latent variable models provides deep theoretical insight.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. Chapter 21 covers clustering with a modern probabilistic perspective. Freely available at https://probml.github.io/pml-book/.

Jolliffe, I. T. (2002). Principal Component Analysis, 2nd ed. Springer. The definitive reference on PCA, covering theory, computation, variants, and applications across many domains.
Shalizi, C. R. (2024). Advanced Data Analysis from an Elementary Point of View. Chapter 19 covers PCA from a statistical perspective with excellent intuition. Available at https://www.stat.cmu.edu/~cshalizi/ADAfaEPoV/.

Arthur, D. and Vassilvitskii, S. (2007). "k-means++: The Advantages of Careful Seeding." Proceedings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms. The paper that introduced K-means++ initialization, now the default in virtually all implementations.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." Proceedings of KDD. The original DBSCAN paper. A landmark in density-based clustering.
Campello, R. J. G. B., Moulavi, D., and Sander, J. (2013). "Density-Based Clustering Based on Hierarchical Density Estimates." PAKDD 2013. Introduces HDBSCAN, which extends DBSCAN by varying epsilon and extracting a hierarchy of clusters. Available as the hdbscan Python package.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm." Journal of the Royal Statistical Society, Series B. The foundational paper on the EM algorithm, which underpins GMM parameter estimation.

van der Maaten, L. and Hinton, G. (2008). "Visualizing Data using t-SNE." Journal of Machine Learning Research, 9, 2579--2605. The original t-SNE paper. Clear exposition of the method and its advantages over prior techniques.
McInnes, L., Healy, J., and Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv:1802.03426. The UMAP paper. Combines strong theoretical motivation from topology with practical performance that often surpasses t-SNE.
Wattenberg, M., Viegas, F., and Johnson, I. (2016). "How to Use t-SNE Effectively." Distill. An excellent interactive article that demonstrates how t-SNE behavior changes with perplexity and other parameters. Essential reading for anyone using t-SNE. Available at https://distill.pub/2016/misread-tsne/.

Liu, F. T., Ting, K. M., and Zhou, Z.-H. (2008). "Isolation Forest." Proceedings of the 8th IEEE International Conference on Data Mining. Introduces the elegant idea that anomalies are easier to isolate with random partitioning.
Breunig, M. M., Kriegel, H.-P., Ng, R. T., and Sander, J. (2000). "LOF: Identifying Density-Based Local Outliers." Proceedings of ACM SIGMOD. The original Local Outlier Factor paper, introducing the concept of local density-based anomaly scoring.
Chandola, V., Banerjee, A., and Kumar, V. (2009). "Anomaly Detection: A Survey." ACM Computing Surveys, 41(3). A comprehensive survey covering statistical, classification-based, clustering-based, and information-theoretic approaches to anomaly detection.

Rousseeuw, P. J. (1987). "Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis." Journal of Computational and Applied Mathematics, 20, 53--65. The original silhouette coefficient paper.
Hubert, L. and Arabie, P. (1985). "Comparing Partitions." Journal of Classification, 2, 193--218. Foundational work on the Rand Index and its adjusted version.
Vinh, N. X., Epps, J., and Bailey, J. (2010). "Information Theoretic Measures for Clusterings Comparison: Variants, Properties, Normalization and Correction for Chance." Journal of Machine Learning Research, 11, 2837--2854. Comprehensive treatment of NMI, AMI, and related metrics.

scikit-learn Clustering Documentation: https://scikit-learn.org/stable/modules/clustering.html --- Excellent practical guide with comparison of all clustering algorithms, including visual examples on synthetic datasets.
scikit-learn Decomposition Documentation: https://scikit-learn.org/stable/modules/decomposition.html --- Covers PCA, Incremental PCA, Kernel PCA, and other decomposition methods with code examples.
UMAP Documentation: https://umap-learn.readthedocs.io/ --- Official UMAP documentation with tutorials, parameter guides, and advanced usage (supervised UMAP, semi-supervised UMAP, inverse transform).
StatQuest: K-means, PCA, t-SNE (YouTube): Josh Starmer's video explanations of these algorithms are exceptionally clear and intuitive. Recommended for visual learners.
Google's "Clustering in Machine Learning" Course: Part of the Machine Learning Crash Course. Provides interactive exercises for K-means and hierarchical clustering.

scikit-learn (sklearn): The primary library used throughout this chapter. Provides KMeans, AgglomerativeClustering, DBSCAN, GaussianMixture, PCA, TSNE, IsolationForest, LocalOutlierFactor, and all evaluation metrics.
umap-learn: The reference UMAP implementation. Install with pip install umap-learn. Supports GPU acceleration via cuml.
hdbscan: An improved version of DBSCAN that automatically selects density thresholds. Install with pip install hdbscan.
yellowbrick (yellowbrick): Visualization library built on scikit-learn that provides elbow plots, silhouette visualizers, and other diagnostic tools for clustering. Install with pip install yellowbrick.
RAPIDS cuML: GPU-accelerated implementations of K-means, DBSCAN, PCA, UMAP, and other algorithms. Essential for large-scale unsupervised learning. Available at https://rapids.ai/.

Spectral Clustering: Uses the eigenvalues of a similarity matrix to reduce dimensionality before clustering. Effective for non-convex clusters. See sklearn.cluster.SpectralClustering.
Kernel PCA: Applies PCA in a kernel-induced feature space, enabling nonlinear dimensionality reduction. Covered in sklearn.decomposition.KernelPCA.
Non-Negative Matrix Factorization (NMF): A dimensionality reduction method that produces non-negative components, leading to parts-based representations useful for text and image analysis.
Autoencoders: Neural network-based dimensionality reduction (covered in Part III of this book). Autoencoders learn nonlinear mappings and can produce powerful low-dimensional representations.
Variational Autoencoders (VAEs): A probabilistic extension of autoencoders that learns a generative model of the data. Combines deep learning with the probabilistic framework of GMMs.
Contrastive Learning: A modern self-supervised approach where representations are learned by contrasting similar and dissimilar pairs. See SimCLR (Chen et al., 2020) and BYOL (Grill et al., 2020).