Further Reading: Chapter 21

Dimensionality Reduction: PCA, t-SNE, and UMAP

Foundational Papers

1. "Principal Component Analysis" --- Karl Pearson (1901) Pearson introduced PCA as a method for fitting lines and planes to data points in high-dimensional space. The original formulation is geometric: find the lower-dimensional hyperplane that minimizes the total squared perpendicular distance from the data points. Published in The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. This is the paper that started it all, though the modern SVD-based implementation came later from Hotelling (1933).

2. "Visualizing Data using t-SNE" --- Laurens van der Maaten and Geoffrey Hinton (2008) The paper that introduced t-SNE. Van der Maaten and Hinton described how replacing the Gaussian kernel in the low-dimensional space (used in SNE) with a Student's t-distribution solved the "crowding problem" --- the tendency of SNE to crush nearby points together in low dimensions. The paper demonstrates t-SNE on handwritten digits (MNIST), gene expression data, and other high-dimensional datasets. Published in the Journal of Machine Learning Research, Vol. 9. This is the definitive reference for understanding how t-SNE works and why it outperforms earlier methods like Sammon mapping and Isomap for visualization.

3. "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction" --- Leland McInnes, John Healy, and James Melville (2018) The paper that introduced UMAP. McInnes et al. ground the method in Riemannian geometry and algebraic topology, constructing a fuzzy topological representation of the high-dimensional data and then optimizing a low-dimensional layout to match it. The paper demonstrates that UMAP is faster than t-SNE, preserves more global structure, and supports a transform operation for embedding new data. Published as a preprint on arXiv (1802.03426). This is the paper to read if you want to understand UMAP's theoretical foundations and its practical advantages over t-SNE.

PCA Theory and Extensions

4. "Analysis of a Complex of Statistical Variables into Principal Components" --- Harold Hotelling (1933) Hotelling formalized PCA in its modern algebraic form: computing eigenvalues and eigenvectors of the covariance matrix. Published in the Journal of Educational Psychology. Hotelling's formulation is the one used in textbooks and software today, including scikit-learn's PCA (which uses SVD, a numerically equivalent approach).

5. "Kernel Principal Component Analysis" --- Bernhard Scholkopf, Alexander Smola, and Klaus-Robert Muller (1998) The paper that extended PCA to non-linear manifolds using the kernel trick. By mapping data into a high-dimensional feature space via a kernel function (RBF, polynomial), Kernel PCA can capture non-linear structure that standard PCA misses. Published at Neural Computation, Vol. 10. Read this for the theory behind scikit-learn's KernelPCA, though in practice UMAP has largely replaced Kernel PCA for non-linear dimensionality reduction.

6. "Probabilistic Principal Component Analysis" --- Michael Tipping and Christopher Bishop (1999) Tipping and Bishop recast PCA as a latent variable model, deriving PCA from a probabilistic generative model. This formulation enables PCA with missing data, Bayesian model selection for the number of components, and mixtures of PCA models. Published in the Journal of the Royal Statistical Society, Series B. Read this if you need to handle missing values in PCA or want a principled way to select the number of components.

t-SNE Extensions and Critiques

7. "Accelerating t-SNE Using Tree-Based Algorithms" --- Laurens van der Maaten (2014) Van der Maaten introduced Barnes-Hut t-SNE, which reduces the computational complexity from O(n^2) to O(n log n) using a Barnes-Hut tree approximation of the gradient. This is the algorithm implemented in scikit-learn's TSNE when method='barnes_hut' (the default). Published in the Journal of Machine Learning Research, Vol. 15. Read this if you need to understand why t-SNE is slow and what trade-offs the accelerated version makes.

8. "How to Use t-SNE Effectively" --- Martin Wattenberg, Fernanda Viegas, and Ian Johnson (2016) An interactive visual essay (distill.pub/2016/misread-tsne) that demonstrates the effects of perplexity, iteration count, and data structure on t-SNE output. The authors systematically show how t-SNE can produce misleading visualizations: apparent clusters from uniform data, varying cluster sizes from identical distributions, and perplexity-dependent structure. This is the single best resource for developing intuition about t-SNE artifacts. If you read one additional resource from this list, make it this one.

9. "The Art of Using t-SNE for Single-Cell Transcriptomics" --- Dmitry Kobak and Philipp Berens (2019) A practical guide to t-SNE in bioinformatics, but the lessons apply to any domain. Kobak and Berens discuss perplexity selection, initialization (PCA initialization is recommended), the importance of sufficient iterations, and the effect of dataset size on t-SNE behavior. Published in Nature Communications. Read this for practical recommendations that go beyond the default settings.

UMAP Theory and Practice

10. "Understanding UMAP" --- Andy Coenen and Adam Pearce (2019) An interactive visual essay on Google's PAIR blog that walks through UMAP's algorithm step by step, including the fuzzy simplicial set construction, the cross-entropy objective, and the effect of n_neighbors and min_dist. Available at pair-code.github.io/understanding-umap. This is the UMAP equivalent of Wattenberg et al.'s t-SNE essay (item 8) and is equally essential.

11. "Initialization Is Critical for Preserving Global Data Structure in Both t-SNE and UMAP" --- Dmitry Kobak and George Linderman (2021) Kobak and Linderman demonstrate that both t-SNE and UMAP can preserve global structure much better when initialized with PCA (or other informative initialization) rather than random initialization. They recommend PCA initialization for both methods and show that with proper initialization, t-SNE can preserve global structure comparably to UMAP. Published in Nature Biotechnology. This paper challenges the claim that UMAP is inherently better at preserving global structure and attributes much of the difference to initialization.

12. "Benchmarking Dimensionality Reduction Methods on Single-Cell RNA-Sequencing Data" --- Sun, Zhu, Ma, Gao, Finley, and Fan (2019) A large-scale benchmark comparing PCA, t-SNE, UMAP, and other methods on single-cell transcriptomics data. The authors evaluate preservation of local structure (neighborhood recall), global structure (distance correlation), computational speed, and reproducibility. Key finding: UMAP consistently provides the best balance of local structure preservation and speed. Published in Genome Biology. Read this for empirical evidence comparing the three methods covered in this chapter.

Manifold Learning (Broader Context)

13. "A Global Geometric Framework for Nonlinear Dimensionality Reduction" --- Joshua Tenenbaum, Vin de Silva, and John Langford (2000) The paper that introduced Isomap, one of the first manifold learning methods. Isomap computes geodesic distances (shortest paths through the data graph) and then applies classical MDS to embed the data. While Isomap has been largely superseded by t-SNE and UMAP, it introduced the key idea that high-dimensional data often lies on a lower-dimensional manifold and that the manifold's intrinsic geometry matters more than the ambient Euclidean distances. Published in Science, Vol. 290.

14. "Nonlinear Dimensionality Reduction by Locally Linear Embedding" --- Sam Roweis and Lawrence Saul (2000) Published in the same issue of Science as Isomap, LLE takes a different approach: each point is reconstructed as a linear combination of its neighbors, and the same reconstruction weights are used in the low-dimensional embedding. LLE is faster than Isomap but more sensitive to noise. Read this alongside Isomap for the two founding approaches to manifold learning.

Practical Applications

15. "Dimensionality Reduction for Visualizing Industrial Health Monitoring Data" --- Verma, Hossain, and Khan (2020) A practical study comparing PCA, t-SNE, and UMAP for visualizing sensor data from industrial equipment. The authors demonstrate how UMAP reveals failure modes that PCA misses and how t-SNE's sensitivity to perplexity can lead to inconsistent conclusions. Published in Sensors, Vol. 20. This is useful for seeing dimensionality reduction applied outside the typical ML/bioinformatics context.

16. "Visualizing and Understanding Convolutional Networks" --- Matthew Zeiler and Rob Fergus (2014) While focused on deep learning, this paper demonstrates how t-SNE can be used to visualize the learned representations (embeddings) of neural networks. The approach --- extract feature vectors from an intermediate layer and plot them with t-SNE colored by class label --- is directly applicable to debugging recommendation models, as shown in Section 5 of this chapter. Published at ECCV 2014.

Software and Tools

17. scikit-learn User Guide --- Decomposition and Manifold Learning scikit-learn's documentation for PCA, TruncatedSVD, IncrementalPCA, KernelPCA, and TSNE. Includes practical examples, parameter descriptions, and performance tips. The manifold learning section compares t-SNE with Isomap, LLE, and Spectral Embedding on standard datasets. Available at scikit-learn.org in the User Guide.

18. UMAP Library Documentation --- Leland McInnes The official documentation for the umap-learn Python package. Covers basic usage, parameter tuning, supervised UMAP, plotting utilities, and integration with scikit-learn pipelines. The "How UMAP Works" page provides a readable summary of the algorithm. Available at umap-learn.readthedocs.io.

19. "openTSNE: A Modular Python Library for t-SNE Embeddings" --- Pavlin Policar, Martin Strazar, and Blaz Zupan (2019) openTSNE provides a more flexible t-SNE implementation than scikit-learn, supporting embedding new points (via affinity-based interpolation), custom affinity kernels, and better control over initialization and optimization. Published in the Journal of Statistical Software. If you need t-SNE with transform-like functionality, openTSNE is the tool.

Responsible Use and Interpretation

20. "The Specious Art of Single-Cell Genomics" --- Tara Chari and Lior Pachter (2023) A provocative paper demonstrating that common visualizations in single-cell biology (primarily t-SNE and UMAP) can produce misleading results, including phantom clusters in continuous data and false separation between biological states. While the biological context is specific, the methodological warnings apply to any domain where t-SNE or UMAP is used for exploratory analysis. Published in PLOS Computational Biology. Read this as a cautionary tale about over-interpreting dimensionality reduction visualizations.

How to Use This List

If you read nothing else, read Wattenberg et al. (item 8) on t-SNE pitfalls and Coenen and Pearce (item 10) on UMAP internals. These two interactive essays will give you the intuition that no amount of reading papers can match.

If you want to understand the theory, read van der Maaten and Hinton (item 2) on t-SNE and McInnes et al. (item 3) on UMAP. Both are well-written and accessible to practitioners with a solid math background.

If you are deciding between t-SNE and UMAP for a specific application, read Sun et al. (item 12) for an empirical benchmark and Kobak and Linderman (item 11) for the argument that PCA initialization matters more than the choice of algorithm.

If you are worried about misinterpretation, read Chari and Pachter (item 20) for a rigorous demonstration of how t-SNE and UMAP can mislead, and always apply the validation workflow from Case Study 2 of this chapter.

This reading list supports Chapter 21: Dimensionality Reduction. Return to the chapter to review concepts before diving in.