Key Takeaways: Chapter 21

DataField.Dev

Key Takeaways: Chapter 21

Dimensionality Reduction: PCA, t-SNE, and UMAP

PCA, t-SNE, and UMAP are three tools that serve two different purposes --- and confusing the purposes is the most common mistake. PCA is for preprocessing: compressing many features into fewer components for downstream models. t-SNE and UMAP are for visualization: projecting high-dimensional data into 2D to reveal structure that human eyes can see. PCA for visualization usually produces uninformative blobs. t-SNE for preprocessing produces non-deterministic, non-invertible features that are meaningless to a downstream model. Know which tool matches your goal before writing any code.
PCA finds the directions of maximum variance by computing eigenvectors of the covariance matrix. The first principal component captures the most variance, the second captures the next most (orthogonal to the first), and so on. The explained variance ratio tells you the fraction of total variance each component captures. The cumulative explained variance tells you how much information you retain with $k$ components. Always standardize features before PCA --- otherwise high-variance features dominate the components regardless of their importance.
Choose the number of PCA components using explained variance thresholds, scree plots, or downstream model performance. A cumulative variance threshold of 80-95% is the most common approach. The scree plot's "elbow" provides a visual heuristic. The most rigorous approach is to vary the number of components and measure the downstream model's cross-validated performance. For tree-based models (XGBoost, Random Forest), PCA preprocessing rarely helps because these models handle high-dimensional data natively. PCA is most valuable as preprocessing for linear models, SVMs, and kNN.
t-SNE preserves local neighborhoods but systematically distorts everything else. Three facts are non-negotiable: distances between clusters in a t-SNE plot are meaningless (they do not reflect actual feature-space distances); cluster sizes are meaningless (t-SNE normalizes density via adaptive bandwidths); and perplexity changes the visual result dramatically (always try multiple values). These are not subtle caveats --- they are the difference between correct and incorrect interpretation. Every t-SNE conclusion about inter-cluster distances, relative cluster sizes, or cluster counts is unreliable unless verified in the original feature space.
UMAP is faster, more scalable, and preserves more global structure than t-SNE --- making it the default choice for most visualization tasks. UMAP uses approximate nearest neighbors and stochastic gradient descent, making it practical for datasets of 100K+ observations where t-SNE is impractical. It preserves more of the large-scale relationships between clusters, though it still distorts inter-cluster distances. Critically, UMAP has a transform method that allows embedding new observations without refitting, which t-SNE cannot do.
UMAP's two key hyperparameters control the balance between local detail and global coherence. n_neighbors (default 15) controls how many neighbors influence each point's position: low values preserve fine-grained local structure, high values preserve broader patterns. min_dist (default 0.1) controls how tightly points pack together: low values produce tight, separated clusters, high values spread points more uniformly. Always visualize the same data with 2-3 different hyperparameter settings to check that patterns are robust.
Every pattern you see in a t-SNE or UMAP plot must be validated in the original feature space. If UMAP shows a region of concentrated churners, confirm it by examining the original features of those points --- their days_since_last_session, payment_failures_6m, and support_tickets_90d. If you see clusters, run a clustering algorithm on the original data and check whether it finds the same groups. If you see outliers, examine those observations in the original features. The visualization suggests where to look; the original data confirms what you found.
PCA components are interpretable through their loadings, but the interpretations are fragile. Each principal component is a linear combination of the original features. The loadings tell you which features contribute most to each component. You might find that PC1 loads heavily on engagement features and PC2 on behavioral features --- but these labels are subjective and can change with different datasets or feature sets. Use loadings for understanding the structure of your data, not for naming features in production systems.
For sparse data, use TruncatedSVD instead of PCA. For data that does not fit in memory, use IncrementalPCA. Standard PCA centers the data by subtracting the mean, which destroys sparsity. TruncatedSVD operates directly on the sparse matrix. IncrementalPCA processes data in batches using partial_fit, enabling PCA on datasets larger than available RAM. Both produce results comparable to standard PCA with appropriate settings.
The deliverable is the business insight, not the plot. A UMAP visualization is a means to an end. The VP of Product does not need to understand eigenvectors, perplexity, or manifold learning. She needs to know that there are five customer archetypes, which ones are at risk, and what actions to take. The dimensionality reduction enables the discovery; the cluster profiles, feature statistics, and business recommendations are the actual output. Present the map with archetype labels. Keep the technical validation in your notebook.

If You Remember One Thing

PCA for preprocessing, t-SNE/UMAP for visualization --- different tools, different purposes. PCA compresses features deterministically and invertibly for downstream models. t-SNE and UMAP reveal visual structure in 2D but distort distances, densities, and cluster sizes. The most dangerous mistake is interpreting t-SNE or UMAP plots as if they faithfully represent the original feature space. They do not. Every visual finding must be confirmed with statistics on the original data. The visualization tells you where to look. The data tells you what is there.

These takeaways summarize Chapter 21: Dimensionality Reduction. Return to the chapter for full context.