You have a 24-feature churn dataset. You want to reduce it to something smaller. But "reduce it to something smaller" hides two fundamentally different goals, and confusing them is one of the most common mistakes in applied data science.
In This Chapter
- PCA, t-SNE, and UMAP
- Different Tools, Different Purposes
- Part 1: Principal Component Analysis (PCA)
- Part 2: t-SNE --- Visualization of High-Dimensional Data
- Part 3: UMAP --- The Modern Alternative
- Part 4: Comparing the Three Methods
- Part 5: E-Commerce Product Embeddings
- Part 6: Advanced PCA Topics
- Part 7: Practical Workflow
- Summary
Chapter 21: Dimensionality Reduction
PCA, t-SNE, and UMAP
Learning Objectives
By the end of this chapter, you will be able to:
- Apply PCA for dimensionality reduction and explain variance retention
- Choose the number of components using explained variance ratios
- Use t-SNE for 2D visualization of high-dimensional data
- Apply UMAP as a faster, more structure-preserving alternative to t-SNE
- Distinguish between use cases: PCA for preprocessing vs. t-SNE/UMAP for visualization
Different Tools, Different Purposes
You have a 24-feature churn dataset. You want to reduce it to something smaller. But "reduce it to something smaller" hides two fundamentally different goals, and confusing them is one of the most common mistakes in applied data science.
Goal 1: Preprocessing. You want to compress 24 features into 5 or 10 components that retain most of the variance, then feed those components into a downstream model. You need the transformation to be deterministic, invertible (approximately), and applicable to new data. This is PCA.
Goal 2: Visualization. You want to plot 24-dimensional data in 2D so you can see clusters, outliers, and patterns. You do not care about invertibility. You do not need to apply the transformation to new data in production. You need the 2D layout to reveal meaningful structure. This is t-SNE or UMAP.
These are not interchangeable. PCA for visualization often produces uninformative blobs. t-SNE for preprocessing produces features that are non-deterministic, non-invertible, and meaningless to a downstream model. The single most important lesson in this chapter is knowing which tool matches which goal.
| PCA | t-SNE | UMAP | |
|---|---|---|---|
| Primary use | Preprocessing, compression | Visualization | Visualization (and some preprocessing) |
| Output dimensions | Any (typically 5-50) | 2 or 3 | 2 or 3 (can do more) |
| Preserves | Global variance | Local neighborhoods | Local + some global structure |
| Deterministic | Yes | No (random initialization) | No (random initialization) |
| Invertible | Yes (approximately) | No | No |
| Scales to new data | Yes (transform method) | No (must refit) | Yes (transform method) |
| Speed | Fast (linear algebra) | Slow (O(n^2) memory) | Fast (approximate nearest neighbors) |
We will build each technique from the ground up, starting with PCA.
Part 1: Principal Component Analysis (PCA)
The Intuition
PCA finds the directions in your data that capture the most variance. If your 24 features are correlated --- and they always are --- then the data does not truly live in 24 independent dimensions. It lives on a lower-dimensional surface embedded in that 24-dimensional space. PCA finds that surface.
Imagine a cloud of points in 3D that is shaped like a pancake --- spread wide in two directions but thin in the third. PCA rotates the coordinate system so the first axis points along the widest spread, the second along the next widest, and the third along the thin direction. If you drop the third axis, you lose very little information because there was almost no variation in that direction.
That is exactly what PCA does in 24 dimensions. It finds the directions of maximum variance (principal components), orders them by how much variance each captures, and lets you keep only the top few.
The Math (Just Enough)
PCA operates on the covariance matrix of your centered data. Given a data matrix $X$ with $n$ observations and $p$ features (centered so each feature has mean zero):
- Compute the covariance matrix: $C = \frac{1}{n-1} X^T X$
- Compute the eigenvalues $\lambda_1 \geq \lambda_2 \geq \dots \geq \lambda_p$ and corresponding eigenvectors $v_1, v_2, \dots, v_p$ of $C$
- The eigenvectors are the principal components (directions). The eigenvalues tell you how much variance each direction captures.
- Project the data onto the top $k$ eigenvectors to get $k$-dimensional data.
The explained variance ratio for component $i$ is:
$$\text{EVR}_i = \frac{\lambda_i}{\sum_{j=1}^{p} \lambda_j}$$
This is the fraction of total variance captured by that component. The cumulative sum tells you how much variance you retain with $k$ components.
Practical Note --- You do not need to implement this yourself. scikit-learn's
PCAdoes the eigendecomposition (actually, it uses SVD, which is numerically more stable). What you need to understand is what the output means and how to choose $k$.
PCA on the StreamFlow Churn Dataset
Let us apply PCA to the StreamFlow SaaS churn dataset and see what the components reveal.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
np.random.seed(42)
n = 50000
streamflow = pd.DataFrame({
'monthly_hours_watched': np.random.exponential(18, n).round(1),
'sessions_last_30d': np.random.poisson(14, n),
'avg_session_minutes': np.random.exponential(28, n).round(1),
'unique_titles_watched': np.random.poisson(8, n),
'content_completion_rate': np.random.beta(3, 2, n).round(3),
'binge_sessions_30d': np.random.poisson(2, n),
'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
'hours_change_pct': np.random.normal(0, 30, n).round(1),
'sessions_change_pct': np.random.normal(0, 25, n).round(1),
'months_active': np.random.randint(1, 60, n),
'plan_price': np.random.choice(
[9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]
),
'devices_used': np.random.randint(1, 6, n),
'profiles_active': np.random.randint(1, 5, n),
'payment_failures_6m': np.random.poisson(0.3, n),
'support_tickets_90d': np.random.poisson(0.8, n),
'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
'recommendation_click_rate': np.random.beta(2, 8, n).round(3),
'search_frequency_30d': np.random.poisson(6, n),
'download_count_30d': np.random.poisson(3, n),
'share_count_30d': np.random.poisson(1, n),
'rating_count_30d': np.random.poisson(2, n),
'free_trial_convert': np.random.binomial(1, 0.65, n),
'referral_source': np.random.choice(
[0, 1, 2, 3], n, p=[0.50, 0.25, 0.15, 0.10]
),
})
churn_logit = (
-3.0
+ 0.08 * streamflow['days_since_last_session']
- 0.02 * streamflow['monthly_hours_watched']
- 0.04 * streamflow['sessions_last_30d']
+ 0.15 * streamflow['payment_failures_6m']
+ 0.10 * streamflow['support_tickets_90d']
- 0.03 * streamflow['content_completion_rate'] * 10
+ 0.05 * (streamflow['hours_change_pct'] < -30).astype(int)
- 0.01 * streamflow['months_active']
+ 0.08 * (streamflow['plan_price'] > 19.99).astype(int)
- 0.02 * streamflow['unique_titles_watched']
+ np.random.normal(0, 0.3, n)
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
streamflow['churned'] = np.random.binomial(1, churn_prob)
X = streamflow.drop(columns=['churned'])
y = streamflow['churned']
Critical Step --- Always scale before PCA. PCA maximizes variance, so if one feature has a range of 0-60 and another has a range of 0-1, PCA will be dominated by the high-variance feature regardless of its importance. StandardScaler puts all features on equal footing.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled = pd.DataFrame(X_scaled, columns=X.columns)
# Fit PCA with all components to analyze variance
pca_full = PCA(random_state=42)
X_pca_full = pca_full.fit_transform(X_scaled)
print("Explained variance ratio (first 10 components):")
for i, evr in enumerate(pca_full.explained_variance_ratio_[:10]):
cumulative = pca_full.explained_variance_ratio_[:i+1].sum()
print(f" PC{i+1}: {evr:.4f} (cumulative: {cumulative:.4f})")
Output (your values will be close but not identical due to random data generation):
Explained variance ratio (first 10 components):
PC1: 0.0726 (cumulative: 0.0726)
PC2: 0.0618 (cumulative: 0.1344)
PC3: 0.0534 (cumulative: 0.1878)
PC4: 0.0491 (cumulative: 0.2369)
PC5: 0.0468 (cumulative: 0.2837)
PC6: 0.0446 (cumulative: 0.3283)
PC7: 0.0432 (cumulative: 0.3715)
PC8: 0.0421 (cumulative: 0.4136)
PC9: 0.0416 (cumulative: 0.4552)
PC10: 0.0413 (cumulative: 0.4965)
The Scree Plot
The scree plot visualizes the explained variance per component. You are looking for an "elbow" --- a point where the curve bends sharply, indicating that adding more components yields diminishing returns.
fig, axes = plt.subplots(1, 2, figsize=(14, 5))
# Individual explained variance
axes[0].bar(
range(1, len(pca_full.explained_variance_ratio_) + 1),
pca_full.explained_variance_ratio_,
color='steelblue', alpha=0.8
)
axes[0].set_xlabel('Principal Component')
axes[0].set_ylabel('Explained Variance Ratio')
axes[0].set_title('Scree Plot')
# Cumulative explained variance
cumulative = np.cumsum(pca_full.explained_variance_ratio_)
axes[1].plot(
range(1, len(cumulative) + 1), cumulative,
marker='o', markersize=4, color='steelblue'
)
axes[1].axhline(y=0.80, color='red', linestyle='--', label='80% threshold')
axes[1].axhline(y=0.90, color='orange', linestyle='--', label='90% threshold')
axes[1].set_xlabel('Number of Components')
axes[1].set_ylabel('Cumulative Explained Variance')
axes[1].set_title('Cumulative Variance')
axes[1].legend()
plt.tight_layout()
plt.savefig('pca_scree_plot.png', dpi=150, bbox_inches='tight')
plt.show()
Choosing the Number of Components
Three practical strategies:
Strategy 1: Cumulative variance threshold. Pick the number of components that capture a target percentage of variance --- typically 80%, 90%, or 95%.
def components_for_variance(pca, threshold=0.90):
"""Return the number of components needed to reach the variance threshold."""
cumulative = np.cumsum(pca.explained_variance_ratio_)
n_components = np.argmax(cumulative >= threshold) + 1
return n_components, cumulative[n_components - 1]
for threshold in [0.80, 0.90, 0.95]:
n, actual = components_for_variance(pca_full, threshold)
print(f"{threshold:.0%} variance: {n} components (actual: {actual:.4f})")
Strategy 2: The elbow method. Look at the scree plot and find where the curve flattens. This is subjective, which is why practitioners often prefer Strategy 1.
Strategy 3: Downstream performance. Vary the number of components and measure how the downstream model (e.g., the churn classifier) performs. This is the most rigorous approach but requires more computation.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
results = []
for n_comp in [5, 8, 10, 12, 15, 18, 20, 24]:
pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_comp, random_state=42)),
('clf', LogisticRegression(max_iter=1000, random_state=42))
])
scores = cross_val_score(pipe, X, y, cv=5, scoring='roc_auc')
results.append({
'n_components': n_comp,
'mean_auc': scores.mean(),
'std_auc': scores.std()
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
Practical Guidance --- In most tabular datasets, if the first 2-3 components do not capture at least 40-50% of the variance, the data has no dominant low-dimensional structure, and PCA is unlikely to produce useful 2D visualizations. It may still be useful for preprocessing (reducing 100 features to 20), but do not expect pretty scatter plots.
Interpreting the Components
Each principal component is a linear combination of the original features. The weights (loadings) tell you what each component "means."
loadings = pd.DataFrame(
pca_full.components_[:5].T,
columns=[f'PC{i+1}' for i in range(5)],
index=X.columns
)
# Show top 5 features by absolute loading for each component
for pc in loadings.columns:
top = loadings[pc].abs().nlargest(5)
print(f"\n{pc} - Top features by |loading|:")
for feat, val in top.items():
direction = "+" if loadings.loc[feat, pc] > 0 else "-"
print(f" {direction} {feat}: {loadings.loc[feat, pc]:.4f}")
Warning
--- PCA components are mathematical constructs, not business concepts. PC1 might load heavily on monthly_hours_watched, sessions_last_30d, and avg_session_minutes --- you could call it an "engagement" component. But this labeling is subjective and can be misleading. Use loadings for understanding, not for naming features in production.
PCA for Preprocessing: A Pipeline
When using PCA for preprocessing, always wrap it in a pipeline so the scaler and PCA are fit on the training data only.
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Without PCA
model_full = Pipeline([
('scaler', StandardScaler()),
('clf', XGBClassifier(
n_estimators=200, max_depth=4, learning_rate=0.1,
random_state=42, eval_metric='logloss'
))
])
model_full.fit(X_train, y_train)
auc_full = roc_auc_score(y_test, model_full.predict_proba(X_test)[:, 1])
# With PCA (10 components)
model_pca = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=10, random_state=42)),
('clf', XGBClassifier(
n_estimators=200, max_depth=4, learning_rate=0.1,
random_state=42, eval_metric='logloss'
))
])
model_pca.fit(X_train, y_train)
auc_pca = roc_auc_score(y_test, model_pca.predict_proba(X_test)[:, 1])
print(f"AUC without PCA (24 features): {auc_full:.4f}")
print(f"AUC with PCA (10 components): {auc_pca:.4f}")
When PCA Helps --- PCA is most valuable as preprocessing when you have many correlated features (100+ sensor readings, genomics, NLP embeddings) and a downstream model that struggles with high dimensionality (logistic regression, SVM, kNN). Tree-based models like XGBoost handle high-dimensional data natively and rarely benefit from PCA preprocessing. Use PCA for preprocessing when (a) the number of features is large relative to the number of observations, (b) features are highly correlated, or (c) you need to reduce computational cost.
Reconstruction Error
Since PCA is approximately invertible, you can reconstruct the original data from the reduced representation. The reconstruction error tells you how much information was lost.
pca_10 = PCA(n_components=10, random_state=42)
X_reduced = pca_10.fit_transform(X_scaled)
X_reconstructed = pca_10.inverse_transform(X_reduced)
# Per-feature reconstruction error (RMSE)
recon_error = np.sqrt(((X_scaled.values - X_reconstructed) ** 2).mean(axis=0))
recon_df = pd.DataFrame({
'feature': X.columns,
'rmse': recon_error
}).sort_values('rmse', ascending=False)
print("Features with highest reconstruction error (10 components):")
print(recon_df.head(10).to_string(index=False))
Features with high reconstruction error are not well-represented by the top 10 components. This is informative --- it tells you which features carry unique information that is orthogonal to the dominant patterns.
Part 2: t-SNE --- Visualization of High-Dimensional Data
A Different Goal Entirely
PCA preserves global variance structure. That makes it good for preprocessing but often poor for visualization. In 2D, PCA tends to produce a single blob where structure is hard to see --- especially when the interesting structure is non-linear.
t-SNE (t-distributed Stochastic Neighbor Embedding) was designed specifically for visualization. It preserves local neighborhoods: if two points are close in 24-dimensional space, t-SNE tries to place them close in 2D. It does not care about preserving global distances or variance. This makes it excellent at revealing clusters and local structure.
How t-SNE Works (Conceptual)
t-SNE operates in two phases:
-
Build a probability distribution over pairs in high-dimensional space. For each point, t-SNE computes a probability for every other point based on their distance. Nearby points get high probability; distant points get low probability. This uses a Gaussian kernel, and the bandwidth of the Gaussian is controlled by the perplexity parameter.
-
Build a probability distribution over pairs in low-dimensional space. Points are randomly placed in 2D, and a similar probability distribution is computed using a t-distribution (heavier tails than Gaussian, which helps prevent crowding).
-
Minimize the KL divergence between the two distributions. t-SNE moves the 2D points around until the 2D neighborhood structure matches the high-dimensional neighborhood structure as closely as possible.
The result is a 2D layout where clusters are visible and local structure is preserved.
The Mandatory t-SNE Warnings
Before you run a single line of t-SNE code, you must internalize three facts. These are not caveats buried in a footnote. They are the difference between using t-SNE correctly and drawing false conclusions from it.
Warning 1: Distances between clusters are meaningless. If t-SNE shows three clusters, the distances between those clusters in the 2D plot have no relationship to the distances between those clusters in the original space. Cluster A might be close to Cluster B in the plot but far away in 24D, or vice versa. You cannot conclude "these two groups are similar" from their proximity in a t-SNE plot.
Warning 2: Cluster sizes are meaningless. t-SNE distorts densities. A tight cluster in the plot might be spread out in the original space. A large, diffuse cloud in the plot might be a tight cluster in 24D. You cannot conclude "this group is more homogeneous" from a tight t-SNE cluster.
Warning 3: Perplexity changes everything. Different perplexity values produce radically different plots from the same data. A perplexity of 5 emphasizes very local structure and produces many small clusters. A perplexity of 50 captures broader structure and may merge small clusters. There is no single "correct" perplexity --- you should always try multiple values and look for patterns that are robust across perplexities.
These warnings are not hypothetical. I have seen teams present t-SNE plots in executive reviews and draw conclusions about customer segment distances ("Segment A is closest to Segment B, suggesting similar behavior") that were entirely artifacts of the visualization.
t-SNE on StreamFlow
from sklearn.manifold import TSNE
# Subsample for speed --- t-SNE is O(n^2) in memory
sample_idx = np.random.RandomState(42).choice(len(X_scaled), 5000, replace=False)
X_sample = X_scaled.iloc[sample_idx]
y_sample = y.iloc[sample_idx]
tsne = TSNE(
n_components=2,
perplexity=30,
random_state=42,
n_iter=1000,
learning_rate='auto',
init='pca'
)
X_tsne = tsne.fit_transform(X_sample)
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(
X_tsne[:, 0], X_tsne[:, 1],
c=y_sample, cmap='coolwarm', alpha=0.4, s=8
)
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
ax.set_title('t-SNE of StreamFlow Churn Data (perplexity=30)')
plt.colorbar(scatter, label='Churned')
plt.savefig('tsne_churn.png', dpi=150, bbox_inches='tight')
plt.show()
Perplexity: The Critical Hyperparameter
Perplexity roughly corresponds to the number of effective nearest neighbors. It controls the balance between local and global structure.
fig, axes = plt.subplots(2, 2, figsize=(14, 12))
perplexities = [5, 15, 30, 50]
for ax, perp in zip(axes.ravel(), perplexities):
tsne = TSNE(
n_components=2,
perplexity=perp,
random_state=42,
n_iter=1000,
learning_rate='auto',
init='pca'
)
X_tsne = tsne.fit_transform(X_sample)
ax.scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.4, s=8)
ax.set_title(f'Perplexity = {perp}')
ax.set_xlabel('t-SNE 1')
ax.set_ylabel('t-SNE 2')
plt.suptitle('t-SNE at Different Perplexities', fontsize=14)
plt.tight_layout()
plt.savefig('tsne_perplexity_comparison.png', dpi=150, bbox_inches='tight')
plt.show()
Practical Guidance --- A perplexity between 20 and 50 works well for most datasets with 1,000-50,000 observations. For very large datasets (100K+), higher perplexity values (50-100) may be needed. If you see a single uniform blob at all perplexities, the data may genuinely lack cluster structure, or the features may not capture the relevant differences.
What t-SNE Cannot Do
t-SNE has no transform method. You cannot fit it on training data and transform test data. Every time you add data, you must refit from scratch. This makes it unsuitable for production pipelines.
# This does NOT work:
# tsne.transform(X_new) # AttributeError: TSNE has no transform method
# You must refit on the combined data:
# X_combined = np.vstack([X_sample, X_new])
# tsne.fit_transform(X_combined)
This is a fundamental limitation, not a missing feature. t-SNE's optimization is non-convex and sensitive to the full dataset. Adding even a single point changes the global solution.
Part 3: UMAP --- The Modern Alternative
Why UMAP?
UMAP (Uniform Manifold Approximation and Projection) was introduced in 2018 as an alternative to t-SNE that addresses several of its limitations:
- Speed. UMAP uses approximate nearest neighbors and stochastic gradient descent. It is significantly faster than t-SNE, especially on large datasets.
- Global structure preservation. UMAP preserves more of the large-scale relationships between clusters, though it is still not perfectly faithful to global distances.
- Transform method. UMAP has a
transformmethod that lets you embed new data without refitting. - Scalability. UMAP can handle datasets of 100K-1M+ observations that would be impractical for t-SNE.
UMAP is based on topological data analysis and Riemannian geometry, but you do not need to understand the math to use it effectively. What you need to understand are its two key hyperparameters.
The Key Hyperparameters
n_neighbors controls the balance between local and global structure, similar to perplexity in t-SNE. Low values (5-15) preserve fine-grained local structure. High values (50-200) preserve broader, more global structure.
min_dist controls how tightly UMAP packs points together. Low values (0.0-0.1) produce tight, separated clusters. High values (0.5-1.0) spread points out more uniformly.
# pip install umap-learn
import umap
reducer = umap.UMAP(
n_neighbors=15,
min_dist=0.1,
n_components=2,
random_state=42,
metric='euclidean'
)
X_umap = reducer.fit_transform(X_sample)
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(
X_umap[:, 0], X_umap[:, 1],
c=y_sample, cmap='coolwarm', alpha=0.4, s=8
)
ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
ax.set_title('UMAP of StreamFlow Churn Data')
plt.colorbar(scatter, label='Churned')
plt.savefig('umap_churn.png', dpi=150, bbox_inches='tight')
plt.show()
Hyperparameter Sensitivity
fig, axes = plt.subplots(2, 3, figsize=(18, 11))
configs = [
(5, 0.0), (15, 0.0), (50, 0.0),
(5, 0.5), (15, 0.5), (50, 0.5),
]
for ax, (n_neigh, md) in zip(axes.ravel(), configs):
reducer = umap.UMAP(
n_neighbors=n_neigh, min_dist=md,
n_components=2, random_state=42
)
embedding = reducer.fit_transform(X_sample)
ax.scatter(embedding[:, 0], embedding[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.4, s=8)
ax.set_title(f'n_neighbors={n_neigh}, min_dist={md}')
plt.suptitle('UMAP Hyperparameter Sensitivity', fontsize=14)
plt.tight_layout()
plt.savefig('umap_hyperparameters.png', dpi=150, bbox_inches='tight')
plt.show()
Practical Guidance --- Start with
n_neighbors=15, min_dist=0.1(the defaults). If clusters are too fragmented, increasen_neighbors. If clusters are too merged, decrease it. Adjustmin_distto control visual tightness. As with t-SNE, always check that patterns are robust across different hyperparameter values.
UMAP's Transform Method
Unlike t-SNE, UMAP can embed new data using a fitted model. This makes it useful in semi-production contexts --- for example, embedding new customers into an existing visualization.
# Split sample to simulate "new" data
X_train_sample = X_sample.iloc[:4000]
y_train_sample = y_sample.iloc[:4000]
X_new_sample = X_sample.iloc[4000:]
y_new_sample = y_sample.iloc[4000:]
# Fit on training sample
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap_train = reducer.fit_transform(X_train_sample)
# Transform new data (no refitting)
X_umap_new = reducer.transform(X_new_sample)
fig, ax = plt.subplots(figsize=(10, 8))
ax.scatter(X_umap_train[:, 0], X_umap_train[:, 1],
c='lightgray', alpha=0.3, s=5, label='Existing')
ax.scatter(X_umap_new[:, 0], X_umap_new[:, 1],
c=y_new_sample, cmap='coolwarm', alpha=0.6, s=15, label='New')
ax.set_title('UMAP: New Observations Embedded into Existing Space')
ax.legend()
plt.savefig('umap_transform.png', dpi=150, bbox_inches='tight')
plt.show()
Caution
--- While UMAP's transform method is useful, the embeddings of new points are approximate and depend on the training data. If the new data is substantially different from the training data (distribution shift), the embeddings may not be meaningful. Use this for exploration, not for production feature engineering.
Part 4: Comparing the Three Methods
Side-by-Side Comparison
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
# PCA (2 components)
pca_2d = PCA(n_components=2, random_state=42)
X_pca_2d = pca_2d.fit_transform(X_sample)
axes[0].scatter(X_pca_2d[:, 0], X_pca_2d[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.4, s=8)
axes[0].set_title(f'PCA (explains {pca_2d.explained_variance_ratio_.sum():.1%} var)')
axes[0].set_xlabel('PC1')
axes[0].set_ylabel('PC2')
# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
learning_rate='auto', init='pca')
X_tsne = tsne.fit_transform(X_sample)
axes[1].scatter(X_tsne[:, 0], X_tsne[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.4, s=8)
axes[1].set_title('t-SNE (perplexity=30)')
axes[1].set_xlabel('t-SNE 1')
axes[1].set_ylabel('t-SNE 2')
# UMAP
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_sample)
axes[2].scatter(X_umap[:, 0], X_umap[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.4, s=8)
axes[2].set_title('UMAP (n_neighbors=15, min_dist=0.1)')
axes[2].set_xlabel('UMAP 1')
axes[2].set_ylabel('UMAP 2')
plt.suptitle('StreamFlow Churn: Three Reduction Methods', fontsize=14)
plt.tight_layout()
plt.savefig('comparison_three_methods.png', dpi=150, bbox_inches='tight')
plt.show()
Typical results for churn data:
- PCA produces a single cloud. The churners and non-churners overlap substantially. This is expected --- churn is not driven by the top two variance components.
- t-SNE reveals more local structure, often showing regions where churners are concentrated, but the interpretation must respect the warnings above.
- UMAP shows similar local structure to t-SNE but with more coherent global layout and better-separated regions.
When to Use Each
| Scenario | Best Choice | Why |
|---|---|---|
| Reduce 100 features to 20 for a classifier | PCA | Deterministic, invertible, has transform |
| Visualize clusters in a report | UMAP | Fast, preserves global + local structure |
| Debug a recommendation system by visualizing product embeddings | t-SNE or UMAP | Fine for one-off exploration |
| Feature engineering for production model | PCA (maybe) | The only one that is fully deterministic and stable |
| Check if clusters from K-Means are "real" | UMAP | Overlay cluster labels on UMAP to see separation |
| Dashboard that updates daily with new data | UMAP | Has transform method; t-SNE cannot handle new data |
| Reduce noise in high-dimensional sensor data | PCA | Global variance preservation removes noise dimensions |
Part 5: E-Commerce Product Embeddings
Let us apply these techniques to the second anchor example: visualizing product embeddings for recommendation debugging.
np.random.seed(42)
n_products = 3000
# Simulated product embeddings (128-dimensional)
# Products belong to 6 categories with different embedding patterns
categories = np.random.choice(
['Electronics', 'Clothing', 'Books', 'Home', 'Sports', 'Food'],
n_products,
p=[0.20, 0.25, 0.15, 0.15, 0.10, 0.15]
)
# Each category has a different mean in embedding space
category_centers = {
'Electronics': np.random.randn(128) * 0.5 + 1.0,
'Clothing': np.random.randn(128) * 0.5 - 1.0,
'Books': np.random.randn(128) * 0.5 + 0.5,
'Home': np.random.randn(128) * 0.5,
'Sports': np.random.randn(128) * 0.5 - 0.5,
'Food': np.random.randn(128) * 0.5 + 0.3,
}
embeddings = np.zeros((n_products, 128))
for i, cat in enumerate(categories):
embeddings[i] = category_centers[cat] + np.random.randn(128) * 0.8
# Standardize
scaler_emb = StandardScaler()
embeddings_scaled = scaler_emb.fit_transform(embeddings)
# UMAP visualization colored by category
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
emb_umap = reducer.fit_transform(embeddings_scaled)
fig, ax = plt.subplots(figsize=(12, 9))
for cat in ['Electronics', 'Clothing', 'Books', 'Home', 'Sports', 'Food']:
mask = categories == cat
ax.scatter(emb_umap[mask, 0], emb_umap[mask, 1],
alpha=0.5, s=12, label=cat)
ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
ax.set_title('Product Embeddings (UMAP) --- Colored by Category')
ax.legend()
plt.savefig('product_embeddings_umap.png', dpi=150, bbox_inches='tight')
plt.show()
This is the kind of visualization that reveals problems in recommendation systems. If products that should be similar (same category) are scattered across the embedding space, the recommendation model has learned poor representations. If products from different categories overlap significantly, the model may be making cross-category recommendations that confuse users.
Debugging Insight --- If you see a product from "Electronics" sitting in the middle of the "Clothing" cluster, check that product's metadata. It might have incorrect category labels in the database, or the recommendation model might be grouping it with clothing products based on purchase co-occurrence (users who buy running shoes also buy fitness trackers). Both findings are actionable.
Part 6: Advanced PCA Topics
PCA on Sparse Data: TruncatedSVD
Standard PCA centers the data (subtracts the mean), which destroys sparsity. For sparse matrices (e.g., TF-IDF vectors, one-hot encoded features), use TruncatedSVD instead.
from sklearn.decomposition import TruncatedSVD
from scipy.sparse import random as sparse_random
# Simulate a sparse matrix (e.g., TF-IDF)
X_sparse = sparse_random(10000, 5000, density=0.01, random_state=42)
svd = TruncatedSVD(n_components=50, random_state=42)
X_reduced = svd.fit_transform(X_sparse)
print(f"Original shape: {X_sparse.shape}")
print(f"Reduced shape: {X_reduced.shape}")
print(f"Explained variance (50 components): {svd.explained_variance_ratio_.sum():.4f}")
Incremental PCA for Large Datasets
If the dataset does not fit in memory, use IncrementalPCA which processes data in batches.
from sklearn.decomposition import IncrementalPCA
ipca = IncrementalPCA(n_components=10, batch_size=5000)
for batch_start in range(0, len(X_scaled), 5000):
batch = X_scaled.iloc[batch_start:batch_start + 5000]
ipca.partial_fit(batch)
X_reduced = ipca.transform(X_scaled)
print(f"Incremental PCA explained variance: {ipca.explained_variance_ratio_.sum():.4f}")
Kernel PCA for Non-Linear Structure
Standard PCA finds linear projections. If the data has non-linear structure (e.g., a Swiss roll), Kernel PCA can capture it using the kernel trick.
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=0.01, random_state=42)
X_kpca = kpca.fit_transform(X_sample)
Practical Note --- Kernel PCA is rarely used in practice because (a) choosing the kernel and its parameters requires tuning, (b) it does not scale well, and (c) UMAP handles non-linear structure better for visualization. Know that it exists; reach for UMAP first.
Part 7: Practical Workflow
Here is the workflow I use when I encounter a high-dimensional dataset:
Step 1: PCA for a quick variance check. Fit full PCA and look at the scree plot. How many components capture 80% of the variance? If the answer is 3-5, the data has strong low-dimensional structure. If the answer is 15-20, the variance is spread across many dimensions.
Step 2: UMAP for visual exploration. Fit UMAP with default parameters (n_neighbors=15, min_dist=0.1). Color by the target variable. Color by other metadata (customer segment, product category, time period). Look for clusters, outliers, and unexpected patterns.
Step 3: Validate findings. Any pattern you see in UMAP must be confirmed with actual statistics. If UMAP shows two clusters, check whether they correspond to meaningful groups by examining summary statistics of the original features for each cluster. Never draw conclusions from the UMAP plot alone.
Step 4: PCA for preprocessing (if needed). If a downstream model benefits from dimensionality reduction, use PCA in a pipeline. Tune the number of components using cross-validation on the downstream metric.
# The full workflow in code
from sklearn.pipeline import Pipeline
# Step 1: Quick PCA variance check
pca_check = PCA(random_state=42).fit(X_scaled)
cumvar = np.cumsum(pca_check.explained_variance_ratio_)
n_80 = np.argmax(cumvar >= 0.80) + 1
print(f"Components for 80% variance: {n_80}")
# Step 2: UMAP for exploration
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled.iloc[:5000])
# Step 3: Validate by checking group statistics
# (example: compare features for left vs. right halves of UMAP)
left = X_umap[:, 0] < np.median(X_umap[:, 0])
print("\nFeature means --- Left half vs. Right half:")
for col in X.columns[:5]:
left_mean = X.iloc[:5000].loc[left, col].mean()
right_mean = X.iloc[:5000].loc[~left, col].mean()
print(f" {col}: {left_mean:.2f} vs {right_mean:.2f}")
# Step 4: PCA in a production pipeline
production_pipe = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=n_80, random_state=42)),
('clf', XGBClassifier(
n_estimators=200, max_depth=4,
random_state=42, eval_metric='logloss'
))
])
Summary
Dimensionality reduction is not one technique --- it is three techniques that serve two different purposes:
- PCA is for preprocessing: deterministic, invertible, global variance preservation, has transform. Use it to compress features for downstream models or to quickly check the variance structure of your data.
- t-SNE is for visualization: non-deterministic, non-invertible, local neighborhood preservation, no transform. Excellent at revealing clusters, but distances between clusters and cluster sizes are meaningless. Always vary perplexity.
- UMAP is for visualization (primarily): non-deterministic, non-invertible but has transform, preserves local and some global structure. Faster than t-SNE, scales better, and has become the default choice for 2D visualization.
The most common mistake is using the wrong tool for the wrong purpose. PCA for visualization produces blobs. t-SNE for preprocessing produces meaningless features. Know which tool you need before you write the first line of code.
Next chapter: Chapter 22: Anomaly Detection