Case Study 2: Visualizing High-Dimensional Data with t-SNE and UMAP

Context and Motivation

You are a data scientist at an AI startup building an image classification pipeline. Before training deep learning models (Part III of this book), you want to understand the structure of your image data. Are the classes well-separated? Are any classes easily confused? Are there outlier images that might be mislabeled?

To answer these questions, you need to visualize data that lives in a high-dimensional space. Each image in the digits dataset is 8x8 pixels = 64 dimensions. Directly plotting 64 dimensions is impossible; we need dimensionality reduction.

This case study compares three approaches---PCA, t-SNE, and UMAP---on the scikit-learn digits dataset (1,797 samples, 64 features, 10 classes). The goal is not just to produce pretty plots, but to understand what each method reveals and conceals about the data.

Step 1: Load and Explore the Data

import numpy as np
from sklearn.datasets import load_digits
from sklearn.preprocessing import StandardScaler

# Load data
digits = load_digits()
X = digits.data          # Shape: (1797, 64)
y = digits.target        # Shape: (1797,) - labels 0-9
class_names = [str(i) for i in range(10)]

print(f"Dataset: {X.shape[0]} samples, {X.shape[1]} features")
print(f"Classes: {np.unique(y)}")
print(f"Samples per class: {np.bincount(y)}")

# Standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Step 2: PCA Baseline

PCA provides a fast, deterministic baseline. Its linear nature means it preserves global variance structure but may not reveal nonlinear cluster boundaries.

from sklearn.decomposition import PCA

# Full PCA to understand variance distribution
pca_full = PCA(n_components=None)
pca_full.fit(X_scaled)

cumulative_var = np.cumsum(pca_full.explained_variance_ratio_)
n_90 = np.argmax(cumulative_var >= 0.90) + 1
n_95 = np.argmax(cumulative_var >= 0.95) + 1
print(f"Components for 90% variance: {n_90}")
print(f"Components for 95% variance: {n_95}")

# 2D projection
pca_2d = PCA(n_components=2)
X_pca = pca_2d.fit_transform(X_scaled)
print(f"Variance in 2D: {pca_2d.explained_variance_ratio_.sum():.2%}")

Observations from PCA: - Two components capture only about 22% of total variance, meaning the 2D PCA plot is a very lossy representation. - Some digit classes partially overlap. For example, digits 3, 5, and 8 tend to occupy similar regions. - The global layout is informative: digit 0 is far from digit 1, which makes intuitive sense given their visual dissimilarity.

Step 3: t-SNE Visualization

t-SNE should reveal local cluster structure that PCA misses.

Perplexity Study

Different perplexity values can dramatically change the embedding. We run t-SNE with multiple perplexities to understand the sensitivity.

from sklearn.manifold import TSNE

perplexities = [5, 15, 30, 50, 100]
tsne_results = {}

for perp in perplexities:
    tsne = TSNE(
        n_components=2,
        perplexity=perp,
        learning_rate="auto",
        n_iter=1000,
        random_state=42
    )
    X_embedded = tsne.fit_transform(X_scaled)
    tsne_results[perp] = X_embedded
    print(f"Perplexity {perp:3d} | KL divergence: {tsne.kl_divergence_:.4f}")

Observations from the perplexity study:

  • Perplexity 5: Very tight, fragmented clusters. Some digits split into sub-clusters (e.g., different writing styles of "7"). Noise dominates the layout.
  • Perplexity 15: Clusters begin to consolidate. Most digits form coherent groups, but some remain fragmented.
  • Perplexity 30 (default): Good balance. Ten distinct clusters are visible, roughly corresponding to the ten digits. Some overlap between visually similar digits (3 vs. 5, 4 vs. 9).
  • Perplexity 50: Clusters are more spread out. The global layout becomes more stable across runs.
  • Perplexity 100: Clusters start to merge. With perplexity approaching $n/3$, t-SNE begins to behave more like a global method.

Best Practice: PCA First, Then t-SNE

For higher-dimensional datasets, reducing to 50 components with PCA before applying t-SNE is recommended. On this 64-dimensional dataset, the speedup is modest, but for datasets with hundreds or thousands of features, it is essential.

# PCA reduction first
pca_50 = PCA(n_components=50)
X_pca50 = pca_50.fit_transform(X_scaled)

tsne_from_pca = TSNE(
    n_components=2,
    perplexity=30,
    learning_rate="auto",
    n_iter=1000,
    random_state=42
)
X_tsne_from_pca = tsne_from_pca.fit_transform(X_pca50)

Step 4: UMAP Visualization

UMAP should produce similar local structure to t-SNE but with better global structure preservation and faster computation.

Parameter Study

import umap

# Varying n_neighbors
neighbors_values = [5, 15, 30, 50]
min_dist_values = [0.0, 0.1, 0.5, 1.0]

# Default UMAP
reducer = umap.UMAP(
    n_components=2,
    n_neighbors=15,
    min_dist=0.1,
    metric="euclidean",
    random_state=42
)
X_umap = reducer.fit_transform(X_scaled)

Observations from the UMAP parameter study:

  • n_neighbors=5: Very local focus. Clusters are tight and separated by large gaps. Sub-structure within clusters is visible.
  • n_neighbors=15 (default): Good balance. Clusters are well-separated, and the global layout is meaningful---visually similar digits (e.g., 3, 5, 8) are placed near each other.
  • n_neighbors=50: More global structure. Clusters may begin to blend at their boundaries.
  • min_dist=0.0: Points within clusters are packed very tightly. Good for identifying sub-clusters.
  • min_dist=0.5: More spread-out clusters. The continuous "manifold" structure is more apparent.

Step 5: Quantitative Comparison

Beyond visual inspection, we compare the methods quantitatively.

Trustworthiness

from sklearn.manifold import trustworthiness

trust_pca = trustworthiness(X_scaled, X_pca, n_neighbors=15)
trust_tsne = trustworthiness(X_scaled, tsne_results[30], n_neighbors=15)
trust_umap = trustworthiness(X_scaled, X_umap, n_neighbors=15)

print(f"Trustworthiness (k=15):")
print(f"  PCA:   {trust_pca:.4f}")
print(f"  t-SNE: {trust_tsne:.4f}")
print(f"  UMAP:  {trust_umap:.4f}")

Expected results: UMAP and t-SNE should both achieve trustworthiness above 0.95, significantly outperforming PCA (around 0.90 or below).

Clustering Quality on Embeddings

We can also evaluate how well the 2D embeddings preserve cluster structure by running K-means on the embeddings and comparing with true labels.

from sklearn.cluster import KMeans
from sklearn.metrics import adjusted_rand_score, normalized_mutual_info_score

def evaluate_embedding_clustering(
    X_2d: np.ndarray,
    y_true: np.ndarray,
    n_clusters: int = 10
) -> dict[str, float]:
    """Evaluate clustering quality on a 2D embedding.

    Args:
        X_2d: 2D embedded data.
        y_true: Ground-truth labels.
        n_clusters: Number of clusters for K-means.

    Returns:
        Dictionary with ARI and NMI scores.
    """
    km = KMeans(n_clusters=n_clusters, n_init=20, random_state=42)
    labels = km.fit_predict(X_2d)
    return {
        "ARI": adjusted_rand_score(y_true, labels),
        "NMI": normalized_mutual_info_score(y_true, labels),
    }

for name, X_2d in [("PCA", X_pca), ("t-SNE", tsne_results[30]), ("UMAP", X_umap)]:
    scores = evaluate_embedding_clustering(X_2d, y)
    print(f"{name:6s} | ARI: {scores['ARI']:.3f} | NMI: {scores['NMI']:.3f}")

Timing Comparison

import time

# PCA timing
start = time.time()
PCA(n_components=2).fit_transform(X_scaled)
pca_time = time.time() - start

# t-SNE timing
start = time.time()
TSNE(n_components=2, random_state=42).fit_transform(X_scaled)
tsne_time = time.time() - start

# UMAP timing
start = time.time()
umap.UMAP(n_components=2, random_state=42).fit_transform(X_scaled)
umap_time = time.time() - start

print(f"PCA:   {pca_time:.3f}s")
print(f"t-SNE: {tsne_time:.3f}s")
print(f"UMAP:  {umap_time:.3f}s")

Step 6: Identifying Confusing Digit Pairs

One practical application of visualization is identifying which classes are most easily confused.

def find_confused_pairs(
    X_2d: np.ndarray,
    y: np.ndarray,
    n_neighbors: int = 10
) -> list[tuple[int, int, float]]:
    """Find digit pairs that are most confused in the embedding.

    For each point, check how many of its nearest neighbors
    belong to a different class.

    Args:
        X_2d: 2D embedded coordinates.
        y: True class labels.
        n_neighbors: Number of neighbors to check.

    Returns:
        List of (class_a, class_b, confusion_rate) sorted by confusion.
    """
    from sklearn.neighbors import NearestNeighbors

    nn = NearestNeighbors(n_neighbors=n_neighbors + 1)
    nn.fit(X_2d)
    _, indices = nn.kneighbors(X_2d)

    confusion = np.zeros((10, 10))
    for i in range(len(X_2d)):
        neighbors = indices[i, 1:]  # Exclude self
        for j in neighbors:
            if y[i] != y[j]:
                confusion[y[i], y[j]] += 1

    pairs = []
    for a in range(10):
        for b in range(a + 1, 10):
            rate = (confusion[a, b] + confusion[b, a]) / (
                np.sum(y == a) + np.sum(y == b)
            )
            pairs.append((a, b, rate))

    return sorted(pairs, key=lambda x: -x[2])

# Find most confused pairs in UMAP embedding
confused = find_confused_pairs(X_umap, y)
print("Most confused digit pairs (UMAP):")
for a, b, rate in confused[:5]:
    print(f"  Digits {a} vs {b}: confusion rate {rate:.3f}")

Expected confusing pairs: (3, 5), (4, 9), (3, 8), (1, 7). These correspond to digits that look visually similar in handwriting.

Step 7: Using UMAP for New Data

Unlike t-SNE, UMAP can transform new data points. This is crucial for production use.

# Split data for demonstration
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)

# Fit on training data
reducer_train = umap.UMAP(n_components=2, n_neighbors=15, random_state=42)
X_train_umap = reducer_train.fit_transform(X_train)

# Transform test data
X_test_umap = reducer_train.transform(X_test)

# Verify that test points land near their correct class clusters
test_scores = evaluate_embedding_clustering(
    np.vstack([X_train_umap, X_test_umap]),
    np.concatenate([y_train, y_test]),
    n_clusters=10
)
print(f"Combined embedding ARI: {test_scores['ARI']:.3f}")

Summary of Findings

Method Trustworthiness ARI Time Transform New Data
PCA ~0.88 ~0.45 <0.01s Yes
t-SNE (perp=30) ~0.97 ~0.75 ~5s No
UMAP (default) ~0.97 ~0.80 ~2s Yes

Key Lessons from This Case Study

  1. PCA is fast but limited: It provides a useful first look and baseline, but its linear nature cannot capture the nonlinear manifold structure of image data. Use it as a first step, not the final answer.

  2. t-SNE excels at local structure: It produces beautiful visualizations where clusters are clearly separated, but the global layout is not reliable. Never interpret distances between clusters in t-SNE.

  3. UMAP is the best all-rounder: It matches t-SNE's visualization quality, runs faster, better preserves global structure, and supports transforming new data. For most visualization tasks, UMAP should be the default choice.

  4. Perplexity and n_neighbors matter enormously: Always try multiple parameter values. A single t-SNE or UMAP plot can be misleading.

  5. Combine methods: Use PCA to understand variance structure and reduce dimensionality. Then use UMAP for visualization. Use the confused-pairs analysis to identify potential labeling issues or inherently difficult classification boundaries.

  6. Quantify, do not just visualize: Trustworthiness scores and embedding-based clustering metrics provide objective measures of embedding quality, complementing visual inspection.

  7. UMAP for production, t-SNE for exploration: If you need to embed new data (e.g., in a dashboard or API), use UMAP. If you are doing one-off exploratory analysis and want the most "tunable" visualization, t-SNE is still a strong choice.