Case Study 2: The t-SNE Lies --- Common Misinterpretations

How to Read (and Misread) a t-SNE Plot


Background

This case study is different from most in this textbook. Instead of building a solution to a business problem, we are building your ability to spot flawed reasoning. Specifically, we are going to generate t-SNE visualizations and then systematically demonstrate the wrong conclusions that practitioners commonly draw from them --- and show you how to check whether those conclusions are real.

Every example in this case study is based on patterns I have seen in real data science teams: conference presentations where t-SNE distances were interpreted as feature-space distances, product reviews where cluster sizes were interpreted as segment homogeneity, and strategy decks where visual separation was interpreted as behavioral separation.

The goal is simple: after working through these examples, you should be able to look at any t-SNE (or UMAP) plot and immediately identify interpretation traps.


Setup: Two Datasets

We will work with two carefully constructed datasets where we know the ground truth, so we can verify whether t-SNE conclusions are correct.

import numpy as np
import pandas as pd
from sklearn.manifold import TSNE
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances
import matplotlib.pyplot as plt

np.random.seed(42)

# Dataset 1: Three clusters with KNOWN distances
# Cluster A and B are close in feature space
# Cluster C is far from both
n_per = 500
cluster_a = np.random.randn(n_per, 20) * 0.8 + np.array([0]*20)
cluster_b = np.random.randn(n_per, 20) * 0.8 + np.array([2]*20)
cluster_c = np.random.randn(n_per, 20) * 0.8 + np.array([10]*20)

X_clusters = np.vstack([cluster_a, cluster_b, cluster_c])
labels_clusters = np.array(['A']*n_per + ['B']*n_per + ['C']*n_per)

# Dataset 2: Two groups with KNOWN variance
# Group "Tight" has low variance; Group "Spread" has high variance
tight = np.random.randn(500, 20) * 0.3 + np.array([0]*20)
spread = np.random.randn(500, 20) * 3.0 + np.array([5]*20)

X_variance = np.vstack([tight, spread])
labels_variance = np.array(['Tight']*500 + ['Spread']*500)

Ground Truth: Actual Distances

# Compute actual centroid distances
centroid_a = cluster_a.mean(axis=0)
centroid_b = cluster_b.mean(axis=0)
centroid_c = cluster_c.mean(axis=0)

dist_ab = np.linalg.norm(centroid_a - centroid_b)
dist_ac = np.linalg.norm(centroid_a - centroid_c)
dist_bc = np.linalg.norm(centroid_b - centroid_c)

print("Ground truth centroid distances (20-dimensional):")
print(f"  A to B: {dist_ab:.2f}")
print(f"  A to C: {dist_ac:.2f}")
print(f"  B to C: {dist_bc:.2f}")
print(f"  Ratio A-C / A-B: {dist_ac/dist_ab:.2f}x")

Expected output:

Ground truth centroid distances (20-dimensional):
  A to B: 8.94
  A to C: 44.72
  B to C: 35.78
  Ratio A-C / A-B: 5.00x

Cluster C is 5 times further from A than B is. This is the ground truth we will compare against t-SNE's visual representation.


Lie 1: "These Clusters Are Equidistant"

scaler = StandardScaler()
X_clusters_scaled = scaler.fit_transform(X_clusters)

fig, axes = plt.subplots(1, 3, figsize=(20, 6))
perplexities = [5, 30, 50]

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
                learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_clusters_scaled)

    for label, color in [('A', 'steelblue'), ('B', 'salmon'), ('C', 'forestgreen')]:
        mask = labels_clusters == label
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                   c=color, alpha=0.5, s=10, label=label)

    # Measure t-SNE centroid distances
    tsne_centroid_a = X_tsne[labels_clusters == 'A'].mean(axis=0)
    tsne_centroid_b = X_tsne[labels_clusters == 'B'].mean(axis=0)
    tsne_centroid_c = X_tsne[labels_clusters == 'C'].mean(axis=0)

    tsne_ab = np.linalg.norm(tsne_centroid_a - tsne_centroid_b)
    tsne_ac = np.linalg.norm(tsne_centroid_a - tsne_centroid_c)

    ax.set_title(f'Perplexity={perp}\n'
                 f't-SNE dist A-B: {tsne_ab:.1f}, A-C: {tsne_ac:.1f}\n'
                 f'Ratio: {tsne_ac/tsne_ab:.2f}x (true: 5.00x)')
    ax.legend()

plt.suptitle('Lie 1: t-SNE Distorts Inter-Cluster Distances', fontsize=14)
plt.tight_layout()
plt.savefig('cs2_lie1_distances.png', dpi=150, bbox_inches='tight')
plt.show()

The Misinterpretation

A presenter shows the perplexity=30 plot and says: "Clusters A, B, and C are roughly equidistant. This suggests all three customer segments are equally different from each other."

The Truth

The true distance ratio is 5:1, but t-SNE compresses it to approximately 1.5:1 or even 1:1 depending on perplexity. t-SNE's gradient descent optimization does not try to preserve absolute distances or even distance ratios between clusters. It only tries to preserve local neighborhood structure within clusters.

The Rule --- Never compare distances between clusters in a t-SNE plot. If you need to compare inter-cluster distances, compute them in the original feature space using centroid distances, Mahalanobis distance, or pairwise distance distributions.

The Correct Analysis

# Actual pairwise distance distributions between clusters
from scipy.spatial.distance import cdist

dists_ab = cdist(cluster_a, cluster_b).ravel()
dists_ac = cdist(cluster_a, cluster_c).ravel()
dists_bc = cdist(cluster_b, cluster_c).ravel()

fig, ax = plt.subplots(figsize=(10, 5))
ax.hist(dists_ab, bins=50, alpha=0.5, label=f'A-B (median: {np.median(dists_ab):.1f})',
        density=True)
ax.hist(dists_ac, bins=50, alpha=0.5, label=f'A-C (median: {np.median(dists_ac):.1f})',
        density=True)
ax.hist(dists_bc, bins=50, alpha=0.5, label=f'B-C (median: {np.median(dists_bc):.1f})',
        density=True)
ax.set_xlabel('Pairwise Euclidean Distance (20D)')
ax.set_ylabel('Density')
ax.set_title('Actual Pairwise Distance Distributions')
ax.legend()
plt.savefig('cs2_actual_distances.png', dpi=150, bbox_inches='tight')
plt.show()

This histogram tells the real story. A and B overlap slightly. C is far from both. No t-SNE plot would reveal this.


Lie 2: "This Cluster Is More Homogeneous"

X_variance_scaled = scaler.fit_transform(X_variance)

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for ax, perp in zip(axes, perplexities):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
                learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_variance_scaled)

    for label, color in [('Tight', 'steelblue'), ('Spread', 'salmon')]:
        mask = labels_variance == label
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                   c=color, alpha=0.5, s=10, label=label)

    # Measure visual sizes
    tight_spread_tsne = X_tsne[labels_variance == 'Tight'].std(axis=0).mean()
    spread_spread_tsne = X_tsne[labels_variance == 'Spread'].std(axis=0).mean()

    ax.set_title(f'Perplexity={perp}\n'
                 f'Tight visual spread: {tight_spread_tsne:.1f}, '
                 f'Spread visual spread: {spread_spread_tsne:.1f}')
    ax.legend()

plt.suptitle('Lie 2: t-SNE Distorts Cluster Sizes', fontsize=14)
plt.tight_layout()
plt.savefig('cs2_lie2_sizes.png', dpi=150, bbox_inches='tight')
plt.show()

The Misinterpretation

A data scientist observes that the "Tight" group appears as a compact ball while the "Spread" group appears as a larger cloud and concludes: "The 'Spread' customers are about 2x more diverse than the 'Tight' customers."

The Truth

The actual standard deviation ratio is 10:1 (0.3 vs. 3.0), but t-SNE compresses this to approximately 2:1 in the visual. At some perplexity values, the two groups may appear similar in size. t-SNE normalizes density through its use of adaptive Gaussian bandwidths (the perplexity mechanism), which explicitly adjusts for different local densities.

The Rule --- Never infer relative variance or homogeneity from visual cluster sizes in a t-SNE plot. Compute within-cluster variance in the original feature space.

The Correct Analysis

# Actual within-cluster variance
tight_var = np.var(tight, axis=0).mean()
spread_var = np.var(spread, axis=0).mean()

print(f"Tight group - mean feature variance: {tight_var:.4f}")
print(f"Spread group - mean feature variance: {spread_var:.4f}")
print(f"Ratio: {spread_var/tight_var:.1f}x")
Tight group - mean feature variance: 0.0900
Spread group - mean feature variance: 9.0000
Ratio: 100.0x

The spread group has 100x the variance. No t-SNE plot will show this ratio faithfully.


Lie 3: "The Number of Clusters Is Three"

Perplexity does not just change the layout. It can change the apparent number of clusters.

# Create data with ambiguous cluster structure:
# 2 real clusters, each with 2 sub-clusters
np.random.seed(42)
sub1a = np.random.randn(300, 15) * 0.5 + np.concatenate([np.array([0, 0, 0]), np.zeros(12)])
sub1b = np.random.randn(300, 15) * 0.5 + np.concatenate([np.array([1.5, 0, 0]), np.zeros(12)])
sub2a = np.random.randn(300, 15) * 0.5 + np.concatenate([np.array([8, 0, 0]), np.zeros(12)])
sub2b = np.random.randn(300, 15) * 0.5 + np.concatenate([np.array([9.5, 0, 0]), np.zeros(12)])

X_ambig = np.vstack([sub1a, sub1b, sub2a, sub2b])
labels_ambig = np.array(['1a']*300 + ['1b']*300 + ['2a']*300 + ['2b']*300)
X_ambig_scaled = StandardScaler().fit_transform(X_ambig)

fig, axes = plt.subplots(1, 4, figsize=(22, 5))
for ax, perp in zip(axes, [5, 15, 30, 50]):
    tsne = TSNE(n_components=2, perplexity=perp, random_state=42,
                learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_ambig_scaled)

    for label, color in [('1a', 'steelblue'), ('1b', 'royalblue'),
                          ('2a', 'salmon'), ('2b', 'indianred')]:
        mask = labels_ambig == label
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                   c=color, alpha=0.5, s=10, label=label)
    ax.set_title(f'Perplexity={perp}')
    ax.legend(fontsize=8)

plt.suptitle('Lie 3: Perplexity Changes the Apparent Number of Clusters', fontsize=14)
plt.tight_layout()
plt.savefig('cs2_lie3_cluster_count.png', dpi=150, bbox_inches='tight')
plt.show()

The Misinterpretation

At perplexity=5, you might see 4 clusters. At perplexity=50, you might see 2. A presenter picks the most dramatic result and says "there are 4 distinct customer segments."

The Truth

The data has a hierarchical structure: 2 major groups, each with 2 sub-groups. The "correct" number of clusters depends on the scale of analysis. t-SNE's perplexity controls which scale is emphasized. There is no single correct answer from t-SNE alone.

The Rule --- Never determine the number of clusters from a t-SNE plot. Use a clustering algorithm (K-Means with silhouette scores, DBSCAN, hierarchical clustering) on the original high-dimensional data. Use t-SNE or UMAP only to visualize clusters that have been identified by other methods.


Lie 4: "These Two Runs Disagree, So Something Is Wrong"

fig, axes = plt.subplots(1, 3, figsize=(20, 6))

for ax, seed in zip(axes, [42, 99, 7]):
    tsne = TSNE(n_components=2, perplexity=30, random_state=seed,
                learning_rate='auto', init='pca')
    X_tsne = tsne.fit_transform(X_clusters_scaled)

    for label, color in [('A', 'steelblue'), ('B', 'salmon'), ('C', 'forestgreen')]:
        mask = labels_clusters == label
        ax.scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                   c=color, alpha=0.5, s=10, label=label)
    ax.set_title(f'random_state={seed}')
    ax.legend()

plt.suptitle('Different Random Seeds, Same Data', fontsize=14)
plt.tight_layout()
plt.savefig('cs2_lie4_random_seeds.png', dpi=150, bbox_inches='tight')
plt.show()

The Misinterpretation

"The plots look different every time I run it. The method is unreliable."

The Truth

The three clusters appear in all three runs --- they are just in different positions and orientations. t-SNE is non-convex, so the absolute positions are arbitrary. What is stable across runs is which points are near which other points. If the same clusters appear (regardless of position) across multiple runs, the structure is real.

The Rule --- When running t-SNE multiple times, look for consistent groupings, not consistent positions. If the same observations always cluster together, the local structure is real, even if the layout rotates or flips between runs.


Lie 5: "UMAP Shows the Same Thing, So It Must Be True"

This is the most dangerous lie because it feels like validation.

import umap

fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42,
            learning_rate='auto', init='pca')
X_tsne = tsne.fit_transform(X_clusters_scaled)
for label, color in [('A', 'steelblue'), ('B', 'salmon'), ('C', 'forestgreen')]:
    mask = labels_clusters == label
    axes[0].scatter(X_tsne[mask, 0], X_tsne[mask, 1],
                    c=color, alpha=0.5, s=10, label=label)
axes[0].set_title('t-SNE (perplexity=30)')
axes[0].legend()

# UMAP
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_clusters_scaled)
for label, color in [('A', 'steelblue'), ('B', 'salmon'), ('C', 'forestgreen')]:
    mask = labels_clusters == label
    axes[1].scatter(X_umap[mask, 0], X_umap[mask, 1],
                    c=color, alpha=0.5, s=10, label=label)
axes[1].set_title('UMAP (n_neighbors=15, min_dist=0.1)')
axes[1].legend()

plt.suptitle('"Both methods agree" --- but agree on what?', fontsize=14)
plt.tight_layout()
plt.savefig('cs2_lie5_false_validation.png', dpi=150, bbox_inches='tight')
plt.show()

The Misinterpretation

"Both t-SNE and UMAP show three clusters that appear equidistant. This confirms that the three segments are equally different."

The Truth

Both methods share the same fundamental limitation: they do not preserve inter-cluster distances. "Agreement" between t-SNE and UMAP on the number of clusters is meaningful --- both are detecting local structure independently. But agreement on apparent distances is meaningless because both methods distort distances in similar ways. Two methods with the same blind spot do not validate each other.

The Rule --- Use multiple visualization methods to confirm the existence of structure (clusters, outliers). Never use their agreement to confirm properties of that structure (distances, sizes, densities). Confirm properties in the original feature space.


The Correct t-SNE/UMAP Workflow

Given everything above, here is the workflow that produces reliable insights:

# Step 1: Generate the visualization
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
embedding = reducer.fit_transform(X_clusters_scaled)

# Step 2: Note visual patterns (but do not draw conclusions yet)
# "I see three groups. Group A and B look close. Group C looks separate."

# Step 3: Validate with original-space statistics
centroids = {}
for label in ['A', 'B', 'C']:
    mask = labels_clusters == label
    centroids[label] = X_clusters[mask].mean(axis=0)

for pair in [('A', 'B'), ('A', 'C'), ('B', 'C')]:
    d = np.linalg.norm(centroids[pair[0]] - centroids[pair[1]])
    print(f"  {pair[0]}-{pair[1]} centroid distance: {d:.2f}")

# Step 4: Only state conclusions supported by Step 3
# "There are three clusters (confirmed by both t-SNE and UMAP).
#  C is ~5x further from A than B is (confirmed by centroid distances).
#  t-SNE visual distances were misleading --- the actual separation is much larger."

Summary: The Five Lies and Their Antidotes

Lie What You See Why It Is Wrong The Antidote
"Equidistant clusters" Clusters appear equally spaced t-SNE does not preserve inter-cluster distances Compute centroid distances in original space
"Homogeneous cluster" One cluster looks tighter t-SNE normalizes density via adaptive bandwidth Compute within-cluster variance in original space
"N clusters exist" You count N blobs Perplexity changes apparent cluster count Use a clustering algorithm on original data
"Method is unreliable" Different runs look different Absolute positions are arbitrary; groupings are stable Check if the same points cluster together across runs
"Two methods agree" t-SNE and UMAP show similar layout Both methods share the same inter-cluster distance limitations Use agreement to confirm existence, not properties

The common thread: t-SNE and UMAP are powerful tools for revealing local structure, but they systematically distort global properties. Every conclusion about distances, sizes, densities, or counts must be verified in the original feature space.


This case study supports Chapter 21: Dimensionality Reduction. Return to the chapter to review the technical foundations.