Case Study 1: StreamFlow PCA + UMAP Visualization
Understanding Churn Patterns Through Dimensionality Reduction
Background
StreamFlow's data science team has built a high-performing churn prediction model (Chapters 11-19). The model works --- AUC of 0.938, tuned hyperparameters, SHAP explanations for individual predictions. But the VP of Product has a different question:
"I do not want to know why individual customers churn. I want to see the landscape. Are there distinct types of churners? Are there pockets of healthy customers that look different from each other? Show me the shape of our customer base."
This is a dimensionality reduction problem. You need to take the 24-feature customer dataset and project it into a space where human eyes can see structure. The VP does not want a model --- she wants a map.
The plan:
- Use PCA to understand the variance structure and check for dominant patterns
- Use UMAP to create a 2D visualization that reveals customer segments
- Overlay churn status and other metadata to find actionable patterns
- Validate any visual patterns with actual feature statistics
Step 1: PCA Variance Analysis
Before creating any visualization, check whether the data has low-dimensional structure at all.
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
np.random.seed(42)
n = 50000
streamflow = pd.DataFrame({
'monthly_hours_watched': np.random.exponential(18, n).round(1),
'sessions_last_30d': np.random.poisson(14, n),
'avg_session_minutes': np.random.exponential(28, n).round(1),
'unique_titles_watched': np.random.poisson(8, n),
'content_completion_rate': np.random.beta(3, 2, n).round(3),
'binge_sessions_30d': np.random.poisson(2, n),
'weekend_ratio': np.random.beta(2.5, 3, n).round(3),
'peak_hour_ratio': np.random.beta(3, 2, n).round(3),
'hours_change_pct': np.random.normal(0, 30, n).round(1),
'sessions_change_pct': np.random.normal(0, 25, n).round(1),
'months_active': np.random.randint(1, 60, n),
'plan_price': np.random.choice(
[9.99, 14.99, 19.99, 24.99], n, p=[0.35, 0.35, 0.20, 0.10]
),
'devices_used': np.random.randint(1, 6, n),
'profiles_active': np.random.randint(1, 5, n),
'payment_failures_6m': np.random.poisson(0.3, n),
'support_tickets_90d': np.random.poisson(0.8, n),
'days_since_last_session': np.random.exponential(5, n).round(0).clip(0, 60),
'recommendation_click_rate': np.random.beta(2, 8, n).round(3),
'search_frequency_30d': np.random.poisson(6, n),
'download_count_30d': np.random.poisson(3, n),
'share_count_30d': np.random.poisson(1, n),
'rating_count_30d': np.random.poisson(2, n),
'free_trial_convert': np.random.binomial(1, 0.65, n),
'referral_source': np.random.choice(
[0, 1, 2, 3], n, p=[0.50, 0.25, 0.15, 0.10]
),
})
churn_logit = (
-3.0
+ 0.08 * streamflow['days_since_last_session']
- 0.02 * streamflow['monthly_hours_watched']
- 0.04 * streamflow['sessions_last_30d']
+ 0.15 * streamflow['payment_failures_6m']
+ 0.10 * streamflow['support_tickets_90d']
- 0.03 * streamflow['content_completion_rate'] * 10
+ 0.05 * (streamflow['hours_change_pct'] < -30).astype(int)
- 0.01 * streamflow['months_active']
+ 0.08 * (streamflow['plan_price'] > 19.99).astype(int)
- 0.02 * streamflow['unique_titles_watched']
+ np.random.normal(0, 0.3, n)
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
streamflow['churned'] = np.random.binomial(1, churn_prob)
X = streamflow.drop(columns=['churned'])
y = streamflow['churned']
scaler = StandardScaler()
X_scaled = pd.DataFrame(scaler.fit_transform(X), columns=X.columns)
# Full PCA
pca_full = PCA(random_state=42)
pca_full.fit(X_scaled)
cumvar = np.cumsum(pca_full.explained_variance_ratio_)
n_80 = np.argmax(cumvar >= 0.80) + 1
n_90 = np.argmax(cumvar >= 0.90) + 1
print(f"Components for 80% variance: {n_80}")
print(f"Components for 90% variance: {n_90}")
print(f"Variance in first 2 components: {cumvar[1]:.4f}")
print(f"Variance in first 5 components: {cumvar[4]:.4f}")
Key Finding --- The variance is spread broadly across components. The first two components capture only about 13% of the variance. This tells us two things: (a) a 2D PCA plot will lose 87% of the information and likely show a single blob, and (b) we should not expect PCA visualization to reveal churn patterns. This is exactly the scenario where UMAP adds value.
PCA Loadings: What the Components Capture
Even though PCA is not ideal for visualization here, the loadings are informative.
loadings = pd.DataFrame(
pca_full.components_[:5].T,
columns=[f'PC{i+1}' for i in range(5)],
index=X.columns
)
fig, axes = plt.subplots(1, 3, figsize=(18, 6))
for i, ax in enumerate(axes):
pc = f'PC{i+1}'
sorted_loadings = loadings[pc].sort_values()
colors = ['steelblue' if v > 0 else 'salmon' for v in sorted_loadings]
sorted_loadings.abs().plot.barh(ax=ax, color=colors)
ax.set_title(f'{pc} Loadings (explained var: '
f'{pca_full.explained_variance_ratio_[i]:.3f})')
ax.set_xlabel('|Loading|')
plt.tight_layout()
plt.savefig('cs1_pca_loadings.png', dpi=150, bbox_inches='tight')
plt.show()
Typical interpretation: - PC1 loads on engagement metrics (hours, sessions, titles). This is the "how much do they use the platform" axis. - PC2 loads on behavioral patterns (weekend ratio, peak hour ratio, session length). This is the "how do they use the platform" axis. - PC3 loads on friction indicators (payment failures, support tickets, days since last session). This is the "how much trouble are they having" axis.
Step 2: UMAP Visualization
Now we move to UMAP for the actual visualization. We subsample for speed and to keep the plot readable.
import umap
# Subsample 8,000 observations
sample_idx = np.random.RandomState(42).choice(len(X_scaled), 8000, replace=False)
X_sample = X_scaled.iloc[sample_idx].reset_index(drop=True)
y_sample = y.iloc[sample_idx].reset_index(drop=True)
X_orig_sample = X.iloc[sample_idx].reset_index(drop=True)
# UMAP with default parameters
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2, random_state=42)
embedding = reducer.fit_transform(X_sample)
# Store for later use
umap_df = pd.DataFrame({
'umap_1': embedding[:, 0],
'umap_2': embedding[:, 1],
'churned': y_sample
})
The Churn Map
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
# Left: colored by churn
colors_churn = y_sample.map({0: 'steelblue', 1: 'salmon'})
axes[0].scatter(embedding[:, 0], embedding[:, 1],
c=colors_churn, alpha=0.3, s=5)
axes[0].set_title('UMAP --- Colored by Churn Status')
axes[0].set_xlabel('UMAP 1')
axes[0].set_ylabel('UMAP 2')
# Manual legend
from matplotlib.lines import Line2D
legend_elements = [
Line2D([0], [0], marker='o', color='w', markerfacecolor='steelblue',
label='Retained', markersize=8),
Line2D([0], [0], marker='o', color='w', markerfacecolor='salmon',
label='Churned', markersize=8),
]
axes[0].legend(handles=legend_elements)
# Right: colored by days_since_last_session
sc = axes[1].scatter(embedding[:, 0], embedding[:, 1],
c=X_orig_sample['days_since_last_session'],
cmap='YlOrRd', alpha=0.4, s=5)
axes[1].set_title('UMAP --- Colored by Days Since Last Session')
axes[1].set_xlabel('UMAP 1')
axes[1].set_ylabel('UMAP 2')
plt.colorbar(sc, ax=axes[1], label='Days Since Last Session')
plt.tight_layout()
plt.savefig('cs1_umap_churn_map.png', dpi=150, bbox_inches='tight')
plt.show()
Multi-Feature Overlay
To understand what drives the visual structure, overlay multiple features.
features_to_plot = [
'monthly_hours_watched', 'sessions_last_30d',
'payment_failures_6m', 'support_tickets_90d',
'months_active', 'plan_price'
]
fig, axes = plt.subplots(2, 3, figsize=(20, 12))
for ax, feat in zip(axes.ravel(), features_to_plot):
sc = ax.scatter(embedding[:, 0], embedding[:, 1],
c=X_orig_sample[feat], cmap='viridis',
alpha=0.4, s=5)
ax.set_title(f'Colored by {feat}')
ax.set_xlabel('UMAP 1')
ax.set_ylabel('UMAP 2')
plt.colorbar(sc, ax=ax)
plt.suptitle('UMAP --- Feature Overlays', fontsize=14)
plt.tight_layout()
plt.savefig('cs1_umap_feature_overlay.png', dpi=150, bbox_inches='tight')
plt.show()
What to Look For --- If a specific region of the UMAP plot lights up for both churn (red) and a feature (e.g., high
days_since_last_session), that tells you the UMAP has organized customers so that disengaged churners cluster together. This is visual confirmation of what the churn model already knows from SHAP (Chapter 19), but in a format the VP can digest in one glance.
Step 3: Identifying Customer Archetypes
Rather than reading the UMAP plot subjectively, we can cluster the UMAP embedding and then profile each cluster using original features.
from sklearn.cluster import KMeans
# Cluster the UMAP embedding
kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
umap_df['cluster'] = kmeans.fit_predict(embedding)
# Profile each cluster
profile = X_orig_sample.copy()
profile['churned'] = y_sample
profile['cluster'] = umap_df['cluster']
cluster_profile = profile.groupby('cluster').agg({
'churned': 'mean',
'monthly_hours_watched': 'mean',
'sessions_last_30d': 'mean',
'days_since_last_session': 'mean',
'payment_failures_6m': 'mean',
'support_tickets_90d': 'mean',
'months_active': 'mean',
'plan_price': 'mean',
'content_completion_rate': 'mean',
}).round(3)
cluster_profile['size'] = profile.groupby('cluster').size()
cluster_profile = cluster_profile.sort_values('churned', ascending=False)
print(cluster_profile.to_string())
Visualizing the Clusters
fig, axes = plt.subplots(1, 2, figsize=(18, 7))
# Left: UMAP colored by cluster
scatter = axes[0].scatter(embedding[:, 0], embedding[:, 1],
c=umap_df['cluster'], cmap='Set2',
alpha=0.4, s=5)
axes[0].set_title('UMAP --- Customer Clusters')
axes[0].set_xlabel('UMAP 1')
axes[0].set_ylabel('UMAP 2')
# Right: Churn rate by cluster
cluster_churn = profile.groupby('cluster')['churned'].mean().sort_values(ascending=False)
cluster_churn.plot.bar(ax=axes[1], color='steelblue', alpha=0.8)
axes[1].set_title('Churn Rate by Cluster')
axes[1].set_ylabel('Churn Rate')
axes[1].set_xlabel('Cluster')
plt.tight_layout()
plt.savefig('cs1_umap_clusters.png', dpi=150, bbox_inches='tight')
plt.show()
Critical Validation --- We clustered the UMAP embedding for visual labeling, but the cluster profiles use the original features. This is the correct workflow. Never describe clusters using UMAP coordinates --- they have no inherent meaning. Always go back to the original features to characterize what makes each cluster distinct.
Step 4: Naming the Archetypes
Based on the cluster profiles, we assign human-readable labels. Typical archetypes in SaaS churn data:
| Cluster | Churn Rate | Archetype | Key Characteristics |
|---|---|---|---|
| 3 | ~14% | At-Risk Disengaged | Low hours, high days since last session, payment failures |
| 1 | ~10% | Frustrated Active | Moderate hours, high support tickets, medium tenure |
| 0 | ~8% | Average Users | Near-mean on all features |
| 4 | ~6% | Loyal Power Users | High hours, many sessions, high completion rate, long tenure |
| 2 | ~5% | Casual Satisfied | Low-moderate hours, few tickets, few payment failures |
archetype_labels = {
3: 'At-Risk Disengaged',
1: 'Frustrated Active',
0: 'Average Users',
4: 'Loyal Power Users',
2: 'Casual Satisfied',
}
# Note: cluster numbers may differ based on random state; map by churn rate ranking
umap_df['archetype'] = umap_df['cluster'].map(archetype_labels)
Step 5: Robustness Check
Any finding from UMAP must survive a robustness check. We re-run with different hyperparameters.
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
configs = [
{'n_neighbors': 10, 'min_dist': 0.05},
{'n_neighbors': 15, 'min_dist': 0.1},
{'n_neighbors': 30, 'min_dist': 0.2},
]
for ax, cfg in zip(axes, configs):
r = umap.UMAP(**cfg, random_state=42)
emb = r.fit_transform(X_sample)
ax.scatter(emb[:, 0], emb[:, 1], c=y_sample,
cmap='coolwarm', alpha=0.3, s=5)
ax.set_title(f'n_neighbors={cfg["n_neighbors"]}, '
f'min_dist={cfg["min_dist"]}')
plt.suptitle('UMAP Robustness: Different Hyperparameters', fontsize=14)
plt.tight_layout()
plt.savefig('cs1_umap_robustness.png', dpi=150, bbox_inches='tight')
plt.show()
Validation Rule --- If a pattern (e.g., a region of concentrated churners) appears across multiple UMAP configurations and corresponds to a meaningful feature profile in the original data, it is likely real. If a pattern appears at one setting but vanishes at another, it is an artifact. Never present a UMAP finding without this check.
Step 6: The Executive Summary
The VP asked for a map of the customer base. Here is what we deliver:
Slide 1: The Customer Landscape A single UMAP scatter plot colored by the five archetypes, with a clean legend. No axis labels (the VP does not care about "UMAP 1"). Each cluster is labeled directly on the plot.
Slide 2: The Archetype Profiles A table showing each archetype's size, churn rate, average engagement, and average friction metrics. The "At-Risk Disengaged" archetype is highlighted in red.
Slide 3: The Action Items - "At-Risk Disengaged" customers (about 15% of the base) have 2x the churn rate. Recommended intervention: re-engagement campaign with personalized content recommendations. - "Frustrated Active" customers (about 18%) are using the platform but filing support tickets. Recommended intervention: proactive support outreach. - "Loyal Power Users" (about 20%) are the healthiest segment. Recommended intervention: referral incentives to grow this segment.
What We Did Not Show --- We did not show the PCA scree plot. We did not show perplexity experiments. We did not explain UMAP's topological foundations. The VP needed a map and action items. The technical validation (robustness checks, feature profiling, cluster stability) stays in our notebook for the technical review.
Key Lessons
-
Start with PCA to assess dimensional structure. Before running UMAP, check whether the data has meaningful low-dimensional structure. If the first few PCA components explain very little variance, PCA visualization will not work, but UMAP still might.
-
UMAP reveals structure that PCA misses. PCA is limited to linear projections. For churn data where the interesting patterns are non-linear combinations of engagement, friction, and tenure, UMAP produces much more informative visualizations.
-
Always profile clusters using original features. UMAP coordinates are meaningless. The value of UMAP is the spatial organization --- points near each other share characteristics. But you must go back to the original data to name and describe those characteristics.
-
Robustness checks are mandatory. Run UMAP with at least 2-3 different hyperparameter settings. Any pattern you present must survive across settings.
-
The output is the business insight, not the plot. The VP does not need to understand PCA or UMAP. She needs to know that there are five distinct customer archetypes, which ones are at risk, and what to do about it. The dimensionality reduction is the tool; the customer archetypes are the product.
This case study supports Chapter 21: Dimensionality Reduction. Return to the chapter for the full technical treatment.