Exercises: Chapter 21

DataField.Dev

Exercises: Chapter 21

Dimensionality Reduction: PCA, t-SNE, and UMAP

Exercise 1: PCA Fundamentals (Conceptual)

a) Explain in your own words what PCA does. Avoid the phrase "reduces dimensions" --- instead describe the geometric operation (rotation, projection) and what is preserved.

b) Why must you standardize features before applying PCA? Give a concrete example: if monthly_hours_watched ranges from 0-200 and content_completion_rate ranges from 0-1, what happens if you run PCA without scaling?

c) A colleague fits PCA with 10 components on a 24-feature dataset and reports "we captured 85% of the variance." Explain what this means in precise terms. What is in the remaining 15%?

d) The explained variance ratios for a dataset are: PC1=0.45, PC2=0.20, PC3=0.10, PC4=0.05, .... Compare this to a dataset where the ratios are: PC1=0.06, PC2=0.05, PC3=0.05, PC4=0.05, .... What does the difference tell you about the structure of the two datasets? Which one would produce a more informative 2D PCA scatter plot?

Exercise 2: PCA on StreamFlow (Code)

Using the StreamFlow churn dataset from the chapter, complete the following:

a) Fit PCA with all 24 components. Create a scree plot showing both individual and cumulative explained variance. How many components are needed to capture 90% of the variance?

import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt

# --- Use the chapter's StreamFlow data setup ---
# Your code here: load data, scale, fit PCA

# Create the scree plot
# Your code here

# Report the number of components for 90% variance
# Your code here

b) Extract the loadings for the first 3 principal components. For each, identify the top 3 features by absolute loading value and describe what the component captures in plain English (e.g., "PC1 is an engagement component dominated by hours and sessions").

c) Compare the AUC of a logistic regression classifier using: (i) all 24 original features, (ii) PCA with 5 components, (iii) PCA with 10 components, (iv) PCA with 15 components. Use 5-fold cross-validation with roc_auc scoring. At what number of components does performance plateau?

from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline

# Your code here: build pipelines and evaluate

d) Compute the reconstruction error (RMSE per feature) when reducing to 10 components. Which 3 features have the highest reconstruction error? Why might these features be poorly captured by the top 10 components?

Exercise 3: t-SNE Exploration (Code + Analysis)

a) Run t-SNE on a 5,000-observation subsample of the StreamFlow data with perplexity values of 5, 15, 30, and 50. Create a 2x2 grid of plots, coloring points by churn status. Describe how the visual structure changes across perplexity values.

from sklearn.manifold import TSNE

# Subsample
sample_idx = np.random.RandomState(42).choice(len(X_scaled), 5000, replace=False)
X_sample = X_scaled.iloc[sample_idx]
y_sample = y.iloc[sample_idx]

# Your code here: fit t-SNE at 4 perplexity values, create 2x2 grid

b) Run t-SNE twice with perplexity=30 but different random states (random_state=42 and random_state=99). Compare the two plots. Are the same clusters visible? Are they in the same positions? Explain why or why not, and what this means for interpreting t-SNE results.

c) A colleague shows you a t-SNE plot with two well-separated clusters and says: "The churners and non-churners are very distinct groups --- they are far apart in feature space." Write a 3-4 sentence response explaining why this interpretation is incorrect. Cite the specific properties of t-SNE that make this conclusion invalid.

Exercise 4: UMAP Exploration (Code + Analysis)

a) Run UMAP on the same 5,000-observation subsample with the following parameter combinations and create a 2x3 grid:

n_neighbors	min_dist
5	0.0
15	0.0
50	0.0
5	0.5
15	0.5
50	0.5

import umap

# Your code here: fit UMAP with 6 configurations, create 2x3 grid

Describe the effect of increasing n_neighbors (holding min_dist constant) and the effect of increasing min_dist (holding n_neighbors constant).

b) Fit UMAP on the first 4,000 observations, then use the transform method to embed the remaining 1,000. Color the new observations differently. Do they land in expected positions relative to the training data?

c) Compare UMAP with metric='euclidean' versus metric='cosine' on the StreamFlow data. Create side-by-side plots. For what types of data would cosine distance produce meaningfully different results?

Exercise 5: Head-to-Head Comparison (Code + Analysis)

a) Create a 1x3 figure showing PCA, t-SNE, and UMAP applied to the StreamFlow subsample, all colored by churn status. Below the figure, fill in this comparison table from your observations:

Property	PCA	t-SNE	UMAP
Visible clusters?
Churn separation?
Run time (seconds)

import time

# Time each method
start = time.time()
# PCA
pca_time = time.time() - start

# Repeat for t-SNE and UMAP

b) Color all three plots by plan_price instead of churn status. Then by days_since_last_session. Then by months_active. For each coloring variable, which visualization method reveals the most structure?

c) A product manager asks you: "Which method should we use for our quarterly business review?" Write a 3-4 sentence recommendation that considers audience, interpretation risks, and the specific properties of each method.

Exercise 6: PCA for Preprocessing in a Pipeline (Code)

a) Build two XGBoost pipelines for churn prediction: one with PCA preprocessing (10 components) and one without. Compare 5-fold cross-validated AUC. Is PCA helping, hurting, or neutral for this model?

from xgboost import XGBClassifier
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

pipe_no_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('clf', XGBClassifier(
        n_estimators=200, max_depth=4,
        random_state=42, eval_metric='logloss'
    ))
])

pipe_with_pca = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(n_components=10, random_state=42)),
    ('clf', XGBClassifier(
        n_estimators=200, max_depth=4,
        random_state=42, eval_metric='logloss'
    ))
])

# Your code here: cross-validate both and compare

b) Repeat the comparison using LogisticRegression instead of XGBoost. Does PCA help more for logistic regression than for XGBoost? Explain why this might be the case.

c) Build a pipeline that uses GridSearchCV to jointly tune the number of PCA components (5, 10, 15, 20) and the logistic regression regularization strength (C=0.01, 0.1, 1.0, 10.0). Report the best combination.

from sklearn.model_selection import GridSearchCV

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('pca', PCA(random_state=42)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

param_grid = {
    'pca__n_components': [5, 10, 15, 20],
    'clf__C': [0.01, 0.1, 1.0, 10.0]
}

# Your code here

Exercise 7: E-Commerce Product Embeddings (Code + Analysis)

Using the product embedding simulation from the chapter (or your own embedding data):

a) Run PCA on the 128-dimensional product embeddings. How many components capture 80% of the variance? What does this tell you about the intrinsic dimensionality of the embedding space?

b) Create a 1x2 figure showing t-SNE and UMAP embeddings of the products, colored by category. Which method produces better visual separation of categories?

c) Identify the 10 products whose UMAP position is furthest from the centroid of their category cluster. These are the "outlier" products. Examine their original embedding distances to products in other categories. Are they genuinely miscategorized, or are they cross-category products?

d) A data engineer asks: "Should we use PCA or UMAP to compress these 128-dimensional embeddings to 32 dimensions for faster nearest-neighbor search in production?" Write a recommendation that considers determinism, invertibility, speed, and the impact on recommendation quality.

Exercise 8: The Dangers of t-SNE Misinterpretation (Analysis)

This exercise has no code. It tests whether you have internalized the t-SNE warnings.

a) A colleague shows a t-SNE plot where churners form a tight, small cluster on the left and non-churners form a large, spread-out cloud on the right. They conclude: "Churners are a very homogeneous group --- they all look alike. Non-churners are diverse." Explain why this conclusion is not supported by the t-SNE visualization. What would you need to check in the original feature space?

b) A team presents a t-SNE plot where three customer segments are separated by large gaps. They argue: "Segment A and Segment B are far apart, so they are very different. Segment A and Segment C are close, so they are similar." Explain why this reasoning is flawed. How would you actually measure segment similarity?

c) You run t-SNE three times with the same data but different random seeds. Each run produces a different layout. A junior analyst asks: "Which one is correct?" Write a response that explains t-SNE's non-deterministic nature and how to extract reliable insights despite it.

d) Under what circumstances would you trust a pattern in a t-SNE plot? List 3 specific criteria that would increase your confidence that a pattern is real rather than an artifact.

These exercises support Chapter 21: Dimensionality Reduction. Return to the chapter for reference.