Quiz: Chapter 21

DataField.Dev

Quiz: Chapter 21

Dimensionality Reduction: PCA, t-SNE, and UMAP

Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.

Question 1 (Multiple Choice)

What does PCA maximize when selecting principal components?

A) The correlation between original features and the target variable
B) The variance captured along each successive orthogonal direction
C) The distance between data points in the reduced space
D) The classification accuracy on the training set

Answer: B) The variance captured along each successive orthogonal direction. PCA finds orthogonal directions (eigenvectors of the covariance matrix) that capture the maximum amount of variance in the data. The first principal component captures the most variance, the second captures the most remaining variance while being orthogonal to the first, and so on. PCA has no knowledge of the target variable --- it is an unsupervised method. It does not optimize for correlation with the target, classification accuracy, or inter-point distances.

Question 2 (Multiple Choice)

Why is standardization (e.g., StandardScaler) required before applying PCA?

A) PCA requires normally distributed features
B) PCA maximizes variance, so features on larger scales will dominate the components
C) PCA cannot handle negative values
D) Standardization improves the numerical stability of the eigendecomposition

Answer: B) PCA maximizes variance, so features on larger scales will dominate the components. If monthly_hours_watched ranges from 0-200 and content_completion_rate ranges from 0-1, the variance of monthly_hours_watched is orders of magnitude larger. PCA will create components dominated by the high-variance feature, regardless of whether that feature is informative. Standardization puts all features on the same scale so PCA treats them equally. PCA does not require normality (option A), handles negative values (option C), and while standardization may help stability, it is not the primary reason (option D).

Question 3 (Multiple Choice)

A scree plot shows the following explained variance ratios: PC1=0.42, PC2=0.18, PC3=0.08, PC4=0.06, PC5-PC20 each below 0.04. How many components would you select using the "elbow" heuristic?

A) 1
B) 2
C) 3
D) 5

Answer: C) 3. The elbow is at PC3, where the explained variance drops sharply from meaningful contributions (0.42, 0.18, 0.08) to a flatter tail (0.06, 0.04, ...). The first three components capture 0.42 + 0.18 + 0.08 = 0.68 (68%) of the variance, and adding more components yields diminishing returns. Two components capture only 60%, which may miss meaningful variance in PC3. Five components would include the flat tail where each component adds very little.

Question 4 (Short Answer)

Explain the difference between using PCA for preprocessing and using PCA for visualization. Why does PCA often produce uninformative 2D scatter plots?

Answer: PCA for preprocessing compresses many features into fewer components (typically 5-50) to reduce dimensionality for a downstream model, preserving as much variance as possible. PCA for visualization projects to exactly 2 components for plotting. The 2D projection is often uninformative because the first two components may capture only a small fraction of the total variance (e.g., 15-20%), meaning 80% of the data's structure is lost. The resulting scatter plot shows a single amorphous blob because the variation that distinguishes clusters or groups lives in the discarded components.

Question 5 (Multiple Choice)

Which of the following is TRUE about t-SNE?

A) t-SNE preserves global distances between clusters
B) t-SNE has a transform method for embedding new data
C) t-SNE preserves local neighborhoods --- nearby points in high dimensions stay nearby in 2D
D) t-SNE produces identical results regardless of random seed

Answer: C) t-SNE preserves local neighborhoods --- nearby points in high dimensions stay nearby in 2D. This is the core design objective of t-SNE. It explicitly does NOT preserve global distances (A) --- the distances between clusters in a t-SNE plot are meaningless. It does NOT have a transform method (B) --- you must refit from scratch when adding new data. It is NOT deterministic (D) --- different random seeds produce different layouts because the optimization is non-convex.

Question 6 (Multiple Choice)

A t-SNE plot shows two clusters: one is tight and small, the other is large and diffuse. What can you conclude?

A) The tight cluster has lower variance in the original feature space
B) The diffuse cluster has more diversity in the original feature space
C) Nothing about relative cluster sizes or densities in the original space
D) The tight cluster has fewer observations

Answer: C) Nothing about relative cluster sizes or densities in the original space. t-SNE distorts densities during the embedding process. A tight cluster in 2D might correspond to a spread-out group in the original space, and vice versa. The visual size and tightness of clusters in a t-SNE plot do not reflect the actual variance or density of those groups in high-dimensional space. You would need to examine the original features to draw conclusions about cluster homogeneity.

Question 7 (Multiple Choice)

What does the perplexity parameter in t-SNE control?

A) The number of output dimensions
B) The learning rate of the optimization
C) The effective number of nearest neighbors considered for each point
D) The maximum number of iterations

Answer: C) The effective number of nearest neighbors considered for each point. Perplexity controls the bandwidth of the Gaussian kernel used to compute pairwise affinities in the high-dimensional space. Higher perplexity considers more neighbors, capturing broader structure. Lower perplexity focuses on very local neighborhoods, often producing more fragmented clusters. Typical values range from 5 to 50, and practitioners should always try multiple values to check robustness.

Question 8 (Short Answer)

Explain two advantages of UMAP over t-SNE. In what situation would you still prefer t-SNE?

Answer: First, UMAP is significantly faster than t-SNE, especially on large datasets, because it uses approximate nearest neighbors instead of computing all pairwise distances. Second, UMAP has a transform method that allows embedding new data without refitting, which t-SNE cannot do. You might prefer t-SNE when you need to reproduce results from an older analysis that used t-SNE, when your workflow is already built around t-SNE, or when your dataset is small enough (under 10,000 points) that speed is not a concern and you are already familiar with interpreting t-SNE's output.

Question 9 (Multiple Choice)

Which method should you use to reduce 100 features to 15 components as a preprocessing step for a logistic regression classifier?

A) t-SNE with n_components=15
B) UMAP with n_components=15
C) PCA with n_components=15
D) Any of the above --- they are interchangeable

Answer: C) PCA with n_components=15. PCA is the correct choice for preprocessing because it is deterministic (same input always produces the same output), approximately invertible, has a transform method for applying the same transformation to new data, and preserves the global variance structure that downstream linear models need. t-SNE cannot produce 15-dimensional output reliably, has no transform method, and is non-deterministic. UMAP could technically produce 15 components and has a transform method, but its non-deterministic nature and non-linear transformations make it unsuitable as a preprocessing step for a linear model.

Question 10 (Multiple Choice)

What does UMAP's min_dist parameter control?

A) The minimum distance between cluster centroids
B) How tightly the embedding packs nearby points together
C) The minimum number of neighbors considered
D) The minimum explained variance per component

Answer: B) How tightly the embedding packs nearby points together. Low min_dist values (0.0-0.1) allow UMAP to pack points very closely, producing tight, well-separated clusters. High min_dist values (0.5-1.0) force points to spread out more uniformly, preserving the broader topological structure at the expense of tight clustering. This parameter affects the visual appearance of the embedding and should be tuned based on whether you want to emphasize cluster boundaries or continuous variation.

Question 11 (Multiple Choice)

You fit PCA on 24 features and find that the first 10 components capture 75% of the variance. You then reconstruct the original data from these 10 components. Which features will have the highest reconstruction error?

A) The features with the highest variance
B) The features with the highest correlation with the target
C) The features whose variance is least aligned with the top 10 principal components
D) The features with the most missing values

Answer: C) The features whose variance is least aligned with the top 10 principal components. Reconstruction error measures how well the reduced representation can recreate the original features. Features that are well-represented by the top components (i.e., their variance lies primarily along those directions) will have low reconstruction error. Features whose variance is orthogonal to the top components --- those carrying unique information that does not correlate with the dominant patterns --- will have the highest reconstruction error. This is independent of the target variable (B) and missing values (D).

Question 12 (Short Answer)

A junior data scientist proposes adding UMAP embeddings as two new features in a production churn prediction pipeline. Identify two problems with this approach.

Answer: First, UMAP is non-deterministic --- different random seeds produce different embeddings, which means the feature values would change between training and retraining, making the model unstable and difficult to debug. Second, UMAP embeddings are sensitive to the full dataset composition. When new observations are added or the data distribution shifts over time, the UMAP embedding space changes, meaning the features would not be comparable across different time periods. PCA would be a safer choice for production feature engineering because it is deterministic and has stable transformation properties.

Question 13 (Multiple Choice)

Which of the following is the correct workflow for using PCA in a train/test split?

A) Fit PCA on the full dataset, then split into train and test
B) Fit PCA on the training set, then transform both training and test sets
C) Fit PCA separately on the training set and the test set
D) Fit PCA on the test set and transform the training set

Answer: B) Fit PCA on the training set, then transform both training and test sets. This prevents data leakage --- if you fit PCA on the full dataset (A), the principal components incorporate information from the test set, biasing your evaluation. Fitting PCA separately on each set (C) produces different component directions, making the features incomparable. Fitting on the test set (D) is backwards. The correct approach mirrors all preprocessing: learn parameters from training data, apply to both.

Question 14 (Short Answer)

You run t-SNE on 5,000 observations with perplexity=30 and random_state=42. The plot shows three clear clusters. You then run it again with perplexity=10 and the three clusters split into seven. A colleague asks: "Are there three clusters or seven?" How do you answer?

Answer: Neither the t-SNE plot alone nor any single perplexity value tells you the "true" number of clusters. The number of visible clusters in t-SNE depends on the perplexity setting, which controls the scale of neighborhood structure the algorithm emphasizes. The correct answer requires confirming structure with a proper clustering algorithm (e.g., K-Means, DBSCAN) on the original high-dimensional data and evaluating cluster quality with metrics like silhouette score. t-SNE is a visualization tool, not a clustering method.

Question 15 (Multiple Choice)

For which of the following tasks is t-SNE the WORST choice?

A) Visualizing 50-dimensional embeddings in 2D for a presentation
B) Reducing 200 features to 20 for a downstream gradient boosting model
C) Exploring whether customer segments from K-Means are visually separable
D) Creating a one-time scatter plot of gene expression data

Answer: B) Reducing 200 features to 20 for a downstream gradient boosting model. t-SNE is designed for 2D or 3D visualization, not for producing features for downstream models. Its output is non-deterministic, non-invertible, has no transform method for new data, and the feature values have no interpretable meaning. For reducing 200 features to 20, PCA is the correct choice. The other three options (A, C, D) are all legitimate visualization use cases where t-SNE is appropriate, provided you respect its limitations when interpreting the results.

This quiz covers Chapter 21: Dimensionality Reduction. Return to the chapter to review concepts.