Chapter 7 Quiz
Test your understanding of unsupervised learning concepts. Each question has one best answer unless stated otherwise. Try to answer each question before revealing the solution.
Question 1
What is the primary objective of unsupervised learning?
- A) Minimize prediction error on labeled test data
- B) Discover hidden structure in unlabeled data
- C) Maximize classification accuracy
- D) Reduce the size of the training set
Show Answer
**B) Discover hidden structure in unlabeled data.** Unsupervised learning operates without target labels. The goal is to find patterns, groupings, or representations in the data itself, rather than learning a mapping from inputs to known outputs.Question 2
In K-means clustering, the objective function (inertia) is defined as:
- A) The sum of distances between all pairs of points
- B) The sum of squared distances from each point to the overall mean
- C) The sum of squared distances from each point to its assigned cluster centroid
- D) The maximum distance between any two cluster centroids
Show Answer
**C) The sum of squared distances from each point to its assigned cluster centroid.** Inertia (WCSS) = $\sum_{i=1}^{k} \sum_{\mathbf{x} \in C_i} \|\mathbf{x} - \boldsymbol{\mu}_i\|^2$. It measures the total within-cluster variance.Question 3
Which of the following is NOT a limitation of K-means clustering?
- A) It requires specifying the number of clusters in advance
- B) It assumes clusters are spherical and equally sized
- C) It is sensitive to the initial placement of centroids
- D) It has exponential time complexity
Show Answer
**D) It has exponential time complexity.** K-means has time complexity $O(nkd)$ per iteration, which is linear in the number of points, clusters, and dimensions. This makes it one of the most scalable clustering algorithms. The other three options are genuine limitations.Question 4
K-means++ initialization improves upon random initialization by:
- A) Using the first $k$ data points as centroids
- B) Selecting centroids that are spread apart, with probability proportional to squared distance from existing centroids
- C) Running K-means multiple times and selecting the best result
- D) Placing centroids at the corners of the data's bounding box
Show Answer
**B) Selecting centroids that are spread apart, with probability proportional to squared distance from existing centroids.** K-means++ selects each new centroid with probability $\propto D(\mathbf{x})^2$, where $D(\mathbf{x})$ is the distance to the nearest existing centroid. This ensures initial centroids are well-separated and provides an $O(\log k)$ approximation guarantee.Question 5
In hierarchical clustering, which linkage criterion is most susceptible to the "chaining effect"?
- A) Single linkage
- B) Complete linkage
- C) Average linkage
- D) Ward's linkage
Show Answer
**A) Single linkage.** Single linkage defines inter-cluster distance as the minimum distance between any two points in different clusters. This can cause elongated, "chained" clusters where clusters merge through a chain of closely spaced points, even if the clusters are overall quite different.Question 6
In DBSCAN, a point is classified as a core point if:
- A) It is the centroid of a cluster
- B) It has at least
min_samplespoints within distanceeps - C) It has the highest density in its neighborhood
- D) It is not within distance
epsof any other point
Show Answer
**B) It has at least `min_samples` points within distance `eps`.** A core point has a dense enough neighborhood (at least `min_samples` points within radius `eps`). Border points are within `eps` of a core point but do not themselves meet the density threshold. Points that are neither core nor border are classified as noise.Question 7
Which clustering algorithm does NOT require specifying the number of clusters in advance?
- A) K-means
- B) Gaussian Mixture Model
- C) DBSCAN
- D) Agglomerative Clustering (when using
n_clustersparameter)
Show Answer
**C) DBSCAN.** DBSCAN determines the number of clusters automatically based on the density structure of the data. It requires `eps` and `min_samples` instead. K-means requires $k$, GMM requires $k$ (number of components), and Agglomerative Clustering typically requires either `n_clusters` or a distance threshold.Question 8
In a Gaussian Mixture Model, the "responsibilities" $\gamma_{ij}$ computed in the E-step represent:
- A) The weight of the $j$-th Gaussian component
- B) The posterior probability that component $j$ generated point $i$
- C) The Euclidean distance from point $i$ to the mean of component $j$
- D) The variance of component $j$
Show Answer
**B) The posterior probability that component $j$ generated point $i$.** The responsibility $\gamma_{ij}$ is the posterior probability $P(z_i = j \mid \mathbf{x}_i)$, computed using Bayes' theorem. This is what makes GMMs provide "soft" cluster assignments, unlike K-means which provides hard assignments.Question 9
When using BIC to select the number of components in a GMM, you should choose the model with:
- A) The highest BIC
- B) The lowest BIC
- C) BIC closest to zero
- D) The steepest drop in BIC
Show Answer
**B) The lowest BIC.** BIC = $-2\ln\hat{L} + p\ln n$. It balances model fit (likelihood) against complexity (number of parameters). Lower BIC indicates a better trade-off. Unlike the elbow method, BIC provides a clear optimum rather than requiring subjective judgment.Question 10
What does PCA maximize when selecting principal components?
- A) The classification accuracy of the projected data
- B) The variance of the projected data along each component
- C) The distance between data points after projection
- D) The number of features that can be removed
Show Answer
**B) The variance of the projected data along each component.** PCA finds orthogonal directions that maximize the variance of the projected data. The first principal component captures the most variance, the second captures the most remaining variance (subject to orthogonality), and so on.Question 11
A dataset with 100 features is reduced to 10 principal components that explain 95% of the total variance. The reconstruction error is:
- A) 5% of the total variance
- B) 10% of the total variance
- C) 95% of the total variance
- D) Cannot be determined from this information
Show Answer
**A) 5% of the total variance.** The reconstruction error from PCA is equal to the variance in the discarded components. If 95% of variance is retained, then 5% is lost, which equals the reconstruction error (in terms of variance).Question 12
Why is it important to standardize features before applying PCA?
- A) PCA requires features to be normally distributed
- B) Features with larger scales would dominate the principal components
- C) Standardization makes the covariance matrix invertible
- D) PCA only works with features in the range [0, 1]
Show Answer
**B) Features with larger scales would dominate the principal components.** PCA maximizes variance, so features measured in larger units (e.g., salary in dollars vs. age in years) would have artificially higher variance and dominate the principal components. Standardization (zero mean, unit variance) ensures all features contribute equally, as discussed in Chapter 3.Question 13
Which of the following statements about t-SNE is TRUE?
- A) t-SNE preserves global distances between clusters
- B) t-SNE can be used to transform new data points
- C) The relative sizes of clusters in a t-SNE plot are meaningful
- D) t-SNE uses a heavy-tailed Student's t-distribution in the low-dimensional space
Show Answer
**D) t-SNE uses a heavy-tailed Student's t-distribution in the low-dimensional space.** The Student's t-distribution (with one degree of freedom) is used in the low-dimensional space to address the "crowding problem." Options A and C are false because t-SNE distorts global distances and cluster sizes. Option B is false because t-SNE has no `transform()` method.Question 14
The perplexity parameter in t-SNE roughly corresponds to:
- A) The number of output dimensions
- B) The effective number of neighbors considered for each point
- C) The learning rate for gradient descent
- D) The number of iterations
Show Answer
**B) The effective number of neighbors considered for each point.** Perplexity controls the bandwidth of the Gaussian kernels used to compute pairwise similarities. It can be interpreted as a smooth measure of the effective number of neighbors. Typical values range from 5 to 50, with 30 being a common default.Question 15
Compared to t-SNE, UMAP offers which advantage(s)?
- A) Faster computation and the ability to transform new data
- B) Guaranteed preservation of all pairwise distances
- C) No hyperparameters to tune
- D) Always produces better visualizations
Show Answer
**A) Faster computation and the ability to transform new data.** UMAP is significantly faster than t-SNE (especially for large datasets) and supports a `transform()` method for embedding new points. It does still have hyperparameters (`n_neighbors`, `min_dist`), does not guarantee distance preservation, and the quality of visualizations depends on the dataset.Question 16
Isolation Forest detects anomalies based on the principle that:
- A) Anomalies have high local density
- B) Anomalies are easier to isolate (require fewer random splits)
- C) Anomalies are always in the corners of feature space
- D) Anomalies have the highest variance
Show Answer
**B) Anomalies are easier to isolate (require fewer random splits).** Isolation Forest builds random trees by selecting random features and random split values. Anomalies, being rare and different, tend to be isolated in fewer splits (shorter path lengths in the tree). Normal points, being clustered together, require more splits to isolate.Question 17
A Local Outlier Factor (LOF) score of approximately 1.0 means:
- A) The point is definitely an outlier
- B) The point has similar density to its neighbors (likely normal)
- C) The point is at the center of a cluster
- D) The point has no neighbors within the search radius
Show Answer
**B) The point has similar density to its neighbors (likely normal).** LOF compares a point's local reachability density to that of its neighbors. A score near 1 means the density is comparable, indicating a normal point. Scores much greater than 1 indicate the point is in a sparser region relative to its neighbors, suggesting an anomaly.Question 18
The silhouette coefficient for a data point ranges from:
- A) 0 to 1
- B) -1 to 1
- C) 0 to infinity
- D) -infinity to infinity
Show Answer
**B) -1 to 1.** The silhouette coefficient $s(i) = (b(i) - a(i)) / \max(a(i), b(i))$ is bounded between -1 (point is in the wrong cluster) and +1 (point is perfectly clustered). A score of 0 indicates the point is on the boundary between clusters.Question 19
You compute silhouette scores for K-means with $k = 2, 3, 4, 5$ and get scores of 0.55, 0.72, 0.65, 0.45. Which $k$ does the silhouette score suggest is best?
- A) k = 2
- B) k = 3
- C) k = 4
- D) k = 5
Show Answer
**B) k = 3.** The silhouette score is highest at $k = 3$ (0.72), indicating the best balance between cluster cohesion and separation. Higher silhouette scores are better.Question 20
The Adjusted Rand Index (ARI) differs from the raw Rand Index in that:
- A) It uses cosine distance instead of Euclidean distance
- B) It adjusts for chance agreement, with an expected value of 0 for random clusterings
- C) It only considers cluster centroids
- D) It requires the same number of clusters as ground-truth classes
Show Answer
**B) It adjusts for chance agreement, with an expected value of 0 for random clusterings.** The raw Rand Index can give high values even for random clusterings (especially with many clusters). ARI corrects for this by subtracting the expected value under random permutations, giving 0 for random clusterings and 1 for perfect agreement. It does not require the same number of clusters.Question 21
Which of the following is a common pitfall when using t-SNE for data analysis?
- A) Running it only once, when multiple runs should be compared
- B) Standardizing the input features
- C) Using perplexity in the range 30-50
- D) Reducing dimensionality with PCA before applying t-SNE
Show Answer
**A) Running it only once, when multiple runs should be compared.** t-SNE results can vary across runs due to the stochastic optimization. Running it multiple times (or with different perplexity values) helps ensure the observed structure is robust. Options B, C, and D are all recommended best practices, not pitfalls.Question 22
You want to use clustering to create features for a downstream supervised model. Which approach is most appropriate?
- A) Cluster on t-SNE output and use cluster labels as features
- B) Cluster on the original (scaled) data and use distances to centroids as features
- C) Cluster on the test set and propagate labels to the training set
- D) Use the silhouette score directly as a feature
Show Answer
**B) Cluster on the original (scaled) data and use distances to centroids as features.** Distances to cluster centroids provide informative continuous features that capture the data's structure. Option A is wrong because t-SNE distorts distances. Option C violates the train/test split principle (data leakage, as discussed in Chapter 3). Option D confuses an evaluation metric with a feature.Question 23
For a dataset with varying-density clusters (one tight cluster and one spread-out cluster), which algorithm is most likely to produce correct results?
- A) K-means
- B) DBSCAN with a single eps value
- C) Gaussian Mixture Model with full covariance
- D) K-means with mini-batch updates
Show Answer
**C) Gaussian Mixture Model with full covariance.** GMMs with full covariance matrices can model clusters of different shapes and sizes, since each component has its own covariance matrix. K-means assumes equal-sized spherical clusters. DBSCAN with a single eps struggles with varying densities because the eps that works for the dense cluster will merge points in the sparse cluster.Question 24
When applying PCA as a preprocessing step before K-means, what is the primary benefit?
- A) PCA guarantees that K-means will find the global optimum
- B) Dimensionality reduction speeds up K-means and removes noisy features
- C) PCA converts categorical features to numerical ones
- D) PCA ensures that the optimal $k$ is always 2
Show Answer
**B) Dimensionality reduction speeds up K-means and removes noisy features.** By reducing dimensionality, PCA makes K-means faster (fewer distance calculations) and can improve results by removing noisy, low-variance dimensions. It does not guarantee optimality, convert data types, or determine the number of clusters.Question 25
A trustworthiness score close to 1.0 for a dimensionality reduction method indicates:
- A) The reduction perfectly preserves all pairwise distances
- B) Points that are neighbors in the embedding were also neighbors in the original space
- C) The method has no hyperparameters
- D) The embedding took very little time to compute