Quiz: Chapter 20

Clustering: K-Means, DBSCAN, and Hierarchical Clustering


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

K-Means minimizes which objective function?

  • A) The sum of absolute distances from each point to its assigned centroid
  • B) The sum of squared distances from each point to its assigned centroid (inertia)
  • C) The maximum distance from any point to its assigned centroid
  • D) The number of points that change cluster assignment between iterations

Answer: B) The sum of squared distances from each point to its assigned centroid (inertia). K-Means iteratively assigns points to the nearest centroid and recomputes centroids as the mean of assigned points. This process minimizes inertia --- the total within-cluster sum of squares. Because it uses squared Euclidean distance, K-Means implicitly assumes spherical clusters with equal variance. Minimizing absolute distances (A) would correspond to K-Medians; minimizing maximum distance (C) would correspond to K-Center.


Question 2 (Multiple Choice)

You run K-Means with k=3 on a dataset containing two crescent-shaped clusters (like make_moons). What happens?

  • A) K-Means will raise an error because it detects non-spherical clusters
  • B) K-Means will return 3 clusters, but they will not correspond to the 2 actual clusters
  • C) K-Means will return 2 clusters and ignore the k=3 setting
  • D) K-Means will return 3 clusters that perfectly separate the two crescents

Answer: B) K-Means will return 3 clusters, but they will not correspond to the 2 actual clusters. K-Means always returns exactly k clusters, regardless of the true structure in the data. It partitions space using Voronoi cells (linear boundaries perpendicular to the line between centroids), which cannot capture crescent or non-convex shapes. K-Means will split the data into 3 regions that minimize inertia, producing clusters that cut across the two crescents. It does not detect or warn about assumption violations.


Question 3 (Multiple Choice)

Why is K-Means++ initialization preferred over random initialization?

  • A) K-Means++ guarantees finding the global optimum
  • B) K-Means++ spreads initial centroids apart, reducing the chance of poor convergence
  • C) K-Means++ eliminates the need for the n_init parameter
  • D) K-Means++ works with non-spherical clusters

Answer: B) K-Means++ spreads initial centroids apart, reducing the chance of poor convergence. K-Means++ selects the first centroid randomly, then each subsequent centroid is chosen with probability proportional to the squared distance from the nearest existing centroid. This biases initialization toward spread-out centroids, making it less likely that two centroids start in the same cluster. K-Means++ provides a theoretical guarantee of O(log k) approximation to the optimal inertia, but it does not guarantee the global optimum (A), does not remove the need for multiple runs (C), and does not change K-Means' spherical cluster assumption (D).


Question 4 (Short Answer)

Explain the elbow method for choosing k. What is the main limitation of this approach?

Answer: The elbow method plots inertia (within-cluster sum of squares) against the number of clusters k and looks for a "bend" or "elbow" where adding another cluster yields diminishing improvement. Below the elbow, each additional cluster captures meaningful structure; above the elbow, additional clusters are splitting noise. The main limitation is that the elbow is often ambiguous in real-world data --- the curve decreases smoothly without a sharp bend, making k selection subjective. Different practitioners looking at the same plot may choose different k values.


Question 5 (Multiple Choice)

A silhouette score of -0.15 for a particular observation means:

  • A) The observation is at the center of its assigned cluster
  • B) The observation is on the border between two clusters
  • C) The observation is closer on average to points in a neighboring cluster than to points in its own cluster
  • D) The observation is a noise point

Answer: C) The observation is closer on average to points in a neighboring cluster than to points in its own cluster. The silhouette score s(i) = (b(i) - a(i)) / max(a(i), b(i)), where a(i) is the mean intra-cluster distance and b(i) is the mean nearest-cluster distance. A negative value means a(i) > b(i) --- the observation is, on average, closer to the nearest other cluster than to its own. This suggests the point may be misassigned. A score near 0 (B) means the point is equidistant between clusters. Silhouette scores do not identify noise points (D) --- that is DBSCAN's role.


Question 6 (Multiple Choice)

In DBSCAN, a core point is defined as a point that:

  • A) Is closest to the cluster centroid
  • B) Has at least min_samples neighbors within distance eps
  • C) Has the highest density in its cluster
  • D) Was the first point visited in its cluster

Answer: B) Has at least min_samples neighbors within distance eps. DBSCAN classifies each point into one of three types: core points (at least min_samples neighbors within eps), border points (within eps of a core point but with fewer than min_samples neighbors), and noise points (not within eps of any core point). Core points form the "backbone" of clusters --- a cluster is built by connecting core points that are within eps of each other and then assigning border points to their nearest core point's cluster.


Question 7 (Short Answer)

Name two advantages and two disadvantages of DBSCAN compared to K-Means.

Answer: Advantages: (1) DBSCAN can find clusters of arbitrary shape (non-spherical, non-convex), unlike K-Means which assumes spherical clusters. (2) DBSCAN automatically identifies noise points (label = -1), providing a natural way to handle outliers, while K-Means forces every point into a cluster. Disadvantages: (1) DBSCAN struggles with clusters of varying density --- a single eps value cannot capture both dense and sparse clusters simultaneously. (2) DBSCAN does not have a predict method for new data; it must be refit on the entire dataset, whereas K-Means can assign new points to their nearest centroid.


Question 8 (Multiple Choice)

The k-distance plot is used in DBSCAN to choose:

  • A) The number of clusters k
  • B) The min_samples parameter
  • C) The eps parameter
  • D) Whether to use DBSCAN or K-Means

Answer: C) The eps parameter. The k-distance plot shows, for each point, the distance to its k-th nearest neighbor (where k = min_samples), sorted in ascending order. The "elbow" in this plot indicates the distance at which the density transition occurs --- points below the elbow are in dense regions (clusters), and points above are in sparse regions (noise). The elbow distance is a good starting point for eps. The plot does not choose the number of clusters (DBSCAN determines this automatically) or the min_samples parameter (which is set based on a heuristic, typically 2 * n_features).


Question 9 (Multiple Choice)

In hierarchical agglomerative clustering, Ward linkage computes the distance between two clusters as:

  • A) The minimum distance between any pair of points in the two clusters
  • B) The maximum distance between any pair of points in the two clusters
  • C) The increase in total within-cluster variance when the two clusters are merged
  • D) The average distance between all pairs of points in the two clusters

Answer: C) The increase in total within-cluster variance when the two clusters are merged. Ward linkage minimizes the total within-cluster variance at each merge step, making it analogous to K-Means in the hierarchical framework. It tends to produce compact, equal-sized clusters. Single linkage (A) finds elongated clusters but is sensitive to chaining. Complete linkage (B) produces compact clusters but is sensitive to outliers. Average linkage (D) is a compromise between single and complete.


Question 10 (Short Answer)

How do you determine the number of clusters from a dendrogram? What visual pattern suggests that the data has a strong cluster structure?

Answer: To determine the number of clusters from a dendrogram, draw a horizontal line at a chosen height (distance). The number of vertical lines the horizontal line intersects equals the number of clusters at that resolution. A strong cluster structure is indicated by large vertical gaps between successive merge distances --- the dendrogram shows long vertical lines before a merge, meaning those clusters were well-separated and persisted over a wide range of distances before being joined. If all merges happen at similar heights (no large gaps), the data does not have strong natural clusters.


Question 11 (Multiple Choice)

You cluster 10,000 customers into 4 segments using K-Means. The silhouette score is 0.62. A colleague then profiles the clusters and finds that all 4 clusters have nearly identical mean values for every feature. What is the most likely explanation?

  • A) The silhouette score is incorrect
  • B) The features were not scaled before clustering
  • C) The clusters differ in feature distributions (variance, skew) rather than means
  • D) K-Means always produces identical cluster profiles

Answer: C) The clusters differ in feature distributions (variance, skew) rather than means. A high silhouette score confirms that the clusters are well-separated in the feature space. If the cluster means are similar, the clusters may differ in other distributional properties --- for example, one cluster may have low variance (concentrated behavior) while another has high variance (diverse behavior), or clusters may differ in the tail of the distribution. This is why cluster profiling should include not only means but also medians, standard deviations, and percentile comparisons. It is also possible that a few features drive the separation while the majority do not, making the means across all features look similar.


Question 12 (Multiple Choice)

Which of the following internal validation metrics is LEAST appropriate when comparing a K-Means clustering (which assigns all points) to a DBSCAN clustering (which labels some points as noise)?

  • A) Silhouette score computed on all points
  • B) Silhouette score computed only on non-noise points
  • C) Adjusted Rand Index between the two clusterings
  • D) Calinski-Harabasz index computed only on non-noise points

Answer: A) Silhouette score computed on all points. When comparing K-Means and DBSCAN, noise points (label = -1 in DBSCAN) create an asymmetry. K-Means assigns every point to a cluster, while DBSCAN excludes noise points from clusters. Computing silhouette on all points would penalize K-Means for including outliers that DBSCAN separated as noise, or it would require assigning noise points to a "noise cluster" that distorts the metric. The fair comparison is to compute metrics only on non-noise points (B and D) or to use ARI on the subset of points that both methods assign to clusters (C is valid if the noise points are excluded from the comparison).


Question 13 (Short Answer)

A marketing team asks you to cluster customers into exactly 5 segments because their CRM system has 5 campaign slots. Your silhouette analysis suggests k=3 is optimal. What do you do?

Answer: Use k=5 because the business constraint is real and silhouette optimality is not the only consideration. Clustering is exploratory, and the "best" k is the one that produces actionable segments, not necessarily the one with the highest silhouette score. However, communicate the tradeoff: show the silhouette plots for k=3 and k=5 side by side, demonstrate that some segments at k=5 may be poorly separated, and verify that the 5-segment profiles are genuinely different enough to warrant different campaigns. If two of the five segments are nearly identical in their profiles, recommend merging them and using only 4 campaign slots.


Question 14 (Multiple Choice)

You need to cluster 2 million customer records with 15 features. Which algorithm is the most practical starting point?

  • A) Agglomerative hierarchical clustering with Ward linkage
  • B) Standard K-Means with n_init=10
  • C) Mini-Batch K-Means with batch_size=1024
  • D) DBSCAN with eps determined from a k-distance plot

Answer: C) Mini-Batch K-Means with batch_size=1024. At 2 million rows, agglomerative clustering (A) is infeasible due to O(n^2) memory requirements. Standard K-Means (B) is feasible but slow because each iteration computes distances for all 2 million points. DBSCAN (D) can struggle with 15 features (curse of dimensionality) and the k-distance plot on 2 million points is expensive to compute. Mini-Batch K-Means updates centroids using random mini-batches, running 3-10x faster than standard K-Means with minimal accuracy loss. It is the standard choice for large-scale clustering.


Question 15 (Short Answer)

Explain why clustering is described as "exploratory" rather than "predictive." How does this affect how you evaluate clustering results compared to a supervised learning model?

Answer: Clustering is exploratory because there is no target variable and therefore no objectively "correct" answer. A supervised model can be evaluated against held-out labels (accuracy, AUC, RMSE), but a clustering solution can only be evaluated against indirect criteria: internal metrics (silhouette, Calinski-Harabasz), stability across subsamples, interpretability of cluster profiles, and --- most importantly --- whether the clusters lead to different actions. Two different clusterings of the same data can both be valid if they answer different questions. This means evaluation in clustering is inherently more subjective than in supervised learning, and domain expert validation is not optional but essential.


Quiz covers Chapter 20: Clustering. Return to the chapter for review, or check your answers against the answer key.