Chapter 9 Quiz: Unsupervised Learning
Multiple Choice
Question 1. What is the fundamental difference between supervised and unsupervised learning?
a) Supervised learning uses Python; unsupervised learning uses R b) Supervised learning uses labeled data with a target variable; unsupervised learning discovers patterns without labels c) Supervised learning is more accurate than unsupervised learning d) Supervised learning finds clusters; unsupervised learning makes predictions
Question 2. In K-means clustering, what does the algorithm do after assigning each data point to its nearest centroid?
a) Removes outlier data points that are far from all centroids b) Increases K by one and adds a new centroid c) Recalculates each centroid as the mean of all points assigned to it d) Calculates the accuracy of the cluster assignments
Question 3. A data scientist runs K-means for K = 2 through K = 10 and plots the within-cluster sum of squares (WCSS). The curve drops sharply from K = 2 to K = 5, then flattens significantly. Using the elbow method, the recommended K is:
a) K = 2, because it has the steepest drop b) K = 5, because it's where the curve bends from steep to flat c) K = 10, because it has the lowest WCSS d) K = 7, because it's the midpoint of the range
Question 4. Which of the following is true about the silhouette score?
a) It ranges from 0 to 1, where 1 indicates perfect clustering b) It measures the distance between cluster centroids c) It ranges from -1 to +1, where values near +1 indicate well-separated clusters d) It can only be calculated for K-means, not for other clustering algorithms
Question 5. A marketing team uses K-means to segment 50,000 customers. One feature (annual revenue) ranges from $10 to $2 million, while another feature (satisfaction score) ranges from 1 to 5. Before running K-means, the data scientist should:
a) Remove the satisfaction score because its range is too small b) Scale both features so they contribute proportionally to the distance calculation c) Log-transform the revenue feature and leave the satisfaction score unchanged d) Use only the revenue feature since it contains more information
Question 6. What is the primary advantage of hierarchical clustering over K-means?
a) It runs faster on large datasets b) It always produces better cluster assignments c) It produces a dendrogram showing the full hierarchy of cluster relationships, without requiring K to be specified upfront d) It handles categorical features directly
Question 7. In a dendrogram, the height at which two clusters merge indicates:
a) The number of data points in each cluster b) The dissimilarity between the two clusters — higher merges mean the clusters were more different c) The time the algorithm took to process the merge d) The silhouette score of the resulting merged cluster
Question 8. DBSCAN classifies data points into three types: core points, border points, and noise points. Which statement best describes a noise point?
a) A point that is equidistant from two cluster centroids
b) A point that has at least min_samples neighbors within eps distance
c) A point that is not within eps distance of any core point and doesn't belong to any cluster
d) A point that was assigned to a cluster but has a negative silhouette score
Question 9. Which clustering algorithm would be most appropriate for identifying clusters of irregular shape in geographic data while also detecting outliers?
a) K-means b) Hierarchical clustering with Ward linkage c) DBSCAN d) PCA
Question 10. Principal Component Analysis (PCA) reduces dimensionality by:
a) Removing the features with the lowest variance b) Randomly selecting a subset of features c) Finding new axes (principal components) that are linear combinations of original features, ordered by the amount of variance they explain d) Clustering features into groups and selecting one representative feature per group
Question 11. After running PCA on 30 features, you find that the first 5 principal components explain 88% of the total variance. This means:
a) The remaining 25 features are irrelevant and can be deleted from the dataset b) 88% of the information in the original 30 features is captured by 5 composite dimensions c) 88% of the data points are well-represented; 12% are outliers d) You need exactly 5 clusters for K-means
Question 12. Which statement about t-SNE is correct?
a) Distances between clusters in a t-SNE plot accurately represent how different the clusters are b) Cluster sizes in a t-SNE plot accurately represent how many data points each cluster contains c) t-SNE preserves local neighborhood structure but may distort global distances and cluster sizes d) t-SNE is deterministic and will produce identical results every time it is run
Question 13. An isolation forest detects anomalies by:
a) Comparing each data point to the mean and flagging points more than 3 standard deviations away b) Measuring how quickly a data point can be isolated by random splits — anomalies are isolated faster c) Training a supervised classifier on labeled anomaly data d) Calculating the Mahalanobis distance of each point from the data center
Question 14. In RFM analysis, which combination of scores would most likely indicate a customer at risk of churning?
a) High Recency (long time since purchase), High Frequency, High Monetary b) Low Recency (recent purchase), Low Frequency, Low Monetary c) High Recency (long time since purchase), declining Frequency, declining Monetary d) Low Recency (recent purchase), High Frequency, High Monetary
Question 15. Athena's behavioral clustering revealed the "Quiet Loyalists" segment. Which characteristic made this segment invisible to the traditional demographic-based segmentation?
a) They had extremely high spending that skewed the demographic averages b) They had moderate spending, shopped primarily in-store, and had near-zero email engagement — making them invisible to digital marketing metrics c) They were all in the same ZIP code, which the demographic model excluded d) They were new customers with no purchase history
True or False
Question 16. K-means clustering will always produce the same result regardless of the initial random centroid placement.
Question 17. DBSCAN requires you to specify the number of clusters (K) before running the algorithm.
Question 18. A silhouette score of -0.3 for a data point suggests it may have been assigned to the wrong cluster.
Question 19. PCA can only reduce data to 2 dimensions.
Question 20. In anomaly detection for fraud, unsupervised methods are valuable because they can detect novel fraud patterns that supervised models trained on historical fraud would miss.
Short Answer
Question 21. Explain why you should always scale your features before running K-means clustering. Give a specific example of what can go wrong if you don't.
Question 22. A colleague presents a customer segmentation and says: "We achieved a silhouette score of 0.85, so we know this is the correct segmentation." Identify two flaws in this statement.
Question 23. Describe one scenario where DBSCAN would produce meaningfully better results than K-means, and one scenario where K-means would be the better choice. Explain the characteristics of each scenario that favor one algorithm over the other.
Question 24. Athena's segmentation revealed that the Quiet Loyalists segment responded to the strategy "do less" — specifically, stop sending them emails they never read. Explain why "do less" can be a valid strategic response to a segmentation finding, and give another hypothetical example where the optimal strategy for a segment might be to reduce, rather than increase, contact.
Question 25. Professor Okonkwo says unsupervised learning "reveals questions you didn't know to ask." Using the Athena case as an example, explain what question the clustering revealed and why that question was more valuable than any prediction the algorithm could have made.
Answer Key
-
b — Supervised learning uses labeled data with a target variable; unsupervised learning discovers patterns without labels.
-
c — Recalculates each centroid as the mean of all points assigned to it, then repeats the assign-and-update cycle until convergence.
-
b — K = 5 is where the curve transitions from steep improvement to diminishing returns — the "elbow."
-
c — The silhouette score ranges from -1 to +1. Values near +1 indicate points are well-matched to their cluster and poorly matched to neighboring clusters.
-
b — Scaling ensures all features contribute proportionally to the distance calculation. Without scaling, revenue (range: millions) would completely dominate satisfaction (range: 1-5).
-
c — Hierarchical clustering's dendrogram preserves the full hierarchy, allowing exploration at any level of granularity without specifying K in advance.
-
b — The merge height represents dissimilarity. Higher merges indicate the two clusters were more different from each other.
-
c — A noise point doesn't have enough neighbors to be a core point and isn't within eps distance of any core point.
-
c — DBSCAN handles arbitrary cluster shapes and explicitly labels outliers as noise, making it ideal for geographic data with irregular boundaries.
-
c — PCA finds new axes (principal components) that are ordered by variance explained, where each component is a linear combination of original features.
-
b — 88% of the variance (information) is captured by the 5 principal components. The original features are still "present" as weighted contributions to each component.
-
c — t-SNE preserves local structure (nearby points stay nearby) but distorts global distances and relative cluster sizes.
-
b — Isolation forest uses random splits in decision trees. Anomalies, being rare and different, are isolated with fewer splits (shorter path lengths).
-
c — High recency (long time since last purchase) combined with declining frequency and monetary value is the classic churn risk profile.
-
b — Quiet Loyalists were moderate spenders with extremely low digital engagement. Since the traditional segmentation was based on demographics and the marketing team tracked digital metrics, this behaviorally distinct group was distributed across multiple demographic segments and invisible.
-
False — K-means is sensitive to initialization. Different starting centroids can produce different final clusters. This is why scikit-learn's implementation runs the algorithm multiple times (n_init parameter) and selects the best result.
-
False — DBSCAN discovers the number of clusters automatically based on the density structure of the data. It requires eps and min_samples, not K.
-
True — A negative silhouette score means the point is, on average, closer to points in a neighboring cluster than to points in its own cluster, suggesting misassignment.
-
False — PCA can reduce to any number of dimensions from 1 up to the number of original features. Reducing to 2 or 3 is common for visualization, but higher numbers are used for preprocessing.
-
True — Supervised fraud models can only detect patterns similar to historically labeled fraud. Unsupervised anomaly detection identifies any unusual transaction, including novel fraud schemes.
-
K-means uses distance (typically Euclidean) to assign points to clusters. If features are on different scales — e.g., income in tens of thousands and age in tens — income will dominate the distance calculation, and the clusters will be based almost entirely on income, ignoring age. Scaling (e.g., StandardScaler) ensures each feature contributes proportionally.
-
Flaw 1: A high silhouette score indicates statistical cluster quality (well-separated clusters) but says nothing about whether the clusters are useful for business decisions. Flaw 2: There is no single "correct" segmentation. The same data can be validly segmented in multiple ways, each optimizing for different objectives. Silhouette score helps compare clusterings but doesn't establish correctness.
-
DBSCAN is better when data contains non-spherical clusters and outliers — for example, identifying geographic delivery zones in a city, where zones follow road networks (irregular shapes) and some addresses are in rural areas (noise). K-means would force round clusters and assign outlier addresses to distant zones. K-means is better when you need a specific number of well-defined, interpretable segments — for example, creating 5 customer tiers for a loyalty program, where each tier must have a clear centroid profile that stakeholders can understand and act on.
-
"Do less" is valid when the current strategy is either wasting resources on an unresponsive group or actively harming the relationship. For the Quiet Loyalists, email campaigns they never opened were wasting marketing budget and, worse, lowering their CRM engagement scores. Another example: a subscription service discovers that a segment of long-tenured, satisfied customers churns at higher rates after receiving frequent "we miss you" discount offers — the offers inadvertently signal that the company expects them to leave, or train them to wait for discounts. The optimal strategy is to reduce promotional contact and maintain the relationship through service quality, not outreach volume.
-
The clustering revealed the question: "Is there a high-value customer group defined by behavioral patterns (in-store loyalty, full-price purchasing, email disengagement) that our demographic segmentation is systematically missing?" This question was more valuable than any prediction because it exposed a structural blind spot in Athena's marketing strategy. No supervised model could have surfaced this — you'd need to know the segment exists before you could build a model to predict membership in it. The unsupervised algorithm discovered the category itself, not just membership in a known category.