Chapter 9 Exercises: Unsupervised Learning
Section A: Conceptual Foundations
Exercise 9.1 — Supervised vs. Unsupervised
For each of the following business problems, classify it as supervised learning, unsupervised learning, or a combination. Justify your answer by identifying whether a target variable exists, whether labeled data is available, and what the desired output is.
a) A bank wants to group its loan applicants into risk tiers based on financial behavior patterns, without predefined categories. b) A streaming service wants to predict which users will cancel their subscription next month. c) An e-commerce company wants to identify unusual return patterns that might indicate fraud. d) An insurance company wants to set premium prices for new customers based on historical claims data. e) A hospital wants to find natural groupings among patients with chronic conditions to personalize treatment plans. f) A manufacturer wants to detect faulty products on an assembly line using sensor readings from equipment where defects are extremely rare and inconsistently labeled.
Exercise 9.2 — The "No Right Answer" Problem
Professor Okonkwo says: "You don't know if you're right. You know if you're useful."
a) Explain in two to three sentences why this distinction matters for unsupervised learning but not for supervised classification. b) A data scientist presents a customer segmentation with a silhouette score of 0.72. The marketing VP responds: "The segments don't map to any campaigns we can run." Is the segmentation "good"? Discuss the tension between statistical quality and business utility, and how you would resolve it. c) Propose three criteria (beyond silhouette score) that a business should use to evaluate whether a clustering result is useful.
Exercise 9.3 — When Unsupervised Fails
A retail company runs K-means on a dataset of 10,000 customers using five features: age, annual income, number of children, ZIP code, and years of education. They find four clusters and label them "Young Professionals," "Affluent Families," "Retirees," and "Students."
a) What type of segmentation is this (demographic, behavioral, or psychographic)? b) Identify at least two problems with including ZIP code as a clustering feature without transformation. c) Explain why this segmentation, even if statistically valid, might produce less actionable insights than a behavioral segmentation based on purchase data. d) Propose a set of five alternative features that would produce a more actionable behavioral segmentation for a retailer.
Section B: K-Means Clustering
Exercise 9.4 — K-Means by Hand
Consider the following five data points in two dimensions (already scaled):
| Point | X | Y |
|---|---|---|
| A | 1 | 1 |
| B | 1.5 | 2 |
| C | 3 | 4 |
| D | 5 | 7 |
| E | 3.5 | 5 |
You initialize K-means with K = 2 and the following starting centroids: C1 = (1, 1) and C2 = (5, 7).
a) Assign each point to the nearest centroid (using Euclidean distance). Show your distance calculations. b) Recalculate each centroid as the mean of its assigned points. c) Reassign each point to the nearest (new) centroid. Did any points change clusters? d) If the algorithm has converged (no changes), state the final clusters. If not, perform one more iteration.
Exercise 9.5 — Choosing K
You run K-means on a customer dataset for K = 2 through K = 10 and obtain the following inertia (WCSS) values:
| K | Inertia | Silhouette Score |
|---|---|---|
| 2 | 48,500 | 0.45 |
| 3 | 32,200 | 0.52 |
| 4 | 22,800 | 0.58 |
| 5 | 18,400 | 0.61 |
| 6 | 16,900 | 0.59 |
| 7 | 15,800 | 0.53 |
| 8 | 15,200 | 0.48 |
| 9 | 14,900 | 0.44 |
| 10 | 14,700 | 0.41 |
a) Based on the elbow method, what value of K would you recommend? Explain your reasoning. b) Based on the silhouette score, what value of K would you recommend? c) The two methods suggest different values. How would you decide? What additional considerations (business, operational) would factor into your decision? d) A marketing executive says: "We can only run three distinct campaigns, so K must be 3." How would you respond? Is this a valid way to choose K?
Exercise 9.6 — Scaling Matters
A data scientist clusters customers using three unscaled features:
| Feature | Min | Max | Mean |
|---|---|---|---|
| Annual revenue ($) | 50 | 500,000 | 25,000 |
| Number of purchases | 1 | 120 | 15 |
| Customer satisfaction score | 1 | 5 | 3.8 |
a) Without scaling, which feature will dominate the distance calculation? Why? b) After StandardScaler is applied, what will the mean and standard deviation of each feature be? c) A colleague suggests using MinMaxScaler instead of StandardScaler. Under what circumstances might MinMaxScaler be preferable? d) The data scientist finds completely different clusters after scaling. Which result should they trust — scaled or unscaled? Why?
Section C: Hierarchical Clustering and DBSCAN
Exercise 9.7 — Reading Dendrograms
A dendrogram for 8 customers shows the following merge structure (height indicates dissimilarity at which clusters merge):
- Height 0.5: Customers A and B merge
- Height 0.8: Customers C and D merge
- Height 1.2: Cluster {A, B} merges with Customer E
- Height 1.5: Customers F and G merge
- Height 2.0: Cluster {C, D} merges with Cluster {F, G}
- Height 3.5: Cluster {A, B, E} merges with Customer H
- Height 6.0: All clusters merge into one
a) If you cut the dendrogram at height 2.5, how many clusters do you get? List their members. b) If you cut at height 1.0, how many clusters do you get? List their members. c) Which two customers are most similar to each other? How do you know? d) Which customer is most different from all others? How do you know? e) If you need exactly 3 clusters, where would you cut? What are the clusters?
Exercise 9.8 — Choosing Between K-Means and DBSCAN
For each of the following scenarios, recommend either K-means or DBSCAN, and explain your reasoning:
a) Segmenting 1 million loyalty program members for a national restaurant chain into 5 tiers for a new rewards program. b) Identifying clusters of fraudulent ATM withdrawals based on location, time, and amount — knowing that fraud represents less than 0.1% of transactions. c) Grouping retail stores by performance metrics for regional management reporting, where the CEO wants exactly 4 performance tiers. d) Analyzing GPS data from delivery trucks to find natural delivery zones in a city, where zones may be irregular shapes following road networks. e) Segmenting social media followers into interest groups for targeted content, with no prior knowledge of how many groups exist.
Exercise 9.9 — DBSCAN Parameter Sensitivity
You run DBSCAN on a dataset with three different parameter settings and get the following results:
| Setting | eps | min_samples | Clusters Found | Noise Points (%) |
|---|---|---|---|---|
| A | 0.3 | 5 | 12 | 35% |
| B | 0.5 | 5 | 5 | 8% |
| C | 1.0 | 5 | 2 | 1% |
a) Explain the relationship between eps and the number of clusters/noise points. Why does increasing eps reduce both?
b) Which setting would you investigate first for a customer segmentation project, and why?
c) How would you use a k-distance plot to select eps more systematically?
d) What would happen if you kept eps = 0.5 but increased min_samples to 20? Predict the likely effect on cluster count and noise percentage.
Section D: Dimensionality Reduction
Exercise 9.10 — PCA Intuition
You run PCA on a dataset with 20 features and obtain the following explained variance ratios for the first 6 principal components:
| Component | Explained Variance Ratio | Cumulative |
|---|---|---|
| PC1 | 35.2% | 35.2% |
| PC2 | 18.7% | 53.9% |
| PC3 | 12.4% | 66.3% |
| PC4 | 8.1% | 74.4% |
| PC5 | 5.6% | 80.0% |
| PC6 | 3.9% | 83.9% |
a) If you need to retain at least 80% of the total variance, how many components should you keep? b) You've reduced 20 features to 5. What are the trade-offs of this reduction for (i) a visualization task, (ii) a clustering task, and (iii) a supervised learning task? c) A colleague says: "PC1 accounts for 35% of the variance, so it must be the most important feature." Explain why this statement misunderstands PCA. d) After running K-means on the original 20 features and on the 5 PCA components, you get slightly different cluster assignments. Which result should you use, and why?
Exercise 9.11 — t-SNE Interpretation Pitfalls
A data scientist creates a t-SNE visualization of customer data and presents the following observations to the marketing team. For each observation, evaluate whether it's a valid interpretation or a misuse of t-SNE. Explain your reasoning.
a) "These five clusters are clearly separated in the plot, confirming our segmentation is valid." b) "Cluster A is much larger than Cluster B in the plot, so Cluster A contains more customers." c) "Clusters 2 and 3 are right next to each other, so those customer groups are very similar." d) "This cluster has a tight, dense shape, which means the customers within it are very homogeneous." e) "I ran t-SNE twice and got different-looking plots, so the analysis is unreliable."
Section E: Anomaly Detection
Exercise 9.12 — Fraud Detection Design
You are the Head of Data Analytics at an online payment processor handling 5 million transactions per day. Approximately 0.05% of transactions are fraudulent. Your current rule-based system catches about 60% of fraud but generates a false positive rate of 2% (meaning 2% of legitimate transactions are incorrectly flagged).
a) Calculate the approximate daily numbers: How many transactions are fraudulent? How many legitimate transactions are incorrectly flagged? What is the ratio of false positives to true positives? b) Explain why a supervised classifier trained on labeled fraud data might struggle with this problem. (Hint: consider class imbalance and evolving fraud patterns.) c) Describe how you would use an isolation forest to supplement the rule-based system. What features would you include? How would you set the contamination parameter? d) A colleague proposes replacing the rule-based system entirely with an unsupervised anomaly detector. Evaluate this proposal. What are the risks?
Exercise 9.13 — Anomaly Detection Methods
For each of the following scenarios, recommend the most appropriate anomaly detection approach (Z-score, IQR, isolation forest, or Mahalanobis distance) and explain why:
a) Detecting unusual daily sales figures for a single product at a single store, where sales are approximately normally distributed. b) Detecting anomalous customer behavior profiles based on 15 correlated behavioral features. c) Detecting unusual delivery times in a logistics operation, where the data is right-skewed (most deliveries are fast, some are very late). d) Detecting anomalous sensor readings in a manufacturing plant where you have 50 sensor measurements per machine per minute.
Section F: Customer Segmentation and Business Application
Exercise 9.14 — RFM Analysis
You have the following customer data:
| Customer | Last Purchase | Purchases (12 mo.) | Total Spend (12 mo.) |
|---|---|---|---|
| Alice | 5 days ago | 24 | $3,200 |
| Bob | 90 days ago | 3 | $150 |
| Carol | 15 days ago | 12 | $1,800 |
| Dave | 2 days ago | 48 | $8,500 |
| Eve | 180 days ago | 1 | $45 |
| Frank | 30 days ago | 8 | $600 |
a) Rank each customer on Recency, Frequency, and Monetary (1 = lowest, 5 = highest). Assign scores using quintile-like groupings based on this small dataset. b) Which customer(s) would you classify as "best customers"? Which as "at-risk"? Which as "lost"? c) If you could only send a retention offer to two customers, which two would you choose and why? d) What additional behavioral features (beyond RFM) would make this segmentation more actionable for an e-commerce company?
Exercise 9.15 — From Clusters to Strategy
You've run K-means (K = 4) on a hotel chain's guest data and discovered the following four segments:
| Segment | Size | Avg Nights/Year | Avg Rate Paid | % Business Travel | Loyalty Member | Restaurant Spend/Stay |
|---|---|---|---|---|---|---|
| A | 30% | 2 | $120 | 10% | 15% | $15 | |||
| B | 25% | 12 | $180 | 85% | 70% | $45 | |||
| C | 15% | 4 | $350 | 20% | 40% | $120 | |||
| D | 30% | 1 | $95 | 5% | 5% | $10 |
a) Assign a descriptive name to each segment based on the data. b) For each segment, propose one specific marketing action and one specific operational action. c) Which segment has the highest estimated lifetime value? Show your reasoning (consider frequency, rate, ancillary spend, and loyalty). d) Segment D is the largest but lowest-value segment. The CMO wants to "convert" them into Segment B customers. Evaluate this strategy — is it realistic? What alternative approach might be more effective? e) How would you validate that these four segments are genuinely different and not artifacts of the clustering algorithm?
Exercise 9.16 — Segmentation Ethics
A health insurance company uses unsupervised learning to segment its members into behavioral clusters. The clustering reveals that one segment has significantly higher healthcare utilization, lower income, and is disproportionately composed of members from minority racial groups.
a) Should the insurance company use this segment for pricing decisions? Discuss the ethical and legal implications. b) If the company uses this segmentation for marketing (e.g., targeting preventive care programs to high-utilization members), does the ethical calculus change? Why or why not? c) How might algorithmic segmentation amplify existing social inequalities, even when race is not explicitly included as a feature? d) Propose a framework for evaluating whether a particular use of ML-based segmentation is ethical. Include at least three criteria.
Section G: Python and Applied Skills
Exercise 9.17 — Modifying the CustomerSegmenter
Using the CustomerSegmenter class from the chapter as a starting point:
a) Add a new feature called days_since_first_purchase (customer tenure) to the synthetic data generator. Assign different tenure distributions to each segment (e.g., Quiet Loyalists should have long tenures, Young Digital should have short tenures).
b) Re-run the clustering pipeline with the new feature. Does the addition of tenure change the cluster assignments? How do the silhouette scores compare?
c) Modify the profile_clusters method to output a "recommended channel" for each cluster based on the online_ratio and email_engagement values (e.g., "in-store outreach," "email campaigns," "push notifications," "social media").
Exercise 9.18 — Hierarchical Clustering Comparison
Using the synthetic data from the CustomerSegmenter:
a) Run agglomerative hierarchical clustering with Ward linkage on the same scaled data. Use scipy.cluster.hierarchy.dendrogram to plot the dendrogram.
b) Cut the dendrogram at a level that produces 6 clusters. Compare these clusters to the K-means clusters using a cross-tabulation. How much overlap is there?
c) Repeat with average linkage and complete linkage. Which linkage method produces clusters most similar to K-means?
d) Write a paragraph summarizing when you would recommend hierarchical clustering over K-means to a business stakeholder.
Exercise 9.19 — Anomaly Detection in Practice
Using the CustomerSegmenter's synthetic data:
a) Add 50 synthetic "anomalous" customers with extreme values (e.g., very high spend but zero frequency, or very high return rates with very high frequency). These represent potential fraud or data entry errors.
b) Fit an IsolationForest from scikit-learn with contamination=0.01 on the full dataset (original + anomalous customers).
c) How many of the 50 injected anomalies does the isolation forest detect? What is the false positive rate (legitimate customers flagged as anomalies)?
d) Experiment with different contamination values (0.005, 0.01, 0.02, 0.05). Plot the true positive rate and false positive rate for each value. What is the trade-off?
Exercise 9.20 — PCA Deep Dive
a) Run PCA on the CustomerSegmenter's scaled data and extract the first three principal components. Print the component loadings (the weights assigned to each original feature for each PC).
b) Interpret PC1 in business terms. Which original features contribute most strongly? What "concept" does PC1 capture?
c) Create a 3D scatter plot using the first three principal components, colored by cluster assignment. Does the third dimension reveal structure that the 2D plot missed?
d) Calculate the cumulative explained variance for all 8 components. At how many components do you reach 95% explained variance?
Section H: Integration and Critical Thinking
Exercise 9.21 — Unsupervised as Preprocessing
A data science team at a bank is building a supervised model to predict loan default. They have 200 features (credit history, income, employment, spending patterns, etc.). Before building the classifier, they consider two preprocessing approaches:
a) Approach A: Use PCA to reduce 200 features to 30 components (retaining 90% of variance), then train a gradient-boosted tree classifier. b) Approach B: Use K-means to cluster borrowers into 8 behavioral segments, add the cluster label as a new feature to the original 200 features, then train the classifier.
For each approach, discuss: (i) What information might be gained or lost? (ii) How does it affect model interpretability? (iii) Under what circumstances would you prefer this approach?
Exercise 9.22 — Cluster Stability Analysis
You've presented a 5-cluster customer segmentation to the VP of Marketing. She asks: "How do I know these clusters are real and not just noise?"
Design a cluster stability analysis that addresses her concern. Your plan should include:
a) At least two quantitative methods for assessing cluster stability b) A visual method for communicating stability to a non-technical stakeholder c) A business-level validation approach (beyond statistical metrics) d) Criteria for deciding that the clusters are "stable enough" to act on
Exercise 9.23 — Full Segmentation Proposal
You are a data analyst at a mid-sized subscription box company (250,000 subscribers). The CEO has asked you to develop a customer segmentation to inform retention strategy. You have access to: subscription history, box customization choices, engagement with social media and email, customer service interactions, referral activity, and payment history.
Write a two-page proposal that includes:
a) The business objective of the segmentation and how it will inform specific decisions b) Your recommended feature set (at least 8 features, with justification for each) c) Your recommended clustering approach (K-means, hierarchical, DBSCAN, or a combination) with justification d) Your plan for choosing the number of clusters e) Your plan for validating and communicating the results to non-technical stakeholders f) Three potential risks or failure modes and how you would mitigate each
Solutions to selected exercises are available in Appendix B.