Chapter 7 Exercises
These exercises reinforce the concepts from Chapter 7 and build practical skills with unsupervised learning algorithms. They are organized into three tiers: Conceptual (testing understanding), Applied (hands-on implementation), and Challenge (open-ended, harder problems).
Solutions to selected exercises are available in code/exercise-solutions.py.
Conceptual Exercises
Exercise 7.1: Supervised vs. Unsupervised
Explain in your own words the fundamental difference between supervised and unsupervised learning. Give two real-world scenarios where unsupervised learning is the only practical option (i.e., labeled data is unavailable or impractical to obtain).
Exercise 7.2: K-Means Assumptions
K-means clustering makes several implicit assumptions about the data. List at least three assumptions, and for each, describe a dataset that would violate it. What alternative algorithm would you recommend for each violation?
Exercise 7.3: Linkage Criteria
Consider four data points in 2D: A = (0, 0), B = (1, 0), C = (5, 0), D = (6, 0). Suppose we have two clusters: {A, B} and {C, D}. Compute the inter-cluster distance under (a) single linkage, (b) complete linkage, and (c) average linkage. Which linkage criterion gives the largest inter-cluster distance?
Exercise 7.4: DBSCAN Point Classification
Given the following 2D points and parameters eps=1.5, min_samples=3:
Points: (0,0), (1,0), (0,1), (5,5), (5,6), (6,5), (6,6), (10,10)
Classify each point as a core point, border point, or noise point. How many clusters does DBSCAN find?
Exercise 7.5: GMM vs. K-Means
Explain the relationship between Gaussian Mixture Models and K-means. Under what specific conditions does the EM algorithm for GMMs reduce to the K-means algorithm? What does this tell you about K-means from a probabilistic perspective?
Exercise 7.6: PCA Variance
A dataset has 5 features with the following eigenvalues after PCA: [4.2, 2.1, 0.8, 0.5, 0.1]. (a) Compute the explained variance ratio for each component. (b) How many components are needed to explain at least 90% of the total variance? (c) What is the reconstruction error (in terms of variance lost) if we keep only 2 components?
Exercise 7.7: t-SNE Interpretation
Your colleague runs t-SNE on a dataset and observes three well-separated clusters in the 2D embedding. They conclude that: (a) the data has exactly three natural clusters, (b) the clusters are equidistant from each other, and (c) cluster 1 is larger than cluster 2 because it occupies more area in the plot. Which of these conclusions are valid? Justify your answers.
Exercise 7.8: Silhouette Score Properties
Prove that the silhouette coefficient $s(i)$ is always in the range $[-1, +1]$. What does a silhouette score of exactly 0 mean geometrically? Under what conditions would you expect negative silhouette scores?
Exercise 7.9: Evaluation Without Labels
You have clustered a customer dataset into 5 segments using K-means. You have no ground-truth labels. Describe a complete evaluation strategy using at least three different approaches (quantitative metrics, visualization, and domain-based methods).
Exercise 7.10: Anomaly Detection Comparison
Compare Isolation Forest and Local Outlier Factor in terms of: (a) their underlying detection philosophy, (b) time complexity, (c) ability to detect local vs. global anomalies, and (d) sensitivity to high dimensionality.
Applied Exercises
Exercise 7.11: K-Means from Scratch
Implement the K-means algorithm from scratch using only NumPy. Your implementation should:
- Accept a data matrix X and number of clusters k
- Use random initialization (select k random points as initial centroids)
- Return cluster labels and final centroids
- Include a convergence check (stop when assignments do not change)
Test your implementation on sklearn.datasets.make_blobs and compare results with sklearn.cluster.KMeans.
Exercise 7.12: Elbow Method Implementation
Write a function that takes a dataset and a range of $k$ values, computes the inertia for each $k$, and returns the data needed to plot an elbow curve. Apply it to the Iris dataset (without using the species labels). What value of $k$ does the elbow suggest?
Exercise 7.13: Hierarchical Clustering Dendrogram
Using the Iris dataset (ignoring labels), perform agglomerative clustering with Ward's linkage. (a) Generate the full linkage matrix. (b) Determine a distance threshold that results in exactly 3 clusters. (c) Compare the resulting clusters to the actual species labels using the Adjusted Rand Index.
Exercise 7.14: DBSCAN Parameter Tuning
Load the make_moons dataset from scikit-learn (with noise=0.1). (a) Show that K-means fails to properly cluster this data. (b) Use the k-distance plot method to estimate a good value of eps for DBSCAN. (c) Apply DBSCAN and show that it correctly identifies the two crescent-shaped clusters.
Exercise 7.15: GMM Model Selection
Generate a dataset with make_blobs using 5 centers but do not reveal this to your algorithm. Fit GMMs with $k = 2, 3, \ldots, 10$ components and both "full" and "diag" covariance types. Plot BIC and AIC vs. $k$ for both covariance types. Does the information criterion correctly identify $k = 5$?
Exercise 7.16: PCA on the Digits Dataset
Apply PCA to the scikit-learn digits dataset (64 features). (a) Plot the cumulative explained variance ratio. (b) How many components are needed for 90% and 95% variance? (c) Visualize the first two principal components, colored by digit label. (d) Visualize the first 5 principal component images (reshape the loading vectors to 8x8).
Exercise 7.17: PCA for Noise Reduction
Create a noisy dataset by adding Gaussian noise to the digits dataset: X_noisy = X + np.random.normal(0, 2, X.shape). Apply PCA with different numbers of components and reconstruct the data. At what number of components does the reconstruction best resemble the original clean data? Quantify using mean squared error.
Exercise 7.18: t-SNE Perplexity Study
Apply t-SNE to the digits dataset with perplexity values of 5, 15, 30, 50, and 100. For each, create a 2D scatter plot colored by digit label. How does perplexity affect the visualization? At what perplexity do the digit clusters appear most clearly?
Exercise 7.19: UMAP Parameter Exploration
Apply UMAP to the digits dataset with varying n_neighbors (5, 15, 50, 100) and min_dist (0.0, 0.1, 0.5, 1.0). Create a grid of 2D scatter plots. How do these parameters affect the embedding? Which combination gives the most informative visualization?
Exercise 7.20: Isolation Forest on Synthetic Data
Create a 2D dataset with 500 normal points drawn from a standard normal distribution and 25 anomalous points drawn from a uniform distribution on $[-6, 6]^2$. Apply Isolation Forest with varying contamination values (0.01, 0.05, 0.10, 0.15). At which contamination level does the algorithm best separate anomalies from normal points?
Exercise 7.21: LOF vs. Isolation Forest
Create a dataset with two clusters of different densities: one dense cluster (std=0.5) and one sparse cluster (std=2.0). Add a few outlier points between the clusters. Apply both LOF and Isolation Forest. Which algorithm better identifies the outliers without flagging points from the sparse cluster as anomalous?
Exercise 7.22: Complete Pipeline
Build a complete unsupervised learning pipeline for the Wine dataset (sklearn.datasets.load_wine):
1. Standardize features
2. Apply PCA (retain 95% variance)
3. Run K-means with $k = 2, 3, 4, 5$
4. For each $k$, compute silhouette score, Calinski-Harabasz index, and Davies-Bouldin index
5. Visualize the best clustering in 2D (using the first two PCA components)
6. Compare with the actual wine classes using ARI
Exercise 7.23: Silhouette Analysis
Perform silhouette analysis on the Iris dataset (without using labels) for $k = 2, 3, 4, 5$. For each $k$, compute the per-cluster silhouette scores. At which $k$ are the silhouette scores most uniform across clusters? How does this compare to the known number of species (3)?
Exercise 7.24: Clustering Comparison on make_moons
Generate make_moons(n_samples=500, noise=0.1). Apply K-means ($k=2$), Agglomerative Clustering ($k=2$, each linkage type), DBSCAN, and GMM ($k=2$). Compute the Adjusted Rand Index for each against the true labels. Which algorithm performs best and why?
Exercise 7.25: Feature Extraction with PCA
Use PCA-transformed features as input to a classifier. On the digits dataset: (a) Train a logistic regression classifier (Chapter 5) on the raw 64 features. (b) Train the same classifier on PCA-reduced data with 10, 20, 30, and 40 components. (c) Plot accuracy vs. number of components. At how many components does accuracy plateau?
Challenge Exercises
Exercise 7.26: K-Means++ from Scratch
Implement the K-means++ initialization algorithm from scratch. Compare convergence speed and final inertia between your K-means++ initialization and random initialization over 50 runs on a challenging dataset (e.g., make_blobs with cluster_std=2.0).
Exercise 7.27: EM Algorithm for 1D GMM
Implement the EM algorithm for a 1D Gaussian mixture model from scratch. Given data drawn from a mixture of two Gaussians with known parameters ($\mu_1 = -2, \sigma_1 = 1, \mu_2 = 3, \sigma_2 = 0.5, \pi_1 = 0.6$), show that your EM algorithm recovers the true parameters. Track and plot the log-likelihood at each iteration to verify convergence.
Exercise 7.28: PCA from Scratch
Implement PCA from scratch using NumPy's eigendecomposition or SVD. Your implementation should: - Center the data - Compute the covariance matrix - Find eigenvectors and eigenvalues - Project onto the top $k$ components - Compute explained variance ratios - Support inverse transformation (reconstruction)
Verify your results match sklearn.decomposition.PCA on a test dataset.
Exercise 7.29: Clustering Stability Analysis
Implement a clustering stability analysis. For a given dataset and algorithm, run the algorithm $M$ times with different random seeds and compute the pairwise Adjusted Rand Index between all pairs of runs. Report the mean and standard deviation of ARI. Apply this to K-means and GMM on the digits dataset. Which algorithm is more stable?
Exercise 7.30: Multi-View Clustering
The digits dataset can be "viewed" differently: (a) raw pixel features, (b) PCA-reduced features, (c) histogram of oriented gradients (from skimage). Run K-means on each view separately and compute ARI with the true labels. Then combine all three views by concatenating the features (after scaling each) and cluster again. Does the multi-view approach improve results?
Exercise 7.31: Anomaly Detection Benchmark
Create a comprehensive anomaly detection benchmark. Generate three types of anomalies: (a) global outliers (points far from all clusters), (b) local outliers (points at the boundary between clusters), and (c) contextual outliers (points normal in most features but unusual in one). Apply Isolation Forest, LOF, and GMM-based detection to each type. Report precision, recall, and F1 for each algorithm-anomaly combination.
Exercise 7.32: Dimensionality Reduction Race
Compare PCA, t-SNE, and UMAP on three datasets: (a) digits (1,797 samples, 64 features), (b) a subset of MNIST (5,000 samples, 784 features), and (c) a synthetic dataset with 50,000 samples and 100 features. For each, record: wall-clock time, trustworthiness score, and visual quality of the 2D embedding. Summarize your findings in a comparison table.
Exercise 7.33: Semi-Supervised Learning
Use clustering to bootstrap a semi-supervised learning approach. On the digits dataset: (a) Assume only 5% of labels are available. (b) Cluster the entire dataset using K-means. (c) For each cluster, assign the majority label from the labeled points as the cluster's label. (d) Use these pseudo-labels to train a classifier. Compare accuracy with: (i) a classifier trained on only the 5% labeled data, and (ii) a classifier trained on all labels.
Exercise 7.34: Robust PCA
Standard PCA is sensitive to outliers. Implement a simple robust PCA approach: (a) Apply standard PCA to a contaminated dataset (add 5% gross outliers). (b) Detect outliers using Isolation Forest on the PCA scores. (c) Remove detected outliers and re-run PCA. (d) Compare the principal components from steps (a) and (c). How much do the outliers affect the components?
Exercise 7.35: Custom Clustering Metric
Design and implement a custom clustering evaluation metric that considers both cluster compactness and inter-cluster separation, but also penalizes clusters with very unequal sizes. Apply it alongside the silhouette score to compare K-means and GMM on a dataset with naturally imbalanced cluster sizes (e.g., make_blobs with cluster_std=[0.5, 2.0, 0.5]).