Chapter 9 Further Reading: Unsupervised Learning
Clustering: Foundations and Algorithms
1. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. Chapter 14: Unsupervised Learning. The definitive reference for the mathematical foundations of clustering, PCA, and related techniques. Chapter 14 covers K-means, hierarchical clustering, self-organizing maps, and spectral clustering with mathematical rigor. Not a casual read, but indispensable for anyone who wants to understand why these algorithms work, not just how to run them. Freely available online from the authors' website.
2. Jain, A. K. (2010). "Data Clustering: 50 Years Beyond K-Means." Pattern Recognition Letters, 31(8), 651-666. A retrospective from one of the field's leading researchers, surveying the evolution of clustering from K-means through density-based, spectral, and kernel methods. Particularly valuable for understanding the limitations of K-means and when to reach for alternatives. Accessible to readers with moderate technical background.
3. Ester, M., Kriegel, H.-P., Sander, J., & Xu, X. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise." Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD-96), 226-231. The original DBSCAN paper. Remarkably readable for a foundational algorithm paper, with clear explanations of core points, border points, and noise. Essential for understanding the design philosophy behind density-based clustering and why it was such a departure from K-means and hierarchical methods.
4. McInnes, L., Healy, J., & Astels, S. (2017). "hdbscan: Hierarchical Density Based Clustering." Journal of Open Source Software, 2(11), 205.
Introduces HDBSCAN, the hierarchical extension of DBSCAN that handles varying-density clusters and eliminates the need to select the eps parameter manually. A practical upgrade for practitioners who find DBSCAN's parameter sensitivity limiting. The Python implementation (hdbscan library) is production-ready and widely used.
Dimensionality Reduction
5. Shlens, J. (2014). "A Tutorial on Principal Component Analysis." arXiv preprint arXiv:1404.1100. The clearest and most accessible explanation of PCA available, written for readers who want intuition before mathematics. Shlens builds PCA from first principles using linear algebra and statistics, with helpful visualizations. Ideal for MBA students and business practitioners who want to understand PCA beyond "it reduces dimensions."
6. van der Maaten, L., & Hinton, G. (2008). "Visualizing Data Using t-SNE." Journal of Machine Learning Research, 9, 2579-2605. The original t-SNE paper by its creators. Explains the probabilistic foundations of the technique and demonstrates its superiority over PCA for visualizing high-dimensional data with complex local structure. The paper's visualizations of MNIST digit clusters are iconic in the ML community. Technical but well-written.
7. McInnes, L., Healy, J., & Melville, J. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction." arXiv preprint arXiv:1802.03426. The UMAP paper introduces a technique that achieves t-SNE-quality visualizations at significantly lower computational cost, with better preservation of global structure. The paper is mathematically dense (grounded in topological data analysis), but the practical implications are straightforward: UMAP is faster, more consistent, and more versatile than t-SNE for most use cases.
8. Wattenberg, M., Viegas, F., & Johnson, I. (2016). "How to Use t-SNE Effectively." Distill. An interactive, beautifully designed article that demonstrates how t-SNE's hyperparameters (especially perplexity) affect the resulting visualizations. Essential reading for anyone who presents t-SNE plots to stakeholders, as it makes viscerally clear why naive interpretation of t-SNE plots can be misleading. Available free at distill.pub.
Anomaly Detection
9. Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008). "Isolation Forest." Proceedings of the 8th IEEE International Conference on Data Mining (ICDM), 413-422. The original isolation forest paper. The key insight — anomalies are easier to isolate with random splits — is elegant and practically powerful. The paper demonstrates the algorithm's advantages over distance-based and density-based anomaly detectors, particularly on high-dimensional data. Readable and well-motivated.
10. Chandola, V., Banerjee, A., & Kumar, V. (2009). "Anomaly Detection: A Survey." ACM Computing Surveys, 41(3), 1-58. The most comprehensive survey of anomaly detection techniques, covering statistical, classification-based, clustering-based, nearest-neighbor, and information-theoretic approaches. The taxonomy of anomaly types (point anomalies, contextual anomalies, collective anomalies) is particularly useful for framing business applications. A reference work rather than a tutorial — useful when you need to understand the full landscape.
11. Goldstein, M., & Uchida, S. (2016). "A Comparative Evaluation of Unsupervised Anomaly Detection Algorithms for Multivariate Data." PLoS ONE, 11(4), e0152173. A systematic comparison of 10 unsupervised anomaly detection algorithms across multiple datasets. Particularly valuable for practitioners who need to choose between isolation forest, local outlier factor, one-class SVM, and other approaches. The benchmarking methodology is rigorous and the results are directly applicable to real-world algorithm selection.
Customer Segmentation and Business Applications
12. Fader, P. (2020). Customer Centricity: Focus on the Right Customers for Strategic Advantage (2nd ed.). Wharton School Press. Peter Fader, one of the foremost experts on customer valuation, makes the strategic case for treating different customers differently. His framework for customer lifetime value (CLV) and customer-level profitability analysis provides the business context for why segmentation matters — and why some segments (like Athena's Quiet Loyalists) are worth far more than their transaction volumes suggest.
13. McDonald, M., & Dunbar, I. (2012). Market Segmentation: How to Do It and How to Profit from It (4th ed.). Wiley. The standard reference on market segmentation for business practitioners, covering both traditional (demographic, psychographic) and data-driven approaches. Useful for understanding the business strategy context into which ML-based segmentation must fit. The chapters on segment evaluation criteria and implementation are particularly relevant.
14. Christodoulakis, C., Meng, Y., & Papangelou, K. (2020). "Customer Segmentation Using Machine Learning." Applied Marketing Analytics, 6(2), 131-145. A practical guide to implementing ML-based customer segmentation, including RFM analysis, K-means clustering, and cluster evaluation. Written for marketing analytics practitioners rather than data scientists, making it accessible to the target audience of this textbook. Includes worked examples and business interpretation guidance.
Fraud Detection
15. Bolton, R. J., & Hand, D. J. (2002). "Statistical Fraud Detection: A Review." Statistical Science, 17(3), 235-255. A foundational review of statistical approaches to fraud detection, covering both supervised and unsupervised methods. Bolton and Hand categorize fraud detection approaches by type (credit card, telecommunications, insurance, money laundering) and technique (peer-group analysis, break-point analysis, network analysis). Provides historical context for the evolution toward ML-based fraud detection.
16. Abdallah, A., Maarof, M. A., & Zainal, A. (2016). "Fraud Detection System: A Survey." Journal of Network and Computer Applications, 68, 90-113. An updated survey covering deep learning, ensemble methods, and hybrid systems for fraud detection. Particularly useful for understanding how supervised and unsupervised approaches complement each other in production fraud detection systems. The discussion of feature engineering for fraud detection is directly applicable to the PayPal case study.
Spotify and Music Recommendation
17. Jacobson, K., Murali, V., Newett, E., Whitman, B., & Yon, R. (2016). "Music Personalization at Spotify." Proceedings of the 10th ACM Conference on Recommender Systems (RecSys), 373. A brief but authoritative overview of Spotify's recommendation architecture from Spotify engineers. Describes the interplay of collaborative filtering, content-based filtering, and NLP-based approaches. The practical constraints of recommending at scale (600M+ users, 100M+ tracks) are illuminating.
18. van den Oord, A., Dieleman, S., & Schrauwen, B. (2013). "Deep Content-Based Music Recommendation." Advances in Neural Information Processing Systems (NIPS), 26. Describes the deep learning approach to music content analysis that underlies Spotify's audio feature extraction. The paper demonstrates how convolutional neural networks trained on audio spectrograms can learn features that predict listener preferences, enabling content-based recommendation for songs with no listening history (the cold start problem).
Practical Implementation
19. Muller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A Guide for Data Scientists. O'Reilly. Chapters 3 and 7. Chapter 3 covers unsupervised transformations (PCA, NMF, t-SNE) and Chapter 7 covers clustering (K-means, agglomerative, DBSCAN) with scikit-learn. The best practical introduction for readers who want to implement the techniques discussed in this chapter. Code-heavy, with excellent explanations of hyperparameter tuning and evaluation.
20. VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly. Chapter 5: Machine Learning. A comprehensive, code-first introduction to ML with scikit-learn, including K-means, Gaussian mixture models, PCA, and manifold learning. The K-means section includes a particularly clear discussion of the elbow method and silhouette analysis. Available free online at jakevdp.github.io.
21. Scikit-learn Documentation. "Clustering," "Decomposition," and "Novelty and Outlier Detection." The official scikit-learn documentation is itself an excellent learning resource. The clustering section includes mathematical descriptions, practical guidance, and comparison charts for every algorithm. The "Comparing different clustering algorithms on toy datasets" example page is a must-see — it visually demonstrates each algorithm's behavior on different data shapes.
Ethics and Responsible Use
22. Barocas, S., & Selbst, A. D. (2016). "Big Data's Disparate Impact." California Law Review, 104(3), 671-732. A legal and ethical analysis of how data mining and ML can reproduce and amplify existing social inequalities, even when sensitive attributes (race, gender) are not used as features. Directly relevant to the ethics of customer segmentation: clusters that correlate with protected characteristics can lead to discriminatory treatment, even if discrimination was not intended.
23. Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin's Press. A journalistic investigation of how algorithmic systems — including clustering and profiling techniques — affect the lives of vulnerable populations. Eubanks examines automated eligibility systems, predictive policing, and data-driven child welfare intervention. Essential context for anyone deploying segmentation or anomaly detection in contexts that affect individuals' access to services or resources.
Advanced Topics
24. Xu, D., & Tian, Y. (2015). "A Comprehensive Survey of Clustering Algorithms." Annals of Data Science, 2(2), 165-193. A thorough survey covering partition-based, hierarchical, density-based, grid-based, model-based, and spectral clustering algorithms. Useful as a reference when K-means, hierarchical, and DBSCAN aren't sufficient for a particular problem. The comparison tables summarizing algorithm properties and appropriate use cases are particularly useful for practitioners.
25. Sculley, D., Holt, G., Golovin, D., et al. (2015). "Hidden Technical Debt in Machine Learning Systems." Advances in Neural Information Processing Systems (NIPS), 28. While not specific to unsupervised learning, this seminal Google paper on ML system maintenance is essential reading for anyone deploying clustering or anomaly detection in production. The paper's warning about "glue code," "pipeline jungles," and "undeclared consumers" applies directly to customer segmentation systems that feed into downstream marketing, pricing, and inventory decisions. When your segmentation pipeline breaks, everything downstream breaks with it.
Each item in this reading list was selected because it directly supports concepts introduced in Chapter 9 and developed throughout the textbook. Entries are ordered by relevance within each category. For additional resources on supervised learning foundations, see the Further Reading for Chapters 7-8. For recommendation systems (which build on collaborative filtering concepts introduced here), see Chapter 10.