Part IV: Unsupervised Learning

Parts I through III had a safety net: labeled data. Every model had a target variable — a "right answer" to learn from and evaluate against. That safety net is now gone.

Unsupervised learning asks: What structure exists in this data that nobody told us about? There is no target variable, no accuracy score, no confusion matrix. There is only the data and the question of whether the patterns you find are real, useful, or artifacts of your algorithm's assumptions.

This makes unsupervised learning simultaneously more creative and more dangerous than supervised learning. More creative because you can discover groupings, structures, and anomalies that no one thought to label. More dangerous because there is no ground truth to tell you when you are wrong. A clustering algorithm will always find clusters — even in random noise.

Five chapters. Five ways to find structure without supervision.

Chapter 20: Clustering covers K-Means, DBSCAN, and hierarchical clustering — finding groups in data without labels. The emphasis is on evaluation without ground truth and on understanding when your clusters are real versus when your algorithm is hallucinating.

Chapter 21: Dimensionality Reduction presents PCA for preprocessing and t-SNE/UMAP for visualization. These serve different purposes: PCA compresses information, while t-SNE and UMAP create visual maps of high-dimensional landscapes. The chapter includes mandatory warnings about how to misinterpret t-SNE plots — the most misused visualization in data science.

Chapter 22: Anomaly Detection finds the needles in the haystack. Isolation Forests, autoencoders, and statistical methods for identifying the data points that do not belong. The manufacturing anchor example drives this chapter: the vibration sensor that starts behaving strangely is the turbine that is about to fail.

Chapter 23: Association Rules mines transaction data for patterns. "People who buy X also buy Y" is simple to state, surprisingly nuanced to compute, and commercially valuable when done right.

Chapter 24: Recommender Systems closes the part with collaborative filtering, content-based methods, and hybrid approaches. If you have ever received a "you might also like" suggestion, you have experienced the output of the algorithms in this chapter.


Progressive Project Connection

The progressive project takes a detour in Part IV. There are no numbered milestones, but each chapter connects back to StreamFlow:

  • Clustering: Segment subscribers into behavioral groups with different churn rates
  • Dimensionality reduction: Visualize the churn/retained boundary in feature space
  • Anomaly detection: Flag unusual usage patterns as early churn indicators
  • Association rules: Discover content combinations that reduce churn
  • Recommender systems: Build a content recommender to increase engagement

What You Need

  • Parts I–III completed (especially Chapter 16 on evaluation and Chapter 19 on interpretation)
  • scikit-learn, umap-learn, mlxtend (for association rules)
  • The StreamFlow pipeline from Chapter 10

Chapters in This Part