Key Takeaways: Chapter 15

Naive Bayes and Nearest Neighbors


  1. Naive Bayes assumes conditional independence of features given the class, and this assumption is almost always wrong --- yet the classifier often works remarkably well. The reason is that classification only requires correct ranking of class probabilities, not accurate probability values. The independence assumption can wildly distort probabilities while preserving the ranking. Domingos and Pazzani (1997) showed that NB is optimal when feature dependencies distribute evenly across classes, which happens more often than you might expect.

  2. There are three Naive Bayes variants, each designed for a different feature type. Gaussian NB assumes continuous features follow a normal distribution within each class. Multinomial NB assumes count data (word frequencies, event counts). Bernoulli NB assumes binary features (word presence/absence). Choosing the wrong variant for your data type is a common source of poor NB performance --- Multinomial NB on TF-IDF text features is almost always the right starting point for text classification.

  3. Laplace smoothing is non-negotiable for Multinomial and Bernoulli NB. Without smoothing, a single unseen word zeros out the entire class probability. The smoothing parameter alpha controls the bias-variance tradeoff: too low and the model overfits to rare words; too high and it washes out the signal. Cross-validate alpha between 0.01 and 5.0 instead of accepting the default of 1.0.

  4. ComplementNB is often a better default than MultinomialNB for text classification, especially with imbalanced classes. ComplementNB estimates parameters from the complement of each class, which provides more stable estimates for minority classes and reduces the majority-class bias that standard Multinomial NB exhibits.

  5. KNN is the ultimate "no assumptions" model: it memorizes training data and classifies by local majority vote. There is no training phase, no parameters to optimize, and no assumptions about the data distribution. KNN adapts to any decision boundary shape because the boundary is defined entirely by the local structure of the training data. This makes it a powerful baseline and anomaly detection tool.

  6. Feature scaling is mandatory for KNN. Without standardization, features on larger scales dominate the distance calculation regardless of their predictive importance. Always apply StandardScaler (or equivalent) before KNN. The one exception is cosine distance, which is scale-invariant by construction.

  7. The curse of dimensionality is KNN's fatal weakness. As dimensionality increases, all points become approximately equidistant (nearest-to-farthest distance ratio approaches 1.0), and the concept of "nearest neighbor" becomes meaningless. KNN degrades rapidly above 20-30 features. The fix is dimensionality reduction (PCA, feature selection) before applying KNN, with domain-driven feature selection preferred over blind PCA.

  8. K controls the bias-variance tradeoff in KNN. Small K (1-3) gives a jagged, low-bias, high-variance boundary that overfits to noise. Large K (50+) gives a smooth, high-bias, low-variance boundary that underfits complex patterns. Cross-validate K; a common starting point is sqrt(n), but the optimal K depends on the data. Use odd K values for binary classification to avoid ties.

  9. Naive Bayes excels on small datasets, high-dimensional sparse data, and when training speed matters. With 50-500 labeled examples, NB often outperforms logistic regression and ensemble methods because it has fewer parameters to overfit. Text classification with thousands of features is NB's sweet spot. Microsecond training time enables real-time retraining that no other algorithm can match.

  10. KNN excels at anomaly detection, nonlinear boundaries, and cold-start problems. Distance-based anomaly scoring (average distance to K nearest normal neighbors) requires no labeled anomaly data and provides interpretable explanations. KNN's ability to adapt to any boundary shape makes it strong on problems where the decision surface is complex and localized. Adding new training points is O(1), making KNN naturally suited to streaming and continuously updating scenarios.


If You Remember One Thing

Simple models are not simplistic models. Naive Bayes and KNN have clear, well-understood limitations --- violated independence assumptions, curse of dimensionality, uncalibrated probabilities, O(n) prediction cost. But within their domains of strength (small data, text classification, anomaly detection, nonlinear boundaries, real-time retraining), they match or beat algorithms that are orders of magnitude more complex. The senior practitioner does not reach for gradient boosting by default. They reach for the simplest model that solves the problem, and they know --- from experience and from this chapter --- exactly when that model is Naive Bayes or KNN.


These takeaways summarize Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for full context.