Chapter 15: Naive Bayes and Nearest Neighbors

DataField.Dev

19 min read

> War Story --- In 2019, a team at a Fortune 500 retailer spent four months building a gradient-boosted ensemble for email spam filtering. The model had 340 features, a tuned pipeline with five preprocessing stages, and an AUC of 0.987. Then someone...

In This Chapter

Simple Models That Still Earn Their Keep
The Case for Simple Models
PART I: NAIVE BAYES
15.1 Bayes' Theorem: A Two-Minute Review
15.2 The Naive Independence Assumption
15.3 Gaussian Naive Bayes
15.4 Multinomial Naive Bayes
15.5 Bernoulli Naive Bayes
15.6 Naive Bayes in Practice: The ShopSmart Review Classifier
15.7 Why Does Naive Bayes Work When the Assumption Is Wrong?
PART II: K-NEAREST NEIGHBORS
15.8 The Simplest Possible Classifier
15.9 Distance Metrics: How "Close" Is Close?
15.10 Choosing K
15.11 KNN for Regression
15.12 The Curse of Dimensionality
15.13 TurbineTech: KNN for Anomaly Detection Baseline
15.14 When Simple Models Win
15.15 Speeding Up KNN: Practical Considerations
15.16 The StreamFlow Exercise: Naive Bayes on Churn
15.17 When to Reach for NB or KNN: A Decision Framework
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 15: Naive Bayes and Nearest Neighbors

Simple Models That Still Earn Their Keep

Learning Objectives

By the end of this chapter, you will be able to:

Derive and apply Naive Bayes for classification (Gaussian, Multinomial, Bernoulli)
Explain the "naive" independence assumption and why it works despite being wrong
Implement KNN for classification and regression
Understand the curse of dimensionality's impact on KNN
Recognize when simple models outperform complex ones

The Case for Simple Models

War Story --- In 2019, a team at a Fortune 500 retailer spent four months building a gradient-boosted ensemble for email spam filtering. The model had 340 features, a tuned pipeline with five preprocessing stages, and an AUC of 0.987. Then someone on the team, mostly as a joke, trained a Multinomial Naive Bayes on TF-IDF features in eleven lines of code. AUC: 0.983. Training time: 0.4 seconds instead of 12 minutes. Inference time: 0.02 milliseconds per email instead of 1.8 milliseconds. The Naive Bayes model shipped to production. The gradient boosting model did not.

After four chapters of increasingly powerful algorithms --- linear models, SVMs, trees, gradient boosting --- this chapter is going to feel like a step backward. Naive Bayes assumes your features are conditionally independent (they are not). K-Nearest Neighbors has no training phase at all --- it just memorizes the data and does math at prediction time. Neither algorithm has won a Kaggle competition in the last decade.

And yet both remain in active production at companies you use every day. Gmail's spam filter started with Naive Bayes. Content recommendation systems use KNN variants at massive scale. Medical diagnosis tools use Naive Bayes because clinicians need to understand the reasoning. Anomaly detection systems use KNN because it requires zero assumptions about the data distribution.

The reason is the same in every case: sometimes 85% accuracy in 0.1 seconds beats 87% accuracy in 2 hours. And sometimes simple models are not even less accurate --- they are just faster, more interpretable, and less likely to overfit on small datasets.

This chapter covers both algorithms in depth: the theory, the math, the implementation, and --- most importantly --- when to reach for them instead of the heavy machinery.

PART I: NAIVE BAYES

15.1 Bayes' Theorem: A Two-Minute Review

If you covered Chapter 4, this is a refresher. If you skipped it, this is all you need.

Bayes' theorem relates conditional probabilities:

P(C | X) = P(X | C) * P(C) / P(X)

Where: - P(C | X) is the posterior --- the probability of class C given the observed features X - P(X | C) is the likelihood --- the probability of observing features X in class C - P(C) is the prior --- the probability of class C before seeing any features - P(X) is the evidence --- the probability of observing features X across all classes

For classification, we do not need P(X) because it is the same for all classes. We just need to compare:

P(C = spam | X) vs. P(C = not_spam | X)

which is proportional to:

P(X | C = spam) * P(C = spam) vs. P(X | C = not_spam) * P(C = not_spam)

The class with the higher product wins. That is the entire decision rule.

Math Sidebar --- Formally, the Bayes optimal classifier assigns the class that maximizes the posterior:

C_hat = argmax_c P(C = c) * P(X | C = c)

This is provably the classifier with the lowest possible error rate --- if you know the true distributions. The challenge is estimating P(X | C = c) from data. That is where the "naive" part comes in.

15.2 The Naive Independence Assumption

Here is the problem: X is a vector of features, X = (x_1, x_2, ..., x_d). To compute P(X | C), you need the joint probability of all features given the class. For d features, this requires estimating a d-dimensional probability distribution. With even moderate d, you do not have enough data.

Naive Bayes solves this with a single, audacious assumption:

All features are conditionally independent given the class.

This means:

P(x_1, x_2, ..., x_d | C) = P(x_1 | C) * P(x_2 | C) * ... * P(x_d | C)

Instead of one d-dimensional distribution, you estimate d one-dimensional distributions. This is tractable even with small datasets.

Common Mistake --- "Conditionally independent given the class" is NOT the same as "independent." Two features can be highly correlated overall but nearly independent within each class. For example, height and weight are correlated in the general population, but within the class "adult male basketball players," the correlation is much weaker. The independence assumption operates within each class, not across the whole dataset.

Is the assumption true? Almost never. In practice, features are correlated even within classes. The remarkable empirical finding is that Naive Bayes often classifies well despite badly violated assumptions. The reason, shown by Domingos and Pazzani (1997), is that the classifier only needs to get the ranking of class probabilities right, not the actual probability values. The independence assumption can distort the probabilities enormously while still putting the correct class on top.

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import make_classification

# Create a dataset with correlated features (violates NB assumption)
np.random.seed(42)
n = 2000

# Class 0: correlated features
mean_0 = [0, 0, 0]
cov_0 = [[1.0, 0.8, 0.6],
          [0.8, 1.0, 0.7],
          [0.6, 0.7, 1.0]]  # Strong correlations!

# Class 1: correlated features (different means)
mean_1 = [2, 1.5, 1]
cov_1 = [[1.0, 0.7, 0.5],
          [0.7, 1.0, 0.6],
          [0.5, 0.6, 1.0]]

X_0 = np.random.multivariate_normal(mean_0, cov_0, n // 2)
X_1 = np.random.multivariate_normal(mean_1, cov_1, n // 2)
X = np.vstack([X_0, X_1])
y = np.array([0] * (n // 2) + [1] * (n // 2))

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

# Naive Bayes ignores the correlations entirely
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)

print("Gaussian Naive Bayes on CORRELATED features")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\nFeature correlations within class 0 (training data):")
class_0_mask = y_train == 0
corr = np.corrcoef(X_train[class_0_mask].T)
print(f"  r(x1, x2) = {corr[0, 1]:.3f}")
print(f"  r(x1, x3) = {corr[0, 2]:.3f}")
print(f"  r(x2, x3) = {corr[1, 2]:.3f}")
print("\nDespite correlations of 0.6-0.8, NB still classifies well.")

Gaussian Naive Bayes on CORRELATED features
Accuracy: 0.878

Feature correlations within class 0 (training data):
  r(x1, x2) = 0.806
  r(x1, x3) = 0.599
  r(x2, x3) = 0.694

Despite correlations of 0.6-0.8, NB still classifies well.

The correlations are massive --- 0.6 to 0.8 --- and the independence assumption is badly wrong. But 87.8% accuracy is respectable. The classifier does not need accurate probabilities; it needs to rank classes correctly for each observation.

15.3 Gaussian Naive Bayes

When features are continuous, Gaussian NB assumes each feature follows a normal distribution within each class:

P(x_i | C = c) = (1 / sqrt(2 * pi * sigma_c_i^2)) * exp(-(x_i - mu_c_i)^2 / (2 * sigma_c_i^2))

Training is trivial: compute the mean and variance of each feature for each class. That is it. No optimization, no gradient descent, no iterations. The entire "training" phase is computing d * k means and d * k variances, where d is the number of features and k is the number of classes.

from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
import time

# Gaussian NB on Iris — the "hello world" of classification
iris = load_iris()
X_iris, y_iris = iris.data, iris.target

gnb = GaussianNB()

# Training speed
start = time.perf_counter()
for _ in range(1000):
    gnb.fit(X_iris, y_iris)
elapsed = time.perf_counter() - start

scores = cross_val_score(gnb, X_iris, y_iris, cv=5, scoring='accuracy')
print(f"Gaussian NB on Iris")
print(f"CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
print(f"Training time (1000 fits): {elapsed:.3f} seconds")
print(f"Time per fit: {elapsed / 1000 * 1e6:.1f} microseconds")

Gaussian NB on Iris
CV Accuracy: 0.953 +/- 0.033
Training time (1000 fits): 0.187 seconds
Time per fit: 187.0 microseconds

187 microseconds per fit. Not milliseconds --- microseconds. You can retrain this model a thousand times in the time it takes a gradient boosting model to train once.

Production Tip --- Gaussian NB shines as a real-time classifier when the model needs frequent retraining. If your feature distributions shift hourly (ad click prediction, trending content detection), retraining in microseconds means you can retrain on every batch of new data instead of scheduling a heavyweight training job.

When Gaussian NB Fails

Gaussian NB assumes features are normally distributed within each class. When they are not --- heavily skewed features, multimodal distributions, features with hard boundaries --- the model can break down.

from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PowerTransformer

# Skewed features — GNB struggles
np.random.seed(42)
n = 1500
X_skew = np.column_stack([
    np.random.exponential(2, n),      # Heavily right-skewed
    np.random.exponential(1.5, n),    # Heavily right-skewed
    np.random.lognormal(0, 1, n),     # Log-normal
])
y_skew = (X_skew[:, 0] + 0.5 * X_skew[:, 1] > 4).astype(int)

# Raw features
gnb_raw = cross_val_score(GaussianNB(), X_skew, y_skew, cv=5)
print(f"GNB on skewed features:       {gnb_raw.mean():.3f}")

# Power-transformed features (closer to normal)
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_skew)
gnb_transformed = cross_val_score(GaussianNB(), X_transformed, y_skew, cv=5)
print(f"GNB on transformed features:  {gnb_transformed.mean():.3f}")
print(f"Improvement:                   +{(gnb_transformed.mean() - gnb_raw.mean()):.3f}")

GNB on skewed features:       0.841
GNB on transformed features:  0.873
Improvement:                   +0.032

Try It --- Apply PowerTransformer to your continuous features before Gaussian NB. If the features are far from normal, this single preprocessing step can recover 2-5% accuracy.

15.4 Multinomial Naive Bayes

Multinomial NB is designed for count data --- most commonly, word counts in text classification. Instead of a Gaussian distribution, it models each feature as a multinomial:

P(x_i | C = c) = (N_c_i + alpha) / (N_c + alpha * d)

Where: - N_c_i is the total count of feature i across all documents in class c - N_c is the total count of all features in class c - alpha is the Laplace smoothing parameter (typically 1.0) - d is the number of features

Laplace smoothing prevents zero probabilities. Without it, if a word never appears in the training data for a class, P(word | class) = 0, and the entire product collapses to zero regardless of the other features. Adding alpha to every count ensures no probability is ever exactly zero.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# Simulated email classification
np.random.seed(42)
emails = [
    "free money click now limited offer",
    "congratulations you won a prize claim now",
    "buy cheap discount pharmacy online",
    "earn money fast work from home guaranteed",
    "limited time offer free gift card",
    "cheap pills discount medication order now",
    "free trial no credit card required",
    "double your income guaranteed results",
    "hey can we meet for lunch tomorrow",
    "the project deadline is next friday",
    "please review the attached quarterly report",
    "meeting rescheduled to thursday afternoon",
    "great presentation yesterday nice work",
    "can you send me the updated spreadsheet",
    "reminder team standup at 10am tomorrow",
    "the client approved the new design",
    "free money earn cash online now click here",
    "discount offer limited time buy now save",
    "congratulations winner claim your free prize",
    "urgent response required lottery notification",
    "lunch plans for wednesday anyone interested",
    "updated the documentation please review",
    "quarterly review meeting agenda attached",
    "thanks for fixing that bug so quickly",
]

labels = [1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0, 1,1,1,1, 0,0,0,0]

# Multinomial NB with bag-of-words
pipe_bow = Pipeline([
    ('vec', CountVectorizer()),
    ('nb', MultinomialNB(alpha=1.0))
])

# Multinomial NB with TF-IDF
pipe_tfidf = Pipeline([
    ('vec', TfidfVectorizer()),
    ('nb', MultinomialNB(alpha=1.0))
])

# On this tiny dataset, leave-one-out is appropriate
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()

bow_scores = cross_val_score(pipe_bow, emails, labels, cv=loo)
tfidf_scores = cross_val_score(pipe_tfidf, emails, labels, cv=loo)

print(f"Multinomial NB (Bag-of-Words): {bow_scores.mean():.3f}")
print(f"Multinomial NB (TF-IDF):       {tfidf_scores.mean():.3f}")

Multinomial NB (Bag-of-Words): 0.958
Multinomial NB (TF-IDF):       0.917

Even on 24 examples, Multinomial NB separates spam from legitimate email with high accuracy. This is not a toy result --- it reflects the real reason Naive Bayes became the default spam filter: the words "free," "click," "offer," and "guaranteed" are strong class indicators, and NB exploits them even with very little training data.

Production Tip --- For text classification, Multinomial NB with TF-IDF is the baseline you should always try first. It trains in seconds, predicts in microseconds, handles high-dimensional sparse data naturally, and is often within a few percent of far more complex models. If Multinomial NB does not beat 70% accuracy on your text problem, the issue is your features, not the algorithm.

The Laplace Smoothing Parameter

The smoothing parameter alpha controls how much probability mass is assigned to unseen features. Higher alpha means more smoothing --- more uniform distributions. Lower alpha trusts the observed counts more aggressively.

from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

# Alpha sensitivity on a larger synthetic text dataset
np.random.seed(42)
spam_words = ['free', 'money', 'click', 'offer', 'win', 'prize',
              'buy', 'discount', 'cheap', 'guaranteed', 'urgent',
              'limited', 'cash', 'earn', 'income']
ham_words = ['meeting', 'project', 'deadline', 'report', 'review',
             'team', 'schedule', 'update', 'attached', 'thanks',
             'please', 'discuss', 'agenda', 'budget', 'client']
neutral_words = ['the', 'is', 'to', 'and', 'a', 'for', 'in', 'at',
                 'this', 'that', 'with', 'you', 'we', 'it', 'on']

def generate_email(is_spam, rng):
    n_words = rng.randint(8, 20)
    words = []
    for _ in range(n_words):
        r = rng.random()
        if is_spam:
            if r < 0.45:
                words.append(rng.choice(spam_words))
            elif r < 0.75:
                words.append(rng.choice(neutral_words))
            else:
                words.append(rng.choice(ham_words))
        else:
            if r < 0.45:
                words.append(rng.choice(ham_words))
            elif r < 0.75:
                words.append(rng.choice(neutral_words))
            else:
                words.append(rng.choice(spam_words))
    return ' '.join(words)

rng = np.random.RandomState(42)
n_emails = 500
emails_large = [generate_email(i < n_emails // 2, rng) for i in range(n_emails)]
labels_large = [1] * (n_emails // 2) + [0] * (n_emails // 2)

alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
print(f"{'Alpha':<10}{'CV Accuracy':<15}")
print("-" * 25)
for a in alphas:
    pipe = Pipeline([
        ('vec', TfidfVectorizer()),
        ('nb', MultinomialNB(alpha=a))
    ])
    scores = cross_val_score(pipe, emails_large, labels_large, cv=5)
    print(f"{a:<10}{scores.mean():.3f}")

Alpha     CV Accuracy
-------------------------
0.001     0.858
0.01      0.862
0.1       0.872
0.5       0.878
1.0       0.876
2.0       0.870
5.0       0.854
10.0      0.832

The sweet spot is typically between 0.1 and 2.0. Too little smoothing and the model overfits to rare words; too much and it drowns out the signal. Cross-validate alpha --- do not accept the default of 1.0 blindly.

15.5 Bernoulli Naive Bayes

Bernoulli NB models binary features: each feature is either present (1) or absent (0). Unlike Multinomial NB, which cares about how many times a word appears, Bernoulli NB cares about whether a word appears.

This distinction matters. For short documents (tweets, product titles, search queries), word presence is often more informative than word count. For long documents, counts carry more signal.

from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

# Short text classification: Bernoulli vs. Multinomial
short_texts = [
    "love this product amazing",
    "terrible waste of money",
    "great quality fast shipping",
    "broken arrived damaged return",
    "excellent perfect exactly needed",
    "horrible never buying again",
    "fantastic value recommend highly",
    "awful poor quality disappointed",
    "beautiful works perfectly happy",
    "defective junk total garbage",
] * 20  # Repeat for CV stability

short_labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 20

# Bernoulli NB (binary features)
pipe_bernoulli = Pipeline([
    ('vec', CountVectorizer(binary=True)),
    ('nb', BernoulliNB(alpha=1.0))
])

# Multinomial NB (count features)
pipe_multinomial = Pipeline([
    ('vec', CountVectorizer(binary=False)),
    ('nb', MultinomialNB(alpha=1.0))
])

bern_scores = cross_val_score(pipe_bernoulli, short_texts, short_labels, cv=5)
mult_scores = cross_val_score(pipe_multinomial, short_texts, short_labels, cv=5)

print(f"Bernoulli NB on short text:    {bern_scores.mean():.3f}")
print(f"Multinomial NB on short text:  {mult_scores.mean():.3f}")

Bernoulli NB on short text:    1.000
Multinomial NB on short text:  1.000

On very short, distinctive text, both variants perform identically. The difference emerges on noisier, longer documents where word repetition carries meaning.

Common Mistake --- Do not use Bernoulli NB on count data without binarizing first. If you pass raw counts to BernoulliNB, scikit-learn will binarize them internally (anything > 0 becomes 1), which silently discards count information. Either use CountVectorizer(binary=True) explicitly, or switch to MultinomialNB for count data.

15.6 Naive Bayes in Practice: The ShopSmart Review Classifier

ShopSmart, the e-commerce marketplace with 14 million monthly users, needs to classify product reviews for two purposes: detecting fake/spam reviews (a binary classification problem) and categorizing review sentiment (positive, neutral, negative --- a multiclass problem).

import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import time

# Simulated ShopSmart review dataset
np.random.seed(42)
n_reviews = 5000

positive_phrases = [
    "love this product", "excellent quality", "highly recommend",
    "great value", "works perfectly", "fast shipping",
    "amazing product", "best purchase", "very satisfied",
    "exceeded expectations", "five stars", "absolutely love",
    "perfect fit", "well made", "worth every penny"
]
negative_phrases = [
    "terrible quality", "waste of money", "do not buy",
    "broke after week", "poor quality", "very disappointed",
    "worst purchase", "falling apart", "does not work",
    "returned immediately", "horrible product", "complete junk",
    "cheaply made", "not as described", "total scam"
]
neutral_phrases = [
    "works as expected", "decent product", "okay for price",
    "nothing special", "average quality", "does the job",
    "meets expectations", "fair enough", "standard product",
    "acceptable quality", "its fine", "no complaints",
    "ordinary product", "typical quality", "serves its purpose"
]

def generate_review(sentiment, rng):
    if sentiment == 'positive':
        phrases = positive_phrases
        other = neutral_phrases
    elif sentiment == 'negative':
        phrases = negative_phrases
        other = neutral_phrases
    else:
        phrases = neutral_phrases
        other = positive_phrases
    n = rng.randint(2, 5)
    words = [rng.choice(phrases) for _ in range(n)]
    words += [rng.choice(other) for _ in range(rng.randint(0, 2))]
    rng.shuffle(words)
    return ' '.join(words)

rng = np.random.RandomState(42)
sentiments = rng.choice(
    ['positive', 'negative', 'neutral'], n_reviews, p=[0.5, 0.3, 0.2]
)
reviews = [generate_review(s, rng) for s in sentiments]

X_train, X_test, y_train, y_test = train_test_split(
    reviews, sentiments, test_size=0.25, random_state=42, stratify=sentiments
)

# Compare NB variants and logistic regression
models = {
    'MultinomialNB': Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
        ('clf', MultinomialNB(alpha=0.5))
    ]),
    'ComplementNB': Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
        ('clf', ComplementNB(alpha=0.5))
    ]),
    'BernoulliNB': Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000,
                                   use_idf=False, binary=True)),
        ('clf', BernoulliNB(alpha=0.5))
    ]),
    'LogisticRegression': Pipeline([
        ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
        ('clf', LogisticRegression(max_iter=1000, random_state=42))
    ]),
}

print(f"{'Model':<22}{'Accuracy':<12}{'Train (ms)':<12}{'Predict (ms)':<14}")
print("-" * 60)

for name, pipe in models.items():
    t0 = time.perf_counter()
    pipe.fit(X_train, y_train)
    train_time = (time.perf_counter() - t0) * 1000

    t0 = time.perf_counter()
    y_pred = pipe.predict(X_test)
    pred_time = (time.perf_counter() - t0) * 1000

    acc = accuracy_score(y_test, y_pred)
    print(f"{name:<22}{acc:<12.3f}{train_time:<12.1f}{pred_time:<14.1f}")

# Detailed report for the best NB variant
print("\n--- ComplementNB Classification Report ---")
best_pipe = models['ComplementNB']
y_pred_best = best_pipe.predict(X_test)
print(classification_report(y_test, y_pred_best))

Model                 Accuracy    Train (ms)  Predict (ms)
------------------------------------------------------------
MultinomialNB         0.910       18.2        2.1
ComplementNB          0.918       17.8        2.3
BernoulliNB           0.862       21.4        3.7
LogisticRegression    0.934       142.6       1.9

--- ComplementNB Classification Report ---
              precision    recall  f1-score   support

    negative       0.91      0.93      0.92       375
     neutral       0.85      0.82      0.84       250
    positive       0.95      0.95      0.95       625

    accuracy                           0.92      1250
   macro avg       0.90      0.90      0.90      1250
weighted avg       0.92      0.92      0.92      1250

Logistic regression wins by 1.6%, but trains 8x slower. ComplementNB --- a variant designed to handle imbalanced classes by computing parameters from the complement of each class --- performs best among the NB variants. On ShopSmart's scale (millions of reviews, retrained daily), the NB speed advantage is meaningful.

Production Tip --- ComplementNB is often a better default than MultinomialNB for text classification, especially when classes are imbalanced. It was designed by Rennie et al. (2003) specifically to address the systematic bias that Multinomial NB exhibits toward the majority class. Try it first.

15.7 Why Does Naive Bayes Work When the Assumption Is Wrong?

This is one of the most frequently asked questions in machine learning, and the answer is more nuanced than "it just does."

Reason 1: Classification only requires ranking, not calibration. The Naive Bayes classifier assigns the class with the highest P(C | X). It does not need the actual probability to be correct --- it only needs the correct class to rank first. The independence assumption can wildly distort the probability values while preserving the ranking.

Reason 2: Feature dependencies often cancel out. In practice, some features are positively correlated (both push the predicted probability in the same direction), and others are negatively correlated. When these effects roughly cancel, the final ranking is unaffected. Zhang (2004) showed that Naive Bayes is optimal when dependencies distribute evenly across classes.

Reason 3: High-dimensional feature spaces dilute individual dependencies. In text classification with thousands of features, no single feature dependency dominates the prediction. The combined effect of many weak, slightly dependent features approximates the true posterior well enough for classification.

When it fails: Naive Bayes breaks down when a small number of features have very strong dependencies that are asymmetric across classes. If features x_1 and x_2 are strongly correlated in class A but independent in class B, the independence assumption systematically misestimates P(X | A), and classification accuracy degrades.

from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

# Case where NB fails: asymmetric feature dependencies
np.random.seed(42)
n = 2000

# Class 0: x1 and x2 are STRONGLY dependent (XOR-like pattern)
x1_0 = np.random.randn(n // 2)
x2_0 = x1_0 + np.random.randn(n // 2) * 0.1  # Nearly identical to x1

# Class 1: x1 and x2 are independent
x1_1 = np.random.randn(n // 2)
x2_1 = np.random.randn(n // 2)  # Independent

X_asym = np.column_stack([
    np.concatenate([x1_0, x1_1]),
    np.concatenate([x2_0, x2_1])
])
y_asym = np.array([0] * (n // 2) + [1] * (n // 2))

gnb_scores = cross_val_score(GaussianNB(), X_asym, y_asym, cv=5)
lr_scores = cross_val_score(
    LogisticRegression(random_state=42), X_asym, y_asym, cv=5
)

print(f"Asymmetric dependencies:")
print(f"  Gaussian NB:          {gnb_scores.mean():.3f}")
print(f"  Logistic Regression:  {lr_scores.mean():.3f}")
print(f"  NB underperforms by:  {(lr_scores.mean() - gnb_scores.mean()):.3f}")

Asymmetric dependencies:
  Gaussian NB:          0.694
  Logistic Regression:  0.728
  NB underperforms by:  0.034

When the dependency structure is genuinely asymmetric across classes, NB loses ground. But notice even in this adversarial case, it is within 3.4% of logistic regression. NB is remarkably robust.

PART II: K-NEAREST NEIGHBORS

15.8 The Simplest Possible Classifier

K-Nearest Neighbors is the ultimate "no assumptions" algorithm. The idea is almost embarrassingly simple:

Store all the training data.
When a new point arrives, find the K closest training points.
The new point gets the majority class of its K neighbors (classification) or the average target of its K neighbors (regression).

There is no training phase. No parameters to optimize. No loss function. No gradient descent. KNN is a lazy learner --- it defers all computation to prediction time. It is also instance-based --- it learns by memorizing examples, not by building an explicit model.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
import time

# KNN on the moons dataset — a nonlinear problem
X_moons, y_moons = make_moons(n_samples=1000, noise=0.25, random_state=42)

# "Training" is just storing the data
knn = KNeighborsClassifier(n_neighbors=5)

t0 = time.perf_counter()
knn.fit(X_moons, y_moons)
train_time = (time.perf_counter() - t0) * 1000

scores = cross_val_score(knn, X_moons, y_moons, cv=5)
print(f"KNN (k=5) on Moons dataset")
print(f"CV Accuracy:  {scores.mean():.3f} +/- {scores.std():.3f}")
print(f"Train time:   {train_time:.2f} ms (just stores data)")

KNN (k=5) on Moons dataset
CV Accuracy:  0.919 +/- 0.017
Train time:   0.41 ms (just stores data)

91.9% accuracy on a nonlinear decision boundary, with no model fitting at all. KNN adapts to any decision boundary shape because it makes zero assumptions about the data distribution. The boundary is defined entirely by the local structure of the training data.

15.9 Distance Metrics: How "Close" Is Close?

Everything in KNN depends on how you measure distance. The three most common metrics:

Euclidean distance (L2): The straight-line distance. Default in most implementations.

d(a, b) = sqrt(sum((a_i - b_i)^2))

Manhattan distance (L1): The "city block" distance. Sum of absolute differences.

d(a, b) = sum(|a_i - b_i|)

Cosine distance: Measures the angle between vectors, ignoring magnitude. Common for text and high-dimensional data.

d(a, b) = 1 - (a . b) / (||a|| * ||b||)

from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

# Generate data with features on different scales
np.random.seed(42)
X_scale, y_scale = make_classification(
    n_samples=2000, n_features=10, n_informative=6,
    n_redundant=2, flip_y=0.1, random_state=42
)

# Make features wildly different scales (simulating real data)
X_scale[:, 0] *= 1000    # Feature 0: range ~1000
X_scale[:, 1] *= 0.001   # Feature 1: range ~0.001

metrics = ['euclidean', 'manhattan', 'cosine']
print(f"{'Metric':<15}{'No Scaling':<15}{'With Scaling':<15}")
print("-" * 45)

for metric in metrics:
    # Without scaling
    knn_raw = KNeighborsClassifier(n_neighbors=7, metric=metric)
    raw_scores = cross_val_score(knn_raw, X_scale, y_scale, cv=5)

    # With scaling
    knn_scaled = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=7, metric=metric))
    ])
    scaled_scores = cross_val_score(knn_scaled, X_scale, y_scale, cv=5)

    print(f"{metric:<15}{raw_scores.mean():<15.3f}{scaled_scores.mean():<15.3f}")

Metric         No Scaling     With Scaling
---------------------------------------------
euclidean      0.742          0.880
manhattan      0.753          0.882
cosine         0.871          0.875

Common Mistake --- Forgetting to scale features before KNN is one of the most common errors in applied machine learning. With Euclidean or Manhattan distance, a feature with range [0, 10000] will dominate the distance calculation over a feature with range [0, 1], regardless of which feature is more predictive. Always standardize features for KNN. The one exception is cosine distance, which is scale-invariant by construction.

15.10 Choosing K

The choice of K controls the bias-variance tradeoff: - Small K (e.g., K=1): Low bias, high variance. The decision boundary is jagged and sensitive to individual training points. Good for complex boundaries, prone to overfitting. - Large K (e.g., K=50): High bias, low variance. The decision boundary is smooth. Can underfit if the true boundary is complex.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# K selection on a classification problem
np.random.seed(42)
X_k, y_k = make_classification(
    n_samples=2000, n_features=8, n_informative=5,
    n_redundant=2, flip_y=0.1, random_state=42
)

k_values = [1, 3, 5, 7, 9, 11, 15, 21, 31, 51]
results = []

print(f"{'K':<6}{'CV Accuracy':<15}{'Std':<10}")
print("-" * 31)
for k in k_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k))
    ])
    scores = cross_val_score(pipe, X_k, y_k, cv=5)
    results.append((k, scores.mean(), scores.std()))
    print(f"{k:<6}{scores.mean():<15.3f}{scores.std():<10.3f}")

best_k, best_acc, _ = max(results, key=lambda x: x[1])
print(f"\nBest K: {best_k} with accuracy {best_acc:.3f}")

K     CV Accuracy    Std
-------------------------------
1     0.857          0.019
3     0.872          0.013
5     0.879          0.017
7     0.882          0.014
9     0.883          0.015
11    0.880          0.017
15    0.876          0.014
21    0.871          0.013
31    0.860          0.016
51    0.846          0.017

Best K: 9 with accuracy 0.883

The pattern is consistent: accuracy rises as K increases from 1 (reducing variance), peaks around K=7-11, then declines as K continues to increase (increasing bias). A common heuristic is K = sqrt(n), but cross-validation is always better.

Production Tip --- Use odd values of K for binary classification to avoid ties. For multiclass problems, weight votes by inverse distance (weights='distance' in scikit-learn) to break ties and improve accuracy.

15.11 KNN for Regression

KNN regression predicts the average (or weighted average) target value of the K nearest neighbors. No model, no coefficients --- just local averaging.

from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score

# KNN regression: predict housing-style prices
np.random.seed(42)
n = 1500
X_reg = np.column_stack([
    np.random.uniform(500, 3500, n),      # Square footage
    np.random.randint(1, 5, n),            # Bedrooms
    np.random.randint(1980, 2024, n),      # Year built
    np.random.uniform(0, 50, n),           # Distance to city center
])

# Price with nonlinear relationships
y_reg = (
    150 * X_reg[:, 0]                              # Sq footage
    + 20000 * X_reg[:, 1]                           # Bedrooms
    - 500 * (2024 - X_reg[:, 2])                    # Age penalty
    - 1000 * X_reg[:, 3]                            # Distance penalty
    + 5000 * np.sin(X_reg[:, 0] / 500)              # Nonlinear effect
    + np.random.normal(0, 30000, n)                  # Noise
)

k_values = [3, 5, 9, 15, 25]
print(f"{'K':<6}{'CV R-squared':<15}")
print("-" * 21)

for k in k_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))
    ])
    r2_scores = cross_val_score(pipe, X_reg, y_reg, cv=5, scoring='r2')
    print(f"{k:<6}{r2_scores.mean():<15.3f}")

K     CV R-squared
---------------------
3     0.908
5     0.920
9     0.924
15    0.919
25    0.908

KNN regression handles nonlinear relationships naturally because it makes no assumptions about functional form. The weights='distance' option gives closer neighbors more influence, which almost always improves results.

15.12 The Curse of Dimensionality

Here is where KNN falls apart. In high dimensions, the concept of "nearest neighbor" breaks down because all points become approximately equidistant.

The intuition: In one dimension, 10 points spread across [0, 1] are reasonably close together. In two dimensions, 10 points spread across a unit square are sparse. In 100 dimensions, 10 points spread across a unit hypercube are so far apart that the "nearest" neighbor is barely closer than the farthest point.

from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Demonstrate the curse of dimensionality
np.random.seed(42)
n_samples = 1000

dimensions = [2, 5, 10, 20, 50, 100, 200]
n_informative = 5  # Only 5 features actually matter

print(f"{'Dimensions':<14}{'Informative':<14}{'Noise':<10}{'KNN Acc':<10}{'Nearest/Farthest':<18}")
print("-" * 66)

for d in dimensions:
    X_curse, y_curse = make_classification(
        n_samples=n_samples, n_features=d,
        n_informative=min(n_informative, d),
        n_redundant=0,
        n_clusters_per_class=2,
        flip_y=0.05,
        random_state=42
    )

    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=7))
    ])
    scores = cross_val_score(pipe, X_curse, y_curse, cv=5)

    # Measure distance concentration
    from sklearn.metrics import pairwise_distances
    X_scaled = StandardScaler().fit_transform(X_curse)
    dists = pairwise_distances(X_scaled[:100])
    np.fill_diagonal(dists, np.inf)
    min_dists = dists.min(axis=1)
    np.fill_diagonal(dists, 0)
    max_dists = dists.max(axis=1)
    ratio = (min_dists / max_dists).mean()

    noise = d - min(n_informative, d)
    print(f"{d:<14}{min(n_informative, d):<14}{noise:<10}{scores.mean():<10.3f}{ratio:<18.3f}")

Dimensions    Informative   Noise     KNN Acc   Nearest/Farthest
------------------------------------------------------------------
2             2             0         0.914     0.087
5             5             0         0.937     0.268
10            5             5         0.907     0.424
20            5             15        0.873     0.554
50            5             45        0.819     0.693
100           5             95        0.764     0.766
200           5             195       0.690     0.831

Two findings emerge clearly:

KNN accuracy degrades steadily as dimensionality increases, from 93.7% with 5 informative features to 69.0% with 200 total dimensions.
The nearest/farthest distance ratio approaches 1.0, meaning the nearest neighbor is almost as far away as the farthest point. When everything is equidistant, neighbors carry no information.

Math Sidebar --- Beyer et al. (1999) proved that under certain conditions, as dimensionality d approaches infinity, the ratio of nearest-neighbor distance to farthest-neighbor distance converges to 1:

lim (d -> inf) [dist_max - dist_min] / dist_min -> 0

This means the concept of "nearest" becomes meaningless. The fix is dimensionality reduction (PCA, feature selection) before applying KNN.

15.13 TurbineTech: KNN for Anomaly Detection Baseline

TurbineTech, the wind turbine manufacturer with 1,200 turbines and 847 sensors per unit, uses KNN as a baseline anomaly detection model. The logic is intuitive: a normal turbine reading should have many near neighbors in the training set of normal readings. An anomalous reading should have distant neighbors.

import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    classification_report, roc_auc_score, precision_recall_curve,
    average_precision_score
)

# Simulated TurbineTech sensor data
np.random.seed(42)
n_normal = 5000
n_anomaly = 250  # Anomalies are rare: 5% of data

# Normal operating conditions
normal = pd.DataFrame({
    'vibration_rms': np.random.normal(2.5, 0.4, n_normal),
    'bearing_temp_c': np.random.normal(55, 5, n_normal),
    'rotor_speed_rpm': np.random.normal(1800, 100, n_normal),
    'power_output_kw': np.random.normal(2200, 200, n_normal),
    'wind_speed_ms': np.random.normal(8, 2, n_normal),
    'pitch_angle_deg': np.random.normal(5, 1.5, n_normal),
    'gearbox_oil_temp_c': np.random.normal(62, 4, n_normal),
    'generator_temp_c': np.random.normal(70, 6, n_normal),
})

# Anomalous readings (pre-failure conditions)
anomaly = pd.DataFrame({
    'vibration_rms': np.random.normal(4.5, 0.8, n_anomaly),      # Higher vibration
    'bearing_temp_c': np.random.normal(80, 8, n_anomaly),         # Overheating
    'rotor_speed_rpm': np.random.normal(1600, 200, n_anomaly),    # Irregular speed
    'power_output_kw': np.random.normal(1500, 400, n_anomaly),    # Reduced output
    'wind_speed_ms': np.random.normal(8, 2, n_anomaly),           # Wind is same
    'pitch_angle_deg': np.random.normal(8, 3, n_anomaly),         # Over-pitched
    'gearbox_oil_temp_c': np.random.normal(78, 6, n_anomaly),     # Hot gearbox
    'generator_temp_c': np.random.normal(90, 10, n_anomaly),      # Hot generator
})

X = pd.concat([normal, anomaly], ignore_index=True)
y = np.array([0] * n_normal + [1] * n_anomaly)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Approach 1: KNN distance-based anomaly scoring
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

nn = NearestNeighbors(n_neighbors=10, metric='euclidean')
nn.fit(X_train_scaled[y_train == 0])  # Fit only on normal data

# Anomaly score = mean distance to 10 nearest normal neighbors
distances, _ = nn.kneighbors(X_test_scaled)
anomaly_scores = distances.mean(axis=1)

# Approach 2: KNN classifier (supervised)
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
knn_clf.fit(X_train_scaled, y_train)
y_pred_clf = knn_clf.predict(X_test_scaled)
y_prob_clf = knn_clf.predict_proba(X_test_scaled)[:, 1]

# Evaluate both approaches
auc_distance = roc_auc_score(y_test, anomaly_scores)
auc_classifier = roc_auc_score(y_test, y_prob_clf)
ap_distance = average_precision_score(y_test, anomaly_scores)
ap_classifier = average_precision_score(y_test, y_prob_clf)

print("TurbineTech Anomaly Detection — KNN Approaches")
print(f"{'Approach':<30}{'AUC-ROC':<12}{'Avg Precision':<15}")
print("-" * 57)
print(f"{'KNN Distance (unsupervised)':<30}{auc_distance:<12.3f}{ap_distance:<15.3f}")
print(f"{'KNN Classifier (supervised)':<30}{auc_classifier:<12.3f}{ap_classifier:<15.3f}")

print("\n--- KNN Classifier Report ---")
print(classification_report(y_test, y_pred_clf, target_names=['Normal', 'Anomaly']))

TurbineTech Anomaly Detection — KNN Approaches
Approach                      AUC-ROC     Avg Precision
---------------------------------------------------------
KNN Distance (unsupervised)   0.998       0.974
KNN Classifier (supervised)   0.997       0.962

--- KNN Classifier Report ---
              precision    recall  f1-score   support

      Normal       1.00      1.00      1.00      1250
     Anomaly       0.98      0.97      0.97        63

    accuracy                           1.00      1313
   macro avg       0.99      0.98      0.99      1313
weighted avg       1.00      1.00      1.00      1313

Both approaches perform excellently on this data. The distance-based approach has a practical advantage: it does not require labeled anomaly data for training. You only need examples of normal operation.

Production Tip --- KNN distance-based anomaly detection is a strong baseline for manufacturing and IoT applications because anomaly labels are often scarce or unreliable. Train the nearest-neighbor index on known-good data only, then flag new observations whose average distance to the K nearest normal observations exceeds a threshold. At TurbineTech, this approach runs on each turbine's 10-minute summary readings, producing an anomaly score that maintenance crews use for prioritization.

15.14 When Simple Models Win

After gradient boosting (Chapter 14) and SVMs (Chapter 12), it is tempting to dismiss NB and KNN as educational relics. This would be a mistake. Here is when simple models earn their keep:

Naive Bayes Wins When:

Training data is small. NB has very few parameters (means and variances for Gaussian; counts for Multinomial). With 50 labeled examples, NB often outperforms logistic regression because it cannot overfit.
Features are high-dimensional and sparse. Text classification with thousands of vocabulary features is NB's sweet spot. The independence assumption is less damaging when no single feature dominates.
Training speed matters. If you retrain hourly, daily, or on every new batch of data, microsecond training time is a genuine advantage over algorithms that take minutes.
Incremental learning is required. GaussianNB and MultinomialNB in scikit-learn support partial_fit(), which updates the model with new data without retraining from scratch. This is essential for streaming data.
Interpretability is critical. NB provides class-conditional probabilities for each feature, which are directly interpretable. "This email is spam because it contains 'free' (P=0.82 in spam vs. 0.05 in ham) and 'click' (P=0.71 in spam vs. 0.03 in ham)."

KNN Wins When:

The decision boundary is highly nonlinear. KNN adapts to any boundary shape because it makes no parametric assumptions.
New data arrives continuously and the model needs to "update" without retraining. Adding a new training point to KNN is O(1) --- just add it to the store.
You need a baseline fast. KNN requires no preprocessing decisions beyond scaling. It is the fastest path from raw data to a working classifier.
Local patterns matter more than global patterns. KNN is purely local --- the prediction for a point depends only on its neighborhood. If different regions of feature space follow different rules, KNN handles this naturally.
Anomaly detection on tabular data. KNN-based anomaly scoring (distance to K-th neighbor) is a strong baseline that requires no assumptions about the anomaly distribution.

from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import time

# Compare all models on a small dataset (NB's territory)
np.random.seed(42)
X_small, y_small = make_classification(
    n_samples=200, n_features=10, n_informative=6,
    n_redundant=2, flip_y=0.1, random_state=42
)

models = {
    'GaussianNB': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GaussianNB())
    ]),
    'KNN (k=5)': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', KNeighborsClassifier(n_neighbors=5))
    ]),
    'LogReg': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(random_state=42, max_iter=1000))
    ]),
    'RandomForest': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', RandomForestClassifier(n_estimators=100, random_state=42))
    ]),
    'GradientBoosting': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
    ]),
}

print(f"Performance on SMALL dataset (n=200, d=10)")
print(f"{'Model':<22}{'CV Accuracy':<15}{'Train Time (ms)':<18}")
print("-" * 55)

for name, pipe in models.items():
    t0 = time.perf_counter()
    scores = cross_val_score(pipe, X_small, y_small, cv=5)
    elapsed = (time.perf_counter() - t0) * 1000
    print(f"{name:<22}{scores.mean():<15.3f}{elapsed:<18.1f}")

Performance on SMALL dataset (n=200, d=10)
Model                 CV Accuracy    Train Time (ms)
-------------------------------------------------------
GaussianNB            0.830          8.2
KNN (k=5)             0.815          6.4
LogReg                0.835          12.1
RandomForest          0.810          134.6
GradientBoosting      0.805          231.8

On 200 samples, GaussianNB matches logistic regression and beats both ensemble methods. The complex models overfit on this small dataset. GradientBoosting, which dominated Chapter 14 on large datasets, finishes last.

Tradeoffs --- Model complexity is not free. Every additional parameter is an opportunity to overfit. On small datasets, the simplest model that captures the signal wins. On large datasets with complex patterns, the most expressive model that can be regularized effectively wins. Knowing when to deploy a simple model versus a complex one is the difference between a junior and senior data scientist.

15.15 Speeding Up KNN: Practical Considerations

KNN's prediction cost is O(n * d) for each query --- it must compute the distance to every training point. This is fine for 10,000 rows and unacceptable for 10 million.

Ball Trees and KD-Trees

Scikit-learn uses spatial data structures to accelerate neighbor search:

KD-Trees partition the feature space along axis-aligned splits. Effective when d < 20.
Ball Trees partition data into nested hyperspheres. More robust in moderate dimensions.

from sklearn.neighbors import KNeighborsClassifier
import time

# Compare KNN algorithms at different scales
np.random.seed(42)

sizes = [1000, 5000, 10000, 50000]
print(f"{'n':<10}{'brute (ms)':<14}{'kd_tree (ms)':<16}{'ball_tree (ms)':<16}")
print("-" * 56)

for n in sizes:
    X_speed = np.random.randn(n, 10)
    y_speed = (X_speed[:, 0] > 0).astype(int)
    X_query = np.random.randn(100, 10)

    times = {}
    for algo in ['brute', 'kd_tree', 'ball_tree']:
        knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
        knn.fit(X_speed, y_speed)

        t0 = time.perf_counter()
        knn.predict(X_query)
        times[algo] = (time.perf_counter() - t0) * 1000

    print(f"{n:<10}{times['brute']:<14.2f}{times['kd_tree']:<16.2f}{times['ball_tree']:<16.2f}")

n         brute (ms)    kd_tree (ms)    ball_tree (ms)
--------------------------------------------------------
1000      0.78          0.54            0.62
5000      2.41          0.89            1.04
10000     4.56          1.23            1.41
50000     21.84         2.87            3.42

KD-trees provide a significant speedup at larger scales. At 50,000 training points, kd_tree is 7.6x faster than brute force. For even larger datasets (millions of rows), consider approximate nearest neighbor libraries like FAISS (Facebook), Annoy (Spotify), or ScaNN (Google).

Production Tip --- For production KNN at scale, do not use scikit-learn. Use FAISS for GPU-accelerated approximate nearest neighbors or Annoy for memory-efficient, read-only indices. These libraries sacrifice exact neighbor search for 10-100x speedups, with approximate accuracy typically above 95%.

15.16 The StreamFlow Exercise: Naive Bayes on Churn

The progressive project in this chapter is an exercise, not a milestone, but it illustrates an important point: Naive Bayes is a surprisingly competitive baseline for churn prediction.

import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score

# StreamFlow churn features (simplified from Chapter 11)
np.random.seed(42)
n = 5000

streamflow = pd.DataFrame({
    'monthly_hours': np.random.exponential(15, n),
    'sessions_30d': np.random.poisson(12, n),
    'months_active': np.random.randint(1, 48, n),
    'support_tickets_90d': np.random.poisson(0.5, n),
    'payment_failures_6m': np.random.poisson(0.2, n),
    'content_completion': np.random.beta(3, 2, n),
    'devices_used': np.random.randint(1, 5, n),
    'hours_change_pct': np.random.normal(0, 25, n),
})

# Churn probability based on features
churn_logit = (
    -2.0
    - 0.08 * streamflow['monthly_hours']
    - 0.05 * streamflow['sessions_30d']
    + 0.04 * streamflow['support_tickets_90d']
    + 0.6 * streamflow['payment_failures_6m']
    - 0.02 * streamflow['months_active']
    - 0.03 * streamflow['hours_change_pct']
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
y_churn = (np.random.random(n) < churn_prob).astype(int)

X_churn = streamflow.values

models = {
    'GaussianNB': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GaussianNB())
    ]),
    'LogisticRegression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(random_state=42, max_iter=1000))
    ]),
    'GradientBoosting': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GradientBoostingClassifier(
            n_estimators=200, max_depth=4, random_state=42
        ))
    ]),
}

print("StreamFlow Churn Prediction — Model Comparison")
print(f"{'Model':<22}{'CV AUC':<12}{'CV Accuracy':<15}")
print("-" * 49)
for name, pipe in models.items():
    auc_scores = cross_val_score(pipe, X_churn, y_churn, cv=5, scoring='roc_auc')
    acc_scores = cross_val_score(pipe, X_churn, y_churn, cv=5, scoring='accuracy')
    print(f"{name:<22}{auc_scores.mean():<12.3f}{acc_scores.mean():<15.3f}")

StreamFlow Churn Prediction — Model Comparison
Model                 CV AUC      CV Accuracy
-------------------------------------------------
GaussianNB            0.766       0.778
LogisticRegression    0.783       0.781
GradientBoosting      0.790       0.784

GaussianNB is within 2.4% AUC of GradientBoosting on this churn problem. In a real project, this would be the first model you build --- in under a minute --- to establish whether the features carry signal at all. If NB hits 0.76 AUC, you know the ceiling is higher and the investment in complex models is worth making.

15.17 When to Reach for NB or KNN: A Decision Framework

Criterion	Naive Bayes	KNN
Training data size	Excellent with small data (< 1000)	Needs moderate data (> 500)
Feature dimensionality	Handles high dimensions well	Degrades above ~20 features
Feature types	Variants for continuous, counts, binary	Requires numerical (scale first)
Training speed	Microseconds	Microseconds (just stores data)
Prediction speed	Microseconds	O(n*d) per query (slow at scale)
Nonlinear boundaries	No (linear in log-probability space)	Yes (arbitrarily complex)
Interpretability	High (class-conditional probabilities)	Medium (show neighbors)
Incremental learning	Yes (partial_fit)	Yes (add points)
Probability calibration	Poor (probabilities are unreliable)	Fair (with enough neighbors)
Best use case	Text classification, fast baselines	Anomaly detection, local patterns

Common Mistake --- Do not use Naive Bayes probabilities as actual probabilities. NB is a good classifier but a poor probability estimator because the independence assumption distorts probability magnitudes. If you need calibrated probabilities from NB, apply CalibratedClassifierCV from scikit-learn as a post-processing step.

Chapter Summary

Naive Bayes and K-Nearest Neighbors sit at opposite ends of the model spectrum. NB makes a strong assumption (conditional independence) and uses it to estimate probabilities with almost no data. KNN makes no assumptions at all and lets the data speak entirely for itself. Both are fast to implement, easy to understand, and --- in the right circumstances --- competitive with algorithms that are orders of magnitude more complex.

The senior data scientist's toolkit includes gradient boosting and neural networks. It also includes MultinomialNB(alpha=0.5) in an eleven-line pipeline and KNeighborsClassifier(n_neighbors=7) with a StandardScaler. Knowing when to reach for the simple tool instead of the complex one is not a sign of laziness. It is a sign of experience.

Next chapter: Chapter 16 --- Model Evaluation Deep Dive, where we move from "which model is best?" to "how do we know which model is best?" --- the metrics, curves, and evaluation strategies that separate rigorous analysis from wishful thinking.