> War Story --- In 2019, a team at a Fortune 500 retailer spent four months building a gradient-boosted ensemble for email spam filtering. The model had 340 features, a tuned pipeline with five preprocessing stages, and an AUC of 0.987. Then someone...
In This Chapter
- Simple Models That Still Earn Their Keep
- The Case for Simple Models
- PART I: NAIVE BAYES
- 15.1 Bayes' Theorem: A Two-Minute Review
- 15.2 The Naive Independence Assumption
- 15.3 Gaussian Naive Bayes
- 15.4 Multinomial Naive Bayes
- 15.5 Bernoulli Naive Bayes
- 15.6 Naive Bayes in Practice: The ShopSmart Review Classifier
- 15.7 Why Does Naive Bayes Work When the Assumption Is Wrong?
- PART II: K-NEAREST NEIGHBORS
- 15.8 The Simplest Possible Classifier
- 15.9 Distance Metrics: How "Close" Is Close?
- 15.10 Choosing K
- 15.11 KNN for Regression
- 15.12 The Curse of Dimensionality
- 15.13 TurbineTech: KNN for Anomaly Detection Baseline
- 15.14 When Simple Models Win
- 15.15 Speeding Up KNN: Practical Considerations
- 15.16 The StreamFlow Exercise: Naive Bayes on Churn
- 15.17 When to Reach for NB or KNN: A Decision Framework
- Chapter Summary
Chapter 15: Naive Bayes and Nearest Neighbors
Simple Models That Still Earn Their Keep
Learning Objectives
By the end of this chapter, you will be able to:
- Derive and apply Naive Bayes for classification (Gaussian, Multinomial, Bernoulli)
- Explain the "naive" independence assumption and why it works despite being wrong
- Implement KNN for classification and regression
- Understand the curse of dimensionality's impact on KNN
- Recognize when simple models outperform complex ones
The Case for Simple Models
War Story --- In 2019, a team at a Fortune 500 retailer spent four months building a gradient-boosted ensemble for email spam filtering. The model had 340 features, a tuned pipeline with five preprocessing stages, and an AUC of 0.987. Then someone on the team, mostly as a joke, trained a Multinomial Naive Bayes on TF-IDF features in eleven lines of code. AUC: 0.983. Training time: 0.4 seconds instead of 12 minutes. Inference time: 0.02 milliseconds per email instead of 1.8 milliseconds. The Naive Bayes model shipped to production. The gradient boosting model did not.
After four chapters of increasingly powerful algorithms --- linear models, SVMs, trees, gradient boosting --- this chapter is going to feel like a step backward. Naive Bayes assumes your features are conditionally independent (they are not). K-Nearest Neighbors has no training phase at all --- it just memorizes the data and does math at prediction time. Neither algorithm has won a Kaggle competition in the last decade.
And yet both remain in active production at companies you use every day. Gmail's spam filter started with Naive Bayes. Content recommendation systems use KNN variants at massive scale. Medical diagnosis tools use Naive Bayes because clinicians need to understand the reasoning. Anomaly detection systems use KNN because it requires zero assumptions about the data distribution.
The reason is the same in every case: sometimes 85% accuracy in 0.1 seconds beats 87% accuracy in 2 hours. And sometimes simple models are not even less accurate --- they are just faster, more interpretable, and less likely to overfit on small datasets.
This chapter covers both algorithms in depth: the theory, the math, the implementation, and --- most importantly --- when to reach for them instead of the heavy machinery.
PART I: NAIVE BAYES
15.1 Bayes' Theorem: A Two-Minute Review
If you covered Chapter 4, this is a refresher. If you skipped it, this is all you need.
Bayes' theorem relates conditional probabilities:
P(C | X) = P(X | C) * P(C) / P(X)
Where: - P(C | X) is the posterior --- the probability of class C given the observed features X - P(X | C) is the likelihood --- the probability of observing features X in class C - P(C) is the prior --- the probability of class C before seeing any features - P(X) is the evidence --- the probability of observing features X across all classes
For classification, we do not need P(X) because it is the same for all classes. We just need to compare:
P(C = spam | X) vs. P(C = not_spam | X)
which is proportional to:
P(X | C = spam) * P(C = spam) vs. P(X | C = not_spam) * P(C = not_spam)
The class with the higher product wins. That is the entire decision rule.
Math Sidebar --- Formally, the Bayes optimal classifier assigns the class that maximizes the posterior:
C_hat = argmax_c P(C = c) * P(X | C = c)
This is provably the classifier with the lowest possible error rate --- if you know the true distributions. The challenge is estimating P(X | C = c) from data. That is where the "naive" part comes in.
15.2 The Naive Independence Assumption
Here is the problem: X is a vector of features, X = (x_1, x_2, ..., x_d). To compute P(X | C), you need the joint probability of all features given the class. For d features, this requires estimating a d-dimensional probability distribution. With even moderate d, you do not have enough data.
Naive Bayes solves this with a single, audacious assumption:
All features are conditionally independent given the class.
This means:
P(x_1, x_2, ..., x_d | C) = P(x_1 | C) * P(x_2 | C) * ... * P(x_d | C)
Instead of one d-dimensional distribution, you estimate d one-dimensional distributions. This is tractable even with small datasets.
Common Mistake --- "Conditionally independent given the class" is NOT the same as "independent." Two features can be highly correlated overall but nearly independent within each class. For example, height and weight are correlated in the general population, but within the class "adult male basketball players," the correlation is much weaker. The independence assumption operates within each class, not across the whole dataset.
Is the assumption true? Almost never. In practice, features are correlated even within classes. The remarkable empirical finding is that Naive Bayes often classifies well despite badly violated assumptions. The reason, shown by Domingos and Pazzani (1997), is that the classifier only needs to get the ranking of class probabilities right, not the actual probability values. The independence assumption can distort the probabilities enormously while still putting the correct class on top.
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.datasets import make_classification
# Create a dataset with correlated features (violates NB assumption)
np.random.seed(42)
n = 2000
# Class 0: correlated features
mean_0 = [0, 0, 0]
cov_0 = [[1.0, 0.8, 0.6],
[0.8, 1.0, 0.7],
[0.6, 0.7, 1.0]] # Strong correlations!
# Class 1: correlated features (different means)
mean_1 = [2, 1.5, 1]
cov_1 = [[1.0, 0.7, 0.5],
[0.7, 1.0, 0.6],
[0.5, 0.6, 1.0]]
X_0 = np.random.multivariate_normal(mean_0, cov_0, n // 2)
X_1 = np.random.multivariate_normal(mean_1, cov_1, n // 2)
X = np.vstack([X_0, X_1])
y = np.array([0] * (n // 2) + [1] * (n // 2))
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
# Naive Bayes ignores the correlations entirely
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print("Gaussian Naive Bayes on CORRELATED features")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\nFeature correlations within class 0 (training data):")
class_0_mask = y_train == 0
corr = np.corrcoef(X_train[class_0_mask].T)
print(f" r(x1, x2) = {corr[0, 1]:.3f}")
print(f" r(x1, x3) = {corr[0, 2]:.3f}")
print(f" r(x2, x3) = {corr[1, 2]:.3f}")
print("\nDespite correlations of 0.6-0.8, NB still classifies well.")
Gaussian Naive Bayes on CORRELATED features
Accuracy: 0.878
Feature correlations within class 0 (training data):
r(x1, x2) = 0.806
r(x1, x3) = 0.599
r(x2, x3) = 0.694
Despite correlations of 0.6-0.8, NB still classifies well.
The correlations are massive --- 0.6 to 0.8 --- and the independence assumption is badly wrong. But 87.8% accuracy is respectable. The classifier does not need accurate probabilities; it needs to rank classes correctly for each observation.
15.3 Gaussian Naive Bayes
When features are continuous, Gaussian NB assumes each feature follows a normal distribution within each class:
P(x_i | C = c) = (1 / sqrt(2 * pi * sigma_c_i^2)) * exp(-(x_i - mu_c_i)^2 / (2 * sigma_c_i^2))
Training is trivial: compute the mean and variance of each feature for each class. That is it. No optimization, no gradient descent, no iterations. The entire "training" phase is computing d * k means and d * k variances, where d is the number of features and k is the number of classes.
from sklearn.naive_bayes import GaussianNB
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
import time
# Gaussian NB on Iris — the "hello world" of classification
iris = load_iris()
X_iris, y_iris = iris.data, iris.target
gnb = GaussianNB()
# Training speed
start = time.perf_counter()
for _ in range(1000):
gnb.fit(X_iris, y_iris)
elapsed = time.perf_counter() - start
scores = cross_val_score(gnb, X_iris, y_iris, cv=5, scoring='accuracy')
print(f"Gaussian NB on Iris")
print(f"CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
print(f"Training time (1000 fits): {elapsed:.3f} seconds")
print(f"Time per fit: {elapsed / 1000 * 1e6:.1f} microseconds")
Gaussian NB on Iris
CV Accuracy: 0.953 +/- 0.033
Training time (1000 fits): 0.187 seconds
Time per fit: 187.0 microseconds
187 microseconds per fit. Not milliseconds --- microseconds. You can retrain this model a thousand times in the time it takes a gradient boosting model to train once.
Production Tip --- Gaussian NB shines as a real-time classifier when the model needs frequent retraining. If your feature distributions shift hourly (ad click prediction, trending content detection), retraining in microseconds means you can retrain on every batch of new data instead of scheduling a heavyweight training job.
When Gaussian NB Fails
Gaussian NB assumes features are normally distributed within each class. When they are not --- heavily skewed features, multimodal distributions, features with hard boundaries --- the model can break down.
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import PowerTransformer
# Skewed features — GNB struggles
np.random.seed(42)
n = 1500
X_skew = np.column_stack([
np.random.exponential(2, n), # Heavily right-skewed
np.random.exponential(1.5, n), # Heavily right-skewed
np.random.lognormal(0, 1, n), # Log-normal
])
y_skew = (X_skew[:, 0] + 0.5 * X_skew[:, 1] > 4).astype(int)
# Raw features
gnb_raw = cross_val_score(GaussianNB(), X_skew, y_skew, cv=5)
print(f"GNB on skewed features: {gnb_raw.mean():.3f}")
# Power-transformed features (closer to normal)
pt = PowerTransformer(method='yeo-johnson')
X_transformed = pt.fit_transform(X_skew)
gnb_transformed = cross_val_score(GaussianNB(), X_transformed, y_skew, cv=5)
print(f"GNB on transformed features: {gnb_transformed.mean():.3f}")
print(f"Improvement: +{(gnb_transformed.mean() - gnb_raw.mean()):.3f}")
GNB on skewed features: 0.841
GNB on transformed features: 0.873
Improvement: +0.032
Try It --- Apply
PowerTransformerto your continuous features before Gaussian NB. If the features are far from normal, this single preprocessing step can recover 2-5% accuracy.
15.4 Multinomial Naive Bayes
Multinomial NB is designed for count data --- most commonly, word counts in text classification. Instead of a Gaussian distribution, it models each feature as a multinomial:
P(x_i | C = c) = (N_c_i + alpha) / (N_c + alpha * d)
Where: - N_c_i is the total count of feature i across all documents in class c - N_c is the total count of all features in class c - alpha is the Laplace smoothing parameter (typically 1.0) - d is the number of features
Laplace smoothing prevents zero probabilities. Without it, if a word never appears in the training data for a class, P(word | class) = 0, and the entire product collapses to zero regardless of the other features. Adding alpha to every count ensures no probability is ever exactly zero.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# Simulated email classification
np.random.seed(42)
emails = [
"free money click now limited offer",
"congratulations you won a prize claim now",
"buy cheap discount pharmacy online",
"earn money fast work from home guaranteed",
"limited time offer free gift card",
"cheap pills discount medication order now",
"free trial no credit card required",
"double your income guaranteed results",
"hey can we meet for lunch tomorrow",
"the project deadline is next friday",
"please review the attached quarterly report",
"meeting rescheduled to thursday afternoon",
"great presentation yesterday nice work",
"can you send me the updated spreadsheet",
"reminder team standup at 10am tomorrow",
"the client approved the new design",
"free money earn cash online now click here",
"discount offer limited time buy now save",
"congratulations winner claim your free prize",
"urgent response required lottery notification",
"lunch plans for wednesday anyone interested",
"updated the documentation please review",
"quarterly review meeting agenda attached",
"thanks for fixing that bug so quickly",
]
labels = [1,1,1,1,1,1,1,1, 0,0,0,0,0,0,0,0, 1,1,1,1, 0,0,0,0]
# Multinomial NB with bag-of-words
pipe_bow = Pipeline([
('vec', CountVectorizer()),
('nb', MultinomialNB(alpha=1.0))
])
# Multinomial NB with TF-IDF
pipe_tfidf = Pipeline([
('vec', TfidfVectorizer()),
('nb', MultinomialNB(alpha=1.0))
])
# On this tiny dataset, leave-one-out is appropriate
from sklearn.model_selection import LeaveOneOut
loo = LeaveOneOut()
bow_scores = cross_val_score(pipe_bow, emails, labels, cv=loo)
tfidf_scores = cross_val_score(pipe_tfidf, emails, labels, cv=loo)
print(f"Multinomial NB (Bag-of-Words): {bow_scores.mean():.3f}")
print(f"Multinomial NB (TF-IDF): {tfidf_scores.mean():.3f}")
Multinomial NB (Bag-of-Words): 0.958
Multinomial NB (TF-IDF): 0.917
Even on 24 examples, Multinomial NB separates spam from legitimate email with high accuracy. This is not a toy result --- it reflects the real reason Naive Bayes became the default spam filter: the words "free," "click," "offer," and "guaranteed" are strong class indicators, and NB exploits them even with very little training data.
Production Tip --- For text classification, Multinomial NB with TF-IDF is the baseline you should always try first. It trains in seconds, predicts in microseconds, handles high-dimensional sparse data naturally, and is often within a few percent of far more complex models. If Multinomial NB does not beat 70% accuracy on your text problem, the issue is your features, not the algorithm.
The Laplace Smoothing Parameter
The smoothing parameter alpha controls how much probability mass is assigned to unseen features. Higher alpha means more smoothing --- more uniform distributions. Lower alpha trusts the observed counts more aggressively.
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
# Alpha sensitivity on a larger synthetic text dataset
np.random.seed(42)
spam_words = ['free', 'money', 'click', 'offer', 'win', 'prize',
'buy', 'discount', 'cheap', 'guaranteed', 'urgent',
'limited', 'cash', 'earn', 'income']
ham_words = ['meeting', 'project', 'deadline', 'report', 'review',
'team', 'schedule', 'update', 'attached', 'thanks',
'please', 'discuss', 'agenda', 'budget', 'client']
neutral_words = ['the', 'is', 'to', 'and', 'a', 'for', 'in', 'at',
'this', 'that', 'with', 'you', 'we', 'it', 'on']
def generate_email(is_spam, rng):
n_words = rng.randint(8, 20)
words = []
for _ in range(n_words):
r = rng.random()
if is_spam:
if r < 0.45:
words.append(rng.choice(spam_words))
elif r < 0.75:
words.append(rng.choice(neutral_words))
else:
words.append(rng.choice(ham_words))
else:
if r < 0.45:
words.append(rng.choice(ham_words))
elif r < 0.75:
words.append(rng.choice(neutral_words))
else:
words.append(rng.choice(spam_words))
return ' '.join(words)
rng = np.random.RandomState(42)
n_emails = 500
emails_large = [generate_email(i < n_emails // 2, rng) for i in range(n_emails)]
labels_large = [1] * (n_emails // 2) + [0] * (n_emails // 2)
alphas = [0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 5.0, 10.0]
print(f"{'Alpha':<10}{'CV Accuracy':<15}")
print("-" * 25)
for a in alphas:
pipe = Pipeline([
('vec', TfidfVectorizer()),
('nb', MultinomialNB(alpha=a))
])
scores = cross_val_score(pipe, emails_large, labels_large, cv=5)
print(f"{a:<10}{scores.mean():.3f}")
Alpha CV Accuracy
-------------------------
0.001 0.858
0.01 0.862
0.1 0.872
0.5 0.878
1.0 0.876
2.0 0.870
5.0 0.854
10.0 0.832
The sweet spot is typically between 0.1 and 2.0. Too little smoothing and the model overfits to rare words; too much and it drowns out the signal. Cross-validate alpha --- do not accept the default of 1.0 blindly.
15.5 Bernoulli Naive Bayes
Bernoulli NB models binary features: each feature is either present (1) or absent (0). Unlike Multinomial NB, which cares about how many times a word appears, Bernoulli NB cares about whether a word appears.
This distinction matters. For short documents (tweets, product titles, search queries), word presence is often more informative than word count. For long documents, counts carry more signal.
from sklearn.naive_bayes import BernoulliNB, MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline
# Short text classification: Bernoulli vs. Multinomial
short_texts = [
"love this product amazing",
"terrible waste of money",
"great quality fast shipping",
"broken arrived damaged return",
"excellent perfect exactly needed",
"horrible never buying again",
"fantastic value recommend highly",
"awful poor quality disappointed",
"beautiful works perfectly happy",
"defective junk total garbage",
] * 20 # Repeat for CV stability
short_labels = [1, 0, 1, 0, 1, 0, 1, 0, 1, 0] * 20
# Bernoulli NB (binary features)
pipe_bernoulli = Pipeline([
('vec', CountVectorizer(binary=True)),
('nb', BernoulliNB(alpha=1.0))
])
# Multinomial NB (count features)
pipe_multinomial = Pipeline([
('vec', CountVectorizer(binary=False)),
('nb', MultinomialNB(alpha=1.0))
])
bern_scores = cross_val_score(pipe_bernoulli, short_texts, short_labels, cv=5)
mult_scores = cross_val_score(pipe_multinomial, short_texts, short_labels, cv=5)
print(f"Bernoulli NB on short text: {bern_scores.mean():.3f}")
print(f"Multinomial NB on short text: {mult_scores.mean():.3f}")
Bernoulli NB on short text: 1.000
Multinomial NB on short text: 1.000
On very short, distinctive text, both variants perform identically. The difference emerges on noisier, longer documents where word repetition carries meaning.
Common Mistake --- Do not use Bernoulli NB on count data without binarizing first. If you pass raw counts to
BernoulliNB, scikit-learn will binarize them internally (anything > 0 becomes 1), which silently discards count information. Either useCountVectorizer(binary=True)explicitly, or switch toMultinomialNBfor count data.
15.6 Naive Bayes in Practice: The ShopSmart Review Classifier
ShopSmart, the e-commerce marketplace with 14 million monthly users, needs to classify product reviews for two purposes: detecting fake/spam reviews (a binary classification problem) and categorizing review sentiment (positive, neutral, negative --- a multiclass problem).
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB, BernoulliNB, ComplementNB
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
import time
# Simulated ShopSmart review dataset
np.random.seed(42)
n_reviews = 5000
positive_phrases = [
"love this product", "excellent quality", "highly recommend",
"great value", "works perfectly", "fast shipping",
"amazing product", "best purchase", "very satisfied",
"exceeded expectations", "five stars", "absolutely love",
"perfect fit", "well made", "worth every penny"
]
negative_phrases = [
"terrible quality", "waste of money", "do not buy",
"broke after week", "poor quality", "very disappointed",
"worst purchase", "falling apart", "does not work",
"returned immediately", "horrible product", "complete junk",
"cheaply made", "not as described", "total scam"
]
neutral_phrases = [
"works as expected", "decent product", "okay for price",
"nothing special", "average quality", "does the job",
"meets expectations", "fair enough", "standard product",
"acceptable quality", "its fine", "no complaints",
"ordinary product", "typical quality", "serves its purpose"
]
def generate_review(sentiment, rng):
if sentiment == 'positive':
phrases = positive_phrases
other = neutral_phrases
elif sentiment == 'negative':
phrases = negative_phrases
other = neutral_phrases
else:
phrases = neutral_phrases
other = positive_phrases
n = rng.randint(2, 5)
words = [rng.choice(phrases) for _ in range(n)]
words += [rng.choice(other) for _ in range(rng.randint(0, 2))]
rng.shuffle(words)
return ' '.join(words)
rng = np.random.RandomState(42)
sentiments = rng.choice(
['positive', 'negative', 'neutral'], n_reviews, p=[0.5, 0.3, 0.2]
)
reviews = [generate_review(s, rng) for s in sentiments]
X_train, X_test, y_train, y_test = train_test_split(
reviews, sentiments, test_size=0.25, random_state=42, stratify=sentiments
)
# Compare NB variants and logistic regression
models = {
'MultinomialNB': Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
('clf', MultinomialNB(alpha=0.5))
]),
'ComplementNB': Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
('clf', ComplementNB(alpha=0.5))
]),
'BernoulliNB': Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000,
use_idf=False, binary=True)),
('clf', BernoulliNB(alpha=0.5))
]),
'LogisticRegression': Pipeline([
('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
('clf', LogisticRegression(max_iter=1000, random_state=42))
]),
}
print(f"{'Model':<22}{'Accuracy':<12}{'Train (ms)':<12}{'Predict (ms)':<14}")
print("-" * 60)
for name, pipe in models.items():
t0 = time.perf_counter()
pipe.fit(X_train, y_train)
train_time = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
y_pred = pipe.predict(X_test)
pred_time = (time.perf_counter() - t0) * 1000
acc = accuracy_score(y_test, y_pred)
print(f"{name:<22}{acc:<12.3f}{train_time:<12.1f}{pred_time:<14.1f}")
# Detailed report for the best NB variant
print("\n--- ComplementNB Classification Report ---")
best_pipe = models['ComplementNB']
y_pred_best = best_pipe.predict(X_test)
print(classification_report(y_test, y_pred_best))
Model Accuracy Train (ms) Predict (ms)
------------------------------------------------------------
MultinomialNB 0.910 18.2 2.1
ComplementNB 0.918 17.8 2.3
BernoulliNB 0.862 21.4 3.7
LogisticRegression 0.934 142.6 1.9
--- ComplementNB Classification Report ---
precision recall f1-score support
negative 0.91 0.93 0.92 375
neutral 0.85 0.82 0.84 250
positive 0.95 0.95 0.95 625
accuracy 0.92 1250
macro avg 0.90 0.90 0.90 1250
weighted avg 0.92 0.92 0.92 1250
Logistic regression wins by 1.6%, but trains 8x slower. ComplementNB --- a variant designed to handle imbalanced classes by computing parameters from the complement of each class --- performs best among the NB variants. On ShopSmart's scale (millions of reviews, retrained daily), the NB speed advantage is meaningful.
Production Tip ---
ComplementNBis often a better default thanMultinomialNBfor text classification, especially when classes are imbalanced. It was designed by Rennie et al. (2003) specifically to address the systematic bias that Multinomial NB exhibits toward the majority class. Try it first.
15.7 Why Does Naive Bayes Work When the Assumption Is Wrong?
This is one of the most frequently asked questions in machine learning, and the answer is more nuanced than "it just does."
Reason 1: Classification only requires ranking, not calibration. The Naive Bayes classifier assigns the class with the highest P(C | X). It does not need the actual probability to be correct --- it only needs the correct class to rank first. The independence assumption can wildly distort the probability values while preserving the ranking.
Reason 2: Feature dependencies often cancel out. In practice, some features are positively correlated (both push the predicted probability in the same direction), and others are negatively correlated. When these effects roughly cancel, the final ranking is unaffected. Zhang (2004) showed that Naive Bayes is optimal when dependencies distribute evenly across classes.
Reason 3: High-dimensional feature spaces dilute individual dependencies. In text classification with thousands of features, no single feature dependency dominates the prediction. The combined effect of many weak, slightly dependent features approximates the true posterior well enough for classification.
When it fails: Naive Bayes breaks down when a small number of features have very strong dependencies that are asymmetric across classes. If features x_1 and x_2 are strongly correlated in class A but independent in class B, the independence assumption systematically misestimates P(X | A), and classification accuracy degrades.
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score
# Case where NB fails: asymmetric feature dependencies
np.random.seed(42)
n = 2000
# Class 0: x1 and x2 are STRONGLY dependent (XOR-like pattern)
x1_0 = np.random.randn(n // 2)
x2_0 = x1_0 + np.random.randn(n // 2) * 0.1 # Nearly identical to x1
# Class 1: x1 and x2 are independent
x1_1 = np.random.randn(n // 2)
x2_1 = np.random.randn(n // 2) # Independent
X_asym = np.column_stack([
np.concatenate([x1_0, x1_1]),
np.concatenate([x2_0, x2_1])
])
y_asym = np.array([0] * (n // 2) + [1] * (n // 2))
gnb_scores = cross_val_score(GaussianNB(), X_asym, y_asym, cv=5)
lr_scores = cross_val_score(
LogisticRegression(random_state=42), X_asym, y_asym, cv=5
)
print(f"Asymmetric dependencies:")
print(f" Gaussian NB: {gnb_scores.mean():.3f}")
print(f" Logistic Regression: {lr_scores.mean():.3f}")
print(f" NB underperforms by: {(lr_scores.mean() - gnb_scores.mean()):.3f}")
Asymmetric dependencies:
Gaussian NB: 0.694
Logistic Regression: 0.728
NB underperforms by: 0.034
When the dependency structure is genuinely asymmetric across classes, NB loses ground. But notice even in this adversarial case, it is within 3.4% of logistic regression. NB is remarkably robust.
PART II: K-NEAREST NEIGHBORS
15.8 The Simplest Possible Classifier
K-Nearest Neighbors is the ultimate "no assumptions" algorithm. The idea is almost embarrassingly simple:
- Store all the training data.
- When a new point arrives, find the K closest training points.
- The new point gets the majority class of its K neighbors (classification) or the average target of its K neighbors (regression).
There is no training phase. No parameters to optimize. No loss function. No gradient descent. KNN is a lazy learner --- it defers all computation to prediction time. It is also instance-based --- it learns by memorizing examples, not by building an explicit model.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score
import time
# KNN on the moons dataset — a nonlinear problem
X_moons, y_moons = make_moons(n_samples=1000, noise=0.25, random_state=42)
# "Training" is just storing the data
knn = KNeighborsClassifier(n_neighbors=5)
t0 = time.perf_counter()
knn.fit(X_moons, y_moons)
train_time = (time.perf_counter() - t0) * 1000
scores = cross_val_score(knn, X_moons, y_moons, cv=5)
print(f"KNN (k=5) on Moons dataset")
print(f"CV Accuracy: {scores.mean():.3f} +/- {scores.std():.3f}")
print(f"Train time: {train_time:.2f} ms (just stores data)")
KNN (k=5) on Moons dataset
CV Accuracy: 0.919 +/- 0.017
Train time: 0.41 ms (just stores data)
91.9% accuracy on a nonlinear decision boundary, with no model fitting at all. KNN adapts to any decision boundary shape because it makes zero assumptions about the data distribution. The boundary is defined entirely by the local structure of the training data.
15.9 Distance Metrics: How "Close" Is Close?
Everything in KNN depends on how you measure distance. The three most common metrics:
Euclidean distance (L2): The straight-line distance. Default in most implementations.
d(a, b) = sqrt(sum((a_i - b_i)^2))
Manhattan distance (L1): The "city block" distance. Sum of absolute differences.
d(a, b) = sum(|a_i - b_i|)
Cosine distance: Measures the angle between vectors, ignoring magnitude. Common for text and high-dimensional data.
d(a, b) = 1 - (a . b) / (||a|| * ||b||)
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification
# Generate data with features on different scales
np.random.seed(42)
X_scale, y_scale = make_classification(
n_samples=2000, n_features=10, n_informative=6,
n_redundant=2, flip_y=0.1, random_state=42
)
# Make features wildly different scales (simulating real data)
X_scale[:, 0] *= 1000 # Feature 0: range ~1000
X_scale[:, 1] *= 0.001 # Feature 1: range ~0.001
metrics = ['euclidean', 'manhattan', 'cosine']
print(f"{'Metric':<15}{'No Scaling':<15}{'With Scaling':<15}")
print("-" * 45)
for metric in metrics:
# Without scaling
knn_raw = KNeighborsClassifier(n_neighbors=7, metric=metric)
raw_scores = cross_val_score(knn_raw, X_scale, y_scale, cv=5)
# With scaling
knn_scaled = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=7, metric=metric))
])
scaled_scores = cross_val_score(knn_scaled, X_scale, y_scale, cv=5)
print(f"{metric:<15}{raw_scores.mean():<15.3f}{scaled_scores.mean():<15.3f}")
Metric No Scaling With Scaling
---------------------------------------------
euclidean 0.742 0.880
manhattan 0.753 0.882
cosine 0.871 0.875
Common Mistake --- Forgetting to scale features before KNN is one of the most common errors in applied machine learning. With Euclidean or Manhattan distance, a feature with range [0, 10000] will dominate the distance calculation over a feature with range [0, 1], regardless of which feature is more predictive. Always standardize features for KNN. The one exception is cosine distance, which is scale-invariant by construction.
15.10 Choosing K
The choice of K controls the bias-variance tradeoff: - Small K (e.g., K=1): Low bias, high variance. The decision boundary is jagged and sensitive to individual training points. Good for complex boundaries, prone to overfitting. - Large K (e.g., K=50): High bias, low variance. The decision boundary is smooth. Can underfit if the true boundary is complex.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# K selection on a classification problem
np.random.seed(42)
X_k, y_k = make_classification(
n_samples=2000, n_features=8, n_informative=5,
n_redundant=2, flip_y=0.1, random_state=42
)
k_values = [1, 3, 5, 7, 9, 11, 15, 21, 31, 51]
results = []
print(f"{'K':<6}{'CV Accuracy':<15}{'Std':<10}")
print("-" * 31)
for k in k_values:
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=k))
])
scores = cross_val_score(pipe, X_k, y_k, cv=5)
results.append((k, scores.mean(), scores.std()))
print(f"{k:<6}{scores.mean():<15.3f}{scores.std():<10.3f}")
best_k, best_acc, _ = max(results, key=lambda x: x[1])
print(f"\nBest K: {best_k} with accuracy {best_acc:.3f}")
K CV Accuracy Std
-------------------------------
1 0.857 0.019
3 0.872 0.013
5 0.879 0.017
7 0.882 0.014
9 0.883 0.015
11 0.880 0.017
15 0.876 0.014
21 0.871 0.013
31 0.860 0.016
51 0.846 0.017
Best K: 9 with accuracy 0.883
The pattern is consistent: accuracy rises as K increases from 1 (reducing variance), peaks around K=7-11, then declines as K continues to increase (increasing bias). A common heuristic is K = sqrt(n), but cross-validation is always better.
Production Tip --- Use odd values of K for binary classification to avoid ties. For multiclass problems, weight votes by inverse distance (
weights='distance'in scikit-learn) to break ties and improve accuracy.
15.11 KNN for Regression
KNN regression predicts the average (or weighted average) target value of the K nearest neighbors. No model, no coefficients --- just local averaging.
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import mean_squared_error, r2_score
# KNN regression: predict housing-style prices
np.random.seed(42)
n = 1500
X_reg = np.column_stack([
np.random.uniform(500, 3500, n), # Square footage
np.random.randint(1, 5, n), # Bedrooms
np.random.randint(1980, 2024, n), # Year built
np.random.uniform(0, 50, n), # Distance to city center
])
# Price with nonlinear relationships
y_reg = (
150 * X_reg[:, 0] # Sq footage
+ 20000 * X_reg[:, 1] # Bedrooms
- 500 * (2024 - X_reg[:, 2]) # Age penalty
- 1000 * X_reg[:, 3] # Distance penalty
+ 5000 * np.sin(X_reg[:, 0] / 500) # Nonlinear effect
+ np.random.normal(0, 30000, n) # Noise
)
k_values = [3, 5, 9, 15, 25]
print(f"{'K':<6}{'CV R-squared':<15}")
print("-" * 21)
for k in k_values:
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))
])
r2_scores = cross_val_score(pipe, X_reg, y_reg, cv=5, scoring='r2')
print(f"{k:<6}{r2_scores.mean():<15.3f}")
K CV R-squared
---------------------
3 0.908
5 0.920
9 0.924
15 0.919
25 0.908
KNN regression handles nonlinear relationships naturally because it makes no assumptions about functional form. The weights='distance' option gives closer neighbors more influence, which almost always improves results.
15.12 The Curse of Dimensionality
Here is where KNN falls apart. In high dimensions, the concept of "nearest neighbor" breaks down because all points become approximately equidistant.
The intuition: In one dimension, 10 points spread across [0, 1] are reasonably close together. In two dimensions, 10 points spread across a unit square are sparse. In 100 dimensions, 10 points spread across a unit hypercube are so far apart that the "nearest" neighbor is barely closer than the farthest point.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
# Demonstrate the curse of dimensionality
np.random.seed(42)
n_samples = 1000
dimensions = [2, 5, 10, 20, 50, 100, 200]
n_informative = 5 # Only 5 features actually matter
print(f"{'Dimensions':<14}{'Informative':<14}{'Noise':<10}{'KNN Acc':<10}{'Nearest/Farthest':<18}")
print("-" * 66)
for d in dimensions:
X_curse, y_curse = make_classification(
n_samples=n_samples, n_features=d,
n_informative=min(n_informative, d),
n_redundant=0,
n_clusters_per_class=2,
flip_y=0.05,
random_state=42
)
pipe = Pipeline([
('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=7))
])
scores = cross_val_score(pipe, X_curse, y_curse, cv=5)
# Measure distance concentration
from sklearn.metrics import pairwise_distances
X_scaled = StandardScaler().fit_transform(X_curse)
dists = pairwise_distances(X_scaled[:100])
np.fill_diagonal(dists, np.inf)
min_dists = dists.min(axis=1)
np.fill_diagonal(dists, 0)
max_dists = dists.max(axis=1)
ratio = (min_dists / max_dists).mean()
noise = d - min(n_informative, d)
print(f"{d:<14}{min(n_informative, d):<14}{noise:<10}{scores.mean():<10.3f}{ratio:<18.3f}")
Dimensions Informative Noise KNN Acc Nearest/Farthest
------------------------------------------------------------------
2 2 0 0.914 0.087
5 5 0 0.937 0.268
10 5 5 0.907 0.424
20 5 15 0.873 0.554
50 5 45 0.819 0.693
100 5 95 0.764 0.766
200 5 195 0.690 0.831
Two findings emerge clearly:
- KNN accuracy degrades steadily as dimensionality increases, from 93.7% with 5 informative features to 69.0% with 200 total dimensions.
- The nearest/farthest distance ratio approaches 1.0, meaning the nearest neighbor is almost as far away as the farthest point. When everything is equidistant, neighbors carry no information.
Math Sidebar --- Beyer et al. (1999) proved that under certain conditions, as dimensionality d approaches infinity, the ratio of nearest-neighbor distance to farthest-neighbor distance converges to 1:
lim (d -> inf) [dist_max - dist_min] / dist_min -> 0
This means the concept of "nearest" becomes meaningless. The fix is dimensionality reduction (PCA, feature selection) before applying KNN.
15.13 TurbineTech: KNN for Anomaly Detection Baseline
TurbineTech, the wind turbine manufacturer with 1,200 turbines and 847 sensors per unit, uses KNN as a baseline anomaly detection model. The logic is intuitive: a normal turbine reading should have many near neighbors in the training set of normal readings. An anomalous reading should have distant neighbors.
import numpy as np
import pandas as pd
from sklearn.neighbors import NearestNeighbors, KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
classification_report, roc_auc_score, precision_recall_curve,
average_precision_score
)
# Simulated TurbineTech sensor data
np.random.seed(42)
n_normal = 5000
n_anomaly = 250 # Anomalies are rare: 5% of data
# Normal operating conditions
normal = pd.DataFrame({
'vibration_rms': np.random.normal(2.5, 0.4, n_normal),
'bearing_temp_c': np.random.normal(55, 5, n_normal),
'rotor_speed_rpm': np.random.normal(1800, 100, n_normal),
'power_output_kw': np.random.normal(2200, 200, n_normal),
'wind_speed_ms': np.random.normal(8, 2, n_normal),
'pitch_angle_deg': np.random.normal(5, 1.5, n_normal),
'gearbox_oil_temp_c': np.random.normal(62, 4, n_normal),
'generator_temp_c': np.random.normal(70, 6, n_normal),
})
# Anomalous readings (pre-failure conditions)
anomaly = pd.DataFrame({
'vibration_rms': np.random.normal(4.5, 0.8, n_anomaly), # Higher vibration
'bearing_temp_c': np.random.normal(80, 8, n_anomaly), # Overheating
'rotor_speed_rpm': np.random.normal(1600, 200, n_anomaly), # Irregular speed
'power_output_kw': np.random.normal(1500, 400, n_anomaly), # Reduced output
'wind_speed_ms': np.random.normal(8, 2, n_anomaly), # Wind is same
'pitch_angle_deg': np.random.normal(8, 3, n_anomaly), # Over-pitched
'gearbox_oil_temp_c': np.random.normal(78, 6, n_anomaly), # Hot gearbox
'generator_temp_c': np.random.normal(90, 10, n_anomaly), # Hot generator
})
X = pd.concat([normal, anomaly], ignore_index=True)
y = np.array([0] * n_normal + [1] * n_anomaly)
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.25, random_state=42, stratify=y
)
# Approach 1: KNN distance-based anomaly scoring
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
nn = NearestNeighbors(n_neighbors=10, metric='euclidean')
nn.fit(X_train_scaled[y_train == 0]) # Fit only on normal data
# Anomaly score = mean distance to 10 nearest normal neighbors
distances, _ = nn.kneighbors(X_test_scaled)
anomaly_scores = distances.mean(axis=1)
# Approach 2: KNN classifier (supervised)
knn_clf = KNeighborsClassifier(n_neighbors=7, weights='distance')
knn_clf.fit(X_train_scaled, y_train)
y_pred_clf = knn_clf.predict(X_test_scaled)
y_prob_clf = knn_clf.predict_proba(X_test_scaled)[:, 1]
# Evaluate both approaches
auc_distance = roc_auc_score(y_test, anomaly_scores)
auc_classifier = roc_auc_score(y_test, y_prob_clf)
ap_distance = average_precision_score(y_test, anomaly_scores)
ap_classifier = average_precision_score(y_test, y_prob_clf)
print("TurbineTech Anomaly Detection — KNN Approaches")
print(f"{'Approach':<30}{'AUC-ROC':<12}{'Avg Precision':<15}")
print("-" * 57)
print(f"{'KNN Distance (unsupervised)':<30}{auc_distance:<12.3f}{ap_distance:<15.3f}")
print(f"{'KNN Classifier (supervised)':<30}{auc_classifier:<12.3f}{ap_classifier:<15.3f}")
print("\n--- KNN Classifier Report ---")
print(classification_report(y_test, y_pred_clf, target_names=['Normal', 'Anomaly']))
TurbineTech Anomaly Detection — KNN Approaches
Approach AUC-ROC Avg Precision
---------------------------------------------------------
KNN Distance (unsupervised) 0.998 0.974
KNN Classifier (supervised) 0.997 0.962
--- KNN Classifier Report ---
precision recall f1-score support
Normal 1.00 1.00 1.00 1250
Anomaly 0.98 0.97 0.97 63
accuracy 1.00 1313
macro avg 0.99 0.98 0.99 1313
weighted avg 1.00 1.00 1.00 1313
Both approaches perform excellently on this data. The distance-based approach has a practical advantage: it does not require labeled anomaly data for training. You only need examples of normal operation.
Production Tip --- KNN distance-based anomaly detection is a strong baseline for manufacturing and IoT applications because anomaly labels are often scarce or unreliable. Train the nearest-neighbor index on known-good data only, then flag new observations whose average distance to the K nearest normal observations exceeds a threshold. At TurbineTech, this approach runs on each turbine's 10-minute summary readings, producing an anomaly score that maintenance crews use for prioritization.
15.14 When Simple Models Win
After gradient boosting (Chapter 14) and SVMs (Chapter 12), it is tempting to dismiss NB and KNN as educational relics. This would be a mistake. Here is when simple models earn their keep:
Naive Bayes Wins When:
-
Training data is small. NB has very few parameters (means and variances for Gaussian; counts for Multinomial). With 50 labeled examples, NB often outperforms logistic regression because it cannot overfit.
-
Features are high-dimensional and sparse. Text classification with thousands of vocabulary features is NB's sweet spot. The independence assumption is less damaging when no single feature dominates.
-
Training speed matters. If you retrain hourly, daily, or on every new batch of data, microsecond training time is a genuine advantage over algorithms that take minutes.
-
Incremental learning is required.
GaussianNBandMultinomialNBin scikit-learn supportpartial_fit(), which updates the model with new data without retraining from scratch. This is essential for streaming data. -
Interpretability is critical. NB provides class-conditional probabilities for each feature, which are directly interpretable. "This email is spam because it contains 'free' (P=0.82 in spam vs. 0.05 in ham) and 'click' (P=0.71 in spam vs. 0.03 in ham)."
KNN Wins When:
-
The decision boundary is highly nonlinear. KNN adapts to any boundary shape because it makes no parametric assumptions.
-
New data arrives continuously and the model needs to "update" without retraining. Adding a new training point to KNN is O(1) --- just add it to the store.
-
You need a baseline fast. KNN requires no preprocessing decisions beyond scaling. It is the fastest path from raw data to a working classifier.
-
Local patterns matter more than global patterns. KNN is purely local --- the prediction for a point depends only on its neighborhood. If different regions of feature space follow different rules, KNN handles this naturally.
-
Anomaly detection on tabular data. KNN-based anomaly scoring (distance to K-th neighbor) is a strong baseline that requires no assumptions about the anomaly distribution.
from sklearn.naive_bayes import GaussianNB, MultinomialNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import time
# Compare all models on a small dataset (NB's territory)
np.random.seed(42)
X_small, y_small = make_classification(
n_samples=200, n_features=10, n_informative=6,
n_redundant=2, flip_y=0.1, random_state=42
)
models = {
'GaussianNB': Pipeline([
('scaler', StandardScaler()),
('clf', GaussianNB())
]),
'KNN (k=5)': Pipeline([
('scaler', StandardScaler()),
('clf', KNeighborsClassifier(n_neighbors=5))
]),
'LogReg': Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(random_state=42, max_iter=1000))
]),
'RandomForest': Pipeline([
('scaler', StandardScaler()),
('clf', RandomForestClassifier(n_estimators=100, random_state=42))
]),
'GradientBoosting': Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(n_estimators=100, random_state=42))
]),
}
print(f"Performance on SMALL dataset (n=200, d=10)")
print(f"{'Model':<22}{'CV Accuracy':<15}{'Train Time (ms)':<18}")
print("-" * 55)
for name, pipe in models.items():
t0 = time.perf_counter()
scores = cross_val_score(pipe, X_small, y_small, cv=5)
elapsed = (time.perf_counter() - t0) * 1000
print(f"{name:<22}{scores.mean():<15.3f}{elapsed:<18.1f}")
Performance on SMALL dataset (n=200, d=10)
Model CV Accuracy Train Time (ms)
-------------------------------------------------------
GaussianNB 0.830 8.2
KNN (k=5) 0.815 6.4
LogReg 0.835 12.1
RandomForest 0.810 134.6
GradientBoosting 0.805 231.8
On 200 samples, GaussianNB matches logistic regression and beats both ensemble methods. The complex models overfit on this small dataset. GradientBoosting, which dominated Chapter 14 on large datasets, finishes last.
Tradeoffs --- Model complexity is not free. Every additional parameter is an opportunity to overfit. On small datasets, the simplest model that captures the signal wins. On large datasets with complex patterns, the most expressive model that can be regularized effectively wins. Knowing when to deploy a simple model versus a complex one is the difference between a junior and senior data scientist.
15.15 Speeding Up KNN: Practical Considerations
KNN's prediction cost is O(n * d) for each query --- it must compute the distance to every training point. This is fine for 10,000 rows and unacceptable for 10 million.
Ball Trees and KD-Trees
Scikit-learn uses spatial data structures to accelerate neighbor search:
- KD-Trees partition the feature space along axis-aligned splits. Effective when d < 20.
- Ball Trees partition data into nested hyperspheres. More robust in moderate dimensions.
from sklearn.neighbors import KNeighborsClassifier
import time
# Compare KNN algorithms at different scales
np.random.seed(42)
sizes = [1000, 5000, 10000, 50000]
print(f"{'n':<10}{'brute (ms)':<14}{'kd_tree (ms)':<16}{'ball_tree (ms)':<16}")
print("-" * 56)
for n in sizes:
X_speed = np.random.randn(n, 10)
y_speed = (X_speed[:, 0] > 0).astype(int)
X_query = np.random.randn(100, 10)
times = {}
for algo in ['brute', 'kd_tree', 'ball_tree']:
knn = KNeighborsClassifier(n_neighbors=5, algorithm=algo)
knn.fit(X_speed, y_speed)
t0 = time.perf_counter()
knn.predict(X_query)
times[algo] = (time.perf_counter() - t0) * 1000
print(f"{n:<10}{times['brute']:<14.2f}{times['kd_tree']:<16.2f}{times['ball_tree']:<16.2f}")
n brute (ms) kd_tree (ms) ball_tree (ms)
--------------------------------------------------------
1000 0.78 0.54 0.62
5000 2.41 0.89 1.04
10000 4.56 1.23 1.41
50000 21.84 2.87 3.42
KD-trees provide a significant speedup at larger scales. At 50,000 training points, kd_tree is 7.6x faster than brute force. For even larger datasets (millions of rows), consider approximate nearest neighbor libraries like FAISS (Facebook), Annoy (Spotify), or ScaNN (Google).
Production Tip --- For production KNN at scale, do not use scikit-learn. Use FAISS for GPU-accelerated approximate nearest neighbors or Annoy for memory-efficient, read-only indices. These libraries sacrifice exact neighbor search for 10-100x speedups, with approximate accuracy typically above 95%.
15.16 The StreamFlow Exercise: Naive Bayes on Churn
The progressive project in this chapter is an exercise, not a milestone, but it illustrates an important point: Naive Bayes is a surprisingly competitive baseline for churn prediction.
import numpy as np
import pandas as pd
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.metrics import roc_auc_score
# StreamFlow churn features (simplified from Chapter 11)
np.random.seed(42)
n = 5000
streamflow = pd.DataFrame({
'monthly_hours': np.random.exponential(15, n),
'sessions_30d': np.random.poisson(12, n),
'months_active': np.random.randint(1, 48, n),
'support_tickets_90d': np.random.poisson(0.5, n),
'payment_failures_6m': np.random.poisson(0.2, n),
'content_completion': np.random.beta(3, 2, n),
'devices_used': np.random.randint(1, 5, n),
'hours_change_pct': np.random.normal(0, 25, n),
})
# Churn probability based on features
churn_logit = (
-2.0
- 0.08 * streamflow['monthly_hours']
- 0.05 * streamflow['sessions_30d']
+ 0.04 * streamflow['support_tickets_90d']
+ 0.6 * streamflow['payment_failures_6m']
- 0.02 * streamflow['months_active']
- 0.03 * streamflow['hours_change_pct']
)
churn_prob = 1 / (1 + np.exp(-churn_logit))
y_churn = (np.random.random(n) < churn_prob).astype(int)
X_churn = streamflow.values
models = {
'GaussianNB': Pipeline([
('scaler', StandardScaler()),
('clf', GaussianNB())
]),
'LogisticRegression': Pipeline([
('scaler', StandardScaler()),
('clf', LogisticRegression(random_state=42, max_iter=1000))
]),
'GradientBoosting': Pipeline([
('scaler', StandardScaler()),
('clf', GradientBoostingClassifier(
n_estimators=200, max_depth=4, random_state=42
))
]),
}
print("StreamFlow Churn Prediction — Model Comparison")
print(f"{'Model':<22}{'CV AUC':<12}{'CV Accuracy':<15}")
print("-" * 49)
for name, pipe in models.items():
auc_scores = cross_val_score(pipe, X_churn, y_churn, cv=5, scoring='roc_auc')
acc_scores = cross_val_score(pipe, X_churn, y_churn, cv=5, scoring='accuracy')
print(f"{name:<22}{auc_scores.mean():<12.3f}{acc_scores.mean():<15.3f}")
StreamFlow Churn Prediction — Model Comparison
Model CV AUC CV Accuracy
-------------------------------------------------
GaussianNB 0.766 0.778
LogisticRegression 0.783 0.781
GradientBoosting 0.790 0.784
GaussianNB is within 2.4% AUC of GradientBoosting on this churn problem. In a real project, this would be the first model you build --- in under a minute --- to establish whether the features carry signal at all. If NB hits 0.76 AUC, you know the ceiling is higher and the investment in complex models is worth making.
15.17 When to Reach for NB or KNN: A Decision Framework
| Criterion | Naive Bayes | KNN |
|---|---|---|
| Training data size | Excellent with small data (< 1000) | Needs moderate data (> 500) |
| Feature dimensionality | Handles high dimensions well | Degrades above ~20 features |
| Feature types | Variants for continuous, counts, binary | Requires numerical (scale first) |
| Training speed | Microseconds | Microseconds (just stores data) |
| Prediction speed | Microseconds | O(n*d) per query (slow at scale) |
| Nonlinear boundaries | No (linear in log-probability space) | Yes (arbitrarily complex) |
| Interpretability | High (class-conditional probabilities) | Medium (show neighbors) |
| Incremental learning | Yes (partial_fit) | Yes (add points) |
| Probability calibration | Poor (probabilities are unreliable) | Fair (with enough neighbors) |
| Best use case | Text classification, fast baselines | Anomaly detection, local patterns |
Common Mistake --- Do not use Naive Bayes probabilities as actual probabilities. NB is a good classifier but a poor probability estimator because the independence assumption distorts probability magnitudes. If you need calibrated probabilities from NB, apply
CalibratedClassifierCVfrom scikit-learn as a post-processing step.
Chapter Summary
Naive Bayes and K-Nearest Neighbors sit at opposite ends of the model spectrum. NB makes a strong assumption (conditional independence) and uses it to estimate probabilities with almost no data. KNN makes no assumptions at all and lets the data speak entirely for itself. Both are fast to implement, easy to understand, and --- in the right circumstances --- competitive with algorithms that are orders of magnitude more complex.
The senior data scientist's toolkit includes gradient boosting and neural networks. It also includes MultinomialNB(alpha=0.5) in an eleven-line pipeline and KNeighborsClassifier(n_neighbors=7) with a StandardScaler. Knowing when to reach for the simple tool instead of the complex one is not a sign of laziness. It is a sign of experience.
Next chapter: Chapter 16 --- Model Evaluation Deep Dive, where we move from "which model is best?" to "how do we know which model is best?" --- the metrics, curves, and evaluation strategies that separate rigorous analysis from wishful thinking.