Exercises: Chapter 15

DataField.Dev

Exercises: Chapter 15

Naive Bayes and Nearest Neighbors

Exercise 1: Naive Bayes by Hand (Conceptual)

A spam filter has been trained on 1,000 emails: 400 spam and 600 ham (not spam). The word frequencies are:

| Word | P(word | spam) | P(word | ham) | |------|----------------|---------------| | free | 0.60 | 0.05 | | meeting | 0.02 | 0.35 | | click | 0.45 | 0.03 | | report | 0.01 | 0.28 |

A new email contains the words: "free", "click", and "report".

a) Calculate the prior probabilities P(spam) and P(ham).

b) Using the Naive Bayes independence assumption, calculate P(free, click, report | spam) and P(free, click, report | ham).

c) Calculate the unnormalized posterior for each class: P(spam) * P(X | spam) and P(ham) * P(X | ham).

d) Classify the email. Show your work.

e) The email contains both spam indicators ("free", "click") and a ham indicator ("report"). Explain why the spam class still wins despite the presence of "report."

Exercise 2: Laplace Smoothing (Conceptual + Code)

Consider a Multinomial Naive Bayes classifier trained on 100 documents (50 spam, 50 ham) with a vocabulary of 5,000 unique words.

a) Without Laplace smoothing, what happens if a test email contains a word that appeared in ham training emails but never in spam training emails? Explain the mathematical consequence.

b) With Laplace smoothing (alpha = 1), the probability of word w in class c becomes:

P(w | c) = (count(w, c) + 1) / (total_words_in_c + V)

where V is the vocabulary size. If "synergy" appeared 3 times in ham emails (total ham words = 15,000) and 0 times in spam emails (total spam words = 12,000), calculate P("synergy" | ham) and P("synergy" | spam) with and without smoothing.

c) Run the following experiment and explain the results:

import numpy as np
from sklearn.naive_bayes import MultinomialNB
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

np.random.seed(42)

# Generate sparse text data where smoothing matters
vocab_class_a = [f'word_a_{i}' for i in range(100)]
vocab_class_b = [f'word_b_{i}' for i in range(100)]
shared_vocab = [f'word_s_{i}' for i in range(50)]

rng = np.random.RandomState(42)
n_docs = 200
docs = []
labels = []

for i in range(n_docs):
    if i < n_docs // 2:
        words = list(rng.choice(vocab_class_a, 8))
        words += list(rng.choice(shared_vocab, 4))
        labels.append(0)
    else:
        words = list(rng.choice(vocab_class_b, 8))
        words += list(rng.choice(shared_vocab, 4))
        labels.append(1)
    docs.append(' '.join(words))

alphas = [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0]
for alpha in alphas:
    pipe = Pipeline([
        ('vec', CountVectorizer()),
        ('nb', MultinomialNB(alpha=alpha))
    ])
    scores = cross_val_score(pipe, docs, labels, cv=5)
    print(f"alpha={alpha:<8} accuracy={scores.mean():.3f}")

d) Based on your results, what is the optimal range for alpha on this dataset? Why does accuracy degrade at both extremes?

Exercise 3: Gaussian NB Assumptions (Code)

Run the following experiment comparing Gaussian NB with and without feature transformations:

import numpy as np
from sklearn.naive_bayes import GaussianNB
from sklearn.preprocessing import PowerTransformer, QuantileTransformer
from sklearn.model_selection import cross_val_score
from sklearn.pipeline import Pipeline

np.random.seed(42)
n = 2000

# Features with various non-Gaussian distributions
X = np.column_stack([
    np.random.exponential(2, n),           # Right-skewed
    np.random.lognormal(1, 0.8, n),        # Log-normal
    np.random.beta(0.5, 0.5, n),           # U-shaped
    np.random.uniform(0, 10, n),           # Uniform
    np.random.normal(5, 2, n),             # Normal (control)
    np.random.chisquare(3, n),             # Right-skewed
])

y = ((X[:, 0] > 2) & (X[:, 2] > 0.5) | (X[:, 4] > 6)).astype(int)

# Raw features
scores_raw = cross_val_score(GaussianNB(), X, y, cv=5)

# Power transform (Yeo-Johnson)
pipe_power = Pipeline([
    ('transform', PowerTransformer(method='yeo-johnson')),
    ('nb', GaussianNB())
])
scores_power = cross_val_score(pipe_power, X, y, cv=5)

# Quantile transform (maps to uniform, then to normal)
pipe_quantile = Pipeline([
    ('transform', QuantileTransformer(output_distribution='normal',
                                       random_state=42)),
    ('nb', GaussianNB())
])
scores_quantile = cross_val_score(pipe_quantile, X, y, cv=5)

print(f"Raw features:        {scores_raw.mean():.3f}")
print(f"Power transform:     {scores_power.mean():.3f}")
print(f"Quantile transform:  {scores_quantile.mean():.3f}")

a) Which transformation helps the most? Why?

b) Which of the six features would benefit most from transformation? Explain your reasoning based on how far each distribution deviates from Gaussian.

c) The "U-shaped" feature (beta(0.5, 0.5)) is bimodal. Why is this particularly problematic for Gaussian NB? What happens when a single Gaussian is fit to bimodal data?

d) Would these transformations help logistic regression or gradient boosting? Why is Gaussian NB uniquely sensitive to feature distributions?

Exercise 4: KNN Distance Metrics (Code)

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

np.random.seed(42)

# Dataset 1: Dense, low-dimensional
X_dense, y_dense = make_classification(
    n_samples=1000, n_features=5, n_informative=4,
    n_redundant=1, flip_y=0.05, random_state=42
)

# Dataset 2: Sparse, high-dimensional (text-like)
X_sparse = np.random.binomial(1, 0.05, (1000, 200)).astype(float)
y_sparse = (X_sparse[:, :10].sum(axis=1) > 1).astype(int)

metrics = ['euclidean', 'manhattan', 'cosine', 'chebyshev']

for name, X, y in [('Dense (d=5)', X_dense, y_dense),
                     ('Sparse (d=200)', X_sparse, y_sparse)]:
    print(f"\n--- {name} ---")
    for metric in metrics:
        pipe = Pipeline([
            ('scaler', StandardScaler()),
            ('knn', KNeighborsClassifier(n_neighbors=7, metric=metric))
        ])
        scores = cross_val_score(pipe, X, y, cv=5)
        print(f"  {metric:<12} accuracy={scores.mean():.3f}")

a) Run the experiment. Which metric performs best on each dataset?

b) Explain why cosine distance tends to work better on sparse, high-dimensional data.

c) Chebyshev distance (L-infinity norm) measures the maximum absolute difference across any single feature. In what real-world scenario would this metric be preferable to Euclidean distance?

d) If you had a dataset with mixed feature types (5 continuous features and 10 binary features), which metric would you choose and why?

Exercise 5: The Curse of Dimensionality (Code + Analysis)

Run the following experiment and answer the questions:

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
from sklearn.datasets import make_classification

np.random.seed(42)
n_samples = 1000
n_informative = 5

dimensions = [5, 10, 20, 50, 100, 200, 500]

print(f"{'Dim':<8}{'KNN':<10}{'GaussianNB':<12}{'LogReg':<10}")
print("-" * 40)

for d in dimensions:
    X, y = make_classification(
        n_samples=n_samples, n_features=d,
        n_informative=n_informative, n_redundant=0,
        n_clusters_per_class=2, flip_y=0.05,
        random_state=42
    )

    knn_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=7))
    ])
    nb_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('nb', GaussianNB())
    ])
    lr_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('lr', LogisticRegression(random_state=42, max_iter=1000))
    ])

    knn_acc = cross_val_score(knn_pipe, X, y, cv=5).mean()
    nb_acc = cross_val_score(nb_pipe, X, y, cv=5).mean()
    lr_acc = cross_val_score(lr_pipe, X, y, cv=5).mean()

    print(f"{d:<8}{knn_acc:<10.3f}{nb_acc:<12.3f}{lr_acc:<10.3f}")

a) Which model degrades most as dimensionality increases? Which is most robust?

b) At what dimensionality does KNN start performing worse than Gaussian NB? Why does this crossover happen?

c) Why is Gaussian NB relatively immune to the curse of dimensionality?

d) A colleague suggests using PCA to reduce dimensions before applying KNN. Modify the pipeline to apply PCA (keeping 95% variance) before KNN. Does this help? At which dimensionalities is the improvement most dramatic?

e) In a real dataset, how would you decide between (i) applying PCA + KNN and (ii) switching to a model that handles high dimensions natively?

Exercise 6: KNN for Regression vs. Classification (Code)

import numpy as np
from sklearn.neighbors import KNeighborsClassifier, KNeighborsRegressor
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

np.random.seed(42)
n = 1500

X = np.column_stack([
    np.random.uniform(0, 10, n),
    np.random.uniform(0, 10, n),
])

# Continuous target with nonlinear relationship
y_reg = np.sin(X[:, 0]) * np.cos(X[:, 1]) + np.random.normal(0, 0.2, n)

# Binary target
y_clf = (y_reg > 0).astype(int)

k_values = [1, 3, 5, 9, 15, 25, 51]

print("--- Regression (R-squared) ---")
for k in k_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsRegressor(n_neighbors=k, weights='distance'))
    ])
    scores = cross_val_score(pipe, X, y_reg, cv=5, scoring='r2')
    print(f"  k={k:<4} R2={scores.mean():.3f}")

print("\n--- Classification (Accuracy) ---")
for k in k_values:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('knn', KNeighborsClassifier(n_neighbors=k, weights='distance'))
    ])
    scores = cross_val_score(pipe, X, y_clf, cv=5, scoring='accuracy')
    print(f"  k={k:<4} Accuracy={scores.mean():.3f}")

a) Run the experiment. Does the optimal K differ between regression and classification? If so, why?

b) Compare weights='uniform' vs. weights='distance' for both tasks. When does distance-weighting help most?

c) KNN regression predicts the (weighted) mean of neighbors. What happens when the target has outliers in the training data? How would you make KNN regression more robust to outliers?

Exercise 7: Naive Bayes on StreamFlow Churn (Applied)

Build a Naive Bayes baseline for the StreamFlow churn prediction problem using the features from Chapter 11. Specifically:

a) Train a Gaussian NB classifier on the StreamFlow feature set. Report 5-fold cross-validated AUC and accuracy.

b) The StreamFlow features include both continuous features (monthly hours, content completion rate) and count features (sessions, support tickets, payment failures). Train separate Gaussian NB on continuous features and separate Multinomial NB (after discretizing) on count features. Compare their individual performance to the combined Gaussian NB.

c) Apply CalibratedClassifierCV (method='isotonic') to the Gaussian NB classifier. Compare the Brier scores (calibration metric) before and after calibration. Plot a calibration curve for both.

d) A product manager asks: "Why is this user predicted to churn?" Using the Gaussian NB model, show how to extract the per-feature contribution to the prediction. Specifically, compute P(feature_value | churn) / P(feature_value | no_churn) for each feature to identify the strongest churn signals for a given user.

Exercise 8: KNN Anomaly Detection (Applied)

Build a KNN-based anomaly detection system for TurbineTech sensor data:

a) Using only normal operating data for training, implement a KNN distance-based anomaly detector. The anomaly score for each point is the mean distance to its K nearest normal neighbors.

b) Vary K from 3 to 50 and plot the AUC-ROC of the anomaly detector. Is there a sweet spot? Why does very small K give noisy results and very large K give diluted results?

c) Compare three distance-based anomaly scores: (i) mean distance to K neighbors, (ii) distance to the K-th neighbor only, (iii) Local Outlier Factor (LOF) from sklearn.neighbors.LocalOutlierFactor. Which performs best?

d) The TurbineTech dataset has 8 sensor features. Introduce 20 random noise features (uniformly distributed). How much does the anomaly detection AUC degrade? Apply PCA to reduce to the original 8 dimensions and re-evaluate. What does this tell you about feature selection for KNN-based anomaly detection?

Exercise 9: Model Selection Challenge (Synthesis)

You are given four datasets with different characteristics. For each dataset, predict which model (Gaussian NB, Multinomial NB, KNN, Logistic Regression, Random Forest) will perform best. Then verify with code.

import numpy as np
from sklearn.datasets import make_classification

np.random.seed(42)

# Dataset A: 100 samples, 5 features, linearly separable
X_a, y_a = make_classification(
    n_samples=100, n_features=5, n_informative=5,
    n_redundant=0, n_clusters_per_class=1, flip_y=0.02,
    class_sep=2.0, random_state=42
)

# Dataset B: 10000 samples, 3 features, complex boundary
X_b, y_b = make_classification(
    n_samples=10000, n_features=3, n_informative=3,
    n_redundant=0, n_clusters_per_class=3, flip_y=0.05,
    class_sep=0.8, random_state=42
)

# Dataset C: 500 samples, 100 features, sparse
X_c = np.random.binomial(1, 0.03, (500, 100)).astype(float)
y_c = (X_c[:, :5].sum(axis=1) > 0).astype(int)

# Dataset D: 5000 samples, 50 features, 5 informative
X_d, y_d = make_classification(
    n_samples=5000, n_features=50, n_informative=5,
    n_redundant=5, flip_y=0.1, random_state=42
)

a) Before running any code, predict the winner for each dataset. Justify your reasoning.

b) Run all five models on all four datasets. Were your predictions correct?

c) For any incorrect predictions, explain what you missed about the dataset or the model.

Exercise 10: Production Pipeline (Applied)

Build a production-ready text classification pipeline for ShopSmart review sentiment that includes:

a) Preprocessing: TF-IDF with optimized parameters (tune max_features, ngram_range, min_df, max_df via GridSearchCV).

b) Model comparison: MultinomialNB, ComplementNB, LogisticRegression as the classifier step.

c) Probability calibration: Apply CalibratedClassifierCV to the winning NB model.

d) Inference speed benchmark: Time a single prediction for each model. Calculate the number of reviews per second each model can classify.

e) Final recommendation: Write a 3-sentence summary recommending which model ShopSmart should deploy and why, considering accuracy, speed, and interpretability.

Exercises for Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for reference.