Quiz: Chapter 15

Naive Bayes and Nearest Neighbors


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

The "naive" assumption in Naive Bayes refers to:

  • A) The assumption that all features have equal importance
  • B) The assumption that features are conditionally independent given the class
  • C) The assumption that the prior probability of each class is equal
  • D) The assumption that the data is normally distributed

Answer: B) The assumption that features are conditionally independent given the class. This means P(x_1, x_2, ..., x_d | C) = P(x_1 | C) * P(x_2 | C) * ... * P(x_d | C). The assumption allows the joint likelihood to be decomposed into a product of individual feature likelihoods, making estimation tractable even with high-dimensional feature spaces. The assumption is almost always violated in practice, yet the classifier often works well because correct classification only requires correct ranking of class probabilities, not accurate probability estimates.


Question 2 (Multiple Choice)

A Multinomial Naive Bayes classifier is trained on a corpus of 10,000 documents with a vocabulary of 50,000 unique words. Without Laplace smoothing, the classifier will fail when:

  • A) A test document contains a word that appeared in both classes during training
  • B) A test document contains a word that appeared in one class but not the other during training
  • C) A test document is shorter than the average training document
  • D) The prior probabilities of the two classes are unequal

Answer: B) A test document contains a word that appeared in one class but not the other during training. If word w never appeared in spam training documents, P(w | spam) = 0. Since Naive Bayes multiplies feature likelihoods, a single zero factor makes the entire product zero: P(X | spam) = 0, regardless of how many other features strongly indicate spam. Laplace smoothing adds a small count (typically 1) to every word-class combination, ensuring no probability is ever exactly zero.


Question 3 (Short Answer)

Explain why Naive Bayes often classifies correctly despite the independence assumption being violated. What property of the classifier makes it robust to this assumption violation?

Answer: Naive Bayes only needs to rank classes correctly, not estimate accurate probabilities. The independence assumption can distort the magnitude of P(C | X) enormously while still placing the correct class at the top of the ranking. Domingos and Pazzani (1997) showed that when feature dependencies are distributed roughly evenly across classes, the distortions approximately cancel out, leaving the ranking intact. Additionally, in high-dimensional feature spaces (like text with thousands of words), no single feature dependency dominates the overall product, so individual violations have limited impact on the final classification.


Question 4 (Multiple Choice)

Which Naive Bayes variant is most appropriate for classifying product reviews based on TF-IDF features?

  • A) Gaussian NB, because TF-IDF values are continuous
  • B) Multinomial NB, because TF-IDF values represent word importance weights
  • C) Bernoulli NB, because each word is either present or absent
  • D) All three variants will produce identical results on TF-IDF features

Answer: B) Multinomial NB, because TF-IDF values represent word importance weights. While TF-IDF values are technically continuous (making Gaussian NB seem reasonable), they represent weighted word frequencies, which aligns with the multinomial model's assumptions about count-like data. Multinomial NB treats each document as a collection of weighted terms, which matches TF-IDF semantics. Bernoulli NB would discard the magnitude information, and Gaussian NB would assume each TF-IDF value follows a normal distribution within each class, which is a poor fit for the sparse, skewed nature of TF-IDF vectors.


Question 5 (Multiple Choice)

K-Nearest Neighbors is called a "lazy learner" because:

  • A) It uses a simplified loss function
  • B) It performs no computation during the training phase --- all work is deferred to prediction time
  • C) It converges slowly during optimization
  • D) It requires fewer features than other algorithms

Answer: B) It performs no computation during the training phase --- all work is deferred to prediction time. During "training," KNN simply stores the dataset. All distance calculations, neighbor searches, and vote counting happen when a new prediction is requested. This is the opposite of "eager" learners like logistic regression or gradient boosting, which invest significant computation during training to build a compact model and then make fast predictions.


Question 6 (Short Answer)

A data scientist applies KNN (k=5) to a dataset with 10 features. Feature 1 has range [0, 100000] (annual salary) and Feature 2 has range [0, 1] (probability score). Without preprocessing, which feature will dominate the distance calculation? What preprocessing step fixes this, and why?

Answer: Feature 1 (salary) will dominate because Euclidean and Manhattan distances are computed as sums across features, and a difference of 1000 in salary produces a much larger contribution than a difference of 0.1 in probability. StandardScaler (or MinMaxScaler) fixes this by transforming all features to comparable scales. After standardization, each feature has mean 0 and standard deviation 1, so no single feature dominates the distance calculation based on its raw scale.


Question 7 (Multiple Choice)

The "curse of dimensionality" affects KNN because:

  • A) KNN cannot handle more than 20 features
  • B) In high dimensions, all points become approximately equidistant, making the concept of "nearest neighbor" meaningless
  • C) The number of possible classes grows exponentially with dimensionality
  • D) KNN's training time grows exponentially with the number of features

Answer: B) In high dimensions, all points become approximately equidistant, making the concept of "nearest neighbor" meaningless. As dimensionality increases, the ratio of the nearest-neighbor distance to the farthest-neighbor distance converges to 1.0 (Beyer et al., 1999). When the nearest point is almost as far away as the farthest point, the K "nearest" neighbors are essentially random, and the prediction carries no more information than a random sample of training points.


Question 8 (Multiple Choice)

A ShopSmart data scientist trains both Multinomial NB and ComplementNB on an imbalanced review dataset (80% positive, 10% neutral, 10% negative). ComplementNB achieves higher recall on the minority classes. This is because:

  • A) ComplementNB uses a different smoothing strategy
  • B) ComplementNB estimates parameters from the complement of each class, reducing the bias toward the majority class
  • C) ComplementNB automatically resamples the training data
  • D) ComplementNB uses a different distance metric

Answer: B) ComplementNB estimates parameters from the complement of each class, reducing the bias toward the majority class. Standard Multinomial NB estimates P(w | c) from documents in class c. When class c has few documents, these estimates are noisy. ComplementNB instead estimates P(w | not c) from all documents NOT in class c, which provides more stable estimates for minority classes. This approach was specifically designed by Rennie et al. (2003) to address Multinomial NB's systematic bias toward classes with more training examples.


Question 9 (Short Answer)

A TurbineTech engineer asks: "We have 847 sensor features per turbine. Should we use KNN for anomaly detection?" Provide a specific, actionable recommendation.

Answer: Not directly on all 847 features. At 847 dimensions, KNN suffers severely from the curse of dimensionality --- distances become meaningless and the anomaly detector will produce unreliable scores. The recommendation is to first apply dimensionality reduction: use PCA to reduce to 20-50 principal components that capture 95% of variance, or use domain knowledge to select the 15-20 most diagnostic sensors (vibration, bearing temperature, gearbox oil temperature). Then apply KNN anomaly detection on the reduced feature set, where distance calculations are meaningful and the method works well.


Question 10 (Multiple Choice)

Which statement about KNN is FALSE?

  • A) KNN can handle nonlinear decision boundaries without modification
  • B) KNN prediction time is independent of the training set size
  • C) KNN can be used for both classification and regression
  • D) KNN requires a distance metric to be specified

Answer: B) KNN prediction time is independent of the training set size. This is false. For brute-force KNN, prediction time is O(n * d) per query, where n is the number of training points and d is the number of features. Every training point must be examined to find the K nearest neighbors. While tree-based data structures (KD-tree, Ball tree) reduce this to approximately O(d * log(n)), prediction time still depends on n. This is KNN's fundamental scalability limitation.


Question 11 (Multiple Choice)

For binary classification with KNN, using an even value of K (e.g., K=10) can cause problems because:

  • A) Even values of K always produce lower accuracy
  • B) Ties are possible when 5 neighbors vote for each class, requiring a tiebreaking rule
  • C) Even values of K violate the algorithm's mathematical assumptions
  • D) Even values of K cause the decision boundary to be discontinuous

Answer: B) Ties are possible when 5 neighbors vote for each class, requiring a tiebreaking rule. With K=10 in binary classification, a 5-5 split is possible. Scikit-learn breaks ties by choosing the class that appears first in the training data, which is arbitrary and can produce inconsistent predictions for borderline points. Using odd K values (5, 7, 9, ...) eliminates ties in binary classification. For multiclass problems, weights='distance' is a better tiebreaking strategy than relying on odd K.


Question 12 (Short Answer)

A colleague claims: "I used Naive Bayes to predict customer churn and it gave me a probability of 0.97 for a specific customer. I told the retention team that this customer has a 97% chance of churning." What is wrong with this claim?

Answer: Naive Bayes probability estimates are typically poorly calibrated because the independence assumption distorts probability magnitudes. A Naive Bayes output of 0.97 does not mean there is a 97% chance of churn --- it means the model ranks this customer as very likely to churn relative to other customers. The actual probability could be much lower. To obtain reliable probability estimates from Naive Bayes, apply post-hoc calibration using CalibratedClassifierCV with isotonic regression or Platt scaling, then validate the calibration on held-out data before communicating probabilities to stakeholders.


Question 13 (Multiple Choice)

You are building a text classifier for 50,000 customer support tickets with 500 labeled examples and a vocabulary of 20,000 words. Which model is the strongest default choice?

  • A) Gradient Boosting on TF-IDF features
  • B) Multinomial Naive Bayes on TF-IDF features
  • C) KNN on TF-IDF features
  • D) Deep learning (BERT) fine-tuned on the 500 examples

Answer: B) Multinomial Naive Bayes on TF-IDF features. With only 500 labeled examples and 20,000 features, Multinomial NB is the strongest default because: (1) it has very few parameters relative to the feature space, making overfitting unlikely; (2) it handles high-dimensional sparse data naturally; (3) Laplace smoothing provides built-in regularization. Gradient boosting would likely overfit on 500 examples with 20,000 features. KNN would suffer from the curse of dimensionality at 20,000 dimensions. Fine-tuning BERT on 500 examples can work but is far more complex and may overfit without careful regularization.


Question 14 (Short Answer)

Explain the difference between a "lazy learner" (KNN) and an "eager learner" (logistic regression). What are the practical consequences of each approach for training time, prediction time, and model storage?

Answer: An eager learner (logistic regression) invests computation during training to build a compact model (a weight vector), then makes predictions cheaply using that model. Training is expensive, prediction is fast, and storage is minimal (just the coefficients). A lazy learner (KNN) defers all computation to prediction time by memorizing the entire training set. Training is instant, prediction is expensive (must search all stored examples), and storage is proportional to the training set size. The practical consequence is that eager learners are preferred when prediction speed matters (real-time serving), while lazy learners can be useful when training speed matters (frequently updated models) and the training set is small enough that brute-force search is feasible.


Question 15 (Multiple Choice)

A data scientist runs Gaussian NB on a dataset where one feature (annual income) follows a log-normal distribution. The model performs poorly. The best fix is:

  • A) Switch to Multinomial NB
  • B) Apply a log transform or PowerTransformer to the feature before fitting Gaussian NB
  • C) Remove the feature entirely
  • D) Increase the training set size

Answer: B) Apply a log transform or PowerTransformer to the feature before fitting Gaussian NB. Gaussian NB assumes each feature is normally distributed within each class. A log-normal feature violates this assumption, causing the Gaussian likelihood calculation to assign incorrect probabilities. A log transform converts a log-normal distribution to a normal distribution, directly satisfying the model's assumption. PowerTransformer (Yeo-Johnson method) generalizes this to handle a wider range of skewed distributions. This is a targeted fix that addresses the root cause without discarding information.


Quiz for Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for review.