Further Reading: Chapter 15
Naive Bayes and Nearest Neighbors
Foundational Papers
1. "On the Optimality of the Simple Bayesian Classifier under Zero-One Loss" --- Domingos and Pazzani (1997) The paper that explained why Naive Bayes works despite its violated assumptions. Domingos and Pazzani showed that NB is optimal (in terms of zero-one loss) not when features are independent, but when dependencies distribute evenly across classes --- a much weaker condition that holds surprisingly often. Published in Machine Learning, Vol. 29. This is the single most important paper for understanding Naive Bayes's empirical success.
2. "Tackling the Poor Assumptions of Naive Bayes Text Classifiers" --- Rennie, Shih, Teevan, and Karger (2003) The paper introducing Complement Naive Bayes (ComplementNB) and several other corrections to standard Multinomial NB. The authors demonstrated that NB's poor probability estimates stem from specific identifiable biases: skewed class sizes, burstiness of word distributions, and the document-length effect. Their corrected variants achieve accuracy competitive with SVMs on standard text benchmarks. Published in ICML 2003.
3. "Nearest Neighbor Pattern Classification" --- Cover and Hart (1967) The foundational paper on K-Nearest Neighbors. Cover and Hart proved that as the number of training samples approaches infinity, the error rate of the 1-NN classifier is bounded by twice the Bayes optimal error rate. This means 1-NN is at most twice as bad as the theoretical best, which is a remarkably strong guarantee for such a simple algorithm. Published in IEEE Transactions on Information Theory.
4. "When Is 'Nearest Neighbor' Meaningful?" --- Beyer, Goldstein, Ramakrishnan, and Shaft (1999) The paper that formalized the curse of dimensionality for nearest-neighbor methods. Beyer et al. proved that under broad conditions, as dimensionality increases, the ratio of nearest-to-farthest distance converges to 1.0, making distance-based methods degenerate. Essential reading for anyone deploying KNN in practice. Published in ICDT 1999.
Practical Guides and Tutorials
5. Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow --- Aurelien Geron (3rd edition, 2022) Chapter 3 covers classification with a strong Naive Bayes section using the MNIST dataset. The treatment of KNN in Chapter 2 (end-to-end project) and the discussion of instance-based learning provide practical code examples with scikit-learn. Geron's emphasis on pipelines and cross-validation aligns well with production practice.
6. An Introduction to Statistical Learning (ISLR) --- James, Witten, Hastie, Tibshirani (2nd edition, 2021) Section 4.4 covers Naive Bayes from first principles with clear mathematical exposition. Chapter 2 introduces KNN as the prototypical nonparametric method and uses it to illustrate the bias-variance tradeoff. The ISLP (Python edition, 2023) includes updated labs. Free at statlearning.com.
7. Scikit-learn Documentation --- Naive Bayes Module
The scikit-learn documentation for sklearn.naive_bayes includes clear descriptions of Gaussian, Multinomial, Bernoulli, Complement, and Categorical NB variants, along with mathematical formulations and usage examples. The "User Guide" section on Naive Bayes explains the relationship between variants and when to use each. Available at scikit-learn.org.
8. Scikit-learn Documentation --- Nearest Neighbors Module
The sklearn.neighbors documentation covers KNN classification, regression, and unsupervised nearest-neighbor search, including algorithm selection (brute, kd_tree, ball_tree), distance metrics, and NearestNeighbors for anomaly detection. The "User Guide" section discusses computational complexity and scaling strategies. Available at scikit-learn.org.
Text Classification with Naive Bayes
9. "A Comparison of Event Models for Naive Bayes Text Classification" --- McCallum and Nigam (1998) The paper comparing the Multinomial and Multivariate Bernoulli event models for text classification. McCallum and Nigam showed that Multinomial NB outperforms Bernoulli NB on longer documents where word frequency carries information, while Bernoulli NB can be competitive on short documents. Published in AAAI-98 Workshop on Learning for Text Categorization.
10. "Spam Filtering with Naive Bayes --- Which Naive Bayes?" --- Metsis, Androutsopoulos, and Paliouras (2006) A systematic comparison of five NB variants (Multinomial, Multivariate Bernoulli, Boolean Multinomial, TF-IDF Multinomial, and a combined model) on spam filtering. The paper finds that Boolean Multinomial NB (binary word presence with multinomial model) is the strongest variant for spam detection, and that the choice of NB variant matters more than feature selection. Published at CEAS 2006.
11. "The Optimality of Naive Bayes" --- Zhang (2004) A more rigorous theoretical treatment of why Naive Bayes works. Zhang proved that NB is optimal not just when dependencies are evenly distributed, but under even weaker conditions related to the symmetry of dependency structures. Published in FLAIRS Conference 2004. More accessible than the Domingos and Pazzani paper for readers with a statistics background.
KNN in Practice
12. "Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality" --- Indyk and Motwani (1998) The paper that introduced locality-sensitive hashing (LSH) for approximate nearest-neighbor search, trading exact answers for dramatic speedups in high dimensions. This is the theoretical foundation for modern approximate NN libraries like FAISS and Annoy. Published in STOC 1998.
13. FAISS Documentation --- Facebook AI Similarity Search FAISS is the production-grade library for nearest-neighbor search at scale. The documentation covers index types (flat, IVF, HNSW), GPU acceleration, and the accuracy-speed tradeoff for approximate search. If you are deploying KNN at scale (millions to billions of vectors), FAISS is the standard tool. Available at github.com/facebookresearch/faiss.
14. Annoy Documentation --- Approximate Nearest Neighbors Oh Yeah (Spotify) Annoy is a C++ library with Python bindings designed for memory-efficient, read-only nearest-neighbor indices. It uses random projection trees and is optimized for the "build once, query many times" pattern common in recommendation systems. Simpler than FAISS and sufficient for datasets up to ~100M vectors. Available at github.com/spotify/annoy.
Anomaly Detection with KNN
15. "LOF: Identifying Density-Based Local Outliers" --- Breunig, Kriegel, Ng, and Sander (2000) The paper introducing Local Outlier Factor (LOF), which extends KNN-based anomaly detection by comparing each point's local density to the local density of its neighbors. A point is anomalous if its density is significantly lower than its neighbors' density. LOF handles datasets with varying density regions, where simple distance thresholds fail. Published in ACM SIGMOD 2000.
16. Outlier Analysis --- Aggarwal (2017, 2nd edition) A comprehensive textbook on anomaly detection covering distance-based methods (KNN, LOF), density-based methods, clustering-based methods, and more. Chapter 3 covers proximity-based outlier detection in depth, including the theoretical properties of KNN-based anomaly scores and their behavior in high dimensions. Springer.
Deeper Theory
17. The Elements of Statistical Learning --- Hastie, Tibshirani, Friedman (2nd edition, 2009) Section 6.2 covers kernel density estimation and local methods (the theoretical foundation for KNN). Section 9.1 covers Generalized Additive Models with a discussion of the curse of dimensionality that is among the clearest in the literature. Chapter 13 covers prototype methods, including KNN and learning vector quantization. Free PDF at the authors' website.
18. Pattern Recognition and Machine Learning --- Bishop (2006) Section 4.2 covers probabilistic generative models including Naive Bayes, with a rigorous derivation from first principles. Section 2.5 covers nonparametric methods including KNN density estimation and classification. Bishop's treatment is more mathematical than ISLR but provides deep intuition about why these methods work.
Calibration and Probability Estimation
19. "Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayes Classifiers" --- Zadrozny and Elkan (2001) The paper demonstrating that Naive Bayes produces poorly calibrated probabilities and proposing isotonic regression as a post-hoc calibration method. If you need actual probability estimates (not just rankings) from NB, this paper explains why calibration is necessary and how to do it. Published in ICML 2001.
20. "Predicting Good Probabilities with Supervised Learning" --- Niculescu-Mizil and Caruana (2005) A broader comparison of probability calibration across multiple algorithms (NB, trees, SVMs, neural nets, boosted ensembles). The paper shows that NB's probability estimates are among the worst (overly extreme --- pushed toward 0 and 1), while boosted trees and random forests are relatively well-calibrated. Essential context for understanding when to trust model probabilities. Published in ICML 2005.
How to Use This List
If you read nothing else, read Domingos and Pazzani (item 1) for the theory of why NB works, and Beyer et al. (item 4) for the theory of why KNN fails in high dimensions. Together they take about 3 hours and give you the conceptual foundation for both algorithms.
If you are building a text classifier, start with Rennie et al. (item 2) for improved NB variants and Metsis et al. (item 10) for the empirical comparison of NB variants on spam data.
If you are deploying KNN at scale, skip scikit-learn and go directly to the FAISS documentation (item 13). For anomaly detection specifically, read Breunig et al. (item 15) on LOF to understand density-adjusted anomaly scoring.
If you need calibrated probabilities from Naive Bayes, read Zadrozny and Elkan (item 19) and use CalibratedClassifierCV from scikit-learn.
This reading list supports Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter to review concepts before diving in.