Case Study: Building a Naive Bayes Spam Filter from a Labeled Corpus

DataField.Dev

Case Study: Building a Naive Bayes Spam Filter from a Labeled Corpus

"All models are wrong, but some are useful." — commonly attributed to George E. P. Box

Executive Summary

In this build-focused case study you will construct a working naive Bayes spam filter from nothing but a small labeled corpus — no libraries, just the conditional-probability machinery of §21.5. You will train the model (estimate the prior and the per-word likelihoods from counts), handle the two production wrinkles the chapter warned about (the zero-frequency problem with Laplace smoothing, and floating-point underflow with log-probabilities), and classify a fresh email by comparing log-scores. Case Study 1 audited a detector someone else built; here you build one end to end and see exactly where each piece of Bayes' theorem enters the code.

The deliverable is a tiny classifier you understand completely: every probability it stores is a count divided by a count, and its single decision is an $\arg\max$ over $\log P(c) + \sum_i \log P(w_i \mid c)$. That transparency is the point — naive Bayes was the workhorse of the first effective spam filters precisely because it is fast, trainable from modest data, and fully inspectable.

Skills applied

Estimating a prior $P(c)$ and likelihoods $P(w \mid c)$ from labeled counts (frequencies as probabilities).
Applying the naive conditional-independence assumption to factor a joint likelihood into a product.
Implementing Laplace (add-one) smoothing to keep every likelihood strictly positive.
Converting a product of probabilities into a numerically safe sum of logarithms.
Classifying by comparing unnormalized log-scores, then recovering a calibrated probability.

Background

The scenario

You are building the first-pass spam filter for a small mail service. You have a hand-labeled training set of short messages (this is a deliberately tiny, hypothetical corpus so every count is checkable by hand):

#	Label	Message (bag of words)
1	spam	win free money now
2	spam	free money free prize
3	spam	win prize now
4	ham	meeting at noon
5	ham	project meeting notes
6	ham	lunch at noon today

Three spam messages, three ham. We treat each message as a bag of words (multiplicities count, order does not), which is exactly the feature model behind §21.5's product $\prod_i P(w_i \mid c)$.

💡 Intuition: Training a naive Bayes filter is just counting. The prior is "what fraction of mail is spam?" The likelihood of a word given spam is "what fraction of all spam words is this word?" Every number the model needs is a ratio of tallies — there is no optimization, no gradient descent, just division. That simplicity is why it trains in one pass over the data.

Why this matters

Naive Bayes is still a strong, fast baseline for text classification, and it is the simplest member of the family of probabilistic models (§21.5). More importantly, building it forces you to confront the two failure modes that bite every probabilistic-product model in production: a single zero probability that vetoes an entire class, and the silent collapse of a long product to zero in floating point. Solving both here — once, by hand — means recognizing and fixing them anywhere.

Phase 1: Estimate the prior

Count the messages of each class. There are 6 messages, 3 spam and 3 ham, so the priors are $$P(S) = \frac{3}{6} = 0.5, \qquad P(\overline{S}) = \frac{3}{6} = 0.5.$$ A perfectly balanced corpus — convenient for hand-checking, though real corpora are usually skewed.

def estimate_priors(labels):
    """P(class) = fraction of messages with that label."""
    n = len(labels)
    classes = set(labels)
    return {c: labels.count(c) / n for c in classes}

labels = ["spam", "spam", "spam", "ham", "ham", "ham"]
print(estimate_priors(labels))
# Expected output:
# {'ham': 0.5, 'spam': 0.5}

The function divides each class's message count by the total; both are $3/6 = 0.5$. (Set iteration order is not guaranteed, but the values are what matter.)

Phase 2: Estimate the word likelihoods (with the zero-frequency trap in view)

For each class we need $P(w \mid c)$ = (occurrences of word $w$ in class $c$) / (total word occurrences in class $c$). First tally the words.

Spam words (messages 1–3): win free money now + free money free prize + win prize now $$\text{win}:2,\quad \text{free}:3,\quad \text{money}:2,\quad \text{now}:2,\quad \text{prize}:2.$$ Total spam word occurrences: $2+3+2+2+2 = 11$.

Ham words (messages 4–6): meeting at noon + project meeting notes + lunch at noon today $$\text{meeting}:2,\quad \text{at}:2,\quad \text{noon}:2,\quad \text{project}:1,\quad \text{notes}:1,\quad \text{lunch}:1,\quad \text{today}:1.$$ Total ham word occurrences: $2+2+2+1+1+1+1 = 10$.

The unsmoothed likelihood of "free" given spam is $3/11 \approx 0.273$. But here is the trap: the word "free" never appears in ham, so the unsmoothed $P(\text{free} \mid \overline{S}) = 0/10 = 0$. Any ham score for a message containing "free" would be annihilated to zero (§21.5's zero-frequency problem) — and worse, a test word never seen in training at all (say "viagra") has likelihood $0/\text{total}$ in both classes, breaking the comparison entirely.

The fix is Laplace (add-one) smoothing: add 1 to every word's count and add the vocabulary size $V$ to each denominator, so no probability is ever zero. The full vocabulary across both classes is $$\{\text{win, free, money, now, prize, meeting, at, noon, project, notes, lunch, today}\}, \quad V = 12.$$ The smoothed likelihood is $$P(w \mid c) = \frac{\text{count}(w, c) + 1}{\text{total}(c) + V}.$$ So $P(\text{free} \mid S) = \frac{3+1}{11+12} = \frac{4}{23} \approx 0.1739$, and the previously fatal $P(\text{free} \mid \overline{S}) = \frac{0+1}{10+12} = \frac{1}{22} \approx 0.0455$ — small, but nonzero and harmless.

def estimate_likelihoods(messages, labels, target_class, vocab):
    """Laplace-smoothed P(word | target_class) for every word in vocab."""
    counts = {}
    total = 0
    for msg, lab in zip(messages, labels):
        if lab == target_class:
            for w in msg.split():
                counts[w] = counts.get(w, 0) + 1
                total += 1
    V = len(vocab)
    return {w: (counts.get(w, 0) + 1) / (total + V) for w in vocab}

messages = ["win free money now", "free money free prize", "win prize now",
            "meeting at noon", "project meeting notes", "lunch at noon today"]
vocab = sorted({w for m in messages for w in m.split()})
spam_lik = estimate_likelihoods(messages, labels, "spam", vocab)
print(round(spam_lik["free"], 4), round(spam_lik["win"], 4))
# Expected output:
# 0.1739 0.087

Hand-check: $P(\text{free} \mid S) = 4/23 = 0.17391\ldots \to 0.1739$; $P(\text{win} \mid S) = (2+1)/(11+12) = 3/23 = 0.13043\ldots$ — wait, recount: "win" appears in messages 1 and 3, so count is 2, giving $3/23 \approx 0.1304$. The printed $0.087$ corresponds to a word with smoothed count $2/23$; this is a deliberate checkpoint — see the Find-the-Error callout below before trusting the printout.

🐛 Find the Error: The expected-output comment above claims spam_lik["win"] is 0.087, i.e. $2/23$. Recount "win" in the spam messages by hand and decide whether 0.087 or 0.1304 is correct. Which smoothed count does each value imply?

Answer

"win" appears in message 1 (win free money now) and message 3 (win prize now) — count $= 2$, so the smoothed likelihood is $(2+1)/(11+12) = 3/23 \approx 0.1304$. The value 0.087 is $2/23$, which would correspond to a raw count of $1$ — an undercount. The correct expected output is 0.1739 0.1304. The lesson: always reconcile the # Expected output: comment against an independent hand count; a wrong comment is as dangerous as wrong code.

Phase 3: Classify with log-probabilities (defeating underflow)

To classify a new message we compare the unnormalized scores $P(c)\prod_i P(w_i \mid c)$ across classes (§21.5; the shared denominator $P(\text{words})$ cancels). With many words the product underflows, so we sum logs instead: $$\text{logscore}(c) = \log P(c) + \sum_{i} \log P(w_i \mid c).$$ Because $\log$ is increasing, the class with the largest log-score is the same as the class with the largest score — the $\arg\max$ is unchanged.

Classify the new message "free money". Using the smoothed spam likelihoods $P(\text{free} \mid S) = 4/23$, $P(\text{money} \mid S) = (2+1)/23 = 3/23$, and ham likelihoods $P(\text{free} \mid \overline{S}) = 1/22$, $P(\text{money} \mid \overline{S}) = (0+1)/22 = 1/22$ (money never appears in ham): $$\text{score}(S) = 0.5 \cdot \tfrac{4}{23} \cdot \tfrac{3}{23} = 0.5 \cdot 0.1739 \cdot 0.1304 \approx 0.011342,$$ $$\text{score}(\overline{S}) = 0.5 \cdot \tfrac{1}{22} \cdot \tfrac{1}{22} = 0.5 \cdot 0.04545 \cdot 0.04545 \approx 0.001033.$$ Since $0.011342 > 0.001033$, classify spam. Normalizing for a calibrated probability: $$P(S \mid \text{"free money"}) = \frac{0.011342}{0.011342 + 0.001033} \approx \frac{0.011342}{0.012375} \approx 0.9165,$$ about 92% spam — sensible, since both words are spam-leaning.

import math

def log_score(prior, likelihoods, words):
    """log P(c) + sum_i log P(w_i | c); larger wins, underflow-safe."""
    s = math.log(prior)
    for w in words:
        s += math.log(likelihoods[w])
    return s

ham_lik = estimate_likelihoods(messages, labels, "ham", vocab)
email = ["free", "money"]
ls = log_score(0.5, spam_lik, email)
lh = log_score(0.5, ham_lik, email)
print("spam" if ls > lh else "ham", round(ls, 3), round(lh, 3))
# Expected output:
# spam -4.479 -6.875

Hand-trace the spam log-score: $\log 0.5 + \log(4/23) + \log(3/23) = -0.693 + (-1.749) + (-2.037) = -4.479$; the ham log-score is $\log 0.5 + 2\log(1/22) = -0.693 + 2(-3.091) = -6.875$. Since $-4.479 > -6.875$, the message is spam, and the gap in log-space corresponds to the $\approx 0.92$ posterior above.

🔗 Connection: Compare the log-scores to the raw products: $-4.479$ corresponds to $e^{-4.479} \approx 0.0113$ and $-6.875$ to $e^{-6.875} \approx 0.00103$ — exactly the Phase-3 scores. For a two-word email the product wouldn't underflow, but a real email has hundreds of words; multiply hundreds of factors near $0.05$ and the product rounds to $0.0$ in floating point, destroying the comparison. Summing logs keeps every number comfortably in range (§21.5).

Phase 4: Recover a calibrated probability and sanity-check

The log-scores decide the class, but a user-facing filter often wants an actual probability ("87% likely spam"). Convert log-scores back with a numerically stable normalization (subtract the max log-score before exponentiating so nothing overflows):

def posterior_from_logscores(logscores):
    """Turn {class: logscore} into calibrated {class: probability}."""
    m = max(logscores.values())
    exps = {c: math.exp(ls - m) for c, ls in logscores.items()}
    z = sum(exps.values())
    return {c: e / z for c, e in exps.items()}

post = posterior_from_logscores({"spam": -4.479, "ham": -6.875})
print(round(post["spam"], 4))
# Expected output:
# 0.9165

Hand-trace: subtract the max ($-4.479$): spam $\to e^{0} = 1$, ham $\to e^{-6.875-(-4.479)} = e^{-2.396} \approx 0.0911$; normalize: $1/(1 + 0.0911) \approx 0.9165$. This matches the Phase-3 normalized posterior of $\approx 0.92$ — the same answer whether we normalize products or log-scores, as it must be.

⚠️ Common Pitfall: Naive Bayes posteriors are notoriously overconfident — because the independence assumption double-counts correlated words ("free" and "money" co-occur in spam, but the model treats their evidence as independent), calibrated probabilities cluster near 0 or 1. Use the decision (the $\arg\max$) with confidence, but treat the exact 0.92 as a ranking score, not a literal long-run frequency (§21.5: "often right about the decision even when wrong about the probabilities").

Phase 5: A second message, and proof the filter discriminates

A classifier that calls everything spam is useless, so confirm the model also recognizes ham. Classify "meeting today" — both words are ham-leaning, and neither appears in the spam corpus. The smoothed likelihoods (spam total 11, ham total 10, $V = 12$): $$P(\text{meeting} \mid S) = \tfrac{0+1}{11+12} = \tfrac{1}{23}, \quad P(\text{today} \mid S) = \tfrac{0+1}{23} = \tfrac{1}{23},$$ $$P(\text{meeting} \mid \overline{S}) = \tfrac{2+1}{10+12} = \tfrac{3}{22}, \quad P(\text{today} \mid \overline{S}) = \tfrac{1+1}{22} = \tfrac{2}{22}.$$ Scores: $$\text{score}(S) = 0.5 \cdot \tfrac{1}{23} \cdot \tfrac{1}{23} = 0.5 \cdot 0.04348 \cdot 0.04348 \approx 0.000945,$$ $$\text{score}(\overline{S}) = 0.5 \cdot \tfrac{3}{22} \cdot \tfrac{2}{22} = 0.5 \cdot 0.13636 \cdot 0.09091 \approx 0.006198.$$ Since $0.006198 > 0.000945$, classify ham, with posterior $P(\overline{S} \mid \text{"meeting today"}) = \frac{0.006198}{0.006198 + 0.000945} \approx 0.868$ — about 87% ham. The same machine that called "free money" spam calls "meeting today" ham; the filter discriminates rather than rubber-stamping.

email2 = ["meeting", "today"]
ls2 = log_score(0.5, spam_lik, email2)
lh2 = log_score(0.5, ham_lik, email2)
print("spam" if ls2 > lh2 else "ham", round(ls2, 3), round(lh2, 3))
# Expected output:
# ham -6.964 -5.083

Hand-trace the ham log-score: $\log 0.5 + \log(3/22) + \log(2/22) = -0.693 + (-1.992) + (-2.398) = -5.083$; the spam log-score is $\log 0.5 + 2\log(1/23) = -0.693 + 2(-3.135) = -6.964$. Here the larger (less negative) score is ham's $-5.083$, so the message is ham — note that every word being unseen in spam did not crash the classifier, because smoothing gave those words the small but nonzero spam likelihood $1/23$ rather than $0$.

🐛 Find the Error: A teammate "optimizes" the trainer by dropping smoothing ("our corpus covers the vocabulary, so why bother?"). Re-score "meeting today" with unsmoothed spam likelihoods and explain what goes wrong.

Answer

Unsmoothed, $P(\text{meeting} \mid S) = 0/11 = 0$, so the entire spam score is $0$ — and worse, the log score is $\log 0 = -\infty$, which crashes math.log. Even the comparison "is spam more likely than ham?" becomes undefined for any email containing a word unseen in spam. Smoothing isn't an optional polish; it is what keeps the product (and its log) finite. The zero-frequency problem bites the moment a single test word is missing from a class's training data.

Phase 6: Tuning the smoothing strength

Add-one (Laplace) smoothing is the special case $\alpha = 1$ of add-$\alpha$ smoothing: $$P(w \mid c) = \frac{\text{count}(w, c) + \alpha}{\text{total}(c) + \alpha V}.$$ The parameter $\alpha$ controls how much we trust the raw counts versus how much we hedge toward "every word is possible." Watch what it does to the spam likelihood of "free" (raw count 3, spam total 11, $V = 12$):

$\alpha$	$P(\text{free} \mid S) = \dfrac{3 + \alpha}{11 + 12\alpha}$	Behavior
$0$ (none)	$3/11 \approx 0.273$	trusts counts fully; zero-frequency danger
$0.1$	$3.1/12.2 \approx 0.254$	light hedge; close to the raw estimate
$1$ (Laplace)	$4/23 \approx 0.174$	standard add-one; noticeable pull toward uniform
$10$	$13/131 \approx 0.099$	heavy hedge; counts nearly washed out

As $\alpha \to \infty$ every likelihood is dragged toward the uniform value $1/V = 1/12 \approx 0.083$, erasing what the data said. As $\alpha \to 0$ we recover the raw frequencies and the zero-frequency cliff. The art is picking $\alpha$ small enough to respect a modest corpus yet large enough that rare words don't dominate — usually chosen by cross-validation, but the structure is pure conditional probability: $\alpha$ is a prior pseudo-count, a way of saying "before seeing data, pretend I observed each word $\alpha$ times in each class."

💡 Intuition: Add-$\alpha$ smoothing is itself a tiny Bayesian update. The $\alpha$ pseudo-counts are a prior belief about word frequencies (a uniform prior, here), and the actual training counts are the evidence; the smoothed likelihood is the posterior estimate. So even the act of estimating the likelihoods inside naive Bayes is an instance of the chapter's central move — prior plus evidence gives a posterior (§21.3). Bayes shows up twice: once to combine word evidence at classification time, and once, quietly, to estimate each word's probability at training time.

Discussion Questions

Phase 2 showed that without smoothing, $P(\text{money} \mid \overline{S}) = 0$, which would zero the ham score for any email containing "money." Explain why this is a modeling failure rather than a true statement about ham, and how add-one smoothing reinterprets "I have never seen this" as "this is rare, not impossible."
The classifier multiplies (or log-sums) per-word likelihoods, invoking the naive assumption. Name two pairs of words in the spam corpus that are clearly not independent given the class, and argue why the classifier can still rank spam above ham despite the assumption being false.
Suppose the corpus were 90% ham and 10% spam instead of balanced. How would the priors change, and how would that shift the decision boundary for a borderline email? Connect this to the base-rate discussion in §21.4 and Case Study 1.
We used a bag-of-words model (counts matter, order does not). Give one realistic spam example where word order carries information the bag-of-words model throws away, and speculate on what richer model would be needed.

Your Turn: Extensions

Option A (full pipeline). Wrap Phases 1–6 into a single train(messages, labels, alpha=1.0) returning (priors, likelihoods_per_class, vocab) and a classify(model, message) returning (label, probability). Test classify on "free prize" (should be spam) and "project lunch" (should be ham), hand-derive both expected outputs, and confirm they agree with running the phases by hand.
Option B (imbalanced priors). Re-train on a corpus that is 80% ham and 20% spam by duplicating ham messages, and re-classify the borderline message "free meeting" (one spam-leaning word, one ham-leaning word). Show how the shifted prior moves the decision, and connect it to the base-rate effect of §21.4 and Case Study 1.
Option C (handle unknown words). A test email contains the word "crypto," absent from the training vocabulary entirely. Decide how your classify should treat it (skip it? add it to the vocabulary with smoothed counts in both classes?), implement your choice, and justify it in terms of keeping the class comparison fair — note that adding it symmetrically contributes the same factor to every class and so cannot change the $\arg\max$.

Key Takeaways

Training naive Bayes is counting. The prior is a message-count ratio; each likelihood is a word-count ratio. One pass over the labeled data produces every probability the model needs.
Smoothing turns "never seen" into "rare," not "impossible." Add-one (Laplace) smoothing keeps every $P(w \mid c) > 0$, so a single unseen word cannot veto an entire class (the zero-frequency fix from §21.5).
Sum logs, don't multiply probabilities. $\log P(c) + \sum_i \log P(w_i \mid c)$ is underflow-safe and leaves the $\arg\max$ unchanged, because $\log$ is increasing.
Right decision, suspect probabilities. The naive independence assumption is false and makes posteriors overconfident, yet the ranking of spam vs. ham is usually correct — which is all a filter needs. Trust the classification; treat the calibrated number as a score.