Case Study 1: Bayesian Spam Classification

Overview

Spam filtering is one of the oldest and most successful applications of probabilistic machine learning. In this case study, we build a Naive Bayes spam classifier from scratch using only NumPy, applying the concepts of Bayes' theorem, conditional probability, conditional independence, and maximum likelihood estimation from Chapter 4. Despite its simplicity, Naive Bayes remains a competitive baseline for text classification and is used in production systems to this day.

By the end of this case study, you will have: - Implemented a complete Naive Bayes classifier using only NumPy - Understood the "naive" conditional independence assumption and its practical consequences - Applied Laplace smoothing to handle unseen words - Evaluated the classifier using accuracy and confusion matrices - Connected the mathematical theory to a working system

Background: The Spam Classification Problem

Given an email (or SMS message) represented as a collection of words, we want to determine whether it is spam ($C = 1$) or ham ($C = 0$). Using Bayes' theorem:

$$ P(C = 1 \mid \mathbf{w}) = \frac{P(\mathbf{w} \mid C = 1) \, P(C = 1)}{P(\mathbf{w})} $$

where $\mathbf{w} = (w_1, w_2, \ldots, w_d)$ is the vector of word indicators or counts.

The challenge is estimating $P(\mathbf{w} \mid C)$: the probability of an entire word pattern given the class. With a vocabulary of $V$ words, the joint distribution has $2^V$ possible patterns -- far too many to estimate from data.

The Naive Bayes Solution

The naive assumption is that words are conditionally independent given the class:

$$ P(\mathbf{w} \mid C) = \prod_{j=1}^{d} P(w_j \mid C) $$

This reduces the number of parameters from exponential to linear in the vocabulary size. While the assumption is clearly wrong (the words "Nigerian" and "prince" are not independent in spam emails), it works surprisingly well in practice.

The classification rule becomes:

$$ \hat{C} = \arg\max_{c \in \{0, 1\}} \left[\log P(C = c) + \sum_{j=1}^{d} \log P(w_j \mid C = c)\right] $$

We work entirely in log-space for numerical stability.

Parameter Estimation

We estimate parameters using MLE from the training data:

Class prior: $P(C = c) = \frac{N_c}{N}$ where $N_c$ is the number of documents in class $c$ and $N$ is the total.

Word likelihood (Bernoulli model): $P(w_j = 1 \mid C = c) = \frac{N_{jc}}{N_c}$ where $N_{jc}$ is the number of class-$c$ documents containing word $j$.

The Problem of Unseen Words

If a word never appears in spam training emails, its estimated probability is zero: $P(w_j = 1 \mid \text{spam}) = 0$. A single unseen word would then force $P(\text{spam} \mid \mathbf{w}) = 0$, regardless of all other evidence.

Laplace smoothing solves this by adding a pseudocount $\alpha$ (typically $\alpha = 1$):

$$ P(w_j = 1 \mid C = c) = \frac{N_{jc} + \alpha}{N_c + 2\alpha} $$

This corresponds to a MAP estimate with a Beta prior (Section 4.3.4), providing a Bayesian justification for this practical technique.

Implementation

The complete implementation is in code/case-study-code.py. Here we walk through the key components.

Step 1: Data Preparation

We create a synthetic dataset that mimics real spam/ham patterns:

import numpy as np
from typing import Dict, List, Tuple


def create_dataset(
    n_emails: int = 1000,
    spam_ratio: float = 0.3,
    seed: int = 42,
) -> Tuple[List[List[str]], np.ndarray]:
    """Create a synthetic email dataset for spam classification.

    Args:
        n_emails: Total number of emails to generate.
        spam_ratio: Fraction of emails that are spam.
        seed: Random seed for reproducibility.

    Returns:
        Tuple of (emails, labels) where emails is a list of word lists
        and labels is a binary array (1 = spam, 0 = ham).
    """
    rng = np.random.default_rng(seed)

    spam_words = [
        "free", "winner", "click", "buy", "offer", "discount",
        "limited", "urgent", "cash", "prize", "congratulations",
        "deal", "cheap", "money", "credit", "loan",
    ]
    ham_words = [
        "meeting", "project", "report", "schedule", "team",
        "review", "update", "deadline", "budget", "presentation",
        "research", "analysis", "data", "results", "plan",
    ]
    shared_words = [
        "please", "hello", "thanks", "the", "and", "for",
        "this", "that", "with", "have", "from",
    ]

    emails = []
    labels = np.zeros(n_emails, dtype=int)

    for i in range(n_emails):
        is_spam = rng.random() < spam_ratio
        labels[i] = int(is_spam)

        if is_spam:
            n_words = rng.integers(5, 20)
            primary = rng.choice(spam_words, size=n_words // 2)
            filler = rng.choice(shared_words, size=n_words - n_words // 2)
            # Occasionally include ham words in spam
            if rng.random() < 0.2:
                bonus = rng.choice(ham_words, size=2)
                email = list(primary) + list(filler) + list(bonus)
            else:
                email = list(primary) + list(filler)
        else:
            n_words = rng.integers(8, 25)
            primary = rng.choice(ham_words, size=n_words // 2)
            filler = rng.choice(shared_words, size=n_words - n_words // 2)
            # Occasionally include spam words in ham
            if rng.random() < 0.1:
                bonus = rng.choice(spam_words, size=1)
                email = list(primary) + list(filler) + list(bonus)
            else:
                email = list(primary) + list(filler)

        emails.append(email)

    return emails, labels

Step 2: Feature Extraction

We convert emails into binary word-occurrence vectors:

def build_vocabulary(emails: List[List[str]]) -> Dict[str, int]:
    """Build a vocabulary mapping words to indices.

    Args:
        emails: List of emails, each as a list of words.

    Returns:
        Dictionary mapping each unique word to its index.
    """
    vocab = {}
    for email in emails:
        for word in email:
            if word not in vocab:
                vocab[word] = len(vocab)
    return vocab


def emails_to_features(
    emails: List[List[str]],
    vocab: Dict[str, int],
) -> np.ndarray:
    """Convert emails to binary feature matrix.

    Args:
        emails: List of emails, each as a list of words.
        vocab: Vocabulary mapping words to indices.

    Returns:
        Binary matrix of shape (n_emails, vocab_size).
    """
    n = len(emails)
    d = len(vocab)
    features = np.zeros((n, d), dtype=np.float64)

    for i, email in enumerate(emails):
        for word in email:
            if word in vocab:
                features[i, vocab[word]] = 1.0

    return features

Step 3: Naive Bayes Classifier

class NaiveBayesClassifier:
    """Bernoulli Naive Bayes classifier for text classification.

    Implements Bayes' theorem with the naive conditional independence
    assumption and Laplace smoothing.

    Attributes:
        alpha: Laplace smoothing parameter.
        log_prior: Log class prior probabilities, shape (2,).
        log_likelihood: Log word likelihoods, shape (2, vocab_size).
    """

    def __init__(self, alpha: float = 1.0) -> None:
        """Initialize the classifier.

        Args:
            alpha: Laplace smoothing parameter (default: 1.0).
        """
        self.alpha = alpha
        self.log_prior: np.ndarray = np.array([])
        self.log_likelihood: np.ndarray = np.array([])

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        """Fit the model using maximum likelihood estimation.

        Args:
            X: Feature matrix of shape (n_samples, n_features).
            y: Labels of shape (n_samples,), values in {0, 1}.
        """
        n_samples, n_features = X.shape

        # Estimate class priors: P(C=c) = N_c / N
        n_class_0 = np.sum(y == 0)
        n_class_1 = np.sum(y == 1)
        self.log_prior = np.log(
            np.array([n_class_0 / n_samples, n_class_1 / n_samples])
        )

        # Estimate word likelihoods with Laplace smoothing
        # P(w_j=1 | C=c) = (N_jc + alpha) / (N_c + 2*alpha)
        self.log_likelihood = np.zeros((2, n_features))

        for c in range(2):
            mask = y == c
            n_c = np.sum(mask)
            word_counts = np.sum(X[mask], axis=0)  # N_jc for each word j

            prob = (word_counts + self.alpha) / (n_c + 2 * self.alpha)
            self.log_likelihood[c] = np.log(prob)

        # Also store log(1 - P(w_j=1 | C=c)) for absent words
        self._log_complement = np.log(1 - np.exp(self.log_likelihood))

    def predict_log_proba(self, X: np.ndarray) -> np.ndarray:
        """Compute log posterior probabilities for each class.

        Args:
            X: Feature matrix of shape (n_samples, n_features).

        Returns:
            Log probabilities of shape (n_samples, 2).
        """
        # log P(C=c | w) = log P(C=c) + sum_j [w_j * log P(w_j=1|C=c)
        #                                     + (1-w_j) * log P(w_j=0|C=c)]
        log_probs = np.zeros((X.shape[0], 2))

        for c in range(2):
            # Contribution from present words
            present = X @ self.log_likelihood[c]
            # Contribution from absent words
            absent = (1 - X) @ self._log_complement[c]
            log_probs[:, c] = self.log_prior[c] + present + absent

        return log_probs

    def predict(self, X: np.ndarray) -> np.ndarray:
        """Predict class labels.

        Args:
            X: Feature matrix of shape (n_samples, n_features).

        Returns:
            Predicted labels of shape (n_samples,).
        """
        log_probs = self.predict_log_proba(X)
        return np.argmax(log_probs, axis=1)

Step 4: Evaluation

def evaluate(
    y_true: np.ndarray,
    y_pred: np.ndarray,
) -> Dict[str, float]:
    """Compute classification metrics.

    Args:
        y_true: True labels.
        y_pred: Predicted labels.

    Returns:
        Dictionary with accuracy, precision, recall, and F1 score.
    """
    tp = np.sum((y_pred == 1) & (y_true == 1))
    fp = np.sum((y_pred == 1) & (y_true == 0))
    fn = np.sum((y_pred == 0) & (y_true == 1))
    tn = np.sum((y_pred == 0) & (y_true == 0))

    accuracy = (tp + tn) / len(y_true)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0.0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0.0
    f1 = (
        2 * precision * recall / (precision + recall)
        if (precision + recall) > 0
        else 0.0
    )

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "confusion_matrix": np.array([[tn, fp], [fn, tp]]),
    }

Running the Experiment

# Generate data
emails, labels = create_dataset(n_emails=2000, spam_ratio=0.3)

# Build vocabulary and features
vocab = build_vocabulary(emails)
X = emails_to_features(emails, vocab)

# Train/test split (80/20)
n_train = int(0.8 * len(emails))
X_train, X_test = X[:n_train], X[n_train:]
y_train, y_test = labels[:n_train], labels[n_train:]

# Train and evaluate
clf = NaiveBayesClassifier(alpha=1.0)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

metrics = evaluate(y_test, y_pred)
print(f"Accuracy:  {metrics['accuracy']:.4f}")
print(f"Precision: {metrics['precision']:.4f}")
print(f"Recall:    {metrics['recall']:.4f}")
print(f"F1 Score:  {metrics['f1']:.4f}")

Analysis: Most Informative Features

We can identify which words are most indicative of spam by examining the log-likelihood ratio:

$$ \log \frac{P(w_j = 1 \mid \text{spam})}{P(w_j = 1 \mid \text{ham})} $$

Words with large positive values strongly indicate spam; words with large negative values strongly indicate ham.

# Log-likelihood ratio for each word
log_ratio = clf.log_likelihood[1] - clf.log_likelihood[0]
idx_to_word = {v: k for k, v in vocab.items()}

# Top 10 spam indicators
spam_indicators = np.argsort(log_ratio)[-10:][::-1]
print("Top spam indicators:")
for idx in spam_indicators:
    print(f"  {idx_to_word[idx]:20s} log-ratio: {log_ratio[idx]:.3f}")

# Top 10 ham indicators
ham_indicators = np.argsort(log_ratio)[:10]
print("\nTop ham indicators:")
for idx in ham_indicators:
    print(f"  {idx_to_word[idx]:20s} log-ratio: {log_ratio[idx]:.3f}")

Discussion

Why Does Naive Bayes Work Despite the Wrong Assumption?

The conditional independence assumption is clearly violated -- "buy" and "now" are more likely to co-occur in spam. Yet Naive Bayes often performs well because:

  1. Classification, not probability estimation. We only need to get the ranking right (spam score > ham score), not the exact probabilities. The naive assumption distorts probabilities but often preserves rankings.

  2. High bias, low variance. The strong assumption reduces the number of parameters dramatically, preventing overfitting on small datasets.

  3. Errors cancel out. Dependencies between features can introduce errors in both directions, which tend to cancel out in the final sum.

Connection to Information Theory

The log-likelihood ratio $\log \frac{P(w_j \mid \text{spam})}{P(w_j \mid \text{ham})}$ is closely related to the pointwise mutual information between the word and the class. Words with high mutual information with the spam class are the most discriminative features. In Chapter 7, we will see how mutual information can be used for feature selection more generally.

Practical Extensions

In production spam filters, several enhancements improve Naive Bayes:

  • Multinomial model: Use word counts instead of binary presence, with $P(w_j \mid C)$ following a multinomial distribution.
  • Feature engineering: Include non-word features like sender reputation, email metadata, and URL analysis.
  • Online learning: Update parameters incrementally as new labeled emails arrive.
  • Complementary Naive Bayes: Estimate parameters from the complement class to reduce bias.

These ideas connect to the supervised learning techniques we will develop in Chapter 6 and the feature engineering strategies in Chapter 9.

Key Takeaways

  1. Bayes' theorem provides a principled framework for combining prior beliefs with observed evidence for classification.
  2. The naive conditional independence assumption makes high-dimensional probability estimation tractable by reducing exponential complexity to linear.
  3. Laplace smoothing is a practical application of MAP estimation with a Beta prior, preventing zero probabilities for unseen features.
  4. Working in log-space is essential for numerical stability when multiplying many small probabilities.
  5. Despite its simplicity, Naive Bayes is a surprisingly effective baseline that remains relevant in modern AI systems.