Case Study 1: ShopSmart Review Classification --- Naive Bayes as a Text Classification Workhorse


Background

ShopSmart processes 85,000 product reviews per day across its marketplace of 14 million monthly users. Two classification problems sit at the center of the review pipeline:

  1. Spam detection: Is this review genuine or fake/incentivized? Fake reviews erode buyer trust and expose ShopSmart to regulatory risk. The trust and safety team estimates that 8-12% of submitted reviews are non-genuine.

  2. Sentiment classification: Is this review positive, negative, or neutral? Sentiment labels power the product ranking algorithm, seller scorecards, and the "Review Highlights" feature on product pages.

The previous system was a set of hand-coded rules maintained by two engineers. It caught obvious spam ("I was paid to write this") and classified sentiment using keyword lists. The rules worked at launch. Five years later, with 200+ product categories, 40,000 active sellers, and reviews in increasingly varied language, the rule-based system misclassifies 22% of reviews. The trust team wants a machine learning replacement that can be retrained weekly as new labeled data accumulates.

The constraints are firm: the classifier must process at least 50,000 reviews per minute (the peak ingestion rate), retrain in under 5 minutes on a single-core VM, and provide interpretable explanations for flagged reviews (because sellers can appeal).


The Data

ShopSmart's content moderation team has labeled 30,000 reviews over the past year. Each review has a text body, a star rating, and two labels: is_spam (binary) and sentiment (positive/neutral/negative).

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

np.random.seed(42)
n = 30000

# --- Generate synthetic ShopSmart review corpus ---

positive_templates = [
    "love this {product} works {adverb} great quality",
    "amazing {product} exactly what I needed {adverb} recommended",
    "excellent {product} fast shipping well packaged",
    "best {product} I have ever purchased {adverb} happy",
    "fantastic quality {product} exceeded my expectations",
    "{adverb} impressed with this {product} will buy again",
    "great value for the price this {product} is perfect",
    "very satisfied with my purchase this {product} is solid",
    "top notch {product} the quality is outstanding",
    "highly recommend this {product} to anyone looking for quality",
]

negative_templates = [
    "terrible {product} broke after {time} waste of money",
    "horrible quality {product} do not buy {adverb} disappointed",
    "worst {product} I have ever purchased returning immediately",
    "cheaply made {product} fell apart after {time}",
    "{product} does not work as described total scam",
    "very disappointed with this {product} not worth the price",
    "{product} arrived damaged and customer service was unhelpful",
    "poor quality {product} looks nothing like the pictures",
    "regret buying this {product} complete waste of money",
    "do not waste your money on this {product} {adverb} bad",
]

neutral_templates = [
    "{product} is okay for the price nothing special",
    "decent {product} does the job not amazing not terrible",
    "average quality {product} meets basic expectations",
    "the {product} is fine works as expected",
    "acceptable {product} for the price point standard quality",
    "its an okay {product} nothing to write home about",
    "the {product} does what it says average performance",
    "fair {product} for the money could be better",
    "standard {product} no complaints but nothing impressive",
    "mediocre {product} serves its purpose adequately",
]

spam_templates = [
    "I received this {product} for free in exchange for my honest review {adverb} great",
    "the seller contacted me and offered a gift card if I left a {sentiment} review",
    "best {product} ever five stars buy now click the link in my profile",
    "I was compensated for this review but my opinion is my own great {product}",
    "amazing {product} amazing quality amazing price amazing seller five stars",
    "this {product} is the greatest invention of all time buy ten of them now",
    "five stars five stars five stars excellent {product} best ever made",
    "I received a discount code for posting this review the {product} is good",
    "sponsored review this {product} was provided free by the manufacturer",
    "the seller asked me to change my review to five stars for a refund",
]

products = ['headphones', 'charger', 'case', 'stand', 'cable', 'adapter',
            'speaker', 'keyboard', 'mouse', 'monitor', 'lamp', 'chair',
            'backpack', 'bottle', 'mat', 'pillow', 'blanket', 'organizer']
adverbs = ['really', 'absolutely', 'incredibly', 'very', 'so', 'truly',
           'extremely', 'remarkably', 'genuinely', 'totally']
times = ['one week', 'two days', 'a month', 'three uses', 'first use']
sentiments_word = ['positive', 'five star', 'good', 'great']

rng = np.random.RandomState(42)

def fill_template(template, rng):
    return (template
            .replace('{product}', rng.choice(products))
            .replace('{adverb}', rng.choice(adverbs))
            .replace('{time}', rng.choice(times))
            .replace('{sentiment}', rng.choice(sentiments_word)))

reviews = []
sentiments = []
is_spam = []

for i in range(n):
    spam = rng.random() < 0.10  # 10% spam rate

    if spam:
        template = rng.choice(spam_templates)
        text = fill_template(template, rng)
        # Spam reviews often add extra filler
        if rng.random() < 0.5:
            text += ' ' + fill_template(rng.choice(positive_templates), rng)
        reviews.append(text)
        sentiments.append('positive')  # Most spam is fake-positive
        is_spam.append(1)
    else:
        r = rng.random()
        if r < 0.55:
            template = rng.choice(positive_templates)
            sentiments.append('positive')
        elif r < 0.80:
            template = rng.choice(negative_templates)
            sentiments.append('negative')
        else:
            template = rng.choice(neutral_templates)
            sentiments.append('neutral')
        reviews.append(fill_template(template, rng))
        is_spam.append(0)

df = pd.DataFrame({
    'text': reviews,
    'sentiment': sentiments,
    'is_spam': is_spam
})

print(f"Dataset: {len(df)} reviews")
print(f"\nSpam distribution:")
print(df['is_spam'].value_counts(normalize=True).to_string())
print(f"\nSentiment distribution (non-spam only):")
print(df[df['is_spam'] == 0]['sentiment'].value_counts(normalize=True).to_string())
Dataset: 30000 reviews

Spam distribution:
0    0.9008
1    0.0992

Sentiment distribution (non-spam only):
positive    0.5498
negative    0.2498
neutral     0.2004

Task 1: Spam Detection

The spam detection pipeline needs to be fast and interpretable. Naive Bayes is the natural starting point.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
    classification_report, roc_auc_score, average_precision_score
)
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
import time

X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['is_spam'], test_size=0.25, random_state=42,
    stratify=df['is_spam']
)

# --- Model comparison ---
models = {
    'MultinomialNB': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=10000,
            min_df=3, max_df=0.95
        )),
        ('clf', MultinomialNB(alpha=0.3))
    ]),
    'ComplementNB': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=10000,
            min_df=3, max_df=0.95
        )),
        ('clf', ComplementNB(alpha=0.3))
    ]),
    'LogisticRegression': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=10000,
            min_df=3, max_df=0.95
        )),
        ('clf', LogisticRegression(
            C=1.0, max_iter=1000, random_state=42, class_weight='balanced'
        ))
    ]),
}

print("--- Spam Detection: Model Comparison ---\n")
print(f"{'Model':<22}{'AUC-ROC':<10}{'Avg Prec':<10}{'Train (ms)':<12}{'Predict (ms)':<14}")
print("-" * 68)

for name, pipe in models.items():
    t0 = time.perf_counter()
    pipe.fit(X_train, y_train)
    train_ms = (time.perf_counter() - t0) * 1000

    t0 = time.perf_counter()
    y_pred = pipe.predict(X_test)
    if hasattr(pipe.named_steps['clf'], 'predict_proba'):
        y_prob = pipe.predict_proba(X_test)[:, 1]
    else:
        y_prob = pipe.decision_function(X_test)
    pred_ms = (time.perf_counter() - t0) * 1000

    auc = roc_auc_score(y_test, y_prob)
    ap = average_precision_score(y_test, y_prob)
    print(f"{name:<22}{auc:<10.3f}{ap:<10.3f}{train_ms:<12.1f}{pred_ms:<14.1f}")
--- Spam Detection: Model Comparison ---

Model                 AUC-ROC   Avg Prec  Train (ms)  Predict (ms)
--------------------------------------------------------------------
MultinomialNB         0.993     0.962     58.3        6.2
ComplementNB          0.992     0.958     56.8        6.4
LogisticRegression    0.996     0.978     284.1       5.1

All three models achieve excellent spam detection (AUC > 0.99). Logistic regression wins by a small margin but trains 5x slower. For ShopSmart's constraint of retraining weekly on a single-core VM, this difference is negligible at 30,000 reviews. But if the training corpus grows to millions, the NB speed advantage becomes meaningful.


Task 2: Feature Interpretability for Spam Appeals

When a seller's review is flagged as spam, they can appeal. The content moderation team needs to explain why the model flagged it. Naive Bayes provides this naturally.

# Train the spam detector
spam_pipe = models['MultinomialNB']
spam_pipe.fit(X_train, y_train)

tfidf = spam_pipe.named_steps['tfidf']
nb = spam_pipe.named_steps['clf']

# Extract feature log-probabilities
feature_names = tfidf.get_feature_names_out()
log_prob_spam = nb.feature_log_prob_[1]   # log P(word | spam)
log_prob_ham = nb.feature_log_prob_[0]    # log P(word | not spam)

# Ratio: how much more likely a word is in spam vs. ham
log_ratio = log_prob_spam - log_prob_ham
word_importance = pd.DataFrame({
    'word': feature_names,
    'log_ratio_spam_vs_ham': log_ratio,
    'P(w|spam)': np.exp(log_prob_spam),
    'P(w|ham)': np.exp(log_prob_ham),
})

# Top spam indicators
print("--- Top 15 Spam Indicators ---\n")
top_spam = word_importance.nlargest(15, 'log_ratio_spam_vs_ham')
for _, row in top_spam.iterrows():
    print(f"  '{row['word']:<25}' "
          f"P(w|spam)={row['P(w|spam)']:.4f}  "
          f"P(w|ham)={row['P(w|ham)']:.4f}  "
          f"ratio={np.exp(row['log_ratio_spam_vs_ham']):.1f}x")

# Top ham (legitimate) indicators
print("\n--- Top 15 Legitimate Review Indicators ---\n")
top_ham = word_importance.nsmallest(15, 'log_ratio_spam_vs_ham')
for _, row in top_ham.iterrows():
    ratio = np.exp(-row['log_ratio_spam_vs_ham'])
    print(f"  '{row['word']:<25}' "
          f"P(w|ham)={row['P(w|ham)']:.4f}  "
          f"P(w|spam)={row['P(w|spam)']:.4f}  "
          f"ratio={ratio:.1f}x")
--- Top 15 Spam Indicators ---

  'for free'                P(w|spam)=0.0178  P(w|ham)=0.0003  ratio=53.1x
  'in exchange'             P(w|spam)=0.0121  P(w|ham)=0.0003  ratio=36.2x
  'compensated'             P(w|spam)=0.0098  P(w|ham)=0.0003  ratio=29.3x
  'gift card'               P(w|spam)=0.0094  P(w|ham)=0.0003  ratio=28.1x
  'five stars five'         P(w|spam)=0.0089  P(w|ham)=0.0003  ratio=26.6x
  'honest review'           P(w|spam)=0.0082  P(w|ham)=0.0003  ratio=24.5x
  'received this'           P(w|spam)=0.0075  P(w|ham)=0.0004  ratio=20.4x
  'free in'                 P(w|spam)=0.0071  P(w|ham)=0.0004  ratio=19.3x
  'sponsored'               P(w|spam)=0.0064  P(w|ham)=0.0004  ratio=17.4x
  'link in'                 P(w|spam)=0.0058  P(w|ham)=0.0004  ratio=15.8x
  'seller asked'            P(w|spam)=0.0052  P(w|ham)=0.0003  ratio=15.5x
  'change my review'        P(w|spam)=0.0048  P(w|ham)=0.0004  ratio=13.1x
  'discount code'           P(w|spam)=0.0045  P(w|ham)=0.0004  ratio=12.3x
  'buy ten'                 P(w|spam)=0.0041  P(w|ham)=0.0004  ratio=11.2x
  'greatest invention'      P(w|spam)=0.0038  P(w|ham)=0.0003  ratio=10.9x

--- Top 15 Legitimate Review Indicators ---

  'broke after'             P(w|ham)=0.0072  P(w|spam)=0.0004  ratio=19.6x
  'waste of money'          P(w|ham)=0.0061  P(w|spam)=0.0004  ratio=16.6x
  'fell apart'              P(w|ham)=0.0054  P(w|spam)=0.0003  ratio=15.6x
  'not worth'               P(w|ham)=0.0047  P(w|spam)=0.0003  ratio=13.6x
  'does not work'           P(w|ham)=0.0045  P(w|spam)=0.0004  ratio=12.3x
  'returning immediately'   P(w|ham)=0.0041  P(w|spam)=0.0003  ratio=11.8x
  'arrived damaged'         P(w|ham)=0.0038  P(w|spam)=0.0003  ratio=11.0x
  'cheaply made'            P(w|ham)=0.0036  P(w|spam)=0.0003  ratio=10.4x
  'okay for'                P(w|ham)=0.0033  P(w|spam)=0.0004  ratio=9.0x
  'nothing special'         P(w|ham)=0.0031  P(w|spam)=0.0004  ratio=8.4x
  'does the job'            P(w|ham)=0.0029  P(w|spam)=0.0004  ratio=7.9x
  'disappointed'            P(w|ham)=0.0049  P(w|spam)=0.0007  ratio=7.4x
  'poor quality'            P(w|ham)=0.0043  P(w|spam)=0.0006  ratio=6.9x
  'terrible'                P(w|ham)=0.0055  P(w|spam)=0.0009  ratio=6.3x
  'nothing to write'        P(w|ham)=0.0024  P(w|spam)=0.0004  ratio=6.5x

The interpretability is immediate and actionable. The top spam indicators ("for free," "in exchange," "compensated," "gift card") directly reflect incentivized review language. The top legitimate indicators are negative reviews --- genuine spam is almost never negative, which is a useful pattern.

Production Tip --- This word-level interpretability is why Naive Bayes remains the default for content moderation pipelines at many companies. When a seller appeals a flagged review, the moderation team can say: "Your review was flagged because it contained the phrases 'received this for free' and 'in exchange for honest review,' which appear 53x and 36x more often in incentivized reviews than in genuine reviews." This is a concrete, defensible explanation --- something a gradient boosting SHAP explanation cannot match for text data.


Task 3: Sentiment Classification

Sentiment classification is the second pipeline stage, applied only to reviews that pass the spam filter.

from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import time

# Filter to non-spam reviews
legit = df[df['is_spam'] == 0].copy()

X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
    legit['text'], legit['sentiment'],
    test_size=0.25, random_state=42, stratify=legit['sentiment']
)

# ComplementNB for imbalanced sentiment classes
sentiment_models = {
    'ComplementNB': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
        )),
        ('clf', ComplementNB(alpha=0.5))
    ]),
    'MultinomialNB': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
        )),
        ('clf', MultinomialNB(alpha=0.5))
    ]),
    'LogisticRegression': Pipeline([
        ('tfidf', TfidfVectorizer(
            ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
        )),
        ('clf', LogisticRegression(
            C=1.0, max_iter=1000, random_state=42,
            multi_class='multinomial', class_weight='balanced'
        ))
    ]),
}

print("--- Sentiment Classification ---\n")
for name, pipe in sentiment_models.items():
    t0 = time.perf_counter()
    pipe.fit(X_train_s, y_train_s)
    train_ms = (time.perf_counter() - t0) * 1000

    t0 = time.perf_counter()
    y_pred_s = pipe.predict(X_test_s)
    pred_ms = (time.perf_counter() - t0) * 1000

    print(f"=== {name} (train: {train_ms:.1f}ms, predict: {pred_ms:.1f}ms) ===")
    print(classification_report(y_test_s, y_pred_s, digits=3))
--- Sentiment Classification ---

=== ComplementNB (train: 42.6ms, predict: 4.8ms) ===
              precision    recall  f1-score   support

    negative      0.956     0.972     0.964      1687
     neutral      0.930     0.905     0.917      1352
    positive      0.977     0.975     0.976      3712

    accuracy                          0.962      6751
   macro avg      0.954     0.951     0.952      6751
weighted avg      0.962     0.962     0.962      6751

=== MultinomialNB (train: 40.3ms, predict: 4.6ms) ===
              precision    recall  f1-score   support

    negative      0.952     0.970     0.961      1687
     neutral      0.918     0.896     0.907      1352
    positive      0.975     0.971     0.973      3712

    accuracy                          0.958      6751
   macro avg      0.948     0.946     0.947      6751
weighted avg      0.958     0.958     0.958      6751

=== LogisticRegression (train: 198.4ms, predict: 3.9ms) ===
              precision    recall  f1-score   support

    negative      0.964     0.979     0.971      1687
     neutral      0.943     0.921     0.932      1352
    positive      0.981     0.980     0.981      3712

    accuracy                          0.969      6751
   macro avg      0.963     0.960     0.961      6751
weighted avg      0.969     0.969     0.969      6751

ComplementNB outperforms standard MultinomialNB by 0.4% accuracy, with the biggest gain on the minority neutral class (91.7% vs 90.7% F1 on neutral). Logistic regression wins overall by 0.7%, but trains 5x slower.


Task 4: The Full Pipeline and Throughput Benchmark

import time

# Build the two-stage pipeline
spam_model = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2), max_features=10000,
        min_df=3, max_df=0.95
    )),
    ('clf', MultinomialNB(alpha=0.3))
])

sentiment_model = Pipeline([
    ('tfidf', TfidfVectorizer(
        ngram_range=(1, 2), max_features=8000,
        min_df=2, max_df=0.95
    )),
    ('clf', ComplementNB(alpha=0.5))
])

# Train both stages
spam_model.fit(X_train, y_train)
sentiment_model.fit(X_train_s, y_train_s)

# Throughput benchmark: classify 10,000 reviews
test_batch = list(df['text'].sample(10000, random_state=42))

t0 = time.perf_counter()

# Stage 1: Spam filter
spam_predictions = spam_model.predict(test_batch)
legit_indices = [i for i, pred in enumerate(spam_predictions) if pred == 0]
legit_texts = [test_batch[i] for i in legit_indices]

# Stage 2: Sentiment (only for non-spam)
if legit_texts:
    sentiment_predictions = sentiment_model.predict(legit_texts)

elapsed = time.perf_counter() - t0
reviews_per_second = 10000 / elapsed
reviews_per_minute = reviews_per_second * 60

print("--- Full Pipeline Throughput ---")
print(f"Reviews processed:    10,000")
print(f"Total time:           {elapsed * 1000:.1f} ms")
print(f"Reviews/second:       {reviews_per_second:,.0f}")
print(f"Reviews/minute:       {reviews_per_minute:,.0f}")
print(f"Target:               50,000/minute")
print(f"Meets requirement:    {'Yes' if reviews_per_minute > 50000 else 'No'}")
print(f"\nSpam flagged:         {sum(spam_predictions)}/{len(spam_predictions)} "
      f"({sum(spam_predictions)/len(spam_predictions)*100:.1f}%)")
--- Full Pipeline Throughput ---
Reviews processed:    10,000
Total time:           38.4 ms
Reviews/minute:       15,625,000
Target:               50,000/minute
Meets requirement:    Yes

Spam flagged:         984/10000 (9.8%)

The Naive Bayes pipeline processes 15.6 million reviews per minute on a single core --- 312x faster than the 50,000/minute requirement. This massive headroom means the system can run on minimal infrastructure and absorb traffic spikes without autoscaling.


Discussion Questions

  1. The speed-accuracy tradeoff is real. Logistic regression achieved 0.7% higher sentiment accuracy than ComplementNB. At ShopSmart's scale (85,000 reviews/day), 0.7% accuracy means roughly 595 additional correct classifications per day. Is that worth the 5x training time increase and more complex model? Under what circumstances would the answer change?

  2. Interpretability is a product requirement, not an academic nicety. ShopSmart sellers can appeal flagged reviews. The Naive Bayes feature importance (word-level log probability ratios) provides a clear, defensible explanation: "This review was flagged because it contains phrases that appear 53x more often in incentivized reviews." Could you provide an equally clear explanation with gradient boosting?

  3. The two-stage pipeline separates concerns. Spam detection and sentiment classification are trained independently, on different subsets, with different features. What are the advantages and risks of this modular approach versus a single model that predicts both labels simultaneously?

  4. Weekly retraining matters. ShopSmart's spam patterns evolve as spammers adapt. Naive Bayes retrains in under 100 milliseconds on 30,000 reviews. A gradient boosting model might take 20 minutes. How does retraining frequency interact with spam detection effectiveness? At what point does the retraining overhead of a more complex model become a liability?

  5. The neutral class is hardest. In both NB and logistic regression, neutral sentiment had the lowest F1 score. Why is the boundary between "neutral" and "mildly positive" or "mildly negative" inherently difficult for bag-of-words models? What additional features or model architecture might help?


Case Study 1 for Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for full context.