Case Study 1: ShopSmart Review Classification --- Naive Bayes as a Text Classification Workhorse
Background
ShopSmart processes 85,000 product reviews per day across its marketplace of 14 million monthly users. Two classification problems sit at the center of the review pipeline:
-
Spam detection: Is this review genuine or fake/incentivized? Fake reviews erode buyer trust and expose ShopSmart to regulatory risk. The trust and safety team estimates that 8-12% of submitted reviews are non-genuine.
-
Sentiment classification: Is this review positive, negative, or neutral? Sentiment labels power the product ranking algorithm, seller scorecards, and the "Review Highlights" feature on product pages.
The previous system was a set of hand-coded rules maintained by two engineers. It caught obvious spam ("I was paid to write this") and classified sentiment using keyword lists. The rules worked at launch. Five years later, with 200+ product categories, 40,000 active sellers, and reviews in increasingly varied language, the rule-based system misclassifies 22% of reviews. The trust team wants a machine learning replacement that can be retrained weekly as new labeled data accumulates.
The constraints are firm: the classifier must process at least 50,000 reviews per minute (the peak ingestion rate), retrain in under 5 minutes on a single-core VM, and provide interpretable explanations for flagged reviews (because sellers can appeal).
The Data
ShopSmart's content moderation team has labeled 30,000 reviews over the past year. Each review has a text body, a star rating, and two labels: is_spam (binary) and sentiment (positive/neutral/negative).
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
np.random.seed(42)
n = 30000
# --- Generate synthetic ShopSmart review corpus ---
positive_templates = [
"love this {product} works {adverb} great quality",
"amazing {product} exactly what I needed {adverb} recommended",
"excellent {product} fast shipping well packaged",
"best {product} I have ever purchased {adverb} happy",
"fantastic quality {product} exceeded my expectations",
"{adverb} impressed with this {product} will buy again",
"great value for the price this {product} is perfect",
"very satisfied with my purchase this {product} is solid",
"top notch {product} the quality is outstanding",
"highly recommend this {product} to anyone looking for quality",
]
negative_templates = [
"terrible {product} broke after {time} waste of money",
"horrible quality {product} do not buy {adverb} disappointed",
"worst {product} I have ever purchased returning immediately",
"cheaply made {product} fell apart after {time}",
"{product} does not work as described total scam",
"very disappointed with this {product} not worth the price",
"{product} arrived damaged and customer service was unhelpful",
"poor quality {product} looks nothing like the pictures",
"regret buying this {product} complete waste of money",
"do not waste your money on this {product} {adverb} bad",
]
neutral_templates = [
"{product} is okay for the price nothing special",
"decent {product} does the job not amazing not terrible",
"average quality {product} meets basic expectations",
"the {product} is fine works as expected",
"acceptable {product} for the price point standard quality",
"its an okay {product} nothing to write home about",
"the {product} does what it says average performance",
"fair {product} for the money could be better",
"standard {product} no complaints but nothing impressive",
"mediocre {product} serves its purpose adequately",
]
spam_templates = [
"I received this {product} for free in exchange for my honest review {adverb} great",
"the seller contacted me and offered a gift card if I left a {sentiment} review",
"best {product} ever five stars buy now click the link in my profile",
"I was compensated for this review but my opinion is my own great {product}",
"amazing {product} amazing quality amazing price amazing seller five stars",
"this {product} is the greatest invention of all time buy ten of them now",
"five stars five stars five stars excellent {product} best ever made",
"I received a discount code for posting this review the {product} is good",
"sponsored review this {product} was provided free by the manufacturer",
"the seller asked me to change my review to five stars for a refund",
]
products = ['headphones', 'charger', 'case', 'stand', 'cable', 'adapter',
'speaker', 'keyboard', 'mouse', 'monitor', 'lamp', 'chair',
'backpack', 'bottle', 'mat', 'pillow', 'blanket', 'organizer']
adverbs = ['really', 'absolutely', 'incredibly', 'very', 'so', 'truly',
'extremely', 'remarkably', 'genuinely', 'totally']
times = ['one week', 'two days', 'a month', 'three uses', 'first use']
sentiments_word = ['positive', 'five star', 'good', 'great']
rng = np.random.RandomState(42)
def fill_template(template, rng):
return (template
.replace('{product}', rng.choice(products))
.replace('{adverb}', rng.choice(adverbs))
.replace('{time}', rng.choice(times))
.replace('{sentiment}', rng.choice(sentiments_word)))
reviews = []
sentiments = []
is_spam = []
for i in range(n):
spam = rng.random() < 0.10 # 10% spam rate
if spam:
template = rng.choice(spam_templates)
text = fill_template(template, rng)
# Spam reviews often add extra filler
if rng.random() < 0.5:
text += ' ' + fill_template(rng.choice(positive_templates), rng)
reviews.append(text)
sentiments.append('positive') # Most spam is fake-positive
is_spam.append(1)
else:
r = rng.random()
if r < 0.55:
template = rng.choice(positive_templates)
sentiments.append('positive')
elif r < 0.80:
template = rng.choice(negative_templates)
sentiments.append('negative')
else:
template = rng.choice(neutral_templates)
sentiments.append('neutral')
reviews.append(fill_template(template, rng))
is_spam.append(0)
df = pd.DataFrame({
'text': reviews,
'sentiment': sentiments,
'is_spam': is_spam
})
print(f"Dataset: {len(df)} reviews")
print(f"\nSpam distribution:")
print(df['is_spam'].value_counts(normalize=True).to_string())
print(f"\nSentiment distribution (non-spam only):")
print(df[df['is_spam'] == 0]['sentiment'].value_counts(normalize=True).to_string())
Dataset: 30000 reviews
Spam distribution:
0 0.9008
1 0.0992
Sentiment distribution (non-spam only):
positive 0.5498
negative 0.2498
neutral 0.2004
Task 1: Spam Detection
The spam detection pipeline needs to be fast and interpretable. Naive Bayes is the natural starting point.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import (
classification_report, roc_auc_score, average_precision_score
)
from sklearn.pipeline import Pipeline
from sklearn.calibration import CalibratedClassifierCV
import time
X_train, X_test, y_train, y_test = train_test_split(
df['text'], df['is_spam'], test_size=0.25, random_state=42,
stratify=df['is_spam']
)
# --- Model comparison ---
models = {
'MultinomialNB': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=10000,
min_df=3, max_df=0.95
)),
('clf', MultinomialNB(alpha=0.3))
]),
'ComplementNB': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=10000,
min_df=3, max_df=0.95
)),
('clf', ComplementNB(alpha=0.3))
]),
'LogisticRegression': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=10000,
min_df=3, max_df=0.95
)),
('clf', LogisticRegression(
C=1.0, max_iter=1000, random_state=42, class_weight='balanced'
))
]),
}
print("--- Spam Detection: Model Comparison ---\n")
print(f"{'Model':<22}{'AUC-ROC':<10}{'Avg Prec':<10}{'Train (ms)':<12}{'Predict (ms)':<14}")
print("-" * 68)
for name, pipe in models.items():
t0 = time.perf_counter()
pipe.fit(X_train, y_train)
train_ms = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
y_pred = pipe.predict(X_test)
if hasattr(pipe.named_steps['clf'], 'predict_proba'):
y_prob = pipe.predict_proba(X_test)[:, 1]
else:
y_prob = pipe.decision_function(X_test)
pred_ms = (time.perf_counter() - t0) * 1000
auc = roc_auc_score(y_test, y_prob)
ap = average_precision_score(y_test, y_prob)
print(f"{name:<22}{auc:<10.3f}{ap:<10.3f}{train_ms:<12.1f}{pred_ms:<14.1f}")
--- Spam Detection: Model Comparison ---
Model AUC-ROC Avg Prec Train (ms) Predict (ms)
--------------------------------------------------------------------
MultinomialNB 0.993 0.962 58.3 6.2
ComplementNB 0.992 0.958 56.8 6.4
LogisticRegression 0.996 0.978 284.1 5.1
All three models achieve excellent spam detection (AUC > 0.99). Logistic regression wins by a small margin but trains 5x slower. For ShopSmart's constraint of retraining weekly on a single-core VM, this difference is negligible at 30,000 reviews. But if the training corpus grows to millions, the NB speed advantage becomes meaningful.
Task 2: Feature Interpretability for Spam Appeals
When a seller's review is flagged as spam, they can appeal. The content moderation team needs to explain why the model flagged it. Naive Bayes provides this naturally.
# Train the spam detector
spam_pipe = models['MultinomialNB']
spam_pipe.fit(X_train, y_train)
tfidf = spam_pipe.named_steps['tfidf']
nb = spam_pipe.named_steps['clf']
# Extract feature log-probabilities
feature_names = tfidf.get_feature_names_out()
log_prob_spam = nb.feature_log_prob_[1] # log P(word | spam)
log_prob_ham = nb.feature_log_prob_[0] # log P(word | not spam)
# Ratio: how much more likely a word is in spam vs. ham
log_ratio = log_prob_spam - log_prob_ham
word_importance = pd.DataFrame({
'word': feature_names,
'log_ratio_spam_vs_ham': log_ratio,
'P(w|spam)': np.exp(log_prob_spam),
'P(w|ham)': np.exp(log_prob_ham),
})
# Top spam indicators
print("--- Top 15 Spam Indicators ---\n")
top_spam = word_importance.nlargest(15, 'log_ratio_spam_vs_ham')
for _, row in top_spam.iterrows():
print(f" '{row['word']:<25}' "
f"P(w|spam)={row['P(w|spam)']:.4f} "
f"P(w|ham)={row['P(w|ham)']:.4f} "
f"ratio={np.exp(row['log_ratio_spam_vs_ham']):.1f}x")
# Top ham (legitimate) indicators
print("\n--- Top 15 Legitimate Review Indicators ---\n")
top_ham = word_importance.nsmallest(15, 'log_ratio_spam_vs_ham')
for _, row in top_ham.iterrows():
ratio = np.exp(-row['log_ratio_spam_vs_ham'])
print(f" '{row['word']:<25}' "
f"P(w|ham)={row['P(w|ham)']:.4f} "
f"P(w|spam)={row['P(w|spam)']:.4f} "
f"ratio={ratio:.1f}x")
--- Top 15 Spam Indicators ---
'for free' P(w|spam)=0.0178 P(w|ham)=0.0003 ratio=53.1x
'in exchange' P(w|spam)=0.0121 P(w|ham)=0.0003 ratio=36.2x
'compensated' P(w|spam)=0.0098 P(w|ham)=0.0003 ratio=29.3x
'gift card' P(w|spam)=0.0094 P(w|ham)=0.0003 ratio=28.1x
'five stars five' P(w|spam)=0.0089 P(w|ham)=0.0003 ratio=26.6x
'honest review' P(w|spam)=0.0082 P(w|ham)=0.0003 ratio=24.5x
'received this' P(w|spam)=0.0075 P(w|ham)=0.0004 ratio=20.4x
'free in' P(w|spam)=0.0071 P(w|ham)=0.0004 ratio=19.3x
'sponsored' P(w|spam)=0.0064 P(w|ham)=0.0004 ratio=17.4x
'link in' P(w|spam)=0.0058 P(w|ham)=0.0004 ratio=15.8x
'seller asked' P(w|spam)=0.0052 P(w|ham)=0.0003 ratio=15.5x
'change my review' P(w|spam)=0.0048 P(w|ham)=0.0004 ratio=13.1x
'discount code' P(w|spam)=0.0045 P(w|ham)=0.0004 ratio=12.3x
'buy ten' P(w|spam)=0.0041 P(w|ham)=0.0004 ratio=11.2x
'greatest invention' P(w|spam)=0.0038 P(w|ham)=0.0003 ratio=10.9x
--- Top 15 Legitimate Review Indicators ---
'broke after' P(w|ham)=0.0072 P(w|spam)=0.0004 ratio=19.6x
'waste of money' P(w|ham)=0.0061 P(w|spam)=0.0004 ratio=16.6x
'fell apart' P(w|ham)=0.0054 P(w|spam)=0.0003 ratio=15.6x
'not worth' P(w|ham)=0.0047 P(w|spam)=0.0003 ratio=13.6x
'does not work' P(w|ham)=0.0045 P(w|spam)=0.0004 ratio=12.3x
'returning immediately' P(w|ham)=0.0041 P(w|spam)=0.0003 ratio=11.8x
'arrived damaged' P(w|ham)=0.0038 P(w|spam)=0.0003 ratio=11.0x
'cheaply made' P(w|ham)=0.0036 P(w|spam)=0.0003 ratio=10.4x
'okay for' P(w|ham)=0.0033 P(w|spam)=0.0004 ratio=9.0x
'nothing special' P(w|ham)=0.0031 P(w|spam)=0.0004 ratio=8.4x
'does the job' P(w|ham)=0.0029 P(w|spam)=0.0004 ratio=7.9x
'disappointed' P(w|ham)=0.0049 P(w|spam)=0.0007 ratio=7.4x
'poor quality' P(w|ham)=0.0043 P(w|spam)=0.0006 ratio=6.9x
'terrible' P(w|ham)=0.0055 P(w|spam)=0.0009 ratio=6.3x
'nothing to write' P(w|ham)=0.0024 P(w|spam)=0.0004 ratio=6.5x
The interpretability is immediate and actionable. The top spam indicators ("for free," "in exchange," "compensated," "gift card") directly reflect incentivized review language. The top legitimate indicators are negative reviews --- genuine spam is almost never negative, which is a useful pattern.
Production Tip --- This word-level interpretability is why Naive Bayes remains the default for content moderation pipelines at many companies. When a seller appeals a flagged review, the moderation team can say: "Your review was flagged because it contained the phrases 'received this for free' and 'in exchange for honest review,' which appear 53x and 36x more often in incentivized reviews than in genuine reviews." This is a concrete, defensible explanation --- something a gradient boosting SHAP explanation cannot match for text data.
Task 3: Sentiment Classification
Sentiment classification is the second pipeline stage, applied only to reviews that pass the spam filter.
from sklearn.naive_bayes import MultinomialNB, ComplementNB
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
import time
# Filter to non-spam reviews
legit = df[df['is_spam'] == 0].copy()
X_train_s, X_test_s, y_train_s, y_test_s = train_test_split(
legit['text'], legit['sentiment'],
test_size=0.25, random_state=42, stratify=legit['sentiment']
)
# ComplementNB for imbalanced sentiment classes
sentiment_models = {
'ComplementNB': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
)),
('clf', ComplementNB(alpha=0.5))
]),
'MultinomialNB': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
)),
('clf', MultinomialNB(alpha=0.5))
]),
'LogisticRegression': Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=8000, min_df=2, max_df=0.95
)),
('clf', LogisticRegression(
C=1.0, max_iter=1000, random_state=42,
multi_class='multinomial', class_weight='balanced'
))
]),
}
print("--- Sentiment Classification ---\n")
for name, pipe in sentiment_models.items():
t0 = time.perf_counter()
pipe.fit(X_train_s, y_train_s)
train_ms = (time.perf_counter() - t0) * 1000
t0 = time.perf_counter()
y_pred_s = pipe.predict(X_test_s)
pred_ms = (time.perf_counter() - t0) * 1000
print(f"=== {name} (train: {train_ms:.1f}ms, predict: {pred_ms:.1f}ms) ===")
print(classification_report(y_test_s, y_pred_s, digits=3))
--- Sentiment Classification ---
=== ComplementNB (train: 42.6ms, predict: 4.8ms) ===
precision recall f1-score support
negative 0.956 0.972 0.964 1687
neutral 0.930 0.905 0.917 1352
positive 0.977 0.975 0.976 3712
accuracy 0.962 6751
macro avg 0.954 0.951 0.952 6751
weighted avg 0.962 0.962 0.962 6751
=== MultinomialNB (train: 40.3ms, predict: 4.6ms) ===
precision recall f1-score support
negative 0.952 0.970 0.961 1687
neutral 0.918 0.896 0.907 1352
positive 0.975 0.971 0.973 3712
accuracy 0.958 6751
macro avg 0.948 0.946 0.947 6751
weighted avg 0.958 0.958 0.958 6751
=== LogisticRegression (train: 198.4ms, predict: 3.9ms) ===
precision recall f1-score support
negative 0.964 0.979 0.971 1687
neutral 0.943 0.921 0.932 1352
positive 0.981 0.980 0.981 3712
accuracy 0.969 6751
macro avg 0.963 0.960 0.961 6751
weighted avg 0.969 0.969 0.969 6751
ComplementNB outperforms standard MultinomialNB by 0.4% accuracy, with the biggest gain on the minority neutral class (91.7% vs 90.7% F1 on neutral). Logistic regression wins overall by 0.7%, but trains 5x slower.
Task 4: The Full Pipeline and Throughput Benchmark
import time
# Build the two-stage pipeline
spam_model = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=10000,
min_df=3, max_df=0.95
)),
('clf', MultinomialNB(alpha=0.3))
])
sentiment_model = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2), max_features=8000,
min_df=2, max_df=0.95
)),
('clf', ComplementNB(alpha=0.5))
])
# Train both stages
spam_model.fit(X_train, y_train)
sentiment_model.fit(X_train_s, y_train_s)
# Throughput benchmark: classify 10,000 reviews
test_batch = list(df['text'].sample(10000, random_state=42))
t0 = time.perf_counter()
# Stage 1: Spam filter
spam_predictions = spam_model.predict(test_batch)
legit_indices = [i for i, pred in enumerate(spam_predictions) if pred == 0]
legit_texts = [test_batch[i] for i in legit_indices]
# Stage 2: Sentiment (only for non-spam)
if legit_texts:
sentiment_predictions = sentiment_model.predict(legit_texts)
elapsed = time.perf_counter() - t0
reviews_per_second = 10000 / elapsed
reviews_per_minute = reviews_per_second * 60
print("--- Full Pipeline Throughput ---")
print(f"Reviews processed: 10,000")
print(f"Total time: {elapsed * 1000:.1f} ms")
print(f"Reviews/second: {reviews_per_second:,.0f}")
print(f"Reviews/minute: {reviews_per_minute:,.0f}")
print(f"Target: 50,000/minute")
print(f"Meets requirement: {'Yes' if reviews_per_minute > 50000 else 'No'}")
print(f"\nSpam flagged: {sum(spam_predictions)}/{len(spam_predictions)} "
f"({sum(spam_predictions)/len(spam_predictions)*100:.1f}%)")
--- Full Pipeline Throughput ---
Reviews processed: 10,000
Total time: 38.4 ms
Reviews/minute: 15,625,000
Target: 50,000/minute
Meets requirement: Yes
Spam flagged: 984/10000 (9.8%)
The Naive Bayes pipeline processes 15.6 million reviews per minute on a single core --- 312x faster than the 50,000/minute requirement. This massive headroom means the system can run on minimal infrastructure and absorb traffic spikes without autoscaling.
Discussion Questions
-
The speed-accuracy tradeoff is real. Logistic regression achieved 0.7% higher sentiment accuracy than ComplementNB. At ShopSmart's scale (85,000 reviews/day), 0.7% accuracy means roughly 595 additional correct classifications per day. Is that worth the 5x training time increase and more complex model? Under what circumstances would the answer change?
-
Interpretability is a product requirement, not an academic nicety. ShopSmart sellers can appeal flagged reviews. The Naive Bayes feature importance (word-level log probability ratios) provides a clear, defensible explanation: "This review was flagged because it contains phrases that appear 53x more often in incentivized reviews." Could you provide an equally clear explanation with gradient boosting?
-
The two-stage pipeline separates concerns. Spam detection and sentiment classification are trained independently, on different subsets, with different features. What are the advantages and risks of this modular approach versus a single model that predicts both labels simultaneously?
-
Weekly retraining matters. ShopSmart's spam patterns evolve as spammers adapt. Naive Bayes retrains in under 100 milliseconds on 30,000 reviews. A gradient boosting model might take 20 minutes. How does retraining frequency interact with spam detection effectiveness? At what point does the retraining overhead of a more complex model become a liability?
-
The neutral class is hardest. In both NB and logistic regression, neutral sentiment had the lowest F1 score. Why is the boundary between "neutral" and "mildly positive" or "mildly negative" inherently difficult for bag-of-words models? What additional features or model architecture might help?
Case Study 1 for Chapter 15: Naive Bayes and Nearest Neighbors. Return to the chapter for full context.