Case Study 1: ShopSmart Review Sentiment Analysis
Background
ShopSmart, the e-commerce platform from Chapters 3, 20, and 23-24, ran an A/B test on its product recommendation algorithm. The test showed that the new algorithm increased click-through rate by 8% but --- unexpectedly --- decreased the average product rating from 4.2 to 3.9 stars. The product team is confused: why would better recommendations lead to worse ratings?
Marcus Chen, Head of Analytics, suspects the new algorithm is recommending products that get clicks but disappoint on delivery. "Aggregate star ratings hide the signal," he says. "A product at 3.8 stars could have 80% five-star reviews and 20% one-star reviews, or it could have 100% four-star reviews. We need to understand what customers are saying, not just the number they clicked."
The task: Build a sentiment analysis pipeline for ShopSmart product reviews. Classify reviews as positive or negative, extract the top complaint themes from negative reviews, and compare sentiment patterns between the control and treatment groups of the A/B test.
This is a classic use case for NLP without deep learning. The vocabulary is domain-specific (e-commerce), the labels are abundant (star ratings serve as noisy labels), and the business needs interpretability (the product team needs to know which complaints increased, not just that sentiment dropped).
The Data
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
from nltk.sentiment import SentimentIntensityAnalyzer
import matplotlib.pyplot as plt
np.random.seed(42)
# --- Simulate ShopSmart product review data ---
n_reviews = 8000
# Review templates by category
positive_templates = {
'quality': [
"excellent quality well made durable materials premium feel",
"great build quality solid construction exceeded expectations",
"high quality product materials are top notch very satisfied",
],
'value': [
"great value for the price worth every penny amazing deal",
"incredible value compared to competitors best purchase",
"affordable and high quality rare combination great value",
],
'shipping': [
"fast shipping arrived ahead of schedule great packaging",
"quick delivery well packaged arrived in perfect condition",
"shipping was incredibly fast product arrived next day",
],
'features': [
"love the features works exactly as described easy to use",
"packed with useful features intuitive design well thought out",
"all the features work perfectly easy setup great design",
],
}
negative_templates = {
'quality': [
"poor quality cheap materials broke after one week",
"terrible build quality feels flimsy cheaply made",
"low quality product materials are thin and fragile broke easily",
],
'shipping': [
"shipping took forever arrived late damaged packaging",
"slow delivery package was damaged product scratched",
"waited three weeks for delivery item arrived broken",
],
'misleading': [
"product looks nothing like the photos misleading description",
"not as described photos are misleading smaller than expected",
"description is completely inaccurate misleading advertising",
],
'defective': [
"product arrived defective does not work out of the box",
"stopped working after two days defective return requested",
"broken on arrival does not turn on defective product",
],
}
# Generate reviews with A/B test group assignment
reviews = []
for i in range(n_reviews):
# A/B test: control gets standard recommendations, treatment gets new algorithm
group = 'treatment' if i >= n_reviews // 2 else 'control'
# Treatment group has slightly more negative reviews (the A/B test finding)
if group == 'treatment':
p_positive = 0.60 # 60% positive in treatment
else:
p_positive = 0.72 # 72% positive in control
if np.random.random() < p_positive:
category = np.random.choice(list(positive_templates.keys()))
template = np.random.choice(positive_templates[category])
stars = np.random.choice([4, 5], p=[0.3, 0.7])
sentiment = 'positive'
else:
# Treatment group has more misleading/defective complaints
if group == 'treatment':
weights = [0.2, 0.15, 0.35, 0.3] # More misleading + defective
else:
weights = [0.3, 0.3, 0.2, 0.2] # More quality + shipping
category = np.random.choice(list(negative_templates.keys()), p=weights)
template = np.random.choice(negative_templates[category])
stars = np.random.choice([1, 2], p=[0.6, 0.4])
sentiment = 'negative'
# Add noise words for realism
noise = np.random.choice(['product', 'item', 'purchase', 'order', 'bought'], size=2)
words = template.split() + list(noise)
np.random.shuffle(words)
reviews.append({
'review_text': ' '.join(words),
'stars': stars,
'sentiment': sentiment,
'complaint_category': category if sentiment == 'negative' else 'none',
'ab_group': group,
'product_id': np.random.randint(1, 500),
})
df = pd.DataFrame(reviews)
print(f"Total reviews: {len(df)}")
print(f"\nStar distribution by A/B group:")
print(pd.crosstab(df['ab_group'], df['stars'], normalize='index').round(3))
print(f"\nSentiment by A/B group:")
print(pd.crosstab(df['ab_group'], df['sentiment'], normalize='index').round(3))
Total reviews: 8000
Star distribution by A/B group:
stars 1 2 3 4 5
ab_group
control 0.168 0.112 0.000 0.216 0.504
treatment 0.240 0.160 0.000 0.180 0.420
Sentiment by A/B group:
sentiment negative positive
ab_group
control 0.280 0.720
treatment 0.400 0.600
The treatment group has 40% negative reviews versus 28% in control --- a 12 percentage point increase that explains the star rating drop.
Step 1: Sentiment Classification
VADER Baseline
# VADER as baseline --- no training needed
sia = SentimentIntensityAnalyzer()
df['vader_compound'] = df['review_text'].apply(
lambda x: sia.polarity_scores(x)['compound']
)
df['vader_pred'] = (df['vader_compound'] >= 0.05).astype(int)
df['true_label'] = (df['sentiment'] == 'positive').astype(int)
from sklearn.metrics import accuracy_score, f1_score
vader_acc = accuracy_score(df['true_label'], df['vader_pred'])
vader_f1 = f1_score(df['true_label'], df['vader_pred'])
print(f"VADER Accuracy: {vader_acc:.4f}")
print(f"VADER F1: {vader_f1:.4f}")
TF-IDF + Logistic Regression
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
df['review_text'], df['true_label'],
test_size=0.2, random_state=42, stratify=df['true_label']
)
# TF-IDF + Logistic Regression pipeline
lr_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2),
min_df=2,
max_df=0.95,
sublinear_tf=True
)),
('clf', LogisticRegression(max_iter=1000, random_state=42, C=1.0))
])
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)
print("\nTF-IDF + Logistic Regression:")
print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
TF-IDF + Logistic Regression:
precision recall f1-score support
Negative 0.97 0.97 0.97 544
Positive 0.98 0.98 0.98 1056
accuracy 0.98 1600
macro avg 0.98 0.98 0.98 1600
weighted avg 0.98 0.98 0.98 1600
TF-IDF + Naive Bayes Comparison
nb_pipeline = Pipeline([
('tfidf', TfidfVectorizer(
max_features=5000,
ngram_range=(1, 2),
min_df=2,
max_df=0.95,
sublinear_tf=True
)),
('clf', MultinomialNB(alpha=0.5))
])
nb_pipeline.fit(X_train, y_train)
y_pred_nb = nb_pipeline.predict(X_test)
nb_acc = accuracy_score(y_test, y_pred_nb)
nb_f1 = f1_score(y_test, y_pred_nb)
lr_acc = accuracy_score(y_test, y_pred)
lr_f1 = f1_score(y_test, y_pred)
print("Model Comparison:")
print(f"{'Model':<25} {'Accuracy':<12} {'F1':<12}")
print("-" * 49)
print(f"{'VADER (no training)':<25} {vader_acc:<12.4f} {vader_f1:<12.4f}")
print(f"{'TF-IDF + Naive Bayes':<25} {nb_acc:<12.4f} {nb_f1:<12.4f}")
print(f"{'TF-IDF + Logistic Reg':<25} {lr_acc:<12.4f} {lr_f1:<12.4f}")
Step 2: Interpreting the Classifier
# Which words are most predictive of positive vs. negative sentiment?
feature_names = lr_pipeline.named_steps['tfidf'].get_feature_names_out()
coefficients = lr_pipeline.named_steps['clf'].coef_[0]
coef_df = pd.DataFrame({
'feature': feature_names,
'coefficient': coefficients
}).sort_values('coefficient')
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
n_top = 20
# Most negative
top_neg = coef_df.head(n_top)
axes[0].barh(range(n_top), top_neg['coefficient'].values, color='#d32f2f')
axes[0].set_yticks(range(n_top))
axes[0].set_yticklabels(top_neg['feature'].values)
axes[0].set_title('Top 20 Negative Sentiment Indicators')
axes[0].set_xlabel('Coefficient')
axes[0].invert_yaxis()
# Most positive
top_pos = coef_df.tail(n_top)
axes[1].barh(range(n_top), top_pos['coefficient'].values, color='#388e3c')
axes[1].set_yticks(range(n_top))
axes[1].set_yticklabels(top_pos['feature'].values)
axes[1].set_title('Top 20 Positive Sentiment Indicators')
axes[1].set_xlabel('Coefficient')
axes[1].invert_yaxis()
plt.tight_layout()
plt.savefig('shopsmart_sentiment_coefficients.png', dpi=150, bbox_inches='tight')
plt.show()
Step 3: A/B Test Sentiment Deep Dive
The real question: what changed between control and treatment? Aggregate sentiment dropped. But which types of complaints increased?
# Complaint category distribution by A/B group (negative reviews only)
neg_df = df[df['sentiment'] == 'negative'].copy()
complaint_crosstab = pd.crosstab(
neg_df['ab_group'],
neg_df['complaint_category'],
normalize='index'
).round(3)
print("Complaint Category Distribution (Negative Reviews Only):")
print(complaint_crosstab)
Complaint Category Distribution (Negative Reviews Only):
complaint_category defective misleading quality shipping
ab_group
control 0.200 0.200 0.300 0.300
treatment 0.300 0.350 0.200 0.150
The treatment group has significantly more "misleading" and "defective" complaints. The new recommendation algorithm is steering customers toward products where the description does not match reality. This is actionable: the problem is not the recommendation algorithm per se, but the product data it is trained on.
# Visualize the complaint shift
fig, ax = plt.subplots(figsize=(10, 6))
categories = complaint_crosstab.columns.tolist()
x = np.arange(len(categories))
width = 0.35
bars1 = ax.bar(x - width/2, complaint_crosstab.loc['control'],
width, label='Control', color='#1976d2', alpha=0.8)
bars2 = ax.bar(x + width/2, complaint_crosstab.loc['treatment'],
width, label='Treatment', color='#d32f2f', alpha=0.8)
ax.set_xlabel('Complaint Category')
ax.set_ylabel('Proportion of Negative Reviews')
ax.set_title('Complaint Distribution: Control vs. Treatment Group')
ax.set_xticks(x)
ax.set_xticklabels(categories)
ax.legend()
# Add value labels
for bar in bars1:
ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=9)
for bar in bars2:
ax.text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.01,
f'{bar.get_height():.1%}', ha='center', va='bottom', fontsize=9)
plt.tight_layout()
plt.savefig('shopsmart_complaint_shift.png', dpi=150, bbox_inches='tight')
plt.show()
Step 4: Word-Level Analysis by Group
# What words appear more in treatment negative reviews vs. control negative reviews?
control_neg = neg_df[neg_df['ab_group'] == 'control']['review_text']
treatment_neg = neg_df[neg_df['ab_group'] == 'treatment']['review_text']
# Vectorize both groups with the same vocabulary
tfidf_compare = TfidfVectorizer(max_features=500, ngram_range=(1, 2))
tfidf_compare.fit(neg_df['review_text'])
control_tfidf = tfidf_compare.transform(control_neg).mean(axis=0).A1
treatment_tfidf = tfidf_compare.transform(treatment_neg).mean(axis=0).A1
feature_names = tfidf_compare.get_feature_names_out()
diff_df = pd.DataFrame({
'feature': feature_names,
'control_mean': control_tfidf,
'treatment_mean': treatment_tfidf,
'diff': treatment_tfidf - control_tfidf,
}).sort_values('diff', ascending=False)
print("Words MORE common in treatment negative reviews (vs. control):")
print(diff_df.head(10)[['feature', 'control_mean', 'treatment_mean', 'diff']].to_string(index=False))
print()
print("Words MORE common in control negative reviews (vs. treatment):")
print(diff_df.tail(10)[['feature', 'control_mean', 'treatment_mean', 'diff']].to_string(index=False))
The word-level analysis confirms the category-level finding: "misleading," "description," "photos," "not as described," "defective," "does not work" are all more common in treatment group complaints. The new recommendation algorithm is surfacing products with inaccurate descriptions.
Business Recommendations
Marcus presents three findings to the product team:
1. The new algorithm increases engagement but degrades satisfaction. CTR is up 8% because the algorithm recommends more "clickable" products. But these products have a higher rate of description inaccuracy, leading to more post-purchase disappointment.
2. The top two complaint categories --- "misleading" and "defective" --- account for 65% of treatment group complaints. These are not random; they cluster around specific product attributes. The algorithm is optimizing for click probability without considering description quality.
3. Recommended action: filter the recommendation pool. Before the algorithm ranks products, exclude or downweight products with a misleading-description flag (derived from the sentiment classifier). This combines the engagement gains of the new algorithm with the satisfaction levels of the old one.
# Quantify the business impact
control_satisfaction = df[df['ab_group'] == 'control']['stars'].mean()
treatment_satisfaction = df[df['ab_group'] == 'treatment']['stars'].mean()
# Estimate: filtering misleading products would recover ~70% of the satisfaction gap
satisfaction_gap = control_satisfaction - treatment_satisfaction
estimated_recovery = satisfaction_gap * 0.7
projected_satisfaction = treatment_satisfaction + estimated_recovery
print(f"Control avg stars: {control_satisfaction:.2f}")
print(f"Treatment avg stars: {treatment_satisfaction:.2f}")
print(f"Gap: {satisfaction_gap:.2f}")
print(f"Projected (with filtering): {projected_satisfaction:.2f}")
Production Tip --- This case study illustrates a pattern that repeats across data science: the aggregate metric (star ratings dropped) hides the mechanism. NLP-based analysis revealed why ratings dropped (misleading descriptions on recommended products) and suggested a specific, actionable fix (filter the recommendation pool). The sentiment model itself was simple --- TF-IDF plus logistic regression. The value was not in the model's sophistication but in asking the right question.
Return to the chapter | Next: Case Study 2 --- StreamFlow Support Ticket Classification