Exercises: Chapter 26

DataField.Dev

Exercises: Chapter 26

NLP Fundamentals: Text Preprocessing, TF-IDF, Sentiment Analysis, and Topic Modeling

Exercise 1: Text Preprocessing by Hand (Conceptual)

Consider the following five product reviews:

"I LOVE this product!!! It's the BEST I've ever used."
"The product isn't bad, but it's not great either."
"Returned it. Terrible quality. Don't buy."
"Fast shipping & great value --- would recommend to friends!"
"Product arrived broken. Company won't issue a refund. AVOID."

a) Apply the following preprocessing steps manually to Review 1: lowercasing, removing punctuation, tokenization (split on whitespace after cleaning), and stop word removal (remove: "i", "this", "it", "the", "is", "its", "ive", "ever"). Write the resulting token list.

b) Apply the Porter stemmer mentally to the tokens: "running", "recommendation", "happiness", "defective", "shipping", "disappointment". What would each become? (Hint: Porter stemmer aggressively removes suffixes.)

c) Review 2 contains negation. If you remove stop words using NLTK's default English list, which critical words would be lost? How would this affect downstream sentiment analysis?

d) Design a custom stop word list for product review sentiment analysis. Start with NLTK's English list and specify which words you would ADD and which you would REMOVE. Justify each change.

e) Review 5 has all-caps "AVOID." In what scenario would you want to preserve case information rather than lowercasing? How would VADER handle "AVOID" differently from "avoid"?

Exercise 2: Bag-of-Words and TF-IDF by Hand (Conceptual + Computation)

Given the following preprocessed documents (stop words already removed):

Doc A: "great product love quality"
Doc B: "terrible product broke quickly"
Doc C: "great product fast shipping"
Doc D: "product terrible waste money"

a) Construct the Bag-of-Words document-term matrix by hand. The vocabulary should be alphabetically sorted.

b) Compute the TF (term frequency) for "product" in each document and the IDF (inverse document frequency) for "product" using the formula: IDF = log((1 + N) / (1 + df)) + 1, where N is the number of documents and df is the document frequency. What is the TF-IDF weight for "product" in Doc A?

c) Compute the IDF for "great" and "terrible." Which has a higher IDF? Explain why in terms of what IDF measures.

d) If you added a fifth document --- Doc E: "great great great product quality" --- how would the BoW count for "great" differ from its TF-IDF weight? Why does TF-IDF's sublinear_tf=True setting matter here?

e) What is the sparsity of the 4-document BoW matrix from part (a)? In a real corpus with 50,000 documents and a vocabulary of 20,000 words, what sparsity would you expect, and why does scikit-learn use sparse matrices?

Exercise 3: TF-IDF Vectorizer Parameters (Code)

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

np.random.seed(42)

# Sample corpus of 20 product reviews
corpus = [
    "great product excellent quality love it highly recommend",
    "terrible quality broke after one week waste of money",
    "fast shipping arrived early package was perfect condition",
    "not worth the price cheaply made would not recommend",
    "love this product best purchase ever made so happy",
    "product does not work defective arrived broken return",
    "amazing value for the price great quality sturdy build",
    "slow shipping took three weeks arrived damaged box",
    "perfect product exactly as described works perfectly fine",
    "worst product ever purchased complete waste never again",
    "good quality decent product nothing special but works",
    "horrible customer service no refund no response avoid",
    "exceeded expectations premium quality fast delivery love",
    "cheap materials fell apart first day terrible quality",
    "works great easy to use simple setup recommended",
    "product arrived late and damaged very disappointed",
    "fantastic product everyone should buy this excellent",
    "returning this product does not match description at all",
    "five stars best product in this category great value",
    "do not buy terrible experience worst customer service",
]

a) Create four different TfidfVectorizer instances with the following configurations and compare vocabulary sizes: - Config 1: defaults - Config 2: max_features=50 - Config 3: min_df=2, max_df=0.8 - Config 4: ngram_range=(1, 2), max_features=100

Print the vocabulary size for each. Which configuration produces the most compact representation? Which produces the most features?

b) Fit Config 4 and print the top 10 features by IDF score (the rarest features). Then print the bottom 10 by IDF score (the most common features). Explain why the rare features have higher IDF.

c) Vectorize the corpus with sublinear_tf=True and without it. For a document that contains the word "great" three times, compare the TF component in each case. When would sublinear_tf=True improve classification performance?

d) Explain what max_df=0.8 does and why you would use it. Give an example of a word that would be removed by this setting in the corpus above.

Exercise 4: Text Classification Pipeline (Code)

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report

np.random.seed(42)

# Generate a 3-class ticket classification dataset
categories = {
    'billing': [
        "charged twice on credit card need refund billing error",
        "subscription renewed without consent cancel billing",
        "wrong amount on invoice dispute payment charged",
        "refund not processed billing support credit card issue",
        "promo code not applied charged full price billing",
    ],
    'technical': [
        "app crashes on startup cannot open freezes phone",
        "error message when uploading file bug report",
        "slow performance lag loading time unresponsive",
        "cannot connect to server timeout network error",
        "feature not working button does nothing broken",
    ],
    'general': [
        "how do I change my password account settings",
        "where can I find my order tracking information",
        "what are your business hours contact support",
        "need help understanding pricing plans options",
        "how to update shipping address profile settings",
    ],
}

n_per_class = 500
texts, labels = [], []
for label, templates in categories.items():
    for _ in range(n_per_class):
        template = np.random.choice(templates)
        words = template.split()
        noise = list(np.random.choice(['please', 'help', 'urgent', 'thanks'], size=2))
        all_words = words + noise
        np.random.shuffle(all_words)
        texts.append(' '.join(all_words))
        labels.append(label)

df = pd.DataFrame({'text': texts, 'label': labels})
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

a) Split the data 80/20 with stratification. Build three pipelines: - TF-IDF + Logistic Regression - TF-IDF + Multinomial Naive Bayes - TF-IDF + LinearSVC

Use 5-fold cross-validation to compare accuracy. Which model performs best? Which trains fastest?

b) For the best model, print the classification report on the test set. Which category is easiest to classify? Which is hardest? Why?

c) Extract the top 5 most important features for each class from the logistic regression model's coefficients. Do the top features make intuitive sense for each category?

d) Add a GridSearchCV to tune tfidf__ngram_range, tfidf__max_features, and clf__C simultaneously. Report the best parameters and whether they improve over the defaults.

Exercise 5: Sentiment Analysis Comparison (Code)

import numpy as np
from nltk.sentiment import SentimentIntensityAnalyzer

np.random.seed(42)

# Tricky sentiment examples
tricky_reviews = [
    ("Not bad at all, actually quite good.", "positive"),
    ("I wouldn't say this is terrible.", "positive"),
    ("Could be worse, but could be much better.", "negative"),
    ("The product is fine. Just fine. Nothing more.", "neutral"),
    ("Absolutely phenomenal! Best purchase of my life!", "positive"),
    ("Worst. Purchase. Ever.", "negative"),
    ("It works. That's about all I can say.", "neutral"),
    ("I expected more for the price, very disappointed.", "negative"),
    ("Not the best, not the worst.", "neutral"),
    ("This product changed my life for the better.", "positive"),
]

a) Score each review with VADER. Map compound scores to labels using thresholds: positive >= 0.05, negative <= -0.05, neutral otherwise. Which reviews does VADER misclassify?

b) For the misclassified reviews, explain why VADER gets them wrong. What linguistic patterns does VADER struggle with?

c) Design a simple rule-based improvement to VADER for product reviews. What domain-specific rules would you add? (Example: "for the price" following a negative word amplifies negativity.)

d) If you had 500 labeled product reviews, describe how you would build an ML-based sentiment classifier that handles these tricky cases. What preprocessing choices and features would help?

Exercise 6: Topic Modeling with LDA (Code)

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

np.random.seed(42)

# Simulate a corpus of 3000 news article snippets across 4 topics
topic_words = {
    'sports': 'game team win score player season championship league tournament final',
    'politics': 'election vote president senate law policy campaign government reform bill',
    'technology': 'software data algorithm startup cloud platform digital innovation app code',
    'health': 'patient treatment hospital clinical study drug vaccine disease therapy trial',
}

articles = []
true_labels = []
for _ in range(3000):
    topic = np.random.choice(list(topic_words.keys()))
    words = topic_words[topic].split()
    # Primary topic words (80%) + random noise (20%)
    n_primary = np.random.randint(6, 12)
    n_noise = np.random.randint(1, 4)
    primary = list(np.random.choice(words, size=n_primary, replace=True))
    noise_topic = np.random.choice([t for t in topic_words if t != topic])
    noise = list(np.random.choice(topic_words[noise_topic].split(), size=n_noise))
    all_words = primary + noise
    np.random.shuffle(all_words)
    articles.append(' '.join(all_words))
    true_labels.append(topic)

a) Fit LDA with k=4 topics. Display the top 10 words per topic. Do the discovered topics align with the ground truth? Manually label each discovered topic.

b) Transform the corpus to get topic distributions for each document. Compute the "topic purity" --- for each true label, what fraction of documents have their correct topic as the dominant (highest-probability) topic?

c) Experiment with k=3 and k=6. For k=3, which two ground truth topics get merged? For k=6, which ground truth topic gets split? Print the top words for each to show this.

d) Compute the average topic distribution entropy for each document. Documents with high entropy are "mixed topic" --- they do not belong cleanly to one topic. What is the median entropy? What fraction of documents have entropy above 1.0?

e) LDA requires CountVectorizer, not TfidfVectorizer. Deliberately fit LDA on TF-IDF vectors and compare the topic quality (top words) to the CountVectorizer version. Is the difference noticeable?

Exercise 7: Progressive Project --- StreamFlow Churn Model Enhancement (Code)

This exercise integrates NLP features into the StreamFlow churn prediction pipeline from earlier chapters.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import cross_val_score

np.random.seed(42)

# Simulated subscriber-level data with text features
n_subs = 3000
subscriber_data = pd.DataFrame({
    'subscriber_id': range(1, n_subs + 1),
    'months_active': np.random.randint(1, 36, n_subs),
    'avg_hours_weekly': np.random.exponential(5, n_subs).round(1),
    'n_support_tickets': np.random.poisson(1.5, n_subs),
    'last_ticket_text': np.random.choice([
        "billing error charged twice refund",
        "video buffering slow stream quality",
        "cannot login account locked password",
        "show removed library content missing",
        "want to cancel subscription too expensive",
        "great service no complaints satisfied",
        "app crashes frequently unusable",
        "content boring nothing new to watch",
    ], n_subs),
    'churned': np.random.binomial(1, 0.25, n_subs),
})

a) Build two churn prediction models: - Model A: numeric features only (months_active, avg_hours_weekly, n_support_tickets) - Model B: numeric features + TF-IDF of last_ticket_text using ColumnTransformer

Compare 5-fold cross-validated AUC. Does adding text features improve the model?

b) From Model B, extract the top 10 most important text features for predicting churn. Which ticket topics are most predictive of cancellation?

c) Instead of using the raw TF-IDF features, create a two-stage pipeline: first classify tickets into topics (using a pre-trained classifier), then use the topic probabilities as features in the churn model. Compare this approach to the raw TF-IDF approach from part (a). Which is more interpretable? Which performs better?

d) Write a function predict_churn_risk(subscriber_data) that takes a DataFrame with both numeric and text columns and returns a churn probability for each subscriber. The function should use the Model B pipeline.

Solutions to selected exercises are available in the Answers to Selected Exercises appendix.