Chapter 35 Exercises: Natural Language Processing for Business Text

These exercises are organized into five tiers, from foundational to advanced. Complete them in order — each tier builds on skills from the previous one.


Tier 1: Foundations (No Prior NLP Knowledge Required)

Exercise 1.1 — Text Preprocessing Pipeline

Write a function clean_and_tokenize(text) that takes a raw string and returns a list of processed tokens. The function should: - Convert to lowercase - Remove punctuation and numbers - Split into individual words - Remove words shorter than 3 characters

Test it on the following strings:

"Our Q4 revenue grew by 23.7%! Fantastic results for the team."
"URGENT: Package #98342 has NOT been delivered. Contact us ASAP!!!"
"The customer's satisfaction score was 4.5/5.0 — excellent performance."

Expected output for the first string: ['our', 'revenue', 'grew', 'fantastic', 'results', 'for', 'the', 'team']


Exercise 1.2 — Stopword Removal

Using NLTK, extend your clean_and_tokenize function to remove English stopwords. Compare the output with and without stopword removal for a five-sentence paragraph about a business meeting.

Questions to answer: - How many words were removed as stopwords? - What percentage of the original word count remains? - Are any words removed that you think should have been kept?


Exercise 1.3 — TextBlob Quick Sentiment

Score the sentiment of these five customer messages using TextBlob and print the polarity and a human-readable label (POSITIVE/NEUTRAL/NEGATIVE):

messages = [
    "Thank you so much for the quick resolution. You've been amazing.",
    "The package arrived but the box was slightly dented.",
    "I've called three times and nobody has helped me. This is ridiculous.",
    "Item received. Works as expected.",
    "Honestly surprised by the quality — much better than I expected!",
]

For each message, print: the first 50 characters, the polarity score, and the label.


Tier 2: Core Skills

Exercise 2.1 — Batch Sentiment Analysis

Create a DataFrame with 15 rows representing customer reviews. Include columns: review_id, product_category, review_text, star_rating. Add reviews across three product categories (5 per category).

Apply TextBlob sentiment analysis to create polarity and sentiment_label columns. Then: - Calculate the average polarity per product category - Identify the product category with the lowest average sentiment - Check: does your NLP-based sentiment ranking match the star-rating ranking?


Exercise 2.2 — Keyword Frequency Counter

You have been given 20 customer support ticket texts (create your own realistic samples for a fictional e-commerce company). Build a function that: 1. Preprocesses all ticket texts 2. Counts the frequency of each word across all tickets 3. Returns the top 15 keywords as a formatted table with rank, keyword, and count

Which keywords appear most frequently? Do they suggest any business problems?


Exercise 2.3 — Simple Text Classifier

Write a function classify_inquiry(text) that classifies a customer inquiry into one of five categories: - billing - shipping - returns - technical_support - general

Use keyword matching. Define at least 8 keywords per category. Test your classifier on 10 sample inquiries and report the classification result for each.


Exercise 2.4 — N-gram Extraction

Using NLTK's ngrams function, extract all bigrams (2-word phrases) from a collection of product reviews. Find the 10 most common bigrams and the 10 most common trigrams.

How do the results differ from single-word frequency analysis? Give one example of a bigram that provides more business insight than its individual words would.


Tier 3: Applied Business Analysis

Exercise 3.1 — Support Ticket Triage System

Build a complete support ticket triage function that takes a DataFrame of support tickets and returns the same DataFrame sorted by urgency. Urgency should be calculated from: - Sentiment polarity (more negative = more urgent) - Ticket age in hours (older = more urgent) - Presence of specific escalation keywords ("urgent," "manager," "lawyer," "refund demand")

The function should add an urgency_score column (0 to 10) and a priority_label column (HIGH/MEDIUM/LOW). Test it on a DataFrame of at least 15 sample tickets.


Exercise 3.2 — Product Review Dashboard

Given a DataFrame of 50+ product reviews (create realistic synthetic data), build an analysis function that produces: 1. Overall sentiment summary (counts and percentages) 2. Average polarity by star rating (1-5) — are 4-star reviews actually more positive in language than 3-star reviews? 3. Top 10 most common words in 5-star reviews 4. Top 10 most common words in 1-star reviews 5. Three words that appear proportionally more in negative reviews than positive reviews

Present the output in a clean, readable format.


Exercise 3.3 — spaCy Named Entity Extraction

Write a function extract_business_entities(text) using spaCy that returns a structured dictionary containing: - organizations: list of mentioned company names - people: list of person names - dates: list of date expressions - money: list of monetary amounts - locations: list of cities/countries/states

Test it on: 1. A sample contract excerpt you write yourself (100-200 words) 2. A sample business email (75-150 words) 3. A news article excerpt about a business acquisition (100-200 words)

Which document type yielded the most entities? Which entity type was most reliable?


Exercise 3.4 — Sentiment Trend Over Time

Create a DataFrame of 60 customer reviews with realistic created_date values spanning 6 months and realistic review_text values (some positive, some negative, with a pattern — perhaps reviews get worse in the final two months due to a fictional product defect).

  • Calculate average sentiment by month
  • Plot a line chart of average polarity over time
  • Add a horizontal reference line at y=0 (neutral sentiment)
  • Write a two-sentence interpretation of the chart

Tier 4: Integration and Advanced Application

Exercise 4.1 — ML Text Classifier

Using scikit-learn, build a machine learning text classifier for customer support tickets: 1. Create a labeled dataset of at least 60 tickets across 4 categories (15 per category). Write the ticket texts yourself — make them realistic. 2. Split 80/20 into train/test sets 3. Build a Pipeline with TfidfVectorizer and MultinomialNB 4. Train and evaluate the model 5. Print the classification report

Questions: - Which category has the lowest F1 score? Why might that be? - What would you need to do to improve accuracy?


Exercise 4.2 — LDA Topic Discovery

Collect or create a dataset of at least 40 short text documents on a business theme (customer reviews, employee feedback, product descriptions — your choice). Build an LDA topic model with 4 topics.

Then: 1. Display the top 8 words per topic 2. Label each topic with your interpretation 3. Assign the dominant topic to each document 4. Build a bar chart showing how many documents belong to each topic

Bonus: Test with 3 topics and 5 topics. Which number produces the most interpretable results? Explain your reasoning.


Exercise 4.3 — Comparative Language Analysis

You have two sets of survey responses: one from clients who renewed contracts and one from clients who did not renew. Analyze whether the language in these two groups differs.

  • What words appear more frequently in non-renewal responses?
  • Do non-renewal responses mention different topics?
  • Is there a measurable sentiment difference?

(You will need to create synthetic data for this exercise — aim for 25 responses per group.)


Exercise 4.4 — Entity Co-occurrence Network

Using spaCy, process 10 news article excerpts about business mergers and acquisitions (write or adapt them). Extract all ORG and PERSON entities. Build a co-occurrence table showing which organizations are mentioned together in the same document.

Which organization appears in the most documents? Which pair of organizations co-occurs most frequently?


Tier 5: Capstone

Exercise 5.1 — Complete NLP Analysis Pipeline

Build a self-contained Python script analyze_feedback.py that performs a complete NLP analysis on a business text dataset. The script should:

  1. Accept a CSV file path as a command-line argument
  2. Detect which column contains the main text (or accept a --text-col argument)
  3. Run sentiment analysis and produce a distribution report
  4. Perform keyword frequency analysis and display the top 20 terms
  5. If a --group-col argument is provided, break down sentiment by that column
  6. Save a summary to analysis_output.csv
  7. Generate and save two charts: sentiment distribution and (if group column provided) sentiment by group

The script should handle missing data, short responses, and encoding issues gracefully. Include a --help message. Test it on at least two different datasets.


Exercise 5.2 — Multi-Dataset Comparison Study

Choose a business domain (e-commerce, healthcare, hospitality, SaaS — your choice). Find or create three different text datasets from that domain: - Customer reviews - Social media mentions (can be synthetic) - Support/complaint text

Apply the full NLP pipeline to each dataset separately. Write a 400-500 word analysis comparing: - Sentiment profiles across the three sources - Whether the same topics appear across all three - Which dataset provides the most actionable business intelligence and why - What NLP limitations are most apparent in your analysis


Exercise 5.3 — NLP-Powered Weekly Report Generator

Build a Python script that reads a CSV of support tickets filed in the last 7 days and generates a plain-text weekly summary report containing:

  1. Total ticket count and comparison to previous week (if prior week data is available)
  2. Sentiment breakdown with trend indicator (↑↓→)
  3. Top 5 keywords this week
  4. Most urgent ticket (lowest polarity) with ticket ID and truncated text
  5. Category breakdown table
  6. Any categories where sentiment dropped by more than 0.05 vs. prior week

The report should be formatted to be readable when pasted into an email. Include a function send_summary_email(report_text, recipients) stub that prints "Email would be sent to: [recipients]" for now.


Answer Guidance

Tier 1 Solutions

1.1 Solution sketch:

import re
def clean_and_tokenize(text: str) -> list[str]:
    text = text.lower()
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'\d+', '', text)
    tokens = text.split()
    return [t for t in tokens if len(t) >= 3]

1.3 Polarity expectations: Message 1 ≈ +0.5 to +0.7, Message 3 ≈ -0.4 to -0.6, Message 4 ≈ 0 (neutral). If your scores differ significantly, check that TextBlob corpora are installed correctly.

Common Mistakes to Watch For

  • Stopword removal before sentiment analysis: Removing "not" or "never" will flip the meaning of negative sentences. Apply stopword removal only for keyword extraction and topic modeling, not for TextBlob sentiment scoring.

  • Forgetting to handle NaN values: Real datasets always have missing text. Always use .fillna('') or check isinstance(text, str) before processing.

  • Over-trusting individual polarity scores: A single score of +0.1 does not mean a review is meaningfully positive. Look at distributions and aggregates.

  • LDA with too few documents: LDA needs sufficient data to find stable topic patterns. With fewer than 30 documents, topics will be unstable across runs. For small datasets, stick to keyword frequency analysis.