Every business generates an enormous amount of text. Customer emails arrive overnight. Support tickets pile up over the weekend. Survey responses trickle in for weeks after a product launch. Employees write meeting notes, sales teams log call...
In This Chapter
- Learning Objectives
- Introduction: The Mountain of Unread Text
- 35.1 What Is Natural Language Processing?
- 35.2 Text Preprocessing: Cleaning Your Data
- 35.3 Sentiment Analysis
- 35.4 Keyword Extraction and Frequency Analysis
- 35.5 Named Entity Recognition
- 35.6 Text Classification
- 35.7 Topic Modeling: What Are People Talking About?
- 35.8 Analyzing Survey Data at Scale
- 35.9 Practical Limits of Classical NLP
- 35.10 Putting It All Together: A Complete Analysis Workflow
- Chapter Summary
- Key Terms
Chapter 35: Natural Language Processing for Business Text
Learning Objectives
By the end of this chapter, you will be able to:
- Explain what Natural Language Processing is and why it matters for business operations
- Preprocess raw text data for analysis using lowercasing, stopword removal, and lemmatization
- Tokenize text into words and sentences using NLTK and spaCy
- Perform sentiment analysis on customer reviews, support tickets, and survey responses with TextBlob
- Extract named entities (people, organizations, dates, money) from business documents using spaCy
- Classify text into categories to route customer inquiries automatically
- Apply keyword extraction and frequency analysis to large collections of text
- Understand the practical limits of classical NLP approaches
Introduction: The Mountain of Unread Text
Every business generates an enormous amount of text. Customer emails arrive overnight. Support tickets pile up over the weekend. Survey responses trickle in for weeks after a product launch. Employees write meeting notes, sales teams log call summaries, and customers leave reviews on a dozen different platforms.
Most of that text goes largely unread, or at best skimmed. A human analyst can review fifty customer reviews in an afternoon and come away with a general impression. But what about five thousand? Or fifty thousand?
This is the problem Natural Language Processing (NLP) solves.
NLP is the branch of computer science and linguistics that gives programs the ability to understand, interpret, and manipulate human language. It has been a research field for decades, and the tools available today — particularly in Python — make it remarkably accessible to business analysts who are not machine learning researchers.
In this chapter, we will focus on classical NLP techniques: the workhorses of text analysis that have been refined over thirty years and that work reliably on business text without requiring a PhD or a GPU cluster. We will not cover transformers or large language models (LLMs) like GPT — those are covered in a separate domain — but the techniques here handle the vast majority of real business text analysis tasks and remain the right tool when you need interpretable, auditable results.
Priya at Acme Corp has a backlog of 4,200 customer support tickets from the last quarter. She knows, vaguely, that there are complaints about shipping and about the mobile app, but she cannot quantify either problem for Sandra Chen's board presentation. By the end of this chapter, she will be able to turn those tickets into a structured analysis in about two hours of work.
Maya Reyes, who runs a boutique consulting practice, sends a short open-text survey to clients after every engagement. She has two years of responses but has never found the time to read all of them. By the end of this chapter, she will know exactly what her clients value and what they quietly avoid mentioning.
Let's begin.
35.1 What Is Natural Language Processing?
Human language is extraordinarily complex. Words change meaning based on context. The sentence "The bank can guarantee deposits will eventually cover future tuition costs" contains the word "bank," which could mean a financial institution, a riverbank, or even a maneuver in aviation — but any competent reader instantly understands which meaning applies.
NLP is the set of computational techniques that allow programs to work with this complexity. Modern NLP spans a wide spectrum, from simple counting of word frequencies to sophisticated systems that generate fluent text. For business purposes, the most immediately useful tasks are:
Sentiment Analysis — Determining whether text expresses a positive, negative, or neutral opinion. This is the backbone of review monitoring, support ticket triage, and survey analysis.
Named Entity Recognition (NER) — Identifying and classifying specific entities mentioned in text: company names, people's names, locations, dates, dollar amounts. This powers contract analysis, competitive intelligence, and data extraction from unstructured documents.
Text Classification — Assigning text to predefined categories. A customer inquiry gets classified as "billing question," "technical support," or "general feedback" and routed accordingly.
Keyword Extraction — Identifying the most important or frequent terms in a body of text. This surfaces recurring themes without requiring you to read everything.
Topic Modeling — Discovering the latent topics that appear across a collection of documents, without knowing those topics in advance.
Tokenization — The fundamental preprocessing step of splitting text into meaningful units (words, sentences).
Each of these has direct business applications. We will work through each one with practical Python code.
The NLP Ecosystem in Python
Three libraries dominate Python NLP for business use:
NLTK (Natural Language Toolkit) is the classic, comprehensive library. It contains implementations of almost every classical NLP algorithm, extensive corpora, and excellent documentation. It can be verbose, but it is reliable and well-understood.
spaCy is the modern production-grade library. It is fast, opinionated, and excellent for named entity recognition, dependency parsing, and text preprocessing. SpaCy's pre-trained models are state of the art for classical NLP.
TextBlob is the simplest option. It wraps NLTK and provides an intuitive API for sentiment analysis, part-of-speech tagging, and basic NLP tasks. It is ideal for quick analyses where you need results fast without fine-tuning.
You will also encounter scikit-learn for text classification and gensim for topic modeling. We will use all of these in this chapter.
Installation
pip install nltk spacy textblob scikit-learn gensim
python -m spacy download en_core_web_sm
python -m textblob.download_corpora
After installing NLTK, you need to download the data packages:
import nltk
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('averaged_perceptron_tagger')
35.2 Text Preprocessing: Cleaning Your Data
Raw business text is messy. Customer emails contain greetings, email signatures, HTML remnants, and inconsistent capitalization. Before any meaningful analysis, you need to normalize and clean the text. This preprocessing stage is not glamorous, but it is where the quality of your analysis is determined.
The standard preprocessing pipeline consists of several steps, and the right combination depends on your task.
35.2.1 Lowercasing
The simplest step: convert all text to lowercase. This ensures that "Shipping," "SHIPPING," and "shipping" are treated as the same word.
text = "We had TERRIBLE service but the PRODUCT was fine."
text_lower = text.lower()
print(text_lower)
# we had terrible service but the product was fine.
35.2.2 Punctuation and Special Character Removal
Punctuation usually does not contribute to meaning in most NLP tasks. You can remove it with Python's re module:
import re
def remove_punctuation(text: str) -> str:
"""Remove punctuation from text, preserving spaces."""
return re.sub(r'[^\w\s]', '', text)
sample = "Great product! Arrived quickly, no damage. 5/5 stars."
cleaned = remove_punctuation(sample)
print(cleaned)
# Great product Arrived quickly no damage 55 stars
Note that we removed the slash from "5/5" and it became "55" — this illustrates why you should think carefully about each step. For review star ratings or structured data embedded in text, you may need a more nuanced approach.
35.2.3 Tokenization
Tokenization splits text into individual words (word tokenization) or sentences (sentence tokenization). This is more complex than it sounds: contractions ("don't" → ["do", "n't"]), hyphenated words, and abbreviations all present challenges.
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = """The product arrived on Dec. 15th. It was in perfect condition.
However, customer service at the N.Y. office was slow to respond."""
# Word tokenization
words = word_tokenize(text)
print("Words:", words[:10])
# Words: ['The', 'product', 'arrived', 'on', 'Dec.', '15th', '.', 'It', 'was', 'in']
# Sentence tokenization
sentences = sent_tokenize(text)
for i, sent in enumerate(sentences):
print(f"Sentence {i+1}: {sent}")
# Sentence 1: The product arrived on Dec. 15th.
# Sentence 2: It was in perfect condition.
# Sentence 3: However, customer service at the N.Y. office was slow to respond.
NLTK's sentence tokenizer correctly handles "Dec. 15th." as a single sentence boundary, even though "Dec." contains a period. This is non-trivial and the result of years of engineering.
SpaCy handles tokenization differently — it processes text as a Doc object and makes tokens available as attributes:
import spacy
nlp = spacy.load("en_core_web_sm")
doc = nlp("The order arrived Dec. 15th, but the N.Y. warehouse shipped it Dec. 10th.")
for token in doc:
print(f"{token.text:<20} {token.pos_:<10} {token.lemma_}")
35.2.4 Stopword Removal
Stopwords are common words — "the," "a," "is," "and," "of" — that appear in almost every document and carry little analytical meaning. Removing them reduces noise and focuses analysis on substantive content.
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
text = "The customer was very unhappy with the delivery time and the packaging."
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word not in stop_words and word.isalpha()]
print("Original tokens:", len(tokens))
print("After stopword removal:", len(filtered))
print("Remaining:", filtered)
# Remaining: ['customer', 'unhappy', 'delivery', 'time', 'packaging']
Stopword removal dramatically improves the signal-to-noise ratio for tasks like keyword extraction and topic modeling.
A word of caution: for sentiment analysis, removing stopwords can be counterproductive. "Not good" becomes "good" if you remove the word "not." For sentiment analysis, either keep stopwords or use a library that handles negation intelligently.
35.2.5 Stemming vs. Lemmatization
Both stemming and lemmatization reduce words to a base form. The difference matters in practice.
Stemming is a blunt, rule-based approach: it chops off word endings. It is fast but crude.
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ['shipping', 'shipped', 'shipment', 'ships', 'running', 'runs', 'ran']
for word in words:
print(f"{word:<15} → {stemmer.stem(word)}")
# shipping → ship
# shipped → ship
# shipment → shipment (note: not reduced to 'ship')
# ships → ship
# running → run
# runs → run
# ran → ran (irregular verb — stemmer fails here)
Lemmatization is linguistically aware: it uses vocabulary and morphological analysis to return the actual dictionary form (the lemma) of a word. It is slower but more accurate.
from nltk.stem import WordNetLemmatizer
import nltk
lemmatizer = WordNetLemmatizer()
words = ['shipping', 'shipped', 'ships', 'running', 'runs', 'ran', 'better', 'geese']
for word in words:
# You need to specify part of speech for best results; 'v' = verb, 'n' = noun
print(f"{word:<15} → {lemmatizer.lemmatize(word, pos='v')}")
# shipping → ship
# shipped → ship
# ships → ship
# running → run
# runs → run
# ran → run (correctly handles irregular verb)
# better → better (needs pos='a' for adjective)
# geese → geese (needs pos='n' for noun)
For business text analysis, lemmatization is almost always preferred. The extra computational cost is negligible on modern hardware, and the improved accuracy matters when you are comparing word frequencies or building topic models.
35.2.6 Putting the Pipeline Together
Here is a complete preprocessing function that handles all of these steps:
import re
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
# Download required NLTK data (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))
def preprocess_text(
text: str,
remove_stops: bool = True,
lemmatize: bool = True,
min_word_length: int = 2
) -> list[str]:
"""
Preprocess text for NLP analysis.
Args:
text: Raw input text.
remove_stops: Whether to remove stopwords.
lemmatize: Whether to lemmatize tokens.
min_word_length: Minimum word length to keep.
Returns:
List of preprocessed tokens.
"""
if not text or not isinstance(text, str):
return []
# Lowercase
text = text.lower()
# Remove URLs
text = re.sub(r'http\S+|www\S+', '', text)
# Remove email addresses
text = re.sub(r'\S+@\S+', '', text)
# Remove punctuation (keep letters, numbers, spaces)
text = re.sub(r'[^\w\s]', ' ', text)
# Remove extra whitespace
text = re.sub(r'\s+', ' ', text).strip()
# Tokenize
tokens = word_tokenize(text)
# Filter: letters only, minimum length
tokens = [t for t in tokens if t.isalpha() and len(t) >= min_word_length]
# Remove stopwords
if remove_stops:
tokens = [t for t in tokens if t not in stop_words]
# Lemmatize
if lemmatize:
tokens = [lemmatizer.lemmatize(t) for t in tokens]
return tokens
# Test it
sample_ticket = """Hi, I ordered product #12345 on Dec. 15th and it STILL hasn't arrived!
The tracking page at tracking.acmecorp.com shows 'in transit' but that's been
three weeks. Please help! - Frustrated customer"""
tokens = preprocess_text(sample_ticket)
print("Processed tokens:", tokens)
# Processed tokens: ['ordered', 'product', 'dec', 'still', 'arrived',
# 'tracking', 'page', 'show', 'transit', 'three', 'week',
# 'please', 'help', 'frustrated', 'customer']
35.3 Sentiment Analysis
Sentiment analysis is the most immediately useful NLP capability for most businesses. It answers the question: "What do people feel about this?" Applied to customer reviews, support tickets, or survey responses, it transforms thousands of unread text documents into quantified emotional signals.
35.3.1 TextBlob for Quick Sentiment Scoring
TextBlob provides the simplest sentiment API in Python. It returns two scores for any piece of text:
- Polarity: A score from -1.0 (most negative) to +1.0 (most positive). A score near 0 is neutral.
- Subjectivity: A score from 0.0 (objective/factual) to 1.0 (highly subjective/opinionated).
from textblob import TextBlob
reviews = [
"Absolutely love this product! It exceeded all my expectations.",
"The item was okay. Nothing special, but it worked as described.",
"Terrible experience. The product broke after two days and support was useless.",
"Shipped faster than expected. Good quality for the price.",
"I've been waiting three weeks. Complete disaster.",
]
print(f"{'Review':<55} {'Polarity':>10} {'Subjectivity':>14} {'Label'}")
print("-" * 95)
for review in reviews:
blob = TextBlob(review)
polarity = blob.sentiment.polarity
subjectivity = blob.sentiment.subjectivity
if polarity > 0.1:
label = "POSITIVE"
elif polarity < -0.1:
label = "NEGATIVE"
else:
label = "NEUTRAL"
short = review[:52] + "..." if len(review) > 55 else review
print(f"{short:<55} {polarity:>10.3f} {subjectivity:>14.3f} {label}")
Output:
Review Polarity Subjectivity Label
-----------------------------------------------------------------------------------------------
Absolutely love this product! It exceeded all my e... 0.625 0.975 POSITIVE
The item was okay. Nothing special, but it worked ... 0.225 0.550 POSITIVE
Terrible experience. The product broke after two d... -0.467 0.767 NEGATIVE
Shipped faster than expected. Good quality for the... 0.650 0.650 POSITIVE
I've been waiting three weeks. Complete disaster. -0.750 1.000 NEGATIVE
35.3.2 Analyzing a Collection of Reviews
In practice, you will apply sentiment analysis to a DataFrame of reviews, not individual strings:
import pandas as pd
from textblob import TextBlob
def analyze_sentiment(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
"""
Add sentiment scores to a DataFrame of text.
Args:
df: DataFrame containing text data.
text_column: Name of the column containing text to analyze.
Returns:
DataFrame with added polarity, subjectivity, and sentiment_label columns.
"""
df = df.copy()
# Apply TextBlob to each row
blobs = df[text_column].fillna('').apply(TextBlob)
df['polarity'] = blobs.apply(lambda b: b.sentiment.polarity)
df['subjectivity'] = blobs.apply(lambda b: b.sentiment.subjectivity)
# Categorize sentiment
conditions = [
df['polarity'] > 0.1,
df['polarity'] < -0.1,
]
choices = ['POSITIVE', 'NEGATIVE']
df['sentiment_label'] = pd.np.select(conditions, choices, default='NEUTRAL')
return df
# Aggregate sentiment by category
def sentiment_summary(df: pd.DataFrame, group_col: str) -> pd.DataFrame:
"""Summarize sentiment statistics grouped by a category column."""
summary = df.groupby(group_col).agg(
ticket_count=('polarity', 'count'),
avg_polarity=('polarity', 'mean'),
pct_negative=('sentiment_label', lambda x: (x == 'NEGATIVE').mean() * 100),
pct_positive=('sentiment_label', lambda x: (x == 'POSITIVE').mean() * 100),
).round(2)
return summary.sort_values('avg_polarity')
35.3.3 Understanding TextBlob's Limitations
TextBlob uses a lexicon-based approach: it contains a dictionary of words with pre-assigned sentiment scores and combines them. This makes it:
Fast and transparent — you can understand exactly why a score was assigned.
Domain-agnostic — which also means it can miss domain-specific sentiment. In a medical context, "positive" test result is bad news. In a restaurant review, "killer" might mean "excellent."
Sensitive to negation in simple cases — "not bad" scores slightly positive — but it struggles with complex negation like "I wouldn't say it was terrible, but..."
Misled by sarcasm — "Oh, great, another delay. Wonderful." will score as positive.
For most business applications — product reviews, customer satisfaction surveys, support ticket triage — TextBlob's accuracy is typically in the 70-80% range, which is often sufficient for aggregate analysis. You are not making decisions based on individual scores; you are looking at patterns across hundreds or thousands of tickets.
35.3.4 Business Application: Support Ticket Triage
One of the most valuable applications of sentiment analysis is triage: identifying the most urgent, most negative tickets so they get attention first.
import pandas as pd
from textblob import TextBlob
from datetime import datetime
def triage_tickets(tickets_df: pd.DataFrame) -> pd.DataFrame:
"""
Add urgency scores to a support ticket DataFrame.
Urgency combines negative sentiment with waiting time to prioritize
tickets that are both very unhappy AND have been waiting the longest.
"""
df = tickets_df.copy()
# Add sentiment
df['polarity'] = df['text'].apply(
lambda t: TextBlob(str(t)).sentiment.polarity
)
# Hours waiting (assuming 'created_at' is a datetime column)
if 'created_at' in df.columns:
df['hours_waiting'] = (
datetime.now() - pd.to_datetime(df['created_at'])
).dt.total_seconds() / 3600
# Urgency score: most negative + longest waiting
# Normalize both components to 0-1 range
pol_norm = (df['polarity'] - df['polarity'].min()) / (
df['polarity'].max() - df['polarity'].min() + 1e-9
)
wait_norm = (df['hours_waiting'] - df['hours_waiting'].min()) / (
df['hours_waiting'].max() - df['hours_waiting'].min() + 1e-9
)
# Urgency = inverse of polarity (lower polarity = higher urgency)
# + waiting time contribution
df['urgency_score'] = (1 - pol_norm) * 0.6 + wait_norm * 0.4
return df.sort_values('urgency_score', ascending=False)
35.4 Keyword Extraction and Frequency Analysis
Before you know what questions to ask, you need to know what your text is about. Keyword extraction gives you the most important terms in a body of text, and frequency analysis tells you how often each topic appears.
35.4.1 Term Frequency Analysis
The simplest approach: count word occurrences after preprocessing.
from collections import Counter
import pandas as pd
from your_preprocessing import preprocess_text # from section 35.2.6
def extract_top_keywords(
texts: list[str],
top_n: int = 20,
remove_stops: bool = True
) -> pd.DataFrame:
"""
Extract the most frequent keywords from a collection of texts.
Args:
texts: List of text strings.
top_n: Number of top keywords to return.
remove_stops: Whether to remove stopwords before counting.
Returns:
DataFrame with keyword and count columns, sorted by frequency.
"""
all_tokens = []
for text in texts:
tokens = preprocess_text(text, remove_stops=remove_stops)
all_tokens.extend(tokens)
counter = Counter(all_tokens)
top_words = counter.most_common(top_n)
return pd.DataFrame(top_words, columns=['keyword', 'count'])
# Example usage with support tickets
top_keywords = extract_top_keywords(tickets_df['text'].tolist(), top_n=25)
print(top_keywords.head(10))
35.4.2 TF-IDF: Finding What Makes Each Document Unique
Raw frequency has a problem: it tells you what words appear most often across all documents, but not what makes individual documents distinctive. TF-IDF (Term Frequency-Inverse Document Frequency) solves this.
TF measures how often a word appears in a specific document. IDF penalizes words that appear in almost every document (they add little information). TF-IDF is their product: it scores words that are frequent in a specific document but rare across the collection.
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
def get_tfidf_keywords(texts: list[str], top_n: int = 10) -> pd.DataFrame:
"""
Extract TF-IDF keywords from a collection of texts.
Returns a DataFrame with each document's top keywords.
"""
vectorizer = TfidfVectorizer(
max_features=500,
stop_words='english',
ngram_range=(1, 2), # Include bigrams (two-word phrases)
min_df=2, # Word must appear in at least 2 documents
max_df=0.95, # Ignore words in more than 95% of documents
)
tfidf_matrix = vectorizer.fit_transform(texts)
feature_names = vectorizer.get_feature_names_out()
# For each document, find the top TF-IDF terms
results = []
for i, row in enumerate(tfidf_matrix):
scores = zip(feature_names, row.toarray()[0])
top_terms = sorted(scores, key=lambda x: x[1], reverse=True)[:top_n]
results.append({
'document_index': i,
'top_terms': [term for term, score in top_terms if score > 0]
})
return pd.DataFrame(results)
# More useful: get the overall corpus-level top TF-IDF terms
def corpus_tfidf_keywords(texts: list[str], top_n: int = 20) -> list[tuple]:
"""Get the most distinctive keywords across the whole corpus."""
vectorizer = TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2),
min_df=2,
max_df=0.9,
)
matrix = vectorizer.fit_transform(texts)
# Average TF-IDF score across all documents
mean_scores = matrix.mean(axis=0).A1
feature_names = vectorizer.get_feature_names_out()
top_indices = mean_scores.argsort()[-top_n:][::-1]
return [(feature_names[i], mean_scores[i]) for i in top_indices]
35.4.3 N-grams: Capturing Phrases
Single words ("shipping") often tell you less than phrases ("delayed shipping," "shipping damage," "free shipping"). N-grams are sequences of n consecutive words. The ngram_range=(1, 2) parameter in the TF-IDF vectorizer above already handles this, but you can also extract n-grams directly:
from nltk.util import ngrams
from collections import Counter
def extract_ngrams(
texts: list[str],
n: int = 2,
top_n: int = 20
) -> list[tuple]:
"""
Extract the most common n-grams from a collection of texts.
Args:
texts: List of text strings.
n: Size of n-gram (2 = bigrams, 3 = trigrams).
top_n: Number of top n-grams to return.
"""
all_ngrams = []
for text in texts:
tokens = preprocess_text(text, remove_stops=False) # Keep stops for phrases
text_ngrams = list(ngrams(tokens, n))
all_ngrams.extend(text_ngrams)
return Counter(all_ngrams).most_common(top_n)
# Example
bigrams = extract_ngrams(tickets, n=2, top_n=15)
for gram, count in bigrams:
print(f"'{' '.join(gram)}': {count}")
# 'shipping delay': 234
# 'not received': 187
# 'wrong item': 143
# 'refund request': 98
35.5 Named Entity Recognition
Named Entity Recognition (NER) is the task of identifying and classifying real-world entities mentioned in text. For business applications, this is enormously useful: you can automatically extract company names from competitor mentions, pull dates from contracts, identify monetary amounts from invoices or emails, and track which locations appear in logistics documents.
35.5.1 SpaCy NER in Practice
SpaCy's pre-trained models include a highly accurate NER component. The standard English model (en_core_web_sm) recognizes the following entity types relevant for business:
| Label | Description | Example |
|---|---|---|
| PERSON | People's names | "John Smith," "Sarah" |
| ORG | Organizations, companies | "Acme Corp," "Federal Reserve" |
| GPE | Countries, cities, states | "Chicago," "Germany" |
| DATE | Dates and periods | "January 2024," "last Tuesday" |
| TIME | Times | "3:00 PM," "midnight" |
| MONEY | Monetary values | "$1,500," "two million dollars" |
| PERCENT | Percentages | "15%," "a third" |
| PRODUCT | Products, objects | "iPhone," "Model S" |
import spacy
nlp = spacy.load("en_core_web_sm")
def extract_entities(text: str) -> dict[str, list[str]]:
"""
Extract named entities from text using spaCy.
Returns a dictionary mapping entity type to list of entity strings.
"""
doc = nlp(text)
entities: dict[str, list[str]] = {}
for ent in doc.ents:
if ent.label_ not in entities:
entities[ent.label_] = []
# Avoid duplicates within same document
if ent.text not in entities[ent.label_]:
entities[ent.label_].append(ent.text)
return entities
# Example: extracting from a business email
email_text = """
Dear Ms. Chen,
Following our meeting on January 15th, 2024, we are pleased to confirm that
Acme Corporation will proceed with the $45,000 consulting engagement with
Strategic Partners LLC. The work will commence February 1st and conclude
by March 31st. The primary contact at Strategic Partners will be Marcus Webb,
their VP of Operations, based in the Chicago office.
Best regards,
Priya Sharma
"""
entities = extract_entities(email_text)
for entity_type, values in entities.items():
print(f"{entity_type}: {values}")
Output:
PERSON: ['Ms. Chen', 'Marcus Webb', 'Priya Sharma']
DATE: ['January 15th, 2024', 'February 1st', 'March 31st']
ORG: ['Acme Corporation', 'Strategic Partners LLC', 'Strategic Partners']
MONEY: ['$45,000']
GPE: ['Chicago']
PERSON: ['VP']
35.5.2 Batch Entity Extraction
For analyzing many documents, use spaCy's pipe() method, which processes documents in efficient batches:
import spacy
import pandas as pd
from collections import defaultdict
nlp = spacy.load("en_core_web_sm")
def batch_extract_entities(
texts: list[str],
entity_types: list[str] | None = None,
batch_size: int = 50
) -> pd.DataFrame:
"""
Extract named entities from a list of texts efficiently.
Args:
texts: List of text strings to process.
entity_types: Entity types to extract (None = all types).
batch_size: Documents to process per batch.
Returns:
DataFrame with columns: doc_index, entity_type, entity_text, count.
"""
if entity_types is None:
entity_types = ['ORG', 'PERSON', 'GPE', 'DATE', 'MONEY', 'PRODUCT']
results = []
docs = nlp.pipe(texts, batch_size=batch_size, disable=['parser'])
for doc_idx, doc in enumerate(docs):
entity_counter: dict = defaultdict(list)
for ent in doc.ents:
if ent.label_ in entity_types:
entity_counter[ent.label_].append(ent.text.strip())
for etype, entities in entity_counter.items():
for entity in set(entities): # Deduplicate within document
results.append({
'doc_index': doc_idx,
'entity_type': etype,
'entity_text': entity,
'count': entities.count(entity)
})
return pd.DataFrame(results)
# Find the most frequently mentioned organizations across all documents
def top_organizations(texts: list[str], top_n: int = 20) -> pd.DataFrame:
"""Extract and rank organizations mentioned most frequently."""
entities_df = batch_extract_entities(texts, entity_types=['ORG'])
return (
entities_df[entities_df['entity_type'] == 'ORG']
.groupby('entity_text')['count']
.sum()
.sort_values(ascending=False)
.head(top_n)
.reset_index()
)
35.5.3 Business Use Case: Contract Date Extraction
Extracting dates from contracts is a time-consuming manual task that NER handles well:
import spacy
from datetime import datetime
import re
nlp = spacy.load("en_core_web_sm")
def extract_contract_dates(contract_text: str) -> list[dict]:
"""
Extract all date mentions from a contract with surrounding context.
Returns a list of dicts with 'date_text' and 'context' keys.
"""
doc = nlp(contract_text)
dates = []
for ent in doc.ents:
if ent.label_ == 'DATE':
# Get surrounding context (50 characters on each side)
start_char = max(0, ent.start_char - 50)
end_char = min(len(contract_text), ent.end_char + 50)
context = contract_text[start_char:end_char].replace('\n', ' ')
dates.append({
'date_text': ent.text,
'context': f"...{context}..."
})
return dates
35.6 Text Classification
Text classification assigns text to predefined categories. For business, the most common applications are:
- Customer inquiry routing: Categorize incoming emails as billing, technical support, returns, general inquiry
- Complaint categorization: Classify support tickets by product area, department, or issue type
- Content moderation: Flag inappropriate content
- News and document classification: Route articles to the right team
35.6.1 Rule-Based Classification
For many business problems, a simple keyword-based approach is surprisingly effective and has the advantage of being completely transparent and easy to update.
from collections import defaultdict
CATEGORY_KEYWORDS = {
'shipping': [
'shipping', 'delivery', 'delivered', 'arrived', 'package',
'tracking', 'transit', 'courier', 'fedex', 'ups', 'usps',
'lost', 'damaged', 'missing', 'delay', 'delayed'
],
'billing': [
'charge', 'charged', 'billing', 'invoice', 'payment',
'refund', 'credit', 'overcharged', 'price', 'discount',
'coupon', 'cost', 'fee', 'subscription'
],
'product_quality': [
'broken', 'defective', 'faulty', 'quality', 'damaged',
'wrong', 'incorrect', 'missing parts', 'not working',
'stopped working', 'cheap', 'poor quality', 'disappointing'
],
'account': [
'login', 'password', 'account', 'sign in', 'access',
'username', 'reset', 'locked out', 'profile', 'settings'
],
'returns': [
'return', 'returning', 'refund', 'exchange', 'send back',
'return policy', 'rma', 'replacement'
],
}
def classify_ticket_rules(text: str) -> str:
"""
Classify a support ticket using keyword matching.
Returns the category with the most keyword matches.
Ties are broken by category order; 'other' is returned if no match.
"""
text_lower = text.lower()
scores = defaultdict(int)
for category, keywords in CATEGORY_KEYWORDS.items():
for keyword in keywords:
if keyword in text_lower:
scores[category] += 1
if not scores:
return 'other'
return max(scores, key=scores.get)
35.6.2 Machine Learning Classification with TF-IDF + Naive Bayes
When you have labeled training data — even a few hundred labeled examples — machine learning classification will outperform keyword rules:
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import classification_report
def train_ticket_classifier(
labeled_df: pd.DataFrame,
text_col: str = 'text',
label_col: str = 'category'
) -> Pipeline:
"""
Train a text classification pipeline on labeled support tickets.
Args:
labeled_df: DataFrame with text and category columns.
text_col: Name of the text column.
label_col: Name of the category label column.
Returns:
Trained sklearn Pipeline (can be saved with joblib).
"""
X = labeled_df[text_col].fillna('')
y = labeled_df[label_col]
# Split for evaluation
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
# Build pipeline: TF-IDF vectorization + Naive Bayes classifier
pipeline = Pipeline([
('tfidf', TfidfVectorizer(
ngram_range=(1, 2),
max_features=5000,
stop_words='english',
sublinear_tf=True, # Apply log(1 + tf) scaling
)),
('classifier', MultinomialNB(alpha=0.1)),
])
pipeline.fit(X_train, y_train)
# Evaluate
y_pred = pipeline.predict(X_test)
print("Classification Report:")
print(classification_report(y_test, y_pred))
# Cross-validation for more robust estimate
cv_scores = cross_val_score(pipeline, X, y, cv=5, scoring='accuracy')
print(f"Cross-val accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")
return pipeline
def predict_category(pipeline: Pipeline, texts: list[str]) -> list[str]:
"""Predict categories for new texts using a trained pipeline."""
return pipeline.predict(texts).tolist()
35.7 Topic Modeling: What Are People Talking About?
Sometimes you do not know in advance what categories exist in your text. Topic modeling discovers latent themes in a collection of documents without any labeled data. The most widely used algorithm is Latent Dirichlet Allocation (LDA).
35.7.1 LDA Intuition
LDA assumes that each document is a mixture of topics, and each topic is a mixture of words. It works backward from what it can observe (the words in each document) to infer the underlying topic structure.
Think of it this way: if you have a thousand customer reviews, some will be mostly about delivery, some mostly about product quality, some about customer service, and many will touch on multiple topics. LDA identifies those themes mathematically by looking for groups of words that tend to appear together.
LDA is not magic. It requires you to specify the number of topics in advance, and it returns topics as word lists without labels — you have to interpret what each topic means. But on business text, it reliably surfaces real themes.
import pandas as pd
from gensim import corpora, models
from gensim.utils import simple_preprocess
from nltk.corpus import stopwords
stop_words = set(stopwords.words('english'))
stop_words.update(['would', 'could', 'also', 'get', 'got', 'one', 'two'])
def build_lda_model(
texts: list[str],
num_topics: int = 5,
num_words: int = 10,
passes: int = 15
) -> tuple:
"""
Build an LDA topic model from a list of texts.
Args:
texts: List of raw text strings.
num_topics: Number of topics to extract.
num_words: Words per topic to display.
passes: Training passes (more = better but slower).
Returns:
Tuple of (lda_model, dictionary, corpus).
"""
# Preprocess
processed = [
[word for word in simple_preprocess(text, deacc=True)
if word not in stop_words and len(word) > 2]
for text in texts
]
# Build dictionary and corpus
dictionary = corpora.Dictionary(processed)
dictionary.filter_extremes(no_below=2, no_above=0.9)
corpus = [dictionary.doc2bow(doc) for doc in processed]
# Train LDA
lda_model = models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=num_topics,
passes=passes,
random_state=42,
alpha='auto', # Learn topic concentration from data
per_word_topics=True,
)
# Display topics
print("Discovered Topics:")
print("=" * 60)
for idx, topic in lda_model.print_topics(num_words=num_words):
print(f"\nTopic {idx + 1}:")
# Parse the word|weight format for cleaner display
words = [w.strip().strip('"') for w in topic.split('+')]
for word_weight in words:
weight, word = word_weight.split('*')
print(f" {word.strip():<20} (weight: {weight.strip()})")
return lda_model, dictionary, corpus
def get_document_topics(
lda_model,
dictionary,
text: str,
min_probability: float = 0.1
) -> list[tuple[int, float]]:
"""Get the topic distribution for a single new document."""
processed = [
word for word in simple_preprocess(text, deacc=True)
if word not in stop_words and len(word) > 2
]
bow = dictionary.doc2bow(processed)
topics = lda_model.get_document_topics(bow, minimum_probability=min_probability)
return sorted(topics, key=lambda x: x[1], reverse=True)
35.7.2 Choosing the Number of Topics
One of the practical challenges with LDA is choosing how many topics to specify. Too few, and distinct themes get lumped together. Too many, and topics become redundant or incoherent.
The coherence score provides a quantitative measure of topic quality:
from gensim.models.coherencemodel import CoherenceModel
def find_optimal_topics(
processed_texts: list[list[str]],
dictionary,
corpus,
topic_range: range = range(3, 15)
) -> pd.DataFrame:
"""
Evaluate LDA models with different topic counts using coherence scores.
Higher coherence score = more interpretable, coherent topics.
"""
results = []
for num_topics in topic_range:
model = models.LdaModel(
corpus=corpus,
id2word=dictionary,
num_topics=num_topics,
passes=10,
random_state=42,
)
coherence_model = CoherenceModel(
model=model,
texts=processed_texts,
dictionary=dictionary,
coherence='c_v'
)
coherence = coherence_model.get_coherence()
results.append({'num_topics': num_topics, 'coherence': coherence})
print(f" Topics: {num_topics:2d} | Coherence: {coherence:.4f}")
return pd.DataFrame(results)
In practice, for business text corpora, 5-10 topics is usually the right range. Start with 7, look at what the topics mean, and adjust from there.
35.8 Analyzing Survey Data at Scale
Open-text survey responses are a goldmine of insight that most organizations extract very little value from. A typical post-purchase survey might ask "What could we do better?" and receive thousands of responses. NLP makes those responses analyzable.
35.8.1 A Complete Survey Analysis Pipeline
import pandas as pd
import numpy as np
from textblob import TextBlob
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
import re
def analyze_survey_responses(
df: pd.DataFrame,
response_col: str,
segment_col: str | None = None
) -> dict:
"""
Comprehensive analysis of open-text survey responses.
Args:
df: DataFrame with survey responses.
response_col: Column containing text responses.
segment_col: Optional column to segment analysis by (e.g., 'product_line').
Returns:
Dictionary with analysis results.
"""
df = df.copy()
responses = df[response_col].fillna('').astype(str)
# Filter out very short or empty responses
meaningful = responses[responses.str.len() > 20]
print(f"Total responses: {len(responses)}")
print(f"Meaningful responses (>20 chars): {len(meaningful)}")
# Sentiment analysis
sentiments = meaningful.apply(lambda t: TextBlob(t).sentiment.polarity)
df.loc[meaningful.index, 'polarity'] = sentiments
# Keyword extraction
vectorizer = TfidfVectorizer(
stop_words='english',
ngram_range=(1, 2),
max_features=200,
min_df=3,
)
tfidf = vectorizer.fit_transform(meaningful)
mean_tfidf = tfidf.mean(axis=0).A1
feature_names = vectorizer.get_feature_names_out()
top_keywords = sorted(
zip(feature_names, mean_tfidf),
key=lambda x: x[1],
reverse=True
)[:20]
results = {
'total_responses': len(responses),
'meaningful_responses': len(meaningful),
'avg_sentiment': sentiments.mean(),
'pct_positive': (sentiments > 0.1).mean() * 100,
'pct_negative': (sentiments < -0.1).mean() * 100,
'top_keywords': top_keywords,
'sentiment_distribution': sentiments.describe().to_dict(),
}
# Segment analysis
if segment_col and segment_col in df.columns:
segment_summary = (
df.dropna(subset=['polarity'])
.groupby(segment_col)['polarity']
.agg(['mean', 'count'])
.sort_values('mean')
.rename(columns={'mean': 'avg_sentiment', 'count': 'response_count'})
)
results['by_segment'] = segment_summary
return results
35.8.2 Theme Extraction Without Topic Modeling
A lighter-weight approach to finding themes is to look for meaningful bigrams and trigrams in positive versus negative responses separately:
def compare_positive_negative_language(
df: pd.DataFrame,
text_col: str,
sentiment_col: str = 'polarity'
) -> dict:
"""
Compare language used in positive vs negative responses.
Returns top phrases for each sentiment group.
"""
positive_texts = df[df[sentiment_col] > 0.1][text_col].tolist()
negative_texts = df[df[sentiment_col] < -0.1][text_col].tolist()
def get_top_ngrams(texts: list[str], n: int = 2, top: int = 15) -> list:
vec = TfidfVectorizer(
ngram_range=(n, n),
stop_words='english',
max_features=100,
min_df=2,
)
if len(texts) < 3:
return []
try:
matrix = vec.fit_transform(texts)
scores = matrix.mean(axis=0).A1
names = vec.get_feature_names_out()
return sorted(zip(names, scores), key=lambda x: x[1], reverse=True)[:top]
except ValueError:
return []
return {
'positive_phrases': get_top_ngrams(positive_texts),
'negative_phrases': get_top_ngrams(negative_texts),
}
35.9 Practical Limits of Classical NLP
Before you build production systems around these techniques, you need to understand where they fail. Knowing the limits is not defeatist — it is essential for designing systems that produce reliable results.
What Classical NLP Does Well
- Aggregate pattern detection: Finding what themes appear most frequently across thousands of documents. If 30% of your tickets mention "delayed shipping," you can detect that reliably.
- Sentiment direction on clear text: Strongly positive or strongly negative text is classified correctly more than 80% of the time with TextBlob.
- Entity extraction on clean, formal text: NER on business documents (contracts, formal emails) achieves high accuracy.
- Keyword and phrase extraction: TF-IDF reliably surfaces the most distinctive terms.
Where Classical NLP Struggles
Sarcasm and irony: "Oh wonderful, another package 'delivered' to the wrong address. Just what I needed." TextBlob will score this as positive because of "wonderful."
Domain-specific language: Technical jargon, slang, and industry-specific terms are not in standard sentiment lexicons. A negative review of a medical device that says "the probe had excessive artifact" requires domain knowledge.
Negation in complex sentences: "I can't say I wasn't disappointed, though the product itself wasn't bad" requires more sophisticated parsing than TextBlob provides.
Short text: Tweets, one-sentence reviews, and brief responses have high variance in sentiment scoring. A single word can flip a score.
Changing language: Slang and new terminology appear faster than lexicons update.
Multilingual text: The tools in this chapter work for English. For other languages, you need separate models (spaCy has models for many languages, but quality varies).
Designing Around the Limits
- Use sentiment as a signal, not a verdict. Look at aggregate statistics and trends, not individual scores for decision-making.
- Validate on your data. Sample 100-200 texts, manually label them, and measure your classifier's accuracy on your specific domain.
- Keep humans in the loop for edge cases. Build a review queue for tickets with borderline scores.
- Version and monitor your models. Language use changes; models should be periodically retrained.
35.10 Putting It All Together: A Complete Analysis Workflow
Here is how Priya at Acme Corp would approach analyzing 4,200 support tickets from scratch:
import pandas as pd
import numpy as np
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns
from textblob import TextBlob
from sklearn.feature_extraction.text import TfidfVectorizer
# --- Configuration ---
DATA_PATH = Path("data/support_tickets_q4.csv")
OUTPUT_PATH = Path("analysis/ticket_analysis_q4.html")
def main():
# 1. Load data
print("Loading data...")
df = pd.read_csv(DATA_PATH)
print(f"Loaded {len(df):,} tickets")
# 2. Add sentiment scores
print("Analyzing sentiment...")
df['polarity'] = df['ticket_text'].apply(
lambda t: TextBlob(str(t)).sentiment.polarity
)
df['sentiment'] = pd.cut(
df['polarity'],
bins=[-1, -0.1, 0.1, 1],
labels=['NEGATIVE', 'NEUTRAL', 'POSITIVE']
)
# 3. Classify by topic (rule-based for this example)
print("Classifying topics...")
df['category'] = df['ticket_text'].apply(classify_ticket_rules)
# 4. Summary statistics
print("\n=== SENTIMENT BY CATEGORY ===")
category_summary = df.groupby('category').agg(
ticket_count=('polarity', 'count'),
avg_sentiment=('polarity', 'mean'),
pct_negative=('sentiment', lambda x: (x == 'NEGATIVE').mean() * 100)
).sort_values('avg_sentiment')
print(category_summary.round(3).to_string())
# 5. Day-of-week analysis
if 'created_at' in df.columns:
df['day_of_week'] = pd.to_datetime(df['created_at']).dt.day_name()
day_order = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday',
'Saturday', 'Sunday']
day_analysis = (
df.groupby(['day_of_week', 'category'])
.size()
.reset_index(name='count')
)
print("\n=== TICKETS BY DAY OF WEEK ===")
pivot = day_analysis.pivot(
index='day_of_week', columns='category', values='count'
).fillna(0)
pivot = pivot.reindex([d for d in day_order if d in pivot.index])
print(pivot.astype(int).to_string())
# 6. Extract top keywords for the most negative category
most_negative_category = category_summary.index[0]
negative_tickets = df[df['category'] == most_negative_category]['ticket_text']
print(f"\n=== TOP KEYWORDS IN '{most_negative_category.upper()}' TICKETS ===")
vec = TfidfVectorizer(stop_words='english', ngram_range=(1, 2), max_features=50)
vec.fit_transform(negative_tickets.fillna(''))
print("Top phrases:", list(vec.get_feature_names_out()[:15]))
if __name__ == '__main__':
main()
Chapter Summary
Natural Language Processing transforms unstructured text into analyzable, actionable data. In this chapter, you learned:
- Text preprocessing is the foundation: lowercasing, punctuation removal, tokenization, stopword removal, and lemmatization prepare raw text for analysis.
- Sentiment analysis with TextBlob provides fast, interpretable polarity scores for customer-facing text. It works well for aggregate analysis and triage, but struggles with sarcasm and complex negation.
- Keyword extraction using TF-IDF surfaces the most distinctive terms in a corpus, and n-gram extraction captures meaningful phrases.
- Named Entity Recognition with spaCy identifies people, organizations, dates, and monetary amounts from business documents with high accuracy.
- Text classification can be rule-based (keyword matching, transparent and maintainable) or machine learning-based (TF-IDF + Naive Bayes, requires labeled data).
- Topic modeling with LDA discovers latent themes in a collection of documents without labeled data.
- Practical limits matter: treat NLP scores as signals for aggregate patterns, validate on your specific domain, and keep humans in the loop for critical decisions.
The code files in this chapter (sentiment_analyzer.py and ner_extractor.py) provide production-ready implementations of these techniques that you can adapt for your own business text data.
Key Terms
Corpus — A collection of text documents used for analysis.
Lemmatization — Reducing a word to its dictionary base form using linguistic rules (e.g., "ran" → "run").
Named Entity Recognition (NER) — Identifying and classifying real-world entities (people, organizations, dates) in text.
N-gram — A contiguous sequence of n words in text (bigram = 2 words, trigram = 3 words).
Polarity — A sentiment score from -1.0 (most negative) to +1.0 (most positive).
Stemming — Reducing a word to its root form by removing suffixes (blunt, rule-based; e.g., "shipping" → "ship").
Stopwords — Common words (the, a, is) that are filtered out because they carry little analytical meaning.
TF-IDF — Term Frequency-Inverse Document Frequency; a measure of how distinctive a word is within a specific document relative to the whole corpus.
Tokenization — Splitting text into individual words or sentences.
Topic Modeling — Discovering latent thematic structures in a collection of documents.