16 min read

> Core Principle --- Machine learning models consume numbers, not words. The entire discipline of Natural Language Processing is, at its core, the art of turning human language into numerical representations that algorithms can operate on. This...

Chapter 26: NLP Fundamentals

Text Preprocessing, TF-IDF, Sentiment Analysis, and Topic Modeling


Learning Objectives

By the end of this chapter, you will be able to:

  1. Preprocess text (tokenization, lowering, stemming, lemmatization, stop word removal)
  2. Create numerical text representations with Bag-of-Words and TF-IDF
  3. Build a text classifier using TF-IDF + logistic regression / Naive Bayes
  4. Apply topic modeling with LDA (Latent Dirichlet Allocation)
  5. Perform sentiment analysis with lexicon-based and ML-based approaches

Why NLP Without Deep Learning Still Matters

Core Principle --- Machine learning models consume numbers, not words. The entire discipline of Natural Language Processing is, at its core, the art of turning human language into numerical representations that algorithms can operate on. This chapter covers the representations that have worked for twenty years and still work today: Bag-of-Words, TF-IDF, and the classifiers built on top of them. They are not obsolete. For many business problems --- document classification, ticket routing, sentiment scoring, topic discovery --- TF-IDF plus a logistic regression or Naive Bayes classifier is sufficient, interpretable, fast to train, and easy to maintain in production. You need to understand these fundamentals before reaching for BERT.

Every data scientist eventually encounters text data. Customer reviews, support tickets, survey responses, social media posts, product descriptions, email bodies, clinical notes. The question is always the same: how do you get this into a form your model can use?

The answer involves a pipeline with four stages:

1. Preprocessing --- tokenize, lowercase, remove punctuation and stop words, stem or lemmatize. This is the unglamorous plumbing that determines whether your model sees "Running," "running," "runs," and "ran" as four different words or as one concept.

2. Vectorization --- convert the cleaned tokens into a numerical matrix. Bag-of-Words counts how many times each word appears. TF-IDF refines those counts by penalizing words that appear everywhere (and thus carry little information). The result is a document-term matrix where each row is a document and each column is a word.

3. Modeling --- feed that matrix into a classifier (logistic regression, Naive Bayes), a topic model (LDA), or a sentiment analyzer (VADER, or your own trained classifier).

4. Evaluation --- measure whether the model actually works. For classification, the metrics from Chapter 16 apply directly. For topic modeling, coherence scores. For sentiment, both accuracy and qualitative review of misclassified examples.

This chapter builds all four stages from scratch, applies them to two business problems --- product review sentiment analysis at ShopSmart and support ticket topic modeling at StreamFlow --- and previews the word embedding approaches covered in Chapter 36.


Part 1: Text Preprocessing

Why Preprocessing Matters More Than the Model

In tabular machine learning, feature engineering determines 80% of model performance. In NLP, preprocessing plays that same role. A logistic regression on well-preprocessed text will outperform a random forest on raw text. Every minute spent on preprocessing pays dividends downstream.

The fundamental challenge: natural language is messy. Consider these five product reviews:

  • "GREAT product!!! Love it."
  • "great product, love it"
  • "This is a great product. I really love it."
  • "gr8 product luv it"
  • "The product is great and I love it!!!"

A human reads these as five expressions of the same sentiment. A computer sees five completely different strings. Preprocessing closes that gap.

Tokenization

Tokenization splits a string into individual units (tokens). The simplest approach splits on whitespace. A better approach uses a tokenizer that handles punctuation, contractions, and edge cases.

import re
import nltk
from nltk.tokenize import word_tokenize

# Download required NLTK data (run once)
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('vader_lexicon', quiet=True)

text = "The product's quality isn't great --- I wouldn't recommend it."

# Naive: split on whitespace
naive_tokens = text.split()
print("Naive:", naive_tokens)
# ['The', "product's", 'quality', "isn't", 'great', '---', 'I', "wouldn't", 'recommend', 'it.']

# Better: NLTK word_tokenize handles contractions and punctuation
nltk_tokens = word_tokenize(text)
print("NLTK:", nltk_tokens)
# ['The', 'product', "'s", 'quality', 'is', "n't", 'great', '---', 'I', 'would', "n't", 'recommend', 'it', '.']

Notice that word_tokenize splits contractions: "isn't" becomes "is" and "n't", "product's" becomes "product" and "'s". This matters because "isn't" carries negation information. If you treat "isn't" as a single token, the negation is trapped inside a word that your model may never see again in exactly that form.

Common Mistake --- Do not tokenize by splitting on whitespace and calling it done. You will miss contractions, hyphenated words, and punctuation attached to words. NLTK's word_tokenize or spaCy's tokenizer handles these edge cases. The ten minutes you spend using a proper tokenizer will save you hours of debugging downstream.

Lowercasing

Lowercasing maps all characters to lowercase so that "Great," "great," and "GREAT" are treated as the same token. This is almost always the right choice for classification and topic modeling. The exception is named entity recognition, where case carries information ("apple" the fruit vs. "Apple" the company).

text = "GREAT product! Love it."
lowered = text.lower()
print(lowered)
# "great product! love it."

Stop Word Removal

Stop words are high-frequency words that carry little semantic meaning: "the," "is," "and," "it," "a," "to." Every document contains them, so they contribute noise rather than signal to classification models. Removing them shrinks the vocabulary and improves signal-to-noise ratio.

from nltk.corpus import stopwords

stop_words = set(stopwords.words('english'))
print(f"NLTK English stop words: {len(stop_words)}")
print(sorted(list(stop_words))[:20])
# ['a', 'about', 'above', 'after', 'again', 'against', 'ain', 'all', 'am', 'an',
#  'and', 'any', 'are', 'aren', "aren't", 'as', 'at', 'be', 'because', 'been']

tokens = ['the', 'product', 'quality', 'is', 'not', 'great', 'i', 'would', 'not', 'recommend', 'it']
filtered = [t for t in tokens if t not in stop_words]
print(filtered)
# ['product', 'quality', 'great', 'would', 'recommend']

Common Mistake --- Notice that "not" was removed by NLTK's default stop word list. That is dangerous for sentiment analysis, where "not great" and "great" have opposite meanings. For sentiment tasks, either use a custom stop word list that preserves negation words, or skip stop word removal entirely and let the model learn which words matter. Scikit-learn's TfidfVectorizer has a built-in stop_words='english' parameter, but it also removes negations. Be deliberate.

Stemming vs. Lemmatization

Both stemming and lemmatization reduce words to a common base form. They solve the same problem --- "running," "runs," "ran" should be treated as the same concept --- but they use different methods.

Stemming chops off suffixes using rules. It is fast and aggressive. The Porter Stemmer reduces "running" to "run," "happiness" to "happi," and "university" to "univers." The results are not always real words, but that does not matter for classification --- the model just needs consistent tokens.

Lemmatization uses a dictionary and part-of-speech information to reduce words to their dictionary form (lemma). It is slower but produces real words. "Running" becomes "run," "better" becomes "good," "mice" becomes "mouse."

from nltk.stem import PorterStemmer, WordNetLemmatizer

stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()

words = ['running', 'ran', 'runs', 'happily', 'happiness',
         'better', 'mice', 'geese', 'universities']

print("Word          | Stemmed       | Lemmatized")
print("-" * 50)
for word in words:
    stemmed = stemmer.stem(word)
    lemmatized = lemmatizer.lemmatize(word)
    print(f"{word:<14}| {stemmed:<14}| {lemmatized}")
Word          | Stemmed       | Lemmatized
--------------------------------------------------
running       | run           | running
ran           | ran           | ran
runs          | run           | run
happily       | happili       | happily
happiness     | happi         | happiness
better        | better        | better
mice          | mice          | mouse
geese         | gees          | goose
universities  | univers       | university

Production Tip --- For classification tasks (spam detection, ticket routing, sentiment), stemming is usually sufficient and faster. For tasks where the output text needs to be human-readable (topic labels, search results), use lemmatization. In practice, TF-IDF with scikit-learn's TfidfVectorizer does neither by default --- it relies on the vocabulary being large enough that "run" and "running" both carry signal. Adding stemming or lemmatization to the pipeline sometimes helps and sometimes does not. Test both.

Building a Complete Preprocessing Pipeline

Let us combine all the steps into a reusable function:

import re
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

def preprocess_text(text, remove_stopwords=True, lemmatize=True,
                    preserve_negation=True):
    """
    Full NLP preprocessing pipeline.

    Parameters
    ----------
    text : str
        Raw input text.
    remove_stopwords : bool
        Whether to remove stop words.
    lemmatize : bool
        Whether to lemmatize tokens.
    preserve_negation : bool
        If True, keep negation words ('not', 'no', "n't", 'never', 'neither',
        'nobody', 'nothing', 'nowhere', 'nor') even when removing stop words.

    Returns
    -------
    str
        Preprocessed text as a single string of space-separated tokens.
    """
    # Lowercase
    text = text.lower()

    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', '', text)

    # Remove HTML tags
    text = re.sub(r'<.*?>', '', text)

    # Remove non-alphabetic characters (keep spaces)
    text = re.sub(r'[^a-z\s]', '', text)

    # Tokenize
    tokens = word_tokenize(text)

    # Stop word removal (with optional negation preservation)
    if remove_stopwords:
        stop = set(stopwords.words('english'))
        if preserve_negation:
            negation_words = {'not', 'no', 'nor', 'never', 'neither',
                              'nobody', 'nothing', 'nowhere', 'nt'}
            stop = stop - negation_words
        tokens = [t for t in tokens if t not in stop]

    # Lemmatization
    if lemmatize:
        lem = WordNetLemmatizer()
        tokens = [lem.lemmatize(t) for t in tokens]

    # Remove very short tokens
    tokens = [t for t in tokens if len(t) > 1]

    return ' '.join(tokens)


# Test it
reviews = [
    "GREAT product!!! Love it <br>",
    "The product is NOT great. I wouldn't recommend this to anyone.",
    "Arrived quickly. Works as described. 5/5",
    "Terrible quality --- broke after 2 days. Never buying again.",
]

for review in reviews:
    print(f"Original: {review}")
    print(f"Cleaned:  {preprocess_text(review)}")
    print()
Original: GREAT product!!! Love it <br>
Cleaned:  great product love

Original: The product is NOT great. I wouldn't recommend this to anyone.
Cleaned:  product not great not recommend anyone

Original: Arrived quickly. Works as described. 5/5
Cleaned:  arrived quickly work described

Original: Terrible quality --- broke after 2 days. Never buying again.
Cleaned:  terrible quality broke never buying

Notice that "not" is preserved in the second review because preserve_negation=True. This is critical for downstream sentiment analysis.


Part 2: Text Vectorization

From Words to Numbers

Preprocessing gives us clean token sequences. Vectorization converts those sequences into the numerical matrices that scikit-learn models consume. The two approaches covered here --- Bag-of-Words and TF-IDF --- both produce a document-term matrix: a matrix where each row is a document and each column is a unique word from the vocabulary.

Bag-of-Words with CountVectorizer

The simplest vectorization: count how many times each word appears in each document.

from sklearn.feature_extraction.text import CountVectorizer

corpus = [
    "great product love it",
    "product not great not recommend",
    "terrible product broke quickly",
    "love this product great quality",
]

# CountVectorizer tokenizes and counts
count_vec = CountVectorizer()
bow_matrix = count_vec.fit_transform(corpus)

# The result is a sparse matrix
print(f"Type: {type(bow_matrix)}")
print(f"Shape: {bow_matrix.shape}")
print(f"Non-zero entries: {bow_matrix.nnz}")
print(f"Sparsity: {1 - bow_matrix.nnz / (bow_matrix.shape[0] * bow_matrix.shape[1]):.1%}")
Type: <class 'scipy.sparse._csr.csr_matrix'>
Shape: (4, 11)
Non-zero entries: 16
Sparsity: 63.6%
import pandas as pd

# Visualize as a dense DataFrame
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=count_vec.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(len(corpus))]
)
print(bow_df)
       broke  great  it  love  not  product  quality  quickly  recommend  terrible  this
Doc 0      0      1   1     1    0        1        0        0          0         0     0
Doc 1      0      1   0     0    2        1        0        0          1         0     0
Doc 2      1      0   0     0    0        1        0        1          0         1     0
Doc 3      0      1   0     1    0        1        1        0          0         0     1

Document 1 has "not" appearing twice ("not great not recommend"). BoW captures that count. But it does not capture word order --- "not great" and "great not" produce the same vector. This is the fundamental limitation of bag-of-words: it destroys word order.

N-Grams: Partial Word Order Recovery

N-grams partially recover word order by treating sequences of N consecutive words as single features. Bigrams (N=2) capture "not great" as a single feature distinct from "great product."

# Bigrams only
bigram_vec = CountVectorizer(ngram_range=(2, 2))
bigram_matrix = bigram_vec.fit_transform(corpus)

bigram_df = pd.DataFrame(
    bigram_matrix.toarray(),
    columns=bigram_vec.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(len(corpus))]
)
print(bigram_df.to_string())
       broke quickly  great love  great not  great quality  love it  love this  not great  not recommend  product broke  product great  product love  terrible product  this product
Doc 0              0           0          0              0        1          0          0              0              0              0             1                 0             0
Doc 1              0           0          1              0        0          0          1              1              0              1             0                 0             0
Doc 2              1           0          0              0        0          0          0              0              1              0             0                 1             0
Doc 3              0           1          0              1        0          1          0              0              0              0             0                 0             1

Now "not great" is a distinct feature from "great quality." In practice, you combine unigrams and bigrams:

# Unigrams + bigrams (most common in practice)
combo_vec = CountVectorizer(ngram_range=(1, 2))
combo_matrix = combo_vec.fit_transform(corpus)
print(f"Unigrams only: {bow_matrix.shape[1]} features")
print(f"Bigrams only: {bigram_matrix.shape[1]} features")
print(f"Unigrams + bigrams: {combo_matrix.shape[1]} features")
Unigrams only: 11 features
Bigrams only: 13 features
Unigrams + bigrams: 24 features

Production Tip --- Unigrams + bigrams is the standard baseline for text classification. Trigrams (N=3) are rarely worth the vocabulary explosion unless you have a very large corpus. The max_features parameter in CountVectorizer and TfidfVectorizer caps the vocabulary at the top N most frequent terms, which controls memory and overfitting.

TF-IDF: Weighting Words by Importance

Bag-of-Words treats every word equally. But "product" appears in all four documents and tells you nothing about what distinguishes them. "Terrible" appears in one document and is highly discriminative. TF-IDF (Term Frequency-Inverse Document Frequency) encodes this intuition mathematically.

TF (Term Frequency): How often the word appears in this document. A word that appears 5 times in a document is more important to that document than a word that appears once.

IDF (Inverse Document Frequency): How rare the word is across the entire corpus. A word that appears in every document gets a low IDF score. A word that appears in one document gets a high IDF score.

TF-IDF = TF x IDF. A word that appears frequently in a specific document but rarely across the corpus gets the highest weight.

import numpy as np

# Manual TF-IDF calculation for intuition
corpus_tokens = [doc.split() for doc in corpus]
vocab = sorted(set(word for doc in corpus_tokens for word in doc))
n_docs = len(corpus)

print("Term             | Doc Freq | IDF          | Meaning")
print("-" * 65)
for word in vocab:
    doc_freq = sum(1 for doc in corpus_tokens if word in doc)
    # Scikit-learn's smooth IDF: log((1 + n) / (1 + df)) + 1
    idf = np.log((1 + n_docs) / (1 + doc_freq)) + 1
    if doc_freq == n_docs:
        meaning = "Appears everywhere -> low weight"
    elif doc_freq == 1:
        meaning = "Rare -> high weight"
    else:
        meaning = f"In {doc_freq}/{n_docs} docs -> medium weight"
    print(f"{word:<17}| {doc_freq:<9}| {idf:<13.4f}| {meaning}")
Term             | Doc Freq | IDF          | Meaning
-----------------------------------------------------------------
broke            | 1        | 1.9163       | Rare -> high weight
great            | 3        | 1.2231       | In 3/4 docs -> medium weight
it               | 1        | 1.9163       | Rare -> high weight
love             | 2        | 1.5108       | In 2/4 docs -> medium weight
not              | 1        | 1.9163       | Rare -> high weight
product          | 4        | 1.0000       | Appears everywhere -> low weight
quality          | 1        | 1.9163       | Rare -> high weight
quickly          | 1        | 1.9163       | Rare -> high weight
recommend        | 1        | 1.9163       | Rare -> high weight
terrible         | 1        | 1.9163       | Rare -> high weight
this             | 1        | 1.9163       | Rare -> high weight

"Product" has IDF of 1.0 (lowest possible with smoothing) because it appears in every document. "Terrible," "broke," "quality" have IDF of 1.92 because they each appear in only one document.

TfidfVectorizer in Practice

Scikit-learn's TfidfVectorizer combines tokenization, counting, and TF-IDF weighting in a single transformer. It also L2-normalizes each document vector so that document length does not bias the representation.

from sklearn.feature_extraction.text import TfidfVectorizer

tfidf_vec = TfidfVectorizer()
tfidf_matrix = tfidf_vec.fit_transform(corpus)

tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray().round(3),
    columns=tfidf_vec.get_feature_names_out(),
    index=[f"Doc {i}" for i in range(len(corpus))]
)
print(tfidf_df.to_string())
       broke  great     it   love    not  product  quality  quickly  recommend  terrible   this
Doc 0  0.000  0.378  0.593  0.467  0.000    0.309    0.000    0.000      0.000     0.000  0.000
Doc 1  0.000  0.272  0.000  0.000  0.854    0.223    0.000    0.000      0.427     0.000  0.000
Doc 2  0.536  0.000  0.000  0.000  0.000    0.280    0.000    0.536      0.000     0.536  0.000
Doc 3  0.000  0.297  0.000  0.367  0.000    0.243    0.465    0.000      0.000     0.000  0.465

Compare this to the raw counts. In Document 1, "not" dominates (0.854) because it appears twice and only in this document. "Product" gets the lowest weight across all documents because it is ubiquitous.

Sparse Matrices: Why They Matter

A real-world corpus has tens of thousands of unique words. A document-term matrix with 50,000 documents and 30,000 words has 1.5 billion cells, but most are zero (a typical document uses a few hundred unique words). Scikit-learn stores these as sparse matrices (CSR format), which only store the non-zero entries.

from sklearn.feature_extraction.text import TfidfVectorizer
from scipy.sparse import csr_matrix

# Simulate a larger corpus
np.random.seed(42)
vocab_pool = ['product', 'quality', 'shipping', 'price', 'great', 'terrible',
              'fast', 'slow', 'broken', 'excellent', 'return', 'refund',
              'love', 'hate', 'recommend', 'avoid', 'value', 'waste',
              'perfect', 'defective', 'comfortable', 'cheap', 'durable']

large_corpus = []
for _ in range(10000):
    n_words = np.random.randint(5, 30)
    doc = ' '.join(np.random.choice(vocab_pool, size=n_words))
    large_corpus.append(doc)

tfidf = TfidfVectorizer(max_features=5000, ngram_range=(1, 2))
X = tfidf.fit_transform(large_corpus)

print(f"Shape: {X.shape}")
print(f"Non-zero entries: {X.nnz:,}")
print(f"Total entries: {X.shape[0] * X.shape[1]:,}")
print(f"Sparsity: {1 - X.nnz / (X.shape[0] * X.shape[1]):.2%}")

# Memory comparison
import sys
dense_size = X.shape[0] * X.shape[1] * 8  # 8 bytes per float64
sparse_size = X.data.nbytes + X.indices.nbytes + X.indptr.nbytes
print(f"\nDense memory: {dense_size / 1e6:.1f} MB")
print(f"Sparse memory: {sparse_size / 1e6:.1f} MB")
print(f"Compression ratio: {dense_size / sparse_size:.1f}x")
Shape: (10000, 286)
Non-zero entries: 1,614,178
Total entries: 2,860,000
Sparsity: 43.56%

Dense memory: 22.9 MB
Sparse memory: 19.4 MB
Compression ratio: 1.2x

Production Tip --- Never call .toarray() on a large TF-IDF matrix. Scikit-learn classifiers (logistic regression, Naive Bayes, SVM) accept sparse matrices directly. Converting to dense can blow up your memory. With real text data, sparsity is typically 99%+, and the compression ratio is dramatic.


Part 3: Text Classification with TF-IDF

The Standard Pipeline: TF-IDF + Classifier

Text classification is the most common NLP task in business: spam detection, ticket routing, sentiment classification, intent recognition. The standard approach before deep learning --- and still the production standard for many problems --- is TF-IDF vectorization followed by a linear classifier.

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix
import matplotlib.pyplot as plt

np.random.seed(42)

# --- Simulate product review sentiment data ---
n_reviews = 5000

positive_templates = [
    "great product love it highly recommend",
    "excellent quality fast shipping perfect",
    "amazing value exactly what needed works perfectly",
    "best purchase made so happy with quality",
    "fantastic product exceeded expectations would buy again",
    "really good product quality worth every penny",
    "love this product works great highly satisfied",
    "wonderful product fast delivery great packaging",
    "superb quality easy to use love it",
    "outstanding product great value recommend to everyone",
]

negative_templates = [
    "terrible quality broke after one week",
    "waste of money would not recommend avoid",
    "poor quality cheap materials very disappointed",
    "product arrived damaged terrible packaging",
    "does not work as described want refund",
    "horrible product worst purchase ever regret buying",
    "defective product terrible customer service",
    "cheaply made broke immediately total waste",
    "not worth the price poor quality avoid",
    "disappointed with quality does not match description",
]

noise_words = ['product', 'item', 'bought', 'ordered', 'received',
               'delivery', 'package', 'price', 'company', 'store']

reviews = []
labels = []

for i in range(n_reviews):
    if i < n_reviews // 2:
        template = np.random.choice(positive_templates)
        label = 1
    else:
        template = np.random.choice(negative_templates)
        label = 0

    # Add noise words to make it realistic
    n_noise = np.random.randint(1, 5)
    noise = ' '.join(np.random.choice(noise_words, size=n_noise))
    review = f"{template} {noise}"

    # Randomly shuffle words to add variety
    words = review.split()
    np.random.shuffle(words)
    reviews.append(' '.join(words))
    labels.append(label)

df = pd.DataFrame({'review': reviews, 'sentiment': labels})
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print(f"Dataset: {len(df)} reviews")
print(f"Class distribution:\n{df['sentiment'].value_counts()}")
Dataset: 5000 reviews
Class distribution:
sentiment
1    2500
0    2500
Name: count, dtype: int64

Building the Pipeline

Scikit-learn's Pipeline is the right way to chain vectorization and classification. It ensures that the vectorizer is fit only on training data and applied consistently to test data.

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    df['review'], df['sentiment'], test_size=0.2, random_state=42, stratify=df['sentiment']
)

print(f"Train: {len(X_train)}, Test: {len(X_test)}")
Train: 4000, Test: 1000
# Pipeline 1: TF-IDF + Logistic Regression
lr_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95,
        stop_words='english'
    )),
    ('clf', LogisticRegression(
        max_iter=1000,
        random_state=42,
        C=1.0
    ))
])

# Pipeline 2: TF-IDF + Multinomial Naive Bayes
nb_pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(
        max_features=5000,
        ngram_range=(1, 2),
        min_df=2,
        max_df=0.95,
        stop_words='english'
    )),
    ('clf', MultinomialNB(alpha=1.0))
])

# Cross-validation comparison
lr_scores = cross_val_score(lr_pipeline, X_train, y_train, cv=5, scoring='accuracy')
nb_scores = cross_val_score(nb_pipeline, X_train, y_train, cv=5, scoring='accuracy')

print(f"Logistic Regression CV Accuracy: {lr_scores.mean():.4f} (+/- {lr_scores.std():.4f})")
print(f"Naive Bayes CV Accuracy:         {nb_scores.mean():.4f} (+/- {nb_scores.std():.4f})")
Logistic Regression CV Accuracy: 0.9870 (+/- 0.0049)
Naive Bayes CV Accuracy:         0.9830 (+/- 0.0042)
# Fit the better model on full training set and evaluate on test
lr_pipeline.fit(X_train, y_train)
y_pred = lr_pipeline.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['Negative', 'Positive']))
              precision    recall  f1-score   support

    Negative       0.99      0.98      0.99       500
    Positive       0.98      0.99      0.99       500

    accuracy                           0.99      1000
   macro avg       0.99      0.99      0.99      1000
weighted avg       0.99      0.99      0.99      1000

Interpreting the Model: What Words Drive Predictions?

One of the biggest advantages of TF-IDF + logistic regression over deep learning: interpretability. The logistic regression coefficients tell you exactly which words push the prediction toward positive and which toward negative.

# Extract feature names and coefficients
feature_names = lr_pipeline.named_steps['tfidf'].get_feature_names_out()
coefficients = lr_pipeline.named_steps['clf'].coef_[0]

# Top positive and negative features
coef_df = pd.DataFrame({
    'feature': feature_names,
    'coefficient': coefficients
}).sort_values('coefficient')

n_top = 15

fig, axes = plt.subplots(1, 2, figsize=(14, 6))

# Most negative
top_neg = coef_df.head(n_top)
axes[0].barh(top_neg['feature'], top_neg['coefficient'], color='#d32f2f')
axes[0].set_title('Top Negative Indicators')
axes[0].set_xlabel('Coefficient')
axes[0].invert_yaxis()

# Most positive
top_pos = coef_df.tail(n_top)
axes[1].barh(top_pos['feature'], top_pos['coefficient'], color='#388e3c')
axes[1].set_title('Top Positive Indicators')
axes[1].set_xlabel('Coefficient')
axes[1].invert_yaxis()

plt.tight_layout()
plt.savefig('tfidf_lr_coefficients.png', dpi=150, bbox_inches='tight')
plt.show()

This is the interpretability that TF-IDF buys you. When the VP of Product asks "Why is the model predicting negative sentiment for this batch of reviews?", you can point to specific words with specific coefficient weights. Try explaining a BERT embedding to that same VP.

War Story --- A team at a fintech company built a BERT-based model for classifying customer complaints. It achieved 94% accuracy. The TF-IDF + logistic regression baseline achieved 91%. The BERT model required a GPU for inference, took 200ms per prediction, and nobody could explain its decisions to regulators. The TF-IDF model ran on a single CPU, took 0.5ms per prediction, and produced coefficient-based explanations that satisfied the compliance team. They shipped the TF-IDF model. Three percent accuracy was not worth the operational complexity.

TfidfVectorizer Parameters That Matter

# Key parameters and what they control
tfidf_explained = TfidfVectorizer(
    max_features=10000,   # Cap vocabulary at top N terms (memory + regularization)
    ngram_range=(1, 2),   # Include unigrams and bigrams
    min_df=3,             # Ignore terms appearing in fewer than 3 documents (typos, rare junk)
    max_df=0.90,          # Ignore terms appearing in >90% of documents (de facto stop words)
    stop_words='english', # Remove English stop words
    sublinear_tf=True,    # Apply log(1 + tf) instead of raw tf (dampens count magnitude)
    strip_accents='unicode',  # Normalize accented characters
    lowercase=True,       # Default True
)

Production Tip --- sublinear_tf=True is an underused parameter that often improves classification performance. It replaces raw term frequency with 1 + log(tf), which dampens the effect of a word appearing 50 times vs. 5 times. A word that appears 50 times in a document is not 10x more important than a word that appears 5 times; sublinear_tf reflects that.


Part 4: Sentiment Analysis

Two Approaches to Sentiment

Sentiment analysis determines whether text expresses positive, negative, or neutral opinion. There are two families of approach:

Lexicon-based methods use a predefined dictionary of words annotated with sentiment scores. VADER (Valence Aware Dictionary and sEntiment Reasoner) is the most widely used. It requires no training data, works out of the box, and handles social media conventions (capitalization, punctuation emphasis, emoticons).

ML-based methods train a classifier (the TF-IDF pipeline above) on labeled examples. They are domain-specific: a model trained on movie reviews may not work on product reviews. They require labeled data but learn domain-specific sentiment patterns that lexicons miss.

VADER: Rule-Based Sentiment

VADER assigns a compound sentiment score between -1 (most negative) and +1 (most positive). It uses a hand-curated lexicon plus rules for handling negation, capitalization, punctuation, and degree modifiers.

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()

test_sentences = [
    "This product is great.",
    "This product is not great.",
    "This product is GREAT!!!",
    "This product is good, but not great.",
    "The product is absolutely, positively wonderful.",
    "This is the worst product I have ever purchased.",
    "The delivery was fast but the product quality is terrible.",
]

print("Text                                                    | Compound | Label")
print("-" * 90)
for sent in test_sentences:
    scores = sia.polarity_scores(sent)
    compound = scores['compound']
    if compound >= 0.05:
        label = "Positive"
    elif compound <= -0.05:
        label = "Negative"
    else:
        label = "Neutral"
    print(f"{sent:<56}| {compound:>7.4f}  | {label}")
Text                                                    | Compound | Label
------------------------------------------------------------------------------------------
This product is great.                                  |  0.6249  | Positive
This product is not great.                              | -0.3412  | Negative
This product is GREAT!!!                                |  0.7351  | Positive
This product is good, but not great.                    |  0.1779  | Positive
The product is absolutely, positively wonderful.        |  0.8042  | Positive
This is the worst product I have ever purchased.        | -0.6249  | Negative
The delivery was fast but the product quality is terrible. | -0.4939  | Negative

VADER handles negation ("not great" is negative), capitalization emphasis ("GREAT!!!" is more positive than "great"), and degree modifiers ("absolutely wonderful" is more positive than "wonderful").

Common Mistake --- VADER was trained primarily on social media text. It performs well on informal text with strong sentiment signals (product reviews, tweets) but poorly on formal, nuanced text (legal documents, clinical notes, academic prose). Do not assume VADER generalizes to your domain without validation.

VADER vs. ML-Based: When to Use Which

# Compare VADER vs. trained classifier on our review dataset
from sklearn.metrics import accuracy_score

# VADER predictions on test set
vader_predictions = []
for review in X_test:
    compound = sia.polarity_scores(review)['compound']
    vader_predictions.append(1 if compound >= 0.05 else 0)

vader_acc = accuracy_score(y_test, vader_predictions)
lr_acc = accuracy_score(y_test, y_pred)  # From our trained LR pipeline

print(f"VADER accuracy:              {vader_acc:.4f}")
print(f"TF-IDF + LR accuracy:       {lr_acc:.4f}")
print(f"Difference:                  {lr_acc - vader_acc:+.4f}")

Production Tip --- Use VADER when you have no labeled data and need a quick, interpretable baseline. Use ML-based sentiment (TF-IDF + classifier) when you have at least a few hundred labeled examples and the domain has specialized vocabulary. The ML approach learns that "runs hot" is negative for electronics but positive for restaurant reviews. VADER does not.

Handling Negation Properly

Negation is the hardest problem in lexicon-based sentiment. "Not bad" is positive. "Not good" is negative. "Not the worst" is weakly positive. The simplest ML-based approach: use bigrams.

# Demonstrate the power of bigrams for negation handling
unigram_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 1), max_features=5000)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

bigram_pipe = Pipeline([
    ('tfidf', TfidfVectorizer(ngram_range=(1, 2), max_features=5000)),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

uni_scores = cross_val_score(unigram_pipe, X_train, y_train, cv=5, scoring='accuracy')
bi_scores = cross_val_score(bigram_pipe, X_train, y_train, cv=5, scoring='accuracy')

print(f"Unigrams only:     {uni_scores.mean():.4f} (+/- {uni_scores.std():.4f})")
print(f"Unigrams + bigrams: {bi_scores.mean():.4f} (+/- {bi_scores.std():.4f})")

Bigrams capture "not great," "not recommend," "not worth" as distinct features. Without bigrams, the model sees "not" and "great" separately and must learn indirectly that their co-occurrence is negative.


Part 5: Topic Modeling with LDA

What Is Topic Modeling?

Topic modeling is unsupervised: given a collection of documents with no labels, discover the latent topics that the documents are about. The most widely used algorithm is Latent Dirichlet Allocation (LDA).

LDA assumes each document is a mixture of topics, and each topic is a distribution over words. A support ticket about billing might be 70% "billing topic" and 30% "account access topic." The algorithm discovers these topics automatically from the word patterns.

The business use case is straightforward: "We have 50,000 support tickets. What are customers complaining about? Can we categorize them without reading all 50,000?"

LDA with Scikit-Learn

from sklearn.decomposition import LatentDirichletAllocation
from sklearn.feature_extraction.text import CountVectorizer
import numpy as np

np.random.seed(42)

# --- Simulate support ticket corpus ---
topic_templates = {
    'billing': [
        "charged twice credit card billing error refund",
        "subscription renewed unexpectedly cancel billing charged",
        "wrong amount charged invoice billing dispute payment",
        "refund not received billing support credit card",
        "auto renewal cancel subscription charged billing",
    ],
    'performance': [
        "slow loading buffering video stream performance",
        "app crashes frequently performance error loading",
        "lag delay buffering stream quality performance",
        "freezing during playback performance slow buffer",
        "video quality poor resolution stream performance lag",
    ],
    'account': [
        "cannot login password reset account locked",
        "forgot password account access login email",
        "two factor authentication login account verify",
        "account suspended access denied login support",
        "email change account update profile settings login",
    ],
    'content': [
        "show removed library content missing catalog",
        "subtitle wrong language audio content options",
        "new season available content release schedule",
        "content not available region library restricted",
        "download offline content save library access",
    ],
}

# Generate synthetic tickets
n_tickets = 2000
tickets = []
true_topics = []

for i in range(n_tickets):
    # Each ticket is primarily one topic with some noise
    topic = np.random.choice(list(topic_templates.keys()))
    template = np.random.choice(topic_templates[topic])

    # Add noise from other topics
    other_topics = [t for t in topic_templates if t != topic]
    noise_topic = np.random.choice(other_topics)
    noise_template = np.random.choice(topic_templates[noise_topic])

    # 80% primary topic words, 20% noise
    primary_words = template.split()
    noise_words = noise_template.split()[:2]

    words = primary_words + noise_words
    np.random.shuffle(words)
    tickets.append(' '.join(words))
    true_topics.append(topic)

ticket_df = pd.DataFrame({'ticket': tickets, 'true_topic': true_topics})
print(f"Tickets: {len(ticket_df)}")
print(f"True topic distribution:\n{ticket_df['true_topic'].value_counts()}")
Tickets: 2000
True topic distribution:
true_topic
performance    517
billing        507
content        501
account        475
Name: count, dtype: int64
# Step 1: Vectorize with CountVectorizer (LDA expects counts, not TF-IDF)
count_vec = CountVectorizer(
    max_features=1000,
    max_df=0.95,
    min_df=2,
    stop_words='english'
)
doc_term_matrix = count_vec.fit_transform(ticket_df['ticket'])

print(f"Document-term matrix: {doc_term_matrix.shape}")
Document-term matrix: (2000, 55)
# Step 2: Fit LDA
n_topics = 4  # We know there are 4 true topics

lda = LatentDirichletAllocation(
    n_components=n_topics,
    random_state=42,
    max_iter=20,
    learning_method='online',  # 'online' for large corpora, 'batch' for small
    n_jobs=-1
)
lda.fit(doc_term_matrix)

# Step 3: Display top words per topic
feature_names = count_vec.get_feature_names_out()

def display_topics(model, feature_names, n_top_words=10):
    """Print the top words for each topic."""
    for topic_idx, topic in enumerate(model.components_):
        top_word_indices = topic.argsort()[-n_top_words:][::-1]
        top_words = [feature_names[i] for i in top_word_indices]
        print(f"Topic {topic_idx}: {', '.join(top_words)}")

display_topics(lda, feature_names, n_top_words=10)
Topic 0: billing, charged, subscription, cancel, refund, credit, card, payment, invoice, renewed
Topic 1: performance, stream, buffering, loading, slow, lag, video, quality, buffer, crashes
Topic 2: login, account, password, access, email, reset, locked, denied, verify, authentication
Topic 3: content, library, available, subtitle, region, download, offline, language, missing, catalog

LDA discovered four topics that align cleanly with our ground truth: billing, performance, account access, and content availability.

Assigning Topics to Documents

# Transform documents to get topic distributions
topic_distributions = lda.transform(doc_term_matrix)
print(f"Topic distribution shape: {topic_distributions.shape}")
print(f"\nFirst ticket: '{ticket_df['ticket'].iloc[0][:60]}...'")
print(f"True topic: {ticket_df['true_topic'].iloc[0]}")
print(f"LDA topic distribution: {topic_distributions[0].round(3)}")
print(f"Assigned topic: Topic {topic_distributions[0].argmax()}")
# Assign dominant topic to each ticket
ticket_df['lda_topic'] = topic_distributions.argmax(axis=1)
ticket_df['lda_confidence'] = topic_distributions.max(axis=1)

# Map topic numbers to labels (manual interpretation)
topic_labels = {0: 'billing', 1: 'performance', 2: 'account', 3: 'content'}
ticket_df['lda_label'] = ticket_df['lda_topic'].map(topic_labels)

# How well does LDA match true topics?
accuracy = (ticket_df['lda_label'] == ticket_df['true_topic']).mean()
print(f"\nLDA topic assignment accuracy: {accuracy:.2%}")
print(f"\nTopic confidence distribution:")
print(ticket_df['lda_confidence'].describe().round(3))

Choosing the Number of Topics

In practice, you do not know the true number of topics. The coherence score measures how semantically related the top words within each topic are. Higher coherence means more interpretable topics.

from sklearn.decomposition import LatentDirichletAllocation

# Test different numbers of topics
topic_range = range(2, 10)
perplexities = []
log_likelihoods = []

for n in topic_range:
    lda_test = LatentDirichletAllocation(
        n_components=n,
        random_state=42,
        max_iter=20,
        learning_method='online',
        n_jobs=-1
    )
    lda_test.fit(doc_term_matrix)
    perplexities.append(lda_test.perplexity(doc_term_matrix))
    log_likelihoods.append(lda_test.score(doc_term_matrix))

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))

ax1.plot(list(topic_range), perplexities, 'b-o')
ax1.set_xlabel('Number of Topics')
ax1.set_ylabel('Perplexity')
ax1.set_title('Perplexity vs. Number of Topics')
ax1.axvline(x=4, color='red', linestyle='--', label='k=4 (true)')
ax1.legend()

ax2.plot(list(topic_range), log_likelihoods, 'g-o')
ax2.set_xlabel('Number of Topics')
ax2.set_ylabel('Log-Likelihood')
ax2.set_title('Log-Likelihood vs. Number of Topics')
ax2.axvline(x=4, color='red', linestyle='--', label='k=4 (true)')
ax2.legend()

plt.tight_layout()
plt.savefig('lda_topic_selection.png', dpi=150, bbox_inches='tight')
plt.show()

Common Mistake --- Perplexity and log-likelihood curves do not always show a clean elbow. In practice, topic modeling is as much art as science. Run LDA with k=3, 5, 8, 12 topics, examine the top words for each, and pick the number that produces topics a domain expert can label and act on. The "right" number of topics is the one that produces actionable categories for your business, not the one that minimizes a statistical metric.

Production Tip --- LDA requires CountVectorizer, not TfidfVectorizer. LDA is a generative model that assumes word counts follow a multinomial distribution. TF-IDF weights violate this assumption. If you accidentally feed TF-IDF into LDA, it will still run without error, but the topics will be lower quality.


Part 6: Putting It All Together --- The NLP Pipeline

End-to-End: Raw Text to Predictions

The complete NLP pipeline chains preprocessing, vectorization, and modeling. Here is the production pattern:

from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
import numpy as np

np.random.seed(42)

# End-to-end pipeline with hyperparameter tuning
pipeline = Pipeline([
    ('tfidf', TfidfVectorizer()),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

# Parameters to search
param_grid = {
    'tfidf__max_features': [3000, 5000, 10000],
    'tfidf__ngram_range': [(1, 1), (1, 2)],
    'tfidf__min_df': [1, 2, 5],
    'tfidf__sublinear_tf': [True, False],
    'clf__C': [0.1, 1.0, 10.0],
}

grid_search = GridSearchCV(
    pipeline,
    param_grid,
    cv=3,
    scoring='f1',
    n_jobs=-1,
    verbose=0
)

grid_search.fit(X_train, y_train)

print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV F1: {grid_search.best_score_:.4f}")
print(f"Test F1: {grid_search.score(X_test, y_test):.4f}")

Production Tip --- Always put the vectorizer inside the pipeline so that GridSearchCV can tune vectorizer parameters alongside classifier parameters. If you vectorize outside the pipeline, you cannot cross-validate vectorizer choices, and you risk data leakage (fitting the vectorizer on test data).

Adding NLP Features to a Tabular Model

In the StreamFlow churn model, text data from support tickets is one signal among many (usage patterns, billing history, demographics). The standard approach: vectorize the text separately, then combine with tabular features.

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import numpy as np
import pandas as pd

np.random.seed(42)

# Simulated combined dataset: tabular + text features
n = 2000
combined_df = pd.DataFrame({
    'usage_hours': np.random.exponential(10, n),
    'support_tickets': np.random.poisson(2, n),
    'months_active': np.random.randint(1, 36, n),
    'ticket_text': np.random.choice([
        "billing charged wrong amount refund",
        "app slow loading buffering performance",
        "cannot login password reset account",
        "great service no issues happy",
        "cancel subscription not using service",
    ], n),
    'churned': np.random.binomial(1, 0.3, n)
})

# ColumnTransformer handles mixed types
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), ['usage_hours', 'support_tickets', 'months_active']),
        ('text', TfidfVectorizer(max_features=500, ngram_range=(1, 2)), 'ticket_text'),
    ],
    remainder='drop'
)

# Full pipeline: preprocess mixed types -> classify
full_pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('clf', LogisticRegression(max_iter=1000, random_state=42))
])

from sklearn.model_selection import cross_val_score

scores = cross_val_score(full_pipeline, combined_df.drop('churned', axis=1),
                         combined_df['churned'], cv=5, scoring='f1')
print(f"Combined model CV F1: {scores.mean():.4f} (+/- {scores.std():.4f})")

This pattern --- ColumnTransformer with TfidfVectorizer for text columns and StandardScaler for numeric columns --- is the standard way to combine text and tabular features in scikit-learn. It will appear again in the progressive project exercise for this chapter.


Part 7: Preview --- Word Embeddings and What Comes Next

The Limitations of TF-IDF

TF-IDF treats every word as independent. "Happy" and "joyful" have no relationship in TF-IDF space. "Bank" (financial institution) and "bank" (river bank) are the same vector. TF-IDF cannot capture synonymy, polysemy, or semantic similarity.

Word embeddings solve this by mapping words to dense vectors in a continuous space where semantically similar words are close together. Word2Vec, GloVe, and FastText learn these vectors from large text corpora. BERT and other transformer models go further, creating context-dependent embeddings where "bank" gets different vectors depending on the surrounding words.

# A taste of what word embeddings look like (conceptual demonstration)
# In Chapter 36, you will use actual pretrained embeddings

# TF-IDF: "happy" and "joyful" are orthogonal (similarity = 0)
# Embeddings: "happy" and "joyful" are close (similarity ~ 0.7)

# TF-IDF: vocabulary is bounded by your corpus
# Embeddings: pretrained on billions of words, transfer to your domain

# TF-IDF: interpretable (each dimension is a specific word)
# Embeddings: dense, not directly interpretable (300 abstract dimensions)

Core Principle --- TF-IDF is sufficient when your text data has a limited vocabulary, your problem is classification or search, and interpretability matters. Embeddings are necessary when you need semantic similarity, your vocabulary is open-ended, or you are working with short texts where word overlap is low. Chapter 36 covers the advanced techniques. For now, know that TF-IDF is not a stepping stone to something better --- it is a legitimate production tool that solves real problems.


Summary

This chapter covered the NLP fundamentals that form the foundation for all text-based machine learning:

  1. Text preprocessing --- tokenization, lowercasing, stop word removal, stemming, lemmatization --- is the plumbing that determines whether your NLP pipeline works. Negation handling (preserving "not") is critical for sentiment analysis.

  2. Bag-of-Words and TF-IDF convert text into numerical matrices. TF-IDF improves on raw counts by downweighting ubiquitous words and upweighting discriminative words. Use sublinear_tf=True and bigrams as your default configuration.

  3. TF-IDF + logistic regression is the standard baseline for text classification. It is fast, interpretable (coefficients show which words drive predictions), and often competitive with deep learning for business problems with sufficient training data.

  4. VADER provides rule-based sentiment scoring without training data. It handles negation, capitalization, and punctuation. Use it as a baseline or when you have no labeled data. Switch to ML-based sentiment when you have domain-specific labels.

  5. LDA discovers latent topics in a document collection. Feed it CountVectorizer output (not TF-IDF). Choose the number of topics based on interpretability, not just statistical metrics. The output --- topic distributions per document --- can become features in downstream models.

The two case studies that follow apply these techniques to ShopSmart product reviews (sentiment analysis for A/B test interpretation) and StreamFlow support tickets (topic modeling and ticket classification for churn analysis).


Next: Case Study 1 --- ShopSmart Review Sentiment Analysis | Case Study 2 --- StreamFlow Support Ticket Classification