Key Takeaways: Chapter 26

DataField.Dev

Key Takeaways: Chapter 26

NLP Fundamentals: Text Preprocessing, TF-IDF, Sentiment Analysis, and Topic Modeling

Preprocessing determines NLP model quality. Tokenization, lowercasing, stop word removal, and stemming or lemmatization are the plumbing that determines whether your model sees signal or noise. The most impactful decision is often the simplest: whether to preserve negation words during stop word removal. Removing "not" from "not great" inverts the sentiment. Build a preprocessing pipeline once, test it on edge cases, and reuse it across projects.
TF-IDF is not a stepping stone to something better --- it is a legitimate production tool. For document classification, ticket routing, sentiment scoring, and search ranking, TF-IDF plus a linear classifier is sufficient, interpretable, fast, and easy to maintain. It solves the majority of business NLP problems without GPUs, without pretrained models, and without deep learning expertise. Use it as the default and switch to embeddings only when TF-IDF demonstrably fails.
Bag-of-Words destroys word order; bigrams partially recover it. "Not great" and "great not" produce identical unigram vectors. Adding bigrams (ngram_range=(1, 2)) creates distinct features for "not great," "great product," and "fast shipping." Unigrams plus bigrams is the standard baseline configuration. Trigrams are rarely worth the vocabulary explosion unless you have a very large corpus.
The TfidfVectorizer parameters that matter most are max_features, ngram_range, min_df, max_df, and sublinear_tf. max_features caps vocabulary size (memory and regularization). min_df removes typos and one-off terms. max_df removes domain-specific stop words. sublinear_tf=True dampens the effect of extreme word counts. These five parameters have more impact on performance than the choice of classifier.
Put the vectorizer inside the pipeline. Pipeline([('tfidf', TfidfVectorizer()), ('clf', LogisticRegression())]) ensures the vectorizer is fit only on training data, enables cross-validation of vectorizer parameters alongside classifier parameters, and prevents data leakage. If you vectorize outside the pipeline, you fit the vocabulary on test data, which inflates your metrics.
TF-IDF + logistic regression gives you interpretability that deep learning does not. Each coefficient corresponds to a specific word or bigram. You can explain to a non-technical stakeholder exactly why the model classified a document as "billing complaint" by listing the words that contributed the most to that prediction. This interpretability is often more valuable than a small accuracy improvement from a black-box model.
VADER is a fast, zero-training-data sentiment baseline; it is not a production solution for domain-specific text. VADER handles negation, capitalization, and punctuation emphasis using handcrafted rules. It performs well on informal text with clear sentiment signals (product reviews, tweets). It struggles with nuance, sarcasm, domain-specific vocabulary, and formal writing. Use it when you have no labeled data. Train a domain-specific classifier when you do.
LDA discovers topics from unlabeled documents, but it requires human interpretation. LDA outputs distributions over words for each topic and distributions over topics for each document. The algorithm does not label the topics --- a domain expert must examine the top words and assign meaningful names. Choose the number of topics based on interpretability, not just perplexity curves. Five interpretable topics are more useful than fifteen that nobody can label.
Feed CountVectorizer to LDA, not TfidfVectorizer. LDA is a generative model that assumes word counts follow a multinomial distribution. TF-IDF weights violate this assumption. LDA will run on TF-IDF input without error, but the discovered topics will be lower quality. This is a common mistake that silently degrades results.
The real value of NLP is not the model --- it is the question you ask. The ShopSmart case study showed that aggregate star ratings hid the real problem (misleading product descriptions in the treatment group). The StreamFlow case study showed that ticket content matters more than ticket count for predicting churn. In both cases, the models were simple (TF-IDF + logistic regression, LDA). The value was in using NLP to answer the right business question.

If You Remember One Thing

TF-IDF + logistic regression is the baseline you must beat before reaching for deep learning. It trains in seconds, predicts in microseconds, explains its decisions in plain English, runs on any hardware, and solves the majority of business text classification problems. If your BERT model achieves 96% and TF-IDF + LR achieves 93%, you have to justify not just the 3% improvement but the GPU cost, inference latency, maintenance complexity, and the loss of interpretability. Sometimes that justification exists. Often it does not.

These takeaways summarize Chapter 26: NLP Fundamentals. Return to the chapter for full context.