Further Reading: Chapter 26

NLP Fundamentals: Text Preprocessing, TF-IDF, Sentiment Analysis, and Topic Modeling

Foundational Papers

1. "A Vector Space Model for Automatic Indexing" --- Gerard Salton, A. Wong, and C.S. Yang (1975) The paper that introduced the vector space model for information retrieval, the conceptual ancestor of TF-IDF. Salton, Wong, and Yang proposed representing documents as vectors in a high-dimensional space where each dimension corresponds to a term, and similarity between documents is measured by the cosine of the angle between their vectors. Published in Communications of the ACM. Half a century later, the core idea --- documents are vectors, similarity is cosine distance --- remains the foundation of TF-IDF, document search, and the embedding models that succeeded it. Read this for historical grounding.

2. "Introduction to Information Retrieval" --- Christopher Manning, Prabhakar Raghavan, and Hinrich Schutze (2008) The definitive textbook on the mathematics behind text vectorization. Chapters 6 (TF-IDF weighting) and 13 (text classification with Naive Bayes and logistic regression) cover exactly the material in this chapter, with full mathematical derivations. The book is available free online at nlp.stanford.edu/IR-book/. Read Chapters 6 and 13 if you want the formal proofs behind why TF-IDF works. The treatment of Naive Bayes for text classification (multinomial vs. Bernoulli models) is the clearest in the literature.

3. "Latent Dirichlet Allocation" --- David Blei, Andrew Ng, and Michael Jordan (2003) The original LDA paper. Blei, Ng, and Jordan introduced LDA as a generative probabilistic model where each document is a mixture of latent topics and each topic is a distribution over words. Published in the Journal of Machine Learning Research. The paper is mathematically dense (variational inference, Dirichlet distributions) but the intuition is accessible: LDA reverse-engineers the process by which documents were "generated" from hidden topics. Read the first five pages for the intuition, skip the variational inference derivation unless you are implementing LDA from scratch.

Sentiment Analysis

4. "VADER: A Parsimonious Rule-Based Model for Sentiment Analysis of Social Media Text" --- C.J. Hutto and Eric Gilbert (2014) The paper introducing VADER, the lexicon-based sentiment analyzer used throughout this chapter. Hutto and Gilbert constructed a sentiment lexicon of 7,500 features (words, emoticons, slang) and validated it with human raters. The key innovation: five grammatical heuristics that modify word-level sentiment (negation, degree modifiers, capitalization, punctuation, conjunctions). Published at ICWSM 2014. VADER consistently outperforms other lexicon-based tools on social media text and performs competitively with supervised methods when labeled data is scarce. Read this if you want to understand exactly what rules VADER applies.

5. "Thumbs Up? Sentiment Classification Using Machine Learning Techniques" --- Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan (2002) The paper that established ML-based sentiment analysis as a research field. Pang, Lee, and Vaithyanathan showed that standard text classifiers (Naive Bayes, SVM, MaxEnt) with unigram or bigram features could classify movie review sentiment with 78-83% accuracy. Published at EMNLP 2002. Two decades later, TF-IDF + logistic regression on sentiment is still a competitive baseline. The paper's key finding --- that sentiment classification is harder than topic classification because sentiment depends on more subtle linguistic cues --- remains true.

6. "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank" --- Richard Socher, Alex Perelygin, Jean Wu, et al. (2013) The paper that introduced the Stanford Sentiment Treebank and recursive neural networks for fine-grained sentiment analysis. The dataset includes sentence-level and phrase-level sentiment labels on a 5-point scale, enabling models to learn compositional sentiment (how "not" modifies "good"). Published at EMNLP 2013. While the recursive neural network architecture has been superseded by transformers, the dataset remains a standard benchmark, and the paper's analysis of compositional sentiment is essential reading for understanding why bag-of-words approaches struggle with negation and intensification.

Topic Modeling

7. "Probabilistic Topic Models" --- David Blei (2012) Blei's survey article in Communications of the ACM, written for a general computer science audience. This is the accessible version of the 2003 LDA paper: clear intuitions, helpful diagrams, minimal math. Blei walks through how LDA works, what the topics mean, and how to evaluate them. He also covers extensions: dynamic topic models (topics that evolve over time), correlated topic models (topics that co-occur), and supervised topic models (topics that predict labels). Read this instead of the original LDA paper unless you need the mathematical details.

8. "Reading Tea Leaves: How Humans Interpret Topic Models" --- Jonathan Chang, Jordan Boyd-Graber, David Blei, Sean Gerrish, and Lal Amr (2009) A fascinating study showing that statistical metrics (perplexity, log-likelihood) for evaluating topic models do not correlate with human judgments of topic quality. Chang et al. conducted experiments where humans rated whether topics were coherent and whether documents were correctly assigned to topics. Models with better perplexity often produced topics that humans found less interpretable. Published at NeurIPS 2009. This paper is the reason this chapter emphasizes qualitative topic evaluation over perplexity curves.

Text Preprocessing and Feature Engineering

9. "Feature Engineering for Text Data" --- Alice Zheng and Amanda Casari, from Feature Engineering for Machine Learning (O'Reilly, 2018) Chapter 3 of this O'Reilly book covers bag-of-words, TF-IDF, and n-grams with a practitioner focus. The treatment is less mathematical than Manning et al. but more oriented toward production data science. The discussion of when to use character-level n-grams (for capturing misspellings and morphological patterns) and subword features is particularly useful. Read this after this chapter for additional practical techniques.

10. "A Comparison of Stemming Algorithms for English Text Retrieval" --- Jiaul H. Paik (2013) A systematic comparison of stemming algorithms (Porter, Snowball, Lancaster, Krovetz, Xerox) across multiple retrieval tasks. Paik showed that aggressive stemming (Porter, Lancaster) hurts precision on short queries but helps recall on long queries. The practical takeaway: stemming is not universally beneficial, and the choice of stemmer matters. Published in the Journal of the American Society for Information Science and Technology. Read this if you are deciding between stemmers for a specific application.

Scikit-Learn Documentation

11. Scikit-learn Text Feature Extraction Guide The official tutorial at scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction covers CountVectorizer, TfidfVectorizer, and TfidfTransformer with code examples. The section on customizing the vectorizer (custom tokenizers, preprocessors, analyzers) is essential for production use. The guide also covers hashing vectorization with HashingVectorizer for very large vocabularies that do not fit in memory --- a technique not covered in this chapter but useful for large-scale deployments.

12. Scikit-learn Working with Text Data Tutorial The tutorial at scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html walks through a complete text classification pipeline on the 20 Newsgroups dataset: loading data, vectorizing, training Naive Bayes and SVM classifiers, evaluating, and tuning with grid search. It is the closest thing to a "quick start" for NLP with scikit-learn and covers the same pipeline pattern used throughout this chapter.

Beyond TF-IDF (Preview for Chapter 36)

13. "Efficient Estimation of Word Representations in Vector Space" --- Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean (2013) The Word2Vec paper. Mikolov et al. introduced two architectures --- CBOW and Skip-gram --- for learning dense word vectors from large text corpora. The resulting vectors capture semantic relationships: vec("king") - vec("man") + vec("woman") approximates vec("queen"). Published as a preprint at arXiv. Word2Vec was the first widely adopted word embedding method and the conceptual bridge between TF-IDF (sparse, independent dimensions) and modern transformer embeddings (dense, contextual). Chapter 36 covers this in detail.

14. "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding" --- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova (2019) The paper that transformed NLP. BERT produces context-dependent embeddings: "bank" gets different vectors depending on whether the surrounding words are about finance or rivers. Fine-tuned BERT models set new state-of-the-art results on nearly every NLP benchmark. Published at NAACL 2019. Read the abstract and Section 1 for context on why embeddings superseded TF-IDF for tasks requiring semantic understanding. But remember: for many business classification problems, TF-IDF remains the right tool.

Practical Applications

15. "Machine Learning for Text" --- Charu Aggarwal (Springer, 2018) A comprehensive textbook covering text preprocessing, classification, clustering, topic modeling, sentiment analysis, information extraction, and deep learning for text. Aggarwal balances theory and practice, with implementations in Python. Chapters 4-5 (text classification with classical methods) and Chapter 8 (sentiment analysis) directly extend the material in this chapter. The book is more detailed than this chapter on text clustering and information extraction.

Return to the chapter for full context.