Quiz: Chapter 26

NLP Fundamentals: Text Preprocessing, TF-IDF, Sentiment Analysis, and Topic Modeling


Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.


Question 1 (Multiple Choice)

Which preprocessing step is most critical for sentiment analysis specifically?

  • A) Lowercasing all text
  • B) Removing punctuation
  • C) Preserving negation words during stop word removal
  • D) Applying stemming to reduce vocabulary size

Answer: C) Preserving negation words during stop word removal. Negation words ("not," "no," "never," "n't") invert sentiment: "not great" is negative, but removing "not" as a stop word leaves "great," which is positive. This is a common source of error in sentiment pipelines. Lowercasing (A) and punctuation removal (B) are important but do not cause polarity inversion. Stemming (D) is useful for vocabulary reduction but does not affect sentiment direction.


Question 2 (Multiple Choice)

What does the IDF component of TF-IDF measure?

  • A) How frequently a term appears in a specific document
  • B) How rare a term is across the entire corpus
  • C) The total number of unique terms in the vocabulary
  • D) The length of the document normalized by the corpus average

Answer: B) How rare a term is across the entire corpus. IDF = log((1 + N) / (1 + df)) + 1, where N is the total number of documents and df is the number of documents containing the term. Terms that appear in many documents get low IDF (they are common and non-discriminative). Terms that appear in few documents get high IDF (they are rare and potentially informative). Choice A describes TF, not IDF.


Question 3 (Multiple Choice)

You are building a text classifier for customer support tickets. Your corpus has 50,000 tickets. Which TfidfVectorizer configuration is the best starting point?

  • A) TfidfVectorizer(max_features=100, ngram_range=(1, 1))
  • B) TfidfVectorizer(max_features=10000, ngram_range=(1, 2), min_df=2, max_df=0.95, sublinear_tf=True)
  • C) TfidfVectorizer(ngram_range=(1, 3), max_features=100000)
  • D) TfidfVectorizer(max_features=500, stop_words='english', ngram_range=(3, 3))

Answer: B) This is the standard production configuration. max_features=10000 provides a reasonable vocabulary cap. ngram_range=(1, 2) includes unigrams and bigrams (capturing "not working" as a single feature). min_df=2 removes typos and one-off terms. max_df=0.95 removes near-universal terms. sublinear_tf=True dampens the effect of extreme word counts. Choice A is too restrictive (100 features loses too much signal). Choice C creates a massive feature space with trigrams that is likely to overfit. Choice D uses only trigrams, missing the information in individual words and bigrams.


Question 4 (Multiple Choice)

Why does scikit-learn store TF-IDF matrices as sparse matrices?

  • A) Sparse matrices are faster to compute than dense matrices
  • B) Most entries in a document-term matrix are zero, so sparse storage saves memory
  • C) Dense matrices cannot be used with logistic regression
  • D) Sparse matrices preserve word order information

Answer: B) Most entries in a document-term matrix are zero, so sparse storage saves memory. A typical document uses a few hundred unique words out of a vocabulary of tens of thousands. In a matrix with 50,000 documents and 10,000 features, over 99% of entries are zero. Sparse matrices (CSR format) store only the non-zero entries, reducing memory from gigabytes to megabytes. Choice A is partially true (sparse operations can be faster) but the primary reason is memory. Choice C is wrong --- logistic regression works with both. Choice D is wrong --- neither sparse nor dense matrices preserve word order.


Question 5 (Multiple Choice)

What is the fundamental limitation of Bag-of-Words and TF-IDF representations?

  • A) They cannot handle large vocabularies
  • B) They destroy word order information
  • C) They require labeled training data
  • D) They cannot represent negation

Answer: B) They destroy word order information. "The dog bit the man" and "the man bit the dog" produce identical BoW/TF-IDF vectors because only word frequencies are recorded, not positions. N-grams (bigrams, trigrams) partially recover local word order but cannot capture long-range dependencies. Choice A is wrong --- they handle large vocabularies via sparse matrices. Choice C is wrong --- TF-IDF is unsupervised. Choice D is wrong --- bigrams can capture negation patterns like "not good."


Question 6 (Multiple Choice)

VADER assigns a compound score of +0.7 to "This product is GREAT!!!" and +0.5 to "This product is great." What explains the difference?

  • A) VADER has a larger vocabulary than other sentiment tools
  • B) VADER applies rules for capitalization emphasis and punctuation intensity
  • C) VADER trains on each input text before scoring
  • D) The two sentences have different word counts

Answer: B) VADER applies rules for capitalization emphasis and punctuation intensity. All-caps ("GREAT") signals emphasis, which VADER interprets as intensified sentiment. Multiple exclamation marks further boost the score. VADER uses a hand-curated lexicon plus a set of five rules: punctuation amplification, capitalization, degree modifiers ("very," "extremely"), conjunctions ("but"), and negation ("not"). It does not train on input (C) --- it is purely rule-based.


Question 7 (Multiple Choice)

When should you use VADER instead of a trained ML-based sentiment classifier?

  • A) When you need the highest possible accuracy on a domain-specific corpus
  • B) When you have abundant labeled training data
  • C) When you have no labeled data and need a quick, interpretable baseline
  • D) When the text contains domain-specific jargon

Answer: C) When you have no labeled data and need a quick, interpretable baseline. VADER requires no training data --- it works out of the box with its built-in lexicon. It is fast, interpretable (you can trace scores to specific words), and performs well on informal text with clear sentiment signals. However, it does not learn domain-specific vocabulary (D) and is typically outperformed by ML classifiers when labeled data is available (A, B).


Question 8 (Multiple Choice)

LDA (Latent Dirichlet Allocation) requires CountVectorizer input rather than TfidfVectorizer because:

  • A) TF-IDF matrices are too large for LDA
  • B) LDA assumes word counts follow a multinomial distribution, and TF-IDF weights violate this assumption
  • C) CountVectorizer produces dense matrices that LDA requires
  • D) TF-IDF removes stop words, which LDA needs

Answer: B) LDA assumes word counts follow a multinomial distribution, and TF-IDF weights violate this assumption. LDA is a generative model: it models each document as a mixture of topics, and each topic generates words according to a multinomial distribution over the vocabulary. This requires integer counts (or at least non-negative values that can be interpreted as counts). TF-IDF weights are real-valued transformations that no longer represent counts, violating the model's assumptions. LDA will still run on TF-IDF input without raising an error, but the topics will be lower quality.


Question 9 (Short Answer)

Explain the difference between stemming and lemmatization. Give one scenario where stemming is preferred and one where lemmatization is preferred.

Answer: Stemming applies rules to chop suffixes off words, producing stems that may not be real words (e.g., "happiness" becomes "happi"). Lemmatization uses a dictionary to reduce words to their base form, producing valid words (e.g., "happiness" becomes "happiness," "mice" becomes "mouse"). Stemming is preferred for classification tasks where the model just needs consistent token grouping and speed matters (e.g., spam detection on millions of emails). Lemmatization is preferred when the output must be human-readable (e.g., extracting topic labels that a non-technical audience will interpret).


Question 10 (Short Answer)

You have a TF-IDF + logistic regression model for classifying support tickets into 5 categories. The model achieves 95% accuracy. Your manager asks: "Can we explain why the model classified this specific ticket as 'billing'?" How do you provide the explanation?

Answer: Extract the logistic regression coefficients for the "billing" class and the TF-IDF feature values for the specific ticket. The prediction is the dot product of the coefficient vector and the feature vector. Multiply each feature's TF-IDF weight by its coefficient to get each word's contribution to the "billing" score. The words with the largest positive contributions (e.g., "charged," "refund," "credit card") are the reasons the model chose "billing." This coefficient-level explanation is one of the major advantages of TF-IDF + logistic regression over deep learning models.


Question 11 (Short Answer)

A colleague suggests using LDA with k=20 topics on a corpus of 5,000 short support tickets (average 15 words each). What concerns would you raise?

Answer: Twenty topics is likely too many for 5,000 short documents. With only 15 words per document, the topic distribution estimate for each document will be noisy --- there are not enough words to reliably assign probabilities across 20 topics. The topics themselves will likely overlap and be hard to interpret. A better approach: start with k=4 or k=5, examine the top words per topic, and increase k only if clearly distinct themes are being merged into single topics. For short text, fewer topics with clearer separation is almost always better than many topics with blurred boundaries.


Question 12 (Multiple Choice)

You build a TF-IDF + logistic regression sentiment classifier on movie reviews. You deploy it to classify product reviews and accuracy drops from 93% to 72%. The most likely reason is:

  • A) The TF-IDF vocabulary is too large
  • B) The model overfit to domain-specific vocabulary (e.g., "plot," "acting," "director" are irrelevant in product reviews)
  • C) Logistic regression cannot handle sentiment analysis
  • D) The product reviews are longer than the movie reviews

Answer: B) The model overfit to domain-specific vocabulary. Sentiment classifiers learn which words predict positive or negative sentiment in the training domain. "Plot," "acting," "cinematography" are strong features in movie reviews but irrelevant in product reviews. Conversely, "shipping," "durable," "defective" are important in product reviews but absent from movie training data. This is the domain transfer problem: NLP models trained on one domain often fail when applied to another. The fix is to train on in-domain data or use a domain-independent representation (like word embeddings, covered in Chapter 36).


Question 13 (Short Answer)

Explain what the max_df and min_df parameters in TfidfVectorizer control and why you would use them.

Answer: max_df sets an upper bound on document frequency: terms appearing in more than this fraction of documents are excluded. Setting max_df=0.90 removes words that appear in over 90% of documents --- these are effectively domain-specific stop words that carry little discriminative information. min_df sets a lower bound: terms appearing in fewer than this many documents are excluded. Setting min_df=3 removes typos, misspellings, and extremely rare terms that the model cannot learn from. Together, they focus the vocabulary on the "goldilocks zone" of terms that are common enough to be learnable but rare enough to be discriminative.


Question 14 (Multiple Choice)

In the StreamFlow case study, ticket topic probabilities were added as features to the churn model. The topic_prob_content feature had the second-highest coefficient after topic_prob_cancellation. The retention team should prioritize:

  • A) Cancellation tickets, because they have the highest churn coefficient
  • B) Content tickets, because they represent preventable churn
  • C) Billing tickets, because billing errors are easy to fix
  • D) All tickets equally, because every complaint matters

Answer: B) Content tickets, because they represent preventable churn. Cancellation tickets have the highest coefficient, but those subscribers have already decided to leave --- the intervention window is largely closed. Content tickets represent subscribers who are unhappy but have not yet decided to cancel. Proactive outreach (personalized content recommendations, upcoming release previews) has the highest probability of changing the outcome. This is the distinction between predicting churn and preventing churn.


Question 15 (Short Answer)

TF-IDF + logistic regression achieves 95% accuracy on a ticket classification task. A colleague builds a BERT-based model that achieves 97%. Under what circumstances would you recommend shipping the TF-IDF model instead?

Answer: Ship the TF-IDF model when: (1) the 2% accuracy gap does not translate to a meaningful business impact (e.g., the difference is 3 misrouted tickets per day out of 150); (2) the BERT model requires GPU infrastructure that adds cost and operational complexity; (3) interpretability matters (TF-IDF coefficients explain predictions; BERT does not); (4) latency requirements are strict (TF-IDF inference is sub-millisecond; BERT is 50-200ms); (5) the team does not have deep learning expertise to maintain the model in production. The best model is the one that ships, runs reliably, and meets the business need --- not the one with the highest number on a leaderboard.