Chapter 35 Quiz: Natural Language Processing for Business Text

DataField.Dev

Chapter 35 Quiz: Natural Language Processing for Business Text

Instructions: Choose the best answer for each question. Answer key with explanations is at the end of this document.

Questions

1. Which of the following best describes what TextBlob's polarity score measures?

A) The number of positive and negative words in the text B) A value from -1.0 to +1.0 indicating how negative or positive the text is C) The percentage of sentences that express an opinion D) The reading grade level of the text

2. You have a customer review that reads: "Not bad at all — actually quite good." TextBlob might score this with an unexpectedly low positive score. What is the most likely cause?

A) TextBlob only processes English text B) The review contains spelling errors C) TextBlob struggles with complex negation like "not bad" D) The review is too short to score accurately

3. What is the primary purpose of the TfidfVectorizer from scikit-learn when used for keyword extraction?

A) To remove stopwords from text B) To convert text into numerical values that weight distinctive words more highly than common words C) To translate text between languages D) To split text into sentences

4. You are building a support ticket classifier. You have 1,000 labeled tickets and 50,000 unlabeled tickets. Which approach is most appropriate?

A) Apply keyword rules only, because you already have labels B) Train a machine learning classifier on the labeled data and use it to classify all 50,000 unlabeled tickets C) Use LDA topic modeling on the 50,000 unlabeled tickets and ignore the labeled data D) Apply TextBlob sentiment analysis to classify all tickets

5. Which spaCy entity label would be used for the string "March 15, 2024"?

A) TIME B) GPE C) DATE D) EVENT

6. You want to find which two-word phrases appear most frequently in a collection of customer complaints. Which technique should you use?

A) Lemmatization B) Bigram frequency analysis C) Sentiment analysis D) Named entity recognition

7. What does the subjectivity score from TextBlob measure?

A) How positive or negative the text is B) Whether the text is factual/objective (0.0) or opinion-based (1.0) C) The proportion of stopwords in the text D) How many named entities appear in the text

8. Priya is analyzing support tickets and wants to identify the most urgent ones first. She calculates an urgency score. Which combination of factors is most appropriate?

A) Polarity score alone B) Ticket length alone C) Polarity score combined with how long the ticket has been waiting D) Subjectivity score combined with the number of words in the ticket

9. LDA topic modeling requires you to specify which of the following before training?

A) The keywords for each topic B) The number of topics C) The sentiment score for each document D) The language of the text

10. You remove stopwords before performing sentiment analysis with TextBlob. What problem might this cause?

A) The analysis will run slower B) Words like "not" may be removed, flipping the meaning of negative phrases C) TextBlob cannot process text without stopwords D) Stopword removal has no effect on sentiment analysis

11. Which spaCy entity label captures company and organization names?

A) COMPANY B) ORG C) ENTITY D) CORP

12. You are analyzing 500 survey responses and find that the top word in positive responses is "helpful." In negative responses, the top word is also "helpful" because people wrote "not helpful." What preprocessing technique would best separate these cases?

A) Stemming the word "helpful" to "help" B) Using bigram analysis to capture "not helpful" as a distinct phrase C) Removing "helpful" from the analysis entirely D) Using LDA topic modeling

13. What is the main difference between stemming and lemmatization?

A) Stemming removes stopwords; lemmatization does not B) Stemming uses rule-based suffix removal; lemmatization uses a dictionary to find the base form C) Stemming works only on verbs; lemmatization works on all word types D) Stemming is always more accurate than lemmatization

14. Maya wants to discover what themes appear in her client survey responses without defining the categories in advance. Which technique is most appropriate?

A) TextBlob sentiment analysis B) Rule-based keyword classification C) LDA topic modeling D) Named entity recognition

15. You train a Naive Bayes text classifier and get 74% accuracy on your test set. Your manager asks if this is good enough to automatically route 100% of incoming support tickets without human review. What is the best response?

A) Yes, 74% is very high for NLP tasks B) No, 26% misclassification rate means over 1 in 4 tickets would be routed incorrectly with no human review C) Accuracy is irrelevant for routing decisions D) Yes, as long as you have more than 1,000 tickets in your training set

16. The ngram_range=(1, 2) parameter in TfidfVectorizer means:

A) Only bigrams (2-word phrases) will be analyzed B) The text will be analyzed at both the word level and the bigram level C) Only words with at least 2 characters will be included D) The model will look back 2 sentences for context

17. Which of the following is a practical limitation of classical NLP approaches like TextBlob?

A) They cannot process text longer than 1,000 words B) They cannot handle sarcasm reliably C) They require labeled training data for sentiment analysis D) They only work with English text from social media

18. You have 10,000 customer reviews and want to find all mentions of competitor company names automatically. Which technique is most directly applicable?

A) Sentiment analysis B) LDA topic modeling C) Named entity recognition with ORG entity extraction D) TF-IDF keyword extraction

19. The min_df=3 parameter in TfidfVectorizer means:

A) Words must have a minimum length of 3 characters B) Words must appear in at least 3 documents to be included C) The model will use the 3 most frequent words per document D) Words appearing in more than 3% of documents are excluded

20. Priya discovers that 34.4% of shipping delay tickets arrive on Monday. What is the most analytically sound interpretation of this finding?

A) Monday shipping is worse than other days B) Customers shop on weekends and contact support on Monday when tracking shows no weekend updates C) The support team is understaffed on Mondays D) Shipping delays only happen on weekends

Answer Key

1. B — TextBlob's polarity is a float from -1.0 (most negative) to +1.0 (most positive), with 0 being neutral. It is not a count of words (A), not a percentage (C), and not a readability measure (D).

2. C — TextBlob uses a lexicon-based approach that assigns scores to individual words. "Not bad" contains the negative word "not" and the moderately negative word "bad." TextBlob does handle simple negation, but the result ("not bad" = slightly positive) is often weaker than expected. Complex negation patterns like "actually quite good" can create noise. Answer (D) is also partially true — very short texts have high variance — but (C) is the primary cause here.

3. B — TF-IDF (Term Frequency-Inverse Document Frequency) converts text to numerical vectors where words that are frequent in a specific document but rare across the collection get higher scores. This weights distinctive terms more heavily than common ones. Stopword removal is a separate step (A); TfidfVectorizer can perform it but that is not its primary purpose.

4. B — Labeled data should be used to train a classifier. Using keyword rules ignores the advantage of having labels. LDA operates on unlabeled data but does not use the labels you already have. Sentiment analysis categorizes positive/negative — not support topics.

5. C — spaCy labels calendar dates as DATE. TIME is used for specific times of day like "3:00 PM." GPE is for geographic locations. EVENT is for named events.

6. B — Bigrams are two-word sequences. Extracting and counting bigrams using NLTK's ngrams function or TfidfVectorizer's ngram_range parameter is exactly the technique for finding frequent two-word phrases.

7. B — Subjectivity measures how opinion-based vs. factual the text is, from 0.0 (purely objective/factual) to 1.0 (highly subjective/opinionated). It is independent of sentiment direction.

8. C — Urgency in a support context combines emotional intensity (captured by negative polarity) with time pressure (how long the customer has been waiting). Neither factor alone captures urgency fully.

9. B — LDA is an unsupervised algorithm that discovers topics without labeled data, but you must specify the number of topics (num_topics) before training. It does not require predefined keywords, sentiment scores, or language specification (though language affects preprocessing).

10. B — Standard English stopword lists include words like "not," "never," "no," "don't." Removing these can flip the meaning of negative phrases: "not satisfied" becomes "satisfied," "no good" becomes "good." For sentiment analysis, keep stopwords or use a sentiment library that handles negation (like TextBlob does to some degree on unprocessed text).

11. B — spaCy uses ORG for organizations, companies, institutions, and other named entities of that type. The label COMPANY does not exist in spaCy's default English model.

12. B — Bigram analysis would capture "not helpful" as a single distinct phrase, distinguishing it from "very helpful" or just "helpful." This is a key advantage of n-gram analysis over single-word frequency analysis.

13. B — Stemming applies algorithmic rules to chop word endings (fast but crude — "ran" does not stem correctly to "run"). Lemmatization uses a dictionary (vocabulary) and morphological analysis to find the actual base form, handling irregular forms correctly. Both work on all word types with appropriate POS tagging, not just verbs (C is false). Lemmatization is generally more accurate, not less (D is false).

14. C — LDA topic modeling discovers latent thematic structures in a collection of documents without predefined categories. Keyword classification requires predefined categories and their keywords. NER extracts specific named entities. Sentiment analysis scores emotional direction — it does not discover topics.

15. B — 74% accuracy means 26% of tickets are misclassified. For a fully automated system handling thousands of tickets daily, that misclassification rate would create significant problems. The appropriate recommendation is to use the classifier as a routing aid with human review for low-confidence predictions, not as a fully autonomous system.

16. B — ngram_range=(1, 2) tells TfidfVectorizer to analyze text at both the unigram (single word) level AND the bigram (2-word phrase) level simultaneously. This is more informative than either level alone.

17. B — Sarcasm like "Oh, wonderful, another delay" contains positive words that lexicon-based systems misclassify as positive. TextBlob does not require labeled data for sentiment (C is false) — it uses a pre-built lexicon. It can process long text (A is false). It is designed for English generally, not just social media (D is false).

18. C — Named entity recognition with the ORG entity type extracts company and organization names from text. This is precisely the use case NER is designed for. TF-IDF would surface frequent terms but cannot reliably distinguish between "Apple" the company and "apple" the fruit without entity classification.

19. B — min_df stands for "minimum document frequency." Setting it to 3 means a word must appear in at least 3 documents to be included in the vocabulary. This filters out rare words that are unlikely to generalize. max_df is the parameter for excluding words that appear in too many documents (D describes max_df).

20. B — The data supports the interpretation that the Monday spike is a weekend order processing artifact: customers place orders Thursday-Sunday, carriers have reduced Sunday service, and customers who see no movement in tracking by Monday morning contact support. This is an operational insight about order fulfillment timing, not about Monday being inherently worse (A), staffing levels (C, which requires separate data), or delays only happening on weekends (D, which is not supported by the data).