Chapter 27 Quiz

DataField.Dev

Chapter 27 Quiz

Multiple Choice (1 point each)

1. Which of the following is the primary advantage of lemmatization over stemming?

a) It is faster to compute for large corpora b) It returns linguistically valid base forms using a dictionary, producing more readable tokens c) It is available only in NLTK and therefore more reproducible d) It always produces shorter tokens than stemming

Answer: b — Lemmatization uses vocabulary and morphological analysis to return valid dictionary forms. Stemming is faster but may produce non-words (e.g., "studi" from "studying").

2. VADER's compound score ranges from -1 to +1. A score of -0.72 on a campaign speech most accurately means:

a) The speech was 72% negative in its political content b) VADER's algorithm assigned a compound score of -0.72 based on word-level sentiment in its lexicon, indicating predominantly negative word choices c) The speech had 72% more negative words than the average political speech d) The candidate expressed disagreement with their opponent in 72% of sentences

Answer: b — The VADER score reflects word-level patterns relative to a sentiment lexicon. It does not directly measure political negativity in the substantive sense, and it is not a percentage.

3. In TF-IDF weighting, the "IDF" component gives higher weight to:

a) Terms that appear frequently across many documents in the corpus b) Terms that appear frequently within a single document c) Terms that appear rarely across the corpus — making them more distinctive d) Terms that appear in the most recent documents

Answer: c — Inverse Document Frequency down-weights terms that appear in many documents (common words) and up-weights terms that appear in few documents (distinctive terms).

4. When Sam applies LDA to the ODA speech corpus and labels Topic 3 as "Healthcare," this label is:

a) Generated by the LDA algorithm based on the dominant semantic field b) A human interpretive act applied to the algorithm's output — the top words that define the topic c) Validated by the populism_score variable in the dataset d) Defined by the n-gram preprocessing choice

Answer: b — LDA produces topics as distributions over words; human interpretation labels those topics. The label is not algorithmic — it is an analyst's judgment.

5. A corpus of campaign speeches shows that the word "freedom" appears with frequency 0.008 in Republican speeches and 0.003 in Democratic speeches. Which approach best characterizes "freedom" as a distinctively Republican term?

a) Comparing raw counts of "freedom" in each corpus b) Comparing proportions (frequency per total words) — which is what these numbers already represent c) Using only speeches where "freedom" appears at least five times d) Running a topic model and checking whether "freedom" appears in Republican-coded topics

Answer: b — Proportions (frequency relative to total words) are the appropriate comparison unit because the two corpora likely have different total word counts. The numbers given are already proportions, making the comparison valid.

6. The Flesch-Kincaid Grade Level formula produces higher scores (harder text) when:

a) Average sentence length is shorter and average word length is shorter b) Average sentence length is longer and average word length is longer (more syllables) c) The text contains more political jargon regardless of sentence or word length d) The text was written for a broadcast rather than a print audience

Answer: b — Flesch-Kincaid Grade Level increases with longer sentences and longer words (measured in syllables). It does not directly account for content difficulty.

7. Sam applies VADER to a speech that uses the phrase "our community has suffered enormously from years of neglect" in a call to action for veterans' services. VADER would likely score this excerpt as:

a) Highly positive, because the speech is advocating for a sympathetic cause b) Neutral, because VADER balances the positive and negative words c) Negative, because "suffered," "neglect," and similar words score as negative in VADER's lexicon d) Unable to process, because VADER was not designed for political text

Answer: c — VADER scores at the word level. "Suffered," "enormously," "neglect" all carry negative sentiment in VADER's lexicon, regardless of the contextual meaning (expressing care for veterans). This is precisely the construct validity problem described in Section 27.12.

8. In the partisan language classifier, cross-validated accuracy of 78% means:

a) The model correctly classifies Republican vs. Democratic speeches 78% of the time on held-out data, averaged across 5 folds b) 78% of the words in Republican speeches are not present in Democratic speeches c) The model was trained on 78% of the data d) The classifier's precision for Republican speeches is 78%

Answer: a — Cross-validation accuracy is the mean proportion of correctly classified test instances across all folds of held-out data. It is not a precision measure for a single class.

9. The "construct validity" problem in computational text analysis refers to:

a) The computational cost of processing large corpora b) The degree to which the computational measure actually corresponds to the theoretical construct the analyst wants to measure c) The problem of choosing between lemmatization and stemming d) The accuracy of VADER's lexicon for political text

Answer: b — Construct validity is whether you are measuring what you think you are measuring. A VADER score is a proxy for sentiment; the construct validity question is whether VADER scores track the theoretically meaningful concept of political speech tone.

10. N-gram analysis in political communication research is particularly useful for:

a) Identifying individual word frequencies that cannot be obtained from standard word counts b) Capturing multi-word expressions ("working families," "border security") that carry political meaning as phrases rather than individual words c) Measuring the readability of text passages d) Assigning topic labels without human interpretation

Answer: b — N-grams capture compound expressions that carry distinct meaning as phrases. Individual word frequency analysis would split "border security" into separate "border" and "security" counts, losing the compound meaning.

Short Answer (5 points each)

11. Explain in your own words why the illusory truth effect creates a specific challenge for text analysis pipelines that are used to track misinformation. How should an analyst account for this effect when designing a monitoring system?

Model Answer: The illusory truth effect means that repeated exposure to a false claim — even in a corrective or analytic context — increases its perceived truthfulness. A monitoring system that tracks the spread of a false claim must itself reference that claim, potentially contributing to its spread. Analysts should: (1) publish monitoring findings in ways that lead with corrections rather than the false claim itself, (2) be selective about which false claims to amplify through analysis, and (3) track correction penetration alongside claim spread to assess net information quality effects.

12. What is the "bag of words" assumption embedded in most word frequency analysis and VADER sentiment analysis, and what political communication phenomena does it fail to capture?

Model Answer: The bag-of-words assumption treats text as an unordered collection of words, ignoring word order and syntactic structure. It fails to capture: (a) negation ("not great" ≠ "great"), though VADER partially handles this; (b) sarcasm and irony, which depend on contextual knowledge; (c) pragmatic implication — what is meant but not said; (d) metaphor and figurative language; and (e) rhetorical structure, where word order and positioning carry meaning. In political communication, where implication and dog-whistle language are significant, bag-of-words approaches miss a substantial portion of politically meaningful content.

True/False with Justification (3 points each)

13. True or False: LDA topic modeling requires the analyst to specify the number of topics in advance, which means the algorithm discovers topics "objectively" once k is chosen.

Answer: FALSE. The algorithm discovers word co-occurrence patterns objectively for a given k, but the choice of k is a human decision that strongly affects the results. Different values of k produce different topic structures, none of which is objectively "correct." The algorithm's output is also dependent on preprocessing choices, random initialization, and vocabulary size — all human decisions.

14. True or False: A speech with a VADER compound score of +0.65 is objectively more positive in its political message than a speech scoring +0.30.

Answer: FALSE. The VADER score reflects word-level sentiment patterns, not the substantive political positivity of the message. A speech could score +0.65 by using simple, positive-affect words while being substantively neutral or evasive, while a speech scoring +0.30 could contain deeply positive policy commitments expressed in complex, technical language that VADER scores less positively. The score is a measure of the specific linguistic patterns VADER is calibrated to detect, not a direct measure of political message positivity.

15. True or False: Setting a random seed (e.g., random_state=42) in sklearn's LDA and LogisticRegression ensures that running the code on a different computer will produce identical results.

Answer: FALSE. A random seed ensures reproducibility within a given software environment (same library versions, same hardware architecture). Different library versions or different hardware can produce different floating-point results even with the same seed. True reproducibility requires also documenting and fixing library versions (via a requirements.txt or conda environment.yml), which is why Sam's pipeline setup code includes library version documentation.