Chapter 27 Key Takeaways

DataField.Dev

Chapter 27 Key Takeaways

Foundations and Setup

1. Computational text analysis scales what was previously impossible manually. Analyzing thousands of speeches, millions of social media posts, or decades of congressional records for patterns is practically impossible with human coding at scale. Computational methods make it tractable — but introduce their own constraints and assumptions.

2. Environment setup and reproducibility are methodological obligations, not optional niceties. Documenting library versions, setting random seeds, and logging preprocessing decisions are required for results that can be reproduced, shared, and evaluated. In civic data contexts, reproducibility is an ethical obligation.

3. Preprocessing choices are methodological decisions. The choice between lemmatization and stemming, the composition of the stopword list, the minimum document frequency threshold for the vocabulary — each of these affects results. They must be documented, justified, and reported.

Text Preprocessing

4. Lemmatization preserves readable, linguistically valid tokens; stemming is faster but less precise. For political text analysis where interpretability matters, lemmatization is generally preferred. For high-speed classification pipelines where human interpretation of features is not the goal, stemming may be acceptable.

5. Domain-specific stopwords improve signal-to-noise in political text analysis. NLTK's default English stopwords exclude common function words but leave in common political speech fillers ("would," "going," "well"). Adding a political stopword list improves the distinctiveness of word frequency analysis for political content.

6. Always inspect your data before analyzing it. Missing values, inconsistent name formatting, outlier word counts, and date parsing issues are common in political datasets. Failing to document and handle these produces misleading results that may not be obvious in output.

Word Frequency and N-gram Analysis

7. Proportions, not raw counts, are the appropriate comparison unit for corpora of different sizes. Democratic and Republican speech corpora contain different total word counts. Comparing raw frequencies is comparing apples to oranges. Always normalize by total words (or document count) before comparison.

8. Distinctive word analysis reveals partisan vocabulary differences that are consistent with political communication theory. Computational analysis of ODA's corpus reproduced well-established findings about partisan vocabulary — Democratic language emphasizing community and collective action, Republican language emphasizing freedom and security — demonstrating the method's validity against known benchmarks.

9. N-gram analysis captures compound political expressions that individual word analysis misses. "Working families," "border security," "Medicare for All," "law and order" — these phrases carry political meaning as units. Bigram and trigram analysis surfaces these phrase-level patterns.

Sentiment Analysis

10. VADER is calibrated for informal registers and handles capitalization, punctuation, and negation. VADER's design for social media text makes it more appropriate for political speech than models calibrated on formal writing. But its lexicon is general-purpose and may mis-score political terminology.

11. VADER scores at the word level — contextual meaning and sarcasm are not captured. Words like "suffered," "sacrifice," and "neglect" score as negative in VADER's lexicon regardless of whether they appear in contexts of admiration for veterans or genuine criticism. Validate VADER scores against human coding for your specific corpus before making strong claims.

12. Sentiment trends over time reveal campaign-cycle dynamics. Both parties' speeches in the ODA corpus showed declining compound sentiment scores approaching Election Day — computational confirmation of the well-documented pattern of increasing campaign negativity in the final pre-election weeks.

Readability Analysis

13. Readability scores are politically meaningful signals, not just linguistic metrics. Lower Flesch-Kincaid scores (simpler language) correlate with populist communication scores in the ODA corpus. Plain speaking functions as a political brand signal — accessibility as anti-elitism — not merely a stylistic choice.

14. Incumbents use more complex language than challengers on average. A 1.1 grade level difference between incumbents and challengers in the ODA corpus is consistent with incumbents having deeper policy records to defend and more policy-specific vocabulary.

15. Measurement overlap between scales must be checked. When populism scores and readability scores both reflect sentence-structure features, their correlation partly reflects measurement overlap rather than a purely theoretical relationship. Always examine the specific features underlying your measures.

Topic Modeling

16. LDA requires k to be set in advance — k is a human decision, not an algorithmic discovery. Different values of k produce different topic structures, none of which is objectively correct. Evaluate multiple values using both quantitative measures (perplexity) and qualitative coherence assessment.

17. Topic labels are human interpretive acts, not algorithmic outputs. LDA produces topics as distributions over words. The labels analysts assign to those distributions reflect their interpretive judgment. Report top words alongside labels and let readers evaluate your interpretation.

18. More aggressive preprocessing (higher min_df, smaller vocabulary) typically produces more coherent topics. Rare words and highly specific terminology tend to fracture topics in LDA. The vocabulary size and frequency threshold choices directly affect topic quality.

Classifier and Media Framing

19. TF-IDF weighting makes classifiers sensitive to distinctive vocabulary, not just common words. TF-IDF suppresses common words (appearing in most documents) and amplifies rare but distinctive words — exactly the feature pattern that distinguishes partisan language.

20. Classifier accuracy must be evaluated by cross-validation, not training set performance. A classifier that achieves 95% accuracy on training data but 60% on held-out test data is overfitted and useless. Report cross-validated accuracy. The ODA partisan classifier achieved 78% — meaningful but not definitive.

21. Hyperpartisan media outlets show the widest sentiment variance — highly positive about their candidate, highly negative about the opponent. The ODA media analysis found systematic sentiment differences by source type, with local TV and print showing near-neutral mean sentiment and hyperpartisan outlets showing extreme variance. This is computational confirmation of expected structural differences in the information environment.

Interpretation and Epistemic Humility

22. Computational measures are proxies for theoretical constructs — they are not the constructs themselves. A VADER score is not "speech negativity" — it is a word-pattern measurement correlated with negativity. A readability score is not "cognitive accessibility" — it is a formula. The construct validity question — does the measure actually track the theoretical concept — must always be addressed.

23. Statistical significance is not the same as interpretive certainty. A p < .001 finding may have multiple viable interpretations, confounds, or measurement artifact explanations. Report significant findings as invitations to interpret, not conclusions, and address competing interpretations explicitly.

24. The "what we cannot say" section is as important as the findings. Sam's epistemic humility slide — leading every presentation with explicit limitations — is a professional and ethical obligation in civic data contexts. Computational analysis that overstates its conclusions can mislead policy, misrepresent candidates, and undermine public trust.

25. Good computational text analysis generates hypotheses for follow-up, not final answers. The most intellectually honest framing: computational analysis reveals patterns; understanding those patterns requires additional evidence, theoretical interpretation, and often qualitative work that the computational analysis cannot provide.