43 min read

Sam Harding sat at their desk in ODA's cramped but well-lit office at 7:30 a.m. on the Monday after the Garza-Whitfield fact-check — the one that had reached a fifth of the people who saw the original false claim. Adaeze had asked for a deeper...

Learning Objectives

  • Apply text preprocessing techniques including tokenization, stopword removal, stemming, and lemmatization
  • Perform word frequency analysis and generate interpretable visualizations
  • Conduct rule-based sentiment analysis using VADER on political speech data
  • Calculate readability metrics (Flesch-Kincaid, Gunning Fog) for political communication
  • Conduct basic topic modeling using LDA with sklearn
  • Analyze media framing through sentiment by source type
  • Interpret computational text analysis results with appropriate epistemic humility

Chapter 27: Analyzing Political Text and Media (Python Lab)

Opening: Sam Builds a Pipeline

Sam Harding sat at their desk in ODA's cramped but well-lit office at 7:30 a.m. on the Monday after the Garza-Whitfield fact-check — the one that had reached a fifth of the people who saw the original false claim. Adaeze had asked for a deeper analysis of the information environment around the race. Not just "what was said" but "how it was said" — the language patterns in Garza's and Whitfield's speeches, the sentiment arcs in media coverage, whether specific frames showed up in sympathetic outlets before they showed up in the campaigns' own language.

Sam had pulled two datasets: oda_speeches.csv (political speeches by candidates in competitive races across six states) and oda_media.csv (media coverage articles about those races). The speech dataset included text excerpts, metadata, and a populism_score variable generated by a validated populist communication scale. The media dataset included article excerpts, source type, sentiment scores generated by a third-party vendor, and fact-check ratings.

"You want me to build you a text analysis pipeline," Sam told Adaeze. "I'm going to build you a text analysis pipeline."

This chapter follows Sam's work step by step. Along the way, you will build the same pipeline — understanding not just the code but the choices embedded in every methodological decision, and the gap between what computational text analysis can tell us and what we actually want to know.


27.1 Introduction: Why Computational Text Analysis for Political Science?

Political life generates enormous quantities of text. Speeches, press releases, debate transcripts, legislative records, social media posts, news articles, op-eds, campaign mailers — the verbal output of democratic politics is, in a real sense, its primary substance. For most of the discipline's history, analyzing this text required manual reading and coding, which limited the scale of analysis and introduced well-documented problems of intercoder reliability.

Computational text analysis — applying algorithmic methods to process and analyze large quantities of text — has fundamentally changed what is possible. Researchers can now analyze thousands of speeches, millions of social media posts, or decades of congressional records in the time it previously took to analyze a handful of documents. This scale creates new possibilities and new pitfalls.

27.1.1 What Computational Text Analysis Can and Cannot Do

Computational methods are excellent at: - Pattern detection at scale: Identifying which words, phrases, and topics appear more frequently in one corpus than another, across thousands of documents - Consistency: Applying the same measurement procedure to all documents without fatigue, mood, or the selective attention that affects human coders - Replicability: Providing explicit, documentable procedures that other researchers can apply to the same data and expect similar results - Exploratory analysis: Surfacing patterns in data that human analysis would miss, generating hypotheses for subsequent investigation

Computational methods are poor at: - Contextual meaning: Understanding that "raking in cash" means financial gain and is pejorative, while "exceptional fundraising quarter" refers to the same event approvingly — without extensive annotation or fine-tuning - Sarcasm, irony, and subtext: The claim "what a great policy" in a speech critical of the policy is not detected as negative by standard sentiment algorithms - Causal inference: Finding that two variables co-vary in text data does not establish causal direction - Normative judgment: Determining whether populist rhetoric is good or bad, effective or counterproductive, authentic or performative

⚠️ Common Pitfall: Treating the Model as the Phenomenon The most pervasive error in computational text analysis is confusing the model's output with the underlying social phenomenon. A sentiment score is not "how negative the speech was" — it is "how the VADER algorithm categorized word patterns in the speech." The map is not the territory. Sam learned this early in their career when a naive sentiment analysis flagged a speech about veterans' benefits as "highly negative" because of words like "suffered," "wounded," and "sacrifice" — language that in context was deeply positive and admiring.

27.1.2 The ODA Datasets

This chapter works with two datasets that Sam has prepared. Understanding their structure before analyzing them is essential practice.

oda_speeches.csv contains 847 speech records from competitive congressional races across six states, spanning two election cycles. Key variables:

Variable Type Description
speech_id string Unique identifier
date date Speech date
speaker string Candidate name
party string D / R / I
office string Office sought
event_type string rally / debate / townhall / press_conference
state string State abbreviation
word_count integer Total words in speech
text_excerpt string First 500 words
full_text string Complete speech text
populism_score float 0–1 scale, validated populist language measure

oda_media.csv contains 2,341 article records about the same races. Key variables:

Variable Type Description
article_id string Unique identifier
date date Publication date
source string Outlet name
source_type string local_tv / local_print / national / hyperpartisan / social_aggregator
state string State abbreviation
topic string Primary topic (economy / crime / immigration / healthcare / other)
headline string Article headline
excerpt string First 300 words
sentiment_score float -1 to 1 (negative to positive) vendor-generated
candidate_mentions string Candidates mentioned (comma-separated)
factcheck_rating string true / mostly_true / mixed / mostly_false / false / unrated

27.2 Setting Up the Environment

Before writing any analysis code, Sam sets up a reproducible environment. Good practice in computational research means documenting your environment so results can be reproduced.

# Environment requirements
# pip install nltk pandas matplotlib scikit-learn textstat wordcloud

import nltk
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as ticker
from collections import Counter
import re
import string
import warnings
warnings.filterwarnings('ignore')

# Download required NLTK data
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)
nltk.download('wordnet', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('vader_lexicon', quiet=True)

# Set consistent plot style
plt.style.use('seaborn-v0_8-whitegrid')
FIGSIZE_STANDARD = (10, 6)
FIGSIZE_WIDE = (14, 6)

💡 Intuition: Why Environment Setup Matters When Sam runs this pipeline three months from now, or shares it with a colleague, the analysis needs to produce the same results. Documenting library versions, setting random seeds for any stochastic operations, and creating reproducible data loading procedures are the difference between a one-time analysis and a reusable analytical tool. In civic data work — where findings may be reported publicly — reproducibility is an ethical obligation.


27.3 Text Preprocessing: Building the Foundation

27.3.1 Loading and Inspecting the Data

# Load the ODA datasets
speeches = pd.read_csv('oda_speeches.csv', parse_dates=['date'])
media = pd.read_csv('oda_media.csv', parse_dates=['date'])

print(f"Speeches dataset: {speeches.shape[0]} records, {speeches.shape[1]} columns")
print(f"Date range: {speeches['date'].min().date()} to {speeches['date'].max().date()}")
print(f"\nParty breakdown:")
print(speeches['party'].value_counts())
print(f"\nEvent type breakdown:")
print(speeches['event_type'].value_counts())

Good exploratory practice involves understanding missing data before proceeding:

# Check for missing values
print("\nMissing values in speeches:")
print(speeches.isnull().sum())

# Check for suspicious values
print(f"\nWord count statistics:")
print(speeches['word_count'].describe())

# Flag speeches shorter than 100 words as potentially incomplete
short_speeches = speeches[speeches['word_count'] < 100]
print(f"\nSpeeches under 100 words: {len(short_speeches)}")

27.3.2 The Preprocessing Pipeline

Text preprocessing transforms raw text into a form suitable for analysis. Each step involves choices that affect downstream results.

from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer, PorterStemmer

# Initialize tools
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))

# Political speech-specific stopwords to add
# These are common in political speech but carry little analytical meaning
political_stopwords = {
    'would', 'will', 'going', 'must', 'need', 'want', 'think',
    'know', 'come', 'get', 'make', 'take', 'give', 'see', 'let',
    'us', 'one', 'two', 'also', 'well', 'still', 'even', 'back',
    'every', 'many', 'much', 'said', 'say', 'today', 'now', 'year'
}
stop_words.update(political_stopwords)


def preprocess_text(text, method='lemmatize'):
    """
    Full preprocessing pipeline for political speech text.

    Steps:
    1. Lowercase conversion
    2. Remove punctuation and special characters
    3. Tokenization
    4. Stopword removal
    5. Lemmatization or stemming

    Parameters:
        text (str): Input text
        method (str): 'lemmatize' or 'stem'

    Returns:
        list: List of processed tokens
    """
    if not isinstance(text, str) or len(text.strip()) == 0:
        return []

    # Step 1: Lowercase
    text = text.lower()

    # Step 2: Remove URLs, email addresses, special characters
    text = re.sub(r'http\S+|www\S+', '', text)
    text = re.sub(r'\S+@\S+', '', text)
    text = re.sub(r'[^\w\s]', ' ', text)  # Replace punctuation with space
    text = re.sub(r'\d+', '', text)        # Remove numbers
    text = re.sub(r'\s+', ' ', text).strip()  # Normalize whitespace

    # Step 3: Tokenize
    tokens = word_tokenize(text)

    # Step 4: Remove stopwords (and very short tokens)
    tokens = [t for t in tokens if t not in stop_words and len(t) > 2]

    # Step 5: Lemmatize or stem
    if method == 'lemmatize':
        tokens = [lemmatizer.lemmatize(t) for t in tokens]
    elif method == 'stem':
        tokens = [stemmer.stem(t) for t in tokens]

    return tokens


# Apply preprocessing to the text_excerpt column
# We use text_excerpt for speed; in production, use full_text
speeches['tokens'] = speeches['text_excerpt'].apply(preprocess_text)
speeches['token_count'] = speeches['tokens'].apply(len)

print("Preprocessing complete.")
print(f"Average tokens per excerpt: {speeches['token_count'].mean():.1f}")

⚠️ Common Pitfall: Stemming vs. Lemmatization — Choose Deliberately Stemming (Porter algorithm) is faster and strips words to their root form by rule — "running," "runs," "runner" all become "run." Lemmatization uses a vocabulary and morphological analysis to return the base dictionary form — "better" becomes "good" with lemmatization but stays "better" with stemming. For political text where meaning matters, lemmatization is generally preferred. Sam uses lemmatization as the default, but the function accepts a method parameter so the choice is explicit and changeable.


27.4 Word Frequency Analysis

27.4.1 Overall Frequency in the Speech Corpus

from collections import Counter

def get_top_words(token_lists, n=30):
    """Get the n most common words across a list of token lists."""
    all_tokens = [token for sublist in token_lists for token in sublist]
    return Counter(all_tokens).most_common(n)


# Overall top words
overall_top = get_top_words(speeches['tokens'], n=40)
words, counts = zip(*overall_top)

fig, ax = plt.subplots(figsize=FIGSIZE_WIDE)
ax.barh(words[::-1], counts[::-1], color='steelblue', alpha=0.8)
ax.set_xlabel('Frequency', fontsize=12)
ax.set_title('Top 40 Words in ODA Speech Corpus', fontsize=14, fontweight='bold')
ax.set_ylabel('')
plt.tight_layout()
plt.savefig('output/top_words_overall.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nTop 20 words in corpus:")
for word, count in overall_top[:20]:
    print(f"  {word:20s} {count:6d}")

27.4.2 Comparing Word Frequencies by Party

One of the most analytically useful applications of word frequency analysis in political science is comparing language across groups — parties, candidates, ideological factions. The goal is not just identifying the most common words overall, but identifying words that are distinctively associated with one group relative to another.

# Separate by party
dem_tokens = speeches[speeches['party'] == 'D']['tokens']
rep_tokens = speeches[speeches['party'] == 'R']['tokens']

dem_freq = Counter([t for sublist in dem_tokens for t in sublist])
rep_freq = Counter([t for sublist in rep_tokens for t in sublist])

# Normalize by total token count (proportion, not raw count)
dem_total = sum(dem_freq.values())
rep_total = sum(rep_freq.values())

dem_prop = {w: c / dem_total for w, c in dem_freq.items()}
rep_prop = {w: c / rep_total for w, c in rep_freq.items()}

# Distinctiveness: words most over-represented in each party
# Simple approach: proportion in one party minus proportion in the other
all_words = set(dem_prop.keys()) | set(rep_prop.keys())
distinctiveness = {}
for w in all_words:
    d = dem_prop.get(w, 0)
    r = rep_prop.get(w, 0)
    if d + r > 0.0001:  # Filter rare words
        distinctiveness[w] = d - r  # Positive = more Democratic

dem_distinctive = sorted(distinctiveness.items(), key=lambda x: x[1], reverse=True)[:20]
rep_distinctive = sorted(distinctiveness.items(), key=lambda x: x[1])[:20]

# Side-by-side visualization
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=FIGSIZE_WIDE)

dem_words, dem_scores = zip(*dem_distinctive)
rep_words, rep_scores = zip(*[(w, -s) for w, s in rep_distinctive])

ax1.barh(dem_words[::-1], [s * 1000 for s in dem_scores[::-1]], color='#1a6bb5', alpha=0.8)
ax1.set_title('Most Distinctive Democratic Words', fontsize=12, fontweight='bold')
ax1.set_xlabel('Distinctiveness Score (× 1000)', fontsize=10)

ax2.barh(rep_words[::-1], [s * 1000 for s in rep_scores[::-1]], color='#b51a1a', alpha=0.8)
ax2.set_title('Most Distinctive Republican Words', fontsize=12, fontweight='bold')
ax2.set_xlabel('Distinctiveness Score (× 1000)', fontsize=10)

plt.suptitle('Partisan Word Distinctiveness in ODA Speech Corpus', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/partisan_word_distinctiveness.png', dpi=150, bbox_inches='tight')
plt.show()

📊 Real-World Application: The Political Vocabulary Sam's analysis of the ODA speech corpus found systematic differences in partisan language. Democratic speeches showed higher rates of words related to community, access, and collective action ("together," "community," "fair," "right," "everyone"). Republican speeches showed higher rates of words related to freedom, security, and threat ("freedom," "border," "protect," "strength," "America"). These patterns are consistent with decades of political communication research — but the computational approach generates them from the data rather than presupposing them.


27.5 Sentiment Analysis: VADER for Political Speech

27.5.1 Why VADER?

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool developed by C.J. Hutto and Eric Gilbert specifically for social media text. It is particularly well-suited for political communication because:

  • It handles capitalization ("GREAT" vs. "great"), punctuation ("awesome!" vs. "awesome"), and negation ("not good")
  • It is calibrated for informal registers, which political speech resembles more than formal writing
  • It produces a compound score (-1 to 1) as well as positive, negative, and neutral component scores
  • It is fast and does not require training data

VADER's limitations are equally important to understand: - It is a lexicon-based approach — it looks up individual words in a sentiment dictionary. It does not understand complex syntax or extended context. - Its dictionary was developed primarily from general English text. Political terminology may be mis-scored. - Irony and sarcasm are not handled.

For political speech analysis, VADER works best for gross-level characterization (is this speech overall positive or negative in tone?) rather than fine-grained semantic interpretation.

27.5.2 Applying VADER to the Speech Corpus

from nltk.sentiment import SentimentIntensityAnalyzer

sia = SentimentIntensityAnalyzer()


def get_vader_scores(text):
    """
    Apply VADER sentiment analysis to text.
    Returns a dict with neg, neu, pos, compound scores.
    """
    if not isinstance(text, str) or len(text.strip()) == 0:
        return {'neg': 0.0, 'neu': 1.0, 'pos': 0.0, 'compound': 0.0}
    return sia.polarity_scores(text)


# Apply to text_excerpt (VADER works on raw text, not preprocessed tokens)
vader_results = speeches['text_excerpt'].apply(get_vader_scores)
speeches['vader_compound'] = vader_results.apply(lambda x: x['compound'])
speeches['vader_pos'] = vader_results.apply(lambda x: x['pos'])
speeches['vader_neg'] = vader_results.apply(lambda x: x['neg'])
speeches['vader_neu'] = vader_results.apply(lambda x: x['neu'])

# Summary statistics by party
print("VADER Compound Score by Party:")
print(speeches.groupby('party')['vader_compound'].describe().round(3))

# Summary by event type
print("\nVADER Compound Score by Event Type:")
print(speeches.groupby('event_type')['vader_compound'].describe().round(3))

27.5.3 Visualizing Sentiment Patterns

# Distribution of sentiment by party
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

parties = [('D', '#1a6bb5', 'Democratic'), ('R', '#b51a1a', 'Republican')]
score_cols = [
    ('vader_compound', 'Compound Score', axes[0]),
    ('vader_pos', 'Positive Score', axes[1]),
    ('vader_neg', 'Negative Score', axes[2])
]

for col, label, ax in score_cols:
    for party_code, color, party_label in parties:
        data = speeches[speeches['party'] == party_code][col]
        ax.hist(data, bins=30, alpha=0.6, color=color, label=party_label, density=True)
    ax.set_xlabel(label, fontsize=11)
    ax.set_ylabel('Density', fontsize=11)
    ax.set_title(f'Distribution of {label}', fontsize=12, fontweight='bold')
    ax.legend()

plt.suptitle('VADER Sentiment Scores by Party: ODA Speech Corpus', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/vader_distribution_by_party.png', dpi=150, bbox_inches='tight')
plt.show()


# Sentiment over time (rolling average)
speeches_sorted = speeches.sort_values('date')

fig, ax = plt.subplots(figsize=FIGSIZE_WIDE)

for party_code, color, party_label in parties:
    party_data = speeches_sorted[speeches_sorted['party'] == party_code].copy()
    # 30-day rolling average
    party_data = party_data.set_index('date')
    rolling_sentiment = party_data['vader_compound'].rolling('30D', min_periods=3).mean()
    ax.plot(rolling_sentiment.index, rolling_sentiment.values,
            color=color, label=party_label, linewidth=2.5)

ax.axhline(y=0, color='black', linestyle='--', alpha=0.4, linewidth=1)
ax.set_xlabel('Date', fontsize=12)
ax.set_ylabel('Compound Sentiment (30-day rolling avg)', fontsize=12)
ax.set_title('Speech Sentiment Over Time by Party', fontsize=14, fontweight='bold')
ax.legend(fontsize=11)
plt.tight_layout()
plt.savefig('output/sentiment_over_time.png', dpi=150, bbox_inches='tight')
plt.show()

🔗 Connection to Chapter 26: Sam's sentiment tracking pipeline is the same infrastructure used to monitor media coverage sentiment in the Garza-Whitfield fact-check response. The oda_media.csv vendor-generated sentiment_score variable is created by a commercial version of exactly this process — VADER applied at scale with domain-specific lexicon adjustments. Understanding how the tool works reveals both its value and its limitations.


27.6 Readability Analysis: How Complex Is Political Speech?

27.6.1 Why Readability Matters Politically

The complexity of political language is not neutral. Simpler, more accessible language reaches a broader audience, but may sacrifice precision. More complex language may be appropriate for policy-sophisticated audiences but excludes many voters. Research by Schaffner and Sellers (2003) found that clear, concrete language in political advertising produced stronger persuasion effects than abstract policy language.

Populist communication research has specifically identified plain-speaking as a marker of authenticity — the claim to speak directly to ordinary people in contrast to an elite "establishment" that speaks in evasive, technical language. A candidate's readability scores are thus not just measures of linguistic complexity; they are politically meaningful signals.

import textstat


def compute_readability(text):
    """
    Compute multiple readability metrics for political text.

    Returns:
        dict with Flesch Reading Ease, Flesch-Kincaid Grade Level,
        Gunning Fog Index, and average sentence length
    """
    if not isinstance(text, str) or len(text.strip()) < 50:
        return {
            'flesch_reading_ease': None,
            'flesch_kincaid_grade': None,
            'gunning_fog': None,
            'avg_sentence_length': None
        }

    sentences = sent_tokenize(text)
    avg_sent_len = len(word_tokenize(text)) / max(len(sentences), 1)

    return {
        'flesch_reading_ease': textstat.flesch_reading_ease(text),
        'flesch_kincaid_grade': textstat.flesch_kincaid_grade(text),
        'gunning_fog': textstat.gunning_fog(text),
        'avg_sentence_length': avg_sent_len
    }


# Apply readability analysis
readability_results = speeches['text_excerpt'].apply(compute_readability)
readability_df = pd.DataFrame(list(readability_results))
speeches = pd.concat([speeches, readability_df], axis=1)

print("Readability Statistics by Party:")
print(speeches.groupby('party')[['flesch_reading_ease', 'flesch_kincaid_grade',
                                   'gunning_fog', 'avg_sentence_length']].mean().round(2))

27.6.2 Interpreting Flesch-Kincaid Scores

The Flesch Reading Ease score (0–100) and Flesch-Kincaid Grade Level (approximate US school grade) are the most widely used readability metrics. For reference:

Flesch Reading Ease Interpretation FK Grade Level
90–100 Very easy (5th grade) < 5
80–90 Easy (6th grade) 6
70–80 Fairly easy (7th grade) 7
60–70 Standard (8th–9th grade) 8–9
50–60 Fairly difficult (10th–12th grade) 10–12
30–50 Difficult (college level) 13–16
< 30 Very difficult (professional) > 16

Presidential speeches historically fall around grade 8–10. Campaign stump speeches often score simpler than State of the Union addresses. The Gettysburg Address scores around grade 9; Lincoln's Second Inaugural scores around grade 12. Trump's 2016 campaign speeches scored around grade 4–5 (very accessible); Clinton's scored around grade 7–8.

# Readability vs. populism score
fig, axes = plt.subplots(1, 2, figsize=FIGSIZE_WIDE)

# FK Grade vs Populism Score
for party_code, color, party_label in parties:
    party_data = speeches[speeches['party'] == party_code].dropna(
        subset=['flesch_kincaid_grade', 'populism_score'])
    axes[0].scatter(party_data['flesch_kincaid_grade'],
                    party_data['populism_score'],
                    color=color, alpha=0.4, label=party_label, s=30)
axes[0].set_xlabel('Flesch-Kincaid Grade Level', fontsize=11)
axes[0].set_ylabel('Populism Score', fontsize=11)
axes[0].set_title('Readability vs. Populism Score', fontsize=12, fontweight='bold')
axes[0].legend()

# Average sentence length vs. compound sentiment
for party_code, color, party_label in parties:
    party_data = speeches[speeches['party'] == party_code].dropna(
        subset=['avg_sentence_length', 'vader_compound'])
    axes[1].scatter(party_data['avg_sentence_length'],
                    party_data['vader_compound'],
                    color=color, alpha=0.4, label=party_label, s=30)
axes[1].set_xlabel('Average Sentence Length (words)', fontsize=11)
axes[1].set_ylabel('VADER Compound Score', fontsize=11)
axes[1].set_title('Sentence Length vs. Sentiment', fontsize=12, fontweight='bold')
axes[1].legend()

plt.suptitle('Readability and Populism in ODA Speech Corpus', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/readability_vs_populism.png', dpi=150, bbox_inches='tight')
plt.show()

27.7 N-gram Analysis: What Phrases Are Campaign-Specific?

Individual word frequencies miss multi-word expressions that carry political meaning. "Border security" means something different than "border" and "security" separately. "Job creation," "working families," "freedom of speech," "climate crisis" — these compound expressions are the vocabulary of campaigns.

N-gram analysis identifies frequently co-occurring word sequences (bigrams = two words, trigrams = three words).

from nltk.util import ngrams


def get_ngrams(token_list, n):
    """Extract n-grams from a list of tokens."""
    return list(ngrams(token_list, n))


def top_ngrams(token_series, n, top_k=25):
    """Get the top k n-grams across all documents in a series."""
    all_ngrams = []
    for tokens in token_series:
        if len(tokens) >= n:
            all_ngrams.extend(get_ngrams(tokens, n))
    return Counter(all_ngrams).most_common(top_k)


# Top bigrams and trigrams by party
print("Top 20 Bigrams — Democratic Speeches:")
dem_bigrams = top_ngrams(dem_tokens, 2, top_k=20)
for ngram, count in dem_bigrams:
    print(f"  {' '.join(ngram):30s} {count}")

print("\nTop 20 Bigrams — Republican Speeches:")
rep_bigrams = top_ngrams(rep_tokens, 2, top_k=20)
for ngram, count in rep_bigrams:
    print(f"  {' '.join(ngram):30s} {count}")


# Visualize top trigrams by party
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 7))

for ax, token_series, color, title in [
    (ax1, dem_tokens, '#1a6bb5', 'Democratic Speeches'),
    (ax2, rep_tokens, '#b51a1a', 'Republican Speeches')
]:
    top_tris = top_ngrams(token_series, 3, top_k=15)
    labels = [' '.join(ng) for ng, _ in top_tris]
    values = [c for _, c in top_tris]
    ax.barh(labels[::-1], values[::-1], color=color, alpha=0.8)
    ax.set_title(f'Top 15 Trigrams: {title}', fontsize=12, fontweight='bold')
    ax.set_xlabel('Frequency', fontsize=10)

plt.suptitle('Distinctive Phrase Usage by Party: ODA Speech Corpus', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/ngram_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

🧪 Try This: Event-Type N-gram Comparison Run the n-gram analysis separately for rallies versus townhall events within each party. Do candidates use different phrase vocabularies in different event contexts? What does this reveal about audience-tailored messaging?


27.8 Topic Modeling with LDA

27.8.1 What Topic Modeling Does (and Doesn't Do)

Latent Dirichlet Allocation (LDA) is a generative probabilistic model that assumes documents are mixtures of topics, and topics are distributions over words. Given a corpus of documents, LDA infers: (a) a set of topics (each a distribution over vocabulary words), and (b) for each document, a distribution over topics.

Topic modeling is exploratory — it discovers patterns in the data without being told what to look for. Its outputs are not "the topics in these speeches" but rather "word co-occurrence patterns that LDA represents as topics." Interpreting the output requires human judgment.

LDA requires several key decisions: - Number of topics (k): Must be set in advance. Common practice is to try several values and evaluate coherence. - Preprocessing choices: LDA is sensitive to preprocessing; more aggressive filtering typically produces cleaner topics - Random seed: LDA uses stochastic optimization; results vary across runs unless a seed is set

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation
import numpy as np


def prepare_text_for_lda(token_series):
    """Join token lists back to strings for sklearn's CountVectorizer."""
    return token_series.apply(lambda tokens: ' '.join(tokens) if tokens else '')


# Prepare text
lda_texts = prepare_text_for_lda(speeches['tokens'])

# Create document-term matrix
# min_df=5: ignore terms appearing in fewer than 5 documents
# max_df=0.7: ignore terms appearing in more than 70% of documents
# max_features=2000: limit vocabulary size
vectorizer = CountVectorizer(
    min_df=5,
    max_df=0.70,
    max_features=2000,
    ngram_range=(1, 2)  # Include bigrams
)

dtm = vectorizer.fit_transform(lda_texts)
feature_names = vectorizer.get_feature_names_out()

print(f"Document-term matrix shape: {dtm.shape}")
print(f"(Documents × Vocabulary terms)")


# Fit LDA model
# n_components = number of topics
# We try k=8 topics based on prior domain knowledge of campaign issue areas
N_TOPICS = 8
RANDOM_SEED = 42

lda_model = LatentDirichletAllocation(
    n_components=N_TOPICS,
    random_state=RANDOM_SEED,
    learning_method='online',  # Faster than 'batch' for large corpora
    max_iter=25
)

lda_model.fit(dtm)
print(f"\nLDA model fitted with {N_TOPICS} topics.")


def display_topics(model, feature_names, n_top_words=12):
    """Display top words for each LDA topic."""
    for topic_idx, topic in enumerate(model.components_):
        top_word_idx = topic.argsort()[:-n_top_words - 1:-1]
        top_words = [feature_names[i] for i in top_word_idx]
        print(f"Topic {topic_idx + 1}: {' | '.join(top_words)}")


print("\nLDA Topics (top 12 words each):")
display_topics(lda_model, feature_names, n_top_words=12)

27.8.2 Assigning Topic Proportions to Documents

# Get topic proportions for each document
topic_proportions = lda_model.transform(dtm)

# Assign dominant topic
speeches['dominant_topic'] = topic_proportions.argmax(axis=1) + 1  # 1-indexed
speeches['topic_confidence'] = topic_proportions.max(axis=1)

# Add topic proportion columns for all topics
for i in range(N_TOPICS):
    speeches[f'topic_{i+1}_proportion'] = topic_proportions[:, i]

# Topic distribution by party
topic_cols = [f'topic_{i+1}_proportion' for i in range(N_TOPICS)]
print("\nAverage Topic Proportions by Party:")
print(speeches.groupby('party')[topic_cols].mean().round(3).T)

⚠️ Common Pitfall: Over-Interpreting LDA Topics LDA topics are mathematical artifacts — patterns in word co-occurrence that the algorithm finds. Labeling Topic 3 as "healthcare" because it contains words like "insurance," "coverage," "affordable" is a human interpretive act that the algorithm does not validate. Two researchers might label the same topic differently. Always report the top words alongside your labels and let readers evaluate your interpretation.


27.9 Analyzing Media Framing: Sentiment by Source Type

One of Sam's core research questions: does the sentiment of media coverage vary systematically by source type? If hyperpartisan outlets consistently frame the same candidates more negatively than local print or national news outlets, that reveals structural bias in the information environment.

# Reload fresh view of the media dataset
media = pd.read_csv('oda_media.csv', parse_dates=['date'])

print("Media dataset overview:")
print(f"Records: {len(media)}")
print(f"\nSource types:")
print(media['source_type'].value_counts())
print(f"\nSentiment score statistics:")
print(media['sentiment_score'].describe().round(3))


# Sentiment by source type — box plots
fig, ax = plt.subplots(figsize=FIGSIZE_STANDARD)

source_order = ['local_tv', 'local_print', 'national', 'hyperpartisan', 'social_aggregator']
source_labels = ['Local TV', 'Local Print', 'National', 'Hyperpartisan', 'Social Aggregator']

plot_data = [
    media[media['source_type'] == st]['sentiment_score'].dropna().values
    for st in source_order
]

bp = ax.boxplot(plot_data, labels=source_labels, patch_artist=True,
                medianprops=dict(color='black', linewidth=2))

colors = ['#5B9BD5', '#70AD47', '#FFC000', '#FF6B6B', '#A569BD']
for patch, color in zip(bp['boxes'], colors):
    patch.set_facecolor(color)
    patch.set_alpha(0.7)

ax.axhline(y=0, color='black', linestyle='--', alpha=0.4)
ax.set_ylabel('Sentiment Score (-1 = negative, +1 = positive)', fontsize=11)
ax.set_title('Media Sentiment by Source Type: ODA Coverage Dataset', fontsize=13, fontweight='bold')
ax.set_ylim(-1.05, 1.05)
plt.tight_layout()
plt.savefig('output/sentiment_by_source_type.png', dpi=150, bbox_inches='tight')
plt.show()


# Statistical test: Are source-type means significantly different?
from scipy import stats

national = media[media['source_type'] == 'national']['sentiment_score'].dropna()
hyperpartisan = media[media['source_type'] == 'hyperpartisan']['sentiment_score'].dropna()
t_stat, p_value = stats.ttest_ind(national, hyperpartisan)
print(f"\nNational vs. Hyperpartisan sentiment t-test:")
print(f"  National mean: {national.mean():.3f}")
print(f"  Hyperpartisan mean: {hyperpartisan.mean():.3f}")
print(f"  t-statistic: {t_stat:.3f}, p-value: {p_value:.4f}")


# Sentiment by source type AND fact-check rating
pivot = media.groupby(['source_type', 'factcheck_rating'])['sentiment_score'].mean().unstack()
print("\nMean sentiment by source type and fact-check rating:")
print(pivot.round(3))

📊 Real-World Application: What Sam Found Sam's analysis of the ODA media dataset revealed a pronounced sentiment gradient by source type. Local TV and local print coverage had mean sentiment scores close to zero (neutral), consistent with professional journalism norms. National outlets skewed slightly negative overall — consistent with research showing that political news is generally negatively framed. Hyperpartisan outlets showed the widest sentiment variance: highly positive about their favored candidate, highly negative about the opponent. Social aggregators showed the most negative mean sentiment, consistent with engagement-maximizing principles (negative content drives more interaction). Articles rated "mostly_false" or "false" by fact-checkers showed higher emotional extremity (both positive and negative) compared to articles rated "true" or "mostly_true" — a computational validation of the research finding that emotionally arousing content is more likely to be false.


27.10 Building a Simple Political Text Classifier

27.10.1 The Task

A partisan language detector is a simple binary classifier: given a speech excerpt, does it use language more characteristic of Democratic or Republican political communication? This is useful for tracking "partisan drift" in a candidate's language over a campaign, detecting when campaign communication becomes unusually partisan, or flagging content that may have been coordinated.

This is a demonstration classifier — the goal is pedagogical, not production-ready. Sam is explicit with Adaeze about this distinction.

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np


# Filter to Democratic and Republican speeches only (exclude Independents)
# Only use speeches with sufficient text
clf_data = speeches[
    speeches['party'].isin(['D', 'R']) &
    speeches['text_excerpt'].notna() &
    (speeches['word_count'] > 100)
].copy()

print(f"Classification dataset: {len(clf_data)} speeches")
print(clf_data['party'].value_counts())


# Features: TF-IDF on raw text (not preprocessed — preserves more signal)
# TF-IDF: term frequency * inverse document frequency
# Weights terms that are frequent in a document but rare across the corpus
tfidf = TfidfVectorizer(
    max_features=3000,
    ngram_range=(1, 2),
    min_df=3,
    max_df=0.8,
    strip_accents='unicode',
    stop_words='english'
)

X = tfidf.fit_transform(clf_data['text_excerpt'])
y = (clf_data['party'] == 'R').astype(int)  # 1 = Republican, 0 = Democrat

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=42, stratify=y
)

# Logistic regression classifier
# C=1.0 is default regularization; higher values = less regularization
clf = LogisticRegression(C=1.0, max_iter=1000, random_state=42)
clf.fit(X_train, y_train)

# Evaluation
y_pred = clf.predict(X_test)
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Democratic', 'Republican']))

# Cross-validation
cv_scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
print(f"5-fold cross-validation accuracy: {cv_scores.mean():.3f} (+/- {cv_scores.std():.3f})")


# Feature importance: most partisan words
feature_names = tfidf.get_feature_names_out()
coef = clf.coef_[0]

# Most Republican-associated features (highest positive coefficients)
rep_idx = coef.argsort()[-20:][::-1]
# Most Democratic-associated features (most negative coefficients)
dem_idx = coef.argsort()[:20]

print("\nTop Republican-associated terms (positive coefficients):")
for i in rep_idx:
    print(f"  {feature_names[i]:30s}  {coef[i]:.4f}")

print("\nTop Democratic-associated terms (negative coefficients):")
for i in dem_idx:
    print(f"  {feature_names[i]:30s}  {coef[i]:.4f}")


# Visualize feature importance
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=FIGSIZE_WIDE)

rep_words = [feature_names[i] for i in rep_idx]
rep_coefs = [coef[i] for i in rep_idx]
dem_words = [feature_names[i] for i in dem_idx]
dem_coefs = [abs(coef[i]) for i in dem_idx]

ax1.barh(rep_words[::-1], rep_coefs[::-1], color='#b51a1a', alpha=0.8)
ax1.set_title('Republican-Associated Terms', fontsize=12, fontweight='bold')
ax1.set_xlabel('Logistic Regression Coefficient', fontsize=10)

ax2.barh(dem_words[::-1], dem_coefs[::-1], color='#1a6bb5', alpha=0.8)
ax2.set_title('Democratic-Associated Terms', fontsize=12, fontweight='bold')
ax2.set_xlabel('Coefficient Magnitude', fontsize=10)

plt.suptitle('Partisan Language Classifier: Most Predictive Features', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/classifier_features.png', dpi=150, bbox_inches='tight')
plt.show()

⚖️ Ethical Analysis: Classifier Limitations and Misuse Risk A partisan language classifier trained on historical data encodes historical patterns. If deployed to flag "partisan" content in a live context, it will systematically mis-flag cross-partisan language — a Democrat who uses language associated with Republican voters, or vice versa. More seriously, a classifier that labels content as "partisan" could be used to suppress legitimate political speech if deployed by platforms as a moderation tool. Sam documents these limitations explicitly in every report that references the classifier output: "This classifier identifies speech patterns historically associated with each party. It does not measure authenticity, sincerity, or appropriateness of political communication."


27.11 Putting the Pipeline Together

Sam's final pipeline integrates all components into a single analytical report for Adaeze. The key findings from the ODA speech and media datasets:

# Summary dashboard: key metrics by party and event type
summary_stats = speeches.groupby(['party', 'event_type']).agg(
    n_speeches=('speech_id', 'count'),
    avg_word_count=('word_count', 'mean'),
    avg_sentiment=('vader_compound', 'mean'),
    avg_readability=('flesch_kincaid_grade', 'mean'),
    avg_populism=('populism_score', 'mean')
).round(3)

print("Summary Statistics by Party and Event Type:")
print(summary_stats.to_string())


# Create a comprehensive summary visualization
fig, axes = plt.subplots(2, 2, figsize=(14, 10))

metrics = [
    ('vader_compound', 'Average Sentiment Score', axes[0, 0]),
    ('flesch_kincaid_grade', 'Avg. Reading Grade Level', axes[0, 1]),
    ('populism_score', 'Avg. Populism Score', axes[1, 0]),
    ('word_count', 'Avg. Speech Word Count', axes[1, 1])
]

event_types = speeches['event_type'].unique()
x = np.arange(len(event_types))
width = 0.35

for metric_col, ylabel, ax in metrics:
    dem_vals = [speeches[(speeches['party'] == 'D') &
                          (speeches['event_type'] == et)][metric_col].mean()
                for et in event_types]
    rep_vals = [speeches[(speeches['party'] == 'R') &
                          (speeches['event_type'] == et)][metric_col].mean()
                for et in event_types]

    bars1 = ax.bar(x - width/2, dem_vals, width, label='Democratic',
                   color='#1a6bb5', alpha=0.8)
    bars2 = ax.bar(x + width/2, rep_vals, width, label='Republican',
                   color='#b51a1a', alpha=0.8)
    ax.set_xticks(x)
    ax.set_xticklabels([e.replace('_', ' ').title() for e in event_types],
                       rotation=20, ha='right', fontsize=9)
    ax.set_ylabel(ylabel, fontsize=10)
    ax.legend(fontsize=9)
    ax.set_title(ylabel, fontsize=11, fontweight='bold')

plt.suptitle('ODA Speech Corpus: Key Metrics by Party and Event Type',
             fontsize=14, fontweight='bold')
plt.tight_layout()
plt.savefig('output/summary_dashboard.png', dpi=150, bbox_inches='tight')
plt.show()

print("\nPipeline complete. Outputs saved to output/ directory.")

27.12 Interpreting Results with Epistemic Humility

Sam's report to Adaeze includes a section she insists on including in every computational analysis ODA publishes. It is titled "What This Analysis Can and Cannot Tell Us."

The general principles:

What word frequency analysis tells you: Which words appear more often in one corpus than another. What it does not tell you: whether those words are used approvingly or critically, ironically or sincerely, as central claims or peripheral qualifications.

What VADER sentiment analysis tells you: An approximation of the emotional valence of word choices, based on a general-purpose dictionary. What it does not tell you: speaker intent, audience effect, contextual nuance, or whether negative language is being used to criticize an opponent or to express genuine concern about a problem.

What LDA topic modeling tells you: Word co-occurrence patterns that the algorithm represents as topics. What it does not tell you: whether those patterns are meaningful, why they co-occur, or whether human readers would recognize the "topics" as coherent.

What a partisan classifier tells you: Whether language statistically resembles historical patterns associated with one party or the other. What it does not tell you: whether the speaker is partisan, authentic, effective, or appropriate.

🔴 Critical Thinking: The Construct Validity Problem Every computational text analysis measure faces a construct validity question: does the thing we're measuring (word frequencies, VADER scores, LDA topics) actually correspond to the thing we want to know about (populist rhetoric, emotional framing, issue emphasis)? The answer is: partly, and we should be honest about which part. Computational measures are proxies for theoretical constructs. Treating them as direct measures is the most consequential error in the field — and it is more common than it should be.


27.12.1 Going Beyond VADER: Advanced Sentiment Approaches

VADER is an excellent starting point but is limited to rule-based lexicon matching. The field of political text analysis has moved toward more powerful methods that Sam previews in their research roadmap for ODA.

Lexicon-based extensions: Several political-science-specific sentiment lexicons have been developed that address VADER's limitations for political text. The Lexicoder Sentiment Dictionary (Young and Soroka, 2012) was specifically validated for news media text about politics. The SentimentR package (available in R) handles context-dependent sentiment, including negation and amplifier/deamplifier detection, more robustly than VADER.

Transformer-based models: BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. 2019) and its derivatives (RoBERTa, DistilBERT, domain-adapted variants) can capture context at the sentence and paragraph level, substantially outperforming lexicon approaches on political text — but require GPU compute, training data, and expertise to fine-tune properly. For most applied political analytics contexts, a carefully validated VADER implementation is more practical than a fine-tuned BERT model that is harder to explain to non-technical stakeholders.

Zero-shot classification: Models like Facebook's BART-large-mnli and similar zero-shot inference models can classify text into categories without explicit training data, using natural language descriptions of the categories. For example, a zero-shot classifier can evaluate whether a speech excerpt is "about immigration" or "about healthcare" without being trained on labeled political speeches. The Hugging Face transformers library makes these models accessible, though interpretation requires care.

The analytical choice among these methods depends on resources, corpus size, interpretability requirements, and the specific construct being measured. For a civic data organization like ODA, VADER with careful validation is typically the right first tool. For academic research seeking publication in top political science journals, the validation demands are higher and transformer-based approaches are increasingly expected.


27.12.2 Integrating Computational and Human Analysis

The most robust political text analysis combines computational and human methods. This is not a counsel of despair about computational methods — it is a recognition that different methods have complementary strengths.

Computational → Human (exploratory → confirmatory): Use computational methods to identify patterns at scale (which topics show up in which speeches, which words are most distinctive, which speeches show unusual sentiment), then use human coding to validate and interpret those patterns on a manageable sample. This is the approach Sam took: computational analysis surfaced the readability-populism interaction; understanding what that interaction meant required reading the specific speeches that showed extreme values.

Human → Computational (annotation → training): For classification tasks, human annotation of a training sample creates labeled data that can be used to train a supervised classifier, which then applies the human categories at scale. This is how most production political text classifiers are built: start with human coders, validate agreement, train a model, validate the model against held-out human-coded data.

Computational → Computational (replication and validation): Different computational methods for the same construct should produce convergent results if they are measuring the same underlying phenomenon. If VADER sentiment scores and a trained sentiment classifier disagree substantially, that is a measurement problem — not a sign that one method is simply "better."

Sam's pipeline integrates all three integration strategies: exploratory word frequency analysis (computational), followed by human interpretation of distinctive vocabulary (human → computational validation), followed by classifier training from a human-annotated subset (human → computational).


27.12.3 Reproducibility Standards for Political Text Analysis

Political analysts whose work influences public understanding or policy have a particular obligation to meet high reproducibility standards. The political text analysis field has been shaken by several high-profile failures of replication, and best practice has evolved in response.

Pre-registration: For confirmatory analysis (testing a specific theoretical prediction), pre-registering the analysis plan before data collection eliminates the temptation to "p-hack" — searching through alternative specifications until a significant result appears. The American Political Science Review, American Journal of Political Science, and other top journals now publish pre-registered studies, and some offer badges for computational reproducibility.

Open data and code: Sharing analysis code and data (subject to appropriate privacy restrictions) allows other researchers to verify results, extend analyses, and identify errors. GitHub is the standard platform for code sharing; Harvard Dataverse and the ICPSR archive are standard platforms for data sharing.

Workflow documentation: Tools like renv (R) and conda/pip freeze (Python) allow analysts to document and share the exact library versions used in an analysis. Sam documents ODA's pipeline dependencies in a requirements.txt file that is updated before each election cycle.

Version control: Using git (or another version control system) to track changes to analysis code means that every version of the code is preserved, changes can be reversed, and collaboration can happen without overwriting each other's work. The progression from "analysis_v1.py" to "analysis_v2_final_FINAL.py" in a folder is not version control — it is chaos.

Best Practice: The NARR Framework for Computational Social Science Research by Schwartz and Ungar proposed the NARR (Narrate, Analyze, Reflect, Refine) framework for computational social science projects. Narrate: write a plain-language description of the research question before touching any data. Analyze: run the analysis. Reflect: explicitly ask what could be wrong with the results — alternative interpretations, data quality issues, measurement concerns. Refine: revise the approach based on reflection. This framework explicitly builds interpretive humility into the research process rather than treating it as an afterthought.


27.13 Political Text Analysis in Practice: A Workflow Summary

Before the chapter summary, it is worth stepping back to describe the complete analytical workflow Sam uses at ODA — not just the individual components, but how they connect into a coherent research process. Understanding the workflow helps analysts avoid the common mistake of treating computational text analysis as a bag of disconnected techniques rather than an integrated methodology.

27.13.1 The Research Design Layer

All computational text analysis begins with a research question — not a data source, not a method, not a visualization. Sam's first step on any new analysis is a one-paragraph plain-language statement: "I want to know whether [specific pattern] exists in [specific corpus], because [specific theoretical or practical reason]." This question determines what corpus is needed, what preprocessing is appropriate, what analytical method is suited, and what findings would constitute an answer.

Analysts who start with data and look for patterns — the exploratory approach — can still be rigorous, but they must be honest about the difference between exploratory and confirmatory analysis. Exploratory findings are hypotheses for subsequent investigation, not confirmed conclusions. Sam distinguishes clearly between ODA's exploratory work (generating hypotheses for future research) and confirmatory analysis (testing a specific pre-specified hypothesis).

27.13.2 The Data Layer

Corpus construction is a substantive methodological choice. What documents are included, what documents are excluded, what time period is covered, what variation is present — these decisions shape what can be found. A corpus that includes only rally speeches will show a different picture of political language than one that includes the full range of communication contexts. A corpus that overrepresents large, populous states will show a different picture than one balanced across competitive races.

For the ODA speech corpus, Sam documents: selection criteria (competitive congressional races in six states over two election cycles), collection method (campaign websites, C-SPAN transcripts, local news archives), and known gaps (town halls in non-English languages were not included due to translation resource constraints, which may affect the picture in high-Hispanic-population districts).

27.13.3 The Analysis Layer

The analysis layer is where the pipeline from this chapter lives: preprocessing, frequency analysis, sentiment analysis, readability, n-grams, topic modeling, classification. The key principle organizing this layer is the hierarchy from description to inference. Descriptive analyses (what is the distribution of sentiment scores?) should be conducted and understood before inferential analyses (does the sentiment-accuracy correlation hold after controlling for source type?). Rushing to inference on poorly characterized data is a common source of errors.

Sam also emphasizes parallel processing of multiple methods. Running VADER sentiment analysis AND word frequency analysis AND LDA topic modeling on the same corpus produces multiple complementary views of the same data. When they converge (sentiment analysis and topic modeling both find that immigration-focused speeches are more negative than healthcare-focused speeches), the finding is more robust. When they diverge, the divergence is informative — it means the measures are capturing different things, and understanding why they diverge is analytically valuable.

27.13.4 The Communication Layer

The final layer — how findings are communicated — is as important as the analysis itself. Sam's reports follow a consistent structure:

  1. The question: A plain-language statement of what was being investigated
  2. The method: A brief description of the data source, preprocessing choices, and analytical approach — enough for a reader to understand what was done and why
  3. The findings: Results in plain language, with specific numbers and appropriate uncertainty acknowledgment
  4. What this cannot tell us: An explicit list of the limitations specific to this analysis
  5. What's next: What follow-up analysis would be needed to strengthen the conclusions

This structure — borrowed and adapted from open science pre-registration templates — ensures that limitations are not buried at the end of a report after readers have formed impressions from the findings. It also protects ODA's credibility: an organization known for honest limitation acknowledgment is harder to attack when a finding turns out to be more complicated than it initially appeared.


27.14 Summary

Computational text analysis gives political analysts powerful tools for characterizing large quantities of political communication — tools that would be practically impossible to apply at the same scale with human coding alone. The pipeline developed in this chapter — preprocessing, word frequency analysis, VADER sentiment analysis, readability scoring, n-gram analysis, topic modeling, and classification — addresses questions that matter for understanding political campaigns, media environments, and information ecosystems.

Sam's analysis of the ODA datasets found: - Systematic partisan differences in word choice, readability, and sentiment patterns, consistent with decades of political communication research - Hyperpartisan media outlets showing extreme sentiment variance compared to local and national outlets, with articles carrying lower fact-check ratings showing higher emotional extremity - A measurable correlation between readability (simpler language) and populism score — plain speech functioning as a political brand signal - A partisan classifier achieving approximately 78% cross-validated accuracy — significant but far from perfect, and explicitly limited to historical pattern detection rather than normative assessment - The readability-populism interaction (high-populism candidates showing wider grade level ranges across event types) as a novel finding that merited careful interpretive caution

Every output came with an explicit acknowledgment of what it cannot tell us: not speaker intent, not audience effect, not electoral consequence, not whether any communication is appropriate or authentic. The gap between the computational model and the social phenomenon it represents is not a bug to be engineered away — it is a fundamental feature of quantitative social science that honest analysts account for in every report they produce.

The computational tools in this chapter become most powerful when combined with the analytical and theoretical frameworks introduced in earlier chapters: the persuasion research of Chapter 23, the misinformation dynamics of Chapter 26, and the campaign strategy contexts of Chapters 28–32. Text analysis is not a substitute for political science — it is an instrument for making political science questions answerable at scale.

The next chapter turns to the applied context: how data-driven campaigns use text analysis, voter targeting, and integrated data pipelines to plan and execute electoral strategy.


Key Terms

Tokenization — The process of splitting text into individual units (tokens), typically words or sub-word units, as a prerequisite for most computational text analysis.

Lemmatization — Reducing a word to its base dictionary form using morphological analysis (e.g., "running" → "run," "better" → "good").

Stemming — Reducing a word to its root form by rule (e.g., "running" → "run"), less linguistically precise than lemmatization but faster.

VADER — Valence Aware Dictionary and sEntiment Reasoner; a rule-based sentiment analysis tool calibrated for informal text, producing scores from -1 (most negative) to +1 (most positive).

TF-IDF — Term Frequency-Inverse Document Frequency; a weighting scheme that gives higher weight to terms that are frequent in a document but rare across the corpus.

LDA (Latent Dirichlet Allocation) — A generative probabilistic model that represents documents as mixtures of topics, where topics are distributions over vocabulary words.

N-gram — A contiguous sequence of n items from a text; bigrams are 2-word sequences, trigrams are 3-word sequences.

Flesch-Kincaid Grade Level — A readability formula that estimates the US school grade level required to understand a text, based on average sentence length and average word length.

Construct validity — The degree to which a measurement procedure actually measures the theoretical construct it is intended to measure.

Corpus — A structured collection of text documents used as the basis for computational analysis.


27.15 When the Tools Fail: Limitations and Failure Modes

Sam Harding's epistemically humble approach to computational text analysis is grounded in direct experience with the ways these tools break down. Each technique in this chapter has characteristic failure modes that analysts must understand to avoid being misled by their own output.

27.15.1 When VADER Fails

VADER is a capable tool, but it was designed for social media text — short, informal, modern English. When applied to political speech, it fails in predictable ways:

Negation scope errors: VADER handles simple negation ("not good" → negative) but fails on complex negation structures. "I do not think anyone in this chamber would deny that our constituents deserve better" contains negation that VADER may misparse, producing a spuriously positive or negative compound score. Long sentences with multiple embedded clauses are particularly vulnerable.

Political jargon and proper nouns: VADER's dictionary was not calibrated for political terminology. "Radical" has a consistently negative valence in the VADER dictionary, which means speeches that use "radical" as a neutral policy descriptor (as in "radical change" in the progressive sense) may score more negatively than the speaker intended. "Freedom" scores highly positive, which inflates positivity scores for any speech that invokes this word regardless of the context in which it is deployed.

Irony, sarcasm, and rhetorical inversion: "My opponent's healthcare plan is so great, I'm sure the 40,000 families who will lose coverage will thank him" will receive a positive compound score from VADER because it contains positive lexical items. Speakers who use irony strategically — a common rhetorical technique in political speech — are systematically mis-scored.

Domain-specific speech norms: Political speech has distinct stylistic conventions that differ from the social media text VADER was calibrated on. Political speeches frequently contain formal, measured language that scores near zero on VADER (neutral) even when they are making sharp attacks — because the attack is delivered through insinuation and implication rather than emotionally valenced vocabulary. Attack ads, by contrast, contain dense emotionally charged vocabulary and tend to score as highly negative even when making factually accurate claims.

A practical diagnostic for VADER failures: run a spot-check on the 20 highest-scoring and 20 lowest-scoring speeches in your corpus. Read them. If the "most positive" speeches do not sound genuinely positive, or the "most negative" speeches include speeches you would not describe as particularly negative, VADER is failing on your specific corpus and you need to either supplement with manual validation or switch to a domain-adapted sentiment tool.

27.15.2 When Topic Modeling Misleads

LDA topic modeling is among the most frequently misapplied tools in computational social science. The characteristic failure modes:

The labeling illusion: LDA produces word distributions, not topics with interpretable meaning. When a researcher labels Topic 4 as "healthcare" because it contains words like "insurance," "coverage," and "affordable," they are performing an interpretive act that the algorithm does not validate. A different researcher might label the same topic "economic burden" or "government spending" — all three labels are consistent with the same word distribution. The label is an assertion about meaning that requires human validation, not an output of the algorithm.

Sensitivity to preprocessing choices: LDA topic structures are highly sensitive to preprocessing decisions. Different stopword lists, different minimum document frequency thresholds, different vocabulary sizes, and different choices about whether to include bigrams can produce dramatically different topic structures from the same corpus. Analysts who present LDA results without reporting their preprocessing choices — or who chose preprocessing parameters specifically to produce interpretable topics — are presenting results that may not be reproducible.

The k-selection problem: The number of topics (k) must be specified in advance. Common practice is to try several values of k and select based on topic coherence metrics (such as the UMass or UCI coherence measures) or perplexity on held-out data. However, different k values do not just produce different numbers of topics — they produce qualitatively different topic structures, with higher k values splitting topics that lower k values merged. There is no single "correct" k for most political corpora; the choice reflects a theoretical judgment about the granularity of analysis the researcher seeks.

Degenerate topics: In real political corpora, LDA frequently produces "garbage topics" — topics with high-frequency background words that appear across many documents and resist meaningful interpretation. These often appear alongside a topic labeled "procedural language" or similar. The presence of garbage topics is normal; the mistake is attempting to interpret them as if they carry substantive meaning.

⚠️ The Reproducibility Gap in LDA Because LDA is stochastic (uses random initialization), running the same model twice with different random seeds may produce different topic structures. This is not just a theoretical concern: two analysts running LDA on the same corpus with the same k but different seeds may produce topics that are superficially similar but differ in their most distinctive words and their document assignments. Best practice: set and report the random seed, run multiple seeds and compare, and report only findings that are robust across seed values. In the ODA pipeline, RANDOM_SEED = 42 is set explicitly for exactly this reason.

27.15.3 When Partisanship Classifiers Mislead

The logistic regression partisan language classifier in Section 27.10 is the most conceptually fraught tool in the chapter, and the one that requires the most explicit limitation acknowledgment.

Training data encodes historical patterns: The classifier is trained on speeches from a specific period (the ODA corpus covers two election cycles). Partisan language patterns change over time; terms that were distinctively Democratic in 2018 may have migrated to Republican vocabulary by 2026, or may have lost partisan distinctiveness. A classifier trained on historical data applied to current text will be systematically biased toward historical patterns, potentially flagging cross-partisan language as partisan based on outdated associations.

False positives in cross-partisan communication: Candidates who deliberately use language associated with the opposing party — as a rhetorical strategy, to appeal to crossover voters, or because their political position genuinely differs from party orthodoxy — will be misclassified by a historical partisan classifier. A moderate Republican who uses language typically associated with Democratic economic messaging will score as "Democratic" on the classifier even though they are a Republican. The classifier measures vocabulary association, not partisan identity.

The ecological fallacy in individual-level application: The classifier was trained on aggregate patterns across many speeches. Applying it to individual speeches or individual paragraphs assumes that the aggregate pattern characterizes individual instances — a form of ecological fallacy. The fact that Democratic speeches, on average, use "community" more frequently than Republican speeches does not mean that any individual Democratic speech that lacks "community" is somehow less Democratic.

Misuse risk: The partisan language classifier, if deployed outside its documented limitations, could be used to flag or suppress "partisan" speech in contexts where the real concern is political viewpoint. Sam's explicit limitation acknowledgment in every report that references the classifier — "This classifier identifies speech patterns historically associated with each party. It does not measure authenticity, sincerity, or appropriateness of political communication" — is not boilerplate: it is a substantive ethical guardrail against misuse.


27.16 Advanced Approaches: A Conceptual Overview

This chapter has focused on tools that are accessible to analysts working with standard Python data science libraries: VADER, LDA, logistic regression on TF-IDF features. The field of computational political text analysis has moved substantially beyond these tools, and students who want to go further should understand the conceptual landscape of more advanced methods.

27.16.1 Word Embeddings

Word embeddings represent each word as a dense vector in a high-dimensional space (typically 50-300 dimensions), where words with similar meanings and usage patterns are represented by vectors that are close together. The canonical embedding models — Word2Vec (Mikolov et al. 2013), GloVe (Pennington et al. 2014), and fastText (Bojanowski et al. 2017) — learn these representations from very large corpora by predicting contextual word co-occurrence.

For political text analysis, word embeddings offer several advantages over the bag-of-words approaches used in this chapter:

Semantic similarity capture: Embeddings capture that "border" and "immigration" are semantically related, that "healthcare" and "insurance" tend to appear in similar contexts. Word frequency methods treat every word as independent; embeddings encode the semantic relationships between words.

Analogical reasoning: Embeddings support analogical reasoning — "Democrat : liberal :: Republican : _?" — that reveals implicit associations in the training corpus. Researchers have used this to study how political concepts are framed across different corpora: does the embedding for "welfare" cluster closer to "dependency" or "assistance" in conservative versus progressive media corpora?

Cross-candidate comparison: Training word embeddings separately on speeches from different candidates or different time periods allows researchers to compare how the same concepts are used differently across those contexts — detecting semantic shifts in political vocabulary.

The limitation: word embeddings are static representations trained on a fixed corpus. They do not capture context — the word "right" has the same embedding whether it means "right-wing," "correct," or "the right to vote." This limitation is addressed by the transformer models discussed below.

27.16.2 Transformer-Based Models

Transformer architecture, introduced in Vaswani et al.'s 2017 "Attention Is All You Need" paper and operationalized for language understanding in BERT (Devlin et al. 2019), represents the current state of the art for most natural language processing tasks, including political text analysis.

BERT and its derivatives (RoBERTa, DistilBERT, DeBERTa, domain-adapted variants) differ from bag-of-words and word embedding approaches in a fundamental way: they produce contextual representations. The representation of each word depends on the entire surrounding context. "Right" in "the right to vote" and "right-wing politics" receives different representations because the surrounding words are different. This context-dependence is what allows transformer models to substantially outperform lexicon-based approaches on tasks requiring semantic understanding.

For applied political analytics, transformer models offer:

Substantially better sentiment analysis: BERT-based sentiment classifiers, fine-tuned on domain-appropriate labeled data, consistently outperform VADER by 10-20 percentage points on human-labeled political text benchmarks. The improvement is particularly large for irony, negation, and complex sentence structures.

Zero-shot and few-shot classification: Models like facebook/bart-large-mnli, available through the Hugging Face transformers library, can classify text into categories without explicit training data, using natural language descriptions of the categories. A zero-shot classifier can evaluate whether a speech excerpt is "about immigration policy" or "about economic security" by treating these as hypotheses and scoring their textual entailment with the input text.

Named entity recognition and relation extraction: Transformer-based NER models identify specific political actors, organizations, locations, and events in political text with substantially higher accuracy than rule-based approaches.

The practical limitations for most applied political analytics contexts:

Computational requirements: Running transformer models, particularly BERT-scale or larger, requires substantially more computational resources than the scikit-learn approaches in this chapter. A single BERT inference pass on a GPU takes milliseconds; on a standard laptop CPU, it may take seconds per document, making full-corpus analysis on large corpora slow. For the ODA corpus (a few hundred speeches), this is manageable; for a corpus of hundreds of thousands of social media posts, it requires cloud computing resources.

Fine-tuning complexity: The full benefit of transformer models comes from fine-tuning on domain-specific labeled data. A BERT model fine-tuned on 2,000 human-labeled political speech sentiment examples will substantially outperform an off-the-shelf BERT model on political text. Fine-tuning requires labeled data, training expertise, and compute — resources that not all applied political analytics organizations have.

Interpretability: Unlike logistic regression on TF-IDF features, where the most influential features (words) can be directly inspected, transformer model decisions are difficult to interpret. Attention visualization tools exist but do not provide the same direct interpretability as coefficient inspection. For contexts where analysts must explain their methods to non-technical audiences, this interpretability gap matters.

💡 Practical Guidance: When to Use Which Tool For most applied political analytics work at campaign or advocacy organization scale, VADER with careful domain validation is the appropriate starting point for sentiment analysis. For research projects requiring publication-quality analysis, or for tasks where VADER validation shows significant failure, Hugging Face's pipeline("sentiment-analysis") with a political-domain-adapted model is the next step. Full fine-tuning of BERT-scale models is appropriate for research projects with sufficient labeled data, compute resources, and the need for maximum performance on a specific well-defined task. The choice should be driven by the task requirements and available resources, not by the appeal of using the most sophisticated available tool.


27.17 Presenting Text Analysis Findings to Non-Technical Audiences

The gap between computational analysis and political decision-making is bridged by communication. Sam Harding's analysis has value only to the extent that Adaeze Nwosu and the advocacy partners ODA serves can understand it, trust it, and act on it. The communication challenge is substantial: most political professionals are not trained in computational methods, and many are understandably skeptical of analyses they cannot directly evaluate.

27.17.1 The Translation Problem

When Sam presents findings from the word frequency analysis, the key challenge is translating a quantitative finding ("Democratic speeches show higher rates of the term 'community' by 0.0034 per token, a statistically significant difference at p < 0.001") into a meaningful claim about political communication ("Democratic speeches consistently invoke community and collective action, while Republican speeches invoke individual freedom and national security — a pattern that reflects distinct mobilization strategies for their respective bases").

The translation is not just simplification — it is interpretation. The statistical finding is what the data shows; the interpretation is the claim about what the finding means. Both are necessary, and they must be clearly separated in communication to non-technical audiences. Conflating "the data shows X" with "this means Y" — without explicit acknowledgment that the latter is an interpretive claim — is the most common communication error in applied political analytics.

27.17.2 Visualization Principles for Political Text Analysis

The visualizations in this chapter — bar charts of word frequencies, sentiment distributions, bigram comparisons — are designed for analysts who can read them. For non-technical audiences, different design choices apply:

Lead with the finding, not the method. A visualization titled "Word Frequency Comparison by Party" is a technical description that tells a non-technical reader nothing about why they should care. A visualization titled "Democrats Invoke 'Community' at Three Times the Rate Republicans Do" leads with the substantive finding and lets the chart provide the evidence.

Reduce visual complexity. Bar charts with 30 bars, multiple overlaid color-coded distributions, and statistical annotations are appropriate for analytical documentation but overwhelming for briefing slides. For non-technical audiences, showing the top 5-8 most distinctive terms with a clear visual hierarchy communicates the finding more effectively than a comprehensive chart.

Annotate for interpretation. Adding brief text annotations to key chart elements — "This spike in negative sentiment corresponds to the first debate, where Whitfield attacked on crime" — gives non-technical readers the context they need to interpret a pattern they might otherwise miss.

Pair quantitative and qualitative. Non-technical audiences often trust quantitative findings more when they are accompanied by illustrative examples. After showing the LDA topic model output, include a one-paragraph excerpt from a speech that exemplifies the dominant topic — the concrete example makes the abstract topic representation interpretable.

27.17.3 Calibrating Confidence Communication

One of the most important communication challenges is conveying appropriate confidence levels to non-technical audiences. The research limitations section — what this analysis cannot tell us — that Sam includes in every ODA report is technically honest but can create a credibility problem if poorly framed: an audience that sees three paragraphs of caveats before a finding may conclude that the finding is unreliable, even when the caveats are routine epistemic honesty rather than signs of fundamental problems.

Sam's calibration approach: distinguish between method limitations (this tool has known failure modes in X context, which we have checked for in this dataset), data limitations (this corpus covers these sources and time periods; findings may not generalize beyond them), and inference limitations (this analysis shows correlation, not causation; this finding is descriptive, not predictive). Presenting these three categories separately — rather than as an undifferentiated list of caveats — helps non-technical audiences calibrate how much each limitation should concern them.

📊 Real-World Application: ODA's Briefing Format Adaeze requires that ODA's externally-facing analytical products follow a four-section structure for non-technical audiences: (1) What We Looked At — a two-sentence description of the data and method; (2) What We Found — the substantive findings in plain language, with charts that lead with findings; (3) What This Means — interpretation and implications for campaign or advocacy strategy; (4) What We're Not Certain About — one paragraph of key limitations, framed as "the questions this analysis cannot answer" rather than as a list of failures. This structure has been tested with actual advocacy partners and campaign staff, and has substantially improved the uptake of analytical findings compared to ODA's earlier technical report format.


27.18 Reproducibility and Sharing Political Text Analysis Code

Political text analysis that influences public understanding of democracy carries a particular obligation to meet high standards of reproducibility and transparency. Sam's pipeline is designed with reproducibility as a first-order concern, not an afterthought.

27.18.1 The Core Reproducibility Checklist

The ODA political text analysis reproducibility checklist, applied to every analysis before publication:

1. Code is version-controlled. Every change to the analysis code is tracked in git. The specific commit hash that produced the published analysis is recorded in the publication metadata. This means that any researcher who wants to replicate the analysis can check out exactly the code version that was used.

2. Dependencies are documented. A requirements.txt file (for pip) or environment.yml file (for conda) captures the exact library versions used in the analysis. This matters because breaking changes between library versions — particularly in sklearn, NLTK, and the VADER dictionary — can produce different results from the same code.

3. Random seeds are set and reported. Every stochastic operation in the pipeline — LDA, train/test split, classifier training, any sampling — uses an explicitly set random seed that is reported in the analysis documentation. Sam's convention: RANDOM_SEED = 42 is defined once at the top of the notebook and passed to every function that accepts a seed parameter.

4. Data provenance is documented. The corpus documentation specifies: which sources are included, what collection method was used, what date range is covered, what known gaps exist. Data files are stored with checksums so that inadvertent modifications to the source data are detectable.

5. Results are validated against held-out data. Classification models and automated coding approaches report accuracy on held-out test sets, not just training accuracy. LDA topic coherence is reported with multiple metrics. Where human validation was conducted, the sample size and inter-rater agreement are reported.

27.18.2 Open Code and Data Practices

ODA's policy is to publish analysis code for all externally-facing work. Code is published to a GitHub repository under a Creative Commons license with attribution requirement. Data publishing follows a tiered approach based on privacy sensitivity:

Tier 1 (Fully open): Corpora of public political speech — transcripts, official statements, press releases — are published alongside the analysis code without restriction.

Tier 2 (Open with documentation): Corpora that include social media data are published with documentation of collection method and date, subject to platform terms of service. Individual-level social media data is typically not publishable under platform ToS; aggregate statistics and anonymized samples may be.

Tier 3 (Restricted): Any data that includes commercially licensed content (LexisNexis archives, commercial media monitoring data) cannot be freely republished. For analyses using restricted data, ODA publishes the code and a synthetic dataset that produces approximately the same analytical results — allowing researchers to verify the analytical pipeline without requiring access to the original data.

27.18.3 Building a Reproducible Pipeline Directory Structure

Sam's standard directory structure for a political text analysis project:

oda_text_analysis_2026/
├── README.md                  # Project description, setup instructions
├── requirements.txt           # Python dependencies with pinned versions
├── data/
│   ├── raw/                   # Original data files (never modified)
│   │   ├── oda_speeches.csv
│   │   └── oda_media.csv
│   ├── processed/             # Cleaned, preprocessed data
│   └── synthetic/             # Synthetic data for open publication
├── notebooks/
│   ├── 01_data_loading.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_frequency_analysis.ipynb
│   ├── 04_sentiment_analysis.ipynb
│   ├── 05_readability.ipynb
│   ├── 06_topic_modeling.ipynb
│   ├── 07_classification.ipynb
│   └── 08_summary_report.ipynb
├── src/
│   ├── preprocessing.py       # Reusable preprocessing functions
│   ├── sentiment.py           # Sentiment analysis utilities
│   ├── topic_model.py         # LDA wrapper with reproducibility settings
│   └── visualization.py      # Shared visualization functions
├── output/
│   ├── figures/               # All saved charts (with filenames matching notebook)
│   └── reports/               # Generated analytical reports
└── CHANGELOG.md               # Log of significant changes to analysis approach

This structure separates raw data from processed data (protecting against accidental modification of source files), separates analysis notebooks from reusable functions (making the code testable and shareable), and makes the output directory structure correspond to the analysis sequence.

Best Practice: The Binder/Colab Reproducibility Test Sam's personal reproducibility standard: before publishing any analysis, create a fresh virtual environment with only the packages in requirements.txt, run the notebooks in order from a clean state, and verify that the outputs match the published figures. Tools like Binder and Google Colab allow others to do this without installing anything locally — including a Binder launch badge in the repository README allows any reader to run the analysis in the cloud with a single click. This "push-button reproducibility" standard is increasingly expected for computational social science publications in top political science and communication journals, and is good practice for applied analytics organizations regardless of publication intentions.


27.19 Additional Key Terms

Word embeddings — Dense vector representations of words that capture semantic similarity and contextual relationships, learned from large text corpora through neural network training.

BERT (Bidirectional Encoder Representations from Transformers) — A transformer-based language model that generates contextual representations of words based on their full surrounding context, substantially outperforming earlier approaches on most NLP tasks.

Zero-shot classification — The application of a pre-trained language model to classify text into categories without explicit training on labeled examples of those categories.

Construct validity — (Extended) For computational text analysis: the degree to which a computational measure — a VADER score, a topic proportion, a classifier output — actually corresponds to the theoretical construct (sentiment, topical emphasis, partisanship) it is intended to measure.

Ecological fallacy — The error of applying aggregate statistical patterns to predict or explain individual instances, common when group-level classifier patterns are applied to individual documents.

Pre-registration — The practice of publicly specifying an analysis plan, including hypotheses, data sources, and analytical methods, before data collection begins, to prevent post-hoc specification of analyses designed to produce significant results.

Reproducibility — The ability to obtain the same analytical results from the same code, data, and documentation — a minimum standard for credible computational social science.