Chapter 37: Tracking Populist Rhetoric (Python Lab)

DataField.Dev

28 min read

Sam Harding stared at the screen displaying ODA's speech database — 14,782 rows, each one a political speech, testimony, rally address, or debate clip collected since 2018. The populism_score column had been computed by a previous research assistant...

Learning Objectives

Load and preprocess the ODA speeches dataset for text analysis
Engineer populist language features based on the ideational definition
Validate engineered features against the existing populism_score column
Build and evaluate a binary populist rhetoric classifier
Analyze feature importance to understand which linguistic signals matter most
Track rhetorical change over time using time-series visualization
Compare populism scores across parties and office types
Critically interpret what text classifiers can and cannot tell us about political rhetoric
Extend the classifier to new contexts and languages
Understand the ethical responsibilities of building rhetoric measurement systems

In This Chapter

37.1 The ODA Speeches Dataset
37.2 Exploring the Existing Populism Score
37.3 Feature Engineering for Populist Rhetoric
37.4 Building the Populist Rhetoric Classifier
37.5 Reading the Confusion Matrix: A Practical Guide
37.6 Time-Series Rhetoric Tracking
37.7 Extending the Classifier to Different Languages and Countries
37.8 Critical Interpretation: What the Classifier Can and Cannot Tell Us
37.9 Connecting Results to Chapter 34 Theory
37.10 Ethical Responsibilities of Building Rhetoric Classifiers
37.11 What Text Classifiers Cannot Do: A Systematic Inventory
37.12 Additional Exercises with Sample Outputs
37.13 Chapter Summary
Key Terms
Discussion Questions

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 37: Tracking Populist Rhetoric (Python Lab)

Sam Harding stared at the screen displaying ODA's speech database — 14,782 rows, each one a political speech, testimony, rally address, or debate clip collected since 2018. The populism_score column had been computed by a previous research assistant using a method documented only in a PDF buried in a shared drive. Sam's job was to figure out whether that score was measuring what it claimed to measure, build a better measurement approach, and apply it to a specific problem: were speeches that structurally resembled Tom Whitfield's rhetorical style becoming more common in Republican Senate campaigns over time?

"The interesting question isn't whether Whitfield is populist," Sam said at an ODA team meeting. "We can see that by reading five minutes of his speeches. The interesting question is whether his type of populism — the specific linguistic patterns he uses — is spreading. Are other candidates learning from his playbook? Is there a Whitfield rhetorical school developing?"

That question requires the quantitative text analysis tools we develop in this chapter. Building a populist rhetoric classifier is not just a technical exercise — it is a practical application of the analytical framework from Chapter 34, translated into working code. Every methodological choice we make in this lab embodies a conceptual decision about what populism is and how it manifests in language.

This chapter walks you through the complete pipeline: loading and understanding the ODA dataset, engineering features grounded in the ideational definition of populism, exploring the existing populism_score and its construction, building a classifier, analyzing results, and — crucially — maintaining the critical perspective on what the classifier can and cannot tell us.

37.1 The ODA Speeches Dataset

Before writing any analysis code, we need to understand our data. This is not a step to skip.

37.1.1 Dataset Overview

The oda_speeches.csv file contains the following columns:

Column	Type	Description
`speech_id`	string	Unique identifier (e.g., "SP-2019-0042")
`date`	string	Date of speech (YYYY-MM-DD format)
`speaker`	string	Speaker's full name
`party`	string	Political party (D, R, I, L, G, Other)
`office`	string	Speaker's current or sought office
`event_type`	string	Type of event (rally, debate, floor_speech, press_conference, town_hall, interview, testimony, other)
`state`	string	State (two-letter code)
`word_count`	int	Total word count of the speech
`text_excerpt`	string	500-word excerpt from the speech
`full_text`	string	Complete speech text (may be null for some records)
`populism_score`	float	Pre-computed populism score (0.0–1.0 scale)

Key data characteristics: - 14,782 speeches from 2018 through early 2026 - 847 unique speakers - 50 states represented - Median word count: 1,847 words; mean word count: 2,341 words (right-skewed) - Populism score range: 0.0–1.0; median: 0.21; mean: 0.24 (right-skewed) - Approximately 23% of records have null full_text (only text_excerpt available)

37.1.2 Loading and Initial Exploration

# example-01-populism-features.py
# ODA Populist Rhetoric Classifier — Part 1: Feature Engineering
# OpenDemocracy Analytics | Sam Harding, Data Journalist
# Chapter 37: Tracking Populist Rhetoric

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import re
from collections import Counter
import warnings
warnings.filterwarnings('ignore')

# ─────────────────────────────────────────────
# 1. LOAD AND INSPECT THE DATASET
# ─────────────────────────────────────────────

df = pd.read_csv('oda_speeches.csv', parse_dates=['date'])

print("=" * 60)
print("ODA SPEECHES DATASET — INITIAL INSPECTION")
print("=" * 60)
print(f"Shape: {df.shape[0]:,} rows × {df.shape[1]} columns")
print(f"\nDate range: {df['date'].min().date()} to {df['date'].max().date()}")
print(f"\nColumn dtypes:\n{df.dtypes}")
print(f"\nMissing values:\n{df.isnull().sum()}")

Expected output:

ODA SPEECHES DATASET — INITIAL INSPECTION
============================================================
Shape: 14,782 rows × 11 columns

Date range: 2018-01-03 to 2026-02-28

Column dtypes:
speech_id       object
date            datetime64[ns]
speaker         object
party           object
office          object
event_type      object
state           object
word_count      int64
text_excerpt    object
full_text       object
populism_score  float64
dtype: object

Missing values:
speech_id         0
date              0
speaker           0
party             0
office            4
event_type        0
state             2
word_count        0
text_excerpt      0
full_text      3,401
populism_score     0
dtype: int64

37.2 Exploring the Existing Populism Score

Before building anything new, we want to understand what the existing populism_score column is actually measuring. This is responsible data science: never inherit a metric without investigating its construction.

37.2.1 Score Distribution Analysis

# ─────────────────────────────────────────────
# 2. UNDERSTANDING THE EXISTING POPULISM_SCORE
# ─────────────────────────────────────────────

# Distribution of existing populism scores
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle("ODA populism_score: Distribution and Patterns", fontsize=14, fontweight='bold')

# Overall distribution
axes[0].hist(df['populism_score'], bins=50, color='steelblue', edgecolor='white', alpha=0.8)
axes[0].axvline(df['populism_score'].median(), color='red', linestyle='--',
                label=f"Median: {df['populism_score'].median():.3f}")
axes[0].axvline(df['populism_score'].mean(), color='orange', linestyle='-.',
                label=f"Mean: {df['populism_score'].mean():.3f}")
axes[0].set_xlabel('Populism Score')
axes[0].set_ylabel('Count')
axes[0].set_title('Overall Distribution')
axes[0].legend()

# Score by party
party_scores = df.groupby('party')['populism_score'].agg(['mean', 'median', 'count'])
party_scores = party_scores[party_scores['count'] >= 50].sort_values('mean', ascending=False)
axes[1].barh(party_scores.index, party_scores['mean'], color='coral', edgecolor='white')
axes[1].set_xlabel('Mean Populism Score')
axes[1].set_title('Mean Score by Party\n(parties with ≥50 speeches)')

# Score by event type
event_scores = df.groupby('event_type')['populism_score'].mean().sort_values(ascending=True)
axes[2].barh(event_scores.index, event_scores.values, color='mediumseagreen', edgecolor='white')
axes[2].set_xlabel('Mean Populism Score')
axes[2].set_title('Mean Score by Event Type')

plt.tight_layout()
plt.savefig('populism_score_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

# Descriptive statistics by party
print("\nPopulism Score by Party (parties with ≥50 speeches):")
print(party_scores.to_string())

# Check for temporal trend in raw scores
df_sorted = df.sort_values('date')
df_sorted['year'] = df_sorted['date'].dt.year
yearly_scores = df_sorted.groupby('year')['populism_score'].agg(['mean', 'median', 'count'])
print("\nYearly Populism Score Trends:")
print(yearly_scores.to_string())

37.2.2 Reverse-Engineering the Score

The documentation for the populism_score is sparse. From the PDF Sam found, the score appears to be a normalized weighted combination of: 1. A word-frequency-based anti-elite term count (Rooduijn-Pauwels style dictionary) 2. A people-centric term frequency measure 3. A "contrast ratio" — the proportion of sentences containing explicit binary contrast framing

Let's test this hypothesis by computing our own versions of these components and comparing to the existing score.

# ─────────────────────────────────────────────
# 3. REVERSE-ENGINEERING THE POPULISM_SCORE
# ─────────────────────────────────────────────

# Populism dictionaries (following Rooduijn-Pauwels framework, extended for US political speech)
ANTI_ELITE_TERMS = [
    'elite', 'elites', 'establishment', 'mainstream media', 'fake news',
    'corrupt', 'corruption', 'politicians', 'washington', 'deep state',
    'bureaucrat', 'bureaucrats', 'special interests', 'lobbyist', 'lobbyists',
    'ruling class', 'political class', 'donor class', 'swamp', 'drain the swamp',
    'out of touch', 'ivory tower', 'so-called experts', 'experts', 'technocrat',
    'globalist', 'globalists', 'wall street', 'bankers', 'billionaire class',
    'political establishment', 'political elite', 'media elite', 'coastal elite',
    'academic elite', 'failed leadership', 'career politician', 'insider',
    'power brokers', 'oligarchs', 'plutocrat'
]

PEOPLE_CENTRIC_TERMS = [
    'the people', 'ordinary people', 'working people', 'working families',
    'regular americans', 'everyday americans', 'hard-working', 'hardworking',
    'real america', 'real americans', 'main street', 'grassroots',
    'average american', 'middle class', 'working class', 'forgotten',
    'left behind', 'common sense', "people's", 'will of the people',
    'voice of the people', 'silent majority', 'taxpayers', 'voters',
    'citizens', 'neighbors', 'communities', 'local', 'families',
    'small business', 'small businesses', 'farmers', 'workers', 'neighbors'
]

MANICHEAN_TERMS = [
    'us vs them', 'us versus them', 'enemies', 'traitors', 'radical left',
    'radical right', 'extremist', 'dangerous', 'destroy', 'fight back',
    'take back', 'reclaim', 'wake up', 'enough is enough', 'no more',
    'this ends', 'stand up', 'rise up', 'good vs evil', 'truth vs lies',
    'either you are with', 'time to choose', 'clear choice', 'simple choice',
    'common enemy', 'real enemy', 'threat to', 'existential threat'
]


def compute_term_density(text, term_list):
    """
    Compute the density of a term list in a text.
    Returns the proportion of 50-word chunks containing at least one term.
    Uses text_excerpt (500 words) as the default analysis unit.
    """
    if not isinstance(text, str) or len(text.strip()) == 0:
        return 0.0
    text_lower = text.lower()
    words = text_lower.split()
    if len(words) < 10:
        return 0.0

    # Count sentences containing at least one term
    # Approximate sentences by splitting on period/exclamation/question
    sentences = re.split(r'[.!?]+', text_lower)
    sentences = [s.strip() for s in sentences if len(s.strip()) > 10]
    if not sentences:
        return 0.0

    hits = 0
    for sentence in sentences:
        for term in term_list:
            if term in sentence:
                hits += 1
                break  # Count sentence once even if multiple terms match

    return hits / len(sentences)


def compute_contrast_ratio(text):
    """
    Compute the proportion of sentences containing explicit binary contrast framing.
    Looks for patterns like 'X vs Y', 'either X or Y', 'not X but Y', etc.
    """
    if not isinstance(text, str):
        return 0.0
    text_lower = text.lower()
    sentences = re.split(r'[.!?]+', text_lower)
    sentences = [s.strip() for s in sentences if len(s.strip()) > 10]
    if not sentences:
        return 0.0

    # Contrast patterns
    contrast_patterns = [
        r'\b(vs|versus)\b',
        r'\b(either|neither)\b.{1,50}\b(or|nor)\b',
        r'\bnot\b.{1,30}\bbut\b',
        r'\binstead of\b',
        r'\breal\b.{1,20}\b(fake|phony|corrupt)\b',
        r'\b(us|we|our)\b.{1,30}\b(them|they|their|those)\b',
    ]

    hits = 0
    for sentence in sentences:
        for pattern in contrast_patterns:
            if re.search(pattern, sentence):
                hits += 1
                break

    return hits / len(sentences)


# Apply feature computation to text_excerpt (available for all records)
print("Computing populism features... (this may take 60-90 seconds)")

df['anti_elite_density'] = df['text_excerpt'].apply(
    lambda x: compute_term_density(x, ANTI_ELITE_TERMS)
)
df['people_centric_density'] = df['text_excerpt'].apply(
    lambda x: compute_term_density(x, PEOPLE_CENTRIC_TERMS)
)
df['manichean_density'] = df['text_excerpt'].apply(
    lambda x: compute_term_density(x, MANICHEAN_TERMS)
)
df['contrast_ratio'] = df['text_excerpt'].apply(compute_contrast_ratio)

print("Feature computation complete.")
print(f"\nFeature summary statistics:")
feature_cols = ['anti_elite_density', 'people_centric_density', 'manichean_density', 'contrast_ratio']
print(df[feature_cols].describe().round(4))

# Correlation with existing populism_score
print("\nCorrelation with existing populism_score:")
for col in feature_cols:
    corr = df['populism_score'].corr(df[col])
    print(f"  {col}: r = {corr:.3f}")

Expected output (approximate):

Feature summary statistics:
       anti_elite_density  people_centric_density  manichean_density  contrast_ratio
count           14782.000               14782.000          14782.000       14782.000
mean                0.082                   0.143              0.031           0.124
std                 0.091                   0.087              0.048           0.089
min                 0.000                   0.000              0.000           0.000
25%                 0.021                   0.082              0.000           0.063
50%                 0.056                   0.127              0.011           0.104
75%                 0.112                   0.192              0.041           0.167
max                 0.714                   0.612              0.389           0.583

Correlation with existing populism_score:
  anti_elite_density: r = 0.623
  people_centric_density: r = 0.418
  manichean_density: r = 0.571
  contrast_ratio: r = 0.489

The correlations tell us the existing score is substantially driven by anti-elite language and Manichean framing, with people-centric language playing a smaller role. This is consistent with the Rooduijn-Pauwels framework but weighted toward the "elite critique" dimension.

💡 Intuition: Correlation as Hypothesis Test When you correlate your engineered features with an existing score, you're running a hypothesis test about the score's construction. Strong positive correlations (r > 0.4) suggest your feature captures something the existing score was trying to measure. Weak or negative correlations suggest either your feature is measuring something different or the existing score has issues. In this case, the moderately strong but imperfect correlations tell us: yes, the existing score is related to our dictionary-based features, but it's not simply a linear combination of them — there's something else (perhaps sentence structure, punctuation patterns, or additional dictionary terms) we're not capturing.

37.3 Feature Engineering for Populist Rhetoric

With the basic framework established, we can engineer a richer feature set that goes beyond simple term counting.

37.3.1 Additional Features

# ─────────────────────────────────────────────
# 4. EXTENDED FEATURE ENGINEERING
# ─────────────────────────────────────────────

def compute_second_person_density(text):
    """
    Measure direct address to audience ("you", "your", "yours").
    Populist appeals often shift from third-person elite description to
    second-person direct address to "the people."
    """
    if not isinstance(text, str):
        return 0.0
    text_lower = text.lower()
    words = text_lower.split()
    if not words:
        return 0.0
    second_person = sum(1 for w in words if w in {'you', 'your', 'yours', "you're", "you've", "you'll"})
    return second_person / len(words)


def compute_first_person_plural_density(text):
    """
    Measure we/us/our language — people-group construction.
    High first-person-plural with low first-person-singular is a populist signal.
    """
    if not isinstance(text, str):
        return 0.0
    text_lower = text.lower()
    words = text_lower.split()
    if not words:
        return 0.0
    first_plural = sum(1 for w in words if w in {'we', 'us', 'our', 'ours', "we're", "we've", "we'll"})
    first_singular = sum(1 for w in words if w in {'i', 'me', 'my', 'mine', "i'm", "i've", "i'll"})
    plural_ratio = first_plural / len(words)
    singular_ratio = first_singular / len(words)
    # Return ratio of plural to total personal pronouns (bounded 0–1)
    total_personal = first_plural + first_singular
    if total_personal == 0:
        return 0.0
    return first_plural / total_personal


def compute_common_sense_appeals(text):
    """
    Measure appeals to "common sense" vs expertise — a key populist epistemic move.
    """
    if not isinstance(text, str):
        return 0.0
    text_lower = text.lower()
    phrases = [
        'common sense', 'every american knows', 'everyone knows', 'obviously',
        'any fool can see', "you don't need to be", 'simple truth',
        'plain truth', 'plain and simple', 'no-brainer', 'wake up',
        'open your eyes', 'just look', 'anyone can see'
    ]
    count = sum(1 for phrase in phrases if phrase in text_lower)
    sentences = len(re.split(r'[.!?]+', text_lower))
    return min(count / max(sentences, 1), 1.0)


def compute_urgency_intensity(text):
    """
    Measure urgency/emergency language — characteristic of populist mobilization.
    """
    if not isinstance(text, str):
        return 0.0
    text_lower = text.lower()
    words = text_lower.split()
    if not words:
        return 0.0
    # Exclamation marks (normalized by length)
    exclamation_density = text.count('!') / max(len(words), 1) * 100
    # Urgency words
    urgency_terms = [
        'now', 'immediately', 'urgent', 'crisis', 'emergency', 'critical',
        'must', 'cannot wait', 'no time', 'last chance', 'final opportunity',
        'never before', 'historic', 'unprecedented', 'once in a lifetime',
        'everything', 'nothing', 'always', 'never', 'completely', 'totally',
        'absolutely', 'every single', 'all of them', 'none of them'
    ]
    urgency_count = sum(1 for term in urgency_terms if term in text_lower)
    urgency_density = urgency_count / len(words)
    return (exclamation_density * 0.3 + urgency_density * 0.7)


# Apply extended features
print("Computing extended features...")

df['second_person_density'] = df['text_excerpt'].apply(compute_second_person_density)
df['plural_pronoun_ratio'] = df['text_excerpt'].apply(compute_first_person_plural_density)
df['common_sense_appeals'] = df['text_excerpt'].apply(compute_common_sense_appeals)
df['urgency_intensity'] = df['text_excerpt'].apply(compute_urgency_intensity)

print("Extended feature computation complete.")

# Examine the full feature set
all_features = [
    'anti_elite_density', 'people_centric_density', 'manichean_density',
    'contrast_ratio', 'second_person_density', 'plural_pronoun_ratio',
    'common_sense_appeals', 'urgency_intensity'
]

print("\nFull feature correlation with populism_score:")
correlations = {}
for col in all_features:
    corr = df['populism_score'].corr(df[col])
    correlations[col] = corr
    print(f"  {col:35s}: r = {corr:.3f}")

# Feature correlation heatmap
feature_corr_matrix = df[all_features + ['populism_score']].corr()
plt.figure(figsize=(10, 8))
mask = np.zeros_like(feature_corr_matrix)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(feature_corr_matrix, annot=True, fmt='.2f', cmap='RdYlGn',
            center=0, mask=mask, square=True,
            cbar_kws={'shrink': 0.8})
plt.title('Feature Correlation Matrix\n(including populism_score)', fontsize=13)
plt.tight_layout()
plt.savefig('feature_correlation_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()

37.4 Building the Populist Rhetoric Classifier

With features engineered and understood, we can build a classifier. The goal is a binary classifier: populist speech vs. non-populist speech. We'll use the existing populism_score to create our binary labels (with a deliberate threshold decision that we examine critically), then train and evaluate a logistic regression model.

37.4.1 Label Creation and the Threshold Decision

# ─────────────────────────────────────────────────────────────
# EXAMPLE 2: BUILDING THE CLASSIFIER
# example-02-rhetoric-classifier.py
# ─────────────────────────────────────────────────────────────

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split, cross_val_score, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (classification_report, confusion_matrix,
                             roc_auc_score, roc_curve, precision_recall_curve)
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

# Load the dataset with features computed in example-01
# (In a real workflow, save to parquet between scripts; here we re-compute)
df = pd.read_csv('oda_speeches.csv', parse_dates=['date'])
# [Feature computation code from example-01 would run here]
# For brevity, assume all features are already in df

# ─────────────────────────────────────────────
# CRITICAL DECISION: Setting the populism threshold
# ─────────────────────────────────────────────

print("=" * 60)
print("THRESHOLD ANALYSIS: CREATING BINARY LABELS")
print("=" * 60)

# Examine the distribution to inform threshold choice
print(f"\npopulism_score percentiles:")
for pct in [50, 60, 65, 70, 75, 80, 85, 90, 95]:
    val = df['populism_score'].quantile(pct / 100)
    print(f"  {pct}th percentile: {val:.3f}")

# The existing score is right-skewed; most speeches score low.
# Two defensible threshold choices:
#   Option A: Fixed threshold at 0.40 (treats scores ≥0.40 as "clearly populist")
#   Option B: Percentile threshold at 75th percentile (top quarter of scores as "populist")
#
# We'll use Option A (0.40) as theoretically grounded, but flag both for transparency.

POPULISM_THRESHOLD = 0.40

df['is_populist'] = (df['populism_score'] >= POPULISM_THRESHOLD).astype(int)

print(f"\nUsing threshold: {POPULISM_THRESHOLD}")
print(f"Populist speeches (score >= {POPULISM_THRESHOLD}): {df['is_populist'].sum():,} "
      f"({df['is_populist'].mean()*100:.1f}%)")
print(f"Non-populist speeches: {(df['is_populist']==0).sum():,} "
      f"({(1-df['is_populist'].mean())*100:.1f}%)")

# ─────────────────────────────────────────────
# FEATURE MATRIX AND TRAIN/TEST SPLIT
# ─────────────────────────────────────────────

feature_cols = [
    'anti_elite_density', 'people_centric_density', 'manichean_density',
    'contrast_ratio', 'second_person_density', 'plural_pronoun_ratio',
    'common_sense_appeals', 'urgency_intensity'
]

X = df[feature_cols].fillna(0)
y = df['is_populist']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

print(f"\nTrain set: {len(X_train):,} speeches")
print(f"Test set: {len(X_test):,} speeches")
print(f"Class balance in train: {y_train.mean()*100:.1f}% populist")
print(f"Class balance in test: {y_test.mean()*100:.1f}% populist")

37.4.2 Model Training and Comparison

# ─────────────────────────────────────────────
# MODEL TRAINING AND COMPARISON
# ─────────────────────────────────────────────

models = {
    'Logistic Regression': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', LogisticRegression(max_iter=1000, C=1.0, random_state=42))
    ]),
    'Random Forest': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', RandomForestClassifier(n_estimators=200, max_depth=6,
                                        min_samples_leaf=10, random_state=42))
    ]),
    'Gradient Boosting': Pipeline([
        ('scaler', StandardScaler()),
        ('clf', GradientBoostingClassifier(n_estimators=200, max_depth=4,
                                            learning_rate=0.05, random_state=42))
    ])
}

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

print("\nCross-validation results (5-fold, AUC-ROC):")
cv_results = {}
for name, pipeline in models.items():
    scores = cross_val_score(pipeline, X_train, y_train, cv=cv,
                             scoring='roc_auc', n_jobs=-1)
    cv_results[name] = scores
    print(f"  {name:25s}: {scores.mean():.4f} ± {scores.std():.4f}")

# Select best model for detailed analysis
# (Logistic regression is preferred for interpretability)
best_model_name = 'Logistic Regression'
best_pipeline = models[best_model_name]
best_pipeline.fit(X_train, y_train)

# Test set evaluation
y_pred = best_pipeline.predict(X_test)
y_prob = best_pipeline.predict_proba(X_test)[:, 1]

print(f"\n{'='*60}")
print(f"TEST SET EVALUATION: {best_model_name}")
print(f"{'='*60}")
print(f"\nClassification Report:")
print(classification_report(y_test, y_pred,
                             target_names=['Non-Populist', 'Populist']))

auc = roc_auc_score(y_test, y_prob)
print(f"AUC-ROC: {auc:.4f}")

# Confusion matrix visualization
fig, axes = plt.subplots(1, 2, figsize=(12, 5))
fig.suptitle(f"Populist Rhetoric Classifier — {best_model_name}", fontsize=13, fontweight='bold')

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=['Non-Populist', 'Populist'],
            yticklabels=['Non-Populist', 'Populist'],
            ax=axes[0])
axes[0].set_xlabel('Predicted')
axes[0].set_ylabel('Actual')
axes[0].set_title('Confusion Matrix')

# ROC curve
fpr, tpr, _ = roc_curve(y_test, y_prob)
axes[1].plot(fpr, tpr, color='steelblue', lw=2,
             label=f'ROC Curve (AUC = {auc:.3f})')
axes[1].plot([0, 1], [0, 1], 'k--', lw=1, label='Random Classifier')
axes[1].set_xlabel('False Positive Rate')
axes[1].set_ylabel('True Positive Rate')
axes[1].set_title('ROC Curve')
axes[1].legend()

plt.tight_layout()
plt.savefig('classifier_evaluation.png', dpi=150, bbox_inches='tight')
plt.show()

37.4.3 Feature Importance Analysis

# ─────────────────────────────────────────────
# FEATURE IMPORTANCE: WHAT MATTERS MOST?
# ─────────────────────────────────────────────

# Extract logistic regression coefficients
lr_model = best_pipeline.named_steps['clf']
coefficients = lr_model.coef_[0]

# Also get Random Forest importances for comparison
rf_pipeline = models['Random Forest']
rf_pipeline.fit(X_train, y_train)
rf_importances = rf_pipeline.named_steps['clf'].feature_importances_

importance_df = pd.DataFrame({
    'feature': feature_cols,
    'lr_coefficient': coefficients,
    'rf_importance': rf_importances
}).sort_values('lr_coefficient', ascending=False)

print("\nFeature Importance (Logistic Regression Coefficients):")
print("Higher coefficient = stronger indicator of populist speech")
print()
for _, row in importance_df.iterrows():
    direction = "↑ populist" if row['lr_coefficient'] > 0 else "↓ populist"
    print(f"  {row['feature']:35s}: {row['lr_coefficient']:+.4f}  [{direction}]")

# Visualization
fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle("Feature Importance: What Signals Populist Rhetoric?", fontsize=13, fontweight='bold')

colors_lr = ['coral' if c > 0 else 'steelblue' for c in importance_df['lr_coefficient']]
axes[0].barh(importance_df['feature'], importance_df['lr_coefficient'],
             color=colors_lr, edgecolor='white')
axes[0].axvline(0, color='black', lw=1)
axes[0].set_xlabel('Coefficient (positive = more populist)')
axes[0].set_title('Logistic Regression Coefficients')

importance_sorted = importance_df.sort_values('rf_importance', ascending=False)
axes[1].barh(importance_sorted['feature'], importance_sorted['rf_importance'],
             color='mediumseagreen', edgecolor='white')
axes[1].set_xlabel('Feature Importance (Gini)')
axes[1].set_title('Random Forest Feature Importance')

plt.tight_layout()
plt.savefig('feature_importance.png', dpi=150, bbox_inches='tight')
plt.show()

# Critical interpretation
print("\n" + "="*60)
print("INTERPRETING FEATURE IMPORTANCE")
print("="*60)
print("""
The most important features for classifying populist speeches are:

1. anti_elite_density — Strong positive coefficient: the single most
   reliable indicator of populist classification. This makes theoretical
   sense: elite critique is the defining feature of populism.

2. manichean_density — Second strongest positive coefficient: binary
   contrast framing is a highly reliable populist signal.

3. contrast_ratio — Positive: grammatical contrast structure (us vs.
   them constructions, "not X but Y" patterns) correlates with populism.

4. people_centric_density — Positive but weaker: people-centered
   vocabulary is common in many political speeches, reducing its
   discriminant power.

WHAT THIS MISSES:
The classifier relies on explicit vocabulary. Sophisticated populist
communicators (including Whitfield) often achieve populist effects
through narrative structure, emotional tone, and implicit framing
without using the specific terms our dictionaries capture.
The classifier's false negatives are likely concentrated among
rhetorically sophisticated populist speeches.
""")

37.5 Reading the Confusion Matrix: A Practical Guide

The confusion matrix produced by the classifier is one of the most information-dense outputs of the evaluation, but it is also one of the most frequently misread. This section walks through each cell in detail and explains what it means for the substantive research question.

37.5.1 What the Four Cells Tell Us

The confusion matrix for a binary classifier shows four quantities:

	Predicted Non-Populist	Predicted Populist
Actual Non-Populist	True Negatives (TN)	False Positives (FP)
Actual Populist	False Negatives (FN)	True Positives (TP)

For a typical run of the ODA populism classifier, you might see output like:

                 precision    recall  f1-score   support

  Non-Populist       0.88      0.91      0.89      2,156
     Populist        0.72      0.65      0.68       800

    accuracy                            0.84      2,956
   macro avg          0.80      0.78      0.79      2,956
weighted avg          0.83      0.84      0.83      2,956

AUC-ROC: 0.847

Unpacking each metric:

Precision (for Populist class: 0.72): Of all the speeches the classifier labeled "Populist," 72% were truly populist according to the threshold-based labels. The other 28% were false positives — non-populist speeches the classifier incorrectly flagged. In the research context: if Sam uses the classifier to build a list of "populist speeches" for qualitative follow-up, she should expect that roughly 28% of the speeches on that list will, upon manual review, turn out not to exhibit the populist linguistic patterns she's looking for.

Recall (for Populist class: 0.65): Of all the speeches that were truly populist, the classifier identified 65% of them. The remaining 35% are false negatives — populist speeches the classifier missed. This is a substantial miss rate. In the trend analysis context, this means Sam's time-series plot of "populist speeches per quarter" will systematically undercount truly populist speeches, by approximately 35%. The trend direction may still be valid (if the undercount is consistent over time), but the absolute level of the rate is a floor, not an accurate estimate.

The precision-recall trade-off: These two metrics trade off against each other when you adjust the classification threshold. If Sam lowers the threshold (classifying speeches as "Populist" when the model's predicted probability exceeds 0.30 instead of 0.50), she will catch more truly populist speeches (higher recall) but at the cost of including more false positives (lower precision). The choice depends on the research use case: - For trend detection where completeness matters more than purity: lower threshold, accept more false positives - For identifying specific speeches for qualitative coding where purity matters: higher threshold, accept more false negatives

AUC-ROC (0.847): The AUC-ROC measures the classifier's ability to discriminate between populist and non-populist speeches across all possible threshold settings. An AUC of 0.847 means that if you randomly draw one populist speech and one non-populist speech from the test set, the model assigns a higher populist probability to the truly populist speech 84.7% of the time. This is substantially better than random (AUC = 0.5) and represents a working classifier, though not a highly precise one.

37.5.2 False Negative Patterns and Their Research Implications

The 35% false negative rate is not uniform across speech types. Understanding which speeches the classifier misses reveals its structural limitations.

# ─────────────────────────────────────────────
# FALSE NEGATIVE ANALYSIS
# ─────────────────────────────────────────────

# Add classifier probability to test set for analysis
test_df = df.iloc[X_test.index].copy()
test_df['pred_prob'] = y_prob
test_df['predicted'] = y_pred
test_df['actual'] = y_test.values

# Identify false negatives: actually populist, predicted non-populist
false_negatives = test_df[
    (test_df['actual'] == 1) & (test_df['predicted'] == 0)
]

# Identify true positives for comparison
true_positives = test_df[
    (test_df['actual'] == 1) & (test_df['predicted'] == 1)
]

print("FALSE NEGATIVE ANALYSIS")
print("="*60)
print(f"\nFalse negatives: {len(false_negatives):,} speeches "
      f"({len(false_negatives)/test_df['actual'].sum()*100:.1f}% of populist speeches missed)")

# Compare event type distribution
print("\nEvent type distribution: False Negatives vs True Positives")
fn_events = false_negatives['event_type'].value_counts(normalize=True) * 100
tp_events = true_positives['event_type'].value_counts(normalize=True) * 100
comparison = pd.DataFrame({'False Neg %': fn_events, 'True Pos %': tp_events}).fillna(0)
print(comparison.round(1))

# Compare mean feature values
print("\nMean feature values: False Negatives vs True Positives")
fn_features = false_negatives[feature_cols].mean()
tp_features = true_positives[feature_cols].mean()
feature_comparison = pd.DataFrame({
    'False Neg': fn_features,
    'True Pos': tp_features
})
feature_comparison['ratio (FN/TP)'] = feature_comparison['False Neg'] / feature_comparison['True Pos']
print(feature_comparison.round(3))

What the false negative analysis typically reveals:

The speeches the classifier misses (false negatives) tend to differ from the speeches it correctly identifies (true positives) in predictable ways:

Narrative-dominant speeches: Floor speeches that tell long personal stories — "I want to tell you about my constituent Maria, who worked 30 years in that factory..." — can embed deeply populist themes (the forgotten worker, the powerful interests that destroyed her community) without triggering the explicit vocabulary in our dictionaries. The narrative structure performs the populist function; the vocabulary is specific and personal rather than rhetorical and general.

Regional idiomatic speech: Rural Southern political speech often expresses anti-elite sentiment through idiom rather than direct declaration. "Those folks in Washington have never seen a field that wasn't a golf course" scores low on explicit anti-elite vocabulary but is functionally populist. The dictionary approach cannot capture idiomatic expression.

Late-career rhetorical adaptation: Some experienced populist communicators actively avoid specific vocabulary that they know journalists flag as populist. Whitfield himself has been observed replacing "the corrupt establishment" with "the people in charge" and "the elites" with "the decision-makers" in media-heavy settings. The classifier misses this adaptive camouflage.

37.6 Time-Series Rhetoric Tracking

# ─────────────────────────────────────────────────────────────
# EXAMPLE 3: TIME-SERIES RHETORIC ANALYSIS
# example-03-time-series-rhetoric.py
# ─────────────────────────────────────────────────────────────

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import seaborn as sns
from scipy import stats
import warnings
warnings.filterwarnings('ignore')

# Load data with all features computed
df = pd.read_csv('oda_speeches_with_features.csv', parse_dates=['date'])
# (Assumes features from example-01 have been saved to this file)

# ─────────────────────────────────────────────
# 5. TEMPORAL ANALYSIS: RHETORIC CHANGE OVER TIME
# ─────────────────────────────────────────────

df['year'] = df['date'].dt.year
df['month'] = df['date'].dt.to_period('M')
df['quarter'] = df['date'].dt.to_period('Q')

# Overall populism score trend (quarterly)
quarterly_trend = df.groupby(['quarter', 'party']).agg(
    mean_populism=('populism_score', 'mean'),
    mean_anti_elite=('anti_elite_density', 'mean'),
    n_speeches=('speech_id', 'count')
).reset_index()

quarterly_trend['quarter_str'] = quarterly_trend['quarter'].astype(str)

# Focus on R and D parties with sufficient data
main_parties = quarterly_trend[
    (quarterly_trend['party'].isin(['R', 'D'])) &
    (quarterly_trend['n_speeches'] >= 10)
]

fig, axes = plt.subplots(2, 1, figsize=(14, 10))
fig.suptitle("Populist Rhetoric Over Time (ODA Speeches Dataset)", fontsize=14, fontweight='bold')

# Plot 1: Mean populism score over time by party
for party, color, label in [('R', '#CC3333', 'Republican'), ('D', '#3366CC', 'Democrat')]:
    party_data = main_parties[main_parties['party'] == party].sort_values('quarter_str')
    if len(party_data) > 0:
        x_numeric = range(len(party_data))
        axes[0].plot(party_data['quarter_str'], party_data['mean_populism'],
                     color=color, linewidth=2, marker='o', markersize=4,
                     label=label, alpha=0.8)
        # Add trend line
        if len(party_data) > 3:
            z = np.polyfit(range(len(party_data)), party_data['mean_populism'], 1)
            p = np.poly1d(z)
            axes[0].plot(party_data['quarter_str'],
                        [p(i) for i in range(len(party_data))],
                        color=color, linestyle='--', alpha=0.4, linewidth=1)

axes[0].set_xlabel('')
axes[0].set_ylabel('Mean Populism Score')
axes[0].set_title('Mean Populism Score by Quarter and Party')
axes[0].legend()
axes[0].set_xticklabels(axes[0].get_xticklabels(), rotation=45, ha='right', fontsize=8)

# Plot 2: Anti-elite density over time (key driver)
for party, color, label in [('R', '#CC3333', 'Republican'), ('D', '#3366CC', 'Democrat')]:
    party_data = main_parties[main_parties['party'] == party].sort_values('quarter_str')
    if len(party_data) > 0:
        axes[1].plot(party_data['quarter_str'], party_data['mean_anti_elite'],
                     color=color, linewidth=2, marker='s', markersize=4,
                     label=label, alpha=0.8)

axes[1].set_xlabel('Quarter')
axes[1].set_ylabel('Mean Anti-Elite Language Density')
axes[1].set_title('Anti-Elite Language Density by Quarter and Party')
axes[1].legend()
axes[1].set_xticklabels(axes[1].get_xticklabels(), rotation=45, ha='right', fontsize=8)

plt.tight_layout()
plt.savefig('rhetoric_time_series.png', dpi=150, bbox_inches='tight')
plt.show()

# Statistical test: Is the party difference significant?
r_scores = df[df['party'] == 'R']['populism_score']
d_scores = df[df['party'] == 'D']['populism_score']
t_stat, p_value = stats.mannwhitneyu(r_scores, d_scores, alternative='two-sided')

print("\nStatistical Comparison: Republican vs. Democrat Populism Scores")
print(f"  Republican median: {r_scores.median():.4f}")
print(f"  Democrat median: {d_scores.median():.4f}")
print(f"  Mann-Whitney U statistic: {t_stat:.1f}")
print(f"  p-value: {p_value:.4e}")
print(f"  Interpretation: {'Statistically significant difference' if p_value < 0.05 else 'No statistically significant difference'}")

37.7 Extending the Classifier to Different Languages and Countries

One of the most common next steps researchers want to take after building a working English-language populism classifier is to apply it to non-English political speech or to different national contexts. This section describes the methodological challenges and approaches for doing so.

37.7.1 The Language Transfer Problem

The classifier built in this chapter is not a "populism detector" in any universal sense — it is a detector of specific English-language lexical patterns associated with American-style populist rhetoric. Applying it to speeches in other languages requires more than translation.

Why direct translation fails. A naive approach would translate all foreign-language speeches to English and apply the existing classifier. This fails for several reasons:

The populist vocabulary in each language has its own culturally specific terms that do not translate cleanly. The German term "Volk" (people, folk) carries specific historical resonance that makes it a stronger populist signal in German politics than any simple English translation like "the people" would suggest. The French "les élites" indexes a specifically French post-Revolutionary tradition of elite critique that differs from American anti-establishment rhetoric despite surface similarity. Automatic machine translation may convert a term that is highly salient in the source language to a generic English equivalent that scores low on our dictionary.

Syntactic and discourse structure differ substantially across languages in ways that affect our grammatical features. The contrast_ratio feature, which detects "not X but Y" and "us vs. them" constructions, will produce different base rates in German (with its more flexible word order) than in English, making cross-linguistic comparisons of raw scores unreliable.

37.7.2 A Framework for Multilingual Extension

The appropriate approach for multilingual extension has three stages:

Stage 1: Reconstruct the feature engineering in the target language. Rather than translating speeches to English, translate the feature engineering logic to the target language. This means: - Building a new anti-elite dictionary in the target language, informed by that country's political history and vocabulary - Building a new people-centric dictionary reflecting the specific terms used in the target political culture - Adapting the grammatical pattern detection (contrast framing, pronoun ratios) to the target language's grammar

For Spanish-language analysis — relevant for Garza's campaign and for Latin American comparative work — the anti-elite dictionary would include terms like "élites," "la casta," "el establishment," "los poderosos," and regionally specific terms like "los fifís" (Mexican slang for cosmopolitan elites). The people-centric dictionary would include "el pueblo," "la gente," "trabajadores," and regional variants.

Stage 2: Validate features against ground truth in the new language. Before applying the classifier, validate the new language's features against human-coded populism ratings of a sample of speeches. This step cannot be skipped: the validity assumptions that justified using the English features in the English context do not automatically transfer to the new language.

Stage 3: Be explicit about what cannot be compared across languages. Even after Stage 1 and Stage 2, raw scores from the English classifier and the Spanish classifier are not directly comparable — they are parallel measurements of related but distinct constructs in different linguistic contexts. Cross-linguistic claims ("right-wing populism is higher in Spain than in the US") require explicit bridging arguments that go beyond the raw classifier outputs.

# ─────────────────────────────────────────────
# SKETCH: Spanish-Language Feature Dictionary
# (Illustrative — not a production-ready classifier)
# ─────────────────────────────────────────────

ANTI_ELITE_TERMS_ES = [
    'élite', 'élites', 'establecimiento', 'medios de comunicación',
    'corrupto', 'corrupción', 'políticos', 'clase política',
    'intereses especiales', 'la casta', 'los poderosos',
    'clase dominante', 'los de arriba', 'los ricos',
    'oligarquía', 'banqueros', 'globalistas',
    'la élite política', 'los que mandan', 'los tecnócratas'
]

PEOPLE_CENTRIC_TERMS_ES = [
    'el pueblo', 'la gente', 'trabajadores', 'familias trabajadoras',
    'ciudadanos de a pie', 'gente común', 'americanos de a pie',
    'la clase media', 'la clase trabajadora', 'los olvidados',
    'sentido común', 'la voluntad del pueblo', 'voces del pueblo',
    'mayoría silenciosa', 'contribuyentes', 'votantes',
    'ciudadanos', 'comunidades', 'pequeños negocios', 'agricultores'
]

# The compute_term_density function from example-01 can be applied
# directly to Spanish text using these Spanish-language dictionaries.
# Validation against human-coded Spanish-language speeches is required
# before drawing conclusions.

37.7.3 Country-Specific Calibration Challenges

Different countries present different calibration challenges even when working in the same language. Consider two Spanish-language contexts: Mexico and Argentina.

Mexico: Mexican populist rhetoric, particularly under AMLO (Andrés Manuel López Obrador, 2018–2024), used a distinctive vocabulary centered on "el pueblo" vs. "la mafia del poder" (the power mafia). The Manichean framing was explicit and systematic; AMLO's daily morning press conferences ("mañaneras") were saturated with anti-elite content. A classifier calibrated on Mexican political speech would likely score differently than one calibrated on general Spanish-language speech.

Argentina: Peronist rhetoric has a century-long tradition of people-centric populism that has deeply embedded certain terms into the mainstream political vocabulary. "Los trabajadores," "el pueblo argentino," and references to the political and economic "oligarquía" are used by politicians across a much wider ideological spectrum than in Mexico. A classifier trained on Argentine data might find populist vocabulary in speeches that analysts would not consider strongly populist by comparative standards — because the baseline vocabulary level is higher.

The practical implication: populism classifiers should be calibrated and validated separately for each country-context, with threshold decisions made relative to the distribution of scores in that context rather than using a universal absolute threshold.

37.8 Critical Interpretation: What the Classifier Can and Cannot Tell Us

Sam Harding's analytical philosophy is grounded in a principle that could serve as the methodological motto for this entire textbook: the classifier is a map, not the territory. This section translates that principle into specific interpretive constraints.

37.8.1 What the Classifier Does Well

When Sam presents ODA's rhetoric classifier results to the team, she starts with what the model reliably does:

Identifies explicit lexical patterns. The classifier reliably flags speeches that use anti-elite vocabulary, people-centric language, and Manichean framing. If a politician says "the corrupt Washington establishment," "the people of this country," and "this is a fight between freedom and tyranny" in the same speech, the classifier will correctly identify it as populist.

Enables scalable comparison. The classifier processes 14,782 speeches consistently, enabling systematic comparison across time, parties, states, and speaker characteristics that would be impossible to do through manual reading.

Surfaces patterns for human investigation. The classifier is most valuable not as a definitive classification system but as a triage tool — flagging speeches for closer human examination, identifying anomalous patterns that warrant investigation, and providing quantitative backup for impressionistic observations.

37.8.2 What the Classifier Misses

# ─────────────────────────────────────────────
# 8. CLASSIFIER LIMITATIONS ANALYSIS
# ─────────────────────────────────────────────

# ILLUSTRATIVE EXAMPLES OF CLASSIFIER FAILURE MODES

# Example A: Sophisticated populism without explicit vocabulary
# (classifier would likely score this LOW despite clearly populist content)
sophisticated_populist = """
When my grandfather came home from the mill, he had calluses on his hands
and enough in his pocket to pay the mortgage. That mill is gone now.
The people who made the decision that closed it are doing just fine —
they're speaking at conferences in Davos and collecting consulting fees
from the very companies that took your jobs. I'm asking you to ask why.
Ask who benefits. Ask who's not here tonight.
"""

# Example B: High-vocabulary non-populist speech
# (classifier might score this HIGH despite being non-populist academic content)
vocabulary_trap = """
The political elite have historically failed to represent ordinary citizens.
The corruption in our establishment institutions is well-documented.
Common people deserve better representation in the political system.
The voice of working people must be heard in the corridors of power.
"""

print("CLASSIFIER FAILURE MODE ANALYSIS")
print("="*60)
print("\nSophisticated populist speech (no explicit dictionary terms):")
print("  Classifier will likely MISS this as populist because:")
print("  - No explicit 'elite'/'establishment'/etc. terms")
print("  - Storytelling structure with implicit elite critique")
print("  - 'Those making decisions' is an indirect anti-elite signal")
print("  - The emotional/narrative populism is not captured by word frequencies")

print("\nVocabulary-trap non-populist speech (uses all the right words academically):")
print("  Classifier may FALSE POSITIVE on this because:")
print("  - All the elite-critique vocabulary is present")
print("  - But the speech is descriptive analysis, not political mobilization")
print("  - Intent and register are not captured by term frequencies")

37.9 Connecting Results to Chapter 34 Theory

The classifier's findings gain meaning only when interpreted against the theoretical framework established in Chapter 34. Raw counts of "populist speeches" are analytically empty; counts interpreted through the lens of Mudde's ideational definition, Laclau's discursive theory, and the left/right populism distinction are analytically rich.

37.9.1 What the Feature Importance Findings Mean Theoretically

The most important features in the classifier — anti-elite density and Manichean framing — correspond directly to Mudde's core definition of populism as the belief in a struggle between "the pure people" and "the corrupt elite." Chapter 34 predicted that this structure would be the definitional core; the classifier empirically confirms that it is also the most linguistically distinctive — the features that best separate populist from non-populist speech are precisely those that operationalize the theoretical core.

The relative weakness of people-centric language as a discriminating feature is also theoretically interpretable. Chapter 34 noted that appeals to "the people" are present across a much wider range of political rhetoric than pure populism — democratic politicians of all stripes invoke the people. It is the combination of people-positive and elite-negative framing that is distinctive to populism, and our classifier confirms that the elite-negative dimension has higher marginal discriminant power than the people-positive dimension.

37.9.2 The Left/Right Asymmetry Finding

Key Finding 4 from Section 37.7 — that anti-elite language is higher in Republican speeches while people-centric language is similar across parties — is directly interpretable through Chapter 34's left/right populism distinction:

Right populism (exemplified by Whitfield) is elite-focused. The core antagonism is "the corrupt establishment" vs. "real Americans." The people are invoked primarily as the victims of elite predation. The linguistic center of gravity is elite critique.

Left populism is more people-focused. The core message tends to be "the working class" vs. "the billionaire class," with the people framed more actively as agents (not just victims) and with somewhat less explicit elite-hatred. The linguistic center of gravity is people affirmation as much as elite critique.

The classifier's differential detection of these two variants — stronger for right populism, somewhat weaker for left populism — is consistent with this theoretical prediction. It is not a flaw in the classifier; it is an accurate reflection of the different linguistic strategies of left and right populism.

37.9.3 The Code-Switching Finding and Elite Capture

Key Finding 2 — that rally speeches score highest on populism indicators and floor speeches score lowest — is analytically significant beyond the obvious point that politicians adjust their register to their audience.

Chapter 34 discussed the "movement-into-institutions" transition, where populist movements that win power face the challenge of governing within the institutional structures they campaigned against. The code-switching finding provides linguistic evidence of this dynamic: politicians who use populist rhetoric in campaign settings adopt institutional register in legislative settings. The populist identity is contextually deployed, not continuously maintained.

This has a specific implication for the Whitfield analysis. Sam's finding that Whitfield's second-person density is exceptionally high (94th percentile) reflects a specific rhetorical technique: maintaining the populist "you and me against them" frame even in relatively formal settings. This persistence of the direct-address technique across contexts distinguishes rhetorical commitments from situational code-switching.

37.10 Ethical Responsibilities of Building Rhetoric Classifiers

Political rhetoric classifiers are not neutral tools. They make claims about speech — claims that can be used to label politicians, construct narratives about political movements, and shape journalistic and public understanding of political communication. The researchers and organizations that build and deploy these tools carry corresponding responsibilities.

37.10.1 The Labeling Power

When ODA publishes an analysis saying "Senator X's rhetoric is in the 90th percentile of populist speech in our dataset," they are attaching a label to a politician that has political consequences independent of its accuracy. "Populist" is not a neutral descriptor — it carries connotations of demagoguery, irrationality, and democratic danger in much of the mainstream political press. Publishing a quantitative populism score lends false precision and scientific legitimacy to what is ultimately a methodologically limited measurement.

Sam Harding's response to this concern at an ODA editorial meeting is worth quoting in full: "I'm not saying Whitfield is dangerous. I'm saying he scores at the 90th percentile on a set of features I've operationalized from Mudde's ideational definition of populism. Those are different claims. The first is a political judgment. The second is a measurement claim with documented limitations. We should only make the second claim, and we should be explicit about what it does and doesn't mean."

This discipline — between measurement claims and normative claims — is the core of responsible rhetoric analysis. It requires:

Explicit methodology disclosure. Every published output from the classifier should be accompanied by a clear methodology statement explaining what was measured, how the features were defined, what the known limitations are, and what claims the numbers support and do not support.

Uncertainty quantification. A point estimate of a politician's populism score communicates false precision. Reporting score ranges, acknowledging the threshold sensitivity, and noting the classifier's false negative and false positive rates are minimum standards for responsible reporting.

Avoiding individual labeling where data is thin. For politicians with fewer than 20 speeches in the ODA database, individual populism scores are too variable to be reliable. Reporting a specific score for a candidate with limited data is misleading. Aggregate analyses at the party, region, or office level are more defensible than individual-level labeling when the underlying data is sparse.

37.10.2 The Amplification Risk

Rhetoric classifiers can contribute to the very phenomena they study. If a major news organization publishes a story based on ODA's classifier findings — "New analysis shows Republican Senate candidates increasingly using populist rhetoric" — the story itself may influence the behavior it reports on. Candidates may adopt or avoid populist vocabulary based on public awareness of quantitative tracking; their communications staff may actively game the metrics if they are widely known.

This is not a reason not to build or publish classifier findings. It is a reason to think carefully about what analyses to publish, in what form, and with what level of prominence. A subtle trend detected in aggregate data and reported with appropriate uncertainty has different amplification potential than a ranked list of individual politicians with quantitative "populism scores."

37.10.3 The Definitional Power

Whoever defines "populism" for a classifier defines what the research produces. The ODA classifier, following Mudde, operationalizes populism as anti-elite + people-centric + Manichean. This is one defensible definition. It is not the only one.

A classifier built on Laclau's discursive theory would emphasize the construction of an "empty signifier" that unifies disparate demands rather than specific vocabulary. A classifier built on historical institutionalist definitions of populism might emphasize norm-breaking institutional behavior rather than rhetorical content. These are not minor differences — they would produce substantially different labels for the same politicians and the same speeches.

Researchers should be explicit about their definitional choices and transparent about the alternatives they did not take. The methodology statement in Section 37.6.3 attempts this by explicitly stating what the classifier does and does not measure. This transparency is not optional — it is the epistemic minimum for responsible measurement of a politically consequential concept.

⚠️ Common Pitfall: Treating Classifier Output as Ground Truth After building a classifier that achieves reasonable accuracy metrics, there is a natural temptation to treat its outputs as definitive. A speech classified as "populist" by the model is not proven to be populist — it has been classified as populist by an algorithm that approximates human judgment. All downstream claims should include appropriate uncertainty qualifiers: "the classifier suggests," "the model flags," "analysis indicates" — not "is populist," "is classified as populist" (when the implication is that the classification is definitive).

37.11 What Text Classifiers Cannot Do: A Systematic Inventory

The chapter has been explicit about specific limitations throughout. This section consolidates that discussion into a systematic inventory that researchers can use as a checklist before reporting classifier-based findings.

37.11.1 The Vocabulary Boundary

Text classifiers built on bag-of-words or term-frequency features are bounded by their feature vocabulary. They can detect whatever patterns their features capture; they are blind to whatever their features don't capture. For the populism classifier:

Captured: Explicit anti-elite vocabulary; explicit people-centric vocabulary; Manichean contrast terms; direct second-person address; first-person-plural collective pronouns; common-sense epistemic appeals; urgency and emergency language.

Not captured: Metaphorical populism (the politician who compares Washington DC to "a rigged casino" without saying "elite"); narrative populism (the personal story about the closed factory that implies elite failure without naming it); typological populism (using categories and frames that create elite/people contrast without explicit vocabulary); indirect Manichean framing (describing a policy debate as a choice "between the future and the past" rather than "between the people and the elite").

The boundary is systematic, not random. The most politically sophisticated populist communicators — those who have learned, explicitly or intuitively, to achieve populist effects without triggering negative media labeling — are exactly the ones the classifier is most likely to miss. This creates a systematic bias: the classifier performs best on the least strategically adaptive populists and worst on the most strategically adaptive ones.

37.11.2 The Context Blindness Problem

Every text in the ODA corpus is analyzed as an isolated document. The classifier has no knowledge of:

Audience composition: The same sentence means something different when spoken to a friendly partisan rally versus a general audience on live television. Context shapes how language functions as a populist appeal, but the classifier treats both contexts identically.
Speaker history: A politician who has never used anti-elite language producing a moderately high-scoring speech is behaving differently than one who has escalated over time to their current level. The classifier assigns identical scores to both.
Co-text: The relationship between a specific passage and the rest of the speech is invisible to the classifier. A highly scored passage embedded in a longer institutional speech may be performing a different rhetorical function than the same passage in a pure rally address.
Delivery and paralinguistic cues: Sarcasm, irony, humor, and emotional register are not captured by text at all. A politician who delivers an anti-elite line with a wink and an ironic smile is doing something different from one who delivers it with moral indignation. The transcript is the same; the political communication is different.

37.11.3 The Training Data Problem

The classifier's labels are derived from the existing populism_score, which was computed by a previous researcher using a methodology that is incompletely documented. This creates three compounding problems:

Definitional inheritance: Whatever conceptual choices went into the original score are inherited by the classifier. If the original score overweighted right-populist vocabulary, the classifier will reproduce that bias.

Circular validation: The classifier is validated against the score it was trained to predict. Finding that the classifier correlates with the score does not validate that either the classifier or the score is measuring "true" populism — it validates only that they are consistent with each other.

No independent ground truth: True validation requires human expert coders independently rating a sample of speeches as populist or non-populist, providing ground truth that is conceptually independent of any dictionary-based score. This ideal is resource-intensive and has not been done for the ODA corpus.

37.11.4 The Temporal Drift Problem

Political vocabulary changes over time. Terms that were heavily associated with populism in 2018 may be widely used in mainstream political communication by 2025 — reducing their discriminant power. New populist vocabulary terms that entered political language after the training period are missed entirely.

The temporal generalization exercise tests whether this problem is severe enough to degrade classifier performance on recent data. If it is, the classifier should be retrained regularly — at minimum annually, ideally after each election cycle. In practice, this requires either manual dictionary updates or retraining on recent labeled data, both of which require ongoing investment.

37.11.5 The Scale-Precision Trade-off

The core trade-off of the classifier approach is: scale in exchange for precision. A skilled qualitative researcher reading a speech can detect subtle populist moves, contextual irony, audience-specific coding, and strategic adaptation that the classifier entirely misses. The cost is that skilled qualitative analysis can cover perhaps hundreds of speeches per researcher-year; the classifier can process tens of thousands in minutes.

The appropriate response to this trade-off is not to choose one approach over the other but to use them strategically in combination: the classifier for large-scale pattern detection and hypothesis generation, skilled qualitative analysis for validation, case development, and interpretation of patterns the classifier surfaces. Sam Harding's ODA workflow exemplifies this integration: quantitative rhetoric tracking surfaces patterns (Whitfield-type speeches are increasing); qualitative close reading explains them (here are the specific rhetorical moves that the tracker flags and here is what they mean in context).

37.12 Additional Exercises with Sample Outputs

Exercise 37.A: Threshold Sensitivity Analysis

The choice of threshold (0.40 in the main lab) is a consequential decision. This exercise asks you to systematically examine how threshold choice affects classifier behavior and downstream conclusions.

# ─────────────────────────────────────────────
# EXERCISE 37.A: THRESHOLD SENSITIVITY
# ─────────────────────────────────────────────

thresholds = [0.25, 0.30, 0.35, 0.40, 0.45, 0.50, 0.55]
threshold_results = []

for threshold in thresholds:
    # Create labels at this threshold
    y_thresh = (df['populism_score'] >= threshold).astype(int)
    pct_positive = y_thresh.mean() * 100

    # For each threshold, what % of R and D speeches are labeled populist?
    r_pct = (df[df['party'] == 'R']['populism_score'] >= threshold).mean() * 100
    d_pct = (df[df['party'] == 'D']['populism_score'] >= threshold).mean() * 100

    threshold_results.append({
        'threshold': threshold,
        'pct_labeled_populist': pct_positive,
        'r_pct_populist': r_pct,
        'd_pct_populist': d_pct,
        'r_d_ratio': r_pct / max(d_pct, 0.01)
    })

results_df = pd.DataFrame(threshold_results)
print("Threshold Sensitivity Analysis:")
print(results_df.round(2).to_string(index=False))

Sample output:

Threshold Sensitivity Analysis:
 threshold  pct_labeled_populist  r_pct_populist  d_pct_populist  r_d_ratio
      0.25                  38.4            52.1            28.3       1.84
      0.30                  29.7            42.6            20.1       2.12
      0.35                  22.1            33.8            14.2       2.38
      0.40                  16.3            26.1             9.8       2.66
      0.45                  11.8            19.7             6.3       3.13
      0.50                   8.4            14.6             4.1       3.56
      0.55                   5.9            10.8             2.7       4.00

Interpretation question for students: The ratio of Republican to Democrat populism labeling increases as the threshold rises. What does this tell you about the shape of the populism score distribution for each party? What research conclusions would be sensitive to this threshold choice, and what conclusions would be robust to it?

Exercise 37.B: Speaker-Level Analysis

# ─────────────────────────────────────────────
# EXERCISE 37.B: SPEAKER-LEVEL ANALYSIS
# ─────────────────────────────────────────────

# Aggregate to speaker level (minimum 10 speeches)
speaker_profile = df.groupby(['speaker', 'party']).agg(
    n_speeches=('speech_id', 'count'),
    mean_populism=('populism_score', 'mean'),
    mean_anti_elite=('anti_elite_density', 'mean'),
    mean_manichean=('manichean_density', 'mean'),
    mean_second_person=('second_person_density', 'mean'),
    pct_rally=('event_type', lambda x: (x == 'rally').mean() * 100)
).reset_index()

speaker_profile = speaker_profile[speaker_profile['n_speeches'] >= 10]

# Find speakers with highest populism scores by party
print("Top 10 Republican speakers by mean populism score:")
r_top = speaker_profile[speaker_profile['party'] == 'R'].nlargest(10, 'mean_populism')
print(r_top[['speaker', 'n_speeches', 'mean_populism', 'mean_anti_elite',
             'mean_manichean', 'pct_rally']].round(3).to_string(index=False))

print("\nTop 10 Democrat speakers by mean populism score:")
d_top = speaker_profile[speaker_profile['party'] == 'D'].nlargest(10, 'mean_populism')
print(d_top[['speaker', 'n_speeches', 'mean_populism', 'mean_anti_elite',
             'mean_manichean', 'pct_rally']].round(3).to_string(index=False))

What to look for in your output: Compare the feature profiles of high-scoring Republican and Democratic speakers. Do they achieve high populism scores through the same features, or different ones? This is the empirical test of the left/right populism linguistic distinction discussed in Chapter 34.

37.13 Chapter Summary

This chapter walked through a complete political text analysis pipeline — from raw data to interpreted findings — using populist rhetoric as the application domain. The technical skills developed include: data exploration and quality assessment, feature engineering grounded in theoretical concepts, classifier training with cross-validation, confusion matrix interpretation, feature importance analysis, and time-series visualization.

The methodological lessons are inseparable from the technical ones. Every step in the pipeline — which dictionary terms to include, where to set the classification threshold, how to interpret "feature importance" — embeds conceptual decisions that shape the results. The classifier is a formalization of a theory, and the theory's limitations are the classifier's limitations.

The confusion matrix walkthrough (Section 37.5) demonstrated that model evaluation requires going beyond aggregate accuracy metrics to understand where and why the model fails. The 35% false negative rate is not a uniform tax on all speech types — it falls disproportionately on narratively sophisticated populism, regional idiomatic speech, and strategically adapted rhetoric. These patterns are theoretically interpretable and practically important for deciding how to use the classifier.

Extending the classifier to other languages and countries (Section 37.7) requires rebuilding the feature engineering from the ground up in each target context, not translating existing features. Different political cultures have different baseline vocabularies, different populist traditions, and different calibration challenges that require country-specific validation.

The ethical responsibilities section (Section 37.10) is not a formality — it identifies real risks that flow from the labeling power, amplification potential, and definitional choices embedded in rhetoric classifiers. Responsible classifier deployment requires explicit methodology disclosure, uncertainty quantification, avoidance of individual labeling where data is thin, and discipline in distinguishing measurement claims from normative claims.

Sam Harding's methodology statement captures the essential principle: the populist rhetoric classifier is a tool for systematic quantitative pattern detection, not a truth machine for establishing whether individual politicians are or are not populist. Used with appropriate epistemic humility, it reveals patterns at scale that qualitative analysis cannot achieve. Used without that humility, it substitutes a number for an argument — and that substitution, as the Measurement Shapes Reality theme reminds us, has political consequences.

Key Terms

Feature engineering: Transforming raw data (text) into numerical representations (feature vectors) that machine learning models can process
Binary classifier: A model that assigns one of two class labels to each input
Cross-validation: A technique for evaluating model performance by training and testing on different subsets of the data
AUC-ROC: Area Under the Receiver Operating Characteristic Curve; a classifier evaluation metric that measures discrimination ability regardless of classification threshold
False negative: A populist speech that the classifier incorrectly labels as non-populist
False positive: A non-populist speech that the classifier incorrectly labels as populist
Precision: The proportion of speeches predicted as positive (populist) that are truly positive; measures classifier purity
Recall: The proportion of truly positive (populist) speeches that the classifier correctly identifies; measures classifier completeness
Term density: The proportion of text units (sentences, words) containing a specified term or term from a specified list
Manichean framing: Binary moral framing dividing the world into opposing good/evil camps
Threshold decision: The choice of where to set the boundary between positive and negative classifications; affects the trade-off between precision and recall
Code-switching: The practice of adapting rhetorical style to different audiences and contexts
Transfer learning limitation: The constraint that a classifier trained on one language or cultural context cannot be directly applied to a different language or culture without re-validation

Discussion Questions

The classifier achieves reasonable accuracy (around 0.78–0.82 AUC in typical runs) but has significant false negative rates for sophisticated populist communication. What implications does this have for using the classifier to track "trends in populist rhetoric"?
We used the existing populism_score as a proxy for ground truth in training the classifier, despite knowing that score has underdocumented construction. What would a better validation design look like? What resources would you need?
Sam's analysis finds that populist rhetoric density is highest at rallies and lowest in floor speeches. Does this mean politicians are being hypocritical — performing populism for crowds but not believing it in legislative settings? Or does it reflect something else about the communication context? What additional data would help you distinguish these interpretations?
The classifier is "ideologically agnostic" — it detects populist structure regardless of left/right content. But the chapter notes that anti-elite language is higher in Republican speeches than Democratic speeches. How should we interpret this finding given the agnostic design?
If you were building the next version of this classifier, what three improvements would you prioritize? Consider both technical improvements (different model architectures, additional features) and conceptual improvements (better theoretical grounding, better validation data).
Section 37.10 argues that rhetoric classifiers carry ethical responsibilities. Do you think the risks described justify significant restrictions on publishing classifier-based political analyses? Where would you draw the line between responsible and irresponsible use of these tools?