Case Study 37.1: Building the ODA Rhetoric Tracker — Design Decisions and Lessons Learned

DataField.Dev

Case Study 37.1: Building the ODA Rhetoric Tracker — Design Decisions and Lessons Learned

Overview

This case study documents the design decisions Sam Harding made in building ODA's populist rhetoric tracker, focusing on the moments where technical choices had analytical and political consequences. It serves as a model for reflective data science practice: understanding not just what to build, but why each design choice matters and what alternatives were considered.

The Origin of the Project

The ODA rhetoric tracker began as an informal project in late 2021. Sam had noticed that media coverage of "populism" was almost entirely based on qualitative assessment — commentators watched a rally, read a transcript, and declared a politician "populist." There was no systematic, replicable method for making those assessments, and no consistent way to track whether rhetoric was changing over time.

"The problem with pure qualitative assessment isn't that it's wrong," Sam explains. "It's that it's not auditable. When two analysts disagree about whether a speech is populist, there's no way to figure out where the disagreement comes from. Is it the definition of populism? The interpretation of a specific phrase? A difference in baseline expectations for that speaker? A quantitative approach makes the disagreements visible."

The initial requirement specification Sam wrote in early 2022 listed four design goals: 1. Replicability: Any two researchers applying the method to the same text should get the same score. 2. Transparency: The method's choices should be documentable and debatable. 3. Scale: The method should work on the full ODA speech corpus, not just a sample. 4. Theoretical grounding: The features should be derived from the ideational definition of populism, not from statistical pattern-finding alone.

Decision Point 1: Which Text to Analyze?

ODA's speech database includes both text_excerpt (500 words, available for all speeches) and full_text (complete speech, available for ~77%). Sam's first major decision: which to use for feature computation.

Option A: full_text only. More comprehensive; features reflect the full speech arc, including the ending emotional climax where populist appeals often peak.

Option B: text_excerpt for all. Less comprehensive but consistent; no selection bias from the 23% with null full_text.

Sam's choice: text_excerpt for tracking, full_text for validation. The tracker's primary use case is longitudinal comparison — tracking whether rhetoric changes over time. For this purpose, consistent coverage of all speeches is more important than more accurate features for a subset. However, for validation studies (testing the classifier against human-coded ground truth), Sam uses full_text where available to get the most accurate feature representation.

Lesson: The appropriate data choice depends on the analytical use case, not on abstract quality. Comprehensiveness and consistency can conflict; the research question should determine which to prioritize.

Decision Point 2: Dictionary Construction

Sam consulted three sources for the populism dictionaries: 1. Rooduijn and Pauwels' original 2011 dictionary (developed for European party manifestos) 2. Mudde and Kaltwasser's conceptual discussion of populist discourse markers 3. A literature review of US-specific studies of populist communication

The three sources overlapped substantially but also diverged. The original Rooduijn-Pauwels dictionary was developed for formal written text (manifestos) rather than oral political speech; its formalism-adapted terms ("citizenry," "ordinary citizens") did not fully capture the colloquial register of US rally speeches ("regular folks," "hardworking Americans").

Sam added 28 terms to the original dictionary through: - Close reading of 50 high-populism-score speeches in the existing corpus - Consultation with two political scientists who study US populist communication - A pilot test comparing classifier performance with and without the additions

The additions improved AUC by approximately 0.04 on the pilot test, suggesting the US-specific extensions captured meaningful additional signal.

Lesson: Published dictionaries may need adaptation when applied to different cultural contexts, speech registers, or time periods. The adaptation process should be documented and justified, not silently implemented.

Decision Point 3: Treating the Existing populism_score

The existing populism_score column was Sam's inheritance — built by a previous researcher whose methodology was incompletely documented. Sam faced a choice:

Option A: Accept the score as ground truth and build the classifier to predict it.

Option B: Treat the score with skepticism, conduct a reverse-engineering analysis to understand its construction, and use it as a starting point but not an uncriticized gold standard.

Sam chose Option B. The reverse-engineering analysis (Section 37.3.2 in the chapter) revealed that the existing score correlated most strongly with anti-elite language and Manichean framing, with weaker contributions from people-centric vocabulary. This pattern suggested the existing score prioritized elite-critique over people-centrism in its weighting scheme.

This finding had consequences. If Sam had simply accepted the existing score, the classifier would reproduce its weighting — potentially under-weighting the people-centric dimension relative to a theoretically balanced operationalization. By understanding the existing score's construction, Sam could make a deliberate choice about whether to replicate it or build a differently-weighted alternative.

Lesson: Inherited data artifacts — pre-computed scores, prior-team classifications, previous researchers' feature engineering — should never be treated as neutral ground truth. Investigate their construction and understand their assumptions before building on them.

Decision Point 4: Handling Sophisticated Populism

In pilot testing, Sam identified a systematic false negative pattern: speeches by politically experienced populist communicators who had clearly learned to avoid explicit dictionary vocabulary while maintaining populist rhetorical effects. These speeches had high human-expert populism assessments but low classifier scores.

Three responses were available: 1. Expand the dictionary to capture more of the implicit vocabulary. 2. Add structural features that capture narrative patterns without relying on specific vocabulary. 3. Accept the limitation and document it clearly.

Sam implemented Option 2 (adding the contrast_ratio and second_person_density features) and Option 3 (the methodology statement explicitly flags this limitation). Option 1 was rejected because expanding the dictionary to capture implicit patterns risks capturing non-populist text that shares the implicit patterns — reducing precision while improving recall.

The second_person_density feature proved particularly valuable: Whitfield-type speakers who avoid explicit "elite" vocabulary often use heavy direct address ("they don't care about you") that creates the in-group/out-group structure through pronouns rather than nouns. This feature captured that pattern without requiring explicit populist vocabulary.

Lesson: Classifier limitations are partially addressable through thoughtful feature engineering, but some limitations reflect fundamental constraints of vocabulary-based approaches and should be documented rather than hidden by over-engineered features that introduce new problems.

What Sam Learned About Measurement and Politics

Looking back on two years of building and using the rhetoric tracker, Sam offers three reflections:

The classifier changed how ODA's journalists reported. Once quantitative populism scores were available, there was pressure to include them in stories — to say "Whitfield's rhetoric scores in the 88th percentile" rather than "Whitfield uses populist rhetoric." This quantitative precision is sometimes appropriate and sometimes misleading; it creates an impression of objectivity that the score's construction doesn't warrant. Sam has had to actively push back against colleagues treating classifier outputs as facts rather than measurements.

The tracker has been useful for editorial assignments. When an editor asks "is there a trend worth covering in Republican rhetoric?" the time-series plots from the tracker can provide evidence-based answers that support or challenge editorial intuition. This is the tracker's strongest application.

The counter-populism researchers' paradox is real. Several campaigns and political organizations have asked ODA about using the rhetoric tracker for campaign strategy purposes. Sam declined, citing ODA's non-partisan mission, but the interaction highlighted how easily accountability tools can be repurposed as campaign resources. The same features that identify when a politician's rhetoric is escalating toward populism can be used to help a campaign understand when its messaging is not populist enough to activate its base.

Discussion Questions

Sam chose to use text_excerpt rather than full_text for tracking purposes, prioritizing coverage consistency over feature accuracy. What alternative research designs might mitigate the tradeoff (e.g., strategies for imputing full-text features for speeches where only excerpts exist)?
The reverse-engineering analysis of the existing populism_score revealed that it prioritizes anti-elite language over people-centric language. Is this appropriate given the ideational definition, which emphasizes both dimensions? What would a theoretically-weighted score that equally values all three dimensions (anti-elite, people-centric, Manichean) look like, and would it produce different findings?
Sam declined to share the ODA rhetoric tracker with political campaigns. But the methodology is published in this textbook. What is the practical effect of that refusal? Does publishing the method while refusing to share the implementation provide meaningful accountability protection?