Case Study 37.2: The Gap Between Map and Territory — Three Cases of Classifier Failure

Case Study 37.2: The Gap Between Map and Territory — Three Cases of Classifier Failure

Overview

This case study presents three concrete cases where the ODA populist rhetoric classifier produced outputs that diverged significantly from what careful human analysis would conclude. These cases are not included to discredit the classifier but to develop the critical analytical thinking that responsible use of any measurement tool requires. The gap between the classifier's map and the political territory it claims to represent is where the most important analytical lessons live.

Case 1: The Governor's First Inaugural (False Negative)

In January 2021, a newly elected Republican governor of a competitive Midwestern state delivered an inaugural address that independent political scientists uniformly assessed as exhibiting strong populist themes. The speech was notable for: extensive framing of the speaker as a political outsider despite having served in the state legislature; explicit references to "the people of this state" versus "the disconnected leadership in the state capitol and in Washington"; and repeated use of the phrase "they don't know your lives" as a contrast device.

The classifier's assessment: Populism probability: 0.31. Below the 0.40 threshold; classified as non-populist.

Why the classifier failed: The speech is what Sam Harding calls "vocabulary-evasive populism." The governor had been warned by her communications team that the "populist" label would be applied by media critics if she used explicit anti-elite vocabulary, so she achieved the populist effect through: - Personal narrative structure (extensive biographical detail establishing outsider credentials) - Indirect pronouns ("they don't know your lives" rather than "the elite doesn't know your lives") - Policy-specific framing ("the education bureaucracy" rather than "the educational elite") - Hometown imagery rather than explicit people-category invocations

The classifier's eight features found: low anti_elite_density (no explicit "elite/establishment/corrupt" vocabulary), moderate people_centric_density (some people-centered language), low manichean_density (the contrast was implied rather than stated), but high second_person_density (the "you/your" direct address was very high).

The analytical lesson: second_person_density was the only feature that captured something real about this speech's populist character. But the other features misrepresent it. A researcher using the classifier to study this governor's populism trajectory would conclude she was a moderate; political scientists who read the speech would disagree significantly. The classifier's map and the territory diverge precisely where the political actor made a deliberate strategic choice to evade measurement.

What should a researcher do? Flag this speech for human review. The high second_person_density combined with low dictionary-term density should trigger a "possible vocabulary-evasive populism" alert rather than a confident non-populist classification.

Case 2: The Academic Policy Memo (False Positive)

A research memo produced by a university policy center was included in the ODA corpus because it was submitted as public testimony to a state legislative committee by an interest group. The memo analyzed populist movements in three US states, describing their rhetoric, tactics, and policy effects.

The classifier's assessment: Populism probability: 0.71. Above the 0.40 threshold; classified as populist.

Why the classifier failed: The memo was about populism, not an instance of populism. In describing populist rhetoric, it necessarily used populist vocabulary: "elite capture," "ordinary citizens," "the people's demands," "corrupt establishment," "Manichean framing." These descriptions appeared in every section of the analysis.

The classifier's eight features found: high anti_elite_density (extensive use of "elite/establishment/corrupt" in descriptive context), moderate people_centric_density, moderate manichean_density, very high contrast_ratio (the memo repeatedly used "populist X vs. mainstream Y" constructions as analytical descriptions).

The analytical lesson: The classifier cannot distinguish performing populism from describing populism. This is a fundamental limitation of vocabulary-based approaches: the vocabulary of populism analysis overlaps with the vocabulary of populist communication itself. Any corpus that includes academic analysis of populism, journalism about populist politicians, or criticism of populist movements will contain false positives on texts that use populist vocabulary analytically rather than politically.

What should a researcher do? Apply source filtering. Texts produced by academic institutions, think tanks, and journalistic organizations should be flagged as potentially "meta-populist" (analytical rather than performative) before applying the classifier. Event type metadata (the ODA dataset includes event_type) can help: "testimony" and "policy_brief" categories are at higher risk for this pattern than "rally" or "town_hall."

Case 3: The Bipartisan Jobs Bill Speech (False Result on Both Dimensions)

A Democratic senator's floor speech supporting a bipartisan infrastructure bill was the subject of debate among ODA's analysts. The speech celebrated "working families," "communities that Washington has left behind," and "hardworking Americans who deserve better than the promises politicians keep making and breaking." Nadia Osei, reviewing the speech as part of her research on Garza's messaging strategy, noted that it sounded "more populist than our senator's usual register."

The classifier's assessment: Populism probability: 0.44. Just above threshold; classified as populist.

The human analysts' disagreement: - Sam Harding: "The speech is using people-centric vocabulary, but there's no anti-elite critique. The senator is praising Washington's ability to deliver for communities, not attacking Washington as corrupt. The frame is 'government can work for you,' which is the opposite of populism." - A second analyst: "The 'promises politicians keep making and breaking' language is clearly anti-establishment, even if it's mild. It's light populism but real populism." - A third analyst: "Context matters entirely here. The same words in a different setting would read as populism. In the context of a bipartisan bill passage, it reads as boilerplate legislative celebration language."

Why the classifier produced a borderline result: The speech exhibited: high people_centric_density (people-centered vocabulary throughout); moderate second_person_density; but very low anti_elite_density (almost no explicit elite critique) and very low manichean_density (no binary framing). The borderline classification (0.44) is mathematically the result of high people-centric features dominating with near-absent anti-elite features.

What the human disagreement reveals: The analysts' disagreement maps directly onto a theoretical debate about whether people-centrism alone (without anti-elitism) constitutes populism. If you define populism as requiring both people-centrism and anti-elitism (Mudde's definition), this speech is non-populist. If you define populism as a spectrum where people-centrism alone constitutes "soft populism," it qualifies. The classifier's 0.44 score is not a mistake — it is an accurate reflection of a theoretically underdetermined case.

What should a researcher do? Borderline cases (scores near the threshold, say 0.35–0.45) are explicitly not good candidates for confident classification. A responsible analysis would code borderline cases as "ambiguous" and either exclude them from trend analyses or conduct sensitivity tests showing whether conclusions change when borderline cases are included or excluded.

Cross-Case Lessons

These three cases together illustrate the core analytical principle of this chapter:

The classifier is sensitive to explicit vocabulary, insensitive to intent and context. All three cases turn on context: - Case 1: Vocabulary-evasive intent produces a false negative - Case 2: Analytical context produces a false positive - Case 3: Ambiguous context produces a theoretically-indeterminate result

The classifier's errors are not random. They are systematically related to: - Speaker sophistication (more sophisticated speakers are more likely to produce false negatives) - Text type (analytical texts about populism are more likely to produce false positives) - Theoretical borderline cases (texts near definitional boundaries produce unstable results)

The gap between map and territory is informative, not just a failure. Case 3 is not simply a classifier error — it surfaces a genuine theoretical disagreement about whether people-centrism alone constitutes populism. The classifier's ambiguous output makes that theoretical question concrete and empirically tractable in a way that pure qualitative discussion does not.

The Meta-Lesson for Political Analytics

Sam Harding draws a broader lesson from these cases that applies beyond populism measurement: classifier disagreement with human judgment is usually most analytically interesting at the classifier's failure modes. The cases where the map and territory agree (clear populism, clear non-populism) tell you that the tool works. The cases where they disagree tell you something about the phenomenon itself — about how political actors strategically adapt to measurement, about the boundaries of concepts, about the ambiguity inherent in classification projects.

A political analytics practice that only reports what the classifier confirms and ignores what it misses is not rigorous analysis. It is the construction of a convenient reality from the data that happens to fit the measurement tool. The responsible analyst stays alert to the gap — and treats the gap as information, not as noise to be filtered out.

Discussion Questions

The governor's vocabulary-evasive populism (Case 1) was described as a deliberate strategic choice to avoid the "populist" label. This represents a feedback loop between academic measurement and political practice. How should researchers respond when their measurement instruments are evaded by the subjects being measured?
The false positive from the academic policy memo (Case 2) suggests that researchers should apply source filtering before using populism classifiers on mixed corpora. Design a source filtering scheme for the ODA corpus that would reduce false positives from analytical texts while not excluding genuinely populist speeches by politically engaged academics or journalists.
The three human analysts in Case 3 disagree about whether the bipartisan bill speech is populist, and their disagreement maps onto a theoretical controversy. What does this suggest about the appropriate role of human expert coding in classifier validation? How many coders, and what expertise level, would be needed for adequate inter-rater reliability?