Case Study 27.2: The Readability Trap — When Simplicity Becomes a Political Signal
Background
One of the unexpected findings from Sam's text analysis pipeline was the relationship between readability scores and electoral context. Across the ODA speech corpus, speeches at rallies scored consistently simpler (lower Flesch-Kincaid Grade Level, shorter average sentence length) than speeches at townhall events or press conferences — roughly 1.5 to 2 grade levels simpler on average. This was expected; rally communication is designed for emotional engagement, not policy depth.
What was not expected was the interaction: the gap between rally and townhall readability was significantly larger for candidates with higher populism scores. High-populism candidates didn't just use simpler language at rallies — they switched more dramatically between contexts. The range of their reading grade levels across event types was wider. They appeared to be deliberately code-switching: performing simplicity when it served their populist brand.
This case study explores the methodological and interpretive challenges this finding created for Sam and ODA — and what it reveals about the relationship between computational measurement and political meaning.
Part I: The Initial Finding
Sam had run the readability analysis as a routine component of the pipeline — a descriptive measure, not a primary research question. But when they added populism_score as a moderator variable in a visualization, the interaction pattern was striking.
# The code Sam used to identify the interaction
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from scipy import stats
# Bin populism score into quartiles
speeches['populism_quartile'] = pd.qcut(
speeches['populism_score'],
q=4,
labels=['Q1 (Low)', 'Q2', 'Q3', 'Q4 (High)']
)
# For each populism quartile, compute readability by event type
pivot = speeches.groupby(['populism_quartile', 'event_type'])['flesch_kincaid_grade'].mean().unstack()
print("Flesch-Kincaid Grade Level by Populism Quartile and Event Type:")
print(pivot.round(2))
# The range (max - min grade level across event types) as a measure of code-switching
pivot['grade_range'] = pivot.max(axis=1) - pivot.min(axis=1)
print("\nGrade level range (code-switching measure):")
print(pivot['grade_range'].round(2))
The output showed: Q1 (low populism) candidates had a grade range of approximately 1.8 levels across event types. Q4 (high populism) candidates had a grade range of approximately 3.6 levels — twice as wide. High-populism candidates were switching more aggressively between simple and complex registers depending on the audience.
Part II: The Interpretive Challenge
Sam brought this finding to Adaeze with four possible interpretations:
Interpretation A (Authentic populism): High-populism candidates communicate in genuinely simpler ways because they are less policy-sophisticated. They speak simply because that's how they think. The wider grade range reflects genuine uncertainty when confronted with townhall policy questions that they answer less fluently.
Interpretation B (Strategic populism): High-populism candidates deliberately adopt simpler language at rallies as a brand-building performance — plain-speaking as political identity construction. Their actual cognitive engagement is equally sophisticated across contexts, but they strategically modulate their register.
Interpretation C (Measurement artifact): The populism score scale captures certain linguistic patterns (anti-elite language, people vs. establishment framing) that happen to co-occur with simple sentence construction in the training corpus. The readability-populism correlation is partly a measurement artifact — two scales that partially overlap in their linguistic features.
Interpretation D (Selection effect): High-populism candidates in this dataset happen to be running in particular types of competitive races, against particular types of opponents, in particular states. The readability pattern is confounded with unmeasured variables correlated with both populism scores and speech context.
Sam was honest with Adaeze: "I can't rule out C or D without additional analysis. Interpretation A and B are substantively distinct but computationally indistinguishable from each other using this data."
Part III: Testing the Alternatives
Sam designed three additional analyses to probe the competing interpretations:
Test of Interpretation C (Measurement artifact):
Sam examined the specific linguistic features in the populism scale — the items used to generate the populism_score. Three of the five feature sets in the scale involved sentence-level patterns (short, declarative sentences; direct address to "the people"; anti-establishment noun phrases) that are correlated by construction with Flesch-Kincaid scores. Sam computed a partial correlation between populism score and Flesch-Kincaid grade controlling for these overlapping features.
from sklearn.linear_model import LinearRegression
import numpy as np
# Create simplified proxy for the overlapping features
# (using avg_sentence_length as a proxy for the sentence-structure overlap)
X_control = speeches[['avg_sentence_length']].dropna()
y_fk = speeches.loc[X_control.index, 'flesch_kincaid_grade']
y_pop = speeches.loc[X_control.index, 'populism_score']
# Residualize: partial out the effect of sentence length from both variables
def residualize(X, y):
"""Return residuals of y after regressing on X."""
mask = ~(pd.isna(X).any(axis=1) | pd.isna(y))
X_clean = X[mask]
y_clean = y[mask]
model = LinearRegression().fit(X_clean, y_clean)
resid = y_clean - model.predict(X_clean)
return resid
fk_resid = residualize(X_control, y_fk)
pop_resid = residualize(X_control, y_pop)
r, p = stats.pearsonr(fk_resid, pop_resid)
print(f"Partial correlation (controlling for sentence length): r = {r:.3f}, p = {p:.4f}")
The partial correlation was r = -0.18, still significant but smaller than the raw correlation of r = -0.28. Interpretation C had some support: part of the readability-populism relationship was measurement overlap. But a meaningful partial correlation remained.
Test of Interpretation D (Selection effects): Sam added state and race-type fixed effects to a regression predicting Flesch-Kincaid Grade Level from populism score. The populism coefficient remained significant after controls, suggesting the finding was not driven purely by state-level or race-type confounding.
Test of Interpretations A vs. B (Authentic vs. Strategic): This proved harder. Sam could not directly distinguish authentic from strategic simplicity using text data alone. They proposed a design for a follow-up study: content-coding a sample of high-populism speeches by trained coders for "substantive policy content" (operationalized as specific policy proposals, factual claims with evidence, quantitative targets). If high-populism candidates showed lower substantive policy content at rallies but comparable content at press conferences, that would support the strategic code-switching interpretation. If substantive content was uniformly lower, that would support Interpretation A.
ODA did not have the resources to conduct the content-coding study before the election. Sam flagged it for the post-election research agenda.
Part IV: The Publication Decision
Sam's finding was genuinely interesting — potentially publishable in a political communication journal with the right framing and additional analysis. But ODA had an immediate question about publication: could they write about the code-switching finding in their public-facing briefings during the election?
Adaeze convened a brief ethics review. The concern: publishing a finding that high-populism candidates "strategically simplify" their language could be read as a negative characterization of those candidates — as cynical or manipulative. The finding might be accurate, but the framing could shade into electoral commentary that ODA, as a non-partisan organization, was not positioned to make.
They agreed on a formulation: "Our analysis finds that candidates with higher populism scores show larger variation in language complexity across event contexts. Whether this reflects strategic communication choice or natural variation in speech contexts is an empirical question we cannot resolve from this data." This formulation was accurate, acknowledged the limitation, and avoided the interpretive leap that would have been editorially problematic.
Part V: Broader Lessons for Text Analysis Methodology
This case illustrates several principles that Sam incorporated into ODA's standard analytical documentation:
Principle 1: Correlation structure in your measures matters. When multiple computational measures draw on overlapping linguistic features, correlation between them may reflect measurement overlap rather than theoretical constructs. Always examine the specific features underlying your measures for overlap.
Principle 2: Distinguish statistical significance from interpretive certainty. A statistically significant finding is an invitation to interpret, not a conclusion. The readability-populism interaction was highly significant (p < .001) — and multiple substantively distinct interpretations remained viable.
Principle 3: Publication context changes your interpretive obligations. A finding that is appropriate for an academic paper (with full methodology section, explicit limitations, peer review) may require different handling in a public-facing civic data context (where readers bring less statistical sophistication and where political implications are immediate).
Principle 4: Follow-up study design is part of the finding. When a finding is interesting but its interpretation is ambiguous, explicitly designing the study that would resolve the ambiguity — even if you cannot execute it — is a productive research output. Sam's content-coding proposal was a contribution, not a concession of failure.
Discussion Questions
-
Sam identified four competing interpretations of the readability-populism finding. Which do you find most plausible based on your knowledge of political communication? What evidence beyond Sam's partial correlation and fixed effects test would you want to see?
-
The measurement artifact concern (Interpretation C) is a fundamental challenge in computational social science — when scales are built from similar linguistic features, they will correlate regardless of theoretical relationship. How should researchers test for this problem in their own work?
-
ODA chose a formulation for their public briefing that acknowledged uncertainty without making an interpretive claim. Was this the right choice? Is there a version of the finding that is both honest and informative for public audiences without risking unfair electoral characterization?
-
Sam proposed a content-coding follow-up study as part of the finding. How does this proposal illustrate the complementary relationship between computational and qualitative methods? What would a full mixed-methods research design for this question look like?
-
The "readability gap" between rally speeches and townhall speeches was 1.5–2 grade levels across all candidates, with high-populism candidates showing a 3.6 grade level gap. If you were advising a campaign on communication strategy, how would you use this finding? What would you recommend, and what caveats would you attach?
-
Sam insisted on the epistemic humility slide — a summary of what the analysis cannot support — at the beginning of every presentation. Some data journalists and advocacy researchers argue that leading with limitations undercuts the impact of findings and makes non-technical audiences dismiss the results. How would you respond to this argument?