Case Study 27.1: Sam's Pipeline — From Speech Corpus to Actionable Intelligence

DataField.Dev

Case Study 27.1: Sam's Pipeline — From Speech Corpus to Actionable Intelligence

Background

Sam Harding had worked in data journalism for seven years before joining ODA. Their career had taken them through a regional newspaper's data team, a national political media organization, and a brief stint at a civic technology nonprofit before Adaeze recruited them. Sam's reputation was for two things: building tools that were technically excellent and being radically honest about what those tools could not do.

After the Garza-Whitfield fact-check — which reached approximately 18,500 people against a false claim that had reached 74,000 — Adaeze asked Sam to step back from individual fact-checking and build a more systematic analytical infrastructure. The question: could computational text analysis of the ODA speech and media datasets reveal patterns in political communication that would help ODA be more proactive, rather than reactive?

This case study follows Sam's complete project arc: from question formation through data preparation, analysis, findings, and the difficult conversation about what the findings could and could not support.

Part I: Framing the Questions

Sam's first meeting with Adaeze produced five research questions:

Are there measurable differences in language complexity and emotional register between campaign speeches by incumbents and challengers in competitive races?
Does the sentiment of media coverage of a candidate correlate with subsequent factual accuracy problems in that coverage?
Do candidates shift their language (vocabulary, readability, sentiment) as Election Day approaches?
Is there evidence of coordinated language across superficially independent sources — the kind of pattern that might indicate talking points circulating through networks before appearing in public speech?
Does the populism_score variable, derived from a validated academic scale, predict anything about a speech's media coverage or factual treatment in subsequent fact-checks?

Sam immediately raised a methodological concern about Question 4. "The kind of coordinated language we're looking at is genuinely possible to detect computationally — shared unusual phrases, time-ordered adoption of specific terminology. But we have to be careful not to assume coordination when we see similar language. Ideologically aligned people sound similar. That's not evidence of coordination."

They agreed: the pipeline would be built to detect shared language patterns, not to make claims about coordination without additional evidence.

Part II: Data Preparation Challenges

The oda_speeches.csv dataset had several data quality problems that Sam documented before analysis.

Missing text: Approximately 8% of rows in full_text were empty or contained only metadata. Sam created a text_quality flag variable (0 = missing/under 100 words, 1 = text_excerpt only, 2 = full_text available) and restricted analyses to text_quality >= 1 for most questions and text_quality == 2 for the topic modeling and classifier.

Inconsistent speaker names: The same candidate appeared with multiple name variants ("Elena Garza," "Rep. Garza," "E. Garza," "Garza"). Sam wrote a normalization function using fuzzy string matching (via the fuzzywuzzy library) to consolidate to a canonical name per candidate.

Date parsing: Several dates were formatted inconsistently (some ISO 8601, some MM/DD/YYYY, some with approximate dates noted as "early October"). Sam created a date_confidence variable (exact / month / approximate) and restricted temporal analyses to exact dates.

Outliers in word count: Three speeches had word counts above 15,000 — unusually long, suggesting possible transcription errors that combined multiple speeches. Sam flagged these for review and excluded them from readability analysis (where extreme length can distort averages).

Sam documented each decision in a preprocessing log — a practice they insisted on as essential for reproducible research. "If I can't explain every decision to Adaeze in plain language," Sam told a junior colleague, "I haven't made a decision — I've made a mistake I haven't noticed yet."

Part III: The Findings

On Question 1 (Incumbents vs. Challengers): Incumbents showed significantly higher Flesch-Kincaid Grade Level scores (mean 9.2, SD 2.1) than challengers (mean 8.1, SD 2.4), a difference of approximately 1.1 grade levels (t = 4.87, p < .001). Sam's interpretation: incumbents have more policy substance to defend and more policy-specific language in their vocabulary. Their communication is modestly less accessible than challengers'. However, Sam flagged the alternative explanation: challenger candidates in competitive races skew toward economic populism, which research suggests correlates with simpler language.

On Question 2 (Coverage sentiment and factual accuracy): Articles coded with factcheck_rating of "mostly_false" or "false" showed higher absolute sentiment scores (|sentiment| mean = 0.41) compared to articles rated "true" or "mostly_true" (|sentiment| mean = 0.28). The difference was statistically significant (Mann-Whitney U test, p < .01). Sam found this result interesting but noted the critical confound: fact-checking organizations may selectively fact-check the most extreme, emotionally charged claims — meaning the association could reflect selection into fact-checking rather than a causal relationship between emotional language and factual inaccuracy.

On Question 3 (Language shift over time): VADER compound scores showed a modest but consistent negative trend across both parties as Election Day approached — a 0.03 unit decline in compound score per 30-day period in the final three months. Readability was stable. Populism scores increased modestly for both parties in the final 30 days (mean increase of 0.04 on the 0-1 scale). Sam's interpretation: both findings are consistent with campaigns becoming more sharply negative as the election approaches, a well-documented pattern in campaign communication research.

On Question 4 (Coordinated language patterns): Sam identified 14 unusual phrase trigrams that appeared in both campaign speeches and media coverage within narrow time windows, potentially indicating shared talking points. However, Sam's explicit conclusion: "We cannot determine from this analysis whether the pattern reflects coordination or ideological alignment. These phrases are specific enough to be interesting but not specific enough to be dispositive. I recommend we not publish the coordination inference."

On Question 5 (Populism score and media treatment): Higher populism scores in speeches were associated with: (a) more negative subsequent media coverage (Spearman's ρ = -0.18, p < .01), (b) higher engagement with that coverage on social media (positive correlation, ρ = 0.24), and (c) marginally higher rates of fact-check attention. Sam flagged the usual caveats: these are correlational, confounds abound (populist candidates often make bolder and more specific factual claims that invite fact-checking), and the effect sizes are modest.

Part IV: The Hard Conversation

When Sam presented the findings to Adaeze, they began with what they called the "epistemic humility slide" — a summary of what the analysis could not support:

"This analysis cannot tell us whether any specific candidate is more or less honest. It cannot tell us whether coordinated talking points existed. It cannot tell us whether the sentiment patterns in media coverage caused any electoral effect. It can tell us that there are measurable differences in language patterns, that those patterns are correlated with some other measurable outcomes, and that some of those correlations are consistent with well-established theoretical predictions."

Adaeze pushed back. "So what can we publish?"

Sam's answer was precise: "We can publish the descriptive findings — language patterns by incumbent vs. challenger, the sentiment-accuracy correlation with appropriate caveats about selection effects, the temporal trend. We should not publish the coordination inference. We should publish the populism-coverage correlation with an explicit note that causation cannot be established."

Adaeze accepted the constraints. But she added a question: "What would we need to be able to say more?"

Sam's answer became the roadmap for ODA's next research project: experimental study designs, partnership with campaign communication researchers to access more granular data, and a replication study with a second election cycle's data to test whether the patterns held across campaigns.

Part V: What Sam's Pipeline Actually Changed

ODA used the pipeline outputs in three concrete ways in the final weeks of the campaign:

Monitoring trigger: Sam set up an automated alert that flagged media articles with absolute sentiment scores above 0.6 in the final two weeks before Election Day. The logic: extreme sentiment articles correlated with factual accuracy problems. The alert surfaced 23 articles that ODA then reviewed manually, of which 7 warranted a formal fact-check response.

Populism tracking: Sam generated a weekly "populism score trend" for each competitive race in ODA's portfolio. When a candidate's populism score increased sharply between one week and the next, ODA flagged the campaign communication for closer review.

Media brief: ODA's biweekly briefing to partner organizations included a "language environment" summary — readability trends, sentiment by source type, and top n-grams from the week's speech coverage. Several partner organizations said this was the most consistently useful product ODA produced.

Discussion Questions

Sam documented every data preparation decision in a preprocessing log. What is the cost-benefit analysis of this practice? When might the cost (time, documentation burden) outweigh the benefit?
Sam refused to publish the "coordinated language" inference without additional evidence. Was this the right call? What additional evidence would you want before publishing such a claim, and how would you obtain it?
The finding that more extreme sentiment in media articles correlates with lower factual accuracy ratings has an obvious selection bias explanation (fact-checkers choose extreme claims to fact-check). Design a study that would allow you to disentangle selection effects from a genuine causal relationship between emotional framing and factual accuracy.
Sam's automated alert (flagging articles with |sentiment| > 0.6) produced 23 articles, 7 of which warranted fact-checking. Evaluate this as a classification system: what is the precision (of the 23 flagged, how many were true positives)? What other information would you need to compute recall? Is 7/23 precision acceptable for this use case?
Adaeze asked "what would we need to say more?" and Sam's answer became a research roadmap. What does this exchange illustrate about the relationship between applied data journalism and academic political science research? What can each field offer that the other cannot?