Case Study 5.2: Sam's First Story — From Dataset to Publication

DataField.Dev

Case Study 5.2: Sam's First Story — From Dataset to Publication

Background

Three weeks after Adaeze walked Sam Harding through the ODA Dataset, Sam publishes a story. The headline is "The Persuadable Middle: Inside the Voter Pool That Could Decide the Garza-Whitfield Race." It runs on OpenDemocracy Analytics' public-facing site, with supporting charts and a downloadable summary of the analysis.

The process of getting from that first chaotic morning with the spreadsheets to a published story with named methodology and auditable code is what this case study examines.

Phase 1: Question First

Sam's natural instinct as a journalist was to start with interesting data. Adaeze redirected them to start with an interesting question. After a long conversation about what readers genuinely needed to understand about the Garza-Whitfield race — not what the data happened to make easy to show — Sam settled on a question: "Who are the persuadable voters in this race, how many of them are there, and where do they live?"

This question was analytically useful because it was connected to a concrete decision (campaign resource allocation), it had a clear unit of analysis (the persuadable voter), and it was answerable with the data Sam had access to.

Phase 2: Define the Universe

The first analytical challenge was definitional: what counts as a "persuadable voter"? Sam initially proposed using the ODA voter file's persuadability_score column directly — filter to persuadability_score >= 60 and call those the persuadables. Adaeze pushed back.

"That score is a model output," she said. "It has uncertainty. It was built with specific assumptions. If your story rests on it, your story needs to explain what it is."

They worked through several possible definitions:

Definition A: persuadability_score >= 60 (vendor model threshold) - Pros: Uses expert model, fast to compute - Cons: Black box, methodology opaque, threshold is arbitrary

Definition B: support_score between 40 and 60 (the "true middle") - Pros: Transparent, intuitive - Cons: Misses partisans who vote across party lines; conflates undecided with split-ticket voters

Definition C: Independent registration AND support_score between 35 and 65 - Pros: Adds party registration as a behavioral signal - Cons: Misses registered partisans who are genuinely persuadable (closet swing voters)

Definition D: persuadability_score >= 55 AND vote_history_2022 == 1 (high persuadability, likely to vote) - Pros: Focuses on voters who are both persuadable and likely to show up - Cons: Excludes the mobilization component (low-turnout persuadables are also important)

Sam and Adaeze settled on publishing two universes: the broad persuadable pool (Definition B, support_score 35–65) and the high-value persuasion targets (Definition D, high persuadability + likely voter). The story would be explicit about both definitions and explain why two different views were needed.

Phase 3: The Analysis

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.ticker as mticker
import seaborn as sns

DATA_DIR = "data/oda/"
GW_STATE = "TX_ANALOG"

voters = pd.read_csv(DATA_DIR + "oda_voters.csv")
gw = voters[voters["state"] == GW_STATE].copy()

# Universe A: Broad persuadable pool
broad_persuadable = gw[
    (gw["support_score"] >= 35) & (gw["support_score"] <= 65)
].copy()

# Universe B: High-value persuasion targets
high_value = gw[
    (gw["persuadability_score"] >= 55) &
    (gw["vote_history_2022"] == 1)
].copy()

print(f"Broad persuadable pool: {len(broad_persuadable):,}")
print(f"High-value targets: {len(high_value):,}")

# Geographic distribution: where are persuadables concentrated?
county_persuadable = broad_persuadable.groupby("county").agg(
    persuadable_count=("voter_id", "count"),
    total_county=("voter_id", lambda x: len(gw[gw["county"] == x.name])),
).reset_index()
county_persuadable["pct_persuadable"] = (
    county_persuadable["persuadable_count"] / county_persuadable["total_county"] * 100
).round(1)

top_counties = county_persuadable.nlargest(15, "persuadable_count")
print("\nTop 15 counties by persuadable voter count:")
print(top_counties.to_string(index=False))

Phase 4: The Chart Decision

Sam's first draft charts were, Adaeze admitted privately, pretty good for a first attempt. But several needed revision.

Original Chart 1: A pie chart of the persuadable voter pool's racial/ethnic composition.

Problem: Pie charts make it hard to compare slices when there are more than three or four categories. The racial/ethnic breakdown had seven categories, and the slices were illegible.

Revised: A horizontal bar chart ordered by group size, with percentage labels. This made comparison across groups immediate and precise.

Original Chart 2: A scatter plot of support_score vs. persuadability_score for all voters in the state, colored by party registration, showing every data point.

Problem: With 1.2 million voters, the chart was a solid blob of overlapping points from which no structure was visible.

Revised: A 2D histogram (hexbin plot) showing density rather than individual points, which revealed the actual distributional structure — including the key bimodal shape of Independent voters.

Original Chart 3: A map of persuadable voter counts by county, using a sequential blue scale.

Problem: Raw count maps are dominated by population: large counties always have the most voters of every type, so the map just showed which counties were most populous. This told readers nothing about where persuadable voters were concentrated.

Revised: A choropleth map using percent of county registered voters who are in the persuadable pool — a measure of persuasion density rather than raw count. This revealed that some smaller rural counties had disproportionately high shares of persuadable voters, a finding that would not be visible on a raw-count map.

Phase 5: The Methodology Box

OpenDemocracy Analytics' editorial standards require every data-driven story to include a "Methodology" section at the bottom. Sam and Adaeze spent an afternoon writing it. The final version included:

Data source: ODA voter file, version 3.2, last updated September 15, 2025
Definition of "persuadable voter" (both universes, with explanation of choice)
Sample size: all registered voters in the Garza-Whitfield state (N = 1,214,882)
Limitations: voter file may underrepresent recent movers; persuadability score is a modeled estimate with unknown confidence intervals; county-level geographic analysis uses county of registration, not county of residence

The methodology box is not exciting to read. It is, however, essential — both for transparency and for the story's own protection. When a state party official called to challenge the story's findings, the methodology box gave Sam a ready-made answer: here is exactly what we measured, here is exactly how we measured it, and here are the limitations we disclosed.

Phase 6: What Sam Learned

Sam's reflection, published in an editor's note a week after the story ran:

"I came to this story as a journalist who was suspicious of data — not because I thought data was wrong, but because I had seen data misused often enough to be cautious. What I learned working with the ODA Dataset is that data isn't a shortcut to the truth; it's a more systematic way of asking questions. The voter file doesn't tell you who is going to vote for whom. It gives you one imperfect map of an enormously complicated territory. What you do with that map depends on how honest you are about what it can and can't show."

Discussion Questions

1. Sam and Adaeze debated four different definitions of "persuadable voter." For each definition, identify one downstream analytical error that would result from using it in a campaign targeting context without acknowledging its limitations.

2. Adaeze pushed back on using the vendor persuadability_score directly, calling it a "black box." Under what circumstances is it analytically acceptable to use a vendor-provided modeled score without understanding its construction? What disclosures are minimally required?

3. The revised chart 3 switched from raw persuadable voter counts to persuadable voter density (percentage of county). What information does the raw count map convey that the density map loses? Is there a situation where the raw count map is the analytically correct choice?

4. Sam describes data as "a more systematic way of asking questions" rather than a shortcut to truth. What does this framing imply about the relationship between analytical rigor and journalistic humility? How is this similar to or different from the intellectual humility framework developed in Chapter 4?

5. The methodology box disclosed that the voter file "may underrepresent recent movers." Why are recent movers a specific concern for voter file analysis in the context of a Sun Belt state experiencing significant demographic change?

Key Analytical Concepts Illustrated

Question-first analysis: forming clear, answerable questions before examining data
Universe definition: the analytical and rhetorical stakes of defining your analytical population
Visualization revision: systematic improvement of draft charts for accuracy and readability
Transparency and methodology disclosure: the professional and analytical function of documented methodology
The difference between count and density maps: when each is appropriate
Data journalism ethics: treating modeled scores as estimates, not facts