Chapter 5: Your First Political Dataset

37 min read

Adaeze Nwosu has been explaining the same thing for thirty minutes, and she is not sure her audience is getting it. It is a Tuesday morning at the OpenDemocracy Analytics offices — a converted warehouse in a mid-size Midwestern city, all exposed...

Learning Objectives

Set up a Python environment for political data analysis
Load and inspect multiple CSV datasets with pandas
Apply descriptive statistics to polling and demographic data
Create informative visualizations from political data
Filter and subset data to examine a specific race
Handle missing data responsibly in political datasets
Produce basic cross-tabulations from voter-level data

In This Chapter

5.1 Setting Up Your Environment
5.2 The ODA Dataset: What It Is and Where It Came From
5.3 Loading the Data
5.4 Your First Descriptive Statistics
5.5 Filtering and Subsetting: Finding the Garza-Whitfield Race
5.6 Handling Missing Data in Political Datasets
5.7 Your First Visualizations
5.8 Basic Cross-Tabulations
5.9 Exploring the Advertising and Donations Data
5.10 Saving and Documenting Your Work
5.11 Nadia's First Findings: What the Data Says
5.12 Who Gets Counted, Who Gets Heard
5.13 Reading the Data Critically: What the ODA Dataset Assumes
5.14 Exploratory vs. Confirmatory Analysis: Knowing Which Mode You Are In
5.15 Your Analytical Workflow Going Forward
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 5: Your First Political Dataset

Adaeze Nwosu has been explaining the same thing for thirty minutes, and she is not sure her audience is getting it. It is a Tuesday morning at the OpenDemocracy Analytics offices — a converted warehouse in a mid-size Midwestern city, all exposed brick and monitor stands — and Sam Harding is holding a printed copy of the ODA Dataset documentation, frowning at it the way a person frowns at a furniture assembly diagram that seems to be missing a page.

"I understand what the data is," Sam says. "I don't understand what to do with it."

This is a conversation Adaeze has had in some version a hundred times. Sam is a talented data journalist — thirty-five, non-binary, with a gift for explaining complex things to ordinary readers — but they came up through words, not numbers. The ODA Dataset, which OpenDemocracy Analytics has spent three years assembling, represents one of the most comprehensive repositories of American political data in the public domain: six interconnected tables covering polls, voters, advertising, speeches, donations, and media coverage across hundreds of races over four election cycles. Knowing it exists and knowing what to do with it are entirely different things.

"Start at the beginning," Adaeze says, sliding a laptop across the table. "Just look at it."

This chapter starts at the beginning too. If you have never loaded a political dataset in Python — or if you have loaded datasets in other contexts but are new to political data — everything you need is here. By the end, you will have explored the ODA Dataset, pulled your first insights about the Garza-Whitfield Senate race, and produced visualizations that would not look out of place in a professional analytics memo. More importantly, you will have developed the habits that separate careful political data analysis from the kind that produces confident conclusions from wrong premises.

A brief note on the relationship between this chapter and the previous one: Chapter 4 gave you the why of political analysis — the frameworks for thinking before you compute. Chapter 5 gives you the how — the specific Python operations you will use throughout this textbook. But the chapters are not sequential in the sense that one is done with when the other begins. Every line of code you write in this chapter should be filtered through the questions in Chapter 4: What decision does this serve? Who is in this data? What is the right unit of analysis? Good technical execution in the wrong analytical framework produces wrong answers very efficiently. The goal of this chapter is not to make you fast at pandas — it is to make you precise, careful, and honest as you learn the tools.

5.1 Setting Up Your Environment

Before we touch data, a brief word on environment. Political data analysis in Python typically requires three core libraries: pandas for data manipulation, matplotlib for visualization, and numpy for numerical operations. We will also use seaborn, a higher-level visualization library built on matplotlib that produces cleaner default aesthetics with less code.

If you are working in a fresh environment, install these with:

pip install pandas matplotlib numpy seaborn

If you are using Anaconda or a similar distribution, these libraries are likely already available. For this chapter, we recommend working in a Jupyter notebook — the cell-based execution model is well-suited to exploratory analysis, and you will want to see your visualizations inline as you produce them. The code files accompanying this chapter (example-01-loading-oda-data.py, example-02-descriptive-stats.py, example-03-first-visualizations.py) are written as plain Python scripts for portability, but they map directly to the notebook workflow.

# Standard imports for this chapter
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set a clean visual style
sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 120
plt.rcParams["font.size"] = 11

✅ Best Practice: Version Control from Day One Start every analysis project with a git repository, even a simple one. Political data analysis often involves multiple iterations, and version control saves you from the nightmare of analysis_v3_FINAL_revised_actually_final.py. A simple git init and periodic git commit is all you need to begin.

5.2 The ODA Dataset: What It Is and Where It Came From

Sam is still frowning at the documentation. Adaeze takes the printout from them and sets it aside. "The documentation is the wrong place to start. The data is the right place to start."

She is right in a general sense, but the documentation matters more for political data than for many other domains — because in political data, the how of collection shapes the what of content in ways that are analytically crucial. Before loading a single file, you need to understand what each table represents and how the underlying data was gathered.

Political data is assembled from an unusual variety of sources, each with its own logic, incentives, and limitations. Voter file data originates in county and state voter registration systems — bureaucratic records maintained by election officials for administrative purposes, not analytical ones. Polling data originates in sample surveys designed to estimate population preferences, subject to all the sampling, question design, and coverage uncertainties that Chapter 2 established. Campaign finance data originates in mandatory federal disclosure filings, which are publicly available but inconsistently formatted across campaigns and election cycles. Speech and media data requires either manual transcription and coding or automated text processing, each of which introduces its own error modes. Understanding which source each table represents is the first step toward interpreting what it can and cannot tell you.

There is also the question of temporal coverage. Political data has a rhythm determined by election cycles: registration data is most accurate close to election deadlines, polling data is densest in the final months of a race, campaign finance data accumulates over a fundraising period, advertising spending spikes in the final weeks. The ODA Dataset covers multiple election cycles and multiple states, which means it has both the richness of longitudinal data and the complexity of data collected under different conditions at different times. When you compute a cross-state average of any variable, you are potentially averaging across wildly different political contexts. Keeping this temporal and geographic complexity in mind prevents spurious comparisons.

The ODA Dataset is a synthetic-but-realistic political dataset assembled from publicly available data structures, modeling real-world campaign and election data with all personally identifying information altered or generated to preserve privacy while maintaining realistic distributional properties. It consists of six tables:

oda_polls.csv contains records of Senate race polls across multiple states and election cycles. Each row is a single poll. Key columns include poll_id, date, state, pollster, methodology (online/phone/IVR), candidate_d, candidate_r, pct_d, pct_r, sample_size, margin_error, population (likely voters, registered voters, or adults), and race_type.

oda_voters.csv contains voter-level records — one row per voter — with demographic attributes, party registration, vote history, and modeled scores. Key columns include voter_id, state, county, age, gender, race_ethnicity, education, income_bracket, party_reg, vote_history_2018, vote_history_2020, vote_history_2022, urban_rural, support_score, and persuadability_score.

oda_ads.csv contains records of political advertising at the market level. Each row is an advertising buy. Key columns include ad_id, date, sponsor, party, platform, state, market, spend_usd, impressions, issue_topic, tone, and target_demo.

oda_speeches.csv contains annotated speech records with text excerpts and modeled scores. Each row is a speech or major public statement. Key columns include speech_id, date, speaker, party, office, event_type, state, word_count, text_excerpt, full_text, and populism_score.

oda_donations.csv contains campaign finance records. Each row is an individual or organizational donation. Key columns include donation_id, date, donor_name, donor_state, donor_zip, amount, recipient, recipient_party, donation_type, employer, and occupation.

oda_media.csv contains records of news and media coverage. Each row is an article or broadcast segment. Key columns include article_id, date, source, source_type, state, topic, headline, excerpt, sentiment_score, candidate_mentions, and factcheck_rating.

📊 Real-World Application: How This Mirrors Real Data Real campaign data looks similar to this structure — though real voter files have more columns, more missing data, more inconsistency, and more proprietary scoring layers. The ODA Dataset's structure mirrors the architecture of voter file vendors like Catalist and TargetSmart, polling aggregators like the datasets maintained by FiveThirtyEight and the Roper Center, and campaign finance data from the Federal Election Commission. Learning to navigate this structure transfers directly to the tools you will use in professional contexts.

5.3 Loading the Data

Let us load all six tables and perform the first round of inspection. The code below is from example-01-loading-oda-data.py.

import pandas as pd
import numpy as np

# Load all six ODA tables
# Adjust the path to wherever you have stored the ODA Dataset files
DATA_DIR = "data/oda/"

polls = pd.read_csv(DATA_DIR + "oda_polls.csv", parse_dates=["date"])
voters = pd.read_csv(DATA_DIR + "oda_voters.csv")
ads = pd.read_csv(DATA_DIR + "oda_ads.csv", parse_dates=["date"])
speeches = pd.read_csv(DATA_DIR + "oda_speeches.csv", parse_dates=["date"])
donations = pd.read_csv(DATA_DIR + "oda_donations.csv", parse_dates=["date"])
media = pd.read_csv(DATA_DIR + "oda_media.csv", parse_dates=["date"])

print("Tables loaded successfully.")
print(f"  polls:     {polls.shape[0]:,} rows × {polls.shape[1]} columns")
print(f"  voters:    {voters.shape[0]:,} rows × {voters.shape[1]} columns")
print(f"  ads:       {ads.shape[0]:,} rows × {ads.shape[1]} columns")
print(f"  speeches:  {speeches.shape[0]:,} rows × {speeches.shape[1]} columns")
print(f"  donations: {donations.shape[0]:,} rows × {donations.shape[1]} columns")
print(f"  media:     {media.shape[0]:,} rows × {media.shape[1]} columns")

The parse_dates=["date"] argument tells pandas to automatically convert the date column to datetime format rather than treating it as a string. This matters because we will frequently want to filter, sort, and aggregate by date — and datetime objects are far more powerful for these operations than strings. With a datetime column, you can write polls[polls['date'] > '2025-06-01'] to get only recent polls, or polls.groupby(pd.Grouper(key='date', freq='W')) to aggregate by week. With a string column, neither operation works correctly.

Notice that only the tables with meaningful date columns (polls, ads, speeches, donations, media) get parse_dates applied. The voters table does not have a single date column in the same sense — it has columns like reg_date (registration date) that you would parse separately if needed.

Two things you should do with every new dataset before any analysis: look at the shape (rows and columns) and look at the first few rows.

# First look at the polls table
print("\n--- POLLS TABLE ---")
print(polls.info())
print("\nFirst 5 rows:")
print(polls.head())

The .info() method is your first diagnostic tool. It shows you the number of rows, the number of columns, each column's data type, and the count of non-null values. A column with significantly fewer non-null values than the total row count has missing data — a signal that demands investigation. For example, if .info() shows that polls has 2,400 rows but margin_error has only 1,820 non-null values, you know immediately that 580 polls (about 24%) are missing their margin of error — and you need to understand why before you can use that column in any weighting or aggregation.

The .head() method shows you the first five rows, which gives you a concrete sense of what the data looks like: are values formatted as you expect, are categories spelled consistently, are numeric columns actually numeric? The .head() output is not a random sample — it is the first five rows as stored in the CSV — which means it may not be representative of the full range of values. Complement it with .sample(10) to get a random selection, or .tail(5) to see the most recent entries in a date-sorted table. For categorical columns, .value_counts() gives you a much better overall picture than .head() alone.

One critical detail that .info() does not show: whether numeric columns have extreme values, outliers, or impossible entries. A pct_d column with minimum 0 and maximum 100 looks plausible; a column with minimum −3 or maximum 112 signals data entry errors. This is why .describe() is always the second call after .info().

Sam types both methods carefully. The info printout for the polls table shows what looks like a clean structure — all expected columns, appropriate dtypes, only the margin_error column showing substantial missing values. But the head printout reveals something unexpected: two rows where pct_d and pct_r both equal zero. "Are those real polls?" Sam asks. Adaeze looks at the rows: the dates are from early in the cycle, the sample sizes are listed as 0, and the pollster name is "TEST." These are administrative test records that were never cleaned from the file. This is the value of looking at the data: the summary statistics would have averaged in those zero-percentage rows without triggering any error message.

💡 Intuition: Always Check dtypes One of the most common causes of analytical errors in pandas is a numeric column being read as a string (object dtype). This happens when a column contains entries like "N/A" or "--" that prevent automatic numeric inference. Always check that pct_d, pct_r, sample_size, margin_error, and similar columns are float or int, not object. If they are object, you need to clean them before analysis.

5.4 Your First Descriptive Statistics

Now we move to the second example file, example-02-descriptive-stats.py. Sam is typing along, following Adaeze's guidance. They have successfully loaded the polls table and stared at the first five rows. "Okay," Sam says. "So we have polls. What do I do with them?"

"You describe them," Adaeze says. "Before you try to find patterns, find out what you have."

import pandas as pd
import numpy as np

DATA_DIR = "data/oda/"
polls = pd.read_csv(DATA_DIR + "oda_polls.csv", parse_dates=["date"])

# --- Basic descriptive statistics ---
print("=== POLLS: Descriptive Statistics ===\n")
print(polls.describe())

The .describe() method generates summary statistics for all numeric columns: count, mean, standard deviation, minimum, 25th percentile, median (50th percentile), 75th percentile, and maximum. For the polls table, this gives you an immediate sense of the distribution of polling margins, sample sizes, and margin of error across the dataset.

# How many unique states are covered?
print(f"\nStates covered: {polls['state'].nunique()}")
print(polls['state'].value_counts().head(10))

# How many unique pollsters?
print(f"\nUnique pollsters: {polls['pollster'].nunique()}")
print(polls['pollster'].value_counts().head(10))

# Distribution of polling methodology
print(f"\nMethodology breakdown:")
print(polls['methodology'].value_counts(normalize=True).round(3))

# Distribution of population type
print(f"\nPopulation type breakdown:")
print(polls['population'].value_counts(normalize=True).round(3))

The value_counts() method is one of the most useful tools in your descriptive analysis toolkit for categorical variables. With normalize=True, it shows proportions rather than raw counts. The results here tell you something immediately important: not all polls in the dataset use the same population type (LV, RV, or A for adults), which is analytically significant — likely voter polls and registered voter polls cannot be directly compared without adjustment. Mixing population types when computing polling averages will produce distorted results, because LV polls tend to show more Republican-leaning results (older, habitual voters skew right) while A polls include many non-voters who tend to express more Democratic sympathy.

The pollster value counts reveal another important data feature: some pollsters have contributed many polls across many states, while others contributed only one or two. High-frequency pollsters' methodological idiosyncrasies — what analysts call "house effects" — have more influence on the overall dataset than low-frequency pollsters'. This is one reason why polling averages should not simply average all polls with equal weight, a topic we address carefully in Chapter 10.

The methodology breakdown is also analytically relevant. Different polling methodologies reach different populations. Live phone polls tend to reach older, more politically engaged voters. Online opt-in panels tend to oversample politically interested respondents who sign up for survey panels. Interactive Voice Response (IVR) polls, which use pre-recorded automated questions, can reach people who will not talk to a live interviewer but have well-documented issues with cell phone penetration. The methodology column is not just a description of how the poll was conducted; it is a marker of which voters the poll was most likely to reach and miss.

# When is the data from?
print(f"\nDate range:")
print(f"  Earliest poll: {polls['date'].min().date()}")
print(f"  Latest poll:   {polls['date'].max().date()}")
print(f"  Total polls:   {len(polls):,}")

⚠️ Common Pitfall: Comparing Across Population Types A poll of "all adults" will typically show different results than a poll of "likely voters," even in the same race at the same moment. Likely voter screens tend to produce slightly more Republican-leaning samples because older, higher-frequency voters skew Republican. Always check the population column before comparing polls or averaging them. Many apparent polling discrepancies evaporate when you realize you are comparing apples to oranges.

Exploring the Voter Table

The voter table requires a somewhat different approach — it is voter-level data, so the unit of analysis is an individual, and the distributions tell you about the composition of the electorate rather than the results of individual polls.

voters = pd.read_csv(DATA_DIR + "oda_voters.csv")

print("=== VOTERS: Descriptive Statistics ===\n")
print(voters.describe())

# Age distribution
print(f"\nAge distribution:")
print(f"  Mean: {voters['age'].mean():.1f}")
print(f"  Median: {voters['age'].median():.0f}")
print(f"  Std: {voters['age'].std():.1f}")
print(f"  Range: {voters['age'].min()} – {voters['age'].max()}")

# Categorical breakdowns
for col in ['gender', 'race_ethnicity', 'education', 'income_bracket',
            'party_reg', 'urban_rural']:
    print(f"\n{col}:")
    print(voters[col].value_counts(normalize=True).round(3).to_string())

This gives you a portrait of who is in the voter file — which is the first step in understanding who might be missing. If the voter file overrepresents certain demographic groups (as all voter files do), you need to understand those biases before drawing inferences from the data.

# Voting history cross-tabs
print("\n=== Voting History ===")
print("Voted in 2018:", voters['vote_history_2018'].value_counts(normalize=True).round(3))
print("Voted in 2020:", voters['vote_history_2020'].value_counts(normalize=True).round(3))
print("Voted in 2022:", voters['vote_history_2022'].value_counts(normalize=True).round(3))

# How many are consistent voters (voted in all three)?
consistent = (
    (voters['vote_history_2018'] == 1) &
    (voters['vote_history_2020'] == 1) &
    (voters['vote_history_2022'] == 1)
)
print(f"\nConsistent 3-cycle voters: {consistent.sum():,} ({consistent.mean():.1%} of file)")

The consistent voter figure is analytically significant: this is the universe of voters who are most likely to vote in the current election and whose preferences are most predictable from past behavior. It is also typically the most over-canvassed segment — campaigns have been contacting these people for decades, and the marginal value of additional contact is low.

5.5 Filtering and Subsetting: Finding the Garza-Whitfield Race

Nadia has opened her own laptop next to Sam's. She has been following along with the exploratory analysis, but now she has a specific question: "Show me the state."

Filtering in pandas uses boolean indexing — you create a condition and apply it to the DataFrame to return only rows that satisfy that condition. The Garza-Whitfield race takes place in a Sun Belt state that we will call "TX_ANALOG" in the dataset (a synthetic state modeled on Sun Belt demographics). Let's pull the relevant data.

# Filter polls to the Garza-Whitfield state
gw_state = "TX_ANALOG"

# Polls for this state
gw_polls = polls[polls['state'] == gw_state].copy()
print(f"\nGarza-Whitfield state polls: {len(gw_polls)} polls")
print(gw_polls[['date', 'pollster', 'methodology', 'population',
                 'pct_d', 'pct_r', 'sample_size', 'margin_error']].head(10))

# Further filter to likely voter polls only
gw_lv_polls = gw_polls[gw_polls['population'] == 'LV'].copy()
print(f"\nLikely voter polls only: {len(gw_lv_polls)} polls")

# Compute the polling average for D and R in this state (LV polls only)
avg_pct_d = gw_lv_polls['pct_d'].mean()
avg_pct_r = gw_lv_polls['pct_r'].mean()
avg_margin = avg_pct_d - avg_pct_r

print(f"\n=== Garza-Whitfield Race: Polling Average (LV polls) ===")
print(f"  Garza (D):   {avg_pct_d:.1f}%")
print(f"  Whitfield (R): {avg_pct_r:.1f}%")
print(f"  Margin (D–R): {avg_margin:+.1f} points")

💡 Intuition: The Copy Warning Notice the .copy() call when we create filtered DataFrames. When you filter a pandas DataFrame without .copy(), you sometimes get a "SettingWithCopyWarning" if you later try to modify the filtered subset. Using .copy() creates an independent copy, preventing this warning and ensuring your modifications do not accidentally affect the original DataFrame.

Now let us look at the voter file for the same state:

# Filter voters to the Garza-Whitfield state
gw_voters = voters[voters['state'] == gw_state].copy()

print(f"\n=== Garza-Whitfield State: Voter File ===")
print(f"  Total registered voters: {len(gw_voters):,}")

# Demographic breakdown in this state
print("\nParty registration:")
print(gw_voters['party_reg'].value_counts(normalize=True).round(3))

print("\nRace/ethnicity breakdown:")
print(gw_voters['race_ethnicity'].value_counts(normalize=True).round(3))

print("\nUrban/rural breakdown:")
print(gw_voters['urban_rural'].value_counts(normalize=True).round(3))

print("\nEducation breakdown:")
print(gw_voters['education'].value_counts(normalize=True).round(3))

# Support score distribution
print("\nGarza support score distribution:")
print(gw_voters['support_score'].describe().round(2))

# Persuadability score distribution
print("\nPersuadability score distribution:")
print(gw_voters['persuadability_score'].describe().round(2))

# How many high-persuadability voters (score >= 60)?
high_persuadable = (gw_voters['persuadability_score'] >= 60)
print(f"\nHigh-persuadability voters (score ≥ 60): {high_persuadable.sum():,}")
print(f"  ({high_persuadable.mean():.1%} of registered voters in state)")

Nadia leans forward. "That's my persuasion universe," she says. The number matters more than it sounds: if the high-persuadability universe is large, the campaign has a broad field to work in. If it is narrow, targeting precision becomes paramount.

📊 Real-World Application: What Support Scores Actually Measure Voter support scores (also called "favorability scores" or "support propensity scores") are modeled estimates of a voter's probability of supporting a given candidate, typically ranging from 0 to 100. They are built by vendor firms using a combination of vote history, consumer data, public records, and survey weights. Treating them as precise probability estimates is a mistake — the models behind them are imperfect and the precision is largely illusory. More defensibly, they are ordinal rankings: a voter with a score of 75 is more likely to support Garza than one with a score of 40, but the 35-point difference does not correspond to a 35-percentage-point difference in actual probability.

5.6 Handling Missing Data in Political Datasets

Missing data is a fact of life in political analysis, and handling it thoughtlessly is one of the most reliable ways to produce wrong conclusions. Let us look at where the ODA Dataset has missing values and think about what they mean.

# Check missing data across all tables
print("=== Missing Data Summary ===\n")

for name, df in [("polls", polls), ("voters", voters), ("ads", ads),
                 ("speeches", speeches), ("donations", donations), ("media", media)]:
    missing = df.isnull().sum()
    missing_cols = missing[missing > 0]
    if len(missing_cols) > 0:
        print(f"{name}:")
        for col, count in missing_cols.items():
            pct = count / len(df) * 100
            print(f"  {col}: {count:,} missing ({pct:.1f}%)")
        print()
    else:
        print(f"{name}: No missing values\n")

# Focus on polls: what does it mean when margin_error is missing?
print("Polls with missing margin_error:")
missing_me = polls[polls['margin_error'].isnull()]
print(missing_me['methodology'].value_counts())

This is a revealing query. If margin of error is systematically missing for a specific methodology (say, online opt-in panels), that tells you something about the nature of the data — some pollsters do not report margin of error for nonprobability samples because the concept of sampling error does not straightforwardly apply to nonprobability sampling designs. The missingness is not random; it encodes information about the polling methodology.

# Common strategies for handling missing values

# Strategy 1: Drop rows with missing values in a specific column
polls_with_me = polls.dropna(subset=['margin_error']).copy()
print(f"Polls with non-missing margin_error: {len(polls_with_me):,}")

# Strategy 2: Fill with a summary statistic (use carefully!)
median_me = polls['margin_error'].median()
polls_filled = polls.copy()
polls_filled['margin_error'] = polls_filled['margin_error'].fillna(median_me)
print(f"Polls after filling missing margin_error with median ({median_me:.1f}): {len(polls_filled):,}")

# Strategy 3: Create a flag variable and analyze separately
polls['me_missing'] = polls['margin_error'].isnull().astype(int)
print("\nPolls by margin-error availability and methodology:")
print(polls.groupby(['methodology', 'me_missing']).size().unstack(fill_value=0))

⚠️ Common Pitfall: Never Fill Silently The most dangerous missing-data handling is invisible: code that silently drops rows with missing values, or fills them with zeros or means without acknowledging the assumption. Always make your missing-data decisions explicit and document them. If you drop rows, note how many and why. If you fill, state what you filled with and what assumption that implies. Missing data in political datasets is almost never random, and treating it as such leads to systematic bias.

5.7 Your First Visualizations

Sam has been quiet for a few minutes, scrolling through output. "The numbers are useful," they say. "But I need pictures. I'm a journalist."

Adaeze nods. "That's true of most people. Let's make pictures."

Visualization in political analysis serves two distinct functions. The first is exploration — making charts for yourself to spot patterns, outliers, and relationships you might not see in a table. The second is communication — making charts for others that convey an analytical finding clearly and honestly. These are different tasks with different aesthetics, and we will practice both. The code below is from example-03-first-visualizations.py.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

sns.set_style("whitegrid")
plt.rcParams["figure.dpi"] = 120
plt.rcParams["font.size"] = 11

DATA_DIR = "data/oda/"
polls = pd.read_csv(DATA_DIR + "oda_polls.csv", parse_dates=["date"])
voters = pd.read_csv(DATA_DIR + "oda_voters.csv")
ads = pd.read_csv(DATA_DIR + "oda_ads.csv", parse_dates=["date"])

gw_state = "TX_ANALOG"
gw_polls = polls[(polls['state'] == gw_state) & (polls['population'] == 'LV')].copy()
gw_voters = voters[voters['state'] == gw_state].copy()

Visualization 1: The Polling Trend Chart

# Sort by date for the trend line
gw_polls_sorted = gw_polls.sort_values('date')

fig, ax = plt.subplots(figsize=(11, 5))

# Plot each poll as a scatter point
ax.scatter(gw_polls_sorted['date'], gw_polls_sorted['pct_d'],
           color='#2166ac', alpha=0.6, s=50, label='Garza (D)', zorder=3)
ax.scatter(gw_polls_sorted['date'], gw_polls_sorted['pct_r'],
           color='#d6604d', alpha=0.6, s=50, label='Whitfield (R)', zorder=3)

# 7-day rolling average (smoothed trend)
gw_polls_sorted = gw_polls_sorted.set_index('date')
roll_d = gw_polls_sorted['pct_d'].rolling('14D').mean()
roll_r = gw_polls_sorted['pct_r'].rolling('14D').mean()

ax.plot(roll_d.index, roll_d.values, color='#2166ac', linewidth=2.5,
        label='Garza 14-day avg')
ax.plot(roll_r.index, roll_r.values, color='#d6604d', linewidth=2.5,
        label='Whitfield 14-day avg')

# Reference line at 50%
ax.axhline(50, color='gray', linestyle='--', linewidth=0.8, alpha=0.7)

ax.set_title("Garza-Whitfield Senate Race: Polling Trend (Likely Voters)",
             fontsize=13, fontweight='bold', pad=12)
ax.set_xlabel("Date", fontsize=11)
ax.set_ylabel("Percentage (%)", fontsize=11)
ax.legend(loc='upper left', framealpha=0.9)
ax.set_ylim(35, 62)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:.0f}%'))

plt.tight_layout()
plt.savefig("output/01_polling_trend.png", dpi=150, bbox_inches='tight')
plt.show()
print("Chart saved: output/01_polling_trend.png")

💡 Intuition: The 14-Day Rolling Average Individual polls are noisy — sampling variation, timing differences, and house effects make adjacent polls bounce around. A rolling average smooths this noise by computing the average of all polls within a trailing window. We use 14 days (two weeks) because it is long enough to smooth noise but short enough to capture genuine trend movement. The right window length depends on your data; with sparse polling, you may need a longer window.

Visualization 2: Voter Demographics — A Bar Chart

# Bar chart: party registration breakdown in the Garza-Whitfield state
party_counts = gw_voters['party_reg'].value_counts()

fig, ax = plt.subplots(figsize=(8, 5))

colors = {'Democrat': '#2166ac', 'Republican': '#d6604d',
          'Independent': '#4dac26', 'Other': '#888888', 'No Party': '#aaaaaa'}
bar_colors = [colors.get(p, '#888888') for p in party_counts.index]

bars = ax.bar(party_counts.index, party_counts.values, color=bar_colors,
              edgecolor='white', linewidth=0.5)

# Add value labels on bars
for bar, val in zip(bars, party_counts.values):
    pct = val / party_counts.sum() * 100
    ax.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 500,
            f'{val:,}\n({pct:.1f}%)', ha='center', va='bottom', fontsize=9)

ax.set_title("Voter File: Party Registration\nGarza-Whitfield State", fontsize=12,
             fontweight='bold', pad=10)
ax.set_xlabel("Party Registration", fontsize=11)
ax.set_ylabel("Registered Voters", fontsize=11)
ax.yaxis.set_major_formatter(plt.FuncFormatter(lambda x, _: f'{x:,.0f}'))

plt.tight_layout()
plt.savefig("output/02_party_registration.png", dpi=150, bbox_inches='tight')
plt.show()

Visualization 3: Support Score Distribution — A Histogram

# Histogram: support score distribution (stratified by party registration)
fig, ax = plt.subplots(figsize=(10, 5))

for party, color, label in [
    ('Democrat', '#2166ac', 'Democrats'),
    ('Republican', '#d6604d', 'Republicans'),
    ('Independent', '#4dac26', 'Independents')
]:
    subset = gw_voters[gw_voters['party_reg'] == party]['support_score']
    ax.hist(subset, bins=30, alpha=0.55, color=color, label=label, density=True)

ax.set_title("Garza Support Score Distribution by Party Registration",
             fontsize=12, fontweight='bold', pad=10)
ax.set_xlabel("Support Score (0 = strong Whitfield, 100 = strong Garza)", fontsize=11)
ax.set_ylabel("Density", fontsize=11)
ax.legend(loc='upper center', framealpha=0.9)
ax.axvline(50, color='gray', linestyle='--', linewidth=1, alpha=0.8, label='Neutral')

plt.tight_layout()
plt.savefig("output/03_support_score_distribution.png", dpi=150, bbox_inches='tight')
plt.show()

Nadia stares at the histogram for a moment. The Democrat and Republican distributions are, as expected, clustered at opposite ends. But the Independent distribution is bimodal — there is a hump near 35 and another hump near 65, with a valley in the middle. "Independents aren't in the middle," she says quietly. "There are two kinds."

This is an important insight. "Pure independents" — voters with genuinely ambivalent preferences — are a minority of the self-identified independent electorate. Many independents are "closet partisans" who reliably vote for one party while preferring not to be labeled as partisans. The bimodal distribution of support scores among independents captures this distinction visually.

Visualization 4: Persuadability vs. Support — A Scatter Plot

# Scatter plot: persuadability score vs. support score
# This is Nadia's "targeting map" — where are the winnable voters?

# Sample 5,000 voters for readability
sample = gw_voters.sample(5000, random_state=42)

fig, ax = plt.subplots(figsize=(9, 7))

scatter = ax.scatter(
    sample['support_score'],
    sample['persuadability_score'],
    c=sample['support_score'],
    cmap='RdBu',
    alpha=0.4,
    s=15,
    vmin=0, vmax=100
)

plt.colorbar(scatter, ax=ax, label='Support Score (Blue = Garza)')

# Highlight the "sweet spot": high persuadability, mid support
ax.axvline(40, color='gray', linestyle='--', linewidth=0.8, alpha=0.6)
ax.axvline(60, color='gray', linestyle='--', linewidth=0.8, alpha=0.6)
ax.axhline(55, color='gray', linestyle='--', linewidth=0.8, alpha=0.6)

ax.annotate("Persuadable\nGarza leaners",
            xy=(55, 70), fontsize=9, color='#2166ac',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

ax.annotate("Persuadable\nWhitfield leaners",
            xy=(30, 70), fontsize=9, color='#d6604d',
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.8))

ax.set_title("Support Score vs. Persuadability Score\n(Sample of 5,000 registered voters)",
             fontsize=12, fontweight='bold', pad=10)
ax.set_xlabel("Garza Support Score (0–100)", fontsize=11)
ax.set_ylabel("Persuadability Score (0–100)", fontsize=11)
ax.set_xlim(-2, 102)
ax.set_ylim(-2, 102)

plt.tight_layout()
plt.savefig("output/04_support_vs_persuadability.png", dpi=150, bbox_inches='tight')
plt.show()

⚠️ Common Pitfall: Overplotting in Scatter Plots When you plot all voters in the file (potentially millions of rows) as individual points, the chart becomes an illegible blob. We addressed this by sampling 5,000 rows and using alpha=0.4 (40% transparency) so individual points are visible through the overlap. For very large datasets, hexbin plots or 2D histograms can be more informative than scatter plots.

5.8 Basic Cross-Tabulations

Cross-tabulations (or "crosstabs") are the bread and butter of political data analysis — they let you examine how one categorical variable is distributed across levels of another. What is the support score distribution by education level? How does vote history vary by urban-rural status?

# Cross-tab: party registration by urban-rural status
ct_urban_party = pd.crosstab(
    gw_voters['urban_rural'],
    gw_voters['party_reg'],
    normalize='index'  # Row percentages
).round(3)

print("Party Registration by Urban-Rural Status (row percentages):")
print(ct_urban_party.to_string())

# Cross-tab: education by party registration
ct_edu_party = pd.crosstab(
    gw_voters['education'],
    gw_voters['party_reg'],
    normalize='index'
).round(3)

print("\nParty Registration by Education Level (row percentages):")
print(ct_edu_party.to_string())

# Cross-tab: race/ethnicity by vote history 2022
ct_race_turnout = pd.crosstab(
    gw_voters['race_ethnicity'],
    gw_voters['vote_history_2022'],
    normalize='index'
).round(3)

print("\n2022 Turnout by Race/Ethnicity (row percentages):")
print(ct_race_turnout.to_string())

The turnout-by-race table is analytically important. In the Garza-Whitfield context, Nadia's campaign is heavily invested in Latino voter mobilization. The cross-tab will tell her what the historical baseline is for that population — which is the prior that any mobilization investment must overcome.

# Average support score by demographic group — groupby approach
print("\nMean Garza Support Score by Education Level:")
edu_support = gw_voters.groupby('education')['support_score'].agg(
    ['mean', 'median', 'std', 'count']
).round(2)
print(edu_support.to_string())

print("\nMean Garza Support Score by Race/Ethnicity:")
race_support = gw_voters.groupby('race_ethnicity')['support_score'].agg(
    ['mean', 'median', 'std', 'count']
).round(2)
print(race_support.to_string())

🧪 Try This: The Income-Support Cross-Tab Before reading ahead, predict what you will find when you cross-tabulate income_bracket against support_score. Will higher-income voters support Garza or Whitfield? Why? Now compute the cross-tab and compare to your prediction. If it differs from your expectation, what explains the difference? This exercise practices the Chapter 4 habit of forming a prior before looking at data.

🧪 Try This: The Education-Support Prediction Before running the education-by-support cross-tab, write down your prediction: will college-educated voters support Garza more or less than voters without college degrees? By how many points? Then run the cross-tab and compare. The education polarization in American politics has been one of the most dramatic demographic shifts of the past twenty years, and this exercise gives you a chance to test whether your mental model of that shift is calibrated to the actual data.

5.9 Exploring the Advertising and Donations Data

Political science is not only about voters — it is about money and messages. Let us look at the advertising and donations tables briefly, establishing patterns we will explore more deeply in later chapters.

Campaign advertising data is particularly rich for understanding strategic choices. The oda_ads.csv table records where campaigns and outside groups spent money, on which platforms, targeting which demographics, and around which issue topics. This is not what voters saw — individual impressions are not recorded — but it is a reliable record of what campaigns invested in and prioritized. The spend_usd and impressions columns together give you a cost-per-thousand-impressions (CPM) metric that reflects both the cost of reaching audiences on different platforms and the efficiency with which each campaign's dollars translated into audience contact.

Campaign finance data captures a different dimension of political activity: who gave money to whom, and in what amounts. The structure of political giving tells you something about both the economic interests aligned with each candidate and the grassroots intensity of their support base. A campaign funded primarily by large donors from a narrow set of industries has a different political character than one funded by hundreds of thousands of small-dollar donors — even if the total dollar amounts are similar. The FEC disclosure threshold ($200) means that small-dollar donors are partially invisible in public records, which is itself an analytically important feature of the data.

# Advertising: where is money being spent in the Garza-Whitfield state?
gw_ads = ads[ads['state'] == gw_state].copy()

print(f"Total ad spending in GW state: ${gw_ads['spend_usd'].sum():,.0f}")

# By party
party_spend = gw_ads.groupby('party')['spend_usd'].sum().sort_values(ascending=False)
print("\nAd spending by party:")
print(party_spend.apply(lambda x: f"${x:,.0f}").to_string())

# By platform
platform_spend = gw_ads.groupby('platform')['spend_usd'].sum().sort_values(ascending=False)
print("\nAd spending by platform:")
print(platform_spend.apply(lambda x: f"${x:,.0f}").to_string())

# By issue topic
issue_spend = gw_ads.groupby(['party', 'issue_topic'])['spend_usd'].sum().unstack(
    fill_value=0).round(0)
print("\nAd spending by party and issue topic:")
print(issue_spend.to_string())

# Donations: small dollar vs. large dollar composition
gw_donations = donations[donations['recipient'].str.contains(gw_state, na=False)].copy()

# Donation size buckets
bins = [0, 50, 200, 1000, 2900, float('inf')]
labels = ['Micro (<$50)', 'Small ($50–200)', 'Medium ($200–1k)',
          'Major ($1k–2.9k)', 'Max ($2.9k+)']
gw_donations['donation_bucket'] = pd.cut(
    gw_donations['amount'], bins=bins, labels=labels, right=False
)

print("Donation composition by size bucket:")
bucket_summary = gw_donations.groupby(['recipient_party', 'donation_bucket']).agg(
    count=('amount', 'count'),
    total=('amount', 'sum')
).reset_index()
print(bucket_summary.to_string())

⚖️ Ethical Analysis: Who Is in the Donations Data? Campaign finance data raises sharp "who gets counted" questions. Donations over $200 to federal candidates are publicly disclosed, meaning the donor appears in the public record by name, employer, and occupation. Donations below $200 are bundled and reported in aggregate. This means that the most politically engaged small-dollar donors — who may represent a genuine grassroots movement — are essentially invisible in public FEC data, while large donors are individually identifiable. When you analyze donations data, you are seeing a systematically truncated picture of the campaign finance landscape, heavily biased toward large donors.

5.10 Saving and Documenting Your Work

Good analysis is reproducible analysis. This means saving not just your outputs but your code, your documentation decisions, and the inputs that produced each output. In political analytics, where the same analysis may be run dozens of times with updated data, reproducibility is not a nice-to-have — it is essential.

import os

# Create output directory if it doesn't exist
os.makedirs("output", exist_ok=True)

# Save key summary tables
gw_polls_summary = gw_polls.groupby(pd.Grouper(key='date', freq='W')).agg(
    n_polls=('poll_id', 'count'),
    avg_pct_d=('pct_d', 'mean'),
    avg_pct_r=('pct_r', 'mean'),
    avg_margin=('pct_d', lambda x: (x - gw_polls.loc[x.index, 'pct_r']).mean())
).round(2)

gw_polls_summary.to_csv("output/gw_weekly_polling_summary.csv", index=True)
print("Saved: output/gw_weekly_polling_summary.csv")

# Save voter demographic summary
voter_demo_summary = {
    'total_voters': len(gw_voters),
    'mean_age': gw_voters['age'].mean(),
    'pct_dem': (gw_voters['party_reg'] == 'Democrat').mean(),
    'pct_rep': (gw_voters['party_reg'] == 'Republican').mean(),
    'pct_ind': (gw_voters['party_reg'] == 'Independent').mean(),
    'mean_support_score': gw_voters['support_score'].mean(),
    'mean_persuadability': gw_voters['persuadability_score'].mean(),
    'high_persuadable_count': (gw_voters['persuadability_score'] >= 60).sum()
}

pd.Series(voter_demo_summary).to_csv("output/gw_voter_summary.csv", header=False)
print("Saved: output/gw_voter_summary.csv")

✅ Best Practice: Analysis Logs Include a brief "analysis log" comment block at the top of every script, recording: (1) the date and author, (2) what the script does and what inputs it requires, (3) what outputs it produces, and (4) any significant analytical decisions made in the script. Future you — and your colleagues — will thank present you. In political campaigns, where staff turnover is high and time pressure is intense, documented code is the difference between analysis that can be handed off and analysis that dies with its creator.

Beyond logging, consider maintaining a separate "analytical decisions" document — a brief prose record of the choices you made and why: why you filtered to LV polls rather than all polls, why you chose a 60-point threshold for high persuadability, why you used a 14-day rolling window for the trend chart. These decisions are invisible in the code itself but are precisely what makes the difference between analysis that is trustworthy and analysis that just looks that way. When a campaign staffer a month later asks "why did you do it this way?", having that document saves hours of reconstruction from memory.

File naming matters more than it seems. Save outputs with names that include the analysis date and a brief description: gw_lv_polls_2025-10-15.csv rather than output.csv. When you update the analysis with new data, save a new file rather than overwriting the old one. The discipline of retaining historical snapshots means you can always answer "what did we know on October 15th?" — a question that matters enormously during post-election reviews and press inquiries.

5.11 Nadia's First Findings: What the Data Says

Nadia has been running variations of this analysis for three hours. The picture that has emerged is more nuanced than the simple horse-race numbers suggested.

The polling average shows Garza trailing by approximately 1.8 points in the most recent likely voter polls — a deficit that is within the margin of error but consistent enough across pollsters to treat as real. The voter file reveals a demographic landscape that cuts in multiple directions. The state's growing Latino population — which leans 63 points toward Garza on average support score — represents significant potential upside if mobilized. But historical turnout rates for that community in midterm-equivalent cycles have been substantially lower than the campaign's mobilization model assumed.

The advertising data shows Whitfield's aligned outside groups outspending Garza's by roughly 1.4 to 1 on television, concentrated in the state's three major media markets. But Garza's campaign is running a more geographically distributed digital buy, reaching suburban markets that Whitfield's television-heavy strategy may be undersaturating.

The donations analysis tells an interesting story about small-dollar enthusiasm: Garza's campaign is receiving a higher share of its total dollars from donors giving under $100 — a signal of grassroots intensity that may predict volunteer availability in the final push.

None of this tells Nadia who is going to win. That is not what first-look exploratory analysis does. What it does is tell her where to look next, what questions to ask of the data in the coming weeks, and which assumptions in the campaign's planning documents deserve the most skeptical scrutiny. It also gives her a baseline — a documented snapshot of where things stood at the beginning of her analytical engagement with this race — against which she can measure movement as new data arrives.

🔗 Connection to Chapter 10 Chapter 10, the next Python lab chapter, goes deeper on polling data specifically: weighting, house effect adjustment, and time-weighted polling averages. The polling work in this chapter is deliberately unweighted and exploratory. When you return to polling data in Chapter 10, you will have the tools to produce a properly aggregated polling average that accounts for the methodological issues we identified here.

5.12 Who Gets Counted, Who Gets Heard

Sam has finally stopped frowning. They are scrolling through the cross-tabulations of race/ethnicity by vote history, and something is bothering them in a productive way.

"The Latino voters in this file have lower historical turnout than the white voters," they say. "So they're systematically underweighted in any likely voter model that uses past voting behavior as its primary screen."

"That's right," Adaeze says.

"Which means any analysis that only looks at likely voters is starting from a universe that underrepresents Latino voters."

"That's also right."

"Which means all the conclusions about what the electorate looks like, what the race is about, what messages resonate — all of that is built on a sample that systematically excludes part of the population whose participation is actively being contested in this race."

Adaeze smiles. "Now you're thinking like a political analyst."

This is the "Who Gets Counted, Who Gets Heard" theme at the analytical level: the definitions and filters we apply to data embed assumptions about political participation that can quietly determine whose preferences count in our analysis. The likely voter screen is not a neutral methodological choice — it is a choice to analyze the electorate as it has historically been, rather than as it might become. For a campaign actively working to expand the electorate, that choice systematically underweights the population it is most trying to reach.

This does not mean likely voter polls are wrong. It means they are measuring one thing — the electorate as historically constituted — and not another thing — the electorate as potentially expanded. Both are legitimate analytical objects. Being clear about which one you are measuring is not a technicality; it is a substantive question about whose preferences and whose participation you are representing in your analysis.

🌍 Global Perspective: Voter File Infrastructure Across Democracies The voter file infrastructure that makes the ODA Dataset possible is a distinctly American institution. The United States has one of the most commercially developed voter data ecosystems in the world, partly because voter registration data is largely public and partly because the decentralized structure of American elections has created a market for data aggregation. By contrast, many European democracies have stricter data privacy regulations (including GDPR in the EU) that limit the commercial use of voter data, making the kind of fine-grained targeting the ODA Dataset enables either illegal or significantly constrained. Understanding this difference is important for applying American analytical techniques in global contexts.

5.13 Reading the Data Critically: What the ODA Dataset Assumes

Before closing this chapter, it is worth pausing to examine some of the assumptions built into the ODA Dataset's structure — because understanding those assumptions is as important as knowing how to manipulate the data. Every dataset makes choices, and the choices embedded in the ODA Dataset reflect real choices that real data vendors and campaign analytics teams make every day.

The binary vote history assumption. The vote_history_2018, vote_history_2020, and vote_history_2022 columns each contain 0 or 1: did not vote, voted. This is a simplification. In reality, early voting, absentee voting, and provisional ballot status create a richer picture of electoral participation that a binary flag obscures. A voter who cast an absentee ballot that was rejected for a signature mismatch is recorded as a non-voter in the official record, though their intent was to participate. A voter who showed up to a polling place that had moved and was told to go to a new location — and who did not make it — is also a non-voter in the record. The binary flag treats these very different situations identically.

The static demographic assumption. The race_ethnicity, education, and income_bracket columns are recorded as fixed attributes of each voter. But these variables change over time. People get degrees. Incomes shift. Ethnic identification itself can change — particularly in mixed-heritage communities where individuals may identify differently depending on context and political climate. Voter files record these attributes as snapshots, often derived from consumer data or self-reported registration information that may be years out of date. When you compute "the average support score among college-educated voters," you are actually computing "the average support score among voters whose education level was recorded as college-educated at the time the data was assembled," which is a subtly different thing.

The individual independence assumption. Every row in oda_voters.csv represents one voter, implicitly treated as an independent unit of analysis. But voters are embedded in households, in social networks, in communities. Research consistently shows that household members influence each other's vote choices, that social networks transmit political information and mobilization signals, and that community-level norms shape individual behavior in ways that individual-level data cannot capture. The household-level structure of political behavior is largely invisible in a flat voter file.

The support score as ground truth. The support_score and persuadability_score columns represent the output of proprietary models run by data vendors. These models are trained on historical data and validated against past elections. But they are black boxes: the exact features, weights, and training procedures are not disclosed, which means you cannot fully evaluate what they are measuring or when they might fail. Treating these scores as ground truth — as if they were direct measurements of voter preference rather than modeled estimates — is a systematic error that campaigns make constantly. They are useful, they are informative, and they are always uncertain to a degree that the point estimates do not communicate.

The single-race focus. The ODA Dataset is structured around individual Senate races, which is appropriate for a textbook focused on the Garza-Whitfield race. But real electorates are shaped by the full ballot — presidential races, gubernatorial races, ballot initiatives, local contests. Voters make choices in a context shaped by all of these races simultaneously, and campaigns compete for attention in an environment where their race may not be the most salient contest on the ballot. The single-race framing of the ODA Dataset is pedagogically useful but substantively limiting.

None of these limitations make the ODA Dataset a bad tool. They make it a tool with specific properties that you need to understand to use it well. This is true of every political dataset you will ever work with: understanding what a dataset assumes is as important as understanding what it contains.

🔴 Critical Thinking: Who Built This Dataset? Every dataset is the product of decisions made by specific people with specific resources, goals, and incentives. The ODA Dataset was built by OpenDemocracy Analytics, a nonprofit with a mission of transparency in political data. But what if it had been built by a campaign vendor? By a partisan organization? By an academic team with specific theoretical commitments? The data would look structurally similar but might embed different assumptions, prioritize different variables, and reflect different decisions about what to measure and how. Always ask: who built this, why, and what decisions did they make that I cannot see from the data alone?

5.14 Exploratory vs. Confirmatory Analysis: Knowing Which Mode You Are In

A distinction that becomes increasingly important as you develop your analytical practice is the one between exploratory and confirmatory analysis. This chapter has been entirely exploratory — you have been looking at the ODA Dataset without strong prior hypotheses, trying to understand its structure and surface potentially interesting patterns. This is appropriate for a first encounter with a new dataset. It is analytically dangerous if you do not recognize when you have stopped exploring and started confirming.

Exploratory analysis generates hypotheses. When you notice that the support score distribution of Independent voters is bimodal, you have generated a hypothesis: there may be two distinct types of political Independent, not one. When you notice that the Latino community has lower historical turnout but higher average support scores for Garza, you have generated a hypothesis: this community represents an under-realized mobilization opportunity. These are legitimate, valuable outputs of exploratory work.

The danger comes when you immediately treat these exploratory observations as confirmed findings — when you present the bimodal Independent distribution as "proof" of two Independent types, or when you present the Latino turnout-support gap as "evidence" that the campaign's mobilization strategy will produce a breakthrough. Exploratory observations are hypotheses, not conclusions. Confirming them requires additional analysis: independent data, pre-specified tests, or experimental evidence.

Confirmatory analysis tests pre-specified hypotheses. The analysis is designed in advance, the expected results are stated before looking at the data, and the criteria for confirmation or refutation are explicit. This is the approach that produces replicable, trustworthy findings. It is also much harder to do in a campaign context, where time pressure is intense and the temptation to treat every interesting pattern as an actionable finding is enormous.

The appropriate workflow is: explore first, then form hypotheses, then confirm with independent data or analysis. Do not use the same data for exploration and confirmation — that is circular reasoning. If you notice a pattern in the ODA polls data from the first half of the campaign and then confirm it using the same polls data, you have not confirmed anything; you have just found the same pattern twice in the same data.

For Nadia, this distinction has concrete implications. The support-score and persuadability patterns she observed in her initial analysis are interesting hypotheses. Before the campaign makes major resource allocation decisions based on those patterns, she needs to confirm them against additional evidence: targeted surveys, experimental contact programs, or comparison against races with similar demographic profiles. The exploratory analysis has told her where to look next, not what to do right now.

✅ Best Practice: Label Your Analysis Type In any memo, presentation, or script, explicitly label whether the analysis is exploratory (hypothesis-generating) or confirmatory (hypothesis-testing). This protects against overstating findings and sets appropriate expectations for decision-makers. A finding labeled as exploratory signals "this is interesting and warrants follow-up investigation." A finding labeled as confirmatory signals "this was tested rigorously and the conclusion is reliable." These are very different claims, and conflating them is one of the most common sources of analytical overconfidence.

5.15 Your Analytical Workflow Going Forward

Chapter 5 has established the foundation: you can load, inspect, clean, filter, summarize, cross-tabulate, and visualize political data. You have a dataset to work with throughout the rest of the textbook, and you have begun to understand its structure and limitations. Before moving on, it is worth articulating the workflow you should follow every time you begin a new analysis, whether in this course or in professional practice.

Step 1: Understand the data structure before looking at values. Check shape, column names, and dtypes. Read any available documentation. Know what each table represents and what the unit of analysis is.

Step 2: Inspect the data quality. Look for impossible values, missing data, inconsistent coding, and duplicates. Do not assume the data is clean because you received it from a reputable source.

Step 3: Run descriptive statistics. Generate .describe() outputs for numeric columns and value_counts() outputs for categorical columns. Look at the distributions, not just the means. Means can be misleading when distributions are skewed or bimodal.

Step 4: State your question before looking for answers. Based on your understanding of the data's structure and quality, form a specific analytical question. If you are in exploratory mode, be explicit about it. If you are in confirmatory mode, write down your expected result before computing anything.

Step 5: Filter and subset appropriately. Apply the filters that are appropriate for your question — the right population, the right time period, the right methodological subset. Document what you filtered and why.

Step 6: Compute what you need. Run the relevant statistics, cross-tabulations, and regressions. Check intermediate results for sanity as you go. If a number looks surprising, investigate it before proceeding — surprising results are either genuine findings or data errors, and you need to know which.

Step 7: Visualize. Charts are not decorations added at the end. They are analytical tools that reveal structure invisible in tables. Always plot your key variables before drawing conclusions.

Step 8: Document and save. Record what you did, why you did it, and what you found. Save outputs with descriptive names. Write the analysis log at the top of your script.

Step 9: Interpret with humility. State what the data says, what it does not say, and what additional analysis would be needed to say more. Your exploratory findings are hypotheses. Your confirmatory findings have limits defined by your data, your sample, and your assumptions.

This workflow is not glamorous. It takes longer than diving straight into the interesting question. But it is the difference between analysis that is trustworthy and analysis that merely looks trustworthy — and in political practice, where decisions have real consequences for real people, that difference matters enormously.

📊 Real-World Application: How the ODA Dataset Compares to Real Campaign Data If you were working on the actual Garza campaign (or any real Senate campaign), the data structure would be recognizable from the ODA Dataset but substantially messier. Real voter files contain 200 to 400 columns rather than the ODA's carefully curated 14. Real polling exports have inconsistent date formats, truncated decimal places, and column names that change with each delivery from the pollster. Real campaign finance data has donor occupation fields that are free-text strings with thousands of unique values, many of them abbreviated, misspelled, or jokingly filled in. The ODA Dataset has been designed to represent the structural patterns of real political data while removing the most painful data-cleaning friction — a pedagogically appropriate choice, but one you should be aware of as you move toward professional practice. The skills you learn here transfer directly; the time you spend will scale up significantly.

Chapter Summary

You have now loaded, explored, filtered, analyzed, and visualized a real political dataset — and more importantly, you have begun developing the habits that make political data analysis trustworthy rather than misleading. The ODA Dataset is large enough to have real structure and real complexity, and your first exploratory pass has already surfaced several analytically important findings: the bimodal distribution of Independent support scores, the demographic skew in historical turnout rates, the advertising spending imbalance, the small-dollar donation signal.

The technical skills you have practiced — read_csv(), .info(), .describe(), value_counts(), boolean filtering, groupby(), pd.crosstab(), matplotlib charting — are the building blocks of everything that follows in the Python lab chapters. But the deeper lesson is about process: start with the structure of the data, understand who is in it and who is missing, form your questions before your conclusions, and document your work at every step.

The ODA Dataset will accompany you throughout this textbook. In Chapter 10, you will build a properly aggregated and weighted polling average from oda_polls.csv. In Chapter 16, you will create publication-quality visualizations of the electorate. In Chapter 21, you will build a forecasting model. In Chapter 27, you will analyze the text of speeches in oda_speeches.csv. Each time, you will return to data you already know — which means each new analysis builds on a foundation of context rather than starting from scratch.

Adaeze closes her laptop and looks at Sam. "What do you want to know about this race?" Sam looks at the charts still open on their screen, at the distributions and the trend lines and the scatter plot with its clusters of persuadable voters. "Everything," they say. "I want to know everything."

That is exactly the right answer. We have barely started.

Chapter 6 leaves the data and returns to theory: What is public opinion, and how do we know when we are measuring it? The frameworks in Chapter 4 and the data in Chapter 5 inform the deeper conceptual questions Chapter 6 asks about what polls are actually measuring when they claim to represent the public.