Case Study 2: Election Night Analysis — Jordan Explores Voter Turnout Data

Contributors to Introduction to Data Science

Case Study 2: Election Night Analysis — Jordan Explores Voter Turnout Data

Tier 3 — Illustrative/Composite Example: This case study uses a fictional voter turnout dataset loosely inspired by the types of publicly available election data published by state election offices and organizations such as the United States Election Project. All county names, specific numbers, and demographic breakdowns are invented for pedagogical purposes. No real election or jurisdiction is represented, and the patterns described are composites created to illustrate common analytical challenges.

The Setting

It's the Wednesday after a midterm election, and Jordan — our university student from Chapter 1 — is scrolling through news coverage while drinking coffee in the campus library. Every outlet has a different take: one says turnout was "historically high," another calls it "disappointing in key demographics," and a third focuses on "the suburban swing." They're all citing numbers, but the numbers don't seem to agree.

Jordan has been developing a habit over the past few chapters of this course: when people make claims about data, Jordan wants to check the data. So instead of just reading about voter turnout, Jordan decides to analyze turnout data directly.

Their professor has shared a CSV file called county_turnout.csv containing voter turnout data for 280 counties in a fictional state. The file has 10 columns:

Column	Description	Example
`county`	County name	"Adams County"
`region`	State region	"North", "South", "Central", "Metro"
`registered_voters`	Total registered voters	"45200"
`votes_cast`	Total votes cast	"28750"
`turnout_pct`	Turnout as percentage	"63.6"
`median_income`	Median household income	"52400"
`college_pct`	% with bachelor's degree or higher	"28.5"
`median_age`	Median age of population	"38.2"
`population_density`	People per square mile	"125"
`prev_turnout_pct`	Previous election turnout %	"58.2"

Jordan has Python basics from Chapters 3-5 and the EDA workflow from Chapter 6. Time to put them to work.

Asking the Right Questions

Jordan opens a new Jupyter notebook and starts with a Markdown cell:

Project: County-Level Voter Turnout Analysis

Data source: State Election Office, county_turnout.csv

Questions: 1. What does the overall distribution of voter turnout look like across counties? 2. Are there demographic patterns? Do wealthier or more educated counties have higher turnout? 3. How does this election's turnout compare to the previous one?

Notice how Jordan frames question 2 carefully. They don't ask "Does income cause higher turnout?" — that would be a causal question requiring much more sophisticated analysis. Instead, they ask whether there's a pattern, which is a descriptive question they can investigate with the tools they have.

Loading and First Look

import csv

data = []
with open("county_turnout.csv", "r", encoding="utf-8") as f:
    reader = csv.DictReader(f)
    for row in reader:
        data.append(row)

print(f"Loaded {len(data)} counties")
print(f"Columns: {list(data[0].keys())}")

Loaded 280 counties
Columns: ['county', 'region', 'registered_voters', 'votes_cast',
          'turnout_pct', 'median_income', 'college_pct', 'median_age',
          'population_density', 'prev_turnout_pct']

Jordan checks the first and last rows to confirm the file loaded cleanly:

print("First row:", data[0])
print("Last row:", data[-1])

Both look structurally correct. Good. Now for the unique values in the categorical column:

regions = sorted(set(row["region"] for row in data))
print(f"Regions ({len(regions)}): {regions}")

# Count counties per region
for region in regions:
    count = sum(1 for row in data if row["region"] == region)
    print(f"  {region}: {count} counties")

Regions (4): ['Central', 'Metro', 'North', 'South']
  Central: 72 counties
  Metro: 45 counties
  North: 85 counties
  South: 78 counties

Missing Values Check

Before computing anything, Jordan follows the Chapter 6 workflow — check data quality first:

print("Missing values by column:")
print("-" * 45)
for col in data[0].keys():
    missing = sum(1 for row in data if row[col].strip() == "")
    pct = (missing / len(data)) * 100
    status = "OK" if pct < 1 else "NOTE" if pct < 5 else "WARNING"
    print(f"  {col:22s}  {missing:3d} ({pct:4.1f}%)  [{status}]")

Missing values by column:
---------------------------------------------
  county                    0 ( 0.0%)  [OK]
  region                    0 ( 0.0%)  [OK]
  registered_voters         0 ( 0.0%)  [OK]
  votes_cast                0 ( 0.0%)  [OK]
  turnout_pct               0 ( 0.0%)  [OK]
  median_income             8 ( 2.9%)  [NOTE]
  college_pct              12 ( 4.3%)  [NOTE]
  median_age                3 ( 1.1%)  [NOTE]
  population_density        0 ( 0.0%)  [OK]
  prev_turnout_pct         15 ( 5.4%)  [WARNING]

Jordan notes: "Core election data (county, votes, turnout) is complete. Demographic variables have small amounts of missing data (1-4%), which shouldn't affect overall analysis. Previous turnout is missing for 15 counties (5.4%) — these might be newly created counties or counties that were reorganized since the last election."

Overall Turnout Distribution

def safe_floats(data, column):
    """Extract numeric values, skipping empties."""
    values = []
    for row in data:
        raw = row[column].strip()
        if raw == "":
            continue
        try:
            values.append(float(raw))
        except ValueError:
            pass
    return values

def summarize(values, label):
    """Print summary statistics."""
    sorted_v = sorted(values)
    n = len(sorted_v)
    mid = n // 2
    median = sorted_v[mid] if n % 2 else (sorted_v[mid-1] + sorted_v[mid]) / 2
    mean = sum(values) / n

    print(f"--- {label} ---")
    print(f"  Count:  {n}")
    print(f"  Min:    {min(values):.1f}")
    print(f"  Max:    {max(values):.1f}")
    print(f"  Mean:   {mean:.1f}")
    print(f"  Median: {median:.1f}")
    print(f"  Range:  {max(values) - min(values):.1f}")
    print()

turnout = safe_floats(data, "turnout_pct")
summarize(turnout, "Voter Turnout (%)")

--- Voter Turnout (%) ---
  Count:  280
  Min:    22.4
  Max:    78.3
  Mean:   54.8
  Median: 55.6
  Range:  55.9

Jordan writes: "Mean turnout is 54.8%, with a median of 55.6% — close together, suggesting a roughly symmetric distribution. But the range of 55.9 percentage points (from 22.4% to 78.3%) means there's enormous variation. Some counties are deeply engaged; others are barely participating."

Turnout by Region

print("Turnout by region:")
print("-" * 50)
for region in sorted(regions):
    region_turnout = []
    for row in data:
        if row["region"] == region:
            try:
                region_turnout.append(float(row["turnout_pct"]))
            except ValueError:
                pass
    if region_turnout:
        mean_t = sum(region_turnout) / len(region_turnout)
        sorted_rt = sorted(region_turnout)
        n = len(sorted_rt)
        mid = n // 2
        median_t = sorted_rt[mid] if n % 2 else (sorted_rt[mid-1] + sorted_rt[mid]) / 2
        print(f"  {region:10s}  mean={mean_t:.1f}%  median={median_t:.1f}%  "
              f"min={min(region_turnout):.1f}%  max={max(region_turnout):.1f}%  "
              f"(n={n})")

Turnout by region:
--------------------------------------------------
  Central     mean=52.1%  median=52.8%  min=28.5%  max=71.2%  (n=72)
  Metro       mean=62.4%  median=63.1%  min=41.2%  max=78.3%  (n=45)
  North       mean=53.7%  median=54.2%  min=22.4%  max=72.8%  (n=85)
  South       mean=51.8%  median=51.5%  min=24.1%  max=74.6%  (n=78)

Metro counties have notably higher turnout (mean 62.4%) compared to the other three regions (51-54%). Jordan makes a note: "Metro areas show approximately 10 percentage points higher turnout than non-metro areas. This is consistent with known patterns in political science — urban areas tend to have more accessible polling locations, more political engagement, and different demographic compositions."

Exploring the Income-Turnout Relationship

Here's where Jordan gets ambitious — and where pure Python starts to show its limitations.

Jordan wants to know: do counties with higher median incomes have higher voter turnout? Without pandas or a plotting library, Jordan uses a clever workaround — split counties into groups by income level and compare turnout:

# Get counties with both income and turnout data
paired_data = []
for row in data:
    income_raw = row["median_income"].strip()
    turnout_raw = row["turnout_pct"].strip()
    if income_raw != "" and turnout_raw != "":
        try:
            paired_data.append({
                "county": row["county"],
                "income": float(income_raw),
                "turnout": float(turnout_raw)
            })
        except ValueError:
            pass

print(f"Counties with both income and turnout data: {len(paired_data)}")

# Split into income thirds (tertiles)
incomes = sorted([d["income"] for d in paired_data])
n = len(incomes)
low_threshold = incomes[n // 3]
high_threshold = incomes[2 * n // 3]

print(f"\nIncome thresholds:")
print(f"  Lower third: below ${low_threshold:,.0f}")
print(f"  Middle third: ${low_threshold:,.0f} - ${high_threshold:,.0f}")
print(f"  Upper third: above ${high_threshold:,.0f}")

# Compute mean turnout for each income group
for label, low, high in [("Lower third", 0, low_threshold),
                          ("Middle third", low_threshold, high_threshold),
                          ("Upper third", high_threshold, float('inf'))]:
    group_turnout = [d["turnout"] for d in paired_data
                     if low <= d["income"] < high]
    if group_turnout:
        mean_t = sum(group_turnout) / len(group_turnout)
        print(f"  {label:15s}  mean turnout = {mean_t:.1f}%  (n={len(group_turnout)})")

Counties with both income and turnout data: 272

Income thresholds:
  Lower third: below $42,800
  Middle third: $42,800 - $61,500
  Upper third: above $61,500

  Lower third      mean turnout = 48.3%  (n=90)
  Middle third     mean turnout = 55.2%  (n=91)
  Upper third      mean turnout = 61.1%  (n=91)

"Clear pattern," Jordan writes. "Counties in the upper income third have mean turnout of 61.1% — nearly 13 percentage points higher than the lower income third (48.3%). Each income level corresponds to higher turnout."

Jordan immediately adds an important caveat: "But this is a correlation, not a causal claim. We can't say higher income causes higher turnout. Other factors — education level, age, urbanization — are likely correlated with both income and turnout. Disentangling these relationships would require techniques we haven't learned yet (Chapter 24)."

Comparing to the Previous Election

Jordan now turns to the third question: how does this election compare to the last one?

# Get counties with both current and previous turnout
change_data = []
for row in data:
    current = row["turnout_pct"].strip()
    previous = row["prev_turnout_pct"].strip()
    if current != "" and previous != "":
        try:
            curr_val = float(current)
            prev_val = float(previous)
            change_data.append({
                "county": row["county"],
                "region": row["region"],
                "current": curr_val,
                "previous": prev_val,
                "change": curr_val - prev_val
            })
        except ValueError:
            pass

print(f"Counties with both elections: {len(change_data)}")

# Summarize the change
changes = [d["change"] for d in change_data]
mean_change = sum(changes) / len(changes)
pos_changes = sum(1 for c in changes if c > 0)
neg_changes = sum(1 for c in changes if c < 0)
no_change = sum(1 for c in changes if c == 0)

print(f"\nOverall: mean change = {mean_change:+.1f} percentage points")
print(f"  Counties with increased turnout: {pos_changes} ({pos_changes/len(changes)*100:.0f}%)")
print(f"  Counties with decreased turnout: {neg_changes} ({neg_changes/len(changes)*100:.0f}%)")
print(f"  Counties with no change:         {no_change}")

Counties with both elections: 265

Overall: mean change = +2.3 percentage points
  Counties with increased turnout: 178 (67%)
  Counties with decreased turnout: 84 (32%)
  Counties with no change:         3

"Turnout increased in two-thirds of counties," Jordan writes, "with a mean increase of 2.3 percentage points. But let's check whether this increase is uniform across regions."

print("Change by region:")
for region in sorted(regions):
    region_changes = [d["change"] for d in change_data if d["region"] == region]
    if region_changes:
        mean_c = sum(region_changes) / len(region_changes)
        print(f"  {region:10s}  mean change = {mean_c:+.1f} pp  (n={len(region_changes)})")

Change by region:
  Central     mean change = +1.5 pp  (n=68)
  Metro       mean change = +4.8 pp  (n=42)
  North       mean change = +1.2 pp  (n=82)
  South       mean change = +2.1 pp  (n=73)

"The Metro region shows the largest increase (+4.8 percentage points), nearly four times the increase in the North (+1.2). The 'suburban swing' story in the news might be onto something — but I'd need more granular data to confirm."

Hitting the Wall: The Limits of Pure Python

At this point, Jordan wants to do something that should be simple: create a scatter plot of income vs. turnout to visually see the relationship. But they don't have matplotlib yet.

Jordan also wants to compute a correlation coefficient — a single number that measures the strength of the linear relationship between two variables. The formula exists (they've seen it in a statistics textbook), but implementing it from scratch requires computing standard deviations, covariances, and handling edge cases. It's doable, but tedious.

And then there's the filtering. To investigate which Metro counties had the biggest turnout increase while also having above-median income, Jordan needs to chain together multiple conditions, convert strings to numbers, and handle missing values — all in nested loops. The code is getting long, repetitive, and hard to read.

Jordan writes a frustrated note in the notebook:

"I can feel this analysis wanting to go deeper. I want to see a scatter plot. I want to compute correlations. I want to filter on three conditions at once without writing a 10-line loop every time. Pure Python got me here, and I'm proud of what I found — but I'm running into walls. The code is getting longer and less readable as my questions get more interesting."

This is exactly the point. Jordan has reached the boundary of what's practical with pure Python for data analysis. The tools they need — pandas for data manipulation, matplotlib for visualization, scipy for statistics — are just around the corner in Part II. And because Jordan understands what those tools need to do (having done it manually), learning them will be dramatically faster.

Jordan's Key Findings

Jordan's final notebook section summarizes:

Overall turnout averaged 54.8%, with substantial variation across counties (range: 22.4% to 78.3%).

Metro counties had significantly higher turnout (mean 62.4%) compared to non-metro regions (51-54%). This ~10 percentage point gap is consistent across the dataset.

Income and turnout show a positive association. Counties in the highest income third had mean turnout of 61.1%, compared to 48.3% in the lowest third. However, this is a correlation, not a causal claim.

Turnout increased overall from the previous election by a mean of 2.3 percentage points, with 67% of counties showing gains. Metro counties showed the largest gains (+4.8 pp).

Data quality is solid, with core election fields 100% complete and demographic fields missing under 5% of values.

Limitations: This analysis is purely descriptive. We cannot determine whether income causes higher turnout. We also lack data on some important factors (voter registration laws, polling place accessibility, same-day registration availability) that might explain both regional and income-based patterns. The 15 counties missing previous turnout data could slightly bias the comparison analysis.

Next steps: Scatter plots (need matplotlib), formal correlation analysis (need scipy or numpy), and multivariate analysis to disentangle the effects of income, education, and urbanization on turnout.

What This Case Study Illustrates

Jordan's analysis demonstrates the full Chapter 6 workflow applied to a non-health dataset:

Transferable skills. The same EDA workflow — load, inspect, check quality, summarize, compare groups, document findings — works regardless of domain. Vaccination data and voter turnout data are completely different topics, but the analytical approach is identical.
Honest limitations. Jordan explicitly notes what the analysis can't tell us (causation) and what data is missing (key contextual variables). This intellectual honesty is a hallmark of good data science.
The productive wall. Jordan hit the limits of pure Python in a way that's educational rather than frustrating. The desire to make scatter plots, compute correlations, and filter more fluidly creates genuine motivation for learning pandas and matplotlib. That motivation — born from personal experience of the limitations — is far more powerful than simply being told "you should learn pandas because it's useful."
Domain awareness. Jordan brings knowledge from sociology and political science to the analysis, noting that the income-turnout relationship is well-documented in the literature and suggesting specific alternative explanations. This is domain expertise in action — one of the three pillars of data science from Chapter 1.
Notebook as communication. Jordan's notebook could be shown to a political science professor, a newsroom editor, or a fellow student and would make sense to all of them — because it's written as a narrative with context, evidence, and interpretation, not as a wall of code.

This is what Chapter 6 prepares you for: the ability to take any dataset, ask it questions, and extract meaningful answers using the Python fundamentals you've already learned. The tools will get better in Part II. But the thinking — the curiosity, the rigor, the narrative instinct — starts here.