Chapter 5 Exercises: Your First Political Dataset

DataField.Dev

Chapter 5 Exercises: Your First Political Dataset

These exercises assume you have access to the ODA Dataset and a working Python environment with pandas, matplotlib, numpy, and seaborn installed. See Section 5.1 for setup instructions. Solutions are available in code/exercise-solutions.py — attempt each exercise independently first.

Part A: Data Loading and Inspection

Exercise 5.1 — Loading and Verifying

Load all six ODA tables as shown in example-01-loading-oda-data.py. For each table, answer the following questions in a brief comment block at the top of your script:

a) How many rows and columns does each table have?

b) Which tables have missing values, and in which columns? Is the missingness likely to be missing at random, or is it systematic?

c) Which date column (if any) spans the longest time period?

d) Which state has the most records in the polls table? In the voters table? What might explain the difference?

Exercise 5.2 — dtype Inspection

When pandas reads a CSV, it infers the data type of each column. Sometimes numeric columns are read as strings (object dtype) if they contain non-numeric entries.

a) For oda_polls.csv, identify any columns that you would expect to be numeric (float or int) but are read as object dtype. Write code to diagnose this problem.

b) For any such columns, identify what non-numeric values are causing the issue (hint: use .unique() or .value_counts() after filtering to non-numeric entries).

c) Write code to clean the problematic column and convert it to a numeric dtype.

Part B: Descriptive Statistics

Exercise 5.3 — Polling Summary Statistics

Using oda_polls.csv, filtered to likely voter (LV) polls only:

a) Compute the mean, median, and standard deviation of pct_d and pct_r across all states.

b) Identify the five polls with the largest Democratic margin (pct_d − pct_r) and the five with the largest Republican margin.

c) Is there a difference in average sample size between phone polls and online polls? Compute this and provide a one-sentence interpretation.

d) How does margin of error vary by sample size? Create a scatter plot of sample_size (x-axis) vs. margin_error (y-axis). Overlay the theoretical margin of error formula: MOE ≈ 1 / √n × 100. Does the data follow the expected pattern?

Exercise 5.4 — Voter Demographics Deep Dive

Using oda_voters.csv filtered to the Garza-Whitfield state (TX_ANALOG):

a) Create a histogram of voter ages, separately for Democrat, Republican, and Independent registrants on the same chart. Which party has the oldest average voter? The youngest?

b) What percentage of voters in each racial/ethnic group voted in all three election cycles (2018, 2020, 2022)? Which group has the highest consistent voter share? Which has the lowest?

c) Compute the average support_score for voters in each income bracket. Is the relationship between income and Garza support monotonic? How do you interpret the pattern?

d) Create a box plot of persuadability_score by urban_rural classification. What does this tell you about where persuadable voters are concentrated geographically?

Exercise 5.5 — Cross-Tabulation Analysis

Using oda_voters.csv filtered to the Garza-Whitfield state:

a) Create a cross-tabulation of race_ethnicity by party_reg (row percentages). Which racial/ethnic group has the highest proportion of Independent registration?

b) Create a cross-tabulation of education by vote_history_2022 (row percentages). Is there a consistent education gradient in 2022 turnout? What does this pattern suggest about which voters are already voting vs. those who might be mobilized?

c) Cross-tabulate urban_rural by vote_history_2022 and vote_history_2018 simultaneously (separately). Has the urban-rural turnout gap changed between 2018 and 2022?

Part C: Filtering and Subsetting

Exercise 5.6 — Targeted Subsetting

Write code to identify the following subsets of the Garza-Whitfield state voter file. For each, report the count and the average support_score:

a) High-propensity Garza supporters: support_score >= 70 AND vote_history_2022 == 1

b) Persuadable Garza leaners: support_score between 52 and 70 (inclusive) AND persuadability_score >= 60

c) Dormant Garza supporters: support_score >= 65 AND vote_history_2022 == 0 AND vote_history_2020 == 0

d) True swing voters: support_score between 45 and 55 (inclusive) AND persuadability_score >= 70

Exercise 5.7 — State Comparison

Pick three states from oda_voters.csv (other than TX_ANALOG). For each state and for the national aggregate, compute:

a) Mean age of registered voters

b) Share of voters registered as Democrat, Republican, and Independent

c) 2022 turnout rate

d) Mean support_score and mean persuadability_score

Present your results in a formatted comparison table. Write two to three sentences interpreting the key differences between the states.

Part D: Visualization

Exercise 5.8 — Polling Trend for a Second State

Reproduce the polling trend chart from example-03-first-visualizations.py for a different state in the dataset. Use a 21-day rolling window instead of 14 days. In a comment block at the top of the chart, note: (a) whether the 21-day window produces more or less smooth lines than the 14-day window, and (b) why you might prefer a shorter or longer window depending on polling density.

Exercise 5.9 — Diverging Bar Chart

Create a diverging bar chart showing the average support_score by demographic group relative to the 50-point neutral score. Specifically: - Compute mean support_score by race_ethnicity and education (separately) - Plot the deviation from 50 as horizontal bars (positive = Garza-leaning, negative = Whitfield-leaning) - Use blue for positive (Garza) bars and red for negative (Whitfield) bars - Add a vertical reference line at 0

This type of chart is widely used in campaign analytics to show relative candidate strength across demographic segments.

Exercise 5.10 — Time Series of Ad Spending

Using oda_ads.csv for the Garza-Whitfield state:

a) Create a line chart showing weekly ad spending by party over the campaign period. Use the standard blue/red color scheme.

b) Add vertical reference lines at key dates: the primary date, the general election registration deadline, and Election Day (you will need to look up or define these dates).

c) Which party's spending increases more sharply in the final two weeks? What does this suggest about campaign strategy?

Part E: Missing Data

Exercise 5.11 — Missing Data Analysis

Return to oda_polls.csv and focus on the margin_error column.

a) What percentage of polls are missing margin_error?

b) Is the missingness correlated with methodology? Compute the missing rate separately for each polling methodology and report it in a table.

c) Propose three different strategies for handling the missing margin_error values (drop, fill with median, fill with formula). What are the trade-offs of each approach for a downstream analysis that weights polls by inverse margin of error?

d) Implement your preferred strategy and write a comment explaining your choice.

Part F: Challenge Exercises

Exercise 5.12 — County-Level Targeting Analysis (Challenge)

Using oda_voters.csv for the Garza-Whitfield state:

a) Group voters by county. For each county, compute: - Total registered voters - Number of Garza persuasion targets (persuadability >= 60, support_score >= 52) - Number of Garza turnout targets (vote_history_2022 == 0, vote_history_2020 == 0, support_score >= 55) - Average support score - 2022 turnout rate

b) Create a scatter plot where each point is a county: x-axis = persuasion target count, y-axis = turnout target count. Size the points by total county population. Color the points by average support score (blue-to-red scale).

c) Identify the top 5 counties by combined targeting opportunity (persuasion + turnout targets). What does this suggest about where the campaign should concentrate field resources?

Exercise 5.13 — Donations Data Analysis (Challenge)

Using oda_donations.csv:

a) Separate donations into small-dollar (< $200) and large-dollar (>= $200) categories. Compute the share of total dollars that comes from each category for each recipient party.

b) Create a stacked bar chart showing, for each party, the proportion of total dollars coming from each donation size bucket.

c) Which party has a higher share of small-dollar donations? What does this suggest about each campaign's grassroots support base?

d) Look at donor occupation. What are the top 10 occupations by total donation amount for each party? What patterns do you observe?

Exercise 5.14 — Combining Multiple Tables (Challenge)

This exercise requires joining two ODA tables.

a) Filter oda_ads.csv to the Garza-Whitfield state. Compute weekly Democratic and Republican ad spending.

b) Filter oda_polls.csv to the Garza-Whitfield state (LV polls). Compute a weekly polling average for each candidate.

c) Join the two weekly series on date. Create a two-panel plot: the top panel shows the polling trend and the bottom panel shows weekly ad spending. Use the same date x-axis for both.

d) Is there any visible relationship between ad spending spikes and polling movement? Apply the analytical frameworks from Chapter 4: what alternative explanations exist for any patterns you observe?

Reflection Questions

R1 — After working through the ODA Dataset, what surprised you most about the structure or content of political data? What was harder than you expected?

R2 — In Exercise 5.12, the county targeting analysis assigns a single "opportunity score" to each county. What assumptions is this approach making? What important information is it ignoring?

R3 — Sam Harding's realization in Section 5.12 — that likely voter models systematically underrepresent communities whose mobilization is actively being contested — applies to many political analyses. Can you think of two other situations where the filtering or weighting choices in a political analysis systematically exclude or underweight a particular population? What are the policy implications of those exclusions?