Chapter 5 Quiz: Your First Political Dataset

DataField.Dev

Chapter 5 Quiz: Your First Political Dataset

This quiz covers both the conceptual content of Chapter 5 and applied pandas/Python knowledge. Coding questions require you to write correct pandas syntax; you do not need to run the code unless specified.

Part A: Conceptual Questions

1. Which pandas method produces output that includes count, mean, std, min, 25th percentile, median, 75th percentile, and max for all numeric columns?

a) .info() b) .describe() c) .summary() d) .stats()

2. You load a CSV file and notice that the pct_d column has dtype object instead of float64. The most likely explanation is:

a) The column contains values above 100 b) The column contains non-numeric entries such as "N/A" or "–" c) Pandas automatically converts all percentage columns to object dtype d) The CSV file was saved with the wrong encoding

3. When you write polls[polls['population'] == 'LV'], this operation is called:

a) Column selection b) Boolean indexing (filtering) c) Aggregation d) Cross-tabulation

4. The .copy() method is used after filtering a DataFrame primarily to:

a) Make the code run faster by caching the result b) Create an independent copy that avoids the SettingWithCopyWarning c) Ensure the filtered DataFrame is sorted correctly d) Remove duplicate rows from the filtered result

5. value_counts(normalize=True) returns:

a) Counts sorted from smallest to largest b) Proportions (fractions summing to 1.0) c) Z-scores for each category d) Counts and proportions in separate columns

6. A voter support score of 75 (on a 0–100 scale where 100 = strong Garza support) is best interpreted as:

a) The voter will vote for Garza with 75% probability b) The voter is 75 percentage points more likely to vote for Garza than Whitfield c) A relative ranking indicating this voter is more likely to support Garza than a voter scored 40 d) The voter has been contacted 75 times by the Garza campaign

7. In the support score histogram (Figure 5.3 / Visualization 3), the bimodal distribution of Independent voter scores means:

a) The model has a bug and is assigning incorrect scores to Independents b) Most Independents are genuinely undecided and cluster near the 50-point neutral mark c) Many self-identified Independents lean reliably toward one party or the other d) Independent voters were excluded from the model's training data

8. The ODA Dataset's margin_error column is missing for some polls. The text establishes that this missingness is correlated with polling methodology. This is an example of:

a) Missing completely at random (MCAR) b) Missing at random (MAR) c) Missing not at random (MNAR) d) Listwise deletion

9. A 14-day rolling average of polling percentages is useful because:

a) It corrects for house effects by averaging across pollsters b) It smooths sampling variation while still capturing genuine trend movement c) It weights more recent polls more heavily than older polls d) It adjusts for the likely voter screen used by different pollsters

10. The statement "The ODA Dataset's voter file overrepresents older, habitual voters" is best classified as a concern about:

a) Sampling bias within the voter file b) Coverage bias — systematic exclusion of a population segment from the frame c) Nonresponse bias — differential response to the voter file survey d) Measurement error — inaccurate recording of voter attributes

Part B: Coding Questions

For each question, write the pandas code that produces the described result. Assume the data has been loaded as described in Section 5.3.

11. Write code to display the first 10 rows of the polls DataFrame, showing only the columns date, state, pollster, pct_d, pct_r, and population.

12. Write code to filter polls to only rows where: - state equals "TX_ANALOG" - population equals "LV" (likely voters) - date is after January 1, 2025

Store the result in a variable called gw_recent.

13. Given a DataFrame called gw_voters, write code to compute the mean support_score separately for each category of race_ethnicity. Sort the result from highest to lowest mean support score.

14. Write the pandas code to create a cross-tabulation of party_reg (rows) vs. urban_rural (columns) using row percentages, rounded to one decimal place. Assume the DataFrame is gw_voters.

15. Write code to count the number of voters in gw_voters who meet ALL of the following criteria: - persuadability_score >= 60 - support_score between 45 and 65 (inclusive) - vote_history_2022 == 1

16. You have a DataFrame called gw_lv_polls. Write code to compute and print: - The total number of polls - The mean of pct_d - The mean of pct_r - The difference between those means (the average polling margin)

17. Write code to identify all columns in polls that have any missing values, and for each such column, print the column name and the count of missing values.

18. Write the code to create a 14-day rolling average of pct_d for a DataFrame called gw_sorted that has been sorted by date and has date set as its index. Store the result as roll_d.

Part C: Short Analysis Questions

19. Explain in two to three sentences why you should use parse_dates=["date"] when loading political datasets with date columns, rather than leaving pandas to infer the type automatically.

20. Sam Harding's observation in Section 5.12 is that likely voter screens systematically underrepresent communities whose mobilization is being actively contested. Explain in three to four sentences why this creates a substantive analytical problem — not just a technical one — for understanding the Garza-Whitfield race.

21. You observe that counties with more campaign canvassing visits also have higher Garza support scores in the voter file. A colleague concludes that canvassing is increasing support. Using concepts from Chapter 4, explain in three sentences why this observational finding does not support that conclusion, and what type of additional evidence would be needed.

Answer Key

Part A: 1-b | 2-b | 3-b | 4-b | 5-b | 6-c | 7-c | 8-c | 9-b | 10-b

Part B — Model Answers:

Q11:

polls[["date", "state", "pollster", "pct_d", "pct_r", "population"]].head(10)

Q12:

gw_recent = polls[
    (polls["state"] == "TX_ANALOG") &
    (polls["population"] == "LV") &
    (polls["date"] > "2025-01-01")
].copy()

Q13:

(gw_voters.groupby("race_ethnicity")["support_score"]
 .mean()
 .sort_values(ascending=False)
 .round(2))

Q14:

pd.crosstab(
    gw_voters["party_reg"],
    gw_voters["urban_rural"],
    normalize="index"
).mul(100).round(1)

Q15:

mask = (
    (gw_voters["persuadability_score"] >= 60) &
    (gw_voters["support_score"] >= 45) &
    (gw_voters["support_score"] <= 65) &
    (gw_voters["vote_history_2022"] == 1)
)
print(mask.sum())

Q16:

n = len(gw_lv_polls)
mean_d = gw_lv_polls["pct_d"].mean()
mean_r = gw_lv_polls["pct_r"].mean()
margin = mean_d - mean_r
print(f"N polls: {n}")
print(f"Avg pct_d: {mean_d:.1f}%")
print(f"Avg pct_r: {mean_r:.1f}%")
print(f"Avg margin (D-R): {margin:+.1f}")

Q17:

missing = polls.isnull().sum()
for col, count in missing[missing > 0].items():
    print(f"{col}: {count}")

Q18:

roll_d = gw_sorted["pct_d"].rolling("14D", min_periods=2).mean()

Part C — Scoring Rubric:

Q19: Full credit for explaining that parse_dates converts the column to a datetime dtype, enabling date arithmetic (filtering by date range, computing rolling averages, resampling by week), whereas a string column does not support these operations.

Q20: Full credit for explaining that the likely voter model encodes past behavior as a proxy for future behavior, but for a community that has been systematically excluded from or discouraged by the political process, low historical turnout reflects structural barriers rather than disinterest — and thus likely voter screens exclude the exact population whose political engagement is most uncertain and most contested. The analytical problem is not just underestimation of one group; it is that the analysis treats a past pattern of exclusion as a predictor of a future that the campaign is actively working to change.

Q21: Full credit for identifying: (a) campaign canvassers were sent to precincts with already-high support (selection/targeting confound), not randomly assigned; (b) the observed association conflates targeting decisions with canvassing effects; (c) a randomized experiment (randomly assigning canvassing to comparable precincts) would provide clean evidence of the causal effect.