Case Study 10-2: Building a Race Rating System — The Forecaster's Perspective

DataField.Dev

Case Study 10-2: Building a Race Rating System — The Forecaster's Perspective

Background

Election forecasters — from FiveThirtyEight to The Economist to Cook Political Report — translate pools of messy polling data into probability statements and race ratings that voters, campaigns, and media use to understand the electoral landscape. Their methods vary, but the challenge is common: how do you take a collection of polls with different methodologies, population definitions, field dates, and house effects and produce a single, credible summary of where a race stands?

This case study follows Carlos through the process of building a simplified race rating system for the Garza-Whitfield Senate contest, applying the polling evaluation skills from Chapter 10 to produce a formal probabilistic assessment.

The Inputs: The ODA Dataset

Carlos starts with the full oda_polls.csv dataset. Before any averaging, he applies his evaluation framework to characterize the polling environment.

Dataset summary (47 polls, all dates): - 28 LV polls, 14 RV polls, 5 Adult polls - Average sample size: 682 respondents - Methodologies: 13 CATI, 6 Online-Probability, 18 Online-Opt-in, 8 Mixed, 2 IVR - Pollster sponsors: 8 independent, 4 campaign-commissioned, 3 party-affiliated, 32 commercial - Date range: 120 days before election to 3 days before election

Initial quality assessment:

Tier	Criteria	N Polls	Avg Quality Score
High	LV or RV, CATI/Online-Prob, n≥500, independent sponsor	11	87
Medium	LV or RV, Mixed/CATI, n≥400, any sponsor	21	68
Low	Adult, IVR/Opt-in, small n, or partisan sponsor	15	44

The bulk of publicly available polls fall in the medium tier. Only 23% are high-quality. Fifteen polls (32%) are low-quality by Carlos's framework.

Step 1: Construct the Quality-Weighted Average

Carlos uses a weighting scheme that combines three factors:

Weight = Quality Score × Recency Factor × Sample Factor

Where: - Quality Score: 0–100 scale as described in the chapter - Recency Factor: exp(−days_before_election / 30), decaying to 0.37 at 30 days out and 0.14 at 60 days out (more recent polls count more) - Sample Factor: √(sample_size / 600), normalized so a 600-person poll has weight 1.0

For LV polls only, this weighting scheme produces:

Quality-Weighted Average (LV, all polls):
Garza: 47.8%
Whitfield: 46.7%
Margin: Garza +1.1

High-Quality Only Average (LV, quality ≥ 70):
Garza: 48.1%
Whitfield: 46.4%
Margin: Garza +1.7

Final 30-day Average (LV, recency-weighted):
Garza: 47.5%
Whitfield: 47.1%
Margin: Garza +0.4

The race has tightened significantly in the final 30 days. The early-cycle average showed Garza with more comfortable leads; recent polling shows an essentially tied race.

Step 2: House Effect Adjustments

Carlos's house effects analysis identified two pollsters with statistically significant biases: Progressive Polling Inc. (+3.1 D) and Right Track Analytics (−4.3 R). One additional pollster, "RedState Insights," with 3 polls and an estimated house effect of −2.8, falls just short of statistical significance with only 3 polls.

Should he adjust the polls from these firms before including them in his average?

This is a genuine methodological debate. Arguments for adjustment: - If a house effect is real and measurable, ignoring it biases the average - Historical track records provide valid information about systematic bias - Aggregators like FiveThirtyEight make similar adjustments

Arguments against adjustment: - House effects estimated from small samples (3–4 polls) are imprecise - House effects may change over time as methodologies evolve - Adjusting polls post-hoc introduces its own model risk

Carlos chooses a compromise: he applies half the estimated house effect adjustment to polls from firms with statistically significant effects. For Progressive Polling Inc. (4 polls, +3.1 D effect), he subtracts 1.55 points from their Democratic margin before including in the average. For Right Track Analytics (5 polls, −4.3 R effect), he adds 2.15 points to their Democratic margin.

After adjustment, the quality-weighted LV average moves from Garza +1.1 to Garza +1.4.

Step 3: Population Definition Correction

The 14 RV polls in the dataset show Garza averaging +3.8 among registered voters. The 28 LV polls show Garza +1.1. The LV/RV gap of 2.7 points is consistent with the expected Republican advantage from LV screening in this state's midterm context.

Carlos uses the LV estimate as his primary average — it is the most relevant population for election prediction — but notes the RV average as a check on whether the LV model might be unusually strict or lenient.

Step 4: Uncertainty Quantification

A polling average is not a point prediction. It is a distribution. Carlos quantifies uncertainty from three sources:

Source 1: Sampling uncertainty. The weighted effective sample size of the LV average (after quality weighting) is approximately 3,400 respondents. At 95% confidence, this implies ±1.7% sampling uncertainty on the average.

Source 2: House effect uncertainty. The uncertainty in his house effect adjustments adds approximately ±0.8% to his average uncertainty.

Source 3: Trend uncertainty. The race has been tightening. If the trend continues, the final result could be 1–2 points more Republican than the current average. Carlos assigns ±1.5% uncertainty from trend extrapolation.

Combined uncertainty (assuming partial correlation): ≈ ±2.4%

Final estimate: Garza +1.4 ± 2.4% (approximately 95% confidence interval from −1.0 to +3.8 for Garza).

Step 5: Converting to Probability

Under a normal distribution with mean +1.4 and standard deviation ≈ 1.2 (95% CI / 1.96):

P(Garza wins) = P(margin > 0) = P(Z > −1.4/1.2) = P(Z > −1.17) ≈ 0.879

Carlos's model gives Garza approximately an 88% probability of winning, with significant uncertainty — about 1 in 8 chance Whitfield pulls it off.

He stress-tests this against the high-quality-only average (+1.7), which gives Garza ~90% probability, and the final-30-day average (+0.4), which gives Garza only ~63% probability. The range is significant.

Step 6: The Race Rating

Using a standard rating ladder: - Safe D: >95% probability - Likely D: 80–95% - Lean D: 65–80% - Toss-up: 35–65% - Lean R: 20–35% - Likely R: 5–20% - Safe R: <5%

Based on his full-cycle model (88% Garza), Carlos rates the race Likely D. Based on the final-30-day model (63% Garza), he would call it Lean D. The honest answer, he notes in his report, is that the race is at the boundary between Likely D and Lean D, with the direction of movement in the final 30 days a meaningful concern.

Discussion Questions

Question 1: Carlos chooses to apply "half the estimated house effect adjustment" to avoid over-correcting on imprecisely estimated effects. Describe the trade-off this represents. What are the costs of over-correcting vs. under-correcting for house effects?

Question 2: The final-30-day model produces a much more Republican-favorable estimate (Garza +0.4) than the full-cycle model (Garza +1.4). What are two explanations for this discrepancy, and how would you determine which explanation is more likely?

Question 3: Carlos identifies three independent sources of uncertainty in his polling average: sampling uncertainty, house effect uncertainty, and trend uncertainty. Why does he not simply add these three numbers (1.7 + 0.8 + 1.5 = 4.0%) to get his total uncertainty? What does "partial correlation" mean in this context?

Question 4: A campaign manager asks Carlos whether to "believe" his 88% Garza probability. Write a 200-word explanation of what that probability does and does not mean, appropriate for a non-statistician.

Question 5: The forecaster's estimate depends critically on assumptions about: (a) quality weights assigned to different methodologies, (b) the size and direction of house effect adjustments, and (c) the model used to convert polling averages to probabilities. Any of these could be wrong. How should a responsible forecaster communicate this "model uncertainty" to consumers of their ratings?

What Actually Happened

In this illustrative race, Garza won the election by 1.8 points — within the uncertainty range of Carlos's model (though above his central estimate of +1.4). The high-quality-only average (+1.7) was the most accurate of his three estimates. Right Track Analytics's final poll, released two days before the election, showed Whitfield +3 — an 8-point error, consistent with their historical house effect.

The race rated as "Likely D" by Carlos's model, and "Lean D" by the 30-day model, was won by the Democratic candidate with a margin consistent with a "Likely D" classification.

Carlos's post-election note to Vivian: "The model was right for the right reasons — quality-weighted averages, house effect correction, and honest uncertainty quantification. The outlier IVR polls dragged the average Republican in our simple average. This is why the weighting matters."

Methodological Reflection

The exercise of building a race rating model reveals something important about the relationship between methodology and conclusion: the inputs you trust, the adjustments you make, and the uncertainty you acknowledge determine what your model says. There is no view from nowhere in political analytics. Every averaging system is a model, and every model embeds choices.

The responsible forecaster makes those choices explicit, documents them, and reports the sensitivity of conclusions to the most consequential assumptions. A probability of 88% is not a fact — it is the output of a model applied to imperfect data. The value of that probability is entirely contingent on the quality of the inputs and the soundness of the modeling choices. Carlos understands this. It is why he builds the dashboard before reading the headlines.