Chapter 21: Building a Simple Election Model (Python Lab)

DataField.Dev

37 min read

The conference room had been commandeered. Nadia Osei had taken over the whiteboard, the laptop projector, and, to the visible irritation of the campaign's finance director, the catering setup that had been laid out for a donor call. She had moved...

In This Chapter

21.1 Architecture Overview: What a Simple Election Model Does
21.2 The ODA Dataset
21.3 Step 1 — Loading and Cleaning Polling Data
21.4 Step 2 — Calculating Weighted Polling Averages
21.5 Step 3 — Adding Fundamental Inputs
21.6 Step 4 — Generating a Point Estimate
21.7 Step 5 — Monte Carlo Simulation
21.8 Step 6 — Visualizing the Probability Distribution
21.9 Model Assumptions and Sensitivity Analysis
21.10 What Can Go Wrong: Model Failure Modes
21.11 The Complete Model: Example-03 Integration
21.12 Model Validation and Backtesting
21.13 Common Errors and How to Debug Them
21.14 Nadia's Presentation: What the Model Cannot Tell You
21.15 Practical Extensions for Students
21.16 Conclusion: The Model as a Thinking Tool
Summary
Key Terms
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 21: Building a Simple Election Model (Python Lab)

The conference room had been commandeered. Nadia Osei had taken over the whiteboard, the laptop projector, and, to the visible irritation of the campaign's finance director, the catering setup that had been laid out for a donor call. She had moved the canapé trays to the windowsill and replaced them with printed cross-tabs and a laptop displaying a scatter plot.

"Walk us through what you built," said the campaign manager, folding her hands.

"It's a weighted poll aggregator with a fundamentals prior and Monte Carlo uncertainty simulation," Nadia said.

"In English."

"We take all the polls, weight the recent ones and the bigger ones more heavily, adjust for the fact that the economy and the incumbent's approval rating give us a baseline expectation independent of polls, and then run the whole thing 100,000 times with random variations to get a distribution of outcomes instead of a single number." She clicked to the next slide. "As of right now, Garza wins 61 percent of simulations."

The finance director, who had just retrieved her half-eaten canapé from the windowsill, looked up. "Wins 61 percent of simulations — what does that mean, exactly? Wins 61 percent of actual Senate seats?"

Nadia smiled. She had answered this question before. "It means that under our model's assumptions, if this election were run 100,000 times under conditions identical to current conditions, Garza would win about 61,000 of them. It's a probability estimate, not a seat count."

"And what are the assumptions?"

"That's the interesting part," Nadia said. "Let me show you the code."

This chapter is a laboratory session. By the end of it, you will have built a working version of the model Nadia is presenting — a simple but genuinely functional election forecasting tool that implements the core concepts from Chapters 17 through 20. The model you build will not be as sophisticated as FiveThirtyEight's or The Economist's, but it will be built on the same foundational principles, and understanding its mechanics gives you the foundation to extend it.

We will use the oda_polls.csv dataset, which contains polling data for the Garza-Whitfield Senate race along with other races in the same cycle.

21.1 Architecture Overview: What a Simple Election Model Does

Before writing a line of code, it is worth understanding the conceptual architecture of what we are building. A simple election model has three layers:

Layer 1: Poll Aggregation. Raw polls are noisy measurements of a latent quantity — the true distribution of vote preference in the electorate. No single poll is authoritative. Aggregation, by combining multiple polls with appropriate weighting, reduces noise and produces a better estimate than any individual poll.

Layer 2: Fundamentals Integration. Polls measure current opinion, but current opinion is not the only useful signal. Structural factors — the economy, the incumbent's approval rating, the partisan lean of the state — provide a baseline expectation that is less subject to short-term noise than polling. A blended model that uses fundamentals as a prior and updates with polling data has historically outperformed pure poll aggregation.

Layer 3: Uncertainty Quantification. A single point estimate ("Garza leads by 3.2 points") is less useful than a probability distribution over outcomes. Monte Carlo simulation allows us to propagate uncertainty through the model — from sampling variance in individual polls, to systematic polling error, to uncertainty in the fundamentals-polling blend — and produce a probability statement about the outcome.

💡 Why Three Layers? Each layer addresses a different source of information and a different source of uncertainty. Poll aggregation addresses measurement noise. Fundamentals integration addresses the gap between current snapshot and final outcome. Uncertainty quantification addresses the irreducible unpredictability of the social system being measured. A model that skips any of these layers is making implicit assumptions about the remaining uncertainty — usually that it is zero, which is never correct.

The three Python files in this chapter implement each layer separately before combining them into the full model:

example-01-poll-aggregation.py — Layer 1: weighted poll averages
example-02-monte-carlo-simulation.py — Layer 3: uncertainty simulation
example-03-election-model.py — Full integration of all three layers

Let us begin.

21.2 The ODA Dataset

The ODA polls dataset (oda_polls.csv) has the following schema:

Column	Type	Description
`poll_id`	string	Unique identifier
`date`	date	Field date (end date of polling window)
`state`	string	Two-letter state code
`pollster`	string	Polling organization name
`methodology`	string	`phone`, `online`, `ivr`, `mixed`
`candidate_d`	string	Democratic candidate name
`candidate_r`	string	Republican candidate name
`pct_d`	float	Democratic candidate percentage
`pct_r`	float	Republican candidate percentage
`sample_size`	int	Number of respondents
`margin_error`	float	Stated margin of error (±)
`population`	string	`rv` (registered voters), `lv` (likely voters), `a` (adults)
`race_type`	string	`senate`, `governor`, `house`, `president`

For the Garza-Whitfield Senate race, the relevant filters are: - state == "ODA" (the fictional state of Ordana) - race_type == "senate"

Throughout this chapter, "Garza" refers to the Democratic candidate and "Whitfield" to the Republican candidate.

Before any modeling, it is worth pausing on what this schema does and does not give you, because the structure of the data constrains the model you can build on top of it. Every row is a poll, not a voter — so this is aggregate, not individual-level, data, and the unit of analysis throughout the chapter is the published poll. Three columns carry most of the analytical weight. The date field is the polling window's end date, which matters because a poll's information value decays with age; your aggregation in Section 21.4 will down-weight older polls using exactly this column. The population field (rv/lv/a) flags which electorate each poll is measuring, and as Chapter 10 established, likely-voter and registered-voter polls of the same race can differ by several points — so a model that pools them naively without an adjustment term is mixing measurements of subtly different quantities. The pollster and methodology fields are what allow you to model house effects and quality differences rather than treating every poll as an equally reliable draw. Notice also what the schema lacks: there is no undecided or other column, so pct_d and pct_r need not sum to 100, and any margin you compute is a margin between two reported shares, not a share of the decided electorate. A disciplined first step, before writing the aggregation code, is therefore a quick audit — check the date range, count polls per pollster and per population type, and confirm that pct_d + pct_r stays in a plausible band — so that data problems surface as data problems rather than as mysterious behavior three layers deep in the Monte Carlo simulation.

21.3 Step 1 — Loading and Cleaning Polling Data

The first task is to load the dataset, filter to the race of interest, and clean the data. Real polling data is messy: dates need parsing, percentages may be stored as strings, outliers need inspection.

# example-01-poll-aggregation.py (excerpt — see full file in code/)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

# Load data
df = pd.read_csv("oda_polls.csv", parse_dates=["date"])

# Filter to Garza-Whitfield Senate race
race_df = df[(df["state"] == "ODA") & (df["race_type"] == "senate")].copy()

print(f"Total polls found: {len(race_df)}")
print(f"Date range: {race_df['date'].min().date()} to {race_df['date'].max().date()}")
print(f"\nPollsters:\n{race_df['pollster'].value_counts()}")
print(f"\nMethodology breakdown:\n{race_df['methodology'].value_counts()}")

21.3.1 Line-by-Line: What the Loading Code Does

Let's walk through each component of this block deliberately, because the habits formed in data loading determine the quality of everything downstream.

pd.read_csv("oda_polls.csv", parse_dates=["date"]) — The parse_dates parameter instructs pandas to automatically convert the date column from a string to a datetime64 object. Without this, the column would be loaded as a string ("2024-09-15" rather than Timestamp('2024-09-15')), and every subsequent date arithmetic operation would fail with a cryptic error. Always specify date columns at load time rather than converting afterward.

.copy() — When you filter a DataFrame with boolean indexing, you receive a "view" rather than a copy. If you modify a view, pandas may warn you about setting values on a copy. The .copy() call creates an independent copy of the filtered data, preventing this ambiguity. Developing the habit of calling .copy() after filtering is small overhead that prevents confusing bugs.

df["state"] == "ODA" and df["race_type"] == "senate" — Each comparison produces a Boolean Series (a column of True/False values). The & operator combines them element-wise: a row is included only when both conditions are True. Note the parentheses around each condition — they are required because Python's operator precedence would otherwise interpret df["state"] == "ODA" & df["race_type"] as df["state"] == ("ODA" & df["race_type"]), which is not what we want.

race_df['pollster'].value_counts() — This diagnostic output is intentional. Before any analysis, you want to know which pollsters contributed data and how many polls each produced. A dataset where one pollster is responsible for 70 percent of the polls raises different issues than a dataset evenly distributed across many firms.

A well-designed data cleaning step checks for several specific issues:

Date validity: Polls conducted after the election date are data entry errors. Polls with impossible dates (a poll "conducted" in the future relative to the dataset generation date) should be flagged.

Percentage validity: pct_d and pct_r should be non-negative and, when combined with third-party percentages and undecided, should sum to approximately 100. Values outside 0–100 are errors.

Sample size: Polls with sample sizes below 200 should be treated with extreme caution; polls below 100 are almost certainly uninterpretable. Flag them but do not automatically remove them — a small poll from an unusual subpopulation might contain useful information.

Duplicate polls: Some datasets contain the same poll listed twice (once per candidate, or once per publication outlet). Check for duplicates by poll_id or by the combination of pollster, date, and sample_size.

# Data cleaning
# Parse date safely
race_df["date"] = pd.to_datetime(race_df["date"], errors="coerce")

# Drop rows with missing critical values
race_df = race_df.dropna(subset=["date", "pct_d", "pct_r", "sample_size"])

# Calculate the two-party margin (positive = Garza/D lead)
race_df["margin_d"] = race_df["pct_d"] - race_df["pct_r"]

# Flag low-quality polls
race_df["low_quality"] = (race_df["sample_size"] < 300) | \
                          (race_df["population"] == "a")  # adults, not voters

# Flag likely voter polls (highest quality for final forecast)
race_df["is_lv"] = race_df["population"] == "lv"

print("\nCleaned dataset shape:", race_df.shape)
print("Low quality polls:", race_df["low_quality"].sum())
print("Likely voter polls:", race_df["is_lv"].sum())

21.3.2 Line-by-Line: The Cleaning Block

pd.to_datetime(race_df["date"], errors="coerce") — The errors="coerce" argument converts any value that cannot be parsed as a date to NaT (Not a Time — pandas' equivalent of NaN for datetime columns) rather than raising an error. This is safer than the default errors="raise" when processing datasets of unknown quality. After this call, scan for NaT values: race_df["date"].isna().sum() tells you how many dates were unparseable.

race_df.dropna(subset=["date", "pct_d", "pct_r", "sample_size"]) — We drop rows with missing values only in the columns we cannot impute. A poll without a date cannot be recency-weighted; a poll without candidate percentages cannot contribute to the margin estimate; a poll without a sample size cannot be sample-weighted. Columns like margin_error may have missings that we can handle (by computing a theoretical MOE from sample size), so we do not include them in the drop condition.

race_df["margin_d"] = race_df["pct_d"] - race_df["pct_r"] — This creates the primary analytic variable: the signed two-party margin. Positive values favor Garza (Democrat); negative values favor Whitfield (Republican). Working with the margin rather than raw percentages simplifies downstream averaging: the weighted average of margins is equivalent to the difference between weighted averages of the raw percentages, and the margin is the natural scale for forecasting electoral outcomes.

⚠️ Do Not Drop Low-Quality Polls Automatically. It is tempting to immediately exclude polls with small samples or non-likely-voter screens. Resist this temptation in the cleaning step. Instead, flag them and account for them in the weighting step. Automatic exclusion can introduce selection bias if certain pollsters systematically use non-standard methodologies. Document and justify every exclusion decision.

21.4 Step 2 — Calculating Weighted Polling Averages

A simple unweighted average treats all polls equally regardless of sample size, methodology, or timing. Weighted averaging assigns greater influence to polls that should be more accurate. Two primary weights are used in most aggregators:

Recency weight: More recent polls are more informative about current conditions. A poll from eight weeks before the election may reflect voter preferences that have since shifted. The standard approach is to apply an exponential decay function: a poll from $d$ days ago receives a weight of $w_{\text{recency}} = e^{-\lambda d}$, where $\lambda$ is the decay rate.

Sample size weight: Larger samples have smaller sampling variance. A poll of 2,000 respondents contains more information than a poll of 500. The contribution of a poll to variance reduction scales with $n$ (or sometimes $\sqrt{n}$, depending on the weighting scheme).

These weights are typically combined multiplicatively and normalized so that the sum of all weights equals 1.

# Weighted poll average calculation

ELECTION_DATE = datetime(2024, 11, 5)  # Target election date for Garza-Whitfield
DECAY_RATE = 0.05   # Controls how fast old polls lose weight
                    # Higher = faster decay (more recency-focused)

def calculate_weights(poll_df, election_date, decay_rate=0.05,
                      lv_bonus=1.5, quality_penalty=0.5):
    """
    Calculate composite weights for each poll.

    Parameters
    ----------
    poll_df      : DataFrame with polling data
    election_date: datetime — the forecast target date
    decay_rate   : float — lambda in exponential decay function
    lv_bonus     : float — multiplier for likely voter polls
    quality_penalty: float — multiplier for low-quality polls

    Returns
    -------
    DataFrame with 'weight' column added, normalized to sum to 1.
    """
    df = poll_df.copy()

    # Days before election
    df["days_to_election"] = (election_date - df["date"]).dt.days
    df["days_to_election"] = df["days_to_election"].clip(lower=0)

    # Recency weight: exponential decay
    df["w_recency"] = np.exp(-decay_rate * df["days_to_election"])

    # Sample size weight: square root of sample size (diminishing returns)
    df["w_sample"] = np.sqrt(df["sample_size"])

    # Population quality bonus
    df["w_quality"] = np.where(df["is_lv"], lv_bonus,
                       np.where(df["low_quality"], quality_penalty, 1.0))

    # Composite weight (multiplicative)
    df["weight_raw"] = df["w_recency"] * df["w_sample"] * df["w_quality"]

    # Normalize weights to sum to 1
    df["weight"] = df["weight_raw"] / df["weight_raw"].sum()

    return df

# Apply weights
weighted_df = calculate_weights(race_df, ELECTION_DATE)

# Compute weighted average margin
polling_average = (weighted_df["margin_d"] * weighted_df["weight"]).sum()
print(f"\nWeighted polling average (Garza margin): {polling_average:.2f} points")

21.4.1 Line-by-Line: The Weighting Function

(election_date - df["date"]).dt.days — This subtracts the poll date from the election date, producing a timedelta object. The .dt.days accessor extracts the number of days as an integer. A poll fielded on October 20 for a November 5 election produces (2024-11-05 - 2024-10-20).days = 16. The .clip(lower=0) call ensures that polls fielded after the election date (data errors) receive a days value of 0 rather than a negative number, which would produce a recency weight greater than 1.

np.exp(-decay_rate * df["days_to_election"]) — With decay_rate = 0.05, a poll from 14 days ago has weight e^(-0.05 × 14) = e^(-0.7) ≈ 0.497 — about half the weight of a poll from today. A poll from 60 days ago has weight e^(-3.0) ≈ 0.050 — about one-twentieth the weight. The decay rate is a tunable parameter: higher values weight recent polls more aggressively; lower values give older polls more influence. The appropriate rate depends on how quickly opinion moves in the specific race context.

np.where(df["is_lv"], lv_bonus, np.where(df["low_quality"], quality_penalty, 1.0)) — This nested np.where implements a three-way conditional: if the poll is a likely voter poll, apply lv_bonus = 1.5; else if it is low-quality, apply quality_penalty = 0.5; otherwise apply a neutral multiplier of 1.0. This is a concise, vectorized alternative to writing a for loop with if/elif/else.

df["weight"] = df["weight_raw"] / df["weight_raw"].sum() — Normalization ensures all weights sum to 1, so the weighted average is a genuine weighted mean rather than a weighted sum. This makes the blending arithmetic with the fundamentals prior straightforward.

The effective number of polls in a weighted average is related to the Herfindahl index of weight concentration: $N_{\text{eff}} = 1 / \sum_i w_i^2$. A weighted average where one poll gets 80 percent of the weight has an effective sample size closer to 1 than to the number of polls included. Print this diagnostic to assess whether the weighting scheme is reasonable.

# Effective number of polls
n_eff_polls = 1.0 / (weighted_df["weight"] ** 2).sum()
print(f"Number of polls: {len(weighted_df)}")
print(f"Effective number of polls (N_eff): {n_eff_polls:.1f}")

# Effective sample size
effective_n = (weighted_df["weight"] * weighted_df["sample_size"]).sum()
naive_moe = 1.96 / np.sqrt(effective_n) * 100  # two-party margin MOE
print(f"Effective sample size: {effective_n:.0f}")
print(f"Approximate margin of error (±): {naive_moe:.2f} points")

# Display top polls by weight
print("\nTop 5 polls by weight:")
top_polls = weighted_df.nlargest(5, "weight")[
    ["date", "pollster", "margin_d", "sample_size", "weight"]
].copy()
top_polls["date"] = top_polls["date"].dt.strftime("%Y-%m-%d")
top_polls["weight"] = (top_polls["weight"] * 100).round(1)
print(top_polls.to_string(index=False))

📊 Real-World Application: FiveThirtyEight's Pollster Ratings. Major aggregators apply a third weighting dimension: pollster quality, based on historical accuracy. Pollsters with a long track record of accurate predictions receive higher weights; newer or historically less accurate firms receive lower weights. Building a pollster rating system requires historical data on past accuracy — something not available in the ODA dataset — but it is an important extension for any operational forecasting system. The AAPOR report after each election cycle provides raw data for constructing such ratings.

21.5 Step 3 — Adding Fundamental Inputs

Polling aggregation alone has a known limitation: it is entirely dependent on the quality of the polls themselves. If the polls are systematically biased, the aggregate will be too. A second source of signal — structural fundamentals — provides partial insurance against a polling miss.

For a Senate race, relevant fundamentals include:

State partisan lean: The historical tendency of the state to vote above or below the national partisan baseline. A state that has voted 3 points more Republican than the nation in the last three election cycles has a partisan lean of R+3. This is available from sources like CPVI (Cook Partisan Voting Index).

Presidential approval: The incumbent president's approval rating is correlated with down-ballot performance, particularly in midterm elections. A presidential approval of 42 percent in an environment where Garza is the president's co-partisan should reduce our estimate of Garza's support relative to a neutral environment.

Economic indicator: The state unemployment rate, or a national economic index, provides additional signal about the structural environment.

The fundamentals model produces a "prior" — a baseline estimate before looking at polls. The final estimate blends the fundamentals prior with the polling average, weighting each by a blending parameter that reflects our confidence in each source.

# Fundamentals model for Garza-Whitfield

# Historical parameters for Ordana
STATE_LEAN_R = 1.8          # Ordana leans Republican by 1.8 points historically
INCUMBENT_PARTY = "R"       # Republican president in this cycle
PRES_APPROVAL = 44.5        # Presidential approval, national average
APPROVAL_EFFECT_PER_PT = 0.3 # Expected margin change per 1-pt approval shift
                               # from neutral (50%)

# Unemployment effect
STATE_UNEMP = 4.2           # State unemployment rate
NATIONAL_UNEMP = 3.8        # National unemployment rate
UNEMP_EFFECT_PER_PT = -0.8  # 1-pt higher unemployment hurts incumbent party by 0.8 pts

def calculate_fundamentals_prior(state_lean_r, incumbent_party,
                                  pres_approval, approval_effect_per_pt,
                                  state_unemp, national_unemp, unemp_effect_per_pt):
    """
    Calculate a fundamentals-based prior for the Democratic candidate's margin.

    Positive output = Democratic advantage; negative = Republican advantage.
    """
    # Baseline from state lean
    # State lean is defined as R-advantage relative to national, so negate for D margin
    baseline = -state_lean_r

    # Presidential approval effect
    # If D is incumbent's party, high approval helps D; low approval hurts
    approval_deviation = pres_approval - 50.0  # deviation from neutral
    if incumbent_party == "D":
        approval_adj = approval_deviation * approval_effect_per_pt
    else:
        approval_adj = -approval_deviation * approval_effect_per_pt

    # Economic effect
    # Higher unemployment relative to national average hurts the incumbent
    unemp_deviation = state_unemp - national_unemp
    if incumbent_party == "D":
        unemp_adj = unemp_deviation * unemp_effect_per_pt
    else:
        unemp_adj = -unemp_deviation * unemp_effect_per_pt

    fundamentals_prior = baseline + approval_adj + unemp_adj
    return fundamentals_prior

fundamentals_prior = calculate_fundamentals_prior(
    STATE_LEAN_R, INCUMBENT_PARTY,
    PRES_APPROVAL, APPROVAL_EFFECT_PER_PT,
    STATE_UNEMP, NATIONAL_UNEMP, UNEMP_EFFECT_PER_PT
)

print(f"State lean baseline (D margin): {-STATE_LEAN_R:.1f}")
print(f"Presidential approval adjustment: "
      f"{-(PRES_APPROVAL - 50) * APPROVAL_EFFECT_PER_PT:.1f}")
print(f"Unemployment adjustment: "
      f"{-(STATE_UNEMP - NATIONAL_UNEMP) * UNEMP_EFFECT_PER_PT:.1f}")
print(f"\nFundamentals prior (D margin): {fundamentals_prior:.2f} points")

🔴 Critical Thinking: Which Fundamentals? The choice of which fundamentals to include in a model is itself a theoretical claim about which structural factors drive electoral outcomes. Including presidential approval assumes that Senate races are referenda on the president. Including unemployment assumes that economic conditions drive partisan preference. Both assumptions are supported by substantial empirical evidence — but both are approximations. A model that includes too many fundamentals overfits to past elections; a model that includes too few leaves predictable variation unexplained. In practice, most simple Senate race models use 2–4 fundamentals rather than exhaustive indicator lists.

21.6 Step 4 — Generating a Point Estimate

The blended estimate combines the polling average and the fundamentals prior. The blending weight — how much to trust polls vs. fundamentals — is a model hyperparameter. It is typically calibrated empirically against past elections, or set using principled priors about how much polls vs. fundamentals have historically predicted final outcomes.

# Blending polls and fundamentals

# POLL_WEIGHT of 0.75 means: 75% polling average, 25% fundamentals
# This is a reasonable setting early in the cycle; closer to election,
# polling weight should increase toward 1.0
POLL_WEIGHT = 0.75

def blend_estimate(polling_avg, fundamentals_prior, poll_weight):
    """
    Blend polling average with fundamentals prior.

    Parameters
    ----------
    polling_avg      : float — weighted poll average (D margin)
    fundamentals_prior: float — fundamentals-based prior (D margin)
    poll_weight      : float in [0, 1] — weight given to polling

    Returns
    -------
    float — blended point estimate
    """
    fund_weight = 1.0 - poll_weight
    return poll_weight * polling_avg + fund_weight * fundamentals_prior

point_estimate = blend_estimate(polling_average, fundamentals_prior, POLL_WEIGHT)

print(f"\nPolling average (D margin):   {polling_average:+.2f}")
print(f"Fundamentals prior (D margin): {fundamentals_prior:+.2f}")
print(f"Blended point estimate:        {point_estimate:+.2f}")
print(f"\nPoll weight: {POLL_WEIGHT:.0%}, Fundamentals weight: {1-POLL_WEIGHT:.0%}")

When Nadia presented these numbers to the campaign, the point estimate showed Garza with a small but meaningful lead. The blended estimate — which weighted the historical Republican lean of Ordana through the fundamentals prior — was more conservative than the raw polling average, which had been running several points higher.

"So we're not as far ahead as the raw polls suggest," the campaign manager said.

"Correct. The fundamentals say Ordana leans Republican. The polls say we've overcome that lean, but fundamentals are skeptical. The truth is probably somewhere between the two."

Jake Rourke, the Whitfield campaign manager, would have seen similar numbers. His internal model, which Nadia knew existed because Jake was too methodical not to have built one, almost certainly used a higher fundamentals weight — placing more confidence in the historical Republican lean of the state and less in polls that he, following the pattern documented in Chapter 20, may have had reason to doubt.

21.7 Step 5 — Monte Carlo Simulation

A point estimate tells you the center of the distribution. Monte Carlo simulation tells you the shape of the distribution — the range of outcomes consistent with your assumptions about uncertainty.

The uncertainty in our forecast comes from several sources:

Sampling variance in individual polls — captured by sample size and MOE.
Fundamental model uncertainty — our fundamentals estimates are themselves uncertain.
Systematic polling error — the possibility that all polls are biased in the same direction by some unknown amount (the mechanism from Chapter 20).
Late movement — the possibility that voter preferences will shift between now and Election Day.

A Monte Carlo simulation draws random values from each of these uncertainty distributions, computes a simulated outcome for each draw, and assembles the distribution of thousands of simulated outcomes.

# Monte Carlo simulation for election outcome uncertainty

import numpy as np

N_SIMULATIONS = 100_000
np.random.seed(42)  # for reproducibility

# Uncertainty parameters
POLL_AVG_SD = 1.5          # Standard deviation of polling average uncertainty
                            # (reflects sampling variance of weighted average)
FUNDAMENTALS_SD = 2.5       # Standard deviation of fundamentals model uncertainty
SYSTEMATIC_ERROR_SD = 2.0   # Standard deviation of potential systematic polling error
                            # (correlated error analogous to Chapter 20 discussion)
LATE_MOVEMENT_SD = 1.0      # Standard deviation of potential late movement

def run_monte_carlo(point_estimate, poll_avg_sd, fundamentals_sd,
                    systematic_error_sd, late_movement_sd,
                    n_simulations=100_000):
    """
    Simulate election outcomes using Monte Carlo sampling.

    For each simulation:
      1. Draw a polling noise term (uncertainty around the weighted average)
      2. Draw a fundamentals noise term (uncertainty in the prior)
      3. Draw a systematic error term (correlated polling bias)
      4. Draw a late movement term (pre-election shift)
      5. Compute a simulated margin and classify as D-win or R-win

    Returns
    -------
    array of simulated D margins (shape: [n_simulations])
    """
    # Independent noise components
    poll_noise      = np.random.normal(0, poll_avg_sd,      n_simulations)
    fund_noise      = np.random.normal(0, fundamentals_sd,  n_simulations)
    systematic_err  = np.random.normal(0, systematic_error_sd, n_simulations)
    late_movement   = np.random.normal(0, late_movement_sd, n_simulations)

    # Simulated margin = point estimate + all noise terms
    simulated_margins = (point_estimate
                         + poll_noise
                         + fund_noise
                         + systematic_err
                         + late_movement)

    return simulated_margins

simulated_margins = run_monte_carlo(
    point_estimate,
    POLL_AVG_SD, FUNDAMENTALS_SD,
    SYSTEMATIC_ERROR_SD, LATE_MOVEMENT_SD,
    N_SIMULATIONS
)

# Win probability
d_win_prob = (simulated_margins > 0).mean()
r_win_prob = 1 - d_win_prob

print(f"\nMonte Carlo Results ({N_SIMULATIONS:,} simulations)")
print(f"{'='*40}")
print(f"Point estimate (D margin): {point_estimate:+.2f}")
print(f"Simulated median:          {np.median(simulated_margins):+.2f}")
print(f"Garza win probability:     {d_win_prob:.1%}")
print(f"Whitfield win probability: {r_win_prob:.1%}")
print(f"\nPercentile Distribution:")
for pct in [5, 25, 50, 75, 95]:
    print(f"  {pct:3d}th percentile: {np.percentile(simulated_margins, pct):+.2f}")

21.7.1 Line-by-Line: The Simulation Function

np.random.seed(42) — Setting the random seed makes the simulation reproducible: every time you run the code with the same seed, you get the same sequence of random draws. This is essential for debugging and for sharing results with colleagues who need to verify your numbers. In production, you might want to run multiple seeds and verify that results are stable, but during development, a fixed seed eliminates a source of variation that would otherwise make debugging difficult.

np.random.normal(0, poll_avg_sd, n_simulations) — This generates n_simulations = 100,000 draws from a Normal distribution centered at 0 with standard deviation poll_avg_sd = 1.5. Each draw represents the polling error in one simulated election. The mean of 0 reflects our assumption that the polling average is unbiased on average; the standard deviation of 1.5 represents the typical magnitude of polling average error.

Why Normal distributions? The Central Limit Theorem justifies using Normal distributions for sampling error: the average of many independent polls is approximately normally distributed regardless of the distribution of individual poll errors. For late movement and fundamentals uncertainty, normality is a modeling assumption of convenience rather than a derived result — but it is reasonable for small deviations from the current estimate.

(simulated_margins > 0).mean() — This is a vectorized computation that converts the array of simulated margins to a Boolean array (True where Garza wins, False where Whitfield wins) and takes the mean. Since True = 1 and False = 0, the mean of a Boolean array is the proportion of True values — which is the win probability. This is the most common pattern for computing probabilities from Monte Carlo results and is worth memorizing.

💡 Why Add Systematic Error as a Separate Term? Most sources of uncertainty in our model are independent: poll sampling variance is independent of fundamentals uncertainty. But systematic polling error (the kind documented in Chapter 20) is not independent across simulations — it represents a correlated shift that affects all polls in the same direction simultaneously. By adding it as a single draw per simulation that applies to all polls uniformly, we correctly capture the scenario where all polls are wrong in the same direction — the catastrophic case that deterministic models understate.

21.8 Step 6 — Visualizing the Probability Distribution

The probability distribution of simulated outcomes is the primary product of a well-designed forecasting model. Communicating it clearly to a non-technical audience is often the hardest part of the job.

# Visualization

import matplotlib.pyplot as plt
import matplotlib.patches as mpatches

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
fig.suptitle("Garza-Whitfield Senate Race Forecast", fontsize=14, fontweight="bold")

# --- Panel 1: Distribution of simulated margins ---
ax1 = axes[0]

# Histogram with color split at 0
bins = np.linspace(-15, 15, 61)
d_margins = simulated_margins[simulated_margins > 0]
r_margins = simulated_margins[simulated_margins <= 0]

ax1.hist(d_margins, bins=bins, color="#2166AC", alpha=0.75, label="Garza wins")
ax1.hist(r_margins, bins=bins, color="#D6604D", alpha=0.75, label="Whitfield wins")
ax1.axvline(0, color="black", linewidth=1.5, linestyle="--")
ax1.axvline(point_estimate, color="navy", linewidth=2, linestyle="-",
            label=f"Point estimate: {point_estimate:+.1f}")

# 80% credible interval
ci_lo, ci_hi = np.percentile(simulated_margins, [10, 90])
ax1.axvspan(ci_lo, ci_hi, alpha=0.15, color="gray",
            label=f"80% interval: [{ci_lo:+.1f}, {ci_hi:+.1f}]")

ax1.set_xlabel("Simulated Garza Margin (D - R percentage points)", fontsize=11)
ax1.set_ylabel("Number of Simulations", fontsize=11)
ax1.set_title("Distribution of Simulated Outcomes", fontsize=12)
ax1.legend(fontsize=9)
ax1.text(0.97, 0.95, f"Garza: {d_win_prob:.0%}\nWhitfield: {r_win_prob:.0%}",
         transform=ax1.transAxes, ha="right", va="top",
         fontsize=12, fontweight="bold",
         bbox=dict(boxstyle="round", facecolor="white", alpha=0.8))

# --- Panel 2: Poll timeline with weighted average ---
ax2 = axes[1]

# Sort polls by date
poll_plot = weighted_df.sort_values("date")
days_before = (ELECTION_DATE - poll_plot["date"]).dt.days

ax2.scatter(days_before, poll_plot["margin_d"],
            s=poll_plot["sample_size"] / 15,
            alpha=0.6, color="#2166AC", zorder=3,
            label="Individual polls (size = sample size)")

# Rolling weighted average (simplified: just the weight x margin)
# Show how the average evolves over time
rolling_avg_line = []
days_range = range(int(days_before.max()), 0, -1)
cutoff_df = weighted_df.copy()
for d in [120, 90, 60, 30, 14, 7, 3]:
    if d <= days_before.max():
        recent = cutoff_df[
            (ELECTION_DATE - cutoff_df["date"]).dt.days <= d + 14
        ]
        if len(recent) >= 3:
            w = recent["weight_raw"] / recent["weight_raw"].sum()
            avg = (recent["margin_d"] * w).sum()
            rolling_avg_line.append((d, avg))

if rolling_avg_line:
    roll_x, roll_y = zip(*rolling_avg_line)
    ax2.plot(roll_x, roll_y, "b-o", linewidth=2,
             markersize=5, label="Weighted moving average", zorder=4)

ax2.axhline(0, color="black", linewidth=1, linestyle="--")
ax2.axhline(point_estimate, color="navy", linewidth=2, linestyle=":",
            label=f"Final blended estimate: {point_estimate:+.1f}")

ax2.invert_xaxis()  # Left = election day, right = further back
ax2.set_xlabel("Days Before Election", fontsize=11)
ax2.set_ylabel("Garza Margin (D - R percentage points)", fontsize=11)
ax2.set_title("Poll Timeline — Garza-Whitfield Senate Race", fontsize=12)
ax2.legend(fontsize=9)

plt.tight_layout()
plt.savefig("garza_whitfield_forecast.png", dpi=150, bbox_inches="tight")
plt.show()
print("\nFigure saved: garza_whitfield_forecast.png")

When Nadia showed this visualization to the campaign, the finance director's question came up again: "Why does the distribution go all the way out to minus ten? Whitfield winning by ten?"

"Yes," Nadia said. "We're 61 percent confident Garza wins. That means we're 39 percent unconfident. The tail on the right represents outcomes where all the polls are wrong by a lot — the scenario from 2016. We should not pretend that can't happen."

Jake Rourke, she imagined, would be showing his team the same figure from the other side — the tail on the right, where Whitfield's name would be in blue. Both campaigns were living in that tail. The 61-39 split meant there was a race. That was the point.

21.9 Model Assumptions and Sensitivity Analysis

No model is assumption-free. Understanding which assumptions drive the results is as important as running the model itself. A sensitivity analysis systematically varies key parameters to see how much the output changes.

# Sensitivity analysis

print("SENSITIVITY ANALYSIS")
print("="*60)
print(f"{'Parameter':<35} {'Win Prob':>10} {'Delta':>10}")
print("-"*60)

base_prob = d_win_prob

def quick_sim(pe, poll_sd=POLL_AVG_SD, fund_sd=FUNDAMENTALS_SD,
              sys_sd=SYSTEMATIC_ERROR_SD, late_sd=LATE_MOVEMENT_SD,
              n=50_000):
    np.random.seed(99)
    m = pe + np.random.normal(0, poll_sd, n) + \
             np.random.normal(0, fund_sd, n) + \
             np.random.normal(0, sys_sd, n)  + \
             np.random.normal(0, late_sd, n)
    return (m > 0).mean()

scenarios = [
    # (label, point_estimate, poll_sd, fund_sd, sys_sd, late_sd)
    ("Base case",                  point_estimate, 1.5, 2.5, 2.0, 1.0),
    ("Higher systematic error",    point_estimate, 1.5, 2.5, 3.5, 1.0),
    ("Lower systematic error",     point_estimate, 1.5, 2.5, 1.0, 1.0),
    ("Polls +2 (late D surge)",    point_estimate+2, 1.5, 2.5, 2.0, 1.0),
    ("Polls -2 (late R surge)",    point_estimate-2, 1.5, 2.5, 2.0, 1.0),
    ("Higher fundamentals weight", blend_estimate(polling_average,
                                   fundamentals_prior, 0.5),
                                   1.5, 2.5, 2.0, 1.0),
    ("No fundamentals (polls only)", polling_average, 1.5, 2.5, 2.0, 1.0),
]

for label, pe, psd, fsd, ssd, lsd in scenarios:
    prob = quick_sim(pe, psd, fsd, ssd, lsd)
    delta = prob - base_prob
    print(f"{label:<35} {prob:>9.1%} {delta:>+9.1%}")

📊 Reading a Sensitivity Table. The most important rows are those where a plausible change in one parameter produces a large change in the win probability. If moving the systematic error parameter from 2.0 to 3.5 shifts the win probability by 8 points, the model's output is sensitive to this assumption. That tells you where to invest effort in validation: can you bound the systematic error assumption empirically from historical data? The rows that barely move win probability are the ones you can be less worried about.

⚠️ The Danger of False Precision. After running these numbers, it is tempting to report the win probability to two decimal places: "Garza has a 61.23% chance of winning." This is false precision. The model's uncertainty about its own assumptions — the choice of decay rate, the systematic error SD, the fundamentals blending weight — is much larger than the Monte Carlo variance (which can be reduced to near zero by increasing simulations). The appropriate precision is the nearest 5 or 10 percentage points unless you have very strong empirical grounds for believing your model is accurate at finer resolution.

21.10 What Can Go Wrong: Model Failure Modes

Building the model is the easier part. Knowing where it can fail is the harder and more important part.

Failure Mode 1: Overfitting to the polling data. If the polling average is badly biased (as in 2016 or 2020), a high polling weight will carry that bias into the final estimate. The fundamentals component provides partial protection, but only partial — fundamentals models have their own failure modes.

Failure Mode 2: Inappropriate fundamentals. The relationship between presidential approval and Senate race outcomes is real but imperfect. In cycles where a high-profile issue (abortion, immigration, economic shocks) dominates, the standard fundamentals relationship may break down. Including economic approval when voters are voting on social issues introduces noise, not signal.

Failure Mode 3: Ignoring candidate-specific factors. A generic model treats the Democratic candidate as interchangeable with any other Democratic candidate. In reality, candidate quality, fundraising, scandal, and biography produce meaningful deviations from the generic partisan expectation. These are difficult to model systematically and are often best incorporated as qualitative adjustments with explicit uncertainty rather than precise coefficients.

Failure Mode 4: Incorrect uncertainty parameters. The systematic error SD of 2.0 was not derived from empirical analysis of historical polling error in Ordana specifically; it was set based on general historical averages. If Ordana has experienced larger or smaller polling errors historically, this parameter should be updated. Using generic parameters when specific data is available is a form of information inefficiency.

Failure Mode 5: Treating the model as certain. The most common and consequential failure mode is not methodological but behavioral: the model is built, a number is produced, and that number is treated as ground truth rather than an estimate with substantial uncertainty. Nadia understood this. She had spent the previous hour managing the campaign's tendency to read "61%" as "we've already won."

🔴 Critical Thinking: Who Is the Model For? The model Nadia built serves a specific purpose: resource allocation decisions. The campaign needs to know whether Garza's race requires additional advertising spending, whether it is safe to redirect volunteers elsewhere, whether it is competitive enough to attract attention from the national party. A probability of 61% says: this race is competitive enough to warrant continued investment, but safe enough that marginal resources might be better deployed elsewhere. This is a different use case than a public forecast. Understanding the decision the model is supposed to inform shapes how it should be built, communicated, and updated.

21.11 The Complete Model: Example-03 Integration

The third code file (example-03-election-model.py) integrates all components into a single callable function and adds a Senate seat probability matrix for multiple simultaneous races.

# Abbreviated from example-03-election-model.py
# Full code in code/example-03-election-model.py

def full_election_model(polls_df, race_filters,
                         fundamentals_params,
                         weighting_params,
                         monte_carlo_params,
                         n_simulations=100_000):
    """
    Full election model pipeline.

    Parameters
    ----------
    polls_df         : DataFrame — full ODA dataset
    race_filters     : dict — {'state': 'ODA', 'race_type': 'senate'}
    fundamentals_params: dict — state_lean_r, incumbent_party, etc.
    weighting_params : dict — decay_rate, lv_bonus, poll_weight, election_date
    monte_carlo_params: dict — poll_avg_sd, fund_sd, systematic_error_sd, etc.
    n_simulations    : int

    Returns
    -------
    dict with keys:
        'point_estimate' — blended margin
        'polling_average' — raw weighted poll average
        'fundamentals_prior' — fundamentals estimate
        'win_probability_d' — P(D wins)
        'simulated_margins' — array of N simulated margins
        'percentiles' — dict of {5, 25, 50, 75, 95} percentiles
        'effective_n_polls' — effective number of polls
    """
    # Step 1: Filter and clean
    race = polls_df[
        (polls_df["state"] == race_filters["state"]) &
        (polls_df["race_type"] == race_filters["race_type"])
    ].copy()
    race = race.dropna(subset=["date", "pct_d", "pct_r", "sample_size"])
    race["margin_d"] = race["pct_d"] - race["pct_r"]
    race["is_lv"] = race["population"] == "lv"
    race["low_quality"] = race["sample_size"] < 300

    # Step 2: Weighted average
    w_race = calculate_weights(
        race,
        election_date=weighting_params["election_date"],
        decay_rate=weighting_params.get("decay_rate", 0.05),
        lv_bonus=weighting_params.get("lv_bonus", 1.5),
        quality_penalty=weighting_params.get("quality_penalty", 0.5)
    )
    polling_avg = (w_race["margin_d"] * w_race["weight"]).sum()
    n_eff = 1.0 / (w_race["weight"] ** 2).sum()

    # Step 3: Fundamentals
    f = fundamentals_params
    fund_prior = calculate_fundamentals_prior(
        f["state_lean_r"], f["incumbent_party"],
        f["pres_approval"], f["approval_effect_per_pt"],
        f["state_unemp"], f["national_unemp"], f["unemp_effect_per_pt"]
    )

    # Step 4: Blend
    pe = blend_estimate(polling_avg, fund_prior, weighting_params.get("poll_weight", 0.75))

    # Step 5: Monte Carlo
    mc = monte_carlo_params
    sims = run_monte_carlo(
        pe,
        mc.get("poll_avg_sd", 1.5),
        mc.get("fundamentals_sd", 2.5),
        mc.get("systematic_error_sd", 2.0),
        mc.get("late_movement_sd", 1.0),
        n_simulations
    )

    return {
        "point_estimate": pe,
        "polling_average": polling_avg,
        "fundamentals_prior": fund_prior,
        "win_probability_d": (sims > 0).mean(),
        "simulated_margins": sims,
        "percentiles": {p: np.percentile(sims, p) for p in [5, 25, 50, 75, 95]},
        "effective_n_polls": n_eff
    }

Running the full model on the Garza-Whitfield race produces a complete output package that can be updated as new polls arrive, as economic indicators change, or as the election date approaches and the fundamentals weight should decrease in favor of polling.

21.12 Model Validation and Backtesting

Building a model is only the first step. Before trusting its outputs for real decisions, you need to know whether it would have been accurate historically. Backtesting — running the model on past elections where you already know the outcome — is the primary tool for this validation.

21.12.1 The Backtesting Procedure

The procedure for backtesting an election model is straightforward in concept, though it requires careful discipline in execution:

Select historical races where you have polling data equivalent to what your model would have consumed. For Ordana Senate races, this means past cycles with the same pollster coverage.
Apply the model as if you were forecasting blind — use only data that was available before the election, not anything known afterward. This means not including polls fielded after Election Day, not incorporating post-election revised economic figures, and not using any knowledge of who actually won.
Record the model's probability for each race and whether the favored candidate actually won.
Evaluate calibration: A well-calibrated model should be correct approximately 70 percent of the time in races it calls at 70 percent confidence. If your model calls eight races at 70 percent confidence and wins all eight, it is probably underconfident — your uncertainty parameters may be too wide. If it wins only four of those eight, it is overconfident.

# Backtesting framework (conceptual — requires historical data)

def backtest_model(historical_races, model_params):
    """
    Backtest the election model against historical outcomes.

    Parameters
    ----------
    historical_races : list of dicts, each with keys:
        'state', 'race_type', 'election_date', 'actual_winner',
        'actual_margin_d', plus model parameter inputs

    Returns
    -------
    DataFrame with columns: race, predicted_prob_d, actual_winner,
                            correct, brier_score
    """
    results = []
    for race in historical_races:
        # Run model (conceptual — use full_election_model in practice)
        pred_prob_d = race.get("predicted_prob_d", 0.5)  # from model
        actual_d_won = race["actual_winner"] == "D"
        correct = (pred_prob_d > 0.5) == actual_d_won

        # Brier score: (probability - outcome)^2, lower is better
        brier = (pred_prob_d - int(actual_d_won)) ** 2

        results.append({
            "race": f"{race['state']} {race['race_type']} {race['year']}",
            "predicted_prob_d": pred_prob_d,
            "actual_winner": race["actual_winner"],
            "correct": correct,
            "brier_score": brier
        })

    results_df = pd.DataFrame(results)

    print(f"\nBacktest Results ({len(results_df)} races)")
    print(f"Accuracy: {results_df['correct'].mean():.1%}")
    print(f"Mean Brier Score: {results_df['brier_score'].mean():.4f}")
    print(f"  (Brier = 0.00 is perfect; 0.25 is no-skill baseline)")

    # Calibration bins
    results_df["prob_bin"] = pd.cut(results_df["predicted_prob_d"],
                                     bins=[0, 0.3, 0.5, 0.7, 0.85, 1.0])
    cal = results_df.groupby("prob_bin")["correct"].agg(["mean", "count"])
    print(f"\nCalibration by predicted probability:")
    print(cal.to_string())

    return results_df

21.12.2 What Good Calibration Looks Like

The Brier score — the mean squared error between the predicted probability and the actual binary outcome — is the standard metric for evaluating probabilistic forecasts. A model that assigns 50 percent probability to every race achieves a Brier score of 0.25 (the "no-skill baseline"). A perfect model that assigns 100 percent to all winners achieves a score of 0. Real election models typically achieve Brier scores in the 0.10–0.18 range over a large sample of competitive races.

Calibration plots — where the x-axis is the predicted probability and the y-axis is the actual win rate in each probability bucket — are the most intuitive way to diagnose miscalibration. A perfectly calibrated model produces a calibration plot where all points fall on the 45-degree diagonal: races called at 60 percent really do produce wins 60 percent of the time. If the calibration plot curves above the diagonal, the model is underconfident; if it curves below, it is overconfident.

📊 Real-World Application: FiveThirtyEight Calibration Reports. FiveThirtyEight has published calibration analyses of their presidential and Senate forecasts going back to 2008. Their models show good calibration (close to the diagonal) across probability levels, though with some systematic overconfidence at extreme probabilities (98%+ predictions that don't win quite as often as they should). These reports are worth reading not just for the calibration data but for the methodology discussion about how they evaluate their own work.

21.12.3 Overfitting and the Validation Set Problem

Backtesting against all available historical races creates a risk of overfitting: tuning your model parameters until they perform well on the historical data, at the cost of performance on new data. The correct procedure is to divide historical data into a training set (used to calibrate parameters) and a held-out test set (used only for final validation), then report performance only on the test set.

For election models, the sample of historical races is small enough that this discipline is rarely enforced rigorously — there simply are not enough historical Senate races in any given state to support a meaningful train/test split. The appropriate response is humility about backtesting results: a model that looks well-calibrated on fifteen historical races may be overfit to those fifteen specific elections in ways that won't generalize.

21.13 Common Errors and How to Debug Them

Even well-designed code encounters errors. The following are the most common problems students encounter when building this model, along with diagnosis strategies.

21.13.1 The Shape Mismatch Error

Symptom: ValueError: operands could not be broadcast together with shapes (N,) and (M,)

Cause: You are trying to multiply or add two arrays with different lengths. This commonly occurs when you filter the DataFrame after calculating weights, causing the weight array and the margin array to have different lengths.

Fix: Always work within the DataFrame. Use df["weight"] * df["margin_d"] rather than assigning intermediate numpy arrays and then operating on them. The DataFrame keeps the indices aligned; raw numpy arrays do not.

21.13.2 The All-Zero Weights Problem

Symptom: ZeroDivisionError or NaN in the weighted average output.

Cause: All polls are very old (large days_to_election) and the exponential decay has reduced all weights to values so small they round to zero in floating point arithmetic. This can happen if you accidentally set the election date to a date before all polls were fielded, or if the decay rate is very high.

Diagnosis: Print weighted_df["weight_raw"].describe(). If the minimum, maximum, and mean are all effectively zero, the decay is the problem.

Fix: Check the election date, check the decay rate, and print weighted_df["days_to_election"].describe() to confirm the time scale is what you expect.

21.13.3 The TypeError on Date Arithmetic

Symptom: TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

Cause: The date column was not properly converted to datetime. It remains a string object.

Fix: Verify that pd.to_datetime() was called on the column. Print race_df["date"].dtype — it should show datetime64[ns], not object. If it shows object, the conversion failed, possibly because errors="coerce" silently produced NaT values for all rows. Inspect the raw values with race_df["date"].head(10) before conversion.

21.13.4 Monte Carlo Variance Between Runs

Symptom: The win probability changes by several percentage points each time you run the code.

Cause: The random seed was not set, so each run produces different random draws.

Fix: Add np.random.seed(42) (or any fixed integer) before the Monte Carlo call. If the probability still varies significantly between runs with the same seed, the number of simulations may be too small. With 100,000 simulations, the Monte Carlo standard error on a 60 percent win probability is approximately sqrt(0.6 * 0.4 / 100000) = 0.0015, or 0.15 percentage points — negligible for decision-making purposes.

⚠️ A Debugging Habit Worth Developing: Before running any substantial computation, print the shape and first few rows of every DataFrame you are about to process. print(df.shape, df.dtypes, df.head(3)) takes two seconds and can save an hour of debugging. The most common cause of silent errors in data pipelines is a filter or merge that unexpectedly produces zero rows — and a zero-row DataFrame will propagate through subsequent calculations without error, producing NaN or zero outputs that look superficially reasonable.

21.14 Nadia's Presentation: What the Model Cannot Tell You

After walking through the model, Nadia put her laser pointer down.

"The model says 61-39. I want to be clear about what that means and what it doesn't mean. It means that given all the information we have right now — the polls, the fundamentals, our uncertainty about both — we believe Garza is a modest favorite. It does not mean we've won. It does not mean the polls are right. It does not mean we should move any resources off this race."

The campaign manager nodded slowly. "What would move the number significantly?"

"A new poll from a high-quality firm that shows a different picture. A significant change in the president's approval rating. A major candidate event — a debate stumble, a scandal, a major endorsement. The model updates. We update it every time new information comes in."

21.14.1 When Not to Trust Your Model

The finance director asked a different question: "Is there a scenario where we just shouldn't trust this at all?"

Nadia had thought about this. "Several."

When the polling environment is known to be unreliable. If there is reason to believe that the methodology producing the polls in the dataset is systematically problematic — the shift to online polling in a state with poor internet penetration, an unusual cycle where non-response bias is likely elevated — the polling average may be garbage in, garbage out. The model has no way to know this from inside the data.

When candidate-specific factors dominate. A Senate race where one candidate has been credibly accused of a serious crime, or where an unexpected endorsement from a popular former official has shifted the dynamic, may be poorly served by a generic fundamentals model that knows nothing about those events. The model assumes the world is normal. When the world is not normal, the model's priors may be badly miscalibrated.

When the polling drought is severe. If only one or two polls are available and both are from the same pollster, the weighted average is not really an aggregate — it is a single pollster's opinion with recency adjustment. The uncertainty parameters appropriate for a genuine aggregate are too narrow for this situation.

When it is very early. Fundamentals models gain accuracy as the election approaches, because early fundamentals estimates for factors like the unemployment rate are subject to revision. A forecast produced 18 months before an election is substantially more uncertain than the model's default parameters capture, because the world can change so much between now and Election Day.

When something unprecedented is happening. COVID-19 in 2020 invalidated many assumptions that had been stable across previous election cycles. A fundamentals model calibrated to pre-pandemic election environments made implicit assumptions about the relationship between economic conditions and vote share that may not have held in an environment where unemployment temporarily spiked to 15 percent due to a public health emergency rather than an economic recession. Know the historical context of your model's calibration.

Jake Rourke, presenting a nearly identical model to the Whitfield campaign at roughly the same time, was probably saying the same thing: the model is a decision-support tool, not an oracle. The 39 percent on his side of the ledger was not defeat; it was an invitation to close the gap.

That is the appropriate relationship to a probabilistic model: it is not a sentence; it is a score in a game still being played.

21.15 Practical Extensions for Students

The model built in this chapter is functional but simple. The following extensions are ordered from straightforward to challenging. Each builds on the infrastructure already in place.

Extension 1: Dynamic poll weight (easy). Currently, POLL_WEIGHT = 0.75 is a constant. In practice, the appropriate weight given to polls versus fundamentals increases as Election Day approaches — polls that are close to Election Day are more predictive of the final outcome than polls taken six months prior. Implement a function dynamic_poll_weight(days_to_election) that returns a value that increases from approximately 0.5 (early cycle) to 0.95 (final week). Apply this in the blending step.

Extension 2: Pollster house effects (moderate). Some pollsters systematically favor one party over another — a phenomenon called "house effects." Estimate house effects by computing, for each pollster, the average deviation of their polls from the weighted average of all other polls on the same date window. Subtract the estimated house effect from each poll before averaging.

# Conceptual framework for house effects
def estimate_house_effects(df, min_polls_per_pollster=3):
    """
    Estimate house effects as deviation from overall average.
    Returns dict of {pollster: house_effect_estimate}.
    """
    overall_avg = df["margin_d"].mean()
    effects = {}
    for pollster, group in df.groupby("pollster"):
        if len(group) >= min_polls_per_pollster:
            effects[pollster] = group["margin_d"].mean() - overall_avg
    return effects

Extension 3: Multi-race simulation (challenging). Extend the Monte Carlo simulation to run multiple Senate races simultaneously, with correlated systematic errors. The key insight is that if all polls are biased toward Democrats in a given cycle (as in 2020), that bias applies to all states, not just one. Draw one systematic error term per simulation and apply it to all races, then compute the distribution of seat outcomes.

Extension 4: Bayesian updating (advanced). Replace the linear blending of polls and fundamentals with a proper Bayesian update: treat the fundamentals as a prior distribution over the true margin, and update it with polling data as likelihoods. This requires specifying the prior as a Normal distribution with mean = fundamentals_prior and standard deviation = FUNDAMENTALS_SD, and the likelihood as a Normal distribution with mean = the polling average and standard deviation = the polling standard error. The posterior is then also Normal with analytically derivable parameters.

✅ Best Practice for Extension Work: When modifying the model, always test against the base case first. After adding your extension, verify that when the extension is turned off (e.g., house effects set to zero, poll weight fixed at base value), the model reproduces the original results. This confirms your extension is implemented correctly and has not inadvertently broken existing functionality.

21.16 Conclusion: The Model as a Thinking Tool

The model built in this chapter is simple. A professional forecasting operation would extend it in several directions: a pollster quality rating system, more sophisticated fundamentals regressions calibrated to historical Senate races, Bayesian updating rather than simple linear blending, and simulation of correlated errors across multiple simultaneous races. These extensions are not cosmetic — they produce meaningfully better predictions.

But the simple model captures the essential structure: aggregate information, acknowledge uncertainty, simulate the full range of plausible outcomes, and communicate the result honestly as a probability rather than a false certainty. The three-layer architecture — aggregation, fundamentals, uncertainty — is not specific to this dataset or this race. It applies to any electoral forecasting problem where polling data is available.

The skills developed in this lab — weighted averaging, Monte Carlo simulation, sensitivity analysis, visualization — are transferable to every domain in political analytics. They are also habits of mind: a well-built model requires clarity about sources of information, explicit treatment of uncertainty, and honest recognition of what the model cannot know. That is, ultimately, what distinguishes rigorous political analysis from sophisticated-sounding guesswork.

Summary

An election model has three layers: poll aggregation, fundamentals integration, and uncertainty quantification.
Weighted poll averages assign greater influence to recent polls (via exponential decay), larger polls (via square root of sample size), and higher-quality polls (via population-type bonuses).
Fundamentals priors combine state partisan lean, presidential approval, and economic indicators to produce a baseline independent of polling.
Blending polls with fundamentals using a poll weight parameter (typically 0.5–0.85, increasing as Election Day approaches) reduces dependence on any single information source.
Monte Carlo simulation draws random values from each uncertainty distribution to produce a full probability distribution over outcomes rather than a single point estimate.
Systematic error — the possibility that all polls are biased in the same direction — must be modeled as a correlated term (one draw per simulation) rather than independent noise.
Sensitivity analysis reveals which parameters most strongly drive the output, identifying where empirical validation is most important.
Backtesting against historical races is the primary tool for validating model calibration, with Brier scores and calibration plots as the key diagnostic outputs.
Common errors include shape mismatches from misaligned arrays, all-zero weights from excessive decay, and Monte Carlo variance from missing random seeds.
The model is a decision-support tool. Its output is a probability distribution to be communicated honestly — and its limitations in abnormal cycles must be understood before any single number is reported as fact.

Key Terms

Exponential decay weighting — A recency weighting scheme in which a poll's weight decreases exponentially with its age: $w = e^{-\lambda d}$.

Effective sample size — The sample size equivalent of a weighted estimate, accounting for the statistical efficiency loss from weighting.

Fundamentals prior — A baseline estimate of electoral outcome based on structural factors (state lean, economic indicators, presidential approval) independent of current polling.

Monte Carlo simulation — A computational technique that generates a probability distribution of outcomes by repeatedly drawing random values from specified uncertainty distributions.

Systematic error — Directional bias that affects all measurements in the same direction; modeled as a correlated term in simulation.

Sensitivity analysis — Systematic variation of model parameters to identify which assumptions most strongly influence the output.

Point estimate — A single number summarizing the center of a distribution; always less informative than the full probability distribution.

Blending weight — The parameter controlling how much weight is given to polling vs. fundamentals in the blended estimate; typically calibrated empirically against historical elections.

Brier score — A proper scoring rule for probabilistic forecasts, defined as the mean squared error between predicted probability and binary outcome; lower values indicate better calibration.

Backtesting — The procedure of running a forecasting model on historical data where outcomes are known, in order to evaluate the model's accuracy and calibration before deployment.

House effects — Systematic directional bias in a specific pollster's results relative to other pollsters; estimated by comparing a pollster's polls to the contemporaneous field average.

Chapter Summary

This chapter built a complete, operational election forecasting model from the ground up, using the Garza-Whitfield Senate race in Ordana as the working application. The core architecture — poll aggregation, fundamentals integration, and Monte Carlo uncertainty quantification — reflects the structure used by professional forecasters from major news organizations to campaign analytics teams. The lab emphasized not just the technical implementation but the reasoning behind each design choice: why recency weighting uses exponential decay rather than a simple cutoff, why systematic polling error must be modeled as a correlated draw rather than independent noise, why the blending weight between polls and fundamentals should shift as Election Day approaches.

The Python skills developed here — weighted averaging with pandas, Monte Carlo simulation with NumPy, calibration diagnostics, and sensitivity analysis — apply well beyond election forecasting. Any domain that requires combining multiple uncertain information sources into a probability distribution uses the same underlying logic. The habit of asking "what happens if this parameter is wrong?" is the sensitivity analysis mindset applied to every quantitative model you will build.

Perhaps most importantly, the chapter established the appropriate relationship between a practitioner and their model: the forecast is a decision-support tool, not an oracle. Nadia's 61-percent estimate for Garza did not mean the campaign had won — it meant they were a modest favorite in a genuinely uncertain race, and that the appropriate response was continued disciplined investment rather than either complacency or panic. Communicating uncertainty honestly to non-technical audiences is as important a skill as the quantitative techniques that generate the uncertainty estimate. The model tells you what the numbers say. The analyst tells you what the numbers mean.

In This Chapter

Chapter 21: Building a Simple Election Model (Python Lab)

21.1 Architecture Overview: What a Simple Election Model Does

21.2 The ODA Dataset

21.3 Step 1 — Loading and Cleaning Polling Data

21.3.1 Line-by-Line: What the Loading Code Does

21.3.2 Line-by-Line: The Cleaning Block

21.4 Step 2 — Calculating Weighted Polling Averages

21.4.1 Line-by-Line: The Weighting Function

21.5 Step 3 — Adding Fundamental Inputs

21.6 Step 4 — Generating a Point Estimate

21.7 Step 5 — Monte Carlo Simulation

21.7.1 Line-by-Line: The Simulation Function

21.8 Step 6 — Visualizing the Probability Distribution

21.9 Model Assumptions and Sensitivity Analysis

21.10 What Can Go Wrong: Model Failure Modes

21.11 The Complete Model: Example-03 Integration

21.12 Model Validation and Backtesting

21.12.1 The Backtesting Procedure

21.12.2 What Good Calibration Looks Like

21.12.3 Overfitting and the Validation Set Problem

21.13 Common Errors and How to Debug Them

21.13.1 The Shape Mismatch Error

21.13.2 The All-Zero Weights Problem

21.13.3 The TypeError on Date Arithmetic

21.13.4 Monte Carlo Variance Between Runs

21.14 Nadia's Presentation: What the Model Cannot Tell You

21.14.1 When Not to Trust Your Model

21.15 Practical Extensions for Students

21.16 Conclusion: The Model as a Thinking Tool

Summary

Key Terms

Chapter Summary

Related Reading