Case Study 2: Simulating Disease Spread — Elena Models Outbreak Probability

Contributors to Introduction to Data Science

Case Study 2: Simulating Disease Spread — Elena Models Outbreak Probability

Tier 3 — Illustrative/Composite Example: This case study uses a simplified disease transmission model for pedagogical purposes. The simulation parameters are chosen to illustrate probability concepts, not to replicate any specific real-world pathogen. Actual epidemiological modeling involves substantially more complexity (heterogeneous contact patterns, variable infectious periods, spatial dynamics, behavioral responses, etc.). The basic SIR framework described here is a genuine epidemiological tool, originally developed by Kermack and McKendrick in 1927.

The Setting

Elena has been working with vaccination data all semester. She knows the numbers — 55% coverage here, 92% there. But her supervisor asks a question that shifts her perspective:

"Elena, we know neighborhood X has a 60% vaccination rate for measles. Neighborhood Y has 85%. Can you model how an outbreak would spread differently in these two neighborhoods? The county board is deciding where to allocate outreach resources, and a simulation would be much more convincing than a table of percentages."

Elena doesn't have an epidemiology degree. But she has Python, she understands probability, and she's about to discover that simulation is one of the most powerful tools in public health.

The Probability Foundation

Before building the simulation, Elena thinks through the probability.

When an infectious person contacts a susceptible person, there's some probability of transmission. In epidemiology, this is captured by a parameter called R0 (pronounced "R-naught") — the average number of people one infected person will infect in a fully susceptible population.

For measles, R0 is about 12-18. That means each infected person would infect 12-18 others if nobody were immune. But vaccination changes everything. If 90% of the population is vaccinated (and the vaccine is effective), the effective reproduction number drops to about R0 * (1 - vaccination_rate).

import numpy as np
import matplotlib.pyplot as plt

# R0 for measles: approximately 15
R0 = 15

# Effective reproduction number at different vaccination rates
vacc_rates = np.arange(0, 1.01, 0.01)
R_effective = R0 * (1 - vacc_rates)

plt.figure(figsize=(10, 5))
plt.plot(vacc_rates * 100, R_effective, color='steelblue', linewidth=2)
plt.axhline(y=1, color='red', linestyle='--', label='R=1 (outbreak threshold)')
plt.fill_between(vacc_rates * 100, R_effective, 1,
                 where=(R_effective > 1), color='coral', alpha=0.2, label='Outbreak likely')
plt.fill_between(vacc_rates * 100, R_effective, 0,
                 where=(R_effective <= 1), color='lightgreen', alpha=0.2, label='Outbreak unlikely')
plt.xlabel('Vaccination Rate (%)')
plt.ylabel('Effective Reproduction Number (R)')
plt.title('How Vaccination Rate Affects Disease Spread (Measles, R0=15)')
plt.legend()

# Mark the herd immunity threshold
herd_threshold = (1 - 1/R0) * 100
plt.axvline(x=herd_threshold, color='green', linestyle=':', alpha=0.7)
plt.annotate(f'Herd immunity\nthreshold: {herd_threshold:.0f}%',
             xy=(herd_threshold, R0*0.3), fontsize=9, color='green',
             ha='center')

plt.tight_layout()
plt.savefig('vaccination_R_effective.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"Herd immunity threshold for measles: {herd_threshold:.1f}%")
print(f"At 60% vaccination: R_effective = {R0 * 0.4:.1f}")
print(f"At 85% vaccination: R_effective = {R0 * 0.15:.1f}")
print(f"At 93% vaccination: R_effective = {R0 * 0.07:.1f}")

The key threshold is R = 1. If R_effective > 1, each infected person infects more than one other person on average, and the outbreak grows. If R_effective < 1, the outbreak dies out.

For measles (R0 ≈ 15), the herd immunity threshold is about 93%. Below that, outbreaks can spread. Elena's two neighborhoods — at 60% and 85% — are both below herd immunity, but how differently would outbreaks play out?

The Simulation: A Simple SIR Model

Elena builds a simplified SIR model — one of the most fundamental tools in epidemiology. SIR stands for: - Susceptible: can catch the disease - Infected: currently infected and can spread it - Recovered: recovered and now immune

def simulate_outbreak(population_size, vaccination_rate, R0,
                      infectious_days=10, initial_infected=1,
                      max_days=200):
    """
    Simulate a disease outbreak using a stochastic SIR model.

    Each day, each infected person has a probability of infecting
    each susceptible person they contact.
    """
    # Initialize population
    n_vaccinated = int(population_size * vaccination_rate)
    n_susceptible = population_size - n_vaccinated - initial_infected

    # Contact rate: average contacts per day
    contacts_per_day = 10  # Simplified assumption
    # Transmission probability per contact
    p_transmit = R0 / (contacts_per_day * infectious_days)

    # Track daily counts
    S = [n_susceptible]
    I = [initial_infected]
    R_recovered = [n_vaccinated]  # Vaccinated start as "recovered" (immune)

    day = 0
    while I[-1] > 0 and day < max_days:
        day += 1
        current_S = S[-1]
        current_I = I[-1]
        current_R = R_recovered[-1]

        # New infections: each infected person contacts some susceptible people
        if current_S > 0 and current_I > 0:
            # Probability any given susceptible person gets infected today
            p_not_infected = (1 - p_transmit) ** min(current_I, contacts_per_day)
            new_infections = np.random.binomial(current_S, 1 - p_not_infected)
        else:
            new_infections = 0

        # Recoveries: simplified — each infected person recovers with P = 1/infectious_days
        new_recoveries = np.random.binomial(current_I, 1 / infectious_days)

        S.append(current_S - new_infections)
        I.append(current_I + new_infections - new_recoveries)
        R_recovered.append(current_R + new_recoveries)

    return {
        'days': list(range(len(S))),
        'susceptible': S,
        'infected': I,
        'recovered': R_recovered,
        'total_infected': population_size - S[-1] - n_vaccinated
    }

# Compare two neighborhoods
np.random.seed(42)
population = 10000

result_60 = simulate_outbreak(population, vaccination_rate=0.60, R0=15)
result_85 = simulate_outbreak(population, vaccination_rate=0.85, R0=15)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for ax, result, vacc_rate in [(axes[0], result_60, 60), (axes[1], result_85, 85)]:
    ax.plot(result['days'], result['infected'], color='coral', linewidth=2, label='Infected')
    ax.plot(result['days'], result['susceptible'], color='steelblue', linewidth=2, label='Susceptible')
    ax.plot(result['days'], result['recovered'], color='mediumseagreen', linewidth=2, label='Recovered/Immune')
    ax.set_title(f'Vaccination Rate: {vacc_rate}%\nTotal infected: {result["total_infected"]}')
    ax.set_xlabel('Days')
    ax.set_ylabel('Number of People')
    ax.legend()

plt.suptitle('Disease Outbreak Simulation: Two Neighborhoods', fontsize=14)
plt.tight_layout()
plt.savefig('sir_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\n60% vaccination: {result_60['total_infected']} people infected "
      f"({result_60['total_infected']/population*100:.1f}% of population)")
print(f"85% vaccination: {result_85['total_infected']} people infected "
      f"({result_85['total_infected']/population*100:.1f}% of population)")

The Probability Questions

Elena uses this simulation to answer several probability questions for the county board.

Question 1: What's the probability of a large outbreak?

Because the model is stochastic (involves randomness), running it once isn't enough. Elena runs it many times:

def outbreak_probability(n_simulations, population_size, vaccination_rate,
                         R0, threshold=50):
    """
    Estimate the probability of an outbreak exceeding a threshold
    by running many simulations.
    """
    outbreak_sizes = []
    for _ in range(n_simulations):
        result = simulate_outbreak(population_size, vaccination_rate, R0)
        outbreak_sizes.append(result['total_infected'])

    p_large_outbreak = np.mean(np.array(outbreak_sizes) >= threshold)
    return outbreak_sizes, p_large_outbreak

np.random.seed(42)
n_sims = 500

sizes_60, p_large_60 = outbreak_probability(n_sims, 10000, 0.60, 15, threshold=100)
sizes_85, p_large_85 = outbreak_probability(n_sims, 10000, 0.85, 15, threshold=100)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].hist(sizes_60, bins=30, color='coral', edgecolor='white', alpha=0.8)
axes[0].set_title(f'60% Vaccinated\nP(>100 infected) = {p_large_60:.2f}')
axes[0].set_xlabel('Total Infected')

axes[1].hist(sizes_85, bins=30, color='mediumseagreen', edgecolor='white', alpha=0.8)
axes[1].set_title(f'85% Vaccinated\nP(>100 infected) = {p_large_85:.2f}')
axes[1].set_xlabel('Total Infected')

plt.suptitle('Distribution of Outbreak Sizes (500 simulations each)', fontsize=14)
plt.tight_layout()
plt.savefig('outbreak_distribution.png', dpi=150, bbox_inches='tight')
plt.show()

print(f"\nProbability of >100 infections:")
print(f"  60% vaccinated: {p_large_60:.2f} ({p_large_60*100:.0f}%)")
print(f"  85% vaccinated: {p_large_85:.2f} ({p_large_85*100:.0f}%)")

Question 2: What vaccination rate makes outbreaks unlikely?

vacc_rates_test = [0.50, 0.60, 0.70, 0.80, 0.85, 0.90, 0.93, 0.95]
probabilities = []

np.random.seed(42)
for vr in vacc_rates_test:
    _, p_large = outbreak_probability(200, 10000, vr, 15, threshold=50)
    probabilities.append(p_large)
    print(f"  Vaccination rate {vr*100:.0f}%: P(outbreak > 50) = {p_large:.2f}")

plt.figure(figsize=(10, 5))
plt.plot([v*100 for v in vacc_rates_test], probabilities,
         'o-', color='steelblue', linewidth=2, markersize=8)
plt.axhline(y=0.05, color='red', linestyle='--', label='5% risk threshold')
plt.xlabel('Vaccination Rate (%)')
plt.ylabel('Probability of Outbreak (>50 cases)')
plt.title('Outbreak Probability vs. Vaccination Rate')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig('outbreak_vs_vaccination.png', dpi=150, bbox_inches='tight')
plt.show()

Question 3: How does randomness affect outcomes?

Elena runs the same scenario 20 times to show the county board that randomness matters:

np.random.seed(42)
fig, ax = plt.subplots(figsize=(12, 6))

for i in range(20):
    result = simulate_outbreak(10000, vaccination_rate=0.80, R0=15)
    ax.plot(result['days'], result['infected'], alpha=0.3, color='coral')

ax.set_title('20 Simulations at 80% Vaccination Rate\nSame parameters, different randomness')
ax.set_xlabel('Days')
ax.set_ylabel('Number Currently Infected')
plt.tight_layout()
plt.savefig('outbreak_variability.png', dpi=150, bbox_inches='tight')
plt.show()

Even with the same vaccination rate, some simulations produce large outbreaks and others fizzle out. This is the nature of randomness — the same conditions can produce very different outcomes, especially when numbers are small (the first few days of an outbreak).

Elena's Report to the County Board

Elena prepares a brief for the board:

Key Findings from Outbreak Simulation:

At the current 60% vaccination rate in Neighborhood X, the probability of a significant outbreak (>100 cases) following a single introduction is approximately 85-95%. In contrast, at Neighborhood Y's 85% rate, the probability drops to approximately 20-40%.

To reduce outbreak probability below 5%, vaccination rates need to exceed approximately 93% — the herd immunity threshold for measles.

Even at 85% vaccination, outbreaks are not impossible — randomness plays a significant role. Some simulations produce large outbreaks even at relatively high vaccination rates. This argues for a margin of safety above the theoretical threshold.

Recommendation: Prioritize outreach resources in Neighborhood X, where the outbreak risk is highest. The 25-percentage-point gap in vaccination rates between the two neighborhoods translates to a dramatically different risk profile.

The Lessons

Lesson 1: Simulation Turns Abstract Probability into Tangible Stories

The county board members didn't need to understand R0 formulas or SIR differential equations. They could look at the epidemic curves and immediately understand that 60% vaccination is dangerous and 85% is much safer. Simulation made probability visible.

Lesson 2: Stochastic Models Reveal Variability, Not Just Averages

Running the simulation once gives you one possible outcome. Running it hundreds of times gives you a distribution of outcomes — connecting back to the distribution thinking from Chapter 19. The probability of a large outbreak is a summary statistic of that distribution.

Lesson 3: Probability Is About Risk, Not Certainty

Elena didn't tell the board "there will be an outbreak" or "there won't be." She said "there's an 85% chance of a significant outbreak at 60% vaccination." That's what probability gives us — a language for quantifying uncertainty, not eliminating it.

Lesson 4: Python Turns You Into a Modeler

Elena built a useful epidemiological model with basic Python — loops, random number generation, and plotting. She didn't need specialized software or years of training. The combination of probability thinking and Python coding is remarkably powerful.

Discussion Questions

What simplifications did Elena's model make? How might these affect the accuracy of her predictions? (Think about: homogeneous mixing, fixed recovery time, perfect vaccine efficacy.)
If Elena wanted to make her model more realistic, what additional data would she need?
The model assumes "homogeneous mixing" — every person has equal contact with every other person. Why is this unrealistic, and how might contact structure (e.g., families, schools, workplaces) change the results?
How would you communicate the concept of "stochastic variability" (same conditions, different outcomes) to non-technical decision-makers?

Connection to the Chapter

This case study applies nearly every concept from Chapter 20: probability as frequency (running many simulations), independence assumptions, the law of large numbers (why many simulations give stable estimates), and Monte Carlo simulation as a problem-solving strategy. It also connects to Elena's progressive project, extending her vaccination rate analysis from descriptive statistics (Chapter 19) to probabilistic risk modeling.