Case Study 2: Estimating Global Vaccination Coverage from Incomplete Data

Contributors to Introduction to Data Science

Case Study 2: Estimating Global Vaccination Coverage from Incomplete Data

Tier 2 — Attributed Findings: This case study discusses real global health estimation challenges. Statistics and methodology descriptions are attributed to the World Health Organization (WHO) and UNICEF, whose joint reporting system (WHO/UNICEF Estimates of National Immunization Coverage, or WUENIC) is the primary source for global vaccination data. Specific figures are based on widely published WHO estimates; minor variations may exist across report editions. Illustrative examples of specific countries and estimation challenges are based on documented patterns in the global health literature, with some details simplified for pedagogical clarity.

The Question That Sounds Simple

Here is a question that should be easy to answer: What percentage of the world's children have been vaccinated against measles?

It's the kind of question you'd expect to answer in ten minutes. Open a database. Download a file. Compute a mean. Done.

Except it's not that simple. Not even close.

The WHO and UNICEF jointly produce what are considered the most authoritative estimates of national immunization coverage — a dataset known as WUENIC (WHO/UNICEF Estimates of National Immunization Coverage). These estimates cover over 190 countries and multiple vaccines, and they're used by governments, international organizations, and researchers to set health policy, allocate funding, and track progress toward global health goals.

But the WUENIC estimates aren't just downloads from a global database. They're the product of a sophisticated statistical estimation process that grapples with exactly the challenges we studied in Chapter 22: imperfect sampling, missing data, biased measurements, and the fundamental question of how to learn about a population when you can't observe it directly.

This case study walks through how vaccination coverage is estimated globally, what can go wrong, and what it teaches us about the practice of estimation in the real world.

The Data Landscape: A Patchwork, Not a Picture

To understand the estimation challenge, you need to appreciate the wildly different data situations across countries.

Countries with Strong Administrative Data

In many high-income countries and an increasing number of middle-income countries, vaccination is tracked through administrative systems. Every time a child receives a vaccine dose, it's recorded in a database — often an electronic immunization registry that's linked to the child's birth record. At the end of the year, the government can simply divide the number of doses administered by the number of eligible children (the target population) to get a coverage rate.

# The simple case: administrative coverage
doses_administered = 485000
target_population = 500000
admin_coverage = doses_administered / target_population

print(f"Administrative coverage: {admin_coverage:.1%}")

This sounds straightforward, but even in this "easy" case, there are problems:

Numerator errors: Doses might be counted more than once if a child moves between health districts. Doses given in the private sector might not be reported to the government system.
Denominator errors: The "target population" is usually an estimate from census projections, which can be years out of date. In countries with high migration, the actual number of children in a district may differ substantially from projections.
Coverage over 100%: When the numerator is inflated (double-counting) or the denominator is deflated (population estimates too low), the calculated coverage can exceed 100% — which is obviously impossible. This happens more often than you'd think.

Countries with Household Survey Data

Many low- and middle-income countries conduct periodic household surveys — such as the Demographic and Health Surveys (DHS) or Multiple Indicator Cluster Surveys (MICS) — that include questions about children's vaccination status. An interviewer visits randomly selected households, asks to see the child's vaccination card, and records which vaccines have been received.

This is sampling in action — the same kind of sampling you learned about in the chapter. A well-designed survey with 5,000-10,000 households can produce nationally representative estimates of vaccination coverage.

But surveys have their own challenges:

Recall bias: When a vaccination card isn't available (lost, never issued, or the child was never vaccinated), the interviewer asks the mother to recall which vaccines the child received. Mothers of vaccinated children are more likely to remember doses than mothers of unvaccinated children. This creates an upward bias in reported coverage from recall.

Sampling error: Surveys are samples, not censuses. A DHS might estimate measles coverage at 76% ± 3.5 percentage points (95% CI). That uncertainty matters when you're tracking whether coverage has gone up or down from the previous survey five years ago.

Timing: Major household surveys are expensive and typically conducted only every 3-5 years. Between surveys, there's a data gap that must be filled with other sources.

Let's simulate the difference between administrative and survey data:

import numpy as np
import matplotlib.pyplot as plt

np.random.seed(42)

# Simulate a country over 20 years
years = np.arange(2003, 2023)
true_coverage = 50 + 2 * np.arange(20) + np.random.normal(0, 2, 20)
true_coverage = np.clip(true_coverage, 30, 99)

# Administrative data: available every year, but with systematic upward bias
admin_data = true_coverage + np.random.normal(8, 3, 20)  # ~8 points too high
admin_data = np.clip(admin_data, 30, 105)  # Can exceed 100%!

# Survey data: available only every 5 years, but unbiased (with sampling error)
survey_years = [2003, 2008, 2013, 2018]
survey_indices = [np.where(years == y)[0][0] for y in survey_years]
survey_data = true_coverage[survey_indices] + np.random.normal(0, 3.5, len(survey_years))

fig, ax = plt.subplots(figsize=(12, 6))
ax.plot(years, true_coverage, 'k-', linewidth=2, label='True coverage', zorder=3)
ax.plot(years, admin_data, 's-', color='#e74c3c', alpha=0.7,
        label='Administrative data (biased high)')
ax.errorbar(survey_years, survey_data, yerr=7,  # ~±3.5 percentage points × 2
            fmt='o', color='#2ecc71', markersize=10, linewidth=2,
            capsize=5, label='Survey data with 95% CI')
ax.axhline(y=100, color='gray', linestyle=':', alpha=0.5)
ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Vaccination Coverage (%)', fontsize=12)
ax.set_title('The Challenge: Two Data Sources, Different Biases', fontsize=14)
ax.legend(fontsize=11)
ax.set_ylim(30, 110)
plt.tight_layout()
plt.savefig('vaccination_data_sources.png', dpi=150, bbox_inches='tight')
plt.show()

The plot reveals the challenge that WHO/UNICEF analysts face: administrative data is available every year but biased upward. Survey data is unbiased (on average) but only available occasionally, and with substantial uncertainty. The true coverage is the black line that nobody can directly observe.

Countries with Little or No Data

For some countries — particularly those affected by conflict, political instability, or extreme poverty — neither administrative data nor survey data is available in any recent or reliable form. These tend to be countries where vaccination coverage is likely lowest, creating the same cruel paradox we saw with maternal mortality data in Case Study 2 of Chapter 1: the data is worst where the need is greatest.

The WUENIC Estimation Process

Given this patchwork of data sources, how does the WHO/UNICEF estimation process work? Here's a simplified version:

Step 1: Assemble All Available Data

For each country, analysts compile every available data source: - Annual administrative reports to WHO - Household survey estimates (DHS, MICS, or national surveys) - Coverage surveys (smaller-scale surveys focused on vaccination) - Any other relevant data

Step 2: Assess Each Source

Each data point gets a credibility assessment. Administrative data from a country with a functioning electronic immunization registry is weighted more heavily than administrative data from a country with known reporting problems. Survey data from a well-designed national survey with a large sample is weighted more heavily than data from a small convenience survey.

Step 3: Look for Inconsistencies

Analysts compare the different sources. If administrative data says coverage is 95% but a household survey says it's 72%, that's a red flag. The gap might be due to: - Double-counting in the administrative system - Inaccurate population denominators - Vaccines administered to non-target populations (e.g., adults counted in a denominator meant for children) - Survey sampling error

Step 4: Produce a Best Estimate

The final WUENIC estimate for each country and year is a judgment call informed by data. It's not a simple average of sources. Analysts consider the reliability of each source, the consistency between sources, trends over time, and contextual knowledge about the country's health system.

For many countries, the final estimate comes with implicit uncertainty, though WUENIC traditionally publishes point estimates without confidence intervals for individual country-years (a limitation that the global health community has debated).

A Worked Example: Estimating Coverage in a Data-Sparse Country

Let's work through a simplified example to see how estimation works in practice.

import pandas as pd
import numpy as np
from scipy import stats

# Imagine a country with the following data:
data = pd.DataFrame({
    'year': [2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019,
             2020, 2021, 2022],
    'admin': [82, 85, 88, 91, 93, 95, 97, 96, 92, 88, 75, 80, 85],
    'survey': [np.nan, np.nan, 72, np.nan, np.nan, np.nan, 78, np.nan,
               np.nan, np.nan, np.nan, 68, np.nan],
    'survey_ci_lower': [np.nan, np.nan, 67, np.nan, np.nan, np.nan, 73,
                        np.nan, np.nan, np.nan, np.nan, 62, np.nan],
    'survey_ci_upper': [np.nan, np.nan, 77, np.nan, np.nan, np.nan, 83,
                        np.nan, np.nan, np.nan, np.nan, 74, np.nan],
})

print("Available data:")
print(data.to_string(index=False))

 year  admin  survey  survey_ci_lower  survey_ci_upper
 2010     82     NaN              NaN              NaN
 2011     85     NaN              NaN              NaN
 2012     88    72.0             67.0             77.0
 2013     91     NaN              NaN              NaN
 2014     93     NaN              NaN              NaN
 2015     95     NaN              NaN              NaN
 2016     97    78.0             73.0             83.0
 2017     96     NaN              NaN              NaN
 2018     92     NaN              NaN              NaN
 2019     88     NaN              NaN              NaN
 2020     75     NaN              NaN              NaN
 2021     80    68.0             62.0             74.0
 2022     85     NaN              NaN              NaN

Notice the patterns: - Administrative data is available every year but consistently higher than survey data - Survey data is available only in 2012, 2016, and 2021 - In every survey year, the survey estimate is substantially below the administrative figure (by 16-19 points) - The 2020-2021 dip in admin data (likely due to pandemic disruptions) is confirmed by the 2021 survey

# Visualize the data challenge
fig, ax = plt.subplots(figsize=(12, 6))

ax.plot(data['year'], data['admin'], 's-', color='#e74c3c', markersize=8,
        label='Administrative data', alpha=0.8)

# Plot survey data with error bars
survey_mask = data['survey'].notna()
ax.errorbar(data.loc[survey_mask, 'year'],
            data.loc[survey_mask, 'survey'],
            yerr=[data.loc[survey_mask, 'survey'] -
                  data.loc[survey_mask, 'survey_ci_lower'],
                  data.loc[survey_mask, 'survey_ci_upper'] -
                  data.loc[survey_mask, 'survey']],
            fmt='o', color='#2ecc71', markersize=12, linewidth=2,
            capsize=6, label='Survey estimate with 95% CI')

# Shade the gap between admin and survey
ax.annotate('Admin-survey gap\n(~16-19 points)',
            xy=(2016, 87.5), fontsize=11, ha='center',
            bbox=dict(boxstyle='round', facecolor='lightyellow', alpha=0.8))

ax.set_xlabel('Year', fontsize=12)
ax.set_ylabel('Vaccination Coverage (%)', fontsize=12)
ax.set_title('A Country with Conflicting Data Sources', fontsize=14)
ax.legend(fontsize=11)
ax.set_ylim(50, 105)
ax.grid(alpha=0.3)
plt.tight_layout()
plt.savefig('conflicting_data_sources.png', dpi=150, bbox_inches='tight')
plt.show()

An analyst looking at this data would conclude:

The administrative data is systematically inflated — by about 16-19 percentage points based on the three survey comparisons. This could be due to denominator problems, double-counting, or reporting incentives.
The survey data is the more reliable anchor, but it's only available for three years. In other years, the analyst must estimate coverage by adjusting the administrative trend downward.
The confidence intervals on the survey data matter. In 2021, the survey CI is (62%, 74%). The true coverage could be anywhere in that range — a 12-point span that has real policy implications.

# Estimate coverage using survey-anchored adjustment
# Simple approach: compute average admin-survey gap and adjust

gap_years = data[survey_mask]
gaps = gap_years['admin'].values - gap_years['survey'].values
mean_gap = gaps.mean()
std_gap = gaps.std()

print(f"Admin-survey gaps: {gaps}")
print(f"Mean gap: {mean_gap:.1f} percentage points")
print(f"Std dev of gap: {std_gap:.1f}")

# Adjusted estimates: admin data minus the average gap
data['adjusted'] = data['admin'] - mean_gap
data['adjusted'] = data['adjusted'].clip(0, 100)

# For survey years, use the survey estimate
for idx in data[survey_mask].index:
    data.loc[idx, 'adjusted'] = data.loc[idx, 'survey']

print("\nYear-by-year estimates:")
for _, row in data.iterrows():
    source = "survey" if not np.isnan(row['survey']) else "adjusted admin"
    print(f"  {int(row['year'])}: {row['adjusted']:.0f}% ({source})")

This is obviously a simplified version of what WUENIC does, but it illustrates the core principle: use the most reliable data source as an anchor and adjust other sources to be consistent with it.

Implications for Confidence Intervals

Here's where the chapter concepts become critical. When you compute a confidence interval for global vaccination coverage, what does the "uncertainty" include?

The Confidence Interval You Can Compute

If you take the WUENIC estimates for 195 countries and compute a confidence interval for the mean:

# Simulated WUENIC-style estimates for all countries
np.random.seed(42)
countries = pd.DataFrame({
    'country': [f'Country_{i}' for i in range(195)],
    'coverage': np.concatenate([
        np.random.normal(55, 20, 47),     # African region
        np.random.normal(75, 12, 35),     # Americas
        np.random.normal(70, 15, 11),     # SE Asia
        np.random.normal(82, 8, 53),      # Europe
        np.random.normal(63, 18, 22),     # E Mediterranean
        np.random.normal(78, 10, 27),     # W Pacific
    ]).clip(10, 99)
})

mean_cov = countries['coverage'].mean()
se_cov = countries['coverage'].std(ddof=1) / np.sqrt(len(countries))
ci = stats.t.interval(0.95, df=len(countries)-1,
                       loc=mean_cov, scale=se_cov)

print(f"Mean coverage across 195 countries: {mean_cov:.1f}%")
print(f"Standard error: {se_cov:.2f}")
print(f"95% CI: ({ci[0]:.1f}%, {ci[1]:.1f}%)")

This confidence interval is mathematically correct. But what does it actually capture?

It captures between-country variability — the fact that countries differ from each other. It does not capture:

The uncertainty in each country's estimate (each WUENIC figure has its own error)
The countries that don't report data at all
The systematic biases in administrative data
The outdated denominators
The within-country variation (vaccination rates differ enormously between urban and rural areas within the same country)

The Honest Uncertainty Statement

A more complete uncertainty analysis would need to propagate the within-country estimation uncertainty into the between-country confidence interval. This is substantially more complex:

# Simulating propagated uncertainty
np.random.seed(42)

n_simulations = 5000
global_means = []

for _ in range(n_simulations):
    # For each country, draw a "true coverage" from its uncertainty range
    simulated_coverages = []
    for _, row in countries.iterrows():
        # Add within-country uncertainty (±5-10 percentage points)
        country_uncertainty = np.random.normal(0, 7)
        sim_coverage = row['coverage'] + country_uncertainty
        sim_coverage = np.clip(sim_coverage, 0, 100)
        simulated_coverages.append(sim_coverage)

    global_means.append(np.mean(simulated_coverages))

global_means = np.array(global_means)
ci_full = (np.percentile(global_means, 2.5), np.percentile(global_means, 97.5))

print(f"Naive CI (ignoring within-country uncertainty): ({ci[0]:.1f}%, {ci[1]:.1f}%)")
print(f"Width: {ci[1]-ci[0]:.1f} percentage points")
print(f"\nFull CI (with propagated uncertainty): ({ci_full[0]:.1f}%, {ci_full[1]:.1f}%)")
print(f"Width: {ci_full[1]-ci_full[0]:.1f} percentage points")

The full CI — accounting for within-country estimation uncertainty — is wider. This is the honest picture: we're less certain about global vaccination coverage than the naive calculation suggests.

Lessons for Data Scientists

Lesson 1: Point Estimates Hide Layers of Uncertainty

When you read "global measles vaccination coverage is 83%," that number is the product of a complex estimation chain. Each link in the chain adds uncertainty. A single number can't communicate all those layers — which is why confidence intervals and sensitivity analyses are so important.

Lesson 2: Missing Data Is Not Random

Countries that don't report data tend to be countries with weaker health systems. Countries with weaker health systems tend to have lower vaccination coverage. This means the global estimate, computed from available data, is likely biased upward. This is a form of selection bias — the "sample" of reporting countries is not a random sample of all countries.

Lesson 3: Multiple Data Sources Require Judgment

When administrative data says 95% and a survey says 72%, the answer isn't to split the difference. It requires understanding why the sources disagree and which is more trustworthy in context. This kind of judgment is a core data science skill that can't be automated.

Lesson 4: Confidence Intervals Should Reflect Real Uncertainty

The standard confidence interval formula captures one narrow type of uncertainty. Good data science practice means thinking about — and ideally quantifying — the other sources of uncertainty that the formula misses.

Connecting to the Progressive Project

In the progressive project, you're computing confidence intervals for vaccination rates by region. This case study should make you think more carefully about several questions:

What does each country's reported rate actually measure? Is it from administrative data or a survey? How reliable is the estimate?
Which countries are missing? If some countries in a region don't report data, your regional mean is biased toward the countries that do report.
Should you weight countries equally or by population? A mean across countries gives equal weight to a country of 1 million and a country of 1 billion. A population-weighted mean gives more weight to larger countries. Which is the right question depends on what you're trying to measure.
How wide should your confidence interval really be? The formula gives you the between-country sampling uncertainty, but the real uncertainty is larger.

These are the questions that turn a mechanical calculation into a thoughtful analysis. They don't have simple answers — but asking them is what makes the analysis trustworthy.

Discussion Questions

Estimation trade-offs: If you were the WHO analyst and had to choose between publishing (a) point estimates for every country, acknowledging they have unknown uncertainty, or (b) estimates with confidence intervals for only the subset of countries where uncertainty can be reasonably quantified, which would you choose? What are the trade-offs?
Data quality: Why might a government have an incentive to report higher vaccination coverage than the true value? How does this affect the global estimation process?
Communicating uncertainty: Draft a one-paragraph summary of global vaccination coverage for a newspaper audience. How would you communicate the uncertainty without confusing readers?
Project reflection: How does the within-country estimation uncertainty affect the confidence intervals you computed in the progressive project? Write a brief "limitations" paragraph that could accompany your analysis.

Key Takeaway: Estimation in the real world is rarely as clean as sampling from a known population. Real estimates sit atop layers of measurement, reporting, and modeling uncertainty. A confidence interval is a starting point for communicating uncertainty, not the final word. The best data scientists are honest about what their numbers can and cannot tell us.