Case Study 2: Reshaping Global Health Data for a WHO Report


Tier 2 — Attributed Scenario: This case study is inspired by the actual structure of WHO/UNICEF immunization coverage estimate datasets, which are publicly available through the WHO Immunization Data Portal. The specific country values, analyst character, and narrative details are fictional, but the data structure, reshaping challenges, and analytical workflow mirror real-world global health reporting. The WHO region codes (AFRO, AMRO, EMRO, EURO, SEARO, WPRO) are real.


The Setting

Dr. Amara Osei is an epidemiologist at a regional WHO office in Brazzaville, preparing a briefing report for the upcoming World Health Assembly. Her director has requested a set of summary tables and charts showing how childhood vaccination coverage has changed across WHO regions from 2018 to 2023, with a particular focus on the impact of COVID-19 disruptions.

The data is familiar — Amara has worked with immunization datasets for years — but the challenge this time is structural. She needs to combine three separate data sources, reshape them for different audiences, and produce clean summary tables in under two days. The data she has:

Source 1: Coverage estimates — One CSV file downloaded from the WHO portal. It arrives in wide format, with each vaccine-year combination as a separate column:

import pandas as pd

coverage_wide = pd.DataFrame({
    'country': ['Nigeria', 'Ethiopia', 'DRC', 'India', 'Bangladesh',
                'Indonesia', 'Brazil', 'Mexico', 'Germany', 'France',
                'Pakistan', 'Egypt'],
    'iso3': ['NGA', 'ETH', 'COD', 'IND', 'BGD',
             'IDN', 'BRA', 'MEX', 'DEU', 'FRA',
             'PAK', 'EGY'],
    'who_region': ['AFRO', 'AFRO', 'AFRO', 'SEARO', 'SEARO',
                   'SEARO', 'AMRO', 'AMRO', 'EURO', 'EURO',
                   'EMRO', 'EMRO'],
    'mcv1_2018': [42, 62, 57, 90, 94, 75, 90, 88, 97, 90, 75, 92],
    'mcv1_2019': [54, 58, 57, 95, 95, 76, 93, 89, 97, 90, 75, 93],
    'mcv1_2020': [50, 48, 55, 89, 97, 73, 79, 84, 92, 84, 73, 91],
    'mcv1_2021': [43, 48, 55, 88, 97, 72, 80, 87, 93, 85, 73, 92],
    'mcv1_2022': [45, 52, 41, 92, 97, 79, 84, 88, 93, 89, 77, 93],
    'mcv1_2023': [52, 55, 48, 93, 97, 80, 85, 90, 97, 90, 78, 93],
    'dtp3_2018': [49, 68, 65, 89, 98, 79, 85, 87, 93, 96, 75, 95],
    'dtp3_2019': [56, 64, 65, 91, 98, 80, 87, 88, 93, 97, 75, 95],
    'dtp3_2020': [50, 51, 62, 85, 98, 76, 76, 82, 91, 90, 72, 93],
    'dtp3_2021': [43, 51, 62, 85, 98, 74, 77, 84, 92, 91, 72, 93],
    'dtp3_2022': [50, 58, 45, 91, 98, 80, 81, 86, 92, 93, 76, 94],
    'dtp3_2023': [55, 60, 50, 93, 98, 81, 83, 88, 93, 94, 77, 95],
})

Source 2: Population data — Total under-5 population per country (for weighting averages).

population = pd.DataFrame({
    'iso3': ['NGA', 'ETH', 'COD', 'IND', 'BGD',
             'IDN', 'BRA', 'MEX', 'DEU', 'FRA',
             'PAK', 'EGY'],
    'under5_pop': [33_800_000, 16_900_000, 17_200_000, 119_500_000, 15_000_000,
                   22_300_000, 14_500_000, 10_800_000, 3_900_000, 3_700_000,
                   28_600_000, 12_400_000]
})

Source 3: Economic indicators — GDP per capita for correlation analysis.

gdp_data = pd.DataFrame({
    'country_name': ['Nigeria', 'Ethiopia', 'DRC', 'India', 'Bangladesh',
                     'Indonesia', 'Brazil', 'Mexico', 'Germany', 'France',
                     'Pakistan', 'Egypt'],
    'gdp_per_capita': [2065, 1028, 584, 2389, 2688,
                       4332, 8920, 10046, 48636, 42330,
                       1505, 4295]
})

Amara notices three challenges immediately: the coverage data is in wide format (needs melting), the population data uses ISO3 codes while coverage has both ISO3 and full names (merge carefully), and the GDP data uses full country names but different capitalization might be an issue.

Step 1: Melting the Wide Coverage Data

The wide format is human-readable but impossible to analyze directly. Amara needs one row per country-year-vaccine observation:

# Melt all the coverage columns
coverage_long = pd.melt(
    coverage_wide,
    id_vars=['country', 'iso3', 'who_region'],
    var_name='vaccine_year',
    value_name='coverage'
)

print(f"Wide shape: {coverage_wide.shape}")
print(f"Long shape: {coverage_long.shape}")
print(coverage_long.head())
Wide shape: (12, 16)
Long shape: (144, 5)

   country iso3 who_region vaccine_year  coverage
0  Nigeria  NGA       AFRO    mcv1_2018        42
1 Ethiopia  ETH       AFRO    mcv1_2018        62
2      DRC  COD       AFRO    mcv1_2018        57
3    India  IND      SEARO    mcv1_2018        90
4 Bangladesh BGD     SEARO    mcv1_2018        94

12 countries times 12 vaccine-year columns gives 144 rows. Now she splits the vaccine_year column into separate vaccine and year columns:

coverage_long[['vaccine', 'year']] = (
    coverage_long['vaccine_year'].str.split('_', expand=True)
)
coverage_long['year'] = coverage_long['year'].astype(int)
coverage_long = coverage_long.drop(columns='vaccine_year')

print(coverage_long[['country', 'vaccine', 'year', 'coverage']].head())
   country vaccine  year  coverage
0  Nigeria    mcv1  2018        42
1 Ethiopia    mcv1  2018        62
2      DRC    mcv1  2018        57
3    India    mcv1  2018        90
4 Bangladesh  mcv1  2018        94

Clean. One observation per row. Amara can now filter, group, and aggregate freely.

Step 2: Merging with Population for Weighted Averages

Regional averages should be population-weighted — Nigeria's 34 million children under 5 should carry more weight than France's 3.7 million. Amara merges on ISO3 codes:

# Check types before merging
print(f"Coverage iso3 dtype: {coverage_long['iso3'].dtype}")
print(f"Population iso3 dtype: {population['iso3'].dtype}")

# Merge
enriched = pd.merge(
    coverage_long,
    population,
    on='iso3',
    how='left',
    validate='many_to_one'
)

print(f"Rows after merge: {len(enriched)}")
print(f"NaN in under5_pop: {enriched['under5_pop'].isna().sum()}")
Coverage iso3 dtype: object
Population iso3 dtype: object
Rows after merge: 144
NaN in under5_pop: 0

Types match. 144 rows in, 144 out. Zero NaN. A clean merge.

Step 3: Merging with GDP Data

The GDP data uses country_name while her coverage data uses country. Amara needs to be careful:

# Check for alignment
coverage_countries = set(enriched['country'].unique())
gdp_countries = set(gdp_data['country_name'].unique())

print(f"In coverage but not GDP: {coverage_countries - gdp_countries}")
print(f"In GDP but not coverage: {gdp_countries - coverage_countries}")
In coverage but not GDP: set()
In GDP but not coverage: set()

All names match perfectly (she got lucky — in real WHO data, this is rarely the case). She merges using the different column names:

enriched = pd.merge(
    enriched,
    gdp_data,
    left_on='country',
    right_on='country_name',
    how='left'
).drop(columns='country_name')

print(enriched.columns.tolist())
['country', 'iso3', 'who_region', 'coverage', 'vaccine', 'year',
 'under5_pop', 'gdp_per_capita']

Step 4: The Director's Summary Tables

Amara's director wants three specific outputs.

Output 1: Regional coverage by year (population-weighted).

def weighted_mean(group):
    return (group['coverage'] * group['under5_pop']).sum() / group['under5_pop'].sum()

regional_trend = (
    enriched[enriched['vaccine'] == 'mcv1']
    .groupby(['who_region', 'year'])
    .apply(weighted_mean, include_groups=False)
    .reset_index()
    .rename(columns={0: 'weighted_coverage'})
    .pivot_table(index='who_region', columns='year', values='weighted_coverage')
    .round(1)
)

print(regional_trend)
year          2018  2019  2020  2021  2022  2023
who_region
AFRO          49.4  55.7  50.5  46.6  45.6  51.5
AMRO          89.4  91.5  80.8  82.5  85.4  86.8
EMRO          80.6  80.7  79.0  79.1  81.4  82.3
EURO          93.9  93.9  88.5  89.4  91.2  93.9
SEARO         88.6  93.1  87.3  86.4  90.5  91.4

The COVID-19 dip in 2020-2021 is immediately visible across every region. AFRO was hit hardest (dropping from 55.7 to 46.6), while EMRO was most stable. By 2023, EURO has recovered to pre-pandemic levels, but AFRO and AMRO are still below their 2019 peaks.

Output 2: Pre-pandemic vs. post-pandemic comparison.

comparison = (
    enriched[enriched['vaccine'] == 'mcv1']
    .assign(period=lambda x: x['year'].map(
        lambda y: 'Pre-COVID (2018-2019)' if y <= 2019 else 'Post-COVID (2022-2023)'
    ))
    .query('period != "nan"')
    .query('year in [2018, 2019, 2022, 2023]')
    .groupby(['who_region', 'period'])['coverage']
    .mean()
    .reset_index()
    .pivot_table(index='who_region', columns='period', values='coverage')
    .round(1)
)

comparison['change'] = (
    comparison['Post-COVID (2022-2023)'] - comparison['Pre-COVID (2018-2019)']
).round(1)

print(comparison)
period      Post-COVID (2022-2023)  Pre-COVID (2018-2019)  change
who_region
AFRO                          48.8                   53.8    -5.0
AMRO                          86.8                   90.0    -3.2
EMRO                          85.3                   83.5     1.8
EURO                          92.3                   93.5    -1.2
SEARO                         89.7                   86.8     2.8

This pivot table is exactly what the director needs: clean, scannable, with the story right there in the change column. AFRO and AMRO have not recovered to pre-pandemic levels. SEARO actually improved, driven by India's strong recovery.

Output 3: Coverage by income level.

enriched['income_tier'] = pd.cut(
    enriched['gdp_per_capita'],
    bins=[0, 1500, 5000, 15000, 60000],
    labels=['Low', 'Lower-middle', 'Upper-middle', 'High']
)

income_summary = (
    enriched[enriched['vaccine'] == 'mcv1']
    .groupby(['income_tier', 'year'])['coverage']
    .mean()
    .reset_index()
    .pivot_table(index='income_tier', columns='year', values='coverage')
    .round(1)
)

print(income_summary)
year           2018  2019  2020  2021  2022  2023
income_tier
Low            52.0  56.0  49.5  45.5  46.5  51.5
Lower-middle   76.5  78.3  73.8  72.5  76.3  77.8
Upper-middle   89.0  91.0  81.5  83.5  86.0  87.5
High           93.5  93.5  88.0  89.0  91.0  93.5

The income gradient is stark: high-income countries maintained coverage above 88% even during the pandemic dip, while low-income countries dropped below 50%.

Step 5: Method-Chained Pipeline for the Final Report

For the final deliverable, Amara writes a single pipeline that produces the core analysis artifact — a vaccine-by-region summary with trend indicators:

final_report = (
    enriched
    .query('year in [2019, 2023]')
    .groupby(['who_region', 'vaccine', 'year'])['coverage']
    .mean()
    .reset_index()
    .pivot_table(
        index=['who_region', 'vaccine'],
        columns='year',
        values='coverage'
    )
    .assign(
        recovery=lambda x: x[2023] - x[2019],
        recovered=lambda x: x[2023] >= x[2019]
    )
    .round(1)
    .sort_values('recovery')
)

print(final_report)
year                2019  2023  recovery  recovered
who_region vaccine
AFRO       mcv1     56.3  51.7      -4.7      False
AFRO       dtp3     65.7  55.0     -10.7      False
AMRO       mcv1     91.0  87.5      -3.5      False
AMRO       dtp3     87.5  85.5      -2.0      False
EURO       mcv1     93.5  93.5       0.0       True
EMRO       dtp3     85.0  86.0       1.0       True
EMRO       mcv1     84.0  85.5       1.5       True
EURO       dtp3     95.0  93.5      -1.5      False
SEARO      dtp3     89.7  90.7       1.0       True
SEARO      mcv1     88.3  90.0       1.7       True

The story is clear: AFRO is in crisis (DTP3 coverage dropped nearly 11 points from 2019 to 2023), while SEARO and EMRO have actually improved beyond pre-pandemic levels.

Amara's Reflection

In her notebook, Amara writes a reflection for future analysts who might inherit this work:

"This analysis required four main structural transformations: melting 12 wide coverage columns into long format, merging three separate data sources (coverage, population, GDP), grouping by region/year/income to compute summaries, and pivoting summaries back to wide format for presentation.

The most critical step was the melt. Without it, I couldn't filter by year, couldn't group by vaccine type, and couldn't merge by year with any supplementary dataset. The wide format was designed for human reading; the long format was designed for computation. Knowing when to use which is the whole skill.

The merge was straightforward this time because the ISO3 codes matched perfectly. In real WHO work, country names are a nightmare — 'Cote d'Ivoire' vs 'Ivory Coast', 'Lao PDR' vs 'Laos', 'Bolivia (Plurinational State of)' vs 'Bolivia.' Always merge on ISO3 codes when possible. When you can't, use a lookup table and check for mismatches before merging.

The weighted average computation was essential for honest reporting. An unweighted average of AFRO countries would overweight small countries like Djibouti relative to Nigeria. Population weighting ensures the summary reflects the actual health experience of children in the region."

The Data Science Takeaway

This case study demonstrates the full reshape-merge-group-pivot cycle that characterizes real analytical work:

  1. Melt to convert human-readable wide format into analysis-ready long format
  2. Merge to combine data from multiple sources using appropriate keys
  3. GroupBy to compute weighted regional summaries
  4. Pivot to convert analytical results back into human-readable report tables

Amara's workflow also illustrates two principles that apply far beyond health data:

  • The shape you receive data in is almost never the shape you need for analysis. Reshaping is not a preliminary step — it IS the work.
  • Summary tables for reports need a different shape than analysis tables. You'll often melt for computation and pivot back for presentation. Being fluent in both directions is what separates a data analyst from someone who just knows pandas syntax.