Case Study 2: Reshaping Global Health Data for a WHO Report
Tier 2 — Attributed Scenario: This case study is inspired by the actual structure of WHO/UNICEF immunization coverage estimate datasets, which are publicly available through the WHO Immunization Data Portal. The specific country values, analyst character, and narrative details are fictional, but the data structure, reshaping challenges, and analytical workflow mirror real-world global health reporting. The WHO region codes (AFRO, AMRO, EMRO, EURO, SEARO, WPRO) are real.
The Setting
Dr. Amara Osei is an epidemiologist at a regional WHO office in Brazzaville, preparing a briefing report for the upcoming World Health Assembly. Her director has requested a set of summary tables and charts showing how childhood vaccination coverage has changed across WHO regions from 2018 to 2023, with a particular focus on the impact of COVID-19 disruptions.
The data is familiar — Amara has worked with immunization datasets for years — but the challenge this time is structural. She needs to combine three separate data sources, reshape them for different audiences, and produce clean summary tables in under two days. The data she has:
Source 1: Coverage estimates — One CSV file downloaded from the WHO portal. It arrives in wide format, with each vaccine-year combination as a separate column:
import pandas as pd
coverage_wide = pd.DataFrame({
'country': ['Nigeria', 'Ethiopia', 'DRC', 'India', 'Bangladesh',
'Indonesia', 'Brazil', 'Mexico', 'Germany', 'France',
'Pakistan', 'Egypt'],
'iso3': ['NGA', 'ETH', 'COD', 'IND', 'BGD',
'IDN', 'BRA', 'MEX', 'DEU', 'FRA',
'PAK', 'EGY'],
'who_region': ['AFRO', 'AFRO', 'AFRO', 'SEARO', 'SEARO',
'SEARO', 'AMRO', 'AMRO', 'EURO', 'EURO',
'EMRO', 'EMRO'],
'mcv1_2018': [42, 62, 57, 90, 94, 75, 90, 88, 97, 90, 75, 92],
'mcv1_2019': [54, 58, 57, 95, 95, 76, 93, 89, 97, 90, 75, 93],
'mcv1_2020': [50, 48, 55, 89, 97, 73, 79, 84, 92, 84, 73, 91],
'mcv1_2021': [43, 48, 55, 88, 97, 72, 80, 87, 93, 85, 73, 92],
'mcv1_2022': [45, 52, 41, 92, 97, 79, 84, 88, 93, 89, 77, 93],
'mcv1_2023': [52, 55, 48, 93, 97, 80, 85, 90, 97, 90, 78, 93],
'dtp3_2018': [49, 68, 65, 89, 98, 79, 85, 87, 93, 96, 75, 95],
'dtp3_2019': [56, 64, 65, 91, 98, 80, 87, 88, 93, 97, 75, 95],
'dtp3_2020': [50, 51, 62, 85, 98, 76, 76, 82, 91, 90, 72, 93],
'dtp3_2021': [43, 51, 62, 85, 98, 74, 77, 84, 92, 91, 72, 93],
'dtp3_2022': [50, 58, 45, 91, 98, 80, 81, 86, 92, 93, 76, 94],
'dtp3_2023': [55, 60, 50, 93, 98, 81, 83, 88, 93, 94, 77, 95],
})
Source 2: Population data — Total under-5 population per country (for weighting averages).
population = pd.DataFrame({
'iso3': ['NGA', 'ETH', 'COD', 'IND', 'BGD',
'IDN', 'BRA', 'MEX', 'DEU', 'FRA',
'PAK', 'EGY'],
'under5_pop': [33_800_000, 16_900_000, 17_200_000, 119_500_000, 15_000_000,
22_300_000, 14_500_000, 10_800_000, 3_900_000, 3_700_000,
28_600_000, 12_400_000]
})
Source 3: Economic indicators — GDP per capita for correlation analysis.
gdp_data = pd.DataFrame({
'country_name': ['Nigeria', 'Ethiopia', 'DRC', 'India', 'Bangladesh',
'Indonesia', 'Brazil', 'Mexico', 'Germany', 'France',
'Pakistan', 'Egypt'],
'gdp_per_capita': [2065, 1028, 584, 2389, 2688,
4332, 8920, 10046, 48636, 42330,
1505, 4295]
})
Amara notices three challenges immediately: the coverage data is in wide format (needs melting), the population data uses ISO3 codes while coverage has both ISO3 and full names (merge carefully), and the GDP data uses full country names but different capitalization might be an issue.
Step 1: Melting the Wide Coverage Data
The wide format is human-readable but impossible to analyze directly. Amara needs one row per country-year-vaccine observation:
# Melt all the coverage columns
coverage_long = pd.melt(
coverage_wide,
id_vars=['country', 'iso3', 'who_region'],
var_name='vaccine_year',
value_name='coverage'
)
print(f"Wide shape: {coverage_wide.shape}")
print(f"Long shape: {coverage_long.shape}")
print(coverage_long.head())
Wide shape: (12, 16)
Long shape: (144, 5)
country iso3 who_region vaccine_year coverage
0 Nigeria NGA AFRO mcv1_2018 42
1 Ethiopia ETH AFRO mcv1_2018 62
2 DRC COD AFRO mcv1_2018 57
3 India IND SEARO mcv1_2018 90
4 Bangladesh BGD SEARO mcv1_2018 94
12 countries times 12 vaccine-year columns gives 144 rows. Now she splits the vaccine_year column into separate vaccine and year columns:
coverage_long[['vaccine', 'year']] = (
coverage_long['vaccine_year'].str.split('_', expand=True)
)
coverage_long['year'] = coverage_long['year'].astype(int)
coverage_long = coverage_long.drop(columns='vaccine_year')
print(coverage_long[['country', 'vaccine', 'year', 'coverage']].head())
country vaccine year coverage
0 Nigeria mcv1 2018 42
1 Ethiopia mcv1 2018 62
2 DRC mcv1 2018 57
3 India mcv1 2018 90
4 Bangladesh mcv1 2018 94
Clean. One observation per row. Amara can now filter, group, and aggregate freely.
Step 2: Merging with Population for Weighted Averages
Regional averages should be population-weighted — Nigeria's 34 million children under 5 should carry more weight than France's 3.7 million. Amara merges on ISO3 codes:
# Check types before merging
print(f"Coverage iso3 dtype: {coverage_long['iso3'].dtype}")
print(f"Population iso3 dtype: {population['iso3'].dtype}")
# Merge
enriched = pd.merge(
coverage_long,
population,
on='iso3',
how='left',
validate='many_to_one'
)
print(f"Rows after merge: {len(enriched)}")
print(f"NaN in under5_pop: {enriched['under5_pop'].isna().sum()}")
Coverage iso3 dtype: object
Population iso3 dtype: object
Rows after merge: 144
NaN in under5_pop: 0
Types match. 144 rows in, 144 out. Zero NaN. A clean merge.
Step 3: Merging with GDP Data
The GDP data uses country_name while her coverage data uses country. Amara needs to be careful:
# Check for alignment
coverage_countries = set(enriched['country'].unique())
gdp_countries = set(gdp_data['country_name'].unique())
print(f"In coverage but not GDP: {coverage_countries - gdp_countries}")
print(f"In GDP but not coverage: {gdp_countries - coverage_countries}")
In coverage but not GDP: set()
In GDP but not coverage: set()
All names match perfectly (she got lucky — in real WHO data, this is rarely the case). She merges using the different column names:
enriched = pd.merge(
enriched,
gdp_data,
left_on='country',
right_on='country_name',
how='left'
).drop(columns='country_name')
print(enriched.columns.tolist())
['country', 'iso3', 'who_region', 'coverage', 'vaccine', 'year',
'under5_pop', 'gdp_per_capita']
Step 4: The Director's Summary Tables
Amara's director wants three specific outputs.
Output 1: Regional coverage by year (population-weighted).
def weighted_mean(group):
return (group['coverage'] * group['under5_pop']).sum() / group['under5_pop'].sum()
regional_trend = (
enriched[enriched['vaccine'] == 'mcv1']
.groupby(['who_region', 'year'])
.apply(weighted_mean, include_groups=False)
.reset_index()
.rename(columns={0: 'weighted_coverage'})
.pivot_table(index='who_region', columns='year', values='weighted_coverage')
.round(1)
)
print(regional_trend)
year 2018 2019 2020 2021 2022 2023
who_region
AFRO 49.4 55.7 50.5 46.6 45.6 51.5
AMRO 89.4 91.5 80.8 82.5 85.4 86.8
EMRO 80.6 80.7 79.0 79.1 81.4 82.3
EURO 93.9 93.9 88.5 89.4 91.2 93.9
SEARO 88.6 93.1 87.3 86.4 90.5 91.4
The COVID-19 dip in 2020-2021 is immediately visible across every region. AFRO was hit hardest (dropping from 55.7 to 46.6), while EMRO was most stable. By 2023, EURO has recovered to pre-pandemic levels, but AFRO and AMRO are still below their 2019 peaks.
Output 2: Pre-pandemic vs. post-pandemic comparison.
comparison = (
enriched[enriched['vaccine'] == 'mcv1']
.assign(period=lambda x: x['year'].map(
lambda y: 'Pre-COVID (2018-2019)' if y <= 2019 else 'Post-COVID (2022-2023)'
))
.query('period != "nan"')
.query('year in [2018, 2019, 2022, 2023]')
.groupby(['who_region', 'period'])['coverage']
.mean()
.reset_index()
.pivot_table(index='who_region', columns='period', values='coverage')
.round(1)
)
comparison['change'] = (
comparison['Post-COVID (2022-2023)'] - comparison['Pre-COVID (2018-2019)']
).round(1)
print(comparison)
period Post-COVID (2022-2023) Pre-COVID (2018-2019) change
who_region
AFRO 48.8 53.8 -5.0
AMRO 86.8 90.0 -3.2
EMRO 85.3 83.5 1.8
EURO 92.3 93.5 -1.2
SEARO 89.7 86.8 2.8
This pivot table is exactly what the director needs: clean, scannable, with the story right there in the change column. AFRO and AMRO have not recovered to pre-pandemic levels. SEARO actually improved, driven by India's strong recovery.
Output 3: Coverage by income level.
enriched['income_tier'] = pd.cut(
enriched['gdp_per_capita'],
bins=[0, 1500, 5000, 15000, 60000],
labels=['Low', 'Lower-middle', 'Upper-middle', 'High']
)
income_summary = (
enriched[enriched['vaccine'] == 'mcv1']
.groupby(['income_tier', 'year'])['coverage']
.mean()
.reset_index()
.pivot_table(index='income_tier', columns='year', values='coverage')
.round(1)
)
print(income_summary)
year 2018 2019 2020 2021 2022 2023
income_tier
Low 52.0 56.0 49.5 45.5 46.5 51.5
Lower-middle 76.5 78.3 73.8 72.5 76.3 77.8
Upper-middle 89.0 91.0 81.5 83.5 86.0 87.5
High 93.5 93.5 88.0 89.0 91.0 93.5
The income gradient is stark: high-income countries maintained coverage above 88% even during the pandemic dip, while low-income countries dropped below 50%.
Step 5: Method-Chained Pipeline for the Final Report
For the final deliverable, Amara writes a single pipeline that produces the core analysis artifact — a vaccine-by-region summary with trend indicators:
final_report = (
enriched
.query('year in [2019, 2023]')
.groupby(['who_region', 'vaccine', 'year'])['coverage']
.mean()
.reset_index()
.pivot_table(
index=['who_region', 'vaccine'],
columns='year',
values='coverage'
)
.assign(
recovery=lambda x: x[2023] - x[2019],
recovered=lambda x: x[2023] >= x[2019]
)
.round(1)
.sort_values('recovery')
)
print(final_report)
year 2019 2023 recovery recovered
who_region vaccine
AFRO mcv1 56.3 51.7 -4.7 False
AFRO dtp3 65.7 55.0 -10.7 False
AMRO mcv1 91.0 87.5 -3.5 False
AMRO dtp3 87.5 85.5 -2.0 False
EURO mcv1 93.5 93.5 0.0 True
EMRO dtp3 85.0 86.0 1.0 True
EMRO mcv1 84.0 85.5 1.5 True
EURO dtp3 95.0 93.5 -1.5 False
SEARO dtp3 89.7 90.7 1.0 True
SEARO mcv1 88.3 90.0 1.7 True
The story is clear: AFRO is in crisis (DTP3 coverage dropped nearly 11 points from 2019 to 2023), while SEARO and EMRO have actually improved beyond pre-pandemic levels.
Amara's Reflection
In her notebook, Amara writes a reflection for future analysts who might inherit this work:
"This analysis required four main structural transformations: melting 12 wide coverage columns into long format, merging three separate data sources (coverage, population, GDP), grouping by region/year/income to compute summaries, and pivoting summaries back to wide format for presentation.
The most critical step was the melt. Without it, I couldn't filter by year, couldn't group by vaccine type, and couldn't merge by year with any supplementary dataset. The wide format was designed for human reading; the long format was designed for computation. Knowing when to use which is the whole skill.
The merge was straightforward this time because the ISO3 codes matched perfectly. In real WHO work, country names are a nightmare — 'Cote d'Ivoire' vs 'Ivory Coast', 'Lao PDR' vs 'Laos', 'Bolivia (Plurinational State of)' vs 'Bolivia.' Always merge on ISO3 codes when possible. When you can't, use a lookup table and check for mismatches before merging.
The weighted average computation was essential for honest reporting. An unweighted average of AFRO countries would overweight small countries like Djibouti relative to Nigeria. Population weighting ensures the summary reflects the actual health experience of children in the region."
The Data Science Takeaway
This case study demonstrates the full reshape-merge-group-pivot cycle that characterizes real analytical work:
- Melt to convert human-readable wide format into analysis-ready long format
- Merge to combine data from multiple sources using appropriate keys
- GroupBy to compute weighted regional summaries
- Pivot to convert analytical results back into human-readable report tables
Amara's workflow also illustrates two principles that apply far beyond health data:
- The shape you receive data in is almost never the shape you need for analysis. Reshaping is not a preliminary step — it IS the work.
- Summary tables for reports need a different shape than analysis tables. You'll often melt for computation and pivot back for presentation. Being fluent in both directions is what separates a data analyst from someone who just knows pandas syntax.