Case Study 2: Building a Country Health Profile Database — Nested Data for Global Health

Contributors to Introduction to Data Science

Case Study 2: Building a Country Health Profile Database — Nested Data for Global Health

Tier 2 — Attributed Findings: This case study uses real-world health indicator categories and organizational structures from the World Health Organization (WHO) and the World Bank. Specific numeric values used in code examples are approximate and drawn from publicly available sources (WHO Global Health Observatory, World Bank Open Data) for illustration. The dataset structure and analysis workflow are designed to demonstrate Chapter 5 concepts. Elena is a fictional character from the book's anchor examples.

The Problem

Elena is a public health analyst who has been asked to prepare a briefing on global health disparities. Her supervisor wants answers to questions like:

"Which countries in Africa have the lowest life expectancy?"
"How do vaccination rates compare across WHO regions?"
"Is there a relationship between a country's income group and its maternal mortality ratio?"

The data she needs exists — the WHO publishes it, the World Bank publishes it — but it is scattered across multiple files in different formats. She has a CSV file with basic country indicators, a JSON file with WHO region assignments and income group classifications, and her own notes on a few countries she has researched in depth.

Her task: organize all of this into a single Python data structure that she can query programmatically. No pandas yet — just pure Python, dictionaries, lists, and the file-reading skills from Chapter 5.

Step 1: Designing the Data Structure

Before writing any code, Elena sketches what she wants the data to look like. Each country needs:

Basic info: name, ISO code, WHO region, income group
Demographics: population, median age, urban percentage
Health indicators: life expectancy, maternal mortality ratio, vaccination rates (COVID-19, measles, polio)

She designs a nested dictionary structure:

# One country profile
ethiopia = {
    "name": "Ethiopia",
    "iso_code": "ETH",
    "region": "Africa",
    "income_group": "Low income",
    "demographics": {
        "population": 120_283_000,
        "median_age": 19.5,
        "urban_pct": 21.7
    },
    "health": {
        "life_expectancy": 66.6,
        "maternal_mortality_ratio": 267,
        "infant_mortality_rate": 35.7,
        "vaccination_rates": {
            "covid_19": 57.1,
            "measles": 57.0,
            "polio": 87.2
        }
    }
}

This mirrors how the data is actually organized: a country has demographics (a sub-group of related fields), and has health indicators (another sub-group), and some of those health indicators are themselves grouped (vaccination rates by disease).

Elena asks herself: "Can I answer my supervisor's questions using this structure?" She traces through each question:

"Which countries in Africa have the lowest life expectancy?" — She can filter by country["region"] == "Africa" and sort by country["health"]["life_expectancy"]. Yes.
"How do vaccination rates compare across WHO regions?" — She can group by region and compute average country["health"]["vaccination_rates"]["covid_19"]. Yes.
"Is there a relationship between income group and maternal mortality?" — She can group by country["income_group"] and compare country["health"]["maternal_mortality_ratio"]. Yes.

The structure supports her questions. She proceeds.

Step 2: Loading the CSV Data

Elena's CSV file, country_indicators.csv, contains basic data:

iso_code,name,life_expectancy,maternal_mortality,infant_mortality,population,covid_vax_rate,measles_vax_rate
ETH,Ethiopia,66.6,267,35.7,120283000,57.1,57.0
BRA,Brazil,75.9,72,12.4,214326223,80.2,83.0
JPN,Japan,84.8,5,1.7,125124989,83.5,97.0
NGA,Nigeria,52.7,1047,70.6,218541212,12.7,56.0
DEU,Germany,80.6,4,3.1,83294633,77.8,97.0
IND,India,70.8,103,25.5,1417173173,72.4,89.0
CAN,Canada,82.7,11,4.3,38929902,83.4,92.0
TCD,Chad,52.5,856,67.8,17723315,8.4,41.0

She reads it:

import csv

raw_records = []
with open("country_indicators.csv", "r") as f:
    reader = csv.DictReader(f)
    for row in reader:
        raw_records.append(row)

# Quick sanity check
print(f"Loaded {len(raw_records)} countries")
print(f"Columns: {list(raw_records[0].keys())}")
print(f"First record: {raw_records[0]}")

She immediately spots the issue: all values are strings. "267" is not the number 267. She writes a cleaning function:

def clean_csv_record(raw):
    """Convert a raw CSV row (all strings) to properly typed values."""
    return {
        "iso_code": raw["iso_code"],
        "name": raw["name"],
        "life_expectancy": float(raw["life_expectancy"]),
        "maternal_mortality": int(raw["maternal_mortality"]),
        "infant_mortality": float(raw["infant_mortality"]),
        "population": int(raw["population"]),
        "covid_vax_rate": float(raw["covid_vax_rate"]),
        "measles_vax_rate": float(raw["measles_vax_rate"]),
    }

cleaned = [clean_csv_record(r) for r in raw_records]
print(f"Ethiopia life expectancy: {cleaned[0]['life_expectancy']} (type: {type(cleaned[0]['life_expectancy'])})")
# Ethiopia life expectancy: 66.6 (type: <class 'float'>)

Step 3: Loading the JSON Metadata

The JSON file country_metadata.json contains region and income group information:

{
  "ETH": {"region": "Africa", "income_group": "Low income", "median_age": 19.5, "urban_pct": 21.7},
  "BRA": {"region": "Americas", "income_group": "Upper middle income", "median_age": 33.2, "urban_pct": 87.6},
  "JPN": {"region": "Western Pacific", "income_group": "High income", "median_age": 48.6, "urban_pct": 91.9},
  "NGA": {"region": "Africa", "income_group": "Lower middle income", "median_age": 18.1, "urban_pct": 52.0},
  "DEU": {"region": "Europe", "income_group": "High income", "median_age": 45.7, "urban_pct": 77.5},
  "IND": {"region": "South-East Asia", "income_group": "Lower middle income", "median_age": 28.2, "urban_pct": 35.4},
  "CAN": {"region": "Americas", "income_group": "High income", "median_age": 41.8, "urban_pct": 81.6},
  "TCD": {"region": "Africa", "income_group": "Low income", "median_age": 16.6, "urban_pct": 23.5}
}

Elena loads it:

import json

with open("country_metadata.json", "r") as f:
    metadata = json.load(f)

print(f"Metadata for {len(metadata)} countries")
print(f"Brazil's region: {metadata['BRA']['region']}")
# Brazil's region: Americas

Step 4: Merging into the Nested Structure

Now comes the critical step: combining the CSV data and the JSON metadata into her designed nested structure.

countries = []

for record in cleaned:
    iso = record["iso_code"]

    # Get metadata for this country (with safe fallback)
    meta = metadata.get(iso, {})

    country_profile = {
        "name": record["name"],
        "iso_code": iso,
        "region": meta.get("region", "Unknown"),
        "income_group": meta.get("income_group", "Unknown"),
        "demographics": {
            "population": record["population"],
            "median_age": meta.get("median_age", None),
            "urban_pct": meta.get("urban_pct", None),
        },
        "health": {
            "life_expectancy": record["life_expectancy"],
            "maternal_mortality_ratio": record["maternal_mortality"],
            "infant_mortality_rate": record["infant_mortality"],
            "vaccination_rates": {
                "covid_19": record["covid_vax_rate"],
                "measles": record["measles_vax_rate"],
            }
        }
    }
    countries.append(country_profile)

print(f"Built {len(countries)} country profiles")

She also builds a lookup dictionary for quick access by name:

by_name = {c["name"]: c for c in countries}
by_iso = {c["iso_code"]: c for c in countries}

# Quick lookup
print(by_name["Japan"]["health"]["life_expectancy"])  # 84.8
print(by_iso["NGA"]["region"])                        # "Africa"

This double-dictionary pattern — one for lookup by name, one for lookup by ISO code — is common in real data work where the same entity has multiple identifiers.

Step 5: Answering the Questions

With her data structure built, Elena can now answer her supervisor's questions.

Question 1: Which countries in Africa have the lowest life expectancy?

african = [c for c in countries if c["region"] == "Africa"]
african_sorted = sorted(african, key=lambda c: c["health"]["life_expectancy"])

print("African countries by life expectancy (lowest first):")
for c in african_sorted:
    le = c["health"]["life_expectancy"]
    print(f"  {c['name']}: {le} years")

Output:

African countries by life expectancy (lowest first):
  Chad: 52.5 years
  Nigeria: 52.7 years
  Ethiopia: 66.6 years

Question 2: How do vaccination rates compare across regions?

# Group countries by region and compute average COVID vaccination rate
region_rates = {}
for c in countries:
    region = c["region"]
    rate = c["health"]["vaccination_rates"]["covid_19"]
    if region not in region_rates:
        region_rates[region] = []
    region_rates[region].append(rate)

print("Average COVID-19 vaccination rate by WHO region:")
for region in sorted(region_rates):
    rates = region_rates[region]
    avg = sum(rates) / len(rates)
    print(f"  {region}: {avg:.1f}% (n={len(rates)})")

Output:

Average COVID-19 vaccination rate by WHO region:
  Africa: 26.1% (n=3)
  Americas: 81.8% (n=2)
  Europe: 77.8% (n=1)
  South-East Asia: 72.4% (n=1)
  Western Pacific: 83.5% (n=1)

Elena notes a limitation: with only 8 countries, some regions have just one representative. The pattern is suggestive but not statistically meaningful. She adds a note to her briefing: "Full analysis requires the complete 194-country dataset."

Question 3: Income group and maternal mortality

# Group by income group
income_maternal = {}
for c in countries:
    group = c["income_group"]
    mmr = c["health"]["maternal_mortality_ratio"]
    if group not in income_maternal:
        income_maternal[group] = []
    income_maternal[group].append({"name": c["name"], "mmr": mmr})

print("Maternal mortality ratio by income group:")
for group in sorted(income_maternal):
    entries = income_maternal[group]
    avg_mmr = sum(e["mmr"] for e in entries) / len(entries)
    countries_list = ", ".join(e["name"] for e in entries)
    print(f"\n  {group} (avg MMR: {avg_mmr:.0f}):")
    for e in entries:
        print(f"    {e['name']}: {e['mmr']}")

Output:

Maternal mortality ratio by income group:

  High income (avg MMR: 7):
    Japan: 5
    Germany: 4
    Canada: 11

  Low income (avg MMR: 562):
    Ethiopia: 267
    Chad: 856

  Lower middle income (avg MMR: 575):
    Nigeria: 1047
    India: 103

  Upper middle income (avg MMR: 72):
    Brazil: 72

The pattern is stark: low-income countries have maternal mortality ratios that are roughly 80 times higher than high-income countries. Elena knows correlation is not causation — income group and maternal mortality are both shaped by a web of factors including healthcare infrastructure, education, geographic challenges, and historical underinvestment — but the disparity is real and measurable.

Step 6: Saving the Unified Dataset

Elena saves her merged data as a JSON file for future use:

import json

output = {
    "description": "Country health profiles — merged from WHO and World Bank sources",
    "sources": ["country_indicators.csv", "country_metadata.json"],
    "last_updated": "2024-03-15",
    "countries": countries
}

with open("country_profiles.json", "w") as f:
    json.dump(output, f, indent=2)

print("Unified dataset saved to country_profiles.json")

She also writes a summary CSV for her supervisor, who prefers spreadsheets:

import csv

with open("health_summary.csv", "w", newline="") as f:
    fieldnames = ["name", "region", "income_group", "life_expectancy",
                  "maternal_mortality_ratio", "covid_vax_rate"]
    writer = csv.DictWriter(f, fieldnames=fieldnames)
    writer.writeheader()
    for c in countries:
        writer.writerow({
            "name": c["name"],
            "region": c["region"],
            "income_group": c["income_group"],
            "life_expectancy": c["health"]["life_expectancy"],
            "maternal_mortality_ratio": c["health"]["maternal_mortality_ratio"],
            "covid_vax_rate": c["health"]["vaccination_rates"]["covid_19"],
        })

print("Summary CSV saved")

Notice the flattening: the nested structure is perfect for Python processing, but a CSV requires flat rows. Elena must explicitly extract the nested values into top-level fields. This tension between hierarchical data (JSON) and flat data (CSV) is something she will encounter repeatedly.

What Elena Learned

Challenge	Concept Applied
Designing a data representation	Nested dictionaries matching real-world data hierarchy
Reading from multiple sources	`csv.DictReader` for CSV, `json.load` for JSON
Type conversion	Strings from CSV to int/float
Merging data from two sources	Matching records by ISO code using `.get()`
Safe access to potentially missing data	`.get()` with default values
Grouping and aggregation	Dictionary-based grouping pattern
Answering questions programmatically	Comprehensions, loops, and sorting
Output in multiple formats	JSON for hierarchical data, CSV for flat summaries

The Bigger Picture

Elena's work in this case study is a preview of what pandas will make much easier in Chapter 7. Her manual grouping-and-averaging code takes 8-10 lines; in pandas, it will be a single line: df.groupby("income_group")["maternal_mortality_ratio"].mean(). The file merging she did by hand will become pd.merge().

But there is value in doing it manually first. Elena now understands what grouping means — she has looped through records, built dictionaries of lists, and computed averages herself. When pandas does it in one line, she will know what that one line is actually doing. And when something goes wrong — when the merge produces unexpected results or the groupby returns NaN — she will have the mental model to diagnose the problem.

The progression is deliberate: Chapter 5 teaches you the mechanics. Chapter 7 teaches you the shortcuts. The mechanics make the shortcuts meaningful.

Discussion Questions

Data source trust. Elena merged data from two sources (a CSV and a JSON file) assuming they used the same ISO codes. What could go wrong? What would happen if one source used "USA" and the other used "US" for the United States?
Missing data choices. Elena used .get(key, None) for metadata that might be missing. What are the implications of storing None versus leaving the key out of the dictionary entirely? How might each choice affect downstream analysis?
Hierarchical vs. flat. Elena's nested dictionary structure is intuitive for humans but requires flattening for CSV output. When is hierarchical organization worth the complexity? When would a flat structure be simpler and just as effective?
Sample size warnings. Elena noted that some regions had only one country in her 8-country sample. How should she communicate this limitation to her supervisor? What is the risk of presenting regional averages based on one or two countries?
The ethics of aggregation. When Elena reports "average maternal mortality by income group," she is collapsing the experiences of millions of women into a few numbers. What is gained and what is lost through this aggregation? How should she balance the need for summary statistics with the need to preserve individual country stories?

Try It Yourself

Build your own version of Elena's project using a different domain:

Choose a domain (sports, music, movies, environmental data, or anything with accessible data).
Create two data files: a CSV with numeric indicators and a JSON file with categorical metadata.
Write Python code to read both files, merge them into a nested data structure, and answer at least two specific questions.
Save results in both JSON and CSV format.

The specific data matters less than the process: designing a structure, reading from files, merging, querying, and saving. This pipeline is the skeleton of every data analysis you will ever do.