Case Study 2: Health Metrics Across Nations — A Statistical Portrait

Contributors to Introduction to Data Science

Case Study 2: Health Metrics Across Nations — A Statistical Portrait

Tier 2 — Attributed Findings: This case study uses real public health indicators published by the World Health Organization (WHO), the World Bank, and UNICEF. Statistics cited here are drawn from widely published estimates for the approximate period of 2018-2022. Specific country figures may vary slightly between source editions and reporting years. The analysis approach is genuine but simplified for pedagogical purposes; actual WHO reports use more sophisticated statistical methods.

The Setting

Elena has been working on the Global Health Data Explorer project since Chapter 6. She's loaded datasets, cleaned missing values, reshaped tables, and created visualizations. Now it's time for a formal statistical portrait — the kind of descriptive analysis that a public health researcher would include in the first pages of a report, before any modeling or hypothesis testing begins.

Her goal is to describe the landscape of global health using three key metrics: 1. Life expectancy at birth (years) 2. Measles vaccination coverage (% of children receiving at least one dose) 3. Under-five mortality rate (deaths per 1,000 live births)

She wants to answer a simple but fundamental question: How much do health outcomes vary across countries, and what does that variation look like?

The Analysis

Step 1: Load and Inspect the Data

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# Simulated country-level health data based on WHO/World Bank patterns
# (In your project, replace with actual cleaned data)
np.random.seed(202)

n_countries = 180

# Income groups with realistic proportions
income_groups = np.random.choice(
    ['Low income', 'Lower middle', 'Upper middle', 'High income'],
    size=n_countries,
    p=[0.15, 0.28, 0.30, 0.27]
)

# Generate correlated health metrics
health_data = []
for group in income_groups:
    if group == 'Low income':
        le = np.random.normal(58, 6, 1)
        vacc = np.clip(np.random.normal(60, 18, 1), 15, 99)
        mort = np.clip(np.random.normal(85, 25, 1), 15, 200)
    elif group == 'Lower middle':
        le = np.random.normal(67, 5, 1)
        vacc = np.clip(np.random.normal(75, 12, 1), 30, 99)
        mort = np.clip(np.random.normal(45, 20, 1), 8, 120)
    elif group == 'Upper middle':
        le = np.random.normal(74, 4, 1)
        vacc = np.clip(np.random.normal(88, 8, 1), 50, 99)
        mort = np.clip(np.random.normal(18, 8, 1), 3, 60)
    else:  # High income
        le = np.random.normal(80, 3, 1)
        vacc = np.clip(np.random.normal(93, 4, 1), 70, 99)
        mort = np.clip(np.random.normal(6, 3, 1), 2, 20)

    health_data.append({
        'income_group': group,
        'life_expectancy': float(le),
        'measles_vacc_pct': float(vacc),
        'under5_mortality': float(mort)
    })

df = pd.DataFrame(health_data)

print("Dataset shape:", df.shape)
print("\nFirst few rows:")
print(df.head(10))
print("\nIncome group distribution:")
print(df['income_group'].value_counts())

Step 2: Global Overview — What Does the World Look Like?

Elena starts with the big picture. Before breaking data into groups, she wants to understand the overall distribution of each metric.

metrics = ['life_expectancy', 'measles_vacc_pct', 'under5_mortality']
labels = ['Life Expectancy (years)', 'Measles Vaccination (%)', 'Under-5 Mortality (per 1,000)']

print("=" * 70)
print("GLOBAL HEALTH METRICS: DESCRIPTIVE OVERVIEW")
print("=" * 70)

for metric, label in zip(metrics, labels):
    col = df[metric]
    print(f"\n--- {label} ---")
    print(f"  Mean:     {col.mean():.1f}")
    print(f"  Median:   {col.median():.1f}")
    print(f"  Std Dev:  {col.std():.1f}")
    print(f"  IQR:      {col.quantile(0.75) - col.quantile(0.25):.1f}")
    print(f"  Range:    {col.min():.1f} to {col.max():.1f}")
    print(f"  Skewness: {col.skew():.2f}", end="")

    skew_val = col.skew()
    if abs(skew_val) < 0.5:
        print(" (approximately symmetric)")
    elif skew_val > 0:
        print(" (right-skewed)")
    else:
        print(" (left-skewed)")

Elena pauses to interpret the results.

Life expectancy has a mean around 71 years and a median around 73 years — the mean is slightly lower, suggesting left skew. This makes sense: most countries cluster around 70-80 years, but a group of low-income countries with life expectancies in the 50s and 60s pulls the tail leftward.

Measles vaccination has a mean around 80% and a median around 85%. Again, left-skewed — most countries have high coverage, but a tail of low-coverage countries extends leftward.

Under-5 mortality has a mean around 35 and a median around 20. The mean is much higher than the median — this is right-skewed. Most countries have low mortality rates, but a significant number of low-income countries have very high rates, creating a long right tail.

# Visualize all three distributions
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for ax, metric, label, color in zip(axes, metrics, labels,
                                     ['steelblue', 'mediumseagreen', 'coral']):
    col = df[metric]
    ax.hist(col, bins=25, color=color, edgecolor='white', alpha=0.8)
    ax.axvline(col.mean(), color='black', linestyle='--', linewidth=1.5,
               label=f'Mean: {col.mean():.1f}')
    ax.axvline(col.median(), color='black', linestyle=':', linewidth=1.5,
               label=f'Median: {col.median():.1f}')
    ax.set_title(label)
    ax.set_ylabel('Number of Countries')
    ax.legend(fontsize=8)

plt.suptitle('Global Health Metrics: Distribution Overview', fontsize=14)
plt.tight_layout()
plt.savefig('health_distributions.png', dpi=150, bbox_inches='tight')
plt.show()

Step 3: Stratify by Income Group

The global overview told one story. Stratifying by income group tells a much richer one.

print("=" * 70)
print("HEALTH METRICS BY INCOME GROUP")
print("=" * 70)

for metric, label in zip(metrics, labels):
    print(f"\n--- {label} ---")
    group_stats = df.groupby('income_group')[metric].agg(
        ['count', 'mean', 'median', 'std']
    ).round(1)
    group_stats['IQR'] = df.groupby('income_group')[metric].apply(
        lambda x: x.quantile(0.75) - x.quantile(0.25)
    ).round(1)
    group_stats['Skew'] = df.groupby('income_group')[metric].apply(
        lambda x: x.skew()
    ).round(2)

    # Reorder by income level
    order = ['Low income', 'Lower middle', 'Upper middle', 'High income']
    group_stats = group_stats.reindex(order)
    print(group_stats.to_string())

Elena highlights several patterns:

The inequality gradient is steep. Life expectancy ranges from about 58 years (low income) to 80 years (high income) — a 22-year gap. That means a baby born in a low-income country can expect to live two decades less than one born in a high-income country.

Variation shrinks as income rises. The standard deviation and IQR are largest for the low-income group and smallest for the high-income group. This means rich countries look similar to each other, but poor countries vary widely — some are doing much better than others despite limited resources.

Shape differs by group. The high-income group is often left-skewed (most countries near the top, a few lagging). The low-income group is more symmetric or even right-skewed. This affects which summary statistic Elena should report for each group.

Step 4: Box Plots for Comparison

fig, axes = plt.subplots(1, 3, figsize=(16, 6))
order = ['Low income', 'Lower middle', 'Upper middle', 'High income']
colors = ['#e74c3c', '#e67e22', '#2ecc71', '#3498db']

for ax, metric, label in zip(axes, metrics, labels):
    bp = df.boxplot(column=metric, by='income_group', ax=ax,
                    positions=range(4), widths=0.6,
                    return_type='dict', patch_artist=True)

    # Color the boxes
    for patch, color in zip(bp[metric]['boxes'], colors):
        patch.set_facecolor(color)
        patch.set_alpha(0.7)

    ax.set_title(label)
    ax.set_xlabel('')
    ax.set_xticklabels(order, rotation=15, fontsize=8)

plt.suptitle('Health Metrics by Income Group', fontsize=14)
plt.tight_layout()
plt.savefig('health_boxplots.png', dpi=150, bbox_inches='tight')
plt.show()

The box plots make the story visual and immediate. Elena will use this figure in her report because it communicates five pieces of information simultaneously: the center (median line), the spread (box = IQR), the range (whiskers), outliers (dots), and the comparison across groups.

Step 5: Identifying Outliers and Interesting Cases

Which countries defy expectations — performing better or worse than their income group would predict?

print("=" * 70)
print("OUTLIER ANALYSIS: Countries Defying Their Income Group")
print("=" * 70)

for metric, label in zip(metrics, labels):
    print(f"\n--- {label} ---")
    for group in order:
        group_data = df[df['income_group'] == group][metric]
        Q1 = group_data.quantile(0.25)
        Q3 = group_data.quantile(0.75)
        IQR = Q3 - Q1
        lower = Q1 - 1.5 * IQR
        upper = Q3 + 1.5 * IQR

        outliers = group_data[(group_data < lower) | (group_data > upper)]
        if len(outliers) > 0:
            print(f"  {group}: {len(outliers)} outlier(s) — values: "
                  f"{[f'{v:.1f}' for v in outliers.values]}")

Elena knows that in real data, these outliers often tell the most important stories. A low-income country with unexpectedly high vaccination rates might have an effective public health program worth studying. A high-income country with unexpectedly high child mortality might have a hidden health equity problem.

Step 6: Correlation Preview

Elena notices something: countries that do well on one metric tend to do well on all three. She computes a quick correlation matrix (a topic we'll explore fully in Chapter 24):

print("\nCorrelation Matrix:")
corr = df[metrics].corr().round(2)
print(corr)
print()
print("Interpretation:")
print("  - Life expectancy and vaccination are positively correlated")
print("  - Life expectancy and child mortality are negatively correlated")
print("  - These metrics move together — they reflect underlying development")

She notes: "This makes intuitive sense. Countries with resources to vaccinate children also have resources for other health infrastructure, leading to longer life expectancy and lower child mortality. But correlation isn't causation — we'll need to be careful about that when we get to Chapter 24."

The Report

Elena drafts a summary for her project notebook:

Global Health Statistical Portrait: Key Findings

Life expectancy varies from about 48 to 86 years across countries (a 38-year range), with a global median of about 73 years. The distribution is left-skewed — most countries cluster in the 65-82 range, with a tail of lower-income countries extending downward.

Measles vaccination coverage has a global median of about 85%, but with enormous variation — from about 15% in the worst-performing countries to 99% in the best. The distribution is left-skewed.

Under-5 mortality shows the starkest inequality. The global median is about 20 deaths per 1,000 live births, but the distribution is strongly right-skewed — a substantial group of low-income countries has rates above 60, while most high-income countries are below 10.

Income group is a powerful predictor of all three metrics. The gap between low-income and high-income countries is enormous on every measure. Within income groups, variation is highest among low-income countries, suggesting that some are doing much better than others despite similar economic constraints.

For communication purposes: I recommend using the median for all global statistics (due to skewness) and the IQR for spread. Group-specific statistics should be reported alongside global figures to avoid masking the enormous between-group variation.

The Lessons

Lesson 1: Descriptive Statistics Are Not Just Preliminary

Elena's analysis contains no hypothesis tests, no p-values, no regression models — and yet it tells a powerful, important story. Descriptive statistics aren't just the appetizer before the "real" analysis. For many questions, a careful descriptive analysis is the analysis.

Lesson 2: Always Stratify

The global mean life expectancy (about 71 years) hides the 22-year gap between income groups. Any time you have meaningful subgroups in your data, compute statistics for each group separately before reporting an overall number.

Lesson 3: Match the Statistic to the Shape

Elena used the median and IQR for skewed metrics, and noted where the mean and median diverged as evidence of skewness. This isn't just a classroom rule — it's how professional analysts avoid misleading their audiences.

Lesson 4: Outliers Are Stories

The countries that don't fit their income group's pattern aren't nuisances to be removed — they're the most interesting data points. They might represent policy successes, data quality issues, or unique circumstances worth investigating.

Discussion Questions

Elena found that variation was highest among low-income countries. What might explain this? Think about factors that differ widely among low-income countries.
If you were presenting these findings to a non-technical audience (say, a congressional committee), which visualization from this analysis would you choose and why?
The correlation between health metrics and income is strong but not perfect. What other factors might explain why some countries perform better or worse than their income group predicts?
Elena noted that under-5 mortality is right-skewed. Why is this particular shape especially important for public health policy? (Hint: think about where the "action" needs to be.)

Connection to the Chapter

This case study applies nearly every concept from Chapter 19: measures of center (mean vs. median), measures of spread (standard deviation vs. IQR), distribution shape (skewness), outlier detection, grouped statistics, and the critical choice of which summary statistic to report. It also serves as a milestone in the progressive project — the Global Health Data Explorer now has a formal statistical foundation that Elena will build on with probability (Chapter 20), distributions (Chapter 21), and eventually inference and modeling.