Case Study 1: Visualizing Health Disparities Across Income Groups

Contributors to Introduction to Data Science

Case Study 1: Visualizing Health Disparities Across Income Groups

Tier 2 — Attributed Example: This case study uses the structure of real WHO and World Bank datasets to explore health disparities across income groups. The analyst, Priya, is a fictional graduate student. Country names reference real nations and WHO regions, but all specific numerical values for vaccination coverage, health expenditure, and composite scores have been simplified, rounded, or adjusted for pedagogical clarity. The analytical workflow and seaborn techniques demonstrated are representative of real public health data analysis.

The Setting

Priya is a first-year MPH student working on a term paper about global health equity. Her professor has asked a deceptively simple question: Do wealthier countries have better health outcomes, and if so, how large is the gap?

She has the WHO vaccination dataset that this course has been using, enriched with World Bank income classifications (Low, Lower-Middle, Upper-Middle, High) and GDP per capita. She knows the answer is "probably yes" — but her professor wants her to show the answer, not just state it. Charts, not claims. Evidence, not assertions.

Priya opens a Jupyter notebook, imports seaborn, and starts exploring.

Step 1: The Big Picture with a Pair Plot

Priya wants to see all the relationships at once before zooming in on any single one:

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

sns.set_theme(style="whitegrid", palette="colorblind")
df = pd.read_csv("who_vaccination_enriched.csv")

sns.pairplot(df[["coverage_pct", "gdp_per_capita",
                 "health_exp_pct_gdp", "literacy_rate",
                 "income_group"]],
             hue="income_group", diag_kind="kde",
             plot_kws={"alpha": 0.5, "s": 15})

The pair plot is a wall of information. Priya spends five minutes studying it. Here is what jumps out:

GDP and coverage have a positive relationship, but it is not linear — it curves and flattens at the top.
The income groups form distinct clusters in the GDP dimension, by definition, but they also separate clearly on coverage and literacy.
Health expenditure as a percentage of GDP does not separate cleanly by income group. Some low-income countries spend a high percentage of GDP on health; some high-income countries spend surprisingly little.
The KDE plots on the diagonal show that high-income countries have a tight, left-skewed distribution of coverage (most are near 90-99%), while low-income countries have a wide, roughly uniform distribution.

This gives Priya her roadmap: she will focus on coverage disparities by income group, the GDP-coverage curve, and the surprise about health expenditure.

Step 2: Distribution Comparison — Box and Violin

Priya's first focused chart compares coverage distributions across income groups:

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

sns.boxplot(data=df, x="income_group", y="coverage_pct",
            order=["Low", "Lower-Middle",
                   "Upper-Middle", "High"],
            ax=axes[0])
axes[0].set_title("Box Plot: Coverage by Income Group")
axes[0].set_xlabel("Income Group")
axes[0].set_ylabel("Coverage (%)")

sns.violinplot(data=df, x="income_group",
               y="coverage_pct",
               order=["Low", "Lower-Middle",
                      "Upper-Middle", "High"],
               inner="quartile", ax=axes[1])
axes[1].set_title("Violin Plot: Coverage by Income Group")
axes[1].set_xlabel("Income Group")
axes[1].set_ylabel("Coverage (%)")

plt.tight_layout()

The box plot gives her the clean summary: medians rise steadily from Low (around 72%) to High (around 95%). The IQR narrows dramatically — high-income countries are clustered tightly, while low-income countries are spread across a 40-percentage-point range.

The violin plot adds nuance. The Low-income violin has a slight bimodal shape — there is a cluster of countries around 80-85% and another cluster around 50-60%. Some low-income countries achieve high coverage; others are far behind. The High-income violin is almost a spike near 95%, with a thin tail down to around 80%.

Priya writes in her notebook: "The gap is not just about averages. It is about variability. Low-income countries do not just have lower coverage — they have wildly inconsistent coverage."

Step 3: The GDP-Coverage Curve with Regression

Priya wants to visualize the relationship between GDP and coverage, but she suspects it is not linear. She tries three different regression fits:

fig, axes = plt.subplots(1, 3, figsize=(18, 5))

sns.regplot(data=df, x="gdp_per_capita",
            y="coverage_pct", ax=axes[0],
            scatter_kws={"alpha": 0.4, "s": 15})
axes[0].set_title("Linear Fit")

sns.regplot(data=df, x="gdp_per_capita",
            y="coverage_pct", ax=axes[1], order=2,
            scatter_kws={"alpha": 0.4, "s": 15})
axes[1].set_title("Quadratic Fit")

sns.regplot(data=df, x="gdp_per_capita",
            y="coverage_pct", ax=axes[2],
            lowess=True,
            scatter_kws={"alpha": 0.4, "s": 15})
axes[2].set_title("LOWESS Fit")

for ax in axes:
    ax.set_xlabel("GDP per Capita (USD)")
    ax.set_ylabel("Coverage (%)")

plt.tight_layout()

The linear fit misses the curve — it overshoots at low GDP and undershoots in the middle. The quadratic fit captures the diminishing returns: coverage rises steeply with GDP at first, then plateaus. The LOWESS fit tells the most honest story — there is a steep rise from 0 to about 10,000 USD GDP per capita, then a gradual leveling off. Above 30,000 USD, more GDP does not buy more coverage.

Priya writes: "Money helps — up to a point. Below $10,000 GDP per capita, increases in national wealth correspond to dramatic improvements in coverage. Above $30,000, the relationship essentially disappears."

Step 4: The Health Expenditure Puzzle — Heatmap

Priya builds a heatmap to see how income group and region interact:

pivot = df.pivot_table(
    values="coverage_pct",
    index="income_group",
    columns="region",
    aggfunc="mean"
)

# Reorder index
pivot = pivot.reindex(["Low", "Lower-Middle",
                       "Upper-Middle", "High"])

sns.heatmap(pivot, annot=True, fmt=".0f",
            cmap="YlGnBu", linewidths=1,
            cbar_kws={"label": "Mean Coverage (%)"})
plt.title("Mean Vaccination Coverage by Income and Region")
plt.ylabel("Income Group")

The heatmap reveals something the box plots could not: the income-coverage relationship varies by region. In AFRO (Africa), the gap between Low and Upper-Middle income countries is enormous — nearly 25 percentage points. In EURO (Europe), even the lower-middle-income countries achieve coverage above 90%. The region moderates the effect of income.

Some cells are empty (NaN) — there are no low-income countries in EURO, for example. These blank cells are informative too: they show that the income-group-by-region matrix is sparse, and global averages can be misleading.

Step 5: Faceted Scatter for Regional Detail

Priya creates her most detailed visualization:

g = sns.lmplot(data=df, x="gdp_per_capita",
               y="coverage_pct",
               col="region", col_wrap=3,
               hue="income_group",
               height=3.5, aspect=1.2,
               scatter_kws={"alpha": 0.5, "s": 20},
               lowess=True)
g.set_axis_labels("GDP per Capita (USD)",
                  "Coverage (%)")
g.fig.suptitle(
    "GDP vs. Coverage by Region and Income",
    y=1.02, fontsize=14)

Each panel tells a different story. In SEARO (South-East Asia), there is a strong positive relationship — wealthier countries in the region vaccinate more. In EURO, the relationship is nearly flat — all countries, regardless of GDP, achieve high coverage. In AFRO, the relationship is positive but noisy — many countries deviate from the trend.

Step 6: The Summary Visualization

For her paper, Priya creates one publication-ready composite figure:

fig, axes = plt.subplots(2, 2, figsize=(12, 10))

# Panel A: Distribution by income group
sns.violinplot(data=df, x="income_group",
               y="coverage_pct",
               order=["Low", "Lower-Middle",
                      "Upper-Middle", "High"],
               inner="quartile", ax=axes[0, 0])
axes[0, 0].set_title("A. Coverage Distribution")

# Panel B: GDP scatter with LOWESS
sns.regplot(data=df, x="gdp_per_capita",
            y="coverage_pct", lowess=True,
            scatter_kws={"alpha": 0.3, "s": 10},
            ax=axes[0, 1])
axes[0, 1].set_title("B. GDP vs. Coverage")

# Panel C: Heatmap
sns.heatmap(pivot, annot=True, fmt=".0f",
            cmap="YlGnBu", ax=axes[1, 0])
axes[1, 0].set_title("C. Mean Coverage by Group/Region")

# Panel D: KDE by income group
for group in ["Low", "Lower-Middle",
              "Upper-Middle", "High"]:
    subset = df[df["income_group"] == group]
    sns.kdeplot(subset["coverage_pct"],
                label=group, ax=axes[1, 1],
                fill=True, alpha=0.3)
axes[1, 1].set_title("D. Coverage Density")
axes[1, 1].legend(fontsize=8)

fig.suptitle("Health Disparities in Vaccination "
             "Coverage Across Income Groups",
             fontsize=14, y=1.01)
plt.tight_layout()
fig.savefig("health_disparities.png",
            dpi=300, bbox_inches="tight")

Priya's Conclusions

From her visualizations, Priya writes three key findings for her paper:

The income-coverage gap is real and large. Median vaccination coverage is approximately 23 percentage points higher in high-income countries than in low-income countries. But the story is about more than the median — low-income countries have far more variability, meaning some achieve excellent coverage while others fall dramatically behind.
GDP's effect on coverage has diminishing returns. The relationship between GDP per capita and vaccination coverage follows a logarithmic-style curve: steep improvements below $10,000 GDP per capita, gradual gains between $10,000 and $30,000, and essentially no relationship above $30,000. This suggests that policy interventions, not just economic growth, drive coverage at higher income levels.
Region moderates the income effect. The same income group can have very different coverage levels depending on the region. This suggests that regional health infrastructure, governance, and international aid programs play an important role beyond what national income alone predicts.

Pedagogical Reflection

This case study demonstrates the seaborn exploration workflow:

Start broad (pair plot) to identify interesting patterns.
Focus (violin plot, scatter plot) on the most important relationships.
Layer complexity (hue, col, regression) to investigate moderating variables.
Synthesize into a single composite figure for communication.
Write conclusions that connect the visual evidence to substantive claims.

Notice how each visualization answered a specific question and how the progression from simple to complex was driven by what Priya learned at each step. She did not plan all six visualizations in advance — she followed the data, letting each chart raise the next question.