Case Study 1: A Model Capstone: Complete Vaccination Rate Analysis

Contributors to Introduction to Data Science

Case Study 1: A Model Capstone: Complete Vaccination Rate Analysis

Tier 3 — Illustrative/Composite Example: This case study presents a fictional student's completed capstone project as a detailed walkthrough of what a strong, portfolio-ready data science investigation looks like. The student, analytical decisions, code snippets, and specific findings are constructed for pedagogical purposes. The datasets referenced (WHO vaccination data, World Bank indicators) are real and publicly available, but the specific numerical results described here are illustrative and may not match what you'd find by running the analysis yourself.

Introduction

What does a finished capstone actually look like? Not the specification, not the rubric, not the milestone checklist — but the real thing, the notebook you'd be proud to pin to your GitHub profile and show to a hiring manager?

This case study walks you through a model capstone project from beginning to end: Elena's investigation of global vaccination rate disparities. We'll see her notebook structure, her analytical decisions, her writing, her visualizations, and her conclusions. Along the way, we'll note what she does well and where the project exemplifies the capstone rubric's criteria.

Think of this as a worked example — not to copy, but to calibrate your own expectations for what "excellent" looks like.

The Notebook: Section by Section

Title and Abstract

Elena's notebook opens with:

What Explains the Global Vaccination Divide?

A Data Science Investigation of COVID-19 Vaccination Rate Disparities Across 183 Countries

Abstract: COVID-19 vaccination rates ranged from over 90% in some high-income countries to under 5% in several low-income nations as of December 2023. This analysis investigates what national-level factors explain this variation, using data from the WHO (vaccination records), World Bank (economic and demographic indicators), and WHO Global Health Expenditure Database (health system characteristics). Through exploratory analysis, hypothesis testing, and predictive modeling, I find that while GDP per capita is strongly associated with vaccination rates (r = 0.72), healthcare workforce density (physicians and nurses per capita) adds significant predictive power beyond GDP alone. A random forest model incorporating economic, demographic, and health system features achieves R-squared = 0.76 on held-out test data, with healthcare workforce density emerging as the most important feature. These findings suggest that vaccination coverage depends not just on national wealth but on the capacity of the health system to deliver vaccines to the population.

What works here: The title is specific and interesting (not "Capstone Project"). The abstract is self-contained — a reader who reads only the abstract understands the question, data, methods, and findings. Technical terms (R-squared, random forest) are used but the overall message is accessible.

Introduction and Motivation

Elena's introduction spans about 600 words. Here's how it opens:

Imagine two children born on the same day in 2021 — one in Portugal, one in Chad. The child in Portugal had a 93% chance of receiving a full course of COVID-19 vaccination by their second birthday. The child in Chad had a 1.4% chance.

This is not a typo. The gap in COVID-19 vaccination rates between the world's richest and poorest countries is not 10 percentage points, or 20, or even 50. It is, in many cases, more than 80 percentage points — a chasm that represents one of the starkest inequalities of the pandemic era.

But what drives this gap? The intuitive answer — "poor countries can't afford vaccines" — turns out to be incomplete. Several middle-income countries achieved vaccination rates above 80%, while some upper-middle-income countries lagged behind. Within the same income group, vaccination rates varied by 50 or more percentage points. Clearly, something beyond national wealth is at work.

This analysis investigates what that "something" might be...

What works here: The opening is vivid and specific — two children, concrete numbers, an emotionally resonant comparison. The motivation isn't generic ("vaccination is important"); it identifies a specific puzzle (why does the intuitive explanation fall short?) that makes the reader want to know the answer. The transition from motivation to investigation is smooth.

Data Description

Elena documents three data sources with specific details:

Source 1: WHO COVID-19 Vaccination Data - Downloaded March 15, 2024 from the WHO COVID-19 dashboard data portal - Contains cumulative vaccination counts by country, updated daily - Coverage: 194 WHO member states; I retained 183 countries with complete primary series data - Key variable: people_fully_vaccinated_per_hundred as of December 31, 2023

Source 2: World Bank World Development Indicators - Downloaded March 16, 2024 from the World Bank data portal - Variables selected: GDP per capita (current USD), population, urban population (%), secondary school enrollment (%), mobile cellular subscriptions per 100 people - Year: 2021 (most recent complete year for most indicators)

Source 3: WHO Global Health Expenditure Database - Downloaded March 16, 2024 - Variables selected: current health expenditure per capita (USD), current health expenditure as % of GDP, physicians per 1,000 population, nursing/midwifery personnel per 1,000 population - Year: 2020 (most recent available for expenditure data)

What works here: Every source has a URL context, download date, variable list, and time period. A reader could reproduce the data acquisition. She notes that she retained 183 of 194 countries and explains why 11 were dropped (incomplete data), which is important for transparency.

Data Cleaning

Elena documents five cleaning decisions. Here's one of the strongest:

Decision 3: Handling missing healthcare workforce data

Twenty-three countries had missing physician density data, and 31 had missing nursing personnel data. These countries were disproportionately in Sub-Saharan Africa and South Asia — precisely the regions where understanding vaccination barriers matters most.

I considered three options: 1. Drop rows with missing values — this would reduce the dataset from 183 to 147 countries and systematically exclude low-income countries, biasing the analysis toward wealthier nations where data infrastructure is stronger. 2. Impute with regional medians — this assumes countries within a WHO region have similar healthcare workforces, which is a rough approximation. 3. Impute with k-nearest-neighbors using GDP, health expenditure, and urbanization as features — this estimates workforce density based on structurally similar countries.

I chose option 3 (KNN imputation) because it uses country-specific information rather than regional averages. I created a boolean column physician_density_imputed to flag imputed values for later sensitivity analysis. The sensitivity analysis (Section 6.3) confirms that my main conclusions hold whether I use KNN imputation, regional medians, or drop missing rows.

What works here: This is exactly the kind of documented decision-making that the capstone rubric rewards. She identifies the problem, considers multiple solutions, chooses one with reasoning, flags the imputation for sensitivity analysis, and follows through on the sensitivity check. A hiring manager reading this sees someone who thinks carefully about analytical choices.

Exploratory Analysis

Elena presents six visualizations. Each one follows a pattern: a code cell producing the chart, immediately followed by a Markdown interpretation.

Her strongest visualization is a scatter plot of GDP per capita versus vaccination rate, colored by WHO region, with a fitted LOESS curve:

Figure 3: The Diminishing Returns of Wealth

The relationship between GDP per capita and vaccination rate is clearly positive but decidedly non-linear. Below approximately $10,000 GDP per capita, vaccination rates increase steeply with wealth. Between $10,000 and $30,000, the relationship flattens. Above $30,000, additional GDP has almost no association with vaccination rates — wealthy countries are all highly vaccinated, regardless of whether their GDP is $30,000 or $90,000.

This non-linearity has important implications: it suggests that the marginal effect of economic development on vaccination coverage decreases as countries grow wealthier, and that for the poorest countries, even modest economic gains might translate to meaningful vaccination improvements.

What works here: The chart has a descriptive title that communicates the finding (not "Scatter Plot" or "Figure 3"). The interpretation tells the reader what to notice, identifies the non-linear pattern, and explains its significance. She connects the observation to its analytical implications (this will motivate using a non-linear model later).

Statistical Analysis

Elena runs two formal tests:

Test 1: A Kruskal-Wallis test comparing vaccination rates across four World Bank income groups (low, lower-middle, upper-middle, high). She chooses Kruskal-Wallis over ANOVA because a Shapiro-Wilk test showed the distributions were non-normal in two groups.

The test confirms a highly significant difference in vaccination rates across income groups (H = 98.7, p < 0.001). Post-hoc Dunn's tests with Bonferroni correction show that all pairwise comparisons are significant, but the largest effect size is between low-income and lower-middle-income countries (d = 1.42), not between high-income and upper-middle-income countries (d = 0.67). This suggests the biggest "vaccination cliff" is at the bottom of the income distribution.

Test 2: A partial correlation analysis examining the relationship between healthcare workforce density and vaccination rate, controlling for GDP per capita.

After controlling for GDP, the partial correlation between physician density and vaccination rate remains significant (r_partial = 0.38, p < 0.001). This suggests that healthcare workforce density is associated with vaccination coverage independent of national wealth — countries with more healthcare workers per capita tend to have higher vaccination rates even after accounting for economic differences.

What works here: She checks assumptions (normality), chooses appropriate tests (non-parametric when assumptions are violated), reports full results (test statistic, p-value, effect size), and interprets in context. The partial correlation analysis directly addresses her research question.

Modeling

Elena builds three models and compares them:

Model	R-squared (train)	R-squared (test)	RMSE (test)	Notes
Linear Regression	0.69	0.66	18.4	Interpretable coefficients
LASSO Regression	0.68	0.67	18.1	Regularization selected 7 of 12 features
Random Forest	0.89	0.76	15.5	Highest test performance

She notes the gap between training and test R-squared for the random forest (0.89 vs. 0.76) and acknowledges mild overfitting, which she addresses by tuning max_depth and min_samples_leaf.

Her feature importance analysis from the random forest reveals:

Healthcare workforce density (physicians + nurses combined): 28% importance
GDP per capita: 22%
Health expenditure as % of GDP: 15%
Urban population percentage: 12%
Secondary school enrollment: 8%

She uses the linear regression coefficients to provide interpretable estimates: "A 10% increase in health expenditure as a share of GDP is associated with a 6.3 percentage point increase in vaccination rate, holding other factors constant."

Findings and Conclusions

Elena returns to her original question and answers it directly:

What explains the global vaccination divide?

The short answer: a combination of economic capacity and health system infrastructure, with the latter mattering more than is commonly assumed.

The longer answer reveals three key findings:

GDP matters, but not linearly. Wealth is strongly associated with vaccination rates at the bottom of the income distribution, but the relationship flattens above approximately $10,000 GDP per capita. This means that for the world's poorest countries, economic development may meaningfully improve vaccination outcomes — but for middle-income countries, other factors become more important.

Healthcare workforce density is the strongest single predictor. Across all three models, the number of physicians and nurses per capita was the most important feature — more important than GDP, total health spending, urbanization, or education levels. This makes intuitive sense: vaccines don't deliver themselves. Without healthcare workers to administer and distribute vaccines, coverage remains low regardless of supply.

Health spending as a share of GDP predicts better than health spending in absolute terms. This suggests that a country's prioritization of healthcare — what fraction of its resources it devotes to health — matters more than its absolute wealth. Some countries with modest GDPs achieve high vaccination rates by investing heavily in health systems.

Limitations

Elena lists four specific limitations:

Country-level aggregation masks within-country variation
Cross-sectional analysis cannot establish causation
The healthcare workforce data is from 2020, before the pandemic may have altered workforce levels
The analysis cannot distinguish between vaccine supply constraints and vaccine demand/hesitancy

Ethical Reflection

Elena's ethical reflection is her most thoughtful writing:

This analysis treats 183 countries as data points in a regression model. Behind each data point are millions of people whose lives were shaped by the pandemic, and whose access to vaccination was determined by factors far beyond their control.

I want to be explicit about a tension in this work: by identifying "predictors" of low vaccination rates, there is a risk of implicitly blaming the countries with the lowest rates. But low vaccination rates in Sub-Saharan Africa are not a failure of African countries — they reflect global structures of pharmaceutical production, intellectual property law, international aid, and historical colonialism that concentrated vaccine manufacturing in a handful of wealthy nations. My analysis can identify statistical associations, but it cannot capture these structural dynamics.

I also note that the data itself is shaped by power. Countries with weaker statistical systems — often the same countries with the lowest vaccination rates — are the least likely to have complete, accurate data. Eleven countries were excluded from my analysis for missing data, and 23 required imputation. These are not random omissions — they are systematic, and they mean that the countries most in need of attention are the hardest to study.

What Makes This Capstone Excellent

Elena's capstone would score 22-24 on the rubric (Exceptional). Here's why:

Question and Motivation (4/4): Specific question, compelling motivation, vivid opening, clear scope.
Data Handling (4/4): Three sources documented with download dates; five cleaning decisions documented with alternatives considered; sensitivity analysis on key choices.
Exploration and Visualization (4/4): Six polished visualizations with descriptive titles and written interpretations; non-obvious findings identified.
Statistical Analysis/Modeling (4/4): Appropriate tests with checked assumptions; three models compared with honest evaluation; feature importance analysis.
Communication (4/4): Reads as a narrative; accessible to non-technical readers; conclusion answers the question directly.
Critical Reflection (4/4): Specific, honest limitations; thoughtful ethical reflection connecting analytical choices to structural dynamics.

Discussion Questions

Elena's ethical reflection discusses structural factors (pharmaceutical production, intellectual property, colonialism) that her quantitative analysis cannot capture. How does acknowledging these qualitative factors strengthen rather than weaken the quantitative work?
Elena chose to present both a linear regression (for interpretability) and a random forest (for prediction). When might you present only one model? When is presenting multiple models valuable?
Elena's strongest visualization has a descriptive title ("The Diminishing Returns of Wealth") rather than a generic one ("Scatter Plot of GDP vs. Vaccination Rate"). Rewrite three of your own visualization titles to be more descriptive and finding-oriented.
The capstone uses KNN imputation for missing healthcare workforce data. What are the risks of this approach? Under what circumstances would dropping the rows be preferable?
Elena's abstract is 150 words. Try writing an abstract for your own capstone in 150 words or fewer. Every word must earn its place.