Capstone Project 1: Public Health Data Investigation

Project Overview

You are a junior data analyst at a county public health department. Your supervisor has asked you to investigate a health-related question using publicly available data and produce a comprehensive report that could inform policy decisions. Your analysis must be thorough, statistically rigorous, ethically thoughtful, and clearly communicated to a non-technical audience.

This project requires you to demonstrate every major skill you've learned in this course — from exploratory data analysis through regression modeling — applied to a single, coherent investigation.

Estimated time: 15-25 hours over 2-3 weeks

Deliverables: 1. A Jupyter notebook containing all code, analysis, and narrative (the "technical report") 2. A 2-page executive summary written for a non-statistician audience 3. A brief reflection (1 page) on ethical considerations


Step 1: Choose Your Dataset and Research Question

Select a publicly available health dataset from one of the following sources (or propose an alternative with your instructor's approval):

Recommended Data Sources: - CDC WONDER (wonder.cdc.gov): Mortality data, birth data, environmental health data. Example datasets include cause-of-death by county, infant mortality rates, cancer incidence. - Behavioral Risk Factor Surveillance System (BRFSS): The largest continuously conducted health survey in the world. Covers health behaviors, chronic conditions, and preventive services across U.S. states. - WHO Global Health Observatory: International health statistics — life expectancy, disease prevalence, vaccination rates, health expenditure by country. - State or county health department open data: Many states publish datasets on opioid overdoses, lead exposure, disease outbreaks, hospital readmissions, etc. - National Health and Nutrition Examination Survey (NHANES): Combines interviews and physical examinations for a nationally representative sample.

Your research question must: - Be specific and answerable with the data you have - Involve at least one numerical variable and at least one categorical variable - Be relevant to a real public health concern - Require more than descriptive statistics to answer (i.e., it should call for inference)

Example research questions (choose your own — these are just inspiration): - Is there a significant difference in obesity rates between states that expanded Medicaid and those that did not? - After controlling for age and income, does smoking status predict self-reported health status? - Has the opioid mortality rate in rural counties changed significantly compared to urban counties over the past decade? - Is there an association between county-level air quality index and asthma hospitalization rates? - Do vaccination rates differ significantly by education level, and if so, how large is the effect?

Deliverable for this step: A 200-300 word statement including your research question, the dataset you've chosen, why this question matters for public health, and what you expect to find (your hypothesis).


Step 2: Data Acquisition and Preparation

Download your dataset and prepare it for analysis. This step should be fully documented in your Jupyter notebook.

Required tasks:

  1. Load and inspect the data. Use pandas to load the dataset. Report the number of rows and columns, display the first several rows, and run .info() and .describe().

  2. Create a data dictionary. For each variable you plan to use, document: - Variable name - Description - Type (categorical nominal, categorical ordinal, numerical discrete, numerical continuous) - Units (if applicable) - Source notes

  3. Assess data quality. Identify and report: - Missing values (count and percentage for each variable) - Potential outliers or implausible values - Duplicate records - Any inconsistencies in coding or formatting

  4. Clean the data. Handle missing values, remove or correct errors, and document every decision. For each decision, write a brief justification: - Why did you choose to drop rows vs. impute values? - What method of imputation did you use and why? - Did you remove any outliers? On what basis?

  5. Create any derived variables you'll need for your analysis (e.g., binning a continuous variable into groups, creating a BMI variable from height and weight, calculating rates from counts and populations).

Rubric focus for this step: Data handling, reproducibility.


Step 3: Exploratory Data Analysis

Explore the data visually and numerically before conducting any formal tests. Remember: you should never run a hypothesis test on data you haven't looked at first.

Required elements:

  1. Univariate exploration. For each key variable: - Numerical variables: histogram, box plot, five-number summary, mean, standard deviation, assessment of shape (symmetric, skewed, outliers) - Categorical variables: bar chart, frequency table, relative frequency table

  2. Bivariate exploration. For relationships relevant to your research question: - Numerical vs. numerical: scatterplot with trend line, correlation coefficient - Numerical vs. categorical: side-by-side box plots, group means and standard deviations - Categorical vs. categorical: contingency table, grouped bar chart

  3. Distributional assessment. For any variable you plan to use in a parametric test: - QQ-plot - Shapiro-Wilk test (for smaller samples) - Discussion of whether parametric methods are appropriate

  4. Narrative. Write 2-3 paragraphs summarizing what the exploratory analysis reveals. What patterns do you see? What surprises you? What concerns do you have about the data? Does anything you've found change your initial hypothesis?

Rubric focus for this step: Visualization, interpretation.


Step 4: Statistical Analysis

This is the core of your project. You must include at least three of the following four analysis types:

4A: Confidence Intervals

Construct and interpret at least two confidence intervals relevant to your research question.

  • State the parameter being estimated
  • Verify conditions (randomness, sample size, normality)
  • Calculate the interval (show your work or code)
  • Interpret the interval in context — not just "we are 95% confident that the true mean is between X and Y," but what that means for public health

4B: Hypothesis Test

Conduct at least one formal hypothesis test.

  • State the null and alternative hypotheses in both words and symbols
  • Identify the appropriate test (z-test for proportions, t-test for means, chi-square, ANOVA, etc.) and justify your choice
  • Verify the conditions/assumptions for the test
  • Calculate the test statistic and p-value
  • State your conclusion in context
  • Calculate and interpret the effect size (Cohen's d, Cramer's V, eta-squared, or r-squared as appropriate)
  • Discuss whether the result is practically significant, not just statistically significant

4C: Regression Analysis

Fit at least one regression model relevant to your research question.

  • For simple linear regression: scatterplot, fitted line, interpretation of slope and intercept, r-squared, residual plot
  • For multiple regression: interpretation of coefficients ("holding other variables constant"), adjusted r-squared, significance of individual predictors, residual diagnostics, check for multicollinearity (VIF)
  • For logistic regression (if your outcome is binary): interpretation of odds ratios, confusion matrix, discussion of classification performance

  • Discuss the limitations of your model: what confounders might you be missing? Does the model support causal claims or only associative ones?

4D: Group Comparison

Compare outcomes across groups using the appropriate method.

  • Two groups: two-sample t-test or two-proportion z-test
  • More than two groups: one-way ANOVA with post-hoc tests
  • Categorical variables: chi-square test of independence
  • Assumption violations: nonparametric alternative (Wilcoxon, Kruskal-Wallis)

Report group-level descriptive statistics, test results, effect sizes, and confidence intervals for differences.

Rubric focus for this step: Statistical analysis, question formulation.


Step 5: Ethical Considerations

Every dataset about human health involves real people. Your analysis has the potential to inform decisions that affect communities. This step asks you to think carefully about the ethical dimensions of your work.

Required elements (1 page minimum):

  1. Data provenance and consent. How was the data collected? Was informed consent obtained? Are there populations that might be represented in the data without their full knowledge or agreement?

  2. Privacy and identification risk. Even if the data is de-identified, could your analysis — especially when combined with geographic or demographic variables — risk re-identifying individuals or small groups?

  3. Representation and missing voices. Who might be underrepresented or absent from your dataset? How might that affect your conclusions? Are there communities that bear the burden of the health issue you're studying but aren't reflected in the data?

  4. Potential for misuse. How could your findings be misinterpreted or misused? If a policymaker read only your headline finding without the nuance, what might go wrong?

  5. Your responsibility. Given everything above, what steps have you taken (or would you recommend) to minimize harm and maximize the benefit of this analysis?

Rubric focus for this step: Ethics.


Step 6: Executive Summary

Write a 2-page (approximately 800-1,000 word) executive summary of your findings, written for a public health official who has no statistical training.

Requirements: - Open with the problem and why it matters - Summarize key findings in plain language (no jargon, no formulas) - Include 2-3 well-designed visualizations that support your narrative - Clearly state what the data shows and, equally important, what it does not show - End with actionable recommendations or next steps - Acknowledge limitations honestly

Rubric focus for this step: Communication.


Step 7: Reproducibility Check

Before submitting, verify that your work is fully reproducible.

Checklist: - [ ] The Jupyter notebook runs from top to bottom without errors ("Restart and Run All") - [ ] All data files are included or the notebook contains clear download instructions - [ ] All library imports are at the top of the notebook - [ ] Every data cleaning decision is documented with a justification - [ ] Random seeds are set for any simulation or bootstrap procedures - [ ] All figures have titles, axis labels, and legends where appropriate - [ ] Code cells include comments explaining what they do - [ ] The notebook includes a "Methods" section explaining your analytical choices

Rubric focus for this step: Reproducibility.


Project Structure

Organize your Jupyter notebook with the following section headers:

1. Title and Research Question
2. Data Description and Dictionary
3. Data Cleaning and Preparation
4. Exploratory Data Analysis
5. Statistical Analysis
   5a. Confidence Intervals
   5b. Hypothesis Testing
   5c. Regression Analysis
   5d. Group Comparisons
6. Discussion and Interpretation
7. Ethical Considerations
8. Conclusions and Recommendations
9. References

Submit the following files: - capstone_public_health.ipynb — your complete Jupyter notebook - executive_summary.pdf — your 2-page executive summary - Any data files needed to run the notebook (or clear instructions for obtaining them)


Tips for Success

  • Start with the question, not the technique. Your research question should drive every analytical choice. Don't run a chi-square test just because it's in the rubric — run it because it answers your question.

  • Let the EDA guide you. Exploratory analysis often reveals surprises that reshape your approach. That's not a failure; that's good science. Document what you found and how it changed your plan.

  • Interpret everything in context. A p-value of 0.03 means nothing without context. "Counties with higher PM2.5 levels had significantly higher asthma hospitalization rates (p = 0.03), though the effect was small (r-squared = 0.08), suggesting air quality is one factor among many" — that's interpretation.

  • Be honest about limitations. The best projects aren't the ones with the cleanest results. They're the ones that clearly distinguish what the data can support from what it can't.

  • Use the rubric. It's included in this section for a reason. Read it before you start, check it while you work, and review it before you submit.

  • Cite your sources. Cite the data source, any methodology papers you consulted, and any claims you make about public health context. Use APA format or a consistent alternative.


Assessment

This project is assessed using the Capstone Rubric provided in this section. The rubric evaluates eight criteria:

Criterion Weight
Question Formulation 10%
Data Handling 15%
Statistical Analysis 25%
Visualization 10%
Interpretation 15%
Ethics 10%
Communication 10%
Reproducibility 5%

Total: 100%

See the detailed rubric for performance level descriptions (Excellent / Good / Developing / Needs Improvement) for each criterion.