Chapter 35 Exercises: Capstone Project Milestones

Contributors to Introduction to Data Science

Chapter 35 Exercises: Capstone Project Milestones

How to use these exercises: Unlike other chapters, these exercises are the capstone project. They correspond to the milestones in Section 35.3 and guide you through the entire investigation step by step. Complete them in order. Each exercise builds on the previous one, and together they produce your finished capstone.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Phase 1: Setup and Scoping ⭐

Exercise 35.1 — Define your question

Write your capstone research question in one to two clear sentences. Then answer these validation questions:

Is the question specific enough to be answered with available data?
Is the question interesting enough that someone outside this course would care about the answer?
Can you identify at least two data sources that are relevant to the question?
Can you realistically complete this investigation in 15 hours?

If you answered "no" to any of these, refine your question until all four answers are "yes."

Guidance

A question like "What predicts vaccination rates?" is too broad. "What national-level health and economic indicators best predict a country's COVID-19 vaccination completion rate, and do these predictors differ across World Bank income groups?" is specific, interesting, data-answerable, and achievable. If you're doing Option B or C, apply the same specificity: "What factors drive daily revenue at a small bakery, and can we forecast Q4 revenue within 15% accuracy?" or "When did three-point shooting volume become a significant predictor of NBA team winning percentage?"

Exercise 35.2 — Inventory your resources

Create a table documenting everything you're bringing to the capstone:

Resource	Status	Notes
Data source 1: [name]	Downloaded / Need to download	Size, format, date range
Data source 2: [name]	Downloaded / Need to download	Size, format, date range
Previous analysis (Chapter X)	Complete / Partial / Not done	What can be reused
...	...	...

For Option A: list every progressive project milestone you completed and identify gaps. For Options B/C: list all data sources you need and their acquisition status.

Guidance

This inventory prevents the "I thought I had that data" moment at 2 AM. Be honest about gaps. If you skipped [Chapter 11](../../part-02-data-wrangling/chapter-11-dates-times-timeseries/index.md)'s time series milestone, note that you'll need to do date parsing as part of the capstone. If you need weather data for Option B, verify that NOAA's data portal has what you need *before* you start the analysis.

Exercise 35.3 — Set up the repository

Create your capstone GitHub repository with the correct structure:

capstone-project-name/
    README.md
    notebooks/
    data/
        raw/
        processed/
    figures/
    requirements.txt
    .gitignore

Write a draft README with: - A working title - Your research question - A list of planned data sources - A brief description of your planned approach

Commit this initial structure with a descriptive commit message.

Guidance

The repository structure should be set up before you write any analysis code. This enforces good habits and ensures your project is organized from the start. The draft README will evolve as you work — that's fine. The point is to have something written down that anchors your project's direction.

Phase 2: Data Preparation ⭐⭐

Exercise 35.4 — Load and inspect all data

In your capstone notebook, load all data sources. For each one, run and document:

df.shape — how many rows and columns?
df.dtypes — what types are the columns?
df.head() — what does the data look like?
df.describe() — what are the summary statistics?
df.isnull().sum() — where is data missing?

Write a Markdown summary (3-5 sentences) of what you observe. Note any immediate concerns about data quality.

Guidance

This is your first real look at the data you'll be working with for the entire project. Don't rush it. Look at the output of each command and think about what it tells you. Are there columns with unexpected types (a numeric column stored as strings)? Are there columns that are entirely null? Is the date range what you expected? Document everything — your future self will thank you.

Exercise 35.5 — Clean the data with documented decisions

Address all data quality issues you identified in Exercise 35.4. For each cleaning step, write a Markdown cell explaining:

What the problem is
What options you considered
What you chose and why
What the potential consequences of your choice are

You must document at least three cleaning decisions with this level of detail.

After cleaning, produce a summary statistics table of the cleaned dataset and compare it briefly to the pre-cleaning summary.

Guidance

The documentation is as important as the cleaning itself. Example: "The `gdp_per_capita` column had 47 missing values, all in low-income countries. Options: (a) drop these rows, which would remove 25% of Sub-Saharan Africa; (b) impute with the regional median, which assumes countries within a region are economically similar; (c) impute with the most recent available year from the World Bank. I chose option (c) because it uses country-specific data rather than regional averages, and flagged imputed values in a new boolean column for sensitivity analysis."

Exercise 35.6 — Merge and prepare the analytical dataset

If you're using multiple data sources, merge them into a single analytical DataFrame. Document:

What key(s) you merged on
How many records matched vs. didn't match
How you handled non-matching records
The final shape and structure of your analytical dataset

Save the cleaned, merged dataset to data/processed/ as a CSV file.

Guidance

Merging is where many data quality issues surface. Country names might not match between sources ("United States" vs. "United States of America" vs. "US"). Date formats might differ. Some countries in one source might not exist in another. Document how you resolved each issue. After merging, verify that the merge didn't introduce unexpected duplicates or null values.

Phase 3: Exploration ⭐⭐

Exercise 35.7 — Exploratory visualization set

Create at least four visualizations that reveal patterns in your data. For each visualization:

Choose an appropriate chart type (refer to Chapter 14's grammar of graphics and Chapter 18's design principles)
Give it a descriptive title that communicates the finding
Label all axes with units
Write a 2-4 sentence Markdown interpretation explaining what the chart reveals

Your visualizations should cover at least three of these categories: - Distribution of a key variable - Comparison across groups - Relationship between two variables - Trend over time

Guidance

The exploratory visualizations should tell a story, even without the rest of the notebook. A reader scrolling through just the charts should get a rough picture of your findings. Choose chart types that match the data: histograms or box plots for distributions, bar charts or grouped bar charts for comparisons, scatter plots for relationships, line charts for trends. Avoid pie charts for more than 3-4 categories.

Exercise 35.8 — Discover something surprising

During your exploration, identify at least one finding that surprised you — something that contradicted your expectations or revealed a pattern you didn't anticipate. Write a Markdown cell describing:

What you expected to find
What you actually found
Why the discrepancy might exist
What additional analysis it suggests

Guidance

Surprises are the heart of good data science. If your exploratory analysis confirms every prior assumption, either you're not looking carefully enough or the question isn't interesting enough. For the vaccination project, a surprise might be: "I expected healthcare spending per capita to be the strongest predictor, but healthcare spending as a *percentage of GDP* was stronger, suggesting that a country's *prioritization* of health matters more than its absolute wealth."

Exercise 35.9 — Summarize exploration findings

Write a 300-500 word summary of your exploratory analysis. Structure it as:

The three most important patterns you observed
How these patterns relate to your research question
What hypotheses they suggest for formal testing
What questions remain that the exploration couldn't answer

This summary will become part of your capstone notebook's transition between exploration and formal analysis.

Guidance

This transition section is important for narrative flow. It connects "here's what I see in the data" to "here's what I'm going to test formally." A good transition might be: "The exploratory analysis suggests that economic indicators alone don't fully explain vaccination rate variation — healthcare infrastructure variables show independent relationships. To test whether these variables add predictive power beyond GDP, I'll build regression models with and without healthcare features."

Phase 4: Formal Analysis ⭐⭐⭐

Exercise 35.10 — Statistical analysis

Conduct at least two formal statistical analyses appropriate to your question. For each one:

State the hypothesis (null and alternative)
Check the assumptions of the test
Run the test and report the results (test statistic, p-value, effect size)
Interpret the results in plain language — what does this tell us about the real world?

Guidance

For the vaccination project, you might test: (1) "Do vaccination rates differ significantly across World Bank income groups?" using ANOVA or Kruskal-Wallis, and (2) "Is there a significant correlation between healthcare worker density and vaccination rate after controlling for GDP?" using partial correlation or multiple regression. Always check assumptions — normality, homogeneity of variance — and use appropriate non-parametric alternatives when assumptions are violated.

Exercise 35.11 — Build and evaluate models

Build at least two predictive or explanatory models. For each model:

Explain why you chose this model type (connect to your question and data characteristics)
Split data into training and test sets (or use cross-validation)
Train the model and report performance metrics
Interpret the results — what do the coefficients, feature importances, or predictions tell you?

Then compare the models: - Which performs better by your chosen metric(s)? - Which is more interpretable? - Which would you recommend, and why?

Guidance

Model comparison is where analytical maturity shows. A strong capstone doesn't just report "Model A got R-squared 0.68 and Model B got R-squared 0.78." It explains the tradeoff: "The linear regression (R-squared = 0.68) allows direct coefficient interpretation — a $1,000 increase in GDP per capita is associated with a 0.8 percentage point increase in vaccination rate. The random forest (R-squared = 0.78) captures non-linear effects but doesn't provide interpretable coefficients. For this project's goal of understanding *which factors matter*, I present both: the linear model for interpretation and the random forest for feature importance ranking."

Exercise 35.12 — Sensitivity analysis

Choose one major analytical decision you made (e.g., how you handled missing data, how you defined your outcome variable, which features you included) and test how sensitive your results are to that choice. Specifically:

Identify the decision
Implement an alternative approach
Re-run the analysis with the alternative
Compare results — do your conclusions change?

Guidance

Sensitivity analysis is a hallmark of rigorous work. For example, if you imputed missing GDP values, re-run the model after dropping those countries instead of imputing. If your conclusions hold under both approaches, that strengthens your findings. If they change, that's important information — and discussing it honestly makes your project *stronger*, not weaker.

Phase 5: Communication ⭐⭐⭐

Exercise 35.13 — Write the introduction

Write the Title/Abstract and Introduction/Motivation sections of your capstone notebook (700-1100 words combined). Include:

A descriptive, interesting title
A 200-300 word abstract summarizing question, data, methods, and key findings
A 500-800 word introduction explaining the question, its real-world significance, and a preview of your approach

Write these for a reader who is intelligent but may not know data science terminology. Every technical term should be explained or implied by context.

Guidance

The introduction is the most important writing in your capstone. It's what determines whether a reader engages with the rest. Start with something concrete and compelling — a striking statistic, a real-world consequence, or a question that anyone would find interesting. Then narrow to your specific investigation. End with a brief roadmap: "This analysis examines X using Y, finding Z."

Exercise 35.14 — Write the conclusions

Write the Findings/Conclusions section (500-800 words). Structure it as:

A direct answer to your research question (first paragraph)
Your three to five most important specific findings
What surprised you and what confirmed your expectations
Practical implications — who should care about these findings, and what should they do?

Guidance

The conclusion should mirror the introduction: it returns to the question stated at the beginning and answers it. If the answer is nuanced ("GDP is associated with vaccination rates, but the relationship is non-linear and moderated by healthcare infrastructure"), say so. If the data couldn't fully answer the question, explain why and what would be needed. Honesty is more impressive than false certainty.

Exercise 35.15 — Write the limitations and ethical reflection

Write the Limitations section (300-500 words) and the Ethical Reflection section (300-500 words).

Limitations should be specific: - Not "the data may have errors" but "the WHO vaccination data relies on country self-reporting, and countries with weaker health information systems may undercount doses administered" - Not "more data would be helpful" but "individual-level vaccination data (rather than country-level aggregates) would allow analysis of within-country disparities, which are likely substantial"

Ethical reflection should engage genuinely: - Who is represented in your data, and who is invisible? - How could your findings be misinterpreted or misused? - What assumptions have you embedded in your analysis choices? - What responsibility do you have as the person presenting these results?

Guidance

These sections demonstrate intellectual maturity. Generic limitations ("the dataset was small") and performative ethics ("data should be used responsibly") add nothing. Specific, thoughtful reflections show that you understand the real-world context of your work. The best ethical reflections connect specific analytical choices to their potential real-world consequences.

Phase 6: Polish and Finalize ⭐⭐⭐

Exercise 35.16 — Polish all visualizations

Review every visualization in your capstone notebook. For each one, verify:

[ ] Descriptive title (communicates the finding, not just the chart type)
[ ] Labeled axes with units
[ ] Appropriate color palette (accessible to color-blind readers)
[ ] Legend if multiple groups are shown
[ ] Consistent styling across all charts in the notebook
[ ] A Markdown cell immediately following with interpretation

Remove any visualizations that don't advance the narrative. If a chart doesn't help answer your question, it doesn't belong in the capstone.

Guidance

Less is more. Five excellent, well-interpreted charts are better than twelve mediocre ones. Each chart should have a clear purpose in the narrative — it should reveal something that words alone couldn't convey as effectively. If you can't articulate what a chart adds to the story, remove it.

Exercise 35.17 — Clean the notebook

Perform a thorough cleaning pass on your entire notebook:

Remove all debugging cells (print(df.shape) repeated, type(x) checks, etc.)
Remove dead-end analyses that didn't contribute to the final story
Ensure code cells use descriptive variable names
Add comments to any code where the intent isn't immediately obvious
Check that Markdown headers create a logical hierarchy (H1 for title, H2 for sections, H3 for subsections)
Verify consistent formatting (bullet points, bold terms, code formatting)
Run Kernel > Restart & Run All and verify every cell executes without error

Guidance

The Restart & Run All test is the most important. If your notebook doesn't run cleanly from top to bottom, it's not reproducible, which undermines everything. Common issues: cells that reference variables defined in deleted cells, cells run out of order during development, and cells that depend on external files not included in the repository.

Exercise 35.18 — Write the final README

Write the final README.md for your repository. It should stand alone — a reader who only reads the README should understand what you did, what you found, and how to reproduce your work. Include all sections specified in the project specification (Section 35.2).

Guidance

The README is your project's elevator pitch. Lead with the most interesting finding. A hiring manager reading this README for 30 seconds should understand: what question you asked, what data you used, what you found, and why it matters. Reproduction instructions should be specific enough that someone could clone your repository and run the notebook in under 10 minutes.

Exercise 35.19 — Conduct peer review

Exchange your capstone notebook with a classmate, study partner, or community member. Use the peer review guidelines in Section 35.6.

As a reviewer, provide: 1. A score for each rubric dimension (1-4) 2. Three specific things the project does well 3. Three specific things that could be improved 4. One question the analysis raised that could be explored further

After receiving feedback, address the most important suggestions and note (briefly) which feedback you incorporated and which you didn't (and why).

Guidance

If you can't find a peer reviewer, use the self-review checklist in Section 35.6. But peer review is strongly recommended — fresh eyes catch things you've been staring at for too long. If you're working independently, consider posting in an online data science community (r/datascience, a Discord server, or a study group) and offering to exchange reviews.

Exercise 35.20 — Final submission and reflection

Complete these final steps:

Run Kernel > Restart & Run All one final time
Commit all changes to your GitHub repository with a descriptive message
Verify that the repository is public and the README renders correctly on GitHub
Write a brief reflection (200-300 words) answering: - What was the hardest part of this project? - What would you do differently if you started over? - What skill did you use most that you didn't expect to need? - What are you most proud of in this capstone?

Guidance

The reflection is for you, not for a grade. Be honest. Common answers to "what was hardest?": writing the narrative, making the notebook tell a coherent story, knowing when to stop. Common answers to "what unexpected skill?": writing, data cleaning, making decisions under uncertainty. Common answers to "most proud of?": the final product existing at all — because finishing a major project is genuinely hard and genuinely worth celebrating. Congratulations. You just completed a data science investigation. That's not a small thing.