Chapter 35 Quiz: Capstone Readiness Self-Assessment

Contributors to Introduction to Data Science

Chapter 35 Quiz: Capstone Readiness Self-Assessment

Instructions: This quiz is different from other chapter quizzes. It's a self-assessment designed to help you evaluate whether you're ready to begin (or have completed) the capstone project. Use it before starting to identify areas you may need to review, or after completing the capstone to verify you've covered all bases. Total points: 100.

Section 1: Skills Readiness (8 questions, 4 points each)

These questions test whether you have the skills needed for the capstone. If you struggle with any question, review the referenced chapter before starting the project.

Question 1. You have a DataFrame with a column vaccination_rate that contains some negative values (which are clearly errors). Which of the following is the BEST way to handle this in a capstone project?

(A) Delete the entire column
(B) Replace negative values with NaN, document the decision, and explain why you chose NaN over another approach (like zero or the column mean)
(C) Replace negative values with zero without explanation
(D) Ignore them — the model will figure it out

Answer

**Correct: (B)** Replacing with NaN is appropriate because negative vaccination rates are clearly data errors, not meaningful values. The key capstone requirement is *documentation* — explaining that negative rates are impossible, how many rows were affected, and why you chose NaN over alternatives (zero would imply no vaccination, which is different from "unknown"; the mean would be arbitrary). This connects to [Chapter 8](../../part-02-data-wrangling/chapter-08-cleaning-messy-data/index.md) (data cleaning) and the capstone rubric's Data Handling dimension.

Question 2. You're choosing between a bar chart, a scatter plot, and a box plot to show the relationship between World Bank income group (4 categories) and vaccination rate. Which chart type is MOST appropriate?

(A) A scatter plot, because it shows individual data points
(B) A bar chart showing the mean vaccination rate per group
(C) A box plot, because it shows the distribution within each group, not just the mean
(D) A pie chart, because there are only four categories

Answer

**Correct: (C)** A box plot is most appropriate because it shows the distribution (median, quartiles, range, outliers) within each income group, revealing important information about spread and overlap. A bar chart of means would hide the within-group variation. A scatter plot doesn't naturally accommodate a categorical x-axis. A pie chart shows parts of a whole, which doesn't apply here. This connects to Chapters 14-16 (visualization) and 18 (design principles).

Question 3. You built a logistic regression classifying countries as "high vaccination" (>60%) vs. "low vaccination" (<60%). Your test set has 70% high-vaccination countries. Your model achieves 72% accuracy. Is this a good result?

(A) Yes, 72% is above the passing threshold of 70%
(B) No, because a model that always predicts "high vaccination" would achieve 70% accuracy, so 72% is barely better than guessing the majority class
(C) Yes, because logistic regression is a simple model and 72% is reasonable for it
(D) It depends on the R-squared value

Answer

**Correct: (B)** When the baseline accuracy (always predicting the majority class) is 70%, achieving 72% is marginal improvement. This is a critical evaluation skill from [Chapter 29](../../part-05-first-models/chapter-29-evaluating-models/index.md) — always compare model performance against a meaningful baseline. You should also consider metrics beyond accuracy, such as precision, recall, and F1-score, especially for imbalanced classes. R-squared (D) is used for regression, not classification.

Question 4. You want to test whether vaccination rates differ significantly across four WHO regions. Which test is most appropriate?

(A) Four separate t-tests, one comparing each region to the overall mean
(B) A single ANOVA (or Kruskal-Wallis if assumptions aren't met), followed by post-hoc comparisons
(C) A chi-square test
(D) A paired t-test

Answer

**Correct: (B)** ANOVA (or its non-parametric equivalent, Kruskal-Wallis) is designed for comparing means across three or more groups. Multiple separate t-tests (A) inflate the Type I error rate — with four tests at alpha=0.05, the chance of at least one false positive is about 19%. Chi-square (C) is for categorical outcomes, not continuous. Paired t-test (D) requires paired observations. This connects to [Chapter 23](../../part-04-statistical-thinking/chapter-23-hypothesis-testing/index.md) (hypothesis testing).

Question 5. You're writing the capstone introduction and want to explain why vaccination rate disparities matter. Which opening is STRONGEST?

(A) "In this notebook, I will analyze vaccination data."
(B) "Data science is an important field that can be applied to many problems, including healthcare."
(C) "A child born in a high-income country is five times more likely to be fully vaccinated against COVID-19 than a child born in a low-income country. This analysis investigates what drives that gap."
(D) "The WHO publishes vaccination data for 194 countries. I downloaded it and performed analysis."

Answer

**Correct: (C)** Option C leads with a specific, striking fact that makes the reader care about the question. It immediately establishes the "so what?" — why this analysis matters for real people. Options A and D are procedural (describing what you did, not why). Option B is generic and could apply to any topic. This connects to [Chapter 31](../chapter-31-communicating-results/index.md) (communicating results) and the capstone rubric's Communication dimension.

Question 6. Your random forest model has an R-squared of 0.92 on the training set and 0.58 on the test set. What does this indicate?

(A) The model is excellent — 0.92 is very high
(B) The model is overfitting — it memorized the training data and doesn't generalize well
(C) The test set is too small
(D) Random forests are inappropriate for this data

Answer

**Correct: (B)** A large gap between training performance (0.92) and test performance (0.58) is the classic sign of overfitting. The model learned patterns specific to the training data (including noise) that don't generalize. Solutions include: reducing model complexity (fewer trees, lower max depth), using cross-validation, or adding regularization. This connects to [Chapter 29](../../part-05-first-models/chapter-29-evaluating-models/index.md) (model evaluation) and is a critical skill for the capstone's modeling section.

Question 7. You want to create a requirements.txt for your capstone repository. Which approach is BEST?

(A) List every Python package installed on your system
(B) Manually list only the packages your notebook actually imports, with version numbers
(C) Don't include one — the code is self-explanatory
(D) Copy a requirements.txt from another project

Answer

**Correct: (B)** A requirements.txt should include only the packages your project actually uses, with specific version numbers for reproducibility. Listing every installed package (A) includes irrelevant dependencies and creates conflicts. Not including one (C) makes your project non-reproducible. Copying from another project (D) may include wrong versions or missing packages. This connects to [Chapter 33](../chapter-33-reproducibility-collaboration/index.md) (reproducibility).

Question 8. You found that GDP per capita and healthcare spending per capita are highly correlated (r = 0.85). What implication does this have for your regression model?

(A) None — high correlation between features is always fine
(B) Multicollinearity — the two features provide redundant information, making individual coefficient estimates unstable
(C) You should multiply them together to create an interaction term
(D) You should remove both features from the model

Answer

**Correct: (B)** High correlation between predictors (multicollinearity) means they provide overlapping information. While the model's overall predictions may still be accurate, individual coefficient estimates become unstable — a small change in the data can cause large swings in the coefficients. Solutions include: dropping one of the correlated features, using regularization (LASSO), or using the features separately in different model specifications. You wouldn't remove both (D), because they're both potentially informative — just redundant. This connects to [Chapter 26](../../part-05-first-models/chapter-26-linear-regression/index.md) (linear regression).

Section 2: True/False (4 questions, 5 points each)

Question 9. TRUE or FALSE: A capstone project should use every analytical technique you learned in this book to demonstrate the full range of your skills.

Answer

**FALSE.** A capstone should use the methods that are appropriate for the question and the data — not every technique in your toolkit. Using decision trees on a problem that's better suited to linear regression, just to demonstrate you know decision trees, is a sign of poor judgment. The capstone rubric values *appropriate* method selection, not *comprehensive* method application. Simplicity applied well is always better than complexity applied indiscriminately.

Question 10. TRUE or FALSE: If your capstone results contradict what you expected to find, you should revise your analysis until the results match your hypothesis.

Answer

**FALSE.** This is the opposite of good data science. If your results contradict your expectations, that's often the most interesting part of the analysis. Report what you found, not what you wanted to find. Discuss why the discrepancy might exist. Unexpected results — honestly presented and thoughtfully interpreted — are a strength, not a weakness. Manipulating analysis to match a predetermined conclusion is one of the most serious ethical violations in data science.

Question 11. TRUE or FALSE: The ethical reflection section of the capstone is optional if your analysis uses publicly available data.

Answer

**FALSE.** Ethical reflection is a required component of the capstone regardless of data source. Public data still represents real people, can still be misused, and still embeds assumptions about who is counted and how. The fact that data is publicly available doesn't mean it's free of ethical considerations — public health data, for example, may undercount marginalized populations, and analysis of such data can reinforce or challenge existing power structures.

Question 12. TRUE or FALSE: A capstone notebook that runs successfully with Kernel > Restart & Run All is more valuable than one with better analysis but execution errors.

Answer

**TRUE.** Reproducibility is a fundamental requirement of data science work. A notebook that doesn't run from top to bottom is, in a practical sense, broken — no one can verify your results, build on your work, or trust your findings. While analysis quality matters enormously, a non-reproducible analysis is worth very little. Both matter, but reproducibility is the floor, not the ceiling.

Section 3: Short Answer (4 questions, 5 points each)

Question 13. Explain the difference between a capstone project and a homework assignment. What characteristics make the capstone a higher-level demonstration of data science ability?

Answer

A homework assignment typically has a predetermined question, provided data, specified methods, and a known "right answer." A capstone requires the student to make every decision: choosing or refining the question, acquiring and cleaning real data, selecting appropriate methods, interpreting ambiguous results, and communicating findings to an audience. The capstone demonstrates the full data science lifecycle end-to-end, with the student responsible for integration, judgment, and narrative — skills that individual homework assignments test in isolation. It's the difference between following a recipe and creating a menu.

Question 14. Why does the capstone rubric weight "Communication and Narrative" at 15 points (same as "Question and Motivation") rather than giving all the weight to the technical dimensions?

Answer

Communication is weighted heavily because it reflects what real data science work requires. An analysis that can't be understood by its intended audience — whether that's a policymaker, a business executive, or a fellow researcher — fails to achieve its purpose, regardless of technical quality. Hiring managers consistently cite communication as the skill most lacking in junior data scientists. The capstone is a portfolio piece, and portfolio pieces must tell a clear story. Technical excellence without communication is like a tree falling in an empty forest.

Question 15. A student's capstone includes a random forest model with R-squared = 0.78 and concludes: "This proves that GDP and healthcare spending cause high vaccination rates." Identify at least two problems with this conclusion.

Answer

Problem 1: **Correlation vs. causation.** A predictive model demonstrates association, not causation. The random forest shows that GDP and healthcare spending are *associated with* vaccination rates, but it cannot establish that increasing GDP or healthcare spending *would cause* rates to rise. Confounding variables (e.g., political stability, institutional capacity) might drive both the predictors and the outcome. Problem 2: **Overstatement ("proves").** A single cross-sectional analysis with one dataset doesn't "prove" anything. The word "proves" implies certainty that no observational study can provide. A more appropriate statement: "The model suggests a strong association between economic/health indicators and vaccination rates, though causal interpretation requires additional evidence." Problem 3 (bonus): R-squared of 0.78 means 22% of variance is unexplained — claiming the model captures the full picture ignores the substantial portion it doesn't explain.

Question 16. Describe what "sensitivity analysis" means in the context of a capstone project and give a specific example of how you would conduct one.

Answer

Sensitivity analysis tests whether your conclusions are robust to the analytical choices you made. Since every data science project involves judgment calls (how to handle missing data, which variables to include, how to define the outcome), sensitivity analysis checks whether different reasonable choices lead to the same conclusions or different ones. Specific example: if I imputed missing GDP values using the most recent available year, I would re-run my regression model after (a) dropping countries with missing GDP instead of imputing, and (b) imputing with the regional median. If all three approaches produce similar coefficients and significance levels, my conclusions are robust. If the results change substantially, I would report the sensitivity and qualify my conclusions accordingly, noting which conclusions depend on the imputation method.

Section 4: Applied Scenarios (4 questions, 5 points each)

Question 17. You're two weeks into the capstone and realize that your original question ("Do weather patterns affect vaccination campaign effectiveness?") can't be answered with the data you have — weather data is only available for 30 countries, and campaign effectiveness data doesn't exist. What should you do?

Answer

This is a normal part of the data science process, not a failure. You should: (1) Acknowledge the data limitation honestly in your notebook. (2) Pivot to a related question that your available data *can* answer — perhaps "What country-level factors predict vaccination rate variation?" using the WHO and World Bank data you do have. (3) Note the original question as a "future work" item in your limitations section. (4) Don't panic — adjusting scope based on data availability is a professional skill, not a shortcoming. The capstone rubric rewards honest engagement with data limitations.

Question 18. You've finished your capstone and a peer reviewer says: "Your analysis is technically strong, but I have no idea what your main finding is. I had to read the entire 40-cell notebook to figure out your conclusion." How would you address this feedback?

Answer

This is a communication problem, and it's fixable: (1) Add a clear abstract at the top of the notebook that states the main finding in the first sentence. (2) Review the introduction to ensure it previews the conclusion ("This analysis finds that..."). (3) Add a "Key Findings" section before the detailed analysis, summarizing the three to five most important results. (4) Ensure each section ends with a brief summary sentence connecting back to the main question. (5) Make the Conclusion section start with a direct answer to the research question, not more analysis. The principle: the reader should understand your main finding within 60 seconds of opening the notebook.

Question 19. You discover that your vaccination data significantly underrepresents small island nations (population under 100,000). This means 15 countries are missing from your analysis. How should you handle this in your capstone?

Answer

You should: (1) Document the gap explicitly in your Data Description section, specifying which countries are missing and why. (2) Discuss whether these missing countries share characteristics that could bias your analysis — small island nations may have unique healthcare delivery challenges. (3) Note this as a limitation: "Our analysis underrepresents small island developing states, which face distinct challenges including geographic isolation and limited healthcare infrastructure." (4) Consider whether the gap affects your conclusions — if your analysis focuses on income-group comparisons, and most missing countries are in the same income group, the within-group representation is affected. (5) In the ethical reflection, note that data availability itself reflects global power structures — countries with weaker statistical systems are less visible in analyses like this one.

Question 20. Your capstone model identifies "government effectiveness index" as the strongest predictor of vaccination rates. A friend suggests you title your blog post: "Bad Government Is Why Poor Countries Can't Vaccinate Their People." Why is this title problematic, and what would be better?

Answer

The title is problematic for several reasons: (1) It implies a causal relationship ("is why") when the model only shows association. (2) It frames the issue as blame ("bad government") rather than structural challenge. (3) It conflates "poor countries" with "low vaccination," ignoring low-income countries with high vaccination rates. (4) It could reinforce harmful stereotypes and oversimplify a complex, multi-causal phenomenon. (5) "Government effectiveness index" is a composite measure that may proxy for other factors (institutional capacity, infrastructure, stability) — calling it "bad government" is reductive. A better title: "Beyond GDP: How Institutional Capacity Shapes Global Vaccination Outcomes" — this is specific, avoids blame framing, uses qualified language, and highlights the interesting finding (that it's not just about money) without overstating the causal claim.