Chapter 23 Exercises: Hypothesis Testing

Contributors to Introduction to Data Science

Chapter 23 Exercises: Hypothesis Testing

How to use these exercises: Part A tests conceptual understanding — can you explain the logic of hypothesis testing without any numbers? Part B asks you to set up and interpret tests. Part C requires Python code. Part D pushes toward synthesis, judgment, and critical thinking about the testing framework itself.

Difficulty key: ⭐ Foundational | ⭐⭐ Intermediate | ⭐⭐⭐ Advanced | ⭐⭐⭐⭐ Extension

Part A: Conceptual Understanding ⭐

Exercise 23.1 — Stating hypotheses

For each scenario, state the null hypothesis (H₀) and alternative hypothesis (H₁) in both words and mathematical notation.

A pharmaceutical company tests whether a new drug lowers cholesterol levels compared to a placebo.
A teacher wants to know whether a new teaching method produces different test scores than the traditional method.
A tech company runs an A/B test to determine whether a red "Buy Now" button produces a higher click-through rate than a blue button.
A public health researcher investigates whether there is an association between a country's income level and whether it meets the 90% vaccination target.

Guidance

1. H₀: μ_drug = μ_placebo (the drug has no effect). H₁: μ_drug < μ_placebo (the drug lowers cholesterol). This is one-tailed because the company specifically predicts a *decrease*. 2. H₀: μ_new = μ_traditional. H₁: μ_new ≠ μ_traditional. Two-tailed because the new method might produce higher OR lower scores. 3. H₀: p_red = p_blue. H₁: p_red > p_blue. One-tailed because they specifically predict red is *better*. 4. H₀: Income level and meeting the target are independent (no association). H₁: They are not independent (there is an association). This calls for a chi-square test.

Exercise 23.2 — P-value interpretation ⭐

A researcher reports: "The difference between groups was statistically significant (t = 2.45, p = 0.018)." For each statement below, indicate whether it is a correct or incorrect interpretation.

"There is a 1.8% probability that the null hypothesis is true."
"If there were truly no difference between the groups, data this extreme would occur about 1.8% of the time."
"The probability that the result is due to chance is 1.8%."
"There is a 98.2% probability that the alternative hypothesis is true."
"The data provide evidence against the null hypothesis at the 0.05 significance level."

Guidance

1. **Incorrect.** The p-value is not the probability that H₀ is true. It's the probability of the data given H₀. This is the most common misconception. 2. **Correct.** This is the textbook definition of a p-value. 3. **Incorrect.** "Due to chance" is vague, but this is essentially restating misconception 1. The p-value doesn't tell you the probability that the result has any particular cause. 4. **Incorrect.** The p-value doesn't give the probability of H₁ being true. You'd need Bayesian methods for that. 5. **Correct.** Since 0.018 < 0.05, the result is statistically significant at α = 0.05.

Exercise 23.3 — Type I and Type II errors ⭐

For each scenario, describe what a Type I error and a Type II error would mean in practical terms.

A clinical trial tests whether a new cancer treatment is effective.
A quality control test checks whether a batch of microchips meets specification.
A spam filter classifies emails as spam or not-spam.
A COVID test determines whether a patient is infected.

Guidance

1. Type I: Concluding the treatment works when it doesn't → patients receive an ineffective treatment. Type II: Concluding the treatment doesn't work when it does → an effective treatment is abandoned. 2. Type I: Rejecting a good batch → waste. Type II: Accepting a bad batch → defective products reach customers. 3. Type I: Flagging a legitimate email as spam → user misses important messages. Type II: Failing to flag a spam email → user sees spam. 4. Type I: Positive test when patient is healthy → unnecessary quarantine, anxiety. Type II: Negative test when patient is infected → infected person spreads disease. Note: which error is "worse" depends on context. For the cancer treatment, Type I might lead to adopting a harmful/useless treatment. For the COVID test, Type II might lead to outbreaks. The significance level and test design should reflect which error is more costly.

Exercise 23.4 — Statistical vs. practical significance ⭐⭐

A study with 50,000 participants finds that people who drink green tea score 0.3 points higher on a 100-point memory test than those who don't. The result is statistically significant (p = 0.004).

Is this result statistically significant? At what level?
Is this result practically significant? Why or why not?
Compute an approximate Cohen's d if the pooled standard deviation is 15 points.
Write a one-paragraph summary that honestly communicates both the statistical and practical findings.

Guidance

1. Yes, significant at α = 0.01 (since 0.004 < 0.01). 2. Probably not. A 0.3-point difference on a 100-point test is almost certainly too small to be meaningful in practice. No one would notice a 0.3-point improvement. 3. Cohen's d = 0.3 / 15 = 0.02. This is a trivially small effect — well below the "small" threshold of 0.2. 4. Something like: "While our large sample (n = 50,000) provides statistically significant evidence that green tea drinkers score slightly higher on the memory test (p = 0.004), the actual difference is only 0.3 points out of 100 (Cohen's d = 0.02). This difference is too small to have any practical relevance. The statistical significance is entirely an artifact of the very large sample size, which can detect differences too small to matter."

Exercise 23.5 — The logic of "failing to reject" ⭐⭐

Explain in your own words why we say "fail to reject the null hypothesis" rather than "accept the null hypothesis." Use an analogy (courtroom, search, medical test, or your own) to make the distinction clear.

Guidance

"Fail to reject" reflects that we haven't proven the null is true — we've merely failed to find sufficient evidence against it. Just as a "not guilty" verdict doesn't mean the defendant is innocent (it means the prosecution didn't meet the burden of proof), a non-significant p-value doesn't mean there's no effect — it means we didn't find strong enough evidence of one. The effect might be real but too small for our sample to detect. Accepting the null would claim certainty we don't have.

Exercise 23.6 — Choosing the right test ⭐⭐

For each research question, identify the most appropriate hypothesis test from this list: one-sample t-test, two-sample t-test, paired t-test, chi-square test of independence, ANOVA.

Do mean reading scores differ between students in three different schools?
Is there an association between gender and choice of major (STEM vs. non-STEM)?
Is the average customer satisfaction rating different from the company's target of 4.0 (out of 5)?
Do patients' blood pressure readings change after a 6-week exercise program? (Same patients measured before and after.)
Do vaccination rates differ between countries that participated in the COVAX program and those that didn't?

Guidance

1. **ANOVA** — comparing means across 3+ groups. 2. **Chi-square test** — both variables are categorical. 3. **One-sample t-test** — comparing a sample mean to a known value. 4. **Paired t-test** — the same subjects measured twice (before and after), so observations are not independent. 5. **Two-sample t-test** — comparing means of two independent groups.

Exercise 23.7 — Multiple testing reasoning ⭐⭐

A researcher tests 100 different food additives for their effect on rat behavior. None of them actually have any effect (the null is true for all 100). She uses α = 0.05 for each test.

How many "significant" results does she expect to find?
She publishes a paper highlighting the 5 additives that showed significant effects. What's wrong with this?
If she had used Bonferroni correction, what α would she use for each individual test?
A journalist writes: "Scientists discover 5 food additives that affect behavior." How would you respond?

Guidance

1. Expected significant results = 100 × 0.05 = 5. 2. These 5 "significant" results are almost certainly false positives. By running 100 tests at α = 0.05, she expects about 5 false positives by chance alone. Selectively reporting only the significant ones creates a misleading picture. 3. Bonferroni α = 0.05/100 = 0.0005. Each test must have p < 0.0005 to be declared significant. With this correction, the false positive rate across all 100 tests is controlled at 0.05. 4. "These results are likely false positives from multiple testing. The researcher tested 100 additives and found 5 'significant' results, which is exactly what we'd expect by chance when no real effects exist. The findings need to be replicated with pre-registered hypotheses before any conclusions can be drawn."

Part B: Applied Problems ⭐⭐

Exercise 23.8 — Two-sample t-test by hand

Two groups of countries have the following vaccination rates:

Group A (high GDP): n = 25, mean = 84.2%, s = 7.3% Group B (low GDP): n = 20, mean = 62.8%, s = 14.5%

State H₀ and H₁.
Compute the pooled standard error for the difference in means.
Compute the t-statistic.
Using a t-table or calculator, find the approximate p-value.
Compute Cohen's d.
State your conclusion in a complete sentence, including the effect size.

Guidance

1. H₀: μ_A = μ_B. H₁: μ_A ≠ μ_B. 2. SE = √(s_A²/n_A + s_B²/n_B) = √(7.3²/25 + 14.5²/20) = √(2.1316 + 10.5125) = √12.644 = 3.556. 3. t = (84.2 - 62.8) / 3.556 = 21.4 / 3.556 = 6.02. 4. With df ≈ 27 (using Welch's approximation), t = 6.02 gives p < 0.0001. 5. Pooled SD = √(((24)(7.3²) + (19)(14.5²)) / 43) = √((1278.96 + 3993.75)/43) = √122.62 = 11.07. Cohen's d = 21.4/11.07 = 1.93. 6. "High-GDP countries have significantly higher vaccination rates (M = 84.2%) than low-GDP countries (M = 62.8%), t(27) = 6.02, p < 0.001, d = 1.93. The difference of 21.4 percentage points represents a very large effect."

Exercise 23.9 — Chi-square test interpretation

A study examined whether vaccine hesitancy was associated with education level:

	Hesitant	Not Hesitant	Total
No college degree	120	280	400
College degree	60	340	400
Total	180	620	800

What are the expected frequencies under H₀ (no association)?
The chi-square statistic is 30.3 with 1 degree of freedom. What is the approximate p-value?
Compute Cramér's V.
Is the association statistically significant? Is it practically meaningful?

Guidance

1. Expected = (row total × column total) / grand total. E(No college, Hesitant) = 400 × 180/800 = 90. E(No college, Not hesitant) = 400 × 620/800 = 310. E(College, Hesitant) = 90. E(College, Not hesitant) = 310. 2. χ² = 30.3, df = 1 → p < 0.0001. 3. Cramér's V = √(30.3 / (800 × 1)) = √0.0379 = 0.195. A small-to-moderate effect. 4. Statistically significant: yes (p < 0.001). Practically: the effect is real but modest (V = 0.195). The hesitancy rate is 30% among non-degree holders vs. 15% among degree holders — a meaningful difference in public health terms, even if the statistical effect size measure is "small."

Exercise 23.10 — Power calculation ⭐⭐

You're planning a study to detect a 5-percentage-point difference in vaccination rates between two regions. Based on previous data, you expect a standard deviation of about 12 percentage points.

If you sample 50 countries per region, what is the approximate power? (Use the rule of thumb that power is approximately the probability that a standard normal exceeds z_α - δ√(n/2)/σ, where δ is the true difference.)
What sample size per group would give you 80% power?
What sample size per group would give you 90% power?
If the true difference were 10 percentage points instead of 5, how would the required sample size change?

Guidance

1. Effect size d = 5/12 = 0.417. For a two-sample test, the noncentrality parameter is d × √(n/2) = 0.417 × √25 = 2.085. Power ≈ P(Z > 1.96 - 2.085) = P(Z > -0.125) ≈ 0.55. About 55% power. 2. For 80% power: need d × √(n/2) ≈ 2.80. So √(n/2) = 2.80/0.417 = 6.71, n/2 = 45.1, n ≈ 90 per group. 3. For 90% power: need d × √(n/2) ≈ 3.24. So √(n/2) = 7.77, n/2 = 60.4, n ≈ 121 per group. 4. With δ = 10, d = 10/12 = 0.833. For 80% power: √(n/2) = 2.80/0.833 = 3.36, n/2 = 11.3, n ≈ 23 per group. Doubling the effect size quarters the required sample.

Exercise 23.11 — Before-and-after comparison ⭐⭐

Ten countries implemented a public health campaign. Their vaccination rates before and after were:

Country	Before	After	Difference
A	62	68	+6
B	71	74	+3
C	55	62	+7
D	78	80	+2
E	45	53	+8
F	68	69	+1
G	59	66	+7
H	73	75	+2
I	51	58	+7
J	64	70	+6

What type of t-test is appropriate here?
Compute the mean and standard deviation of the differences.
Compute the t-statistic and find the p-value.
Can we conclude the campaign was effective? What are the limitations of this analysis?

Guidance

1. **Paired t-test** — the same countries are measured before and after. 2. Differences: [6, 3, 7, 2, 8, 1, 7, 2, 7, 6]. Mean = 4.9, SD = 2.56. 3. SE = 2.56/√10 = 0.809. t = 4.9/0.809 = 6.06. With df = 9, p < 0.001. 4. The increase is statistically significant. However, we can't conclude the campaign *caused* the increase — other factors (seasonal trends, supply increases, other policies) could explain the change. A proper causal conclusion would require a control group. This is an important limitation we'll explore further in Chapter 24.

Exercise 23.12 — Interpreting non-significance ⭐⭐

A researcher tests whether a new curriculum improves math scores. With 15 students per group, she finds: mean difference = 4.2 points, p = 0.12, 95% CI for the difference = (-1.1, 9.5).

Is the result statistically significant at α = 0.05?
Should she conclude "the curriculum has no effect"? Why or why not?
What does the confidence interval tell you that the p-value alone doesn't?
What would you recommend she do next?

Guidance

1. No, p = 0.12 > 0.05. 2. No! "Not significant" doesn't mean "no effect." The CI extends from -1.1 to +9.5, which includes the possibility of a substantial positive effect (up to 9.5 points). The sample is simply too small to be conclusive. 3. The CI shows the range of plausible effect sizes. An effect as large as 9.5 points is still plausible. The p-value alone would just say "not significant," giving the misleading impression that the curriculum doesn't work. 4. Increase the sample size. With only 15 per group, the study has very low power to detect a moderate effect. A power analysis could determine how many students are needed.

Exercise 23.13 — Reporting results correctly ⭐⭐

Rewrite each of the following result statements to be more complete and accurate:

"The difference was significant (p < 0.05)."
"We found no significant difference between groups."
"The treatment group had a significantly higher score (p = 0.03)."

Guidance

1. Better: "The treatment group scored higher than the control group (M = 78.3 vs. M = 72.1, difference = 6.2, 95% CI [1.4, 11.0], t(48) = 2.61, p = 0.012, d = 0.74)." 2. Better: "The difference between groups was not statistically significant (difference = 2.1, 95% CI [-3.4, 7.6], t(38) = 0.78, p = 0.44, d = 0.25), though the confidence interval does not rule out a meaningful effect." 3. Better: "The treatment group scored higher (M = 82.4, SD = 9.1) than the control group (M = 77.6, SD = 10.3), a difference of 4.8 points (95% CI [0.5, 9.1], t(58) = 2.22, p = 0.030, d = 0.49, a medium effect)."

Part C: Coding Exercises ⭐⭐–⭐⭐⭐

Exercise 23.14 — Simulate the null distribution ⭐⭐

Write a Python simulation that:

Creates two groups of 30 values each from the SAME population (normal, mean=70, std=15)
Computes their difference in means
Repeats this 10,000 times
Plots the null distribution of differences
Marks where a specified observed difference (say, 8.0) would fall
Computes the two-tailed p-value

Verify that your simulated p-value is close to the p-value from scipy.stats.ttest_ind.

Exercise 23.15 — Permutation test implementation ⭐⭐

Implement a permutation test from scratch:

Use the data: group_a = [82, 87, 91, 78, 85, 90, 88, 83, 86, 89] and group_b = [71, 65, 74, 68, 72, 70, 66, 73, 69, 75]
Compute the observed difference in means
Combine all values, then randomly shuffle and split into two groups 10,000 times
Compute the p-value as the proportion of shuffled differences as extreme as the observed
Compare to the scipy.stats.ttest_ind p-value

Exercise 23.16 — Power simulation ⭐⭐

Create a simulation to estimate statistical power:

Define a "true" population difference of 5 points (mean_A = 70, mean_B = 75, shared std = 12)
For sample sizes n = 10, 20, 30, 50, 75, 100, 150, 200: - Run 5,000 simulated experiments - For each experiment, draw n values from each population and run a t-test - Record the proportion of experiments that reject H₀ at α = 0.05
Plot sample size vs. power
Add a horizontal line at 80% power
From your plot, identify the sample size needed for 80% power

Exercise 23.17 — Multiple testing simulation ⭐⭐⭐

Simulate the multiple testing problem:

Run 100 t-tests where the null is true for ALL tests (both groups from the same population)
Count and display the "significant" results
Apply Bonferroni correction and recount
Apply Benjamini-Hochberg (FDR) correction and recount
Now modify the simulation: make the null true for 80 tests and false for 20 tests (with a real difference of 8 points)
Compare the three approaches (uncorrected, Bonferroni, BH) in terms of true positives, false positives, and false negatives

Exercise 23.18 — Complete hypothesis test workflow ⭐⭐⭐

Using the vaccination data from the progressive project (or a simulated version), perform a complete hypothesis test:

State H₀ and H₁
Check assumptions (normality with a Q-Q plot, equal variances with Levene's test)
Perform a two-sample t-test with scipy.stats.ttest_ind (use equal_var=False for Welch's t-test if variances are unequal)
Compute Cohen's d
Compute the 95% confidence interval for the difference
Create a visualization showing the distributions of both groups with means and CIs annotated
Write a complete results paragraph

Exercise 23.19 — Effect of sample size on p-values ⭐⭐⭐

Demonstrate how p-values depend on sample size for a fixed effect:

Set a fixed "true" difference between two populations (e.g., mean 70 vs. 75, std = 15)
For each sample size n = 5, 10, 15, 20, 30, 50, 100, 200, 500: - Draw 1,000 samples and compute the p-value each time - Record the median p-value and the proportion of p-values < 0.05
Create two plots: (a) sample size vs. median p-value, (b) sample size vs. proportion significant
Explain the pattern: why do p-values get systematically smaller as n increases?

Part D: Synthesis and Critical Thinking ⭐⭐⭐–⭐⭐⭐⭐

Exercise 23.20 — The significance filter ⭐⭐⭐

Consider this scenario: 100 research groups all study the same question (e.g., "Does supplement X improve memory?"). Suppose the true effect is small but real (Cohen's d = 0.3), and each group has 50 participants per condition (power ≈ 50%).

How many of the 100 groups would you expect to get a significant result?
Among those significant results, the "winners" will tend to have overestimated the true effect. Why? (This is called the "winner's curse" or "significance filter.")
If only significant results get published, what will the published literature conclude about the effect size?
How does this relate to the replication crisis?

Guidance

1. About 50 groups (50% power → 50% chance of detecting the real effect). 2. With only 50% power, the studies that happen to find significance are disproportionately the ones where the sample effect was larger than the true effect (by luck). The 50 studies that "fail" include many where the sample effect was close to or below the true value. So the average effect size among significant studies is inflated. 3. The published literature would show an average effect size larger than d = 0.3 — possibly d = 0.5 or more. This is "publication bias" in action. 4. When other researchers try to replicate the inflated effect size, they get smaller effects (closer to the true d = 0.3), which may not be significant with the same sample size. The original findings "fail to replicate" not because the effect is fake, but because the original estimate was inflated by the significance filter.

Exercise 23.21 — Designing a study ⭐⭐⭐

You've been asked to design a study testing whether a new vaccine education program increases vaccination rates. The program will be implemented in some schools but not others.

Define your null and alternative hypotheses.
What kind of test will you use?
Conduct a power analysis: if you expect a 5-percentage-point increase and the standard deviation is 15, how many schools per group do you need for 80% power?
What significance level will you use and why?
Will you use one-tailed or two-tailed? Justify your choice.
What potential confounders might threaten the validity of your test?
How will you handle multiple comparisons if you plan to test multiple outcomes?

Guidance

This is an open-ended design exercise. Key considerations: (1) H₀: no difference; H₁: program increases rates. (2) Two-sample t-test or, better, a regression model controlling for baseline differences. (3) d = 5/15 = 0.33. For 80% power: n ≈ 144 per group (from power formula or simulation). (4) α = 0.05 is standard; could argue for 0.01 if you want to be conservative. (5) Two-tailed is safer — the program could conceivably decrease rates if it creates backlash. (6) School demographics, pre-existing attitudes, teacher enthusiasm, seasonal effects. (7) Pre-specify primary and secondary outcomes; use Bonferroni or Holm correction for the secondary outcomes.

Exercise 23.22 — Critiquing a published study ⭐⭐⭐

Read this summary of a (fictional) study:

"We tested whether background music improves study performance. We had students study in three conditions: silence, classical music, and pop music. We measured performance on a vocabulary test, a math test, a reading comprehension test, and a spatial reasoning test. We found that classical music significantly improved spatial reasoning scores (p = 0.04). We conclude that classical music enhances spatial cognitive abilities."

Identify at least four methodological or statistical problems with this study's analysis and conclusions.

Guidance

1. **Multiple testing:** 3 conditions × 4 outcomes = 12 comparisons. At α = 0.05, we'd expect about 0.6 false positives by chance. Finding one significant result among 12 is not compelling without correction. 2. **Cherry-picking:** They tested four outcomes and only reported the one that was significant. This is selective reporting / p-hacking. 3. **Borderline p-value:** p = 0.04 is just barely under 0.05. With the multiple testing context, this provides very weak evidence. 4. **No effect size reported:** How much did spatial reasoning improve? A significant p-value with a tiny effect size is meaningless. 5. **Overgeneralized conclusion:** They tested one specific spatial reasoning task and concluded "classical music enhances spatial cognitive abilities" — a much broader claim. 6. **Possible confounders:** Were students randomly assigned to conditions? Were the conditions counterbalanced? Was the spatial reasoning test always given last?

Exercise 23.23 — The ASA statement on p-values ⭐⭐⭐⭐

In 2016, the American Statistical Association released a statement with six principles about p-values. Look up this statement and:

List the six principles in your own words.
For each principle, give a specific example of how violating it could lead to incorrect conclusions.
The statement says "scientific conclusions should not be based only on whether a p-value passes a specific threshold." What else should they be based on?
Some statisticians have proposed banning p-values entirely. What are the arguments for and against?

Guidance

The six principles (paraphrased): (1) P-values indicate how incompatible the data are with a specified model. (2) P-values do not measure the probability that the hypothesis is true. (3) Scientific conclusions should not be based on p-values alone. (4) Proper reporting requires full transparency. (5) A p-value does not measure effect size or importance. (6) A p-value alone does not provide a good measure of evidence. Arguments for banning: reduces mechanical thinking, forces richer analysis. Arguments against: p-values are useful when understood correctly; banning them doesn't address the underlying problems with how research is conducted.

Exercise 23.24 — Project extension: Comprehensive analysis ⭐⭐⭐

Conduct a comprehensive hypothesis testing analysis on the vaccination dataset:

Test whether vaccination rates differ between WHO regions (ANOVA)
Conduct all pairwise comparisons with Bonferroni correction
Test whether meeting the 90% target is associated with income level (chi-square)
For each test, report: the test statistic, p-value, effect size, and confidence interval
Create a summary table of all results
Write a "Results" section suitable for a report, using proper statistical notation

Exercise 23.25 — Reflection: The limits of testing ⭐⭐⭐⭐

Write a thoughtful essay (500-800 words) addressing this question: "If hypothesis testing is so commonly misunderstood and misused, should we still teach it? What would we lose if we replaced it entirely with confidence intervals and effect sizes?"

Guidance

This is genuinely debated in the statistics community. Arguments for keeping hypothesis testing: (1) It provides a decision framework (accept/reject) that's useful in some contexts (clinical trials, quality control). (2) It's deeply embedded in science and can't easily be removed. (3) When used correctly, it's a valid tool. Arguments for replacing it: (1) Confidence intervals convey more information. (2) Effect sizes better capture practical importance. (3) The binary significant/non-significant framework encourages simplistic thinking. A balanced view might be: teach all three (tests, CIs, effect sizes), emphasize their complementarity, and hammer home the interpretive pitfalls.

Country	Before	After	Difference
A	62	68	+6
B	71	74	+3
C	55	62	+7
D	78	80	+2
E	45	53	+8
F	68	69	+1
G	59	66	+7
H	73	75	+2
I	51	58	+7
J	64	70	+6

Country	Before	After	Difference
A	62	68	+6
B	71	74	+3
C	55	62	+7
D	78	80	+2
E	45	53	+8
F	68	69	+1
G	59	66	+7
H	73	75	+2
I	51	58	+7
J	64	70	+6

Country	Before	After	Difference
A	62	68	+6
B	71	74	+3
C	55	62	+7
D	78	80	+2
E	45	53	+8
F	68	69	+1
G	59	66	+7
H	73	75	+2
I	51	58	+7
J	64	70	+6