Chapter 23 Quiz: Hypothesis Testing

Q: A chi-square test of independence is most appropriate when: - (A) Comparing the means of two groups - (B) Testing whether a sample mean equals a specific value - (C) Testing whether two categorical variables are associated - (D) Comparing the variances of two groups

Correct: (C) The chi-square test of independence tests whether there is an association between two categorical variables (e.g., income level and vaccination status). (A) calls for a t-test. (B) calls for a one-sample t-test. (D) calls for an F-test or Levene's test.

Q: True or False: If you use α = 0.05 and the null hypothesis is true, you will make a Type I error exactly 5% of the time.

True. This is the definition of the significance level. If H₀ is true and you use α = 0.05, the probability of incorrectly rejecting H₀ is exactly 0.05 (5%). This is what the significance level controls — the long-run rate of false positives when the null is true.

Contributors to Introduction to Data Science

Chapter 23 Quiz: Hypothesis Testing

Instructions: This quiz tests your understanding of Chapter 23. Answer all questions before checking the solutions. For multiple choice, select the best answer. For short answer questions, aim for 2-4 clear sentences. Total points: 100.

Section 1: Multiple Choice (10 questions, 4 points each)

Question 1. A researcher wants to test whether a new fertilizer increases plant growth. She sets up an experiment with a control group (no fertilizer) and a treatment group (with fertilizer). What is the appropriate null hypothesis?

(A) The fertilizer increases growth
(B) The fertilizer has no effect on growth
(C) The fertilizer decreases growth
(D) The fertilizer might or might not affect growth

Answer

**Correct: (B)** The null hypothesis always represents the "nothing is happening" or "no effect" position. It's the skeptic's default: until proven otherwise, we assume the fertilizer has no effect. (A) is the alternative hypothesis. (C) is a specific directional alternative. (D) is too vague to be a hypothesis.

Question 2. A two-sample t-test produces a p-value of 0.03. Which interpretation is correct?

(A) There is a 3% probability that the null hypothesis is true
(B) There is a 97% probability that the treatment works
(C) If there were no real difference between groups, data this extreme would occur about 3% of the time
(D) 3% of the data points are significantly different

Answer

**Correct: (C)** The p-value is the probability of observing a test statistic as extreme as (or more extreme than) the one calculated, assuming the null hypothesis is true. (A) is the classic misconception — the p-value is about the probability of the data, not the probability of the hypothesis. (B) conflates the p-value with the probability of the alternative. (D) is nonsensical — p-values are about the sample as a whole, not individual data points.

Question 3. A pharmaceutical company tests a new drug for headache relief. They use α = 0.05. In reality, the drug has no effect. What type of error would occur if they conclude the drug works?

(A) Type II error
(B) Type I error
(C) Standard error
(D) Sampling error

Answer

**Correct: (B)** A Type I error is a false positive — rejecting the null hypothesis when it's actually true. In this case, concluding the drug works (rejecting H₀) when it actually doesn't (H₀ is true) is a Type I error. Type II (A) would be concluding the drug doesn't work when it actually does. (C) and (D) are unrelated concepts.

Question 4. Which of the following will increase the statistical power of a test?

(A) Decreasing the sample size
(B) Decreasing the significance level from 0.05 to 0.01
(C) Increasing the sample size
(D) Decreasing the true effect size

Answer

**Correct: (C)** Power increases with: (1) larger sample size, (2) larger effect size, (3) higher significance level, and (4) lower variability. (A) decreases power. (B) makes it harder to reject the null, decreasing power. (D) makes the effect harder to detect, decreasing power. Only (C) increases power by reducing the standard error and making it easier to detect a real difference.

Question 5. A study with 200,000 participants finds that people who eat breakfast score 0.1 points higher on a 100-point cognitive test (p = 0.001). What is the most appropriate conclusion?

(A) Eating breakfast dramatically improves cognitive function
(B) The result is both statistically significant and practically significant
(C) The result is statistically significant but practically meaningless
(D) The result is not statistically significant

Answer

**Correct: (C)** With 200,000 participants, even trivial differences become statistically significant. A 0.1-point difference on a 100-point test is far too small to matter in practice — Cohen's d would be negligibly small. This is the "giant sample trap." (A) overstates the finding. (B) ignores the tiny effect size. (D) is wrong — p = 0.001 is highly significant.

Question 6. A researcher runs 20 hypothesis tests and finds one significant result at p = 0.04. All 20 tests were exploring different potential effects with no prior theory. What is the most likely explanation?

(A) The significant result represents a genuine discovery
(B) The significant result is likely a false positive due to multiple testing
(C) The significance level should be adjusted to 0.10
(D) All 20 results should be considered significant

Answer

**Correct: (B)** With 20 tests at α = 0.05, we expect about 1 false positive by chance (20 × 0.05 = 1). Finding exactly one "significant" result among 20 exploratory tests is exactly what we'd expect from random noise. Without correction for multiple testing (like Bonferroni), this result should not be taken as a discovery. (A) ignores the multiple testing problem. (C) and (D) don't address the core issue.

Question 7. What is the relationship between a 95% confidence interval and a two-tailed hypothesis test at α = 0.05?

(A) They are completely unrelated
(B) If the 95% CI for the difference includes zero, the test will reject H₀
(C) If the 95% CI for the difference does NOT include zero, the test will reject H₀
(D) The CI is always wider than what the test implies

Answer

**Correct: (C)** A 95% CI and a two-tailed test at α = 0.05 are two sides of the same coin. If the hypothesized value (often zero for a difference) is outside the 95% CI, the p-value is less than 0.05 and the test rejects H₀. If the hypothesized value is inside the CI, the p-value exceeds 0.05 and the test fails to reject H₀. They always agree.

Question 8. A chi-square test of independence is most appropriate when:

(A) Comparing the means of two groups
(B) Testing whether a sample mean equals a specific value
(C) Testing whether two categorical variables are associated
(D) Comparing the variances of two groups

Answer

**Correct: (C)** The chi-square test of independence tests whether there is an association between two categorical variables (e.g., income level and vaccination status). (A) calls for a t-test. (B) calls for a one-sample t-test. (D) calls for an F-test or Levene's test.

Question 9. Which statement about Cohen's d is correct?

(A) Cohen's d measures the statistical significance of a result
(B) A Cohen's d of 0.5 means there is a 50% chance the effect is real
(C) Cohen's d measures the size of an effect in standard deviation units
(D) Cohen's d can only be positive

Answer

**Correct: (C)** Cohen's d = (mean difference) / (pooled standard deviation). It expresses the difference in standardized units, allowing comparisons across different studies and measures. (A) confuses effect size with significance. (B) is a misinterpretation — d has nothing to do with probability. (D) is wrong — d can be negative (indicating one group is lower than the other), though its absolute value is what matters for interpreting magnitude.

Question 10. "We failed to reject the null hypothesis" means:

(A) The null hypothesis has been proven true
(B) The alternative hypothesis has been proven false
(C) The data did not provide sufficient evidence to conclude the null is false
(D) The experiment was poorly designed

Answer

**Correct: (C)** "Fail to reject" means exactly that — the evidence wasn't strong enough to overcome the null hypothesis. It does NOT mean the null is true (A) or that the alternative is false (B). The effect might be real but the sample too small to detect it. And it doesn't necessarily mean anything about the experimental design (D), although low power due to small samples is one common reason for failing to reject.

Section 2: True or False (4 questions, 4 points each)

Question 11. True or False: A p-value of 0.001 means the effect is larger than a p-value of 0.05.

Answer

**False.** A smaller p-value means the data is more incompatible with the null hypothesis, but it does NOT mean the effect is larger. A tiny effect with a huge sample can produce a very small p-value (like our 0.1-point difference with 200,000 participants). A large effect with a tiny sample can produce a large p-value. P-values conflate effect size and sample size — they don't measure either one separately.

Question 12. True or False: If you use α = 0.05 and the null hypothesis is true, you will make a Type I error exactly 5% of the time.

Answer

**True.** This is the definition of the significance level. If H₀ is true and you use α = 0.05, the probability of incorrectly rejecting H₀ is exactly 0.05 (5%). This is what the significance level controls — the long-run rate of false positives when the null is true.

Question 13. True or False: A one-tailed test is always more powerful than a two-tailed test.

Answer

**True** (with an important caveat). A one-tailed test concentrates all of the rejection region in one tail, making it more powerful for detecting effects in the predicted direction. However, it has ZERO power to detect effects in the opposite direction. So while it's technically "more powerful" for one direction, it can completely miss unexpected effects. The caveat: if the effect turns out to be in the opposite direction, a one-tailed test will never detect it. This is why two-tailed tests are generally recommended.

Question 14. True or False: If a result is not statistically significant, the 95% confidence interval for the effect must include zero.

Answer

**True** (for a two-tailed test at α = 0.05). This follows from the equivalence between hypothesis tests and confidence intervals. If the test at α = 0.05 fails to reject H₀ (that the effect is zero), the 95% CI must include zero. If the CI excluded zero, the test would have rejected H₀. They always agree for the same α and confidence level (α = 0.05 ↔ 95% CI).

Section 3: Short Answer (3 questions, 6 points each)

Question 15. Explain the difference between statistical significance and practical significance in 2-3 sentences. Give an example where a result could be statistically significant but not practically significant.

Answer

**Statistical significance** means the observed result is unlikely to have occurred by chance under the null hypothesis (p < α). **Practical significance** means the effect is large enough to matter in the real world. They are different dimensions — a result can be statistically significant without being practically important, and vice versa. **Example:** A study with 100,000 participants finds that people who floss daily have blood pressure readings 0.2 mmHg lower than non-flossers (p = 0.002). This is statistically significant but practically meaningless — 0.2 mmHg is far too small to have any health implications. No doctor would prescribe flossing as a blood pressure treatment based on this finding.

Question 16. What is statistical power, and what three factors primarily determine it? Why is power important when planning a study?

Answer

**Statistical power** is the probability of correctly rejecting a false null hypothesis — in other words, the probability of detecting a real effect when one exists. Power = 1 - β, where β is the probability of a Type II error (false negative). The three primary factors: (1) **Sample size** — larger samples give more power. (2) **Effect size** — larger effects are easier to detect. (3) **Significance level** — higher α (e.g., 0.10 vs. 0.01) gives more power but at the cost of more false positives. Power is important when planning a study because a study with low power (say, 30%) is likely to fail even if the effect is real — you'd be spending time and money on a study that has only a 30% chance of producing a useful result. Power analysis before the study tells you how many observations you need to have a reasonable chance (typically 80%+) of detecting the expected effect.

Question 17. Describe the multiple testing problem in 2-3 sentences. Briefly explain one method for correcting it.

Answer

The **multiple testing problem** occurs when you conduct many hypothesis tests simultaneously. With each test at α = 0.05, there's a 5% chance of a false positive. Over many tests, the probability of at least one false positive grows rapidly — for 20 tests, it's about 64%. This means running many tests guarantees that some "significant" results are spurious. **Bonferroni correction** is the simplest fix: divide α by the number of tests. If you're running 20 tests, use α = 0.05/20 = 0.0025 for each individual test. This ensures the overall probability of any false positive stays at 0.05. The trade-off is reduced power — legitimate effects may be missed because the threshold is much stricter.

Section 4: Applied Scenarios (2 questions, 7 points each)

Question 18. A public health department wants to know whether a vaccination campaign increased coverage. Before the campaign, a random sample of 200 people showed 68% were vaccinated. After the campaign, a new random sample of 250 people showed 74% were vaccinated.

(a) State H₀ and H₁. (b) This is a test comparing two proportions. The test statistic is z = 1.42 and the two-tailed p-value is 0.156. At α = 0.05, do you reject H₀? (c) The 95% confidence interval for the difference in proportions (p₂ - p₁) is (-0.023, 0.143). Interpret this interval. (d) Should the department conclude the campaign was ineffective? Why or why not?

Answer

**(a)** H₀: p_after = p_before (no change in vaccination rate). H₁: p_after ≠ p_before (there is a change). **(b)** No, we fail to reject H₀ because p = 0.156 > 0.05. **(c)** We are 95% confident that the true change in vaccination rate is somewhere between -2.3 percentage points (a small decrease) and +14.3 percentage points (a substantial increase). The interval includes zero, consistent with no change, but also includes effects as large as 14 points. **(d)** No! Failing to reject H₀ does not prove the campaign was ineffective. The confidence interval shows that an effect as large as 14 percentage points is still plausible. The study may simply have been underpowered — with samples of 200 and 250, there may not be enough data to detect a moderate effect. A larger follow-up study would be more informative. The appropriate conclusion is "the evidence is inconclusive," not "the campaign didn't work."

Question 19. A tech company runs an A/B test with 10,000 users per group. The conversion rate is 5.2% for version A and 5.4% for version B. The t-test gives p = 0.03.

(a) Is the result statistically significant at α = 0.05? (b) Compute the absolute difference and the relative difference between groups. (c) Would you recommend the company switch to version B? Consider both statistical and practical factors. (d) What additional information would help you make a better recommendation?

Answer

**(a)** Yes, p = 0.03 < 0.05. **(b)** Absolute difference: 5.4% - 5.2% = 0.2 percentage points. Relative difference: 0.2/5.2 × 100 = 3.8% relative increase. **(c)** This requires judgment. The 0.2 percentage-point absolute difference is statistically significant but very small in absolute terms. However, in business contexts, even small relative improvements (3.8% more conversions) can translate to significant revenue at scale. If the company has millions of users, a 3.8% increase in conversions could be worth millions of dollars. But consider also: Is version B harder to maintain? Does it affect user experience in other ways? Was this the only metric tested, or one of many? **(d)** Additional helpful information: (1) Revenue per conversion — to translate the conversion difference into dollars. (2) Whether the test measured other outcomes (time on site, user satisfaction, return visits) — the change might help one metric while hurting another. (3) Whether there were subgroup differences — version B might help some users and hurt others. (4) The confidence interval for the difference — how uncertain is the 0.2-point estimate? (5) Whether this was the only test run or one of many (multiple testing concern).

Section 5: Code Analysis (1 question, 6 points)

Question 20. Read the following code and answer the questions:

import numpy as np
from scipy import stats

np.random.seed(42)

group1 = np.random.normal(70, 12, 25)
group2 = np.random.normal(75, 12, 25)

t_stat, p_value = stats.ttest_ind(group1, group2)
d = (group2.mean() - group1.mean()) / np.sqrt(
    ((len(group1)-1)*group1.var(ddof=1) + (len(group2)-1)*group2.var(ddof=1))
    / (len(group1) + len(group2) - 2)
)

if p_value < 0.05:
    print("Significant!")
else:
    print("Not significant.")
print(f"p = {p_value:.4f}, d = {d:.3f}")

(a) What test is being performed? (b) What does the variable d compute? (c) If the output says "Not significant" with d = 0.42, what does this mean in practical terms? (d) The true population difference is 5 points. With n = 25 per group, is this study likely to detect it? Why or why not?

Answer

**(a)** An **independent two-sample t-test** comparing the means of `group1` and `group2`. **(b)** `d` computes **Cohen's d** — the standardized effect size. It divides the difference in means by the pooled standard deviation. **(c)** A non-significant result with d = 0.42 (a "small-to-medium" effect) means the study observed a real-looking effect but didn't have enough statistical power to confirm it. The effect size suggests something may be going on, but the evidence isn't strong enough with only 25 observations per group. This is a classic "underpowered study" result — the CI for the difference likely includes zero but also includes meaningful positive values. **(d)** With d = 5/12 ≈ 0.42 and n = 25 per group, the power is approximately 40-50%. This means the study has less than a coin-flip chance of detecting the true effect. The study is **underpowered**. To achieve 80% power, approximately 90 observations per group would be needed.