Quiz: Hypothesis Testing: Making Decisions with Data
Test your understanding of hypothesis testing, p-values, significance levels, and error types. Try to answer each question before revealing the answer.
1. In hypothesis testing, the null hypothesis ($H_0$) represents:
(a) The researcher's belief about what will happen (b) The claim that there is a significant effect (c) The default assumption of no effect, no difference, or status quo (d) The conclusion drawn after analyzing the data
Answer
**(c) The default assumption of no effect, no difference, or status quo.** The null hypothesis is the "nothing special is happening" claim. It's what we assume is true until the data provide strong evidence against it — analogous to the presumption of innocence in a courtroom. The researcher's belief about what will happen is typically the alternative hypothesis ($H_a$).2. A p-value of 0.03 means:
(a) There is a 3% probability that the null hypothesis is true (b) There is a 3% probability that the results are due to chance (c) If the null hypothesis is true, there is a 3% probability of observing data as extreme or more extreme than what was observed (d) The effect is 97% likely to be real
Answer
**(c) If the null hypothesis is true, there is a 3% probability of observing data as extreme or more extreme than what was observed.** The p-value is a conditional probability: $P(\text{data this extreme} \mid H_0)$. It is NOT the probability that $H_0$ is true (option a), NOT the probability results are "due to chance" (option b), and NOT the probability the effect is real (option d). Getting this right is arguably the most important thing in this entire chapter.3. Which of the following is the correct way to state a conclusion when the p-value is greater than $\alpha$?
(a) "We accept the null hypothesis" (b) "We prove the null hypothesis is true" (c) "We fail to reject the null hypothesis" (d) "We reject the alternative hypothesis"
Answer
**(c) "We fail to reject the null hypothesis."** We never "accept" or "prove" the null hypothesis — we simply don't have enough evidence to reject it. "Fail to reject" correctly captures the asymmetry of hypothesis testing: the burden of proof is on the evidence. Not guilty ≠ innocent.4. Sam tests whether Daria's three-point shooting has improved from her career rate of 31%. Which hypotheses are correct?
(a) $H_0: \hat{p} = 0.31$, $H_a: \hat{p} > 0.31$ (b) $H_0: p = 0.31$, $H_a: p > 0.31$ (c) $H_0: p = 0.31$, $H_a: p \neq 0.31$ (d) $H_0: p > 0.31$, $H_a: p = 0.31$
Answer
**(b) $H_0: p = 0.31$, $H_a: p > 0.31$.** Hypotheses are always about population parameters ($p$), not sample statistics ($\hat{p}$), so option (a) is wrong. Since Sam specifically wants to know if Daria has *improved*, the alternative is one-sided ($>$), not two-sided ($\neq$), so option (c) is wrong. Option (d) has the hypotheses reversed.5. The significance level $\alpha$ represents:
(a) The probability of making a Type II error (b) The maximum probability of a Type I error that you're willing to tolerate (c) The p-value threshold that makes a result "important" (d) The probability that the null hypothesis is true
Answer
**(b) The maximum probability of a Type I error that you're willing to tolerate.** $\alpha$ is set before data collection and defines the threshold for rejecting $H_0$. If $H_0$ is true, the probability of incorrectly rejecting it is exactly $\alpha$. It is NOT a measure of importance (option c) — statistical significance is not the same as practical significance.6. A test statistic of $z = 2.50$ in a one-tailed test (right) gives a p-value of approximately:
(a) 0.0062 (b) 0.0124 (c) 0.4938 (d) 0.9938
Answer
**(a) 0.0062.** For a right-tailed test: $p = P(Z \geq 2.50) = 1 - P(Z < 2.50) = 1 - 0.9938 = 0.0062$. Option (b) is the two-tailed p-value ($2 \times 0.0062 = 0.0124$). Option (c) is $P(0 < Z < 2.50)$. Option (d) is $P(Z < 2.50)$, the cumulative probability, not the tail probability.7. What would the p-value in Question 6 be for a two-tailed test?
(a) 0.0062 (b) 0.0124 (c) 0.0031 (d) 0.9876
Answer
**(b) 0.0124.** For a two-tailed test, the p-value is $2 \times P(Z \geq |z|) = 2 \times 0.0062 = 0.0124$. We double the one-tailed p-value because extreme values in *either* direction count as evidence against $H_0$ when $H_a$ is two-sided.8. A researcher conducts a hypothesis test and obtains $p = 0.04$. At $\alpha = 0.05$, she rejects $H_0$. Which statement is true?
(a) She has proven the alternative hypothesis is true (b) There is a 4% chance the null hypothesis is true (c) She might be making a Type I error (d) She cannot possibly be making any error
Answer
**(c) She might be making a Type I error.** When we reject $H_0$, there's always a possibility that $H_0$ was actually true and we got unlucky — that's a Type I error. The probability of this error is at most $\alpha = 0.05$. We never "prove" alternatives (a), the p-value is not $P(H_0 \text{ true})$ (b), and errors are always possible (d).9. A Type I error occurs when:
(a) You reject $H_0$ when it is actually false (b) You fail to reject $H_0$ when it is actually false (c) You reject $H_0$ when it is actually true (d) You fail to reject $H_0$ when it is actually true
Answer
**(c) You reject $H_0$ when it is actually true.** A Type I error is a "false alarm" — concluding there's an effect when there isn't one. In the courtroom analogy, it's convicting an innocent person. Option (b) is a Type II error (missed detection). Options (a) and (d) are correct decisions.10. A Type II error occurs when:
(a) You reject $H_0$ when it is actually false (b) You fail to reject $H_0$ when it is actually false (c) You reject $H_0$ when it is actually true (d) You fail to reject $H_0$ when it is actually true
Answer
**(b) You fail to reject $H_0$ when it is actually false.** A Type II error is a "missed detection" — failing to identify a real effect. In the courtroom analogy, it's acquitting a guilty person. The probability of a Type II error is denoted $\beta$, and its complement ($1 - \beta$) is called the **power** of the test (Chapter 17).11. In a cancer screening test, a Type I error means:
(a) Telling a patient they have cancer when they don't (b) Telling a patient they don't have cancer when they do (c) The test correctly identifies cancer (d) The test correctly rules out cancer
Answer
**(a) Telling a patient they have cancer when they don't (false positive).** In this context, $H_0$: patient does not have cancer. A Type I error means rejecting $H_0$ (concluding cancer) when it's true (patient is healthy). A Type II error (option b) — missing real cancer — is arguably more dangerous in this scenario, which is why cancer screening tests are designed to have high sensitivity (low Type II error rate), even at the cost of more false positives.12. If you decrease $\alpha$ from 0.05 to 0.01 (without changing sample size), what happens?
(a) Type I error probability decreases; Type II error probability decreases (b) Type I error probability decreases; Type II error probability increases (c) Type I error probability increases; Type II error probability decreases (d) Both error probabilities stay the same
Answer
**(b) Type I error probability decreases; Type II error probability increases.** Lowering $\alpha$ makes it harder to reject $H_0$, which reduces false alarms (Type I) but increases the chance of missing real effects (Type II). The two error rates have a seesaw relationship for any fixed sample size. The only way to reduce both simultaneously is to increase $n$.13. A test statistic measures:
(a) The probability of the null hypothesis being true (b) The size of the effect in practical terms (c) How many standard errors the sample statistic is from the null hypothesis value (d) The significance level of the test
Answer
**(c) How many standard errors the sample statistic is from the null hypothesis value.** The test statistic (e.g., $z = (\bar{x} - \mu_0) / (\sigma/\sqrt{n})$) standardizes the distance between what we observed and what the null predicts, using the standard error as the ruler. A test statistic far from 0 suggests the data are inconsistent with $H_0$.14. Which of the following is an example of p-hacking?
(a) Setting $\alpha = 0.05$ before collecting data (b) Reporting all analyses, whether significant or not (c) Testing many variables and reporting only the one with $p < 0.05$ (d) Using a larger sample to increase power
Answer
**(c) Testing many variables and reporting only the one with $p < 0.05$.** P-hacking involves exploiting the "garden of forking paths" — running multiple analyses and cherry-picking the significant result. When you test 20 variables at $\alpha = 0.05$, the probability of finding at least one "significant" result by chance is $1 - 0.95^{20} \approx 64\%$. Options (a), (b), and (d) are all appropriate practices.15. A pharmaceutical company tests a new drug and reports $p = 0.001$. A colleague says: "The drug has a very strong effect." Is the colleague correct?
(a) Yes — a very small p-value always indicates a strong effect (b) No — the p-value measures the strength of evidence against $H_0$, not the size of the effect (c) Yes — $p = 0.001$ means the effect is practically significant (d) No — $p = 0.001$ means the null hypothesis is almost certainly false
Answer
**(b) No — the p-value measures the strength of evidence against $H_0$, not the size of the effect.** A very small p-value can result from a tiny effect with a very large sample. For example, a study of 100,000 people might find that a drug lowers blood pressure by 0.5 mmHg with $p = 0.001$ — statistically significant but clinically meaningless. To assess effect size, you need the actual magnitude of the difference, not just the p-value.16. Dr. Maya Chen constructs a 95% CI for mean systolic BP: (126.5, 133.1). Without any additional calculations, what can you conclude about a two-sided hypothesis test of $H_0: \mu = 130$ at $\alpha = 0.05$?
(a) Reject $H_0$, because 130 is near the edge of the interval (b) Fail to reject $H_0$, because 130 is inside the interval (c) Reject $H_0$, because the sample mean (129.8) is not exactly 130 (d) Cannot determine without computing the test statistic
Answer
**(b) Fail to reject $H_0$, because 130 is inside the interval.** The duality principle: a 95% CI contains exactly the values of $\mu_0$ that would NOT be rejected by a two-sided test at $\alpha = 0.05$. Since 130 is inside (126.5, 133.1), we would fail to reject $H_0: \mu = 130$. This is the CI-hypothesis test connection from Section 13.11.17. For the same CI in Question 16, what about $H_0: \mu = 125$?
(a) Reject $H_0$ at $\alpha = 0.05$ (two-sided) (b) Fail to reject $H_0$ at $\alpha = 0.05$ (two-sided) (c) Cannot determine without knowing the sample size (d) Cannot determine without knowing the standard deviation
Answer
**(a) Reject $H_0$ at $\alpha = 0.05$ (two-sided).** Since 125 is outside the 95% CI (126.5, 133.1), we would reject $H_0: \mu = 125$ at the 5% significance level. The CI serves as a visual decision tool for two-sided tests.18. A researcher runs a two-tailed test and obtains $z = 1.80$ with $p = 0.0719$. If she had run a one-tailed test (in the correct direction), the p-value would be:
(a) 0.0360 (b) 0.0719 (c) 0.1438 (d) 0.9281
Answer
**(a) 0.0360.** For a one-tailed test in the direction of the observed effect, the p-value is half the two-tailed p-value: $0.0719 / 2 = 0.0360$. This means the result would be significant at $\alpha = 0.05$ with a one-tailed test but not with a two-tailed test — which is exactly why the choice between one- and two-tailed must be made before seeing the data.19. Twenty different research teams each test a different dietary supplement for its effect on memory. All supplements are actually ineffective (the null hypothesis is true for all 20). Using $\alpha = 0.05$, how many teams would you expect to find a "statistically significant" result?
(a) 0 (b) 1 (c) 5 (d) 10
Answer
**(b) 1.** If $H_0$ is true and $\alpha = 0.05$, each team has a 5% chance of a false positive. With 20 teams: $20 \times 0.05 = 1$ expected false positive. This is the multiple testing problem. If only the "successful" team publishes, the literature will contain one impressive-looking study showing supplements improve memory — even though the result is entirely due to chance.20. Which of the following is the best interpretation of "statistically significant at $\alpha = 0.05$"?
(a) The result is scientifically important (b) The effect is large enough to matter in practice (c) The data are unlikely enough under $H_0$ that we reject $H_0$ at the 5% level (d) There is a 95% probability that the alternative hypothesis is true