Chapter 26: Quiz — Scientific Thinking and Evidence Evaluation

Instructions: Answer all questions. Multiple choice: select the best answer. Short answer: write 2-4 sentences. Answers are hidden at the bottom of this file.

Part I: Multiple Choice (Questions 1–14)

Question 1. Popper's falsifiability criterion holds that:

(A) Scientific theories must be proven true through sufficient confirmatory evidence. (B) A hypothesis is scientific only if there is possible evidence that could show it false. (C) All scientific claims are equally uncertain until proven. (D) Science requires absolute certainty before accepting any claim.

Question 2. Which study design provides the strongest evidence for a causal relationship between an intervention and an outcome?

(A) Prospective cohort study (B) Case-control study (C) Randomized controlled trial (D) Systematic review of case studies

Question 3. A p-value of 0.03 means:

(A) There is a 3% probability that the null hypothesis is true. (B) There is a 3% probability of observing results this extreme if the null hypothesis is true. (C) There is a 97% probability that the alternative hypothesis is true. (D) The result will replicate 97% of the time.

Question 4. The "file drawer problem" refers to:

(A) Researchers losing their data in disorganized filing systems. (B) Unpublished null results that remain hidden, distorting the published literature toward positive findings. (C) Journal editors rejecting papers for improper formatting. (D) Researchers who fail to cite relevant prior work.

Question 5. A cohort study finds that people who eat breakfast daily have 20% lower rates of obesity. The most likely confounding explanation is:

(A) Obesity causes people to skip breakfast. (B) Breakfast consumption directly reduces caloric absorption. (C) People who eat breakfast may have generally healthier lifestyles that account for the lower obesity rates. (D) The study's measurement of breakfast consumption was inaccurate.

Question 6. What is the main distinguishing feature of a randomized controlled trial that makes it superior for causal inference?

(A) Larger sample size than other study designs. (B) Longer follow-up period. (C) Random assignment distributes known and unknown confounders equally across groups. (D) RCTs always use blinding, while other studies do not.

Question 7. Pre-registration of a clinical trial primarily addresses which problem?

(A) Publication bias (B) P-hacking and HARKing (C) Small sample sizes (D) Conflicts of interest

Question 8. Which of the following is the correct interpretation of a 95% confidence interval?

(A) There is a 95% probability that the true value lies within this specific interval. (B) 95% of values in the sample fall within this interval. (C) If this experiment were repeated many times, 95% of the calculated intervals would contain the true value. (D) The result is statistically significant at the 95% level.

Question 9. A drug reduces the relative risk of stroke by 25%. The absolute risk of stroke in the control group is 2%. What is the absolute risk reduction?

(A) 25% (B) 2% (C) 0.5 percentage points (D) 0.25 percentage points

Question 10. HARKing stands for:

(A) Highly Accurate Result Keeping (B) Hypothesizing After Results are Known (C) High-power Analysis and Replication Kernel (D) Heterogeneous Analysis and Randomized Knowledge

Question 11. Simpson's Paradox demonstrates that:

(A) Simple p-values are always misleading. (B) Aggregate statistics can reverse when data is stratified by a confounding variable. (C) Randomization always eliminates confounding. (D) Effect sizes are more important than p-values.

Question 12. Which of the following is the most reliable guide to scientific consensus on a medical question?

(A) A survey of popular opinion (B) The position of one prominent expert who has conducted extensive research (C) A Cochrane systematic review of all available RCTs (D) Media coverage of the latest individual study

Question 13. The "winner's curse" in the context of underpowered studies refers to:

(A) The fact that larger, better-funded studies always win in publication. (B) The tendency for published significant findings from small studies to overestimate true effect sizes. (C) The bias toward publishing studies that support profitable treatments. (D) The difficulty of replicating results in different populations.

Question 14. An observational study finds that coffee drinkers have significantly lower rates of liver disease. Which Bradford Hill criterion would be MOST important to establish to strengthen a causal claim?

(A) Specificity (B) Analogy (C) Temporality (coffee consumption preceding disease) and a plausible biological mechanism (D) The strength of the association

Part II: True/False with Justification (Questions 15–20)

Write TRUE or FALSE, then provide one sentence justifying your answer.

Question 15. A study with p = 0.001 necessarily has a larger effect size than a study with p = 0.05.

Question 16. Peer review is designed to detect data fabrication.

Question 17. A systematic review always provides stronger evidence than a single well-designed RCT.

Question 18. A confidence interval that includes zero is consistent with no effect.

Question 19. Finding that coffee consumption correlates with lower Alzheimer's rates in a cohort study is sufficient to recommend coffee consumption for Alzheimer's prevention.

Question 20. Replication failure always means the original finding was fraudulent.

Part III: Short Answer (Questions 21–25)

Question 21. Explain the difference between statistical significance and practical significance. Give an example where a result can be statistically significant but practically insignificant.

Question 22. What is publication bias, and how does it distort the published literature? Describe one reform that directly addresses publication bias.

Question 23. A news headline reads: "Scientists discover that people who drink green tea have 30% lower cancer rates." You know the study was an observational cohort study. List three limitations or concerns that would cause you to be skeptical of the headline's implied causal claim.

Question 24. Explain how p-hacking inflates the false positive rate. If a researcher tests 20 independent hypotheses at alpha = 0.05, what is the expected number of false positives?

Question 25. What is the Duhem-Quine thesis, and why does it complicate the simple falsificationist picture of how science works?

Answer Key

(Scroll past the separator for answers.)

---

ANSWERS

Part I: Multiple Choice

Q1: (B) — Popper's criterion is that a hypothesis is scientific only if there exists possible evidence that would show it to be false — i.e., it generates testable predictions that could be refuted by observation. This demarcates science from non-falsifiable claims like "God did it" or "the universe was created with the appearance of age." Note: Popper held that science never proves theories true, only fails to refute them.

Q2: (C) — RCTs provide the strongest evidence for individual study designs because random assignment distributes both known and unknown confounders equally between groups, allowing causal inference. Note that a systematic review of RCTs (option D, if it said RCTs) would be stronger, but a systematic review of case studies (as stated) would be much weaker.

Q3: (B) — The p-value is defined as the probability of observing data at least as extreme as obtained, given that the null hypothesis is true. It is not the probability that the null hypothesis is true (that would require Bayesian analysis with a prior). It says nothing about the probability of replication.

Q4: (B) — The file drawer problem describes the systematic non-publication of null results. Since journals preferentially publish significant findings, null results sit unpublished in researchers' files. Meta-analyses that include only published results will therefore overestimate effect sizes.

Q5: (C) — This is a classic confounding scenario. People who eat breakfast tend to have generally healthier habits (regular routines, better food planning, fewer metabolic disruptions, higher socioeconomic status) that are themselves correlated with lower obesity rates. Reverse causation (option A) is also plausible: people with obesity may skip breakfast due to appetite disruption or dietary restriction attempts. But C describes the most common confounding mechanism.

Q6: (C) — The key feature of randomization is that it distributes both measured and unmeasured confounders between groups, making them comparable in expectation. This allows the difference in outcomes to be attributed to the intervention. No other study design achieves this without strong assumptions. Note: not all RCTs use blinding, so option D is incorrect.

Q7: (B) — Pre-registration — making the hypothesis and analysis plan public before data collection — primarily addresses p-hacking (trying multiple analyses) and HARKing (presenting exploratory results as confirmatory). Publication bias is addressed more directly by registered reports (where journals commit to publish based on design quality, not results).

Q8: (C) — The confidence interval is a procedure that, if repeated many times, would contain the true value in 95% of repetitions. It does NOT say there is a 95% probability that the true value lies within this specific interval (that would be a Bayesian credible interval). The confidence level is a property of the procedure, not the specific interval.

Q9: (C) — Relative risk reduction of 25% on a 2% baseline = 0.25 × 2% = 0.5 percentage points absolute risk reduction. This means the drug reduces stroke from 2% per year to 1.5% per year. While the relative reduction sounds impressive (25%), the absolute reduction is modest. The number needed to treat (NNT) = 1/0.005 = 200 patients per year of treatment to prevent one stroke.

Q10: (B) — HARKing (Hypothesizing After Results are Known) refers to forming a hypothesis after seeing the data but presenting it as if it were formed beforehand. This converts exploratory analysis into the appearance of confirmatory analysis, inflating false positive rates and preventing proper evaluation of the hypothesis.

Q11: (B) — Simpson's Paradox occurs when a trend in aggregate data reverses or disappears when the data is stratified by a confounding variable. The UC Berkeley admissions example is classic: men were accepted at higher rates overall, but when stratified by department, women were accepted at higher rates in most departments. The confounding variable (which departments people applied to) reversed the apparent relationship.

Q12: (C) — A Cochrane systematic review comprehensively and systematically synthesizes all available RCT evidence using pre-specified inclusion criteria and rigorous methodology. It is generally the most reliable guide to evidence on medical interventions. Individual expert opinion, media coverage, and popular surveys are all less reliable for establishing medical consensus.

Q13: (B) — The winner's curse describes how published significant results from underpowered studies tend to overestimate true effect sizes. When a small study produces a significant result by chance, the observed effect must be large enough to exceed the significance threshold despite the large sampling variability, leading to inflated estimates. Subsequent larger, better-powered studies typically find smaller (more accurate) effects.

Q14: (C) — While all Bradford Hill criteria are useful, temporality is the only strictly necessary condition for causation — exposure must precede outcome. Without establishing that coffee consumption preceded rather than followed liver disease protection, causation cannot be inferred. Biological mechanism (plausibility) is the next most important for distinguishing genuine causal relationships from spurious correlations.

Part II: True/False

Q15: FALSE — P-values are strongly influenced by sample size. With a very large sample, even a tiny (practically insignificant) effect produces a very small p-value. A study with n = 1,000,000 might find p = 0.0001 for an effect of Cohen's d = 0.01 (trivially small), while a study with n = 50 might find p = 0.04 for d = 0.5 (medium effect). P-values do not directly indicate effect size.

Q16: FALSE — Peer reviewers do not have access to the original raw data in standard peer review. They evaluate the methods and results as reported. Sophisticated data fabrication (which produces internally consistent data) is essentially impossible to detect through peer review alone. Major fraud cases (Hwang, Stapel, Wansink) passed peer review and were detected through post-publication scrutiny and whistleblowers.

Q17: FALSE — A systematic review is only as strong as the studies it includes. If a systematic review combines poorly designed observational studies or small, low-quality trials, its conclusions may be less reliable than a single, large, well-designed RCT. Furthermore, systematic reviews can be affected by heterogeneity among included studies, inconsistent methodology choices, and the aggregation of biased literature. Quality of included studies matters as much as the systematic review design itself.

Q18: TRUE — When a confidence interval includes zero (for a difference or risk reduction), it means the observed data are compatible with there being no effect. The result would not be statistically significant at the alpha level corresponding to that confidence level (e.g., a 95% CI that includes zero is not significant at p < 0.05). Note: including zero does not prove there is no effect — it means the study cannot rule out no effect with the given precision.

Q19: FALSE — Observational cohort studies establish association, not causation. The association between coffee and lower Alzheimer's rates could be explained by confounders (coffee drinkers may be more educated, have higher income, exercise more, have generally healthier diets), reverse causation (early Alzheimer's symptoms cause people to stop drinking coffee), or selection bias. Clinical recommendations require evidence from randomized trials demonstrating that intervening on coffee consumption actually changes Alzheimer's outcomes.

Q20: FALSE — Replication failure can result from many causes that do not imply fraud: the original finding may have been a true false positive (expected to occur 5% of the time at alpha = 0.05), an underpowered study whose positive result was a chance overestimate, a genuine effect that requires specific conditions not met in the replication, or genuine population differences. While fraud can cause replication failure, it is a less common cause than methodological factors and statistical chance.

Part III: Short Answer

Q21: Statistical significance indicates that an observed result is unlikely to be due to sampling chance alone (given the null hypothesis). Practical significance (also called clinical or substantive significance) indicates that the effect is large enough to matter in the real world. A study with one million participants might find that green tea drinkers have, on average, 0.001 IQ points higher test scores — statistically significant because the sample is so large, but practically meaningless because the effect is too small to be detectable or actionable. Always examine effect sizes alongside p-values.

Q22: Publication bias is the systematic tendency for journals and researchers to preferentially submit and publish studies with statistically significant (positive) results compared to null results. This distorts the published literature: if 20 studies test a false hypothesis at alpha = 0.05, one will produce a false positive on average, and this study is far more likely to be published than the 19 null studies. The result is an overestimation of effect sizes in the literature. Registered Reports directly address publication bias by having journals commit to publish studies based on the quality of the design before results are known, making the eventual result irrelevant to publication decisions.

Q23: Three key concerns about a cohort study finding coffee → lower cancer rates: (1) Confounding: Coffee drinkers may differ from non-coffee drinkers in many other health-relevant ways (dietary patterns, lifestyle, socioeconomic status) that account for the lower cancer rates. Even sophisticated statistical control cannot eliminate all unmeasured confounders. (2) Reverse causation: People at early stages of developing conditions that elevate cancer risk (e.g., pre-existing liver disease, digestive problems) may reduce coffee consumption before diagnosis — creating a spurious correlation. (3) Publication bias and multiple comparisons: Large cohort studies often test many dietary variables; only associations that achieve significance tend to be reported, inflating the probability that any given "significant" finding is a false positive.

Q24: P-hacking involves trying multiple analytical choices (different outcome measures, different covariates, different subgroup definitions, different data exclusion rules) until a significant result emerges, then reporting only the significant analysis as if it were the only one tried. If a researcher tests 20 independent hypotheses at alpha = 0.05 when no true effects exist, the expected number of false positives is 20 × 0.05 = 1 false positive. The probability of obtaining at least one false positive is 1 - (0.95)^20 ≈ 64%. This means that in studies that explore many outcome measures without correction, finding "significant" results is expected by chance alone.

Q25: The Duhem-Quine thesis, proposed independently by Pierre Duhem and Willard Van Orman Quine, holds that no scientific hypothesis can be tested in isolation. Any experimental test involves not only the hypothesis being tested but also a large network of auxiliary hypotheses: assumptions about the experimental apparatus, the measurement instruments, the experimental conditions, background theories in physics and chemistry, and more. When an experiment produces unexpected results, the scientist faces a "holistic" choice — which element of the network to revise. This complicates simple falsificationism because a failed prediction does not automatically falsify the central hypothesis; it could be an auxiliary assumption that needs revision. This explains why scientists do not immediately abandon theories when single experiments fail.