Chapter 26: Key Takeaways — Scientific Thinking and Evidence Evaluation

Core Concepts

1. Falsifiability is the hallmark of science. Popper's criterion holds that scientific claims must be testable — capable in principle of being shown false. Claims structured to be immune to any possible evidence are not scientific. Understanding falsifiability explains why science revises its claims (because disconfirming evidence is taken seriously) and why this revision is a feature, not a bug.

2. Evidence quality is hierarchical. The evidence pyramid — from anecdote at the bottom to meta-analysis at the top — provides a structured way to assess the strength of scientific claims. Higher tiers of evidence reduce confounding, selection bias, and chance. However, no single study design is universally superior; the appropriate design depends on the research question.

3. Randomized controlled trials provide the strongest single-study evidence for causation. Randomization distributes known and unknown confounders equally between groups, enabling causal inference. This is why RCTs are the gold standard for evaluating interventions. Their limitations (ethical constraints, timeframe, external validity) are real but do not eliminate their superiority for specific questions.

4. Peer review is necessary but insufficient. Peer review improves scientific quality by catching errors and raising standards, but it cannot detect fabricated data, does not correct for publication bias, and is vulnerable to reviewer fatigue and confirmation bias. The replication crisis revealed that peer-reviewed publication is not a reliable indicator of truth.

5. P-values are widely misunderstood. A p-value is the probability of observing results at least as extreme as obtained, given the null hypothesis — not the probability that results are due to chance, not the probability the null is true, not the probability of replication. The 0.05 threshold is arbitrary and should not be the sole criterion for evaluating evidence.

6. Effect sizes matter as much as statistical significance. With large samples, trivially small effects achieve statistical significance. Always examine effect sizes (Cohen's d, relative risk, odds ratio, NNT) alongside p-values. Statistical significance tells you the signal is probably real; effect size tells you whether it matters.

7. Correlation does not establish causation. An association between A and B can arise from A causing B, B causing A, a common cause C, or coincidence. Establishing causation requires ruling out alternative explanations — through controlled experiments, prospective design, dose-response relationships, biological plausibility, and multiple converging lines of evidence.

8. Confounding is pervasive in observational research. Observational studies cannot randomize exposures, so known and unknown factors related to both exposure and outcome create spurious or masked associations. The Bradford Hill criteria provide a structured approach to evaluating causal claims from observational data.

9. The replication crisis revealed systematic distortions in scientific practice. Publication bias, p-hacking, HARKing, underpowered studies, and motivated reasoning have collectively inflated effect sizes and filled the published literature with false positives — particularly in psychology, nutrition, and some areas of medicine. This is a genuine and serious problem.

10. The replication crisis does not justify wholesale skepticism. The crisis affects specific research areas with specific features (small effects, high analytical flexibility, strong publication pressures). It does not apply equally across all science. Physics, chemistry, and well-established biological sciences have different methodological cultures and replication rates. The appropriate response is calibrated skepticism, not global distrust.

11. Pre-registration and open data are key reforms. Pre-registering hypotheses and analysis plans prevents p-hacking and HARKing. Open data allows independent verification. Registered Reports directly counter publication bias by basing acceptance on design quality, not results. These reforms improve the reliability of science moving forward.

12. Single studies should not change practice. A single study, however prominent the journal, is insufficient basis for dietary recommendations, clinical practice changes, or public policy. Replicated findings across independent labs and methods, ideally summarized in pre-registered systematic reviews, provide the appropriate evidentiary basis.

13. Science news is systematically distorted. Press releases overstate findings; journalists simplify and sensationalize; headlines eliminate qualifications; social media amplifies dramatic claims. The translation from research to news introduces distortions at every step. Reading the original paper — especially the methods, results, and limitations sections — is required for genuine evaluation.

14. Scientific consensus is real and identifiable. Consensus on well-established issues (evolution, climate change, vaccine safety, germ theory) reflects convergent evidence from multiple independent lines of research, not mere agreement or authority. It can be identified through systematic reviews, position statements of major scientific bodies, and surveys of actively publishing experts. Manufactured controversy mimics but is not scientific controversy.

15. Scientific thinking is a set of transferable habits. Calibrated uncertainty, active search for disconfirmation, distinguishing personal experience from statistical evidence, and base-rate thinking are habits of mind that improve reasoning in any domain, not only scientific research. They are the cognitive tools of the scientifically literate citizen.

Practical Skills Developed

  • Identifying study designs and their position in the evidence hierarchy
  • Interpreting p-values, confidence intervals, and effect sizes correctly
  • Distinguishing correlation from causation using Bradford Hill criteria
  • Detecting confounding and considering alternative explanations
  • Recognizing publication bias, p-hacking, and HARKing
  • Evaluating science news stories for translation errors
  • Distinguishing scientific consensus from manufactured controversy
  • Applying Bayesian thinking to update beliefs in light of evidence quality

Connections to Other Chapters

  • Chapter 25 (Logic and Fallacies): Post hoc ergo propter hoc connects directly to the correlation-causation problem; the appeal to authority fallacy connects to legitimate vs. illegitimate expert consensus; cherry picking is the logical fallacy version of publication bias.
  • Chapter 27 (Source Evaluation): Evaluating the credibility of scientific claims requires understanding who published them, where, and with what methodology.
  • Chapter 24 (Cognitive Biases): Confirmation bias, availability heuristic, and motivated reasoning are the psychological underpinnings of p-hacking, HARKing, and selective reading of scientific evidence.
  • Chapter 28 (Conspiracy Theories): The replication crisis has been weaponized in conspiracy theory discourse; understanding what it does and does not show is necessary for responding to these claims.