32 min read

In This Chapter

Learning Objectives
Section 26.1: The Scientific Method
Section 26.2: Evidence Quality Hierarchies
Section 26.3: Peer Review
Section 26.4: Understanding Statistical Evidence
Section 26.5: Confounding, Causation, and Correlation
Section 26.6: The Replication Crisis
Section 26.7: Evaluating Scientific Claims in Media
Section 26.8: Consensus vs. Frontier
Section 26.9: Applied Scientific Thinking
Callout Box: Reading a Results Section
Callout Box: The Bayesian Perspective
Key Terms
Discussion Questions

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 26: Scientific Thinking and Evidence Evaluation

Learning Objectives

By the end of this chapter, students will be able to:

Describe the structure of the scientific method and explain Popper's falsifiability criterion and its role in demarcating science from pseudoscience.
Apply the evidence quality hierarchy to evaluate the strength of specific scientific claims, explaining when each level of evidence is appropriate.
Explain how peer review works, what it can and cannot catch, and how the replication crisis revealed its limitations.
Interpret common statistical measures — p-values, confidence intervals, and effect sizes — and distinguish statistical significance from practical significance.
Explain the difference between correlation and causation, apply the Bradford Hill criteria, and describe how confounding variables undermine causal inference.
Describe the key mechanisms of the replication crisis (publication bias, p-hacking, HARKing) and explain methodological reforms being implemented in response.
Identify common patterns by which science news is distorted in translation from research to media reports.
Distinguish well-established scientific consensus from active frontier debate.
Apply scientific thinking habits to everyday evidence evaluation.

Section 26.1: The Scientific Method

What Science Does

Science is often described as a method for producing reliable knowledge about the natural world. But this description understates what is distinctive about science. Many enterprises produce knowledge; science produces self-correcting knowledge — knowledge that is systematically tested, publicly shared, independently verified, and revised when evidence demands revision.

The key features that make science distinctive are:

Empiricism: Scientific claims must be grounded in, and revisable by, observation and experiment. Armchair reasoning alone, however elegant, cannot establish empirical claims.

Testability: Scientific hypotheses must generate predictions that can, in principle, be tested. A hypothesis immune to all possible evidence is not a scientific hypothesis.

Public replication: Scientific findings must be accessible for other researchers to examine, challenge, and attempt to reproduce. Private knowledge claims cannot be scientifically established.

Communal critique: Science is a community activity. The scientific community's collective, adversarial evaluation of claims — through peer review, replication, and meta-analysis — is what gives science its distinctive reliability.

Revisability: Science is designed to change in light of evidence. A scientific community that never revises its views is not functioning scientifically.

The Classic Hypothetico-Deductive Method

The textbook scientific method is often presented as a simple sequence:

Observe: Notice a phenomenon in the world that needs explanation.
Hypothesize: Formulate a tentative explanation for the phenomenon.
Predict: Derive testable predictions from the hypothesis.
Test: Design experiments or observational studies to test the predictions.
Analyze: Evaluate the results against the predictions.
Revise: Modify the hypothesis in light of results, or discard it if it consistently fails.

Real science is messier than this clean sequence suggests. Hypotheses are formed before observations in some cases; observations are theory-laden (what we observe depends partly on our conceptual framework); experiments have multiple interpretations; and scientific communities debate the interpretation of ambiguous results for years. But the idealized method captures the essential logic: derive predictions from hypotheses, test them, and update beliefs accordingly.

Popper's Falsifiability Criterion

Karl Popper (1902-1994) argued that what distinguishes science from non-science (pseudoscience, metaphysics, ideology) is falsifiability: a claim is scientific only if it is possible, in principle, to observe something that would show the claim to be false.

This criterion has a counterintuitive implication: scientific claims are not proven true by confirming evidence. Any number of confirming observations is compatible with a false hypothesis. But a single genuine disconfirming observation can falsify a hypothesis. Science progresses not by accumulating proof but by eliminating falsehoods.

Popper's criterion matters for misinformation detection in several ways:

Identifying pseudoscience: Claims that are structured to be immune to disconfirmation — where any evidence can be explained away — are not scientific claims. Homeopathy's defenders, for example, explain away negative clinical trials by claiming that the studies weren't designed correctly, that randomized trials don't apply to individualized homeopathic treatments, or that the research was biased. When no possible result would count as disconfirmation, the hypothesis is not falsifiable.

Understanding scientific revisions: When science revises a claim (e.g., dietary guidelines on fat), misinformers sometimes present this as evidence that "science doesn't know anything." But revision is precisely what falsifiability requires. Science changed the fat-diet recommendations because evidence accumulated that the original claim was too simple. This is the system working, not failing.

Distinguishing hypotheses from established facts: Some scientific claims have survived enormous amounts of potential falsification — evolution, germ theory, anthropogenic climate change — and represent our best available knowledge. Falsifiability means we should always be open to revision in principle, but it does not mean all claims are equally uncertain.

Limits of the Falsifiability Criterion

Popper's criterion, while immensely influential, has been refined and critiqued:

The Duhem-Quine thesis: Any experimental test involves not just the hypothesis being tested but also auxiliary hypotheses about the experimental apparatus, measurement methods, and background conditions. When an experiment fails, it may be the auxiliary hypothesis, not the main hypothesis, that is wrong. Scientists do not simply abandon a hypothesis every time a single test fails.

Kuhn's paradigms: Thomas Kuhn argued that scientists work within paradigms — conceptual frameworks — and that anomalous results are routinely absorbed or ignored until enough accumulate to trigger a "paradigm shift." Normal science is not continually testing its foundational assumptions.

Lakatos's research programs: Imre Lakatos proposed that science operates through research programs with a "hard core" (protected from falsification) and a "protective belt" of auxiliary hypotheses that absorb anomalies. A research program is progressive if it generates new predictions that are confirmed; degenerative if it only accommodates anomalies post hoc.

These refinements matter because they show that the relationship between evidence and theory is complex. Science does not mechanically accept or reject hypotheses based on single experiments. Understanding this complexity is important for interpreting scientific disputes correctly.

Section 26.2: Evidence Quality Hierarchies

The Evidence Pyramid

In medicine and public health, an evidence quality hierarchy — commonly depicted as a pyramid — ranks study designs by the reliability of the evidence they produce. Understanding this hierarchy allows non-experts to assess the evidentiary weight of scientific claims.

From lowest to highest quality:

Level 1 — Anecdote: A single individual's experience or testimony. Anecdotes are valuable for hypothesis generation but cannot establish general claims. They are subject to all the biases of individual experience: selection effects, expectation effects, placebo responses, and coincidental co-occurrence of events.

Level 2 — Case Series / Case Reports: A collection of case reports, typically without a control group. Useful for rare conditions where randomized trials are impossible, and for generating hypotheses about unusual phenomena. Cannot establish cause-effect relationships because there is no comparison group.

Level 3 — Case-Control Study: A retrospective observational study comparing people with a condition (cases) to people without it (controls), looking backward to identify differences in exposures. Can detect associations but is vulnerable to recall bias (cases and controls may remember past exposures differently) and selection bias in the control group.

Level 4 — Cross-Sectional Study: Examines a population at one point in time. Useful for determining prevalence and generating hypotheses, but cannot establish temporal sequence or causation.

Level 5 — Cohort Study: Follows a group of people forward in time, comparing outcomes between those exposed and unexposed to a factor. Can establish temporal sequence (exposure before outcome) but cannot control for all confounders. Strong cohort studies with large samples and rigorous methods approach RCT quality for some questions.

Level 6 — Randomized Controlled Trial (RCT): Participants are randomly assigned to an intervention or control group. Randomization, when successful, distributes confounding variables equally between groups, allowing causal inference. The RCT is the gold standard for evaluating interventions. However, RCTs have limitations: ethical constraints prevent randomization of harmful exposures; some questions involve populations or timescales too large for RCTs; RCTs can be poorly designed or selectively reported.

Level 7 — Systematic Review: A structured, comprehensive review of all available studies on a question, using explicit methods to identify, select, and critically appraise evidence. Systematic reviews reduce selection bias in summarizing evidence by being comprehensive and explicit about inclusion criteria.

Level 8 — Meta-Analysis: A systematic review that statistically combines results from multiple studies to produce a pooled effect estimate. Meta-analyses can detect effects too small to be detected in individual studies and provide more precise effect estimates. However, meta-analyses inherit the weaknesses of the studies they include ("garbage in, garbage out"), and methodological choices in the meta-analysis can affect results substantially.

When Each Level Is Appropriate

The evidence pyramid is not an absolute hierarchy of value. Different research questions call for different study designs:

RCTs are inappropriate when: The exposure of interest is an environmental or behavioral factor that cannot be ethically randomized (e.g., studying the health effects of smoking cannot involve randomizing people to smoke); outcomes take decades to develop; or rare populations are too small for adequate statistical power.

Observational studies are essential when: Questions are too long-term, too complex, or too large for RCTs; when natural experiments provide useful quasi-experimental variation; or when the question is about populations and exposures as they naturally occur.

Anecdotes and case reports remain valuable for: Hypothesis generation; documenting rare, unusual phenomena; signaling unexpected safety events; and humanizing statistical findings for policy purposes.

Evaluating Individual Studies

Beyond the evidence hierarchy, several questions improve individual study evaluation:

What was the sample size? Was the study adequately powered to detect the effect of interest?
Was the sample representative of the population to which conclusions are generalized?
How was the exposure or intervention measured? Are measurement tools validated?
What potential confounders were considered, and how were they controlled?
Was blinding appropriate and maintained (for interventional studies)?
What were the actual effect sizes? Are they clinically meaningful, or merely statistically significant?
Were the outcomes pre-specified, or were they selected post-hoc?
Was the study replicated?

Section 26.3: Peer Review

How Peer Review Works

When scientists submit a manuscript to a scientific journal, the editor first performs an initial assessment. If the paper passes, it is sent to two to four independent experts in the relevant field — "peers" — who review it anonymously and provide assessments covering:

Methodological rigor (study design, statistical analysis)
Appropriateness of the interpretation for the data
Significance and originality of the contribution
Clarity of presentation
Relationship to relevant prior literature

Reviewers recommend acceptance (rare), major revision, minor revision, or rejection. The journal editor makes the final decision, typically following reviewer recommendations. High-quality journals reject most submissions (Nature and Science reject more than 90% of submissions).

What Peer Review Catches

Peer review is effective at catching certain problems:

Obvious statistical errors in analysis
Inconsistencies between methods, results, and conclusions
Failure to cite relevant prior work
Implausible interpretations of data
Methodological details that are unclear or potentially invalid
Papers that do not meet the journal's standards for significance

What Peer Review Misses

Peer review has important, documented limitations:

Fabrication and fraud: Reviewers cannot access the original data and cannot detect fabricated results if the presented data is internally consistent. Major frauds — Hwang Woo-suk's stem cell fabrications, the Wansink food psychology scandals, Diederik Stapel's fabricated social psychology data — passed peer review.

Publication bias: Reviewers and editors prefer positive results (studies finding a significant effect) over null results (studies finding no effect). This systematic preference distorts the published literature toward significant findings even when the true effect is zero or small.

P-hacking: Researchers may try multiple analytical approaches and report only the one that produces a significant result. Peer reviewers typically see only the final analysis, not the decisions made along the way.

Insufficient expertise: For highly technical or interdisciplinary work, finding reviewers with the right combination of expertise is difficult. Reviewers may be competent to evaluate some aspects of a paper but not others.

Time pressure and reviewer fatigue: Peer review is an unpaid, time-consuming service. Reviews are often rapid and superficial rather than comprehensive.

Confirmation bias: Reviewers may be more sympathetic to papers consistent with their own theoretical commitments and more critical of papers that challenge those commitments.

Post-Publication Peer Review

Increasingly, the scientific community recognizes that peer review should not end at publication. Post-publication review through journals' letters, public comment platforms like PubPeer, and replication attempts represents an ongoing, distributed quality-control process. Many significant errors and frauds have been detected through post-publication scrutiny rather than pre-publication review.

Section 26.4: Understanding Statistical Evidence

Why Statistics Matter

Scientific results are almost never deterministic — they are probabilistic. Understanding the language of statistical evidence is necessary for evaluating scientific claims, detecting manipulation of statistics, and avoiding common misinterpretations.

P-Values: Meaning and Misinterpretation

The p-value is probably the most widely misunderstood statistical concept in public discourse.

What it is: The p-value is the probability of observing results at least as extreme as those obtained, assuming the null hypothesis is true. If p < 0.05 (the conventional threshold, also called "alpha"), results are declared "statistically significant."

What it is NOT: - The p-value is not the probability that the null hypothesis is true. - It is not the probability that the result was due to chance. - It is not the probability that the finding will replicate. - A small p-value does not indicate a large or important effect.

Why p < 0.05?: The 0.05 threshold is arbitrary and conventional, introduced by Ronald Fisher in the 1920s without theoretical justification. Many researchers and statisticians now argue it has caused enormous harm by encouraging dichotomous thinking (significant vs. non-significant) at the expense of effect size thinking.

Type I and Type II errors: A Type I error (false positive) occurs when the null hypothesis is wrongly rejected — finding an effect that isn't there. With alpha = 0.05, Type I errors occur in 5% of tests under the null hypothesis. A Type II error (false negative) occurs when the null hypothesis is wrongly retained — failing to detect an effect that is real. Type II error rate depends on statistical power (see Section 26.6).

Confidence Intervals

A confidence interval (CI) provides a range of values compatible with the data. A 95% confidence interval does not mean there is a 95% probability that the true value lies within that specific interval (a common misconception). It means that if we repeated the experiment many times and calculated a CI each time, 95% of those intervals would contain the true value.

Confidence intervals are more informative than p-values because they convey both statistical significance (does the interval include zero?) and precision (how wide is the interval?) and effect direction (which side of zero is the bulk of the interval?). A wide confidence interval indicates low precision; a narrow interval indicates high precision.

Example: A study finds that a drug reduces blood pressure by 4.2 mmHg (95% CI: 0.1 to 8.3 mmHg). This is statistically significant (the interval excludes zero) but the practical significance depends on clinical context: a reduction of 0.1 mmHg would be medically meaningless, while 8.3 mmHg would be clinically important. The wide interval tells us our estimate is imprecise.

Effect Sizes

Effect size quantifies the magnitude of a relationship or difference, independent of sample size. Common effect size measures include:

Cohen's d: Standardized difference between two means (d = 0.2 is small, 0.5 is medium, 0.8 is large by convention)
Relative Risk (RR): Ratio of outcome probability in exposed vs. unexposed groups
Odds Ratio (OR): Ratio of odds in exposed vs. unexposed groups
Number Needed to Treat (NNT): How many patients must be treated for one additional patient to benefit

Practical vs. statistical significance: With large enough samples, trivially small effects become statistically significant. A study of 1 million people might find a statistically significant relationship between, say, left-handedness and tea preference — but the effect size might be so small as to be completely irrelevant to any practical decision. Always look at effect sizes alongside p-values.

Example: A headline reads "New study: Eating breakfast linked to 10% lower obesity risk!" This sounds substantial, but the 10% is a relative risk. If the absolute risk of obesity among non-breakfast eaters is 30%, then among breakfast eaters it is 27% — an absolute risk reduction of 3 percentage points. The effect is real but modest.

How to Read a Results Section

When reading scientific papers' Results sections:

Look at effect sizes, not just p-values.
Check confidence intervals for both statistical significance and precision.
Look at the actual numbers, not just the description ("significantly higher" means what, exactly?).
Check sample sizes — they determine the power to detect effects.
Count how many outcomes were tested. Multiple comparisons inflate Type I error.
Look for subgroup analyses — these are exploratory, not confirmatory.

Section 26.5: Confounding, Causation, and Correlation

The Correlation-Causation Fallacy

"Correlation does not equal causation" is one of the most important principles in scientific thinking, and one of the most widely ignored in science journalism. A correlation between A and B can arise from three general causal structures:

A causes B (the direction claimed in popular reporting)
B causes A (reverse causation)
C causes both A and B (confounding variable C)
Pure coincidence (spurious correlation)

Organic food consumption has been correlated with autism diagnosis rates over time — both have risen. But both are correlated with a common factor: economic development and the expansion of diagnostic criteria. Neither causes the other.

Confounding Variables

A confounding variable is a variable that is associated with both the exposure of interest and the outcome, and that therefore creates a spurious association between exposure and outcome (or masks a real one).

Classic example: Studies have found that carrying a lighter is associated with higher rates of lung cancer. Does carrying a lighter cause lung cancer? No — smoking, the confounder, causes both lighter-carrying and lung cancer.

In nutritional epidemiology (the source of most "miracle food" news, discussed in Section 26.7), confounding is endemic because people who eat certain foods also differ in many other health-relevant ways: their overall dietary patterns, exercise habits, socioeconomic status, education, access to healthcare, and dozens of other factors. Controlling for confounders in observational studies is difficult and never complete.

The Bradford Hill Criteria

In 1965, epidemiologist Austin Bradford Hill proposed a set of criteria for evaluating causal claims from epidemiological data. These criteria do not determine causation definitively, but they help assess the plausibility of causal relationships:

Strength: Larger associations are harder to explain by confounding alone. (RR > 2 provides stronger causal evidence than RR of 1.1.)
Consistency: Has the association been replicated by different researchers in different populations using different methods?
Specificity: Is the exposure associated specifically with the outcome, or with many different outcomes? (Specificity is less decisive than other criteria.)
Temporality: Does exposure precede the outcome? This is the only criterion that is logically necessary for causation.
Biological gradient (dose-response): Do higher levels of exposure produce larger effects? A dose-response relationship strengthens causal inference.
Plausibility: Is there a plausible biological mechanism? Plausibility is constrained by current knowledge — implausibility based on unknown mechanisms is weak evidence against causation.
Coherence: Does the causal interpretation fit with what is known about the biology, natural history, and epidemiology of the disease?
Experiment: Can we intervene on the exposure and change the outcome? Quasi-experimental evidence (natural experiments, mendelian randomization) provides stronger causal evidence than observational studies.
Analogy: Are similar causal relationships known? (Weak criterion — analogies can be misleading.)

When evaluating a causal claim in science news, asking how many Bradford Hill criteria are met provides a structured way to assess the strength of causal evidence.

Simpson's Paradox

Simpson's Paradox is a phenomenon where a trend present in several groups of data reverses or disappears when the groups are combined. It arises from confounding by a variable that affects both the grouping variable and the outcome.

Classic example: A hospital study finds that Hospital A has a higher overall survival rate than Hospital B. But when patients are stratified by disease severity, Hospital B has better survival rates for both mild and severe cases. The paradox arises because Hospital B treats more severe cases, which has lower survival rates regardless of hospital quality. The overall comparison is confounded by disease severity.

Simpson's Paradox appears in many real-world contexts — the UC Berkeley gender bias study, clinical trial subgroup analyses, immigration statistics, and sports performance comparisons. It illustrates why aggregate statistics require careful examination of confounding structure.

Section 26.6: The Replication Crisis

What Is the Replication Crisis?

The replication crisis is the ongoing recognition, beginning around 2011-2015, that a surprisingly large proportion of published scientific findings — particularly in psychology, medicine, nutrition, and economics — fail to replicate when independently tested. The "reproducibility project" in psychology (Open Science Collaboration, 2015), which attempted to replicate 100 prominent psychology studies, found that only about 39% replicated with comparable effect sizes. Similar replication failures have been documented across biomedicine, cancer biology, and other fields.

The Mechanisms of the Crisis

Publication Bias: Journals prefer to publish positive results — studies that find significant effects. Studies that find null results (no effect) are less likely to be submitted or accepted. This creates a file-drawer problem: null results sit unpublished in researchers' file drawers while significant results are published. The published literature then overrepresents positive findings, creating a misleadingly optimistic picture of the evidence.

P-Hacking (Data Dredging): Researchers analyze data in multiple ways — trying different covariates, exclusion criteria, outcome measures, or subgroup definitions — until a statistically significant result emerges. Since any analytical choice that produces p < 0.05 may be reported while others are silently discarded, the probability that the published result is a false positive is much higher than the nominal 5%.

HARKing (Hypothesizing After Results are Known): Researchers form their hypothesis after seeing the data — but present it as if the hypothesis preceded the analysis. This turns exploratory analysis into the appearance of confirmatory analysis. A hypothesis designed to fit the data it was derived from will appear to be supported by that data even if the relationship is spurious.

Underpowered Studies: Many studies in psychology and medicine are too small to reliably detect the true effect size. When a study is underpowered, the results are highly variable, and any significant results that emerge are likely to be overestimates of the true effect (the "winner's curse"). Underpowered studies that by chance produce significant results are particularly vulnerable to non-replication.

Flexibility in Data Analysis: Researchers have enormous flexibility in how they collect, exclude, and analyze data. Simmons, Nelson, and Simonsohn (2011) showed that seemingly reasonable analytical choices could dramatically inflate Type I error rates — from the nominal 5% to over 60% in some scenarios.

Motivated Reasoning: Scientists are human. They are motivated to find significant results (for publication, grants, career advancement), and this motivation subtly influences countless small decisions in the research process — how long to run a trial, which covariates to include, how to handle outliers, when to stop data collection.

Key Cases of Replication Failure

Social priming: Studies suggesting that brief exposure to words or concepts (e.g., words associated with elderly people) could automatically affect behavior (e.g., walking speed) failed to replicate in multiple attempts. The work of John Bargh, whose priming studies were foundational to this research program, came under particular scrutiny.

Ego depletion: The hypothesis that willpower is a limited resource that gets depleted with use (analogous to a muscle that fatigues) was supported by hundreds of studies. A large pre-registered multi-lab replication found a near-zero overall effect, raising questions about whether ego depletion is a real phenomenon.

Power posing: Amy Cuddy's claim that adopting "power poses" (expansive, dominant body postures) for two minutes increases testosterone and risk-taking failed to replicate in multiple independent attempts, including by Cuddy's own co-author.

Cancer biology: The Reproducibility Project: Cancer Biology found that fewer than half of key results from prominent cancer biology papers could be replicated, often with effect sizes far smaller than originally reported.

Reforms in Response to the Crisis

Pre-registration: Researchers submit their hypotheses, methods, and analysis plans to a public registry (e.g., ClinicalTrials.gov, the Open Science Framework) before data collection begins. This prevents HARKing and limits p-hacking by making the exploratory vs. confirmatory distinction explicit and public.

Registered Reports: A publication format in which journals commit to publish a study based on the quality of the design, before results are known. This directly counters publication bias.

Open Data and Open Materials: Researchers share their raw data and analysis code, allowing independent verification. Data availability dramatically aids replication efforts and can catch errors.

Pre-registration of systematic reviews: PROSPERO and similar registries allow prospective registration of systematic review protocols.

Bayesian approaches: Some statisticians advocate replacing or supplementing p-values with Bayesian methods that quantify evidence in more interpretable terms.

Multi-site replications: Large-scale collaborative projects that replicate findings across many labs simultaneously (e.g., ManyLabs, ManyBabies, the Psychological Science Accelerator).

What the Crisis Means for Media Literacy

The replication crisis has important implications for how we evaluate science news:

Single studies — even published in prestigious journals — are not reliable guides to truth.
A finding becomes meaningful only after consistent replication across independent labs.
Effect sizes reported in initial studies are often overestimates.
Press releases and media coverage rarely wait for replication.
The most dramatic-sounding findings (because they are most publishable) are often the least reliable.

Section 26.7: Evaluating Scientific Claims in Media

The Translation Problem

Scientific findings undergo multiple translations between the research paper and the public:

Researcher's finding → Press release (often written by university communications office)
Press release → Journalist's story (often written under time pressure with limited expertise)
Journalist's story → Headline (often written by an editor who didn't read the article)
Headline → Social media share (often shared by people who read only the headline)

Each translation introduces potential distortion. A study's carefully qualified finding ("In this population, under these conditions, we found a statistically significant association that requires replication") becomes "Scientists PROVE coffee prevents Alzheimer's disease."

Common Translation Errors

Correlation reported as causation: The study found an association; the headline announces a cause.

Relative risk amplification: A 50% reduction sounds dramatic; if the baseline risk is 0.1%, the absolute reduction is 0.05 percentage points.

Extrapolating from animal models: A study in mice becomes "breakthrough cure for cancer" despite the vast majority of mouse model results not translating to humans.

Extrapolating from cell cultures: Even more distant from human relevance than animal studies.

Ignoring confidence intervals: The study found X effect (95% CI: -Y to +Z); the headline reports the point estimate X without acknowledging that the true effect may be zero or negative.

Single study overemphasis: A single study contradicts the established consensus; headline: "New study overturns conventional wisdom on [X]."

Press release science: Studies are published whose primary purpose is generating favorable press coverage for the institution, product, or intervention — rather than genuine scientific inquiry.

The "Miracle Food" Pattern

Nutrition news is particularly prone to a characteristic pattern:

A large epidemiological study (typically observational) finds a correlation between consumption of food X and lower rates of condition Y.
The association is statistically significant but modest (RR = 0.85 or similar).
Confounders (people who eat food X may be healthier in many other ways) are acknowledged in the paper but omitted from the press release.
The press release announces that "X reduces risk of Y."
Headlines declare "Scientists prove X prevents Y" or "X is a superfood."
Further studies fail to replicate, or a meta-analysis finds a much smaller effect.
The cycle repeats with a different food.

John Ioannidis, among others, has argued that most nutritional epidemiology findings are unreliable because of the combination of weak effect sizes, massive confounding, unreliable dietary measurement methods (food frequency questionnaires), and publication bias.

The 10x Exaggeration Pattern

Research by Ben Goldacre and others suggests that media coverage systematically exaggerates effect sizes — sometimes by a factor of ten or more. A 10% relative risk reduction becomes a "10 times more effective" claim in headlines. Quantitative numeracy is required to recognize this pattern.

How to Read a Science News Story

Questions to ask:

What type of study was this? (RCT, observational, animal model, cell culture?)
How large was the sample?
What was the actual effect size, not just whether it was "significant"?
Has this been replicated?
Does this contradict established consensus, or add to a consistent pattern?
Who funded the study?
What do independent experts say about it?
Is there a link to the actual paper?

Section 26.8: Consensus vs. Frontier

What Scientific Consensus Means

Scientific consensus is not a vote or a matter of opinion. It reflects the convergent judgment of the relevant expert community based on accumulated evidence — specifically, it represents the position that is most strongly supported by evidence after critical scrutiny and replication.

Distinguishing scientific consensus from active frontier debate is a critical media literacy skill.

Consensus (not legitimately in scientific dispute): - The Earth is approximately 4.5 billion years old - Evolution by natural selection explains the diversity of life - Human activity is causing the current climate warming trend - Vaccines are safe and effective; they do not cause autism - HIV causes AIDS - The universe began approximately 13.8 billion years ago

Active frontier debate (genuine scientific controversy): - The exact value of the Hubble constant (measuring the universe's expansion rate) - The specific mechanisms of Alzheimer's disease - The long-term cardiovascular effects of low-carbohydrate diets - The optimal duration and timing of pre-surgical fasting - The relative contributions of genes and environment to specific complex traits

The difference matters because misinformation systematically conflates these categories — presenting consensus positions as if they were contested frontier issues.

How to Identify Scientific Consensus

Systematic reviews and meta-analyses represent the most reliable guides to where the evidence points. Cochrane Reviews are particularly trusted in medicine.

Position statements of major scientific organizations: When the American Medical Association, the British Medical Association, the National Academies of Science, and equivalent bodies in multiple countries all agree, this represents strong consensus signal.

The Cook et al. method: Surveys of actively publishing researchers in the relevant field, or analysis of peer-reviewed literature, provide quantitative estimates of consensus. Studies on climate change consensus find 97%+ agreement on basic anthropogenic warming; similar surveys find comparable consensus on vaccine safety and evolution.

Independence of convergent evidence: If multiple independent lines of evidence — from different methodologies, different research groups, different countries — converge on the same conclusion, that convergence is strong evidence of reliability.

The Manufactured Controversy

Industry and ideological interests have developed sophisticated techniques for creating the impression of scientific controversy where genuine consensus exists. Key tactics include:

Highlighting dissenting scientists (who may have funding ties to industry)
Creating "institutes" and "centers" with scientific-sounding names that produce industry-favorable "research"
Demanding balance in media coverage, implying that minority views deserve equal time
Pointing to uncertainty ranges as if they negate the central estimate
Misrepresenting normal scientific revision as evidence of unreliability

Recognizing manufactured controversy requires checking: Who are the dissenters? What are their credentials in the relevant field? Who funds them? Does the dissent appear in peer-reviewed literature or only in policy documents and popular media?

Section 26.9: Applied Scientific Thinking

The Scientific Mindset Outside the Lab

Scientific thinking is not confined to laboratories. The habits of scientific reasoning — empiricism, testability, openness to evidence, calibrated uncertainty — are applicable to everyday decision-making and information evaluation.

Calibrated uncertainty: Being appropriately uncertain (neither more confident nor more skeptical than the evidence warrants) is a central scientific virtue. It means saying "the evidence suggests X is probably true, but I'm open to revision" rather than "X is certainly true" or "nothing is ever certain."

Actively seeking disconfirmation: Confirmation bias is the tendency to seek out information that confirms existing beliefs. Scientific thinking means actively seeking out disconfirming evidence — looking for the best arguments against your current position.

Distinguishing personal experience from statistical evidence: Personal experience is a legitimate data point, but it cannot override large, well-designed studies that account for confounders and selection effects. The brain's tendency to weight vivid personal experience over abstract statistics (the availability heuristic) is a systematic error that scientific thinking corrects.

Base rate thinking: Before evaluating a specific claim, consider the prior probability. If a claim is that a common substance causes a rare disease, consider: how many such claims are tested each year? If most of them are false, even a "significant" result may be more likely a false positive than a true positive.

Considering alternative hypotheses: When a pattern or event is observed, scientific thinking asks: what are the alternative explanations? Which one is most parsimonious? What evidence would distinguish between them?

The Public Understanding of Science

The "deficit model" of science communication — the idea that public science skepticism results primarily from lack of information — has been largely discredited. Research shows that increased scientific knowledge can sometimes increase rather than decrease polarization on politically charged science questions (the cultural cognition phenomenon, studied by Dan Kahan and colleagues).

This finding has important implications: effective science communication is not just about providing more information. It requires attending to the social and identity dimensions of belief, building trust, and communicating in ways that respect the audience's values and epistemic autonomy.

Callout Box: Reading a Results Section

When encountering a scientific paper's results section:

Step 1: Identify the primary outcome and how it was measured.

Step 2: Find the effect size. Don't stop at "significant" — find the actual number.

Step 3: Find the confidence interval. How wide is it? Does it suggest practical significance?

Step 4: Find the sample size. Is it large enough to be reliable?

Step 5: Count the comparisons. Were many outcomes tested? Multiple comparisons inflate false positive rates.

Step 6: Look for pre-registration. Was the analysis plan specified before data collection?

Callout Box: The Bayesian Perspective

From a Bayesian perspective, the impact of new evidence on our beliefs depends not just on the evidence itself but on our prior probability — how likely we thought the claim was before seeing the evidence. A positive result from a poorly designed study should update our beliefs less than a positive result from a well-designed RCT. And a positive result for a claim that contradicts well-established science should update our beliefs less than a positive result for a claim consistent with established science.

Frequentist statistics (p-values) do not incorporate prior probability. The Bayesian framework makes this element of reasoning explicit and formal. This is why a single "significant" result for a highly implausible claim (e.g., a homeopathic remedy diluted to 10^-23) should not move our beliefs as much as the p-value might suggest.

Key Terms

Falsifiability: The property of scientific claims that makes it possible, in principle, to observe something that would show the claim to be false.

Evidence pyramid: A hierarchy of study designs from weakest (anecdote) to strongest (meta-analysis) in terms of the reliability of causal evidence they can provide.

Randomized controlled trial (RCT): A study design in which participants are randomly assigned to intervention or control groups, allowing causal inference.

Peer review: The process by which scientific manuscripts are evaluated by independent experts before publication.

P-value: The probability of observing results at least as extreme as obtained, under the null hypothesis; not the probability that results are due to chance.

Confidence interval: A range of values consistent with the data; conveys both statistical significance and precision of estimation.

Effect size: The magnitude of a relationship or difference, independent of sample size.

Confounding variable: A variable associated with both exposure and outcome that creates spurious or masked associations.

Bradford Hill criteria: A set of criteria for evaluating causal claims from epidemiological data.

Publication bias: The tendency for journals and researchers to preferentially report positive results over null results.

P-hacking: Trying multiple analyses until a significant result is found; inflates the false positive rate.

HARKing: Hypothesizing After Results are Known; presenting post-hoc hypotheses as if they were pre-specified.

Pre-registration: Pre-specifying hypotheses, methods, and analysis plans before data collection, to distinguish confirmatory from exploratory research.

Replication crisis: The ongoing recognition that many published scientific findings fail independent replication.

Scientific consensus: The convergent judgment of the relevant expert community based on accumulated evidence.

Discussion Questions

Popper's falsifiability criterion implies that no scientific theory can be proven true — only false. Does this mean scientific theories are merely unrefuted hypotheses, with no stronger status? What would it mean to have high confidence in a theory given this framework?
A major RCT finds that a drug significantly reduces heart attack risk (p = 0.03, NNT = 200). A companion editorial says the result is "promising but requires replication." A patient advocacy group announces "breakthrough cure." How should patients and physicians respond?
The replication crisis revealed systematic problems in how science is practiced. Does this support the position of those who distrust science? Why or why not? What distinction is important here?
Nutritional epidemiology generates enormous amounts of media coverage while being one of the least reliable scientific fields. Who bears responsibility for this gap? Researchers? Journals? Universities? Journalists? Readers?
How should public policy relate to scientific uncertainty? Does the existence of uncertainty (which is always present) provide grounds for inaction on issues like climate change or air quality regulation?
Scientific consensus exists on evolution, climate change, and vaccine safety, yet significant proportions of the public disagree. If the deficit model (lack of information causes public skepticism) is wrong, what does explain this skepticism, and what follows for science communication?
A pharmaceutical company funds a study that finds their drug is effective. The study is well-designed and published in a peer-reviewed journal. How should this funding source affect your assessment of the evidence? What additional information would you want?

Chapter 26 continues in Exercises, Quiz, and Case Studies.