Case Study 21-2: The Replication Crisis in Social Ps...

Overview

This textbook cites dozens of studies from social psychology, cognitive psychology, and behavioral science in building its arguments about how misinformation spreads, why people believe false claims, and what interventions can reduce belief in misinformation. Many of these studies come from fields at the center of what researchers have called the "replication crisis" — a period of reckoning, beginning roughly in 2011 and continuing today, in which it became clear that a substantial fraction of published findings in psychology could not be replicated when tested independently.

This case study examines the replication crisis: what it is, how it happened, what specific findings relevant to misinformation research have been challenged, and what it means for how we should use social science research. It does not conclude that social psychology is worthless or that everything is uncertain — such a conclusion would itself be an overcorrection. It argues instead for calibrated skepticism: a set of criteria for distinguishing more from less reliable findings, and an intellectual honesty about what the evidence does and does not establish.

1. The Origins of the Crisis

1.1 The 2011 Turning Point

The replication crisis in psychology is often dated to two events in 2011:

Daryl Bem's "Feeling the Future" (Journal of Personality and Social Psychology, 2011) reported nine experiments that appeared to demonstrate precognition — the ability to sense future events. The paper was methodologically unremarkable: it used standard procedures from established social psychology paradigms, reported results with p < 0.05 across multiple studies. The problem was that its conclusions were physically impossible. The paper forced the field to confront an uncomfortable question: if standard methodology could produce apparently significant evidence for ESP, what else might it be producing significant evidence for that was merely illusory?

Diederik Stapel's fraud case (2011) revealed that a prominent Dutch social psychologist had fabricated data for dozens of studies over many years, producing findings that had been published in top journals and widely cited. The fabrication was undetected for years because the fabricated data were too clean — they showed the expected patterns without the noise that real data always contain. The Stapel case revealed how weak the error-correction mechanisms in social psychology were.

1.2 The Machinery of False Positives

The replication crisis was not primarily caused by fraud (which was always rare). It was caused by institutional structures and research practices that systematically generated false positives even from honest, well-intentioned researchers.

Publication bias — the tendency of journals to publish positive, significant results while rejecting null results — meant that the published literature was a selected sample biased toward false positives. If ten research teams independently study a hypothesis, and by chance one finds p < 0.05 while nine find null results, the null results go unpublished. The published literature shows a significant finding that is actually a chance artifact.

Researcher degrees of freedom — the legitimate choices researchers make about data collection, analysis, and reporting — create opportunities for inflation of false positive rates. Uri Simonsohn, Leif Nelson, and Joseph Simmons demonstrated in a 2011 paper in Psychological Science that these degrees of freedom, used flexibly, could reliably produce p < 0.05 even from completely random data. Their list of problematic (but common) practices included:

Collecting data until p < 0.05, then stopping
Excluding outliers after examining their effect on the p-value
Reporting only the dependent variables that showed significant effects
Controlling for covariates only when they improve the result
Analyzing subgroups and reporting only significant ones

None of these practices is necessarily conscious fraud. Each can be rationalized as a legitimate methodological choice. The problem is that when choices are made based on which result they produce, rather than based on principled a priori rationale, the nominal p-value no longer controls the false positive rate.

Small sample sizes compounded these problems. Studies with n = 30–60 per condition — typical for lab-based social psychology — have low statistical power. When a study is underpowered, it is less likely to detect real effects, but a result that does emerge is more likely to be inflated (the "winner's curse" of small studies). The combination of low power and flexible analysis procedures is particularly dangerous.

2. The Reproducibility Project

The most systematic examination of the replication crisis was the Reproducibility Project: Psychology (RPP), coordinated by Brian Nosek and reported in Science in 2015 under the title "Estimating the Reproducibility of Psychological Science."

The RPP recruited 270 researchers to replicate 100 published psychology studies, chosen to represent a range of subfields and journals. The replications used methods as close to the original studies as possible, typically with larger sample sizes (to ensure adequate statistical power), and the results were reported pre-registered — the replication protocol was filed in advance, preventing post-hoc changes.

Key findings:

97% of original studies had reported significant results (p < 0.05)
Only 36–39% of replication studies achieved significance at p < 0.05 in the same direction
The average effect size in replications was roughly half the average effect size in original studies
Cognitive psychology studies replicated at higher rates (~50%) than social psychology studies (~25%)
Studies with larger original effect sizes replicated more reliably

The interpretation of these findings was immediately contested. The RPP authors were careful to note that a failed replication is not proof that the original finding was false — there are legitimate reasons why a well-designed replication might not reproduce an original result (different populations, different cultural contexts, changes in the phenomenon over time). But the systematic pattern — fewer than half of findings replicating, and effect sizes halving on average — could not be explained entirely by these legitimate factors. The most parsimonious explanation was that a substantial fraction of published social psychology findings reflected inflated or false effects generated by the practices described above.

3. Specific Findings Relevant to Misinformation Research

Several findings that bear on misinformation, persuasion, and cognitive psychology have been challenged or failed to replicate. Understanding these specific cases is essential for this textbook's intellectual honesty.

3.1 Ego Depletion / Willpower as a Limited Resource

Roy Baumeister and colleagues developed "ego depletion theory" over many studies, arguing that self-control draws on a limited cognitive resource — a kind of mental energy that is depleted by use. This theory was cited extensively in popular media and policy discussions about cognitive performance, decision fatigue, and behavior change. Relevant to misinformation: if cognitive resources are limited, distracted or tired individuals might be more susceptible to misleading claims.

A multi-site pre-registered replication attempt by Hagger et al. (2016), involving 23 laboratories and over 2,000 participants, found essentially no support for the basic ego depletion effect. The original findings appear to have been a combination of p-hacking, publication bias, and possibly specific laboratory conditions that do not generalize.

Implication for this book: Claims that media consumption fatigue reduces critical evaluation of information should be treated with significant caution if they rely on ego depletion theory as their mechanism.

3.2 Power Posing and Physiological Effects

Dana Carney, Amy Cuddy, and Andy Yap published a study in 2010 reporting that "power poses" — expansive, open body postures — increased testosterone, decreased cortisol, and increased risk tolerance. This claim was widely disseminated through a TED Talk (one of the most-watched of all time) and popular media. Its relevance to persuasion and confidence under threat of misinformation challenges seemed plausible.

Eva Ranehill and colleagues (2015) conducted a pre-registered replication with n = 200 and found no effect of power posing on testosterone or cortisol. Subsequent large-scale replications confirmed no robust physiological effects. Dana Carney, the first author of the original study, subsequently issued a public statement disavowing the power-pose findings and detailing the flexible analysis choices that likely produced the original result.

Implication: This case illustrates how a single study with flexible analysis can produce a finding that spreads virally through popular media and becomes deeply embedded in cultural consciousness before replication evidence can correct it — a meta-level misinformation problem about research itself.

Social priming research — showing that brief, subtle exposures to concepts can significantly change behavior — was among the most contentious areas of replication failure. Studies showing that reading words associated with old age makes people walk more slowly, or that exposure to the American flag shifts political attitudes, or that brief subliminal exposures change complex social behaviors, have largely failed rigorous replication.

This matters for misinformation research because priming is often cited as a mechanism by which exposure to false information (even briefly or incidentally) leaves lasting traces on belief. If priming effects are smaller and more fragile than the original literature suggested, some of the research on "continued influence effects" and "backfire effects" may also be overstated.

3.4 The Backfire Effect: A Partial Retraction

Brendan Nyhan and Jason Reifler's (2010) "backfire effect" — the finding that corrections to misinformation sometimes cause people to believe the false information more strongly — was widely cited in popular and academic discussions of fact-checking. This textbook references it as a reason why corrections must be carefully designed.

Subsequent research, including Nyhan and Reifler's own subsequent work and large-scale pre-registered replications, has substantially qualified this finding. The backfire effect appears to be rare, condition-specific, and not the robust phenomenon the initial studies suggested. Corrections generally do reduce belief in false claims; in some cases, people resist corrections, but entrenchment of the original false belief after correction appears uncommon.

Implication: The evidence that corrections can "backfire" is much weaker than early reporting suggested. This is actually partially good news — corrections work more reliably than many feared. But it means discussions of "the backfire effect" as an established phenomenon require careful qualification.

3.5 What Has Held Up

Importantly, not all relevant social psychology has been challenged. Several findings central to understanding misinformation have survived rigorous replication testing:

Confirmation bias: The tendency to evaluate evidence in ways that favor prior beliefs is robust across many paradigms and populations.
Motivated reasoning: People's reasoning about factual questions is influenced by their identity, group membership, and desired conclusions.
The continued influence effect: Corrections reduce but rarely eliminate the influence of initially presented misinformation on judgment.
Source credibility effects: Information from credible, trusted sources is more persuasive than identical information from non-credible sources.
Illusory truth effect: Repeated exposure to claims increases their perceived truth, a well-replicated finding with important implications for viral misinformation.

The challenge is distinguishing which findings are robust from which are replication failures — a judgment that requires looking at the full body of replication evidence rather than any single study.

4. The Role of Pre-Registration

Pre-registration — filing a study's hypotheses, methods, and analysis plan in a public registry before data collection begins — has emerged as the primary structural reform to address p-hacking and related problems.

When researchers pre-register, they commit in advance to: - The primary and secondary hypotheses - The sample size and stopping rule - The measures and their operationalization - The analysis strategy, including covariate inclusion and subgroup analyses - The exclusion criteria

Any deviation from the pre-registered plan must be explicitly acknowledged as exploratory (hypothesis-generating) rather than confirmatory (hypothesis-testing). This prevents the "garden of forking paths" — the multiple post-hoc analysis choices that can inflate false positive rates.

Pre-registration was uncommon in psychology before 2012. It has become standard practice in the most rigorous contemporary research. The presence or absence of pre-registration is now a meaningful quality signal when evaluating studies.

However, pre-registration is not a panacea: - Researchers can deviate from pre-registered plans (though this is detectable and must be disclosed) - Pre-registration does not prevent honest measurement error, attrition bias, or other sources of invalidity - Even well-powered, pre-registered studies can produce false positives by chance - The practice of "hypothesizing after results are known" (HARKing) — re-describing exploratory findings as confirmatory — can be disguised within pre-registration

5. Open Science Practices

Beyond pre-registration, the open science movement has introduced several reforms that improve research quality and reliability:

Open data: Making raw data publicly available allows other researchers to independently verify analyses and conduct secondary analyses. The existence of raw data also deters fraud.

Open materials: Sharing experimental stimuli, questionnaires, and protocols allows exact replication and identifies when materials drive effects.

Multi-site replication: Pre-registered replications across many labs and populations — "Many Labs" projects — provide robust estimates of whether effects replicate and whether they vary across contexts.

Registered Reports: Some journals now accept papers before data collection, based on the quality of the design and hypotheses. The paper is guaranteed publication regardless of the direction of results, eliminating publication bias.

Post-publication peer review: Platforms like PubPeer allow researchers to publicly comment on published work, flagging potential errors or concerns.

These practices have improved the reliability of recent research substantially. Studies published after 2015 with pre-registration and open data are considerably more credible than studies from the previous era under the same conditions.

6. What This Means for This Textbook

This textbook cites social science research on misinformation, persuasion, motivated reasoning, and media effects throughout its chapters. Given what we know about the replication crisis, how should these citations be evaluated?

6.1 Principles for Evaluating Cited Research

Replicated findings should be weighted more heavily than single-study results. If a chapter cites a phenomenon supported by multiple independent replications — including pre-registered ones — the finding is on substantially firmer ground than a single clever experiment, however well-designed.

Effect sizes matter. Even in replicated findings, small effect sizes (d < 0.2, r < 0.1) should temper confidence in practical significance. Research on misinformation often finds that inoculation, debunking, and other interventions have effects that are statistically reliable but modest in absolute magnitude.

Be skeptical of counterintuitive single studies. The most counterintuitive findings — corrections making people believe misinformation more, power poses boosting hormones, subtle primes changing complex behavior — are disproportionately represented in failed replications. This is consistent with the statistical logic: counterintuitive findings are more likely to be false positives because the prior probability they are true is lower.

Consider the era of the research. Pre-2012 studies from high-flexibility research environments should be treated with more caution than post-2015 pre-registered, high-powered studies. This does not mean discarding older research — but it warrants additional scrutiny.

Check for replication evidence. For any important claim, it is worth checking whether replication attempts exist and what they found. PsychFileDrawer, the Open Science Framework registry, and systematic reviews provide this information.

6.2 A Specific Self-Assessment

Where this textbook cites research on:

The illusory truth effect (repeated exposure increases perceived truth): Well-replicated, robust across many conditions. Confidence: High.

Confirmation bias and motivated reasoning: Well-documented across multiple paradigms and methodologies. Confidence: High.

Corrections reducing misinformation belief: Generally supported by the evidence, though effects are modest and vary with correction design. Confidence: Moderate-High.

The backfire effect (corrections increasing belief): Evidence substantially weakened by replication failures. Should be treated as a rare rather than typical outcome. Confidence in strong version: Low.

Third-person effect (others are more influenced by media than oneself): Replicated across many studies, though mechanisms are debated. Confidence: Moderate.

Social norm corrections for misinformation: Promising but relatively new evidence base; requires more replication. Confidence: Low-Moderate.

7. The Broader Epistemic Lesson

The replication crisis teaches a lesson that extends beyond psychology: scientific knowledge is not a binary between "proven" and "disproven" but a probability distribution over possible truths, shaped by the quality and quantity of evidence. The institutional structures of academic science — publication bias, career incentives for novel positive findings, insufficient statistical power, inadequate peer review — can systematically skew this distribution away from truth.

Recognizing this does not license rejecting science or treating all findings as equally uncertain. It requires developing the skills to distinguish more from less reliable findings: looking for replication, adequate statistical power, pre-registration, open data, effect sizes, and consistency across methodologies.

For the study of misinformation specifically, there is a poignant irony in the replication crisis: some of the most-cited research on how misinformation spreads and persists is itself an example of misinformation that spread widely before being corrected by subsequent evidence. The "backfire effect" is a kind of scientific misinformation — a claim that appeared credible, was widely disseminated through authoritative sources, and was only partially corrected when evidence failed to support it.

This irony should not breed nihilism. It should breed humility, methodological rigor, and the practice of updating beliefs in proportion to evidence — which is precisely what this textbook asks of its readers in every domain.

8. Discussion Questions

The replication crisis was caused substantially by institutional incentives in academic science — specifically, the value placed on novel positive findings. What institutional reforms beyond pre-registration could address these incentive problems?
Some critics of the replication crisis narrative argue that "failed replications" often represent genuine contextual moderators — the effect is real but only under certain conditions. How would you design research to distinguish a genuine contextual moderator from an artifact of underpowered original research?
This textbook now knows that the "backfire effect" is weaker than initially reported. How should this change our advice to fact-checkers and misinformation correction practitioners? Does the weaker version of the effect (corrections generally work, but resistance is possible) imply different practices than the stronger version?
If a scientific finding has been cited in thousands of news articles and policy documents, and then fails to replicate, how should corrections be issued? What mechanisms exist to correct the public record, and how effective are they?
Given the replication crisis, would you trust a recent (post-2020) pre-registered, open-data study with n = 2,000 more or less than a landmark pre-2010 study with n = 80 that has been cited 2,000 times? Justify your reasoning.

Key Concepts Illustrated

Publication bias: Systematic overrepresentation of positive findings in the published literature
Researcher degrees of freedom: Legitimate analytical choices that can inflate false positive rates when made post-hoc
P-hacking: Exploiting researcher degrees of freedom to achieve significance
Statistical power: The probability of detecting a real effect; low power leads to inflated effect sizes in significant findings
Pre-registration: Filing hypotheses and analysis plans before data collection to prevent selective reporting
HARKing (Hypothesizing After Results are Known): Presenting exploratory findings as confirmatory
Open science: A cluster of reforms (open data, open materials, registered reports) designed to increase research transparency and reproducibility
Effect size: A measure of practical significance independent of sample size
Replication: The critical scientific mechanism for distinguishing genuine effects from false positives

Case Study 21-2: The Replication Crisis in Social Psychology — What It Means for Misinformation Research