Chapter 3: The Replication Crisis — When Psychology's Foundations Cracked

15 min read

In 2011, one of the most prestigious psychology journals in the world — the Journal of Personality and Social Psychology — published a paper claiming to prove that humans can see the future.

In This Chapter

The Spark: Daryl Bem and the Case for Precognition
The Reckoning: The Open Science Collaboration (2015)
What Went Wrong: The Four Horsemen of Unreliable Science
What's Being Fixed: The Open Science Revolution
Why This Matters for You: The Pre-Crisis Knowledge Base
Fact-Check Portfolio: Chapter 3
After Reading: Confidence Revisited

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 3: The Replication Crisis — When Psychology's Foundations Cracked

In 2011, one of the most prestigious psychology journals in the world — the Journal of Personality and Social Psychology — published a paper claiming to prove that humans can see the future.

The paper was by Daryl Bem, a respected social psychologist at Cornell. Across nine experiments involving over 1,000 participants, Bem reported evidence for "precognition" — the ability to sense future events before they happen. In one experiment, participants could correctly identify which curtain concealed an erotic image at rates above chance, before the image's location had been determined by a random number generator.

The paper was not a joke. It was not retracted. It passed peer review at one of the field's top journals. It used the same statistical methods, the same p-value thresholds, and the same experimental logic that the field had been using for decades. And that was the problem.

If the standard methods of psychological research could produce convincing evidence for psychic powers, then something was deeply wrong with the standard methods. Bem's paper didn't prove that precognition is real. It proved that the methods psychology had been using to generate knowledge were broken in ways that could produce evidence for virtually anything — including the impossible.

The paper landed like a bomb. It catalyzed what became known as the replication crisis — the discovery that a disturbingly large proportion of published psychology findings cannot be reproduced when other researchers try. And it forced the field to confront a set of uncomfortable questions that had been festering for decades.

This chapter tells that story. It matters for this book because many of the popular psychology claims you've heard — and many that we'll evaluate in later chapters — are built on studies from the era before the replication crisis. Some of those studies hold up. Many don't. Understanding the crisis is essential for understanding which claims to trust.

Before You Read: Confidence Check

Rate your confidence (1–10) that each statement is true.

"If a psychology study is published in a top journal, the finding is reliable." ___

"The Stanford Prison Experiment proved that situational forces can make anyone behave cruelly." ___

"Ego depletion is real — willpower is a limited resource that gets used up." ___

"If a study has a p-value below .05, the result is probably true." ___

"Psychology is less trustworthy now than it was before the replication crisis." ___

The Spark: Daryl Bem and the Case for Precognition

Bem's 2011 paper, "Feeling the Future," was designed to be provocative. Bem was not a fringe figure — he had spent decades at Cornell, had published influential work on self-perception theory, and was a fellow of the American Psychological Association. He used standard methods: he recruited participants from the university subject pool, ran controlled experiments, analyzed data using conventional statistics, and reported statistically significant results (p < .05).

The logic of the paper was simple: if the standard methods of psychology are valid, and my study follows those methods, then either precognition is real or the methods are flawed. Bem knew the answer would be the second option. His paper was, in part, a deliberate demonstration of how much the field's methods could produce.

The Methods Problem He Exposed

Bem's paper highlighted several practices that were common in psychology but that, taken together, could produce statistically significant results even when the underlying effect was zero:

Flexible stopping rules. Researchers could collect data until they got a significant result, then stop. If the first 30 participants didn't produce a significant finding, collect 20 more. If the next 50 did, stop there and report the result. This flexibility inflates the false positive rate dramatically.

Flexible analysis. With multiple measures and multiple ways of analyzing data, researchers had many "researcher degrees of freedom." They could analyze the full sample, or just women, or just people who scored above the median on some other measure. Each choice is a separate test, and the more tests you run, the more likely you are to find something "significant" by chance.

Selective reporting. If a study produced 10 measures but only 2 were significant, the researcher could report only those 2. The 8 non-significant results would go unreported, creating the illusion that the significant ones were robust.

Publication bias. Journals overwhelmingly published significant results. Null results — studies that found nothing — were rejected as uninteresting. This meant the published literature was a systematically biased sample of all research conducted, skewed toward positive findings regardless of their truth.

None of these practices were considered fraudulent. They were the normal, accepted way of doing psychology research. Bem simply applied them to an absurd hypothesis, revealing what the field had been ignoring: the methods could find "evidence" for literally anything.

The Reckoning: The Open Science Collaboration (2015)

In 2015, a massive collaborative effort organized by Brian Nosek and the Center for Open Science published its results in Science. The Open Science Collaboration (OSC) attempted to replicate 100 published psychology studies from three top journals (Psychological Science, Journal of Personality and Social Psychology, and Journal of Experimental Psychology: Learning, Memory, and Cognition).

The results were devastating:

Only 36% of the replicated studies produced statistically significant results in the same direction as the original. Put differently: roughly two-thirds of the findings they tested failed to replicate.
The average effect size in the replications was about half the size reported in the original studies.
Even among the studies that did replicate, the effects were generally weaker than originally reported.

These results didn't mean that all of psychology was wrong. Some subfields fared better than others — cognitive psychology replicated more reliably than social psychology, for instance. And "failure to replicate" doesn't definitively prove an effect is false; it could mean the original study captured a real but fragile effect, or that the replication differed in important ways.

But the overall message was clear: the published psychology literature was far less reliable than anyone had assumed. The foundations were cracked.

The Landmark Studies That Didn't Hold Up

The replication crisis wasn't just about obscure studies in obscure journals. Some of the most famous, most-cited, most-taught findings in psychology turned out to be on shaky ground:

Ego depletion (Baumeister, 1998). The idea that willpower is a limited resource — like a muscle that gets tired — was one of the most influential findings in social psychology. It spawned hundreds of studies, multiple books, and widespread popular adoption (including corporate training programs and self-help advice). In 2016, a large pre-registered replication involving 23 labs and over 2,000 participants found... essentially nothing. No significant ego depletion effect. The foundational finding didn't hold up. (We'll examine this in detail in Chapter 27.)

Social priming effects. A series of famous studies claimed that subtle environmental cues unconsciously influence behavior. Being exposed to words related to old age made people walk more slowly (Bargh, Chen, & Burrows, 1996). Holding a warm cup of coffee made you judge other people as warmer (Williams & Bargh, 2008). These findings were elegant, surprising, and frequently taught. Multiple replication attempts failed to confirm them.

The Stanford Prison Experiment (Zimbardo, 1971). One of the most famous studies in all of psychology — the claim that ordinary people placed in guard roles will become cruel — was revealed to have serious methodological and ethical problems. Researcher demand effects (Zimbardo explicitly encouraged aggressive guard behavior), selective participant reactions (some guards didn't become cruel at all), and a sample of only 24 participants undermined the study's dramatic conclusions. Recent archival research by Thibault Le Texier (2018) revealed further problems based on previously unreleased recordings.

The Marshmallow Test (Mischel, 1972/1990). As we discussed in Chapter 2, the follow-up findings that delay of gratification at age four predicted later success were substantially reduced in the 2018 large-scale replication by Watts, Duncan, and Quan.

Stereotype threat (Steele & Aronson, 1995). The finding that reminding people of negative stereotypes about their group impairs performance has been influential in education policy. But meta-analyses and replication attempts have produced more mixed results than the popular version suggests, with publication bias potentially inflating the apparent effect size.

These examples are not meant to suggest that all of psychology is unreliable. They are meant to demonstrate that even the most famous, most-cited findings can turn out to be weaker than originally reported — and that the public's understanding of psychology is built substantially on this pre-crisis literature.

What Went Wrong: The Four Horsemen of Unreliable Science

The replication crisis was not caused by a single problem. It was caused by a cluster of practices that, in combination, made the published literature systematically unreliable. Researchers have identified four primary culprits:

1. P-Hacking

P-hacking refers to the practice of analyzing data in multiple ways until a statistically significant result (p < .05) is found, then reporting only the significant analysis. Simmons, Nelson, and Simonsohn (2011) demonstrated in a now-famous paper that with common "researcher degrees of freedom" — flexible sample sizes, flexible measures, flexible covariates — it was possible to find statistically significant evidence that listening to "When I'm Sixty-Four" by the Beatles literally makes you younger.

They weren't committing fraud. They were using the same analytical flexibility that was common practice in the field. Their point was that this flexibility, combined with the p < .05 threshold, could produce "evidence" for anything.

The term "p-hacking" sounds sinister, but the practice was often unintentional. Researchers genuinely believed they were exploring the data. The distinction between "exploring" and "hacking" is subtle, and the field had never clearly drawn the line.

2. HARKing

HARKing stands for "Hypothesizing After the Results are Known." The practice involves examining the data, finding an interesting pattern, and then writing the paper as if that pattern had been predicted all along. A researcher might test five hypotheses, find one significant result, and write the paper as though only that one hypothesis was tested.

HARKing turns exploratory research (which is fine) into seemingly confirmatory research (which is misleading). It makes the published literature look more confirmatory than it actually is, because the hypotheses appear to be pre-specified when they were actually generated by the data.

3. Publication Bias (The File Drawer Problem)

Journals overwhelmingly publish significant results. A researcher who conducts a study and finds no effect — a null result — has almost no chance of publishing it. This means the published literature is a biased sample: it systematically overrepresents positive findings and underrepresents null findings.

Psychologist Robert Rosenthal called this the file drawer problem in 1979. For every published study finding an effect, there might be several unpublished studies that found nothing — sitting in file drawers, invisible to the field.

The consequence is that published effect sizes are inflated. If five labs study the same phenomenon and one finds a significant result by chance while four find nothing, only the one is published. The published literature then shows "evidence" for the phenomenon, with an effect size that reflects the one lucky study, not the true effect.

4. Small Samples and Low Statistical Power

Many psychology studies used small samples — 20, 30, 50 participants per condition. With small samples, studies have low statistical power — the ability to detect a real effect if one exists. Counter-intuitively, studies with low power that do produce significant results are more likely to be false positives or to overestimate the true effect size.

This is because in a low-powered study, a true effect can only reach significance if the observed effect happens to be much larger than the real effect (by chance). This phenomenon, called the winner's curse, means that published effect sizes from small studies are systematically inflated.

Button and colleagues (2013) estimated that the median statistical power in neuroscience studies was just 21% — meaning that even if the effect being studied is real, the study has only a 21% chance of detecting it. The situation in social psychology was similar.

What's Being Fixed: The Open Science Revolution

The replication crisis was painful, but it produced something remarkable: a reform movement that has fundamentally changed how psychology research is conducted.

Pre-Registration

Researchers now increasingly pre-register their studies — they publicly commit to their hypotheses, sample sizes, and analysis plans before collecting data. This eliminates HARKing (the hypothesis is specified in advance) and reduces p-hacking (the analysis plan is locked in). The Open Science Framework (osf.io) hosts thousands of pre-registered studies.

Registered Reports

Some journals now publish Registered Reports — a format where the paper is peer-reviewed and provisionally accepted before the data is collected. The journal commits to publishing the result regardless of whether it is significant or null. This eliminates publication bias at the journal level, because the decision to publish is based on the quality of the research design, not the interestingness of the results.

Open Data and Open Materials

Researchers are increasingly expected to share their raw data, analysis code, and experimental materials publicly. This makes it possible for other researchers to verify analyses, detect errors, and attempt direct replications with the same methods.

Larger Samples and Multi-Lab Replications

The field has moved toward larger samples and collaborative replication efforts. The "Many Labs" projects — large-scale replication attempts conducted simultaneously across multiple laboratories — provide more reliable estimates of effect sizes than any single study.

A Changed Culture

Perhaps most importantly, the culture of psychology has shifted. Replication is now valued rather than dismissed. Null results are publishable. Methodological rigor is rewarded rather than treated as an obstacle to interesting findings. The field's willingness to examine and correct its own practices is, arguably, a sign of health, not sickness.

Why This Matters for You: The Pre-Crisis Knowledge Base

Here's why the replication crisis matters for a book about popular psychology: most of the psychology claims circulating in popular culture are based on pre-crisis research.

The self-help books you've read, the corporate training programs you've attended, the parenting advice you've followed, the TikTok psychology content you've watched — much of this is based on findings published before 2015, before pre-registration was common, before large-scale replications were standard, before the field confronted its methodological problems.

Some of those findings are robust. Basic cognitive processes, well-established personality dimensions (the Big Five), well-replicated developmental findings, and the general effectiveness of CBT for anxiety and depression — these have survived the crisis intact.

But many popular findings have not survived, or have survived only in much weaker form: - Ego depletion (willpower as a muscle) — failed to replicate - Power posing (hormonal effects) — failed to replicate - Many social priming effects — failed to replicate - The marshmallow test (as a predictor of life success) — substantially reduced - Growth mindset (as originally reported) — effect sizes much smaller than claimed - Stanford Prison Experiment — methodologically compromised

When you encounter a popular psychology claim in the rest of this book, one of the first questions we'll ask is: has this finding been replicated? That question barely existed before 2011. It is now one of the most important questions you can ask about any psychology claim.

Verdict: "If a psychology study is published in a top journal, the finding is reliable" ❌ DEBUNKED — The Open Science Collaboration (2015) found that only 36% of published findings from top journals replicated successfully. Publication in a prestigious journal is not a guarantee of reliability, because publication bias, p-hacking, and low statistical power affect all journals. Origin: This belief reflects the general assumption that peer review is a reliable quality filter. Replication status: The OSC study directly tested and refuted this assumption.

Verdict: "Psychology is less trustworthy now than before the replication crisis" ❌ DEBUNKED — Psychology is more trustworthy now because the field is actively correcting its methods. Pre-registration, Registered Reports, open data, and large-scale replications have improved the reliability of new findings. A science that checks its own work is more trustworthy than one that doesn't. The crisis revealed problems that had always existed; the reforms are making the field more reliable than it has ever been. Context: The replication crisis is a sign of health, not terminal illness. Sciences that don't check their work should be the ones you worry about.

Fact-Check Portfolio: Chapter 3

Return to your list of 15–20 claims from Chapter 1. Now narrow it to 10 claims you want to investigate for the rest of the book.

Choose claims that: - You feel strongly about (either believing or doubting them) - Come from different domains (personality, brain, relationships, self-improvement, parenting, mental health) - You would actually want to know the truth about

For each of your 10 claims, try to find out: is there an original study behind this claim? Has that study been replicated? You don't need to do a full literature search — just a quick Google Scholar search for the claim and the word "replication." Note what you find. You may be surprised how many popular claims have shaky or no replication evidence.

After Reading: Confidence Revisited

Revisit your confidence ratings from the start of this chapter.

"If a psychology study is published in a top journal, the finding is reliable." — What did the Open Science Collaboration find?

"The Stanford Prison Experiment proved that situational forces can make anyone behave cruelly." — What methodological problems have been revealed?

"Ego depletion is real — willpower is a limited resource that gets used up." — What happened in the 2016 pre-registered replication?

"If a study has a p-value below .05, the result is probably true." — What did Bem's precognition paper reveal about the p < .05 standard?

"Psychology is less trustworthy now than it was before the replication crisis." — What reforms have been implemented, and why do they make the field more trustworthy?