Chapter 10: The Replication Problem

37 min read

> "Science isn't about authority or white coats; it's about following a method. That method is, at its core, about checking."

Learning Objectives

Define the replication crisis and explain why it extends far beyond psychology
Identify the structural incentives that discourage replication in every field
Analyze p-hacking, HARKing, and researcher degrees of freedom as systemic features, not individual misconduct
Evaluate the scale of the replication problem across medicine, psychology, economics, education, and forensic science
Add the replication lens to your Epistemic Audit

In This Chapter

Chapter Overview
10.1 The Reproducibility Project: The Reckoning
10.2 The Machinery of Non-Replication
10.3 The Structural Incentives Against Replication
10.4 Beyond Psychology: The Replication Problem Is Everywhere
10.5 Active Right Now: Where the Replication Problem May Be Operating
10.6 What It Looked Like From Inside
10.7 The Structural Diagnosis: Why Checking Is Disincentivized
10.8 The Reform Movement: Open Science
10.9 Practical Considerations: Living With Unreplicated Evidence
10.10 Chapter Summary
Spaced Review
What's Next
Chapter 10 Exercises → exercises.md
Chapter 10 Quiz → quiz.md
Case Study: The Ego Depletion Saga — The Rise and Fall of a Textbook Finding → case-study-01.md
Case Study: Preclinical Cancer Research — When 89% of "Landmark" Studies Don't Replicate → case-study-02.md

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 10: The Replication Problem

"Science isn't about authority or white coats; it's about following a method. That method is, at its core, about checking." — Ben Goldacre, Bad Science

Chapter Overview

In 2011, a social psychologist named Daryl Bem published a paper in one of psychology's most prestigious journals, the Journal of Personality and Social Psychology. The paper, titled "Feeling the Future," claimed to provide experimental evidence for precognition — the ability to perceive future events before they occur.

The experiments were straightforward. In one study, participants were shown two curtains on a computer screen and asked to guess which one concealed an image. The image's location was determined after the participant made their choice, by a random number generator. Bem reported that participants correctly guessed the location of erotic images (but not non-erotic images) at a rate slightly above chance — 53.1% instead of 50%. He interpreted this as evidence that the future somehow influenced the present.

The paper passed peer review at a top journal. Its methodology, on the surface, was conventional — the same experimental designs and statistical methods used in hundreds of published social psychology studies. The statistical results were technically "significant" by the field's standard criteria (p < 0.05).

And that was precisely the problem.

If the methods and statistics that psychology used to evaluate all of its findings could also produce evidence for precognition — a phenomenon that, if real, would overturn the fundamental physics of causality — then what did those methods actually prove? If the standard tools of the trade couldn't distinguish between a genuine psychological effect and literal magic, how much of the field's published findings could be trusted?

Bem's paper didn't just make a strange claim. It held up a mirror to the entire discipline and asked: How much of what you've published is real?

The response to Bem's paper was revealing. Several research groups attempted to replicate his findings and failed — as most scientists expected. But the replication attempts faced an unexpected obstacle: the Journal of Personality and Social Psychology — the same journal that had published Bem's original paper — rejected the replication studies. The reason: the replications were "not sufficiently novel."

Pause on this. The journal published a paper claiming to demonstrate precognition — one of the most extraordinary claims in the history of psychology. When other researchers attempted to test this extraordinary claim using the same methods, the journal refused to publish their results because testing the claim wasn't novel enough. The system was designed to amplify extraordinary claims but not to verify them.

This asymmetry — easy to publish claims, hard to publish verifications — is the replication crisis in miniature. And the answer, when researchers finally began checking the field's broader claims rather than just Bem's, was devastating.

In this chapter, you will learn to: - Understand the replication crisis not as an embarrassment confined to psychology but as a structural feature of how knowledge is produced in every field - Identify p-hacking, HARKing, and researcher degrees of freedom as systemic incentive problems, not individual misconduct - Evaluate the scale of non-replication across multiple fields - Understand why "checking the homework" is structurally disincentivized - Add the replication diagnostic to your Epistemic Audit

🏃 Fast Track: If you're already familiar with the replication crisis in psychology, skip to section 10.4 (Beyond Psychology) for the cross-field analysis and section 10.6 for the structural diagnosis.

🔬 Deep Dive: After this chapter, read Stuart Ritchie's Science Fictions for the most comprehensive popular treatment, and explore the Open Science Framework (osf.io) for the infrastructure being built to address the problem.

10.1 The Reproducibility Project: The Reckoning

In 2011, Brian Nosek and a large team of researchers launched the Reproducibility Project — an unprecedented attempt to systematically replicate 100 published studies from three top psychology journals. The studies were selected to be representative of the field: mainstream topics, reputable journals, standard methodologies.

The results, published in Science in 2015, were a watershed:

Of the 100 original studies, 97% had reported statistically significant results (p < 0.05)
Of the 100 replications, only 36% produced statistically significant results
The average effect size in the replications was roughly half the average effect size in the originals
Many of the most famous and widely cited findings failed to replicate

The numbers told a story that the field had been avoiding for decades: a substantial proportion of published psychological research — perhaps a majority — could not be confirmed when other researchers tried to reproduce the results.

What Failed

Some of the non-replications involved high-profile findings that had been widely cited, taught in textbooks, and incorporated into practice:

Ego depletion. The theory that willpower is a limited resource that gets "depleted" with use — like a muscle that fatigues — was one of the most cited findings in social psychology. A massive multi-lab replication effort found no evidence for the effect.

Power posing. Amy Cuddy's famous finding that adopting "power poses" (standing with arms on hips, for example) increased testosterone and risk-taking was hugely popular — her TED talk has been viewed over 60 million times. Replication attempts found no evidence for the hormonal or behavioral effects.

Priming effects. Multiple studies claiming that subtle environmental cues (seeing words related to elderly people, holding a warm cup of coffee) unconsciously influenced behavior produced near-zero effects in replication attempts.

The Stanford Prison Experiment. Philip Zimbardo's 1971 study — in which college students assigned to "guard" roles became abusive toward "prisoner" students — is among the most famous experiments in the history of psychology. Subsequent investigation by journalist Ben Blum and others revealed that the guards were coached to be cruel, that participants who didn't conform to the expected narrative were removed from the study, that the results were dramatically overstated, and that the study's methodology would not meet contemporary standards. The study has never been formally replicated and is now widely considered unreliable.

Yet the Stanford Prison Experiment remains one of the most taught studies in introductory psychology courses worldwide. It appears in virtually every social psychology textbook. It has been adapted into films and documentaries. It is cited in discussions of military abuse (Abu Ghraib), corporate misconduct, and institutional corruption. An unreplicable study with serious methodological flaws has become one of the foundational narratives of social psychology — and its persistence in curricula demonstrates how difficult it is to remove a compelling story from the educational infrastructure, even when the evidence underlying the story has collapsed.

This is the intersection of the replication problem with the plausible story problem (Chapter 6): the Stanford Prison Experiment is a great story. It has characters, drama, moral lessons, and a clear narrative arc. The story persists because it satisfies the human need for narrative coherence, even as the evidence underpinning it has been dismantled.

⚠️ Common Pitfall: The replication crisis does not mean that all of psychology is unreliable. Many areas of psychology — perception, learning, memory, some cognitive processes — have strong replication records. The crisis is concentrated in social psychology, personality psychology, and certain areas of clinical psychology where effect sizes are small, samples are small, and the incentives for novel, surprising findings are highest. Dismissing all of psychology because of the replication crisis is like dismissing all of medicine because of the lobotomy — it mistakes a structural problem in part of the field for a condemnation of the whole.

10.2 The Machinery of Non-Replication

The replication crisis was not caused by widespread fraud. It was caused by a system that systematically rewarded practices that inflate false positive rates — practices so common that most researchers didn't even recognize them as problematic until the crisis forced a reckoning.

P-Hacking: Fishing for Significance

P-hacking (also called "data dredging" or "fishing expeditions") is the practice of analyzing data in multiple ways until a statistically significant result emerges. The standard threshold for "significance" in most fields is p < 0.05 — meaning there is less than a 5% chance that the result would occur by random chance if the hypothesis were false.

But this threshold assumes that only one analysis was performed. If a researcher runs 20 different analyses on the same data, one of them is likely to produce p < 0.05 purely by chance — even if the hypothesis is completely false. This is not cheating in the traditional sense; the researcher may genuinely believe they are exploring the data legitimately. But the cumulative effect of multiple analyses inflates the false positive rate from the nominal 5% to much higher levels.

Common p-hacking techniques include: - Analyzing multiple dependent variables and reporting only the significant ones - Testing multiple subgroups until one shows a significant effect - Collecting data until p < 0.05, then stopping - Removing "outliers" using different criteria until the desired result appears - Transforming variables (log, square root) until the analysis "works"

Each of these is defensible individually — there are legitimate reasons for each. But when they are used in combination and selectively reported, the result is a published literature full of "significant" findings that are actually noise.

A Worked Example: How P-Hacking Produces "Discoveries"

To make p-hacking concrete, let's trace how a completely false finding can emerge from standard research practices.

Imagine a researcher studying whether listening to classical music improves test performance. They recruit 100 students, assign 50 to listen to Mozart and 50 to silence, and then administer a standardized test.

The result: no significant difference. P = 0.42. Not publishable.

But the researcher has collected additional data: gender, age, time of day, subject major, GPA, musical training, caffeine consumption, and hours of sleep. These weren't part of the original hypothesis, but they're in the dataset.

The researcher begins exploring: - Mozart vs. silence for males only? P = 0.23. No. - For females only? P = 0.34. No. - For students who slept more than 7 hours? P = 0.08. Getting closer. - For female students who slept more than 7 hours and had no musical training? P = 0.03. Significant!

The paper is written: "We hypothesized that Mozart would improve test performance for individuals without musical training who had adequate sleep (the 'relaxation-priming' effect). As predicted, this subgroup showed significantly higher scores..." The HARKing frames the finding as predicted. The p-hacking produced the significant result. The published paper looks like a clean confirmatory study.

No individual step was fraudulent. The researcher genuinely explored the data. The subgroup analysis has a plausible justification (musical training and sleep are reasonable moderators). The statistics are correctly computed. But the finding is almost certainly noise — an artifact of searching through many possible analyses until one produced p < 0.05.

If another researcher tries to replicate this finding with a new sample of female students without musical training who slept well, the effect will almost certainly vanish. But by then, the original paper has been published, cited, and incorporated into a narrative about music and cognition. The non-replication, if it's even attempted, will be harder to publish than the original "discovery."

💡 Intuition: P-hacking is like rolling dice until you get a six and then claiming you predicted it. With one die, a six is a 1-in-6 event. With 20 dice, getting at least one six is almost certain. The researcher who tests 20 hypotheses and reports only the significant ones has rolled 20 dice and reported the sixes. The audience, seeing only the sixes, believes the researcher has remarkable predictive power. The dice, not the researcher, produced the result.

HARKing: Hypothesizing After Results Are Known

HARKing (Hypothesizing After the Results are Known) is the practice of formulating a hypothesis after seeing the data and then presenting the research as though the hypothesis was formulated before the data was collected.

The mechanism is simple. A researcher collects data on 20 variables. Most show no interesting patterns. But variable 14 and variable 17 show an unexpected correlation. The researcher writes up the study as: "We hypothesized that variable 14 would be correlated with variable 17" — as though this was the plan all along.

HARKing transforms an exploratory finding (interesting but unreliable) into a confirmatory finding (apparently rigorous and hypothesis-driven). The false confidence this generates is enormous — both for the researcher and for anyone who reads the published paper.

Researcher Degrees of Freedom: The Garden of Forking Paths

Statistician Andrew Gelman and philosopher Eric Loken coined the term "garden of forking paths" to describe the vast number of analytical choices available to researchers at each stage of a study: how to measure variables, which participants to include or exclude, which statistical tests to run, how to handle missing data, which covariates to include, and how to interpret results.

Each "fork" in the garden is a defensible choice. But the number of possible paths is enormous — hundreds or thousands of possible analytical strategies for any given dataset. If researchers explore many paths and report only the ones that produced significant results (even unconsciously), the published findings will be biased toward false positives.

The critical insight is that researcher degrees of freedom are a structural problem, not a character problem. Most researchers who p-hack or HARK are not deliberately committing fraud. They are navigating a genuine garden of analytical choices with the sincere belief that they are finding real patterns in the data. The incentive system (which rewards novel, significant findings and punishes null results) ensures that the paths leading to publishable results are more likely to be followed — and more likely to be reported — than the paths leading to null results.

🧩 Productive Struggle

Before reading the next section, try this exercise: imagine you have a dataset with 20 variables measured on 200 participants. You don't have a specific hypothesis — you're exploring. How many different analyses could you run? (Consider: pairwise correlations, subgroup analyses, different statistical tests, different exclusion criteria, different variable transformations.) How many of these analyses would produce "significant" results by chance alone?

The answer is sobering. With 20 variables, there are 190 possible pairwise correlations. At p < 0.05, approximately 9-10 of these will be "significant" by chance. If you report only these 9-10 and frame each as a hypothesis-driven finding, you have produced 9-10 "publishable" results — none of which are real.

10.3 The Structural Incentives Against Replication

If replication is the foundation of scientific reliability, why doesn't it happen more often? The answer is structural: every incentive in the system pushes against replication.

Incentive 1: Publish or Perish

Academic careers — hiring, tenure, promotion, salary — depend primarily on publishing novel findings in prestigious journals. Replications are, by definition, not novel. They don't generate new knowledge; they verify old knowledge. In the career marketplace, a replication study is worth far less than an original study, even if the replication is more valuable for the field.

A researcher who spends a year replicating someone else's findings has produced a valuable contribution to science but a worthless contribution to their career. The rational career strategy is to never replicate anything.

Incentive 2: Journal Prestige and Novelty Bias

Journals compete for readers, citations, and prestige. Novel, surprising findings attract attention. Replications — whether successful or failed — do not. Most top journals have historically rejected replication studies, especially failed replications, on the grounds that they are "not sufficiently novel." This means that even if a researcher conducts a replication, the result may be unpublishable.

The result is a systematic filter: original studies (novel, publishable) enter the literature; replications (not novel, less publishable) do not. The published literature becomes an increasingly unreliable guide to what is actually true — because it overrepresents initial findings and underrepresents the verification that would confirm or refute them.

Some journals have begun to change. PLOS ONE publishes based on methodological soundness regardless of novelty. The Journal of Articles in Support of the Null Hypothesis (yes, it exists) publishes negative results. And the registered reports format, adopted by over 300 journals, eliminates the novelty filter entirely. But these remain minority practices. The vast majority of journals still prefer — explicitly or implicitly — novel, surprising, significant findings.

Publishing a failed replication of a prominent researcher's work is a social act as well as a scientific one. It implies that the original researcher's finding may be wrong — which can be perceived as an attack on their competence or integrity. In a field where careers depend on reputation and collegial relationships, publishing a failed replication carries social risk that extends beyond the scientific merits.

Junior researchers who attempt to replicate senior researchers' work face particular vulnerability: the senior researcher may serve on hiring committees, review grant proposals, or referee journal submissions. The power asymmetry makes failed replication risky for the replicator, regardless of the scientific value of the work.

The social dynamics can be severe. When a graduate student attempted to replicate a famous priming study and found no effect, the original author publicly accused the student of incompetence and questioned their motives. The student's advisor had to intervene to protect the student's career. This kind of retaliation — which is documented in multiple cases — sends a clear message to the field: replication is socially dangerous. The cost is not hypothetical; it is observable, documented, and career-threatening.

Incentive 4: Funding Structure

Research funding is allocated for original research, not for replication. Grant applications proposing to replicate existing findings are routinely rejected as "not innovative." The funding structure ensures that the resources necessary for systematic replication are almost never available.

The numbers are telling. The U.S. National Institutes of Health spends roughly $45 billion annually on biomedical research. The proportion allocated specifically to replication studies is negligible — there is no dedicated funding stream for systematic replication. The Dutch government announced a dedicated replication fund in 2016 — the first of its kind anywhere — valued at approximately $3 million. The asymmetry between spending on producing new findings and spending on verifying old ones is on the order of 1,000:1.

This asymmetry means that the scientific enterprise produces vastly more claims than it verifies. The stock of unverified claims grows every year, while the verification capacity remains nearly zero. If this were an accounting system, it would be a clear case of audit failure — the equivalent of a corporation that spends billions generating revenue reports but nothing on auditing whether the revenue is real.

The Ioannidis Bombshell

In 2005, epidemiologist John Ioannidis published a paper with one of the most provocative titles in scientific history: "Why Most Published Research Findings Are False." Published in PLoS Medicine, the paper used mathematical modeling to argue that, given the combination of publication bias, small sample sizes, researcher degrees of freedom, and the base rate of true hypotheses in most fields, a majority of published findings with p < 0.05 are likely to be false positives.

The argument was not based on examining individual studies. It was based on the mathematics of the system. Ioannidis showed that when you combine: - Low prior probability (most hypotheses tested are false) - Small sample sizes (common in many fields) - Researcher degrees of freedom (multiple possible analyses) - Publication bias (positive results are published, negatives are not)

...the resulting published literature is mathematically expected to contain more false positives than true positives, regardless of the competence or integrity of individual researchers.

This paper — now cited over 12,000 times — was the intellectual foundation for the replication crisis that erupted a decade later. It showed that the replication problem was not a surprising discovery but a mathematical inevitability given the structure of the system.

🎓 Advanced: Ioannidis's model uses Bayesian reasoning to calculate the positive predictive value (PPV) of a published finding — the probability that a "significant" published result is actually true. Under plausible assumptions for many fields (prior probability of 10%, alpha of 0.05, power of 50%, some publication bias), the PPV drops below 50% — meaning that fewer than half of published significant findings are true. Under less favorable assumptions (lower prior probability, more bias), the PPV can drop below 20%. This is a structural property of the system, not an indictment of individual researchers.

🔄 Check Your Understanding (try to answer without scrolling up)

Why is the replication crisis a structural problem rather than a fraud problem?

Name three structural incentives that discourage replication.

Verify
1. Because the practices that produce non-replicable results (p-hacking, HARKing, researcher degrees of freedom) are rational responses to the incentive system — they are rewarded by the publication and career structures. Most researchers engaging in these practices are not committing fraud; they are navigating a system that rewards novel significant findings and punishes null results. 2. Publish-or-perish (replications don't count for career advancement), journal novelty bias (replications are rejected as "not novel"), social cost of failed replication (implying a senior colleague was wrong), and funding structure (grants don't fund replication).

10.4 Beyond Psychology: The Replication Problem Is Everywhere

The replication crisis was first identified in psychology — not because psychology is uniquely unreliable, but because psychologists were the first to systematically check. When other fields have checked, the results have been similarly concerning.

Medicine: Preclinical Cancer Research

In 2012, researchers at Amgen attempted to replicate 53 "landmark" studies in preclinical cancer research — studies that had generated significant excitement, influenced treatment development, and been highly cited. They were able to replicate the findings of only 6 out of 53 — an 11% replication rate.

A similar effort by Bayer HealthCare found that only about 25% of published preclinical studies could be replicated in their laboratories.

These numbers are staggering. Preclinical cancer research is the pipeline through which potential cancer treatments reach clinical trials. If 75-89% of the foundational findings don't replicate, the entire drug development pipeline is built on unreliable foundations. This may partly explain why so many promising cancer treatments fail in clinical trials: the preclinical evidence that justified the trials was itself unreliable.

The economic consequences are immense. Each clinical trial that fails because its preclinical foundation was unreliable costs tens of millions of dollars and years of effort. Across the pharmaceutical industry, the failure rate for drugs entering clinical trials is approximately 90%. Some fraction of that failure is due to genuine biological complexity (the drug works in mice but not in humans). But some fraction is due to the replication problem: the drug never worked in the first place, and the preclinical evidence that said it did was a false positive that was never checked.

Glenn Begley, the lead author of the Amgen study, described a case that illustrates the problem. His team identified a "landmark" cancer study that had been cited over 200 times and had influenced treatment approaches. When they attempted to replicate it using the original protocol, the results were negative. They contacted the original author, who revealed that the published positive result had emerged after six repetitions of the experiment — and that the original team had been unable to replicate their own result in subsequent attempts. Yet the original positive finding — the one non-replicating result out of multiple attempts — remained in the published literature, was widely cited, and was influencing clinical decisions.

This is the replication problem in its purest form: a finding that the original researchers themselves could not replicate became part of the scientific canon because the publication system selected for the positive result and filtered out the failures.

Economics: Macro Models and Experimental Results

A replication effort targeting 18 experimental economics studies published in the American Economic Review and the Quarterly Journal of Economics found that 61% replicated — better than psychology but far from reassuring for a field that claims scientific rigor.

More troublingly, macroeconomic models have been criticized for their failure to predict major economic events (Chapter 3 case study) — a form of non-replication at the field level. The models "replicate" historical data (they fit past observations) but fail to replicate their predictions in new time periods (they don't predict future events). This is the distinction between curve-fitting and genuine explanatory power.

Education: Learning Styles and Other Myths

Education research suffers from particularly severe replication problems, compounded by methodological challenges (Chapter 30 will explore this in detail).

The learning styles hypothesis — the widely believed claim that students learn better when instruction is matched to their preferred learning style (visual, auditory, kinesthetic) — has been debunked in multiple systematic reviews. A 2008 review by Pashler and colleagues found "virtually no evidence" supporting the learning styles hypothesis. Yet the concept persists in teacher training programs worldwide, is built into educational software, and is endorsed by a majority of teachers in survey studies.

Learning styles is a case where the replication problem and the zombie idea (Chapter 16) converge: the claim has been tested and has failed — and it persists anyway. The persistence is not due to lack of evidence against the claim but to the structural forces that maintain it: teacher training programs that teach it (textbook inertia), educational technology companies that market products based on it (economic incentives), and the intuitive appeal of the idea to teachers and students (the plausible story problem). The evidence has been checked, found wanting, and ignored — a pattern that Part II as a whole seeks to explain.

Forensic Science: Never Replicated in the First Place

As we discussed in Chapter 3 (case study 2), the 2009 NAS report found that most forensic disciplines — bite mark analysis, hair microscopy, blood spatter interpretation, and others — had never been scientifically validated at all. They had been adopted based on practitioner experience and courtroom testimony without the systematic testing that would constitute either initial validation or replication.

This is a particularly pure form of the replication problem: the evidence was never established in the first place. There was nothing to replicate because there was never an initial rigorous study. The practices were treated as scientific on the basis of authority and tradition rather than evidence — and when the evidence was finally evaluated, it was found to be absent.

🔗 Connection: The replication problem interacts with publication bias (Chapter 5) in a vicious cycle. Publication bias ensures that the published literature overrepresents positive findings. The replication crisis reveals that many of these positive findings are false. Together, they mean that the published scientific literature — the most reliable source of knowledge available to humanity — is systematically biased in a specific direction: toward findings that are surprising, positive, and potentially unreliable.

10.5 Active Right Now: Where the Replication Problem May Be Operating

Machine learning and AI research. The field has begun its own replication reckoning. Studies have found that published ML results often cannot be replicated due to undisclosed hyperparameter tuning, selective reporting of best-of-many-runs results, and sensitivity to implementation details that are not documented in papers. The ML community is beginning to adopt reproducibility checklists and code-sharing requirements, but the competitive pressure to publish state-of-the-art results creates strong incentives for selective reporting.

Nutrition science. As Chapter 26 will explore in depth, nutrition research has particularly severe replication problems due to small samples, confounded observational designs, industry funding, and the sheer difficulty of conducting randomized controlled trials on diet. The "coffee is good/bad" oscillation in health news is a direct consequence of publishing unreplicated, underpowered studies that produce contradictory results through random variation.

Management and organizational research. Much of the evidence base for management practices — leadership styles, team dynamics, organizational culture — is based on studies with small samples, retrospective designs, and heavy researcher degrees of freedom. Systematic replication efforts in management research are rare, and the field has not yet had the kind of reckoning that psychology experienced after 2015.

Clinical psychology and therapy research. While some therapeutic approaches (CBT for anxiety and depression, exposure therapy for phobias) have strong evidence bases, many widely practiced therapeutic techniques have never been rigorously tested against active controls. The distinction between "therapy works" (likely true for many conditions) and "THIS therapy works because of THESE mechanisms" (often untested) is frequently blurred.

10.6 What It Looked Like From Inside

Consider the perspective of a social psychology researcher in 2010, just before the crisis broke:

You are a productive researcher who has published regularly in top journals. Your studies use the same methods, sample sizes, and statistical criteria as everyone else in the field. Your findings are interesting, well-received, and frequently cited.
You know about p-hacking in the abstract — everyone has heard the warnings about multiple comparisons. But in practice, the garden of forking paths doesn't feel like manipulation. It feels like exploration. You try different analyses because you're genuinely trying to understand the data. When one analysis produces a significant result, you don't think "I've just cherry-picked" — you think "I've found the signal in the noise."
You HARK occasionally — everyone does. You see an interesting pattern in the data and write it up as though you predicted it. This doesn't feel dishonest; it feels like efficient communication. Why burden the reader with the exploratory history when the finding is what matters?
You rarely replicate others' work because there's no career incentive to do so. You rarely have your own work replicated because the norm is to build on published findings, not to verify them. The entire field operates on trust — trusting that published results are reliable, trusting that peer review has caught errors, trusting that the methods of the field produce valid results.
When the Reproducibility Project results come out, and 64% of your field's findings don't replicate, the initial reaction is not "we were doing it wrong." It is "the replication method must be flawed." This is not denial in the psychological sense — it is a genuine methodological concern. But it is also the Semmelweis reflex (Chapter 2): the automatic rejection of evidence that challenges the established practice.

From inside this perspective, the replication crisis feels like an attack on the field rather than a diagnosis of the field. The researcher who has followed all the field's accepted practices for their entire career is now told that those practices systematically produce unreliable results. The implication is not just that specific findings are wrong but that the methodology of the entire field is flawed — a far more threatening conclusion than any individual replication failure.

This is why the response to the replication crisis has been so fraught. It's not just about correcting specific findings. It's about confronting the possibility that the field's way of knowing — its methods, its statistics, its publication practices — has been generating unreliable knowledge for decades. The sunk cost (Chapter 9) of the methodology itself — not just individual findings but the entire methodological infrastructure — is enormous.

The response within psychology split along generational and institutional lines. Younger researchers — who had less sunk cost in the old methods — tended to embrace the crisis as an opportunity for reform. Senior researchers — who had built careers on the old methods — tended to resist, defend, and qualify. Some prominent psychologists publicly dismissed the replication crisis as a "manufactured controversy" or as a "witch hunt" against established researchers. Others welcomed it as a long-overdue reckoning.

The most productive response came from researchers who could see both sides: acknowledging that the crisis was real and severe while also recognizing that most researchers had not been deliberately fraudulent — they had been operating rationally within a system that produced unreliable results. The structural framing of this book — that the failure is in the system, not the people — was eventually adopted by most participants in the debate. But it took years, and the wounds are still healing.

🪞 Learning Check-In

Pause and reflect: - Have you ever read a research finding and assumed it was reliable because it was published? Does this chapter change how you evaluate published evidence? - In your field, how confident are you that the foundational findings have been independently replicated? Have you ever checked? - If you discovered that a finding central to your work was a false positive, what would you do?

10.7 The Structural Diagnosis: Why Checking Is Disincentivized

We can now see the full structural picture. The replication problem is not a bug in the system — it is a predictable consequence of how the system is designed.

System Feature	Effect on Replication
Publish-or-perish career structure	Discourages replication (not novel)
Journal novelty bias	Rejects replication studies
Funding for innovation only	Won't fund replication
Social cost of challenging colleagues	Makes failed replication risky
P < 0.05 threshold	Creates false positives
Researcher degrees of freedom	Inflates apparent effect sizes
Publication bias (Ch.5)	Filters out null results
Sunk cost (Ch.9)	Resists acknowledging unreliable foundations
Streetlight effect (Ch.4)	Citation metrics reward novel claims, not verified ones

Every feature of the system points in the same direction: toward producing novel, surprising, publishable findings — and away from verifying whether those findings are true. The system is optimized for production, not for verification. It is as if the manufacturing industry had a quality assurance department but never actually inspected any products.

📐 Project Checkpoint

Your Epistemic Audit — Chapter 10 Addition

Return to your audit target and ask:

Has the foundational evidence in your field been independently replicated? For the 3-5 core claims you identified in Chapter 3, can you find independent replications from different research groups?

What is the replication culture in your field? Is replication valued, tolerated, or discouraged? Are there journals that publish replications? Are there funding mechanisms for verification?

What are the researcher degrees of freedom? When a study in your field is published, how many analytical choices were made? Could different choices have produced different results?

What would it cost to replicate the key findings? In time, money, and career risk — what would it take to verify the foundational evidence? Has anyone tried? If not, why not?

Add 300–500 words to your Epistemic Audit document.

10.8 The Reform Movement: Open Science

The replication crisis has generated a significant reform movement, under the umbrella of Open Science, that aims to redesign the incentive structures responsible for the problem.

Pre-Registration

Researchers publicly declare their hypothesis, methods, and analysis plan before collecting data. This eliminates HARKing (the hypothesis is committed in advance) and reduces p-hacking (the analysis plan is fixed in advance). Pre-registration doesn't eliminate all analytical flexibility, but it distinguishes between pre-planned analyses (confirmatory) and post-hoc analyses (exploratory) — a distinction that is critical for interpreting statistical results.

Registered Reports

Journals commit to publishing a study based on the methodology (before data collection), not on the results (after). This eliminates publication bias at its source: null results get published at the same rate as positive results, because the publication decision was made before the results were known.

Early evidence suggests that registered reports produce substantially different findings than traditional publications — with more null results and smaller effect sizes, exactly as the publication bias model predicts. A 2023 analysis found that registered reports produced null results approximately 55% of the time, compared to roughly 5-10% in traditional publications. This gap — from 5% null to 55% null — is a direct measure of how much the traditional publication process distorts the evidence base. The truth is that most hypotheses fail; the publication system hides this reality.

Open Data and Open Materials

Researchers share their raw data and analysis code alongside their publications, allowing others to verify the analyses and test alternative analytical approaches. This doesn't prevent p-hacking, but it makes it detectable — which creates a deterrent.

Multi-Lab Replication

Large-scale, coordinated replication efforts — involving dozens of laboratories simultaneously running the same study — provide definitive evidence about whether a finding is reliable. The Many Labs project and similar initiatives have replicated some famous findings and failed to replicate others, providing the field with a ground truth against which to calibrate.

Limitations of the Reform

The Open Science movement is genuinely promising, but it faces structural headwinds:

Pre-registration adds effort and reduces flexibility, which researchers may resist
Registered reports require journals to commit space to potentially unexciting null results
Open data raises privacy and intellectual property concerns
Multi-lab replications are expensive and require coordination
Most importantly: the career incentive structure has not changed. As long as tenure committees evaluate researchers primarily on novel publications in prestigious journals, the pressure to produce novel, surprising findings will remain. Some universities have begun to include replication contributions and open science practices in tenure evaluations, but these remain exceptions rather than the norm. The reform is happening at the level of research practices while the incentive structure that drives those practices remains largely unchanged — which is like treating the symptoms while leaving the disease intact.

✅ Best Practice: When evaluating evidence in any field, give substantially more weight to pre-registered studies than to non-pre-registered studies, and more weight to replicated findings than to unreplicated ones. This is the single most effective calibration for navigating a literature affected by the replication crisis.

10.9 Practical Considerations: Living With Unreplicated Evidence

We cannot wait for every finding to be replicated before acting on it. Much of the evidence we rely on — in medicine, policy, education, business — has not been independently verified. The question is how to make good decisions in a world where the evidence base is unreliable.

Strategy 1: Weight by Replication Status

Create an explicit hierarchy: replicated findings > pre-registered findings > large-sample findings > single published studies. Apply this hierarchy consistently when evaluating evidence in your field.

Strategy 2: Look for Convergent Evidence

A finding supported by multiple methods (experiments, observational studies, case studies, theoretical predictions) is more likely to be real than one supported by a single method, even if none of the individual studies has been formally replicated. Convergent evidence from different methodologies is a partial substitute for direct replication.

Strategy 3: Attend to Effect Sizes, Not Just Significance

A finding can be "statistically significant" with a tiny, practically meaningless effect size. When evaluating evidence, ask: "How large is the effect?" not just "Is it significant?" Small effects that are barely detectable in large samples are the findings most likely to be artifacts of p-hacking and publication bias.

Strategy 4: Apply the "Would This Replicate?" Question

Before accepting any finding, ask: "If another research group repeated this study with a new sample, would they get the same result?" Your intuitive answer — based on the methodology, sample size, effect size, and researcher degrees of freedom — is often surprisingly accurate. If the answer is "probably not," calibrate your confidence accordingly.

Strategy 5: Distinguish Between the Claim and the Evidence

A claim can be true even if the specific study supporting it is unreliable. "Unconscious biases affect decision-making" is probably true — supported by converging evidence from many domains — even though some specific priming studies didn't replicate. The replication crisis should make you skeptical of specific published studies, not necessarily of the broader claims those studies were intended to support. The broader claim may be valid; the specific evidence may not be.

This distinction is important because the replication crisis has been weaponized by people who want to dismiss findings they dislike: "Social psychology is all fake, so implicit bias doesn't exist." This is a non sequitur. The replication crisis means that specific studies must be evaluated more carefully — not that entire fields of inquiry are invalid.

⚠️ Common Pitfall: The most dangerous response to the replication crisis is to become either uncritically accepting ("it's published, so it must be true") or uncritically dismissive ("nothing in the literature can be trusted"). Both are wrong. The appropriate response is calibrated skepticism: weighting replicated findings more heavily, attending to effect sizes and sample sizes, and recognizing that the published literature is a biased but still informative source of evidence.

10.10 Chapter Summary

Key Arguments

The replication crisis is not confined to psychology — it extends across medicine, economics, education, and forensic science
The crisis is caused by structural incentives (publish-or-perish, journal novelty bias, funding for innovation only) that systematically discourage verification and reward practices that inflate false positive rates
P-hacking, HARKing, and researcher degrees of freedom are systemic features of these incentive structures, not primarily evidence of individual misconduct
The Open Science movement (pre-registration, registered reports, open data) represents a genuine attempt at structural reform, but faces headwinds from unchanged career incentive structures
The practical response is to calibrate confidence based on replication status, convergent evidence, and effect size

Key Debates

Can the Open Science reforms succeed without changing the tenure and promotion system?
Should replications count equally with original research for career advancement?
Is the 36% replication rate in psychology an underestimate or overestimate of the field's reliability?
How should policy and practice respond to findings that haven't been replicated?

Analytical Framework

The four structural incentives against replication
The three practices that inflate false positives (p-hacking, HARKing, researcher degrees of freedom)
The evidence hierarchy (replicated > pre-registered > large-sample > single study)
The "Would this replicate?" question as a quick diagnostic

Spaced Review

Revisiting earlier material to strengthen retention.

(From Chapter 4) Goodhart's Law says "when a measure becomes a target, it ceases to be a good measure." How does this apply to the p < 0.05 threshold? What happened when "significance" became the target?
(From Chapter 5) Publication bias (the file drawer problem) selectively publishes positive results. The replication crisis shows that many published positive results are false. How do these two problems compound each other?
(From Chapter 9) The sunk cost of consensus keeps wrong answers in place. How does the sunk cost of the methodological infrastructure (p-values, significance testing, researcher degrees of freedom) interact with the replication crisis?

Answers

1. P < 0.05 was originally a reasonable threshold for evaluating individual studies. When it became the *target* that researchers had to hit for publication, Goodhart's Law activated: researchers optimized for the metric (p-hacking, HARKing) and the metric ceased to be a reliable indicator of real effects. The threshold didn't change; its meaning changed because hitting it became a goal rather than a discovery. 2. Publication bias ensures that only "significant" results are published. The replication crisis reveals that many "significant" results are false. Together: the published literature is biased toward exactly the kind of findings most likely to be unreliable. The file drawer contains the null results that would correct the bias — but no one sees them. 3. The methodological infrastructure (significance testing, conventional sample sizes, analytical flexibility) has accumulated massive sunk cost: decades of training, thousands of textbooks, entire careers built on these methods. Acknowledging that the methodology produces unreliable results threatens all of this investment. The replication crisis is, in part, a sunk cost problem: the field invested too heavily in methods that turn out to be inadequate, and the switching cost of adopting better methods is high.

What's Next

In Chapter 11: How Incentive Structures Manufacture Error, we'll examine the third persistence mechanism: the systematic ways in which the business model of knowledge production — funding, publishing, evaluating, and rewarding research — creates structural biases toward wrong answers. If the replication problem shows that nobody checks the homework, Chapter 11 shows why the homework is designed to get the wrong answer in the first place.

Before moving on, complete the exercises and quiz to solidify your understanding.

Learning Objectives

In This Chapter

Chapter 10: The Replication Problem

Chapter Overview

10.1 The Reproducibility Project: The Reckoning

What Failed

10.2 The Machinery of Non-Replication

P-Hacking: Fishing for Significance

A Worked Example: How P-Hacking Produces "Discoveries"

HARKing: Hypothesizing After Results Are Known

Researcher Degrees of Freedom: The Garden of Forking Paths

10.3 The Structural Incentives Against Replication

Incentive 1: Publish or Perish

Incentive 2: Journal Prestige and Novelty Bias

Incentive 3: Social Cost of Failed Replication

Incentive 4: Funding Structure

The Ioannidis Bombshell

10.4 Beyond Psychology: The Replication Problem Is Everywhere

Medicine: Preclinical Cancer Research

Economics: Macro Models and Experimental Results

Education: Learning Styles and Other Myths

Forensic Science: Never Replicated in the First Place

10.5 Active Right Now: Where the Replication Problem May Be Operating

10.6 What It Looked Like From Inside

10.7 The Structural Diagnosis: Why Checking Is Disincentivized

10.8 The Reform Movement: Open Science

Pre-Registration

Registered Reports

Open Data and Open Materials

Multi-Lab Replication

Limitations of the Reform

10.9 Practical Considerations: Living With Unreplicated Evidence

Strategy 1: Weight by Replication Status

Strategy 2: Look for Convergent Evidence

Strategy 3: Attend to Effect Sizes, Not Just Significance

Strategy 4: Apply the "Would This Replicate?" Question

Strategy 5: Distinguish Between the Claim and the Evidence

10.10 Chapter Summary

Key Arguments

Key Debates

Analytical Framework

Spaced Review

What's Next

Chapter 10 Exercises → exercises.md

Chapter 10 Quiz → quiz.md

Case Study: The Ego Depletion Saga — The Rise and Fall of a Textbook Finding → case-study-01.md

Case Study: Preclinical Cancer Research — When 89% of "Landmark" Studies Don't Replicate → case-study-02.md

Related Reading

Chapter 10 Exercises → `exercises.md`

Chapter 10 Quiz → `quiz.md`

Case Study: The Ego Depletion Saga — The Rise and Fall of a Textbook Finding → `case-study-01.md`

Case Study: Preclinical Cancer Research — When 89% of "Landmark" Studies Don't Replicate → `case-study-02.md`