Case Study 7.2: The Replication Crisis — How Small Samples Broke Modern Science

DataField.Dev

Case Study 7.2: The Replication Crisis — How Small Samples Broke Modern Science

The Moment It Became Undeniable

In 2011, a paper by Daryl Bem — a respected, senior Cornell psychologist — was published in a top psychology journal. The paper claimed to provide experimental evidence for precognition: the ability to sense future events before they happen. Bem reported nine experiments with apparently significant results showing that human beings could, at a rate slightly above chance, "feel the future."

The psychology community was rattled. Not because anyone thought ESP was real, but because Bem had used entirely standard methods — the same methods used in thousands of other published papers. His sample sizes were typical. His statistical analyses were conventional. His p-values were below 0.05.

If the field's methods could produce evidence for precognition, what did that say about the methods?

The answer, which emerged over the next decade, was uncomfortable: the methods had a problem, and small samples were at the center of it.

The Open Science Collaboration Study

In 2015, Brian Nosek of the University of Virginia led the Open Science Collaboration — a consortium of 270 researchers across many institutions — in the most ambitious replication study in scientific history.

Their goal: take 100 published psychology studies and attempt to replicate them exactly, using the original authors' methods, with pre-registered hypotheses, and adequate statistical power.

The results, published in Science (Nosek et al., 2015, "Estimating the Reproducibility of Psychological Science"), were striking:

97 of the 100 original studies reported statistically significant results.
Only 36-39 of the replications were statistically significant (the exact count depends on how you define replication success).
The average effect size in replications was about half the size reported in the original studies.
Studies with smaller original samples were less likely to replicate.
Studies with smaller original p-values (e.g., p = 0.001 rather than p = 0.04) were more likely to replicate.

This was not a scandal about fraud or dishonesty. Most of the original researchers were honest scientists doing their best. The problem was structural.

Why Small Samples Were Largely to Blame

The replication crisis has multiple causes — publication bias, p-hacking, flexibility in data analysis, and more. But underpowered studies (too-small samples) are at or near the center of every explanation.

Here is the mechanism, step by step:

Step 1: Researchers run studies with small samples.

A typical psychology study in the 1990s–2010s used 20-50 participants. This was partly due to resource constraints (running subjects costs money and time) and partly because the field had no systematic norm requiring larger samples.

Step 2: Small samples produce noisy estimates.

With a small sample, your estimate of the effect size bounces around wildly. The law of large numbers hasn't had enough trials to stabilize it. You might measure a true effect of 0.3 standard deviations but observe anything from 0.1 to 0.7 in your data.

Step 3: Only significant results get published.

If your result crosses p < 0.05, it goes to a journal. If it doesn't, it goes in a drawer (or in recent years, an unpublished dataset). This is "publication bias." Journals overwhelmingly prefer positive results.

Step 4: The studies that get published are those that got lucky with their small samples.

With small samples and random variation, some studies will observe larger-than-true effects by chance. These are the studies most likely to cross the significance threshold and get published. The effect size in print is, on average, inflated.

Step 5: Replication attempts use the reported effect size to plan sample size.

When a replication team reads that an effect is 0.6 standard deviations, they design a study powered to detect 0.6 standard deviations. But the true effect is only 0.3. Their replication is now underpowered for the true effect — and likely to fail.

This is the full mechanism. Honest researchers, doing their best, produce a literature that is systematically unreliable because the individual sample sizes are too small for the law of large numbers to have worked.

The Role of the Winner's Curse

The statistician Andrew Gelman and colleagues introduced an important related concept: in a literature with small samples and publication bias, the studies that get published are not random samples of truth. They are the studies that got lucky.

This creates what they call the "winner's curse": the published effect size is almost always an overestimate of the true effect. The study "won" the random lottery of getting a large enough observed effect to cross the significance threshold. That win came partly through genuine effect and partly through lucky noise.

When a replication team tries to repeat the study, they don't get the same lucky noise. They get a more typical noise draw — which produces a smaller observed effect, often below the significance threshold. The replication "fails" not because the original was fraudulent but because the original was lucky.

Over the entire literature, this means: a field built on small samples contains many more false and inflated findings than it contains true ones, even assuming everyone played by the rules.

Fields Affected

The replication crisis is not unique to psychology. Analyses of similar scope and methods have found comparable problems in:

Neuroscience: Button et al. (2013) found median power of ~21% in neuroscience studies.

Medicine: Ioannidis (2005), in the landmark paper "Why Most Published Research Findings Are False," demonstrated mathematically that under typical conditions (small samples, multiple comparisons, publication bias), the majority of published medical findings are likely false positives.

Economics: Multiple high-profile economics papers have failed to replicate, including papers on topics like priming effects, nudging interventions, and behavioral economic anomalies.

Social psychology: The field that produced some of the most famous psychological findings — ego depletion, power poses, implicit association testing's predictive validity — has faced severe challenges to many of its landmark results.

The Reforms That Followed

The replication crisis produced a scientific reform movement. Its key tools are directly related to the statistical concepts in this chapter:

Pre-registration: Researchers commit to their hypotheses, sample size, and analysis plan before data collection. This prevents post-hoc fitting of stories to data and the multiple comparisons problem.

Larger required sample sizes: Journals and funding agencies are increasingly requiring power analyses demonstrating adequate sample size before a study begins. The field has moved toward requiring 80-90% power rather than accepting whatever power the budget allows.

Registered Reports: Some journals now accept papers for publication based on the quality of the study design — before seeing the results. This eliminates publication bias against null results.

Open data and open materials: Researchers share their raw data and analysis code, allowing others to check and re-analyze results.

Effect size reporting: Beyond p-values, the field now emphasizes reporting effect sizes with confidence intervals, which communicate both the direction and the uncertainty in the estimate.

These reforms are working. Studies published since 2015 with pre-registration replicate at higher rates than older studies without it.

What This Means for the Rest of Us

The replication crisis is not just a story about academic science. It's a story about how human beings in any domain — content creation, startup strategy, sports coaching, investing — can generate systematic false knowledge from small samples.

When Nadia concludes that Wednesday 7 p.m. is her "winning" posting time after six weeks of data, she has run her own underpowered study. When a startup pivots based on three months of trending data, it has made a decision on an underpowered analysis. When a coach decides that praise makes athletes perform worse (Chapter 8 preview), they've misread regression to the mean in a small sample.

The law of large numbers is not just a theorem in probability textbooks. It is the operating system of reliable knowledge. When sample sizes are too small, the operating system hasn't had time to run. The output is noise that looks like signal, and we build entire frameworks of understanding on top of that noise.

The good news: we now know this problem exists, we know its mechanism, and we know how to fix it. The fix requires patience, rigor, and intellectual humility. It requires treating small samples as starting points for investigation, not conclusions for action.

Discussion Questions

The Nosek et al. (2015) study found that about 61% of attempted replications failed. Does this mean psychology is not a real science? Defend your answer using the distinction between the field's methods and the validity of its subject matter.
The "winner's curse" says that studies that get published tend to have overestimated effect sizes. How might this affect how you read news stories about scientific findings? What questions should you always ask?
Some researchers have argued that pre-registration limits scientific creativity by forcing researchers to commit to hypotheses before they can explore their data. Is there a way to preserve the benefits of exploratory research while also guarding against false findings?
Consider your own life as a study. Think of a belief you hold about yourself ("I perform better under pressure," "I make better decisions in the morning"). How many data points does that belief rest on? Is it a pre-registered hypothesis or a post-hoc explanation? What would a proper test look like?
The medical replication crisis means that some treatments currently in use may be based on false or inflated findings. How should this knowledge affect your relationship with health information? How can you evaluate medical claims more rigorously?