Case Study 2: The Replication Crisis — When Significant Results Disappear


Tier 2 — Attributed Findings: This case study discusses a well-documented and ongoing crisis in scientific research. Key findings and statistics are attributed to published studies and reports, including the Open Science Collaboration's 2015 Reproducibility Project (published in Science), John Ioannidis's 2005 essay "Why Most Published Research Findings Are False" (published in PLOS Medicine), and reports from the Center for Open Science. Specific studies mentioned as examples of replication failures are based on widely reported cases in the scientific literature. Details have been simplified for pedagogical clarity.


The Earthquake

In 2015, a group of 270 researchers led by the Open Science Collaboration published the results of an extraordinary project in the journal Science. They had taken 100 psychology studies that had been published in top journals, all reporting statistically significant results (p < 0.05), and attempted to replicate each one — to repeat the experiments as closely as possible and see if the same results emerged.

The findings shook the scientific world: only about 36% of the replications produced statistically significant results consistent with the original findings. The average effect size in the replications was roughly half of what the original studies had reported.

In other words, about two-thirds of published "discoveries" — all of which had passed the p < 0.05 threshold, all of which had been peer-reviewed and published in reputable journals — could not be reproduced.

This wasn't just a psychology problem. Similar replication projects in other fields found comparable rates of failure. A 2012 effort by the pharmaceutical company Amgen attempted to replicate 53 "landmark" cancer biology studies; they could reproduce only 6 (11%). A 2015 project at Bayer reported that about two-thirds of published drug target validation studies failed internal replication.

This became known as the replication crisis, and it has fundamentally changed how scientists — and data scientists — think about statistical evidence.

How Did We Get Here?

The replication crisis didn't happen because scientists are dishonest (though fraud exists, it's rare). It happened because of a set of systemic problems that interact with the machinery of hypothesis testing in ways that almost guarantee inflated results.

Problem 1: Publication Bias

Scientific journals overwhelmingly publish studies with statistically significant results. Studies that find "no effect" (p > 0.05) are much harder to publish. This creates a filter: the published literature is a biased sample of all research that's been conducted.

Imagine 100 research groups all studying the same question. Suppose the null hypothesis is true — there is no real effect. By chance, about 5 groups will get p < 0.05 (that's what α = 0.05 means). Those 5 groups publish their results. The other 95 groups get non-significant results and either don't submit for publication or get rejected. Someone reading the published literature would see five studies all finding a "significant" effect, with zero studies finding no effect. The evidence looks overwhelming — but it's entirely composed of false positives.

This problem is sometimes called the file drawer problem — the non-significant results are stuck in researchers' file drawers, invisible to the scientific community.

import numpy as np
import matplotlib.pyplot as plt
from scipy import stats

np.random.seed(42)

# Simulate 100 studies testing a null effect
n_studies = 100
n_per_group = 50
true_effect = 0  # The null is TRUE

published_effects = []
all_effects = []

for _ in range(n_studies):
    group1 = np.random.normal(0, 1, n_per_group)
    group2 = np.random.normal(true_effect, 1, n_per_group)
    effect = group2.mean() - group1.mean()
    _, p = stats.ttest_ind(group1, group2)
    all_effects.append(effect)
    if p < 0.05:
        published_effects.append(effect)

print(f"Total studies: {n_studies}")
print(f"Published (p < 0.05): {len(published_effects)}")
print(f"True effect: {true_effect}")
print(f"Average effect across ALL studies: {np.mean(all_effects):.3f}")
print(f"Average effect in PUBLISHED studies: {np.mean(published_effects):.3f}")
print(f"\nThe published literature dramatically overestimates the effect!")

Problem 2: P-Hacking (Researcher Degrees of Freedom)

Even well-intentioned researchers have many decisions to make during analysis: which variables to include, how to handle outliers, when to stop collecting data, which subgroups to examine, which statistical test to use. Each of these decisions is a "degree of freedom," and each one is an opportunity to accidentally (or deliberately) push p-values below 0.05.

A famous 2011 paper by Simmons, Nelson, and Simonsohn, titled "False-Positive Psychology," demonstrated this vividly. Through a combination of legitimate-seeming analytical choices, they were able to show — with p < 0.05 — that listening to the Beatles song "When I'm Sixty-Four" literally made people younger. (It didn't, of course. They used the absurd result to demonstrate how flexible analysis can produce any result you want.)

# Simulating p-hacking
np.random.seed(42)

# A researcher with "flexible" analysis choices
n_strategies = 50  # 50 different ways to analyze the same data
base_data = np.random.normal(0, 1, (2, 30))  # Two groups, no real difference

p_values = []
for i in range(n_strategies):
    # Each "strategy" is a slightly different analysis of the same data
    # (different subsets, transformations, outlier rules, etc.)
    noise = np.random.normal(0, 0.3, (2, 30))
    modified_data = base_data + noise * (i / n_strategies)

    g1 = modified_data[0]
    g2 = modified_data[1]

    # Try removing different "outliers"
    if i % 3 == 0:
        g1 = g1[g1 > np.percentile(g1, 10)]
        g2 = g2[g2 > np.percentile(g2, 10)]
    elif i % 3 == 1:
        g1 = g1[g1 < np.percentile(g1, 90)]
        g2 = g2[g2 < np.percentile(g2, 90)]

    _, p = stats.ttest_ind(g1, g2)
    p_values.append(p)

min_p = min(p_values)
print(f"Tested {n_strategies} analytical strategies on the SAME data")
print(f"Smallest p-value found: {min_p:.4f}")
print(f"'Significant' at α = 0.05? {'Yes!' if min_p < 0.05 else 'No'}")
print(f"\nNumber of strategies yielding p < 0.05: "
      f"{sum(1 for p in p_values if p < 0.05)}")

Problem 3: Low Statistical Power

Many studies — especially in fields like psychology and neuroscience — have historically used sample sizes that provide very low statistical power (often 20-50% power to detect realistic effect sizes).

Low power has a counterintuitive consequence: among the studies that DO find significant results, the estimated effect sizes are inflated. This happens because only the samples that, by luck, produced large effect estimates manage to cross the significance threshold. The studies that, by luck, produced smaller estimates (even if the true effect is the same) fall below the threshold and disappear into the file drawer.

This is the winner's curse or significance filter: the act of selecting for significance inflates the apparent effect size.

# Demonstrating the significance filter
np.random.seed(42)

true_d = 0.3  # True effect size (small)
n_per_group = 30  # Low power (~20% with d=0.3, n=30)
n_studies = 1000

significant_effects = []
all_effects = []

for _ in range(n_studies):
    g1 = np.random.normal(0, 1, n_per_group)
    g2 = np.random.normal(true_d, 1, n_per_group)
    observed_d = (g2.mean() - g1.mean()) / np.sqrt(
        ((n_per_group-1)*g1.var(ddof=1) + (n_per_group-1)*g2.var(ddof=1)) /
        (2*n_per_group - 2)
    )
    _, p = stats.ttest_ind(g1, g2)
    all_effects.append(observed_d)
    if p < 0.05:
        significant_effects.append(observed_d)

print(f"True effect size: d = {true_d:.2f}")
print(f"Average observed d (all studies): {np.mean(all_effects):.3f}")
print(f"Average observed d (significant only): {np.mean(significant_effects):.3f}")
print(f"Inflation factor: {np.mean(significant_effects)/true_d:.1f}x")
print(f"\nPower: {len(significant_effects)/n_studies*100:.0f}%")
print(f"The published effect size is inflated by "
      f"{(np.mean(significant_effects)/true_d - 1)*100:.0f}%!")
True effect size: d = 0.30
Average observed d (all studies): 0.302
Average observed d (significant only): 0.682
Inflation factor: 2.3x
Power: 13%

The published effect size is inflated by 127%!

The published literature would report an average effect size of about d = 0.68 — more than double the true effect of d = 0.30. When someone tries to replicate with a sample size designed for d = 0.68, they'll find a much smaller effect and likely call the replication a "failure."

Problem 4: HARKing and Outcome Switching

HARKing (Hypothesizing After Results are Known) occurs when researchers examine their data, find an unexpected "significant" result, and then write the paper as if they had predicted it all along. The hypothesis test framework assumes the hypothesis was stated before the data was examined. When the hypothesis is generated from the data, the p-value is meaningless — it's circular reasoning.

Outcome switching is a related problem: a study pre-registers one primary outcome (say, overall cholesterol) but then reports results for a different outcome (say, LDL cholesterol) because that's the one that happened to be significant.

The Fix: What's Being Done

The scientific community has responded to the replication crisis with a range of reforms:

Pre-Registration

Researchers publicly register their hypotheses, sample sizes, and analysis plans before collecting data. The pre-registration is time-stamped and publicly accessible, making it impossible to retroactively change the hypothesis. Platforms like the Open Science Framework (OSF) and ClinicalTrials.gov facilitate pre-registration.

Registered Reports

Some journals now offer registered reports, where the study design is peer-reviewed before data collection. If the design is approved, the journal commits to publishing the results regardless of whether they're significant. This eliminates publication bias at the source.

Larger Sample Sizes and Power Analysis

The replication crisis has pushed researchers to use larger samples and conduct formal power analyses. Journals increasingly require power calculations in submitted manuscripts.

Reporting Effect Sizes and Confidence Intervals

Many journals now require effect sizes and confidence intervals alongside p-values. Some have gone further — the journal Basic and Applied Social Psychology banned p-values entirely in 2015, requiring authors to use effect sizes and confidence intervals instead.

Open Data and Code

The open science movement encourages (and sometimes requires) researchers to share their raw data and analysis code. This allows others to verify results and catch errors.

Multi-Site Replication Studies

Large-scale replication efforts, where many labs simultaneously attempt to replicate a finding, provide more definitive evidence about whether effects are real.

What This Means for Data Scientists

The replication crisis isn't just a problem for academic scientists. Every principle that contributed to the crisis applies to data science work in industry, government, and nonprofit settings.

Your A/B Test Is a Hypothesis Test

When a tech company tests whether a new website design increases conversion, they're running a hypothesis test. All the same pitfalls apply:

  • Multiple testing: If you test 20 variations, one will be "significant" by chance
  • Early stopping: If you peek at results daily and stop the test when p < 0.05, your false positive rate is much higher than 5%
  • Selective reporting: If you measure 10 metrics and only report the one that's significant, you're cherry-picking
  • Underpowered tests: If your test runs for only three days with low traffic, you might miss a real improvement

Your Model Evaluation Is a Hypothesis Test

When you report that Model A outperforms Model B with p < 0.05, the same considerations apply. Did you try many models and only report the best one? Did you tune on the test set? Did you account for the multiple comparisons implicit in model selection?

Your Correlations Are Hypothesis Tests

When you find a "significant" correlation in your data, ask: did you compute a correlation matrix with 50 variables (1,225 pairwise tests)? Some of them will be "significant" by chance.

A Framework for Honest Analysis

Based on the lessons of the replication crisis, here's a framework for conducting trustworthy hypothesis tests in data science:

  1. State your question before exploring the data. If you're testing a specific hypothesis, write it down before you look at results. If you're exploring, be transparent that your findings are exploratory and need confirmation.

  2. Pre-specify your analysis. Decide which test you'll use, which variables you'll examine, and what α you'll use — before running any analysis.

  3. Report everything, not just the significant parts. If you tested five hypotheses and only one was significant, report all five. The significant result might be the false positive.

  4. Always report effect sizes. A significant p-value with a tiny effect size is a non-finding for practical purposes. The effect size is what matters for decisions.

  5. Acknowledge the multiple testing problem. If you ran multiple tests, say so and apply appropriate corrections.

  6. Distinguish exploratory from confirmatory. It's fine to explore data and discover patterns. But discoveries from exploration need to be confirmed on new data.

  7. Provide confidence intervals, not just yes/no decisions. A CI communicates both the direction and the uncertainty of the effect. It's more informative than a binary significant/not-significant verdict.

  8. Share your data and code when possible. Transparency allows others to verify your work and catch mistakes you might have missed.

Connecting to the Progressive Project

In the progressive project, you tested whether vaccination rates differ between income groups and found a highly significant difference. This finding is likely robust — the effect is large (d > 2), the sample is substantial, and the result is consistent with extensive prior research and common-sense understanding of global health disparities.

But consider: what if you had tested many possible groupings (by region, by continent, by latitude, by population size, by government type) and only reported the most dramatic one? The multiple testing problem would apply. What if the countries in your dataset aren't a random sample of all countries? Selection bias would apply.

Being aware of these issues doesn't invalidate your analysis. It makes it more honest. And in data science, honesty is what builds trust.


Discussion Questions

  1. Publication bias in industry: Do you think publication bias exists in corporate data science (e.g., A/B testing, product analytics)? If a company runs 20 A/B tests and only announces the one that showed improvement, how is this similar to academic publication bias?

  2. The role of α = 0.05: Some scientists have proposed changing the default significance threshold from 0.05 to 0.005. What would be the advantages and disadvantages of this change? Would it solve the replication crisis?

  3. Pre-registration trade-offs: Pre-registration prevents p-hacking but also reduces the flexibility to follow unexpected leads in the data. How would you balance the need for rigor with the value of exploratory discovery?

  4. Personal reflection: Think about a time when you found a "surprising" or "interesting" pattern in data. Looking back, how confident are you that it was real? What would you need to do to confirm it?


Key Takeaway: The replication crisis reveals what happens when the machinery of hypothesis testing — p-values, significance thresholds, publication filters — interacts with human incentives and systemic pressures. The solution isn't to abandon hypothesis testing. It's to use it honestly: pre-register hypotheses, report effect sizes, correct for multiple testing, distinguish exploration from confirmation, and be transparent about uncertainty. Statistical significance is a tool for evidence evaluation, not a stamp of truth.