Case Study 1: The Reproducibility Crisis — When Published Results Cannot Be Replicated


Tier 1 — Verified Concepts: This case study discusses the well-documented reproducibility crisis in science. The Amgen replication study (Begley & Ellis, 2012, published in Nature), the Reproducibility Project: Psychology (Open Science Collaboration, 2015, published in Science), and the Duke University clinical trials scandal (documented in investigative journalism and legal proceedings) are all based on published sources. Statistical concepts and software versioning examples are based on documented phenomena.


The Study That Changed Everything

In 2012, C. Glenn Begley and Lee Ellis published a brief but devastating paper in the journal Nature. Begley, then at the biotechnology company Amgen, described an internal effort to reproduce 53 "landmark" studies in cancer biology — studies that had been published in top scientific journals and that had shaped drug development strategies across the pharmaceutical industry.

Of the 53 studies, Amgen's scientists could reproduce the results of only 6.

An 11% reproduction rate. For studies that had passed peer review, been cited hundreds of times, and influenced billions of dollars in research investment.

Begley and Ellis were careful to note that they were not accusing anyone of fraud. Most of the original researchers had likely conducted their experiments honestly. The failures were due to a combination of factors: small sample sizes, flexible analysis methods, selective reporting of results, and — crucially — the inability to exactly recreate the experimental conditions of the original studies.

The same year, a similar effort in psychology — the Reproducibility Project, led by Brian Nosek at the University of Virginia — attempted to replicate 100 studies published in three top psychology journals. The results, published in Science in 2015, were sobering: only 36% of the replications produced statistically significant results in the same direction as the original studies.

The reproducibility crisis was no longer a rumor. It was a documented, quantified failure of the scientific enterprise.

What Does This Have to Do with Data Science?

You might think: "I'm not curing cancer or studying psychology. I'm analyzing datasets in Python. How does this apply to me?"

It applies more directly than you might expect. The same factors that cause irreproducible science cause irreproducible data analysis:

Factor 1: Undocumented Software Environments

In 2014, a computational biology researcher discovered that a widely used genomics tool produced subtly different results when run on different versions of its underlying numerical library. The difference was tiny — in the 15th decimal place for individual calculations — but it accumulated over millions of operations, producing meaningfully different final results.

The researchers who had used this tool in published papers had not recorded which version of the numerical library they were running. When others tried to reproduce their results, they got different answers — and nobody could figure out why, because nobody had thought to record something as "minor" as a library version number.

This is exactly the problem that requirements.txt and virtual environments solve. If you record the exact versions of every library in your analysis environment, this entire class of reproducibility failure is eliminated.

Factor 2: Uncontrolled Randomness

Many data science techniques involve randomness, and randomness that is not controlled is a direct threat to reproducibility.

Consider a simple example. You run a random forest classifier and report an accuracy of 87.3%. Your colleague runs the same code on the same data and gets 86.1%. Who is right? Both are, technically — the difference is due to the random initialization of the model and the random train/test split. But which result should be reported? Which result would hold up under scrutiny?

Without a random seed, neither result is reproducible. With a seed, both would get the same answer every time. This is not a hypothetical concern — it is one of the most common reasons that data science results differ between runs, machines, and analysts.

Factor 3: The Garden of Forking Paths

The statistician Andrew Gelman coined the phrase "garden of forking paths" to describe the many small decisions that analysts make during an analysis, each of which can change the result:

  • Should I exclude outliers? Which definition of "outlier" should I use?
  • Should I use the log-transformed or raw version of this variable?
  • Should I control for age? Income? Both? Neither?
  • Should I use the full dataset or a subset?
  • Should I use a 70/30 or 80/20 train/test split?

Each decision is reasonable. But when you make 20 such decisions in sequence, and each one could go either way, you have 2^20 = over one million possible analysis paths. The final result depends on which path you took — and most of those paths are never documented.

Version control (git) addresses this by creating a record of every change. Even if you explored multiple paths, the commit history shows which decisions you made and when. And if you committed at key decision points with descriptive messages ("Use log-transformed GDP instead of raw values because the distribution is heavily right-skewed"), the rationale is preserved.

A Case Study in Consequences: The Duke Clinical Trials Scandal

The consequences of irreproducible research are not always academic. Sometimes they are life-threatening.

Between 2006 and 2010, researchers at Duke University published a series of papers claiming that gene expression patterns could predict which chemotherapy drugs would be most effective for individual cancer patients. The research, led by Anil Potti, was celebrated as a breakthrough in personalized medicine. Three clinical trials were launched based on the findings, enrolling real patients who received treatment recommendations based on the models.

In 2009, biostatisticians Keith Baggerly and Kevin Coombes at MD Anderson Cancer Center attempted to reproduce the Duke results using the published data and methods. They could not. Instead, they discovered a series of errors:

  • Data was mislabeled. In one analysis, the labels for "sensitive" and "resistant" tumors were switched, meaning the model was predicting the exact opposite of what was claimed.
  • Data was selectively excluded. Samples that did not fit the model were removed without documentation.
  • Software versions were not recorded. Different versions of the analysis tools produced different results, and the original versions were not documented.
  • Results could not be independently verified. The analysis code was not shared, and when Baggerly and Coombes reconstructed it from the published descriptions, they found the errors.

The clinical trials were eventually halted. Anil Potti resigned from Duke after investigations revealed additional fraudulent activity (he had fabricated a Rhodes Scholar credential on his resume and a fellowship from the American Cancer Society). The university settled lawsuits from patients who had received treatment recommendations based on the flawed models.

Baggerly and Coombes later gave a talk titled "The Importance of Reproducible Research in High-Throughput Biology" — a talk that has been viewed hundreds of thousands of times and is widely credited with catalyzing the reproducibility movement in computational biology.

Their key point: if the analysis had been reproducible from the start — if the code, data, and environment had been documented and shared — the errors would have been caught in weeks, not years. Patients would never have been enrolled in clinical trials based on flawed models.

The Anatomy of a Reproducible Analysis

What would it have taken to catch the Duke errors early? Let's map it to the tools you learned in this chapter:

Error Tool That Would Have Prevented It
Data mislabeling Version-controlled data with checksums (DVC or git-lfs) and validation code committed alongside the data
Selective data exclusion Git history showing when and why data points were removed, with commit messages explaining the rationale
Software version mismatch requirements.txt or environment.yml pinning exact library versions
Non-shared code Code shared in a git repository, allowing independent verification
Non-reproducible results Random seeds set for all stochastic operations

None of these tools are exotic. None require advanced skills. A graduate student with a week of git training could have established these practices.

The Human Side of Reproducibility

The reproducibility crisis is not just a technical problem. It is also a cultural one. Several human factors contribute:

Publication pressure. Researchers are rewarded for publishing novel, statistically significant results. This creates incentives to explore many analysis paths (the garden of forking paths) and report only the ones that produce interesting findings — a practice called p-hacking or data dredging.

The file drawer problem. Studies that find "nothing interesting" (null results) are less likely to be published. This means the published literature is a biased sample of all research — it overrepresents findings that may be flukes.

Lack of incentive for reproducibility. Sharing code, data, and documentation takes time. Researchers are not rewarded for making their work reproducible — they are rewarded for publishing the next paper. The tools in this chapter take effort to adopt, and without institutional support, that effort may not be recognized.

Embarrassment. Sharing your code means others can find your mistakes. This is uncomfortable. But it is precisely the point — errors caught early cause less damage than errors caught after years of building on flawed foundations.

What Has Changed Since 2012?

The reproducibility crisis triggered significant reforms:

Journals now require data and code sharing. Many top journals (Nature, Science, PLOS) now require authors to share the data and code underlying their results. Some require the analysis to be run by an independent party before publication.

Preregistration. Researchers can now publicly register their analysis plan before seeing the data, reducing the temptation to explore multiple paths and report only the best result.

Open science tools. Platforms like the Open Science Framework (OSF), Zenodo (for sharing datasets), and GitHub (for code) make it easier to share research artifacts.

Reproducibility in industry. Companies have adopted many of the same practices — version control, CI/CD pipelines, Docker containers, automated testing — because reproducibility failures are costly in business too.

Education. Courses like this one now include reproducibility as a core topic, not an afterthought.

Lessons for Data Scientists

  1. Reproducibility is not optional. It is the foundation of trust. If your results cannot be reproduced, they cannot be trusted.

  2. The tools are simple. Git, virtual environments, random seeds, and READMEs are not difficult. The barrier is not skill — it is habit.

  3. Document decisions, not just code. The "why" behind your choices is as important as the code that implements them.

  4. Share your work. Open code and open data are the best defenses against errors. The discomfort of having others examine your work is far less costly than the damage of undiscovered mistakes.

  5. Reproducibility is an ethical obligation. As we discussed in Chapter 32, transparent and verifiable work is a check against bias, errors, and even fraud. If your analysis influences decisions — about health, policy, business, or people's lives — those affected have a right to verify it.

Discussion Questions

  1. The Amgen study found that only 11% of landmark cancer studies could be reproduced. Does this mean that 89% of cancer research is wrong? What other explanations are there?

  2. Baggerly and Coombes spent months trying to reproduce the Duke results. Should there be dedicated "reproducibility reviewers" for published research? Who would fund them?

  3. Some researchers resist sharing their code because they fear others will find mistakes. Is this a legitimate concern? How would you address it?

  4. The practices in this chapter (git, environments, seeds, documentation) are standard in software engineering but still uncommon in many academic data science labs. Why do you think adoption has been slow? What would accelerate it?

  5. Think about your own work in this course. Could someone reproduce your analyses from the previous chapters? What specific steps would you need to take to make them fully reproducible?

  6. The Duke scandal involved both honest errors (data mislabeling) and deliberate fraud (fabricated credentials). Reproducibility tools can catch the former but not necessarily the latter. What additional safeguards are needed to prevent fraud?