Case Study 1: Machine Learning and Medicine -- Two Fields, One Error

DataField.Dev

Case Study 1: Machine Learning and Medicine -- Two Fields, One Error

"The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data." -- John Tukey

When the Algorithm and the Doctor Make the Same Mistake

A machine learning engineer in San Francisco and a medical researcher in Boston have never met, never spoken, and work in fields that share almost no vocabulary. Yet in the spring of 2015, they were both making exactly the same error, for exactly the same reasons, with exactly the same consequences.

The engineer was building a model to predict which users of a social media platform would click on a particular advertisement. She had access to a dataset of ten million user interactions -- clicks, scrolls, page views, dwell times, demographics, device types, time of day, day of week -- and she was fitting a deep neural network with several hundred million parameters. The model achieved a click-through rate prediction accuracy of 94 percent on her training data. She was elated. Her manager was impressed. The model was deployed.

Within three weeks, the model's accuracy had dropped to 61 percent. User behavior had shifted slightly -- a new feature on the platform changed how people scrolled, a competing platform launched a viral campaign that altered attention patterns, and the holiday season changed purchase intent. The model had memorized the specific behavioral patterns of users during the training period, including patterns that were artifacts of that particular moment rather than stable features of user behavior. It had overfit.

Six hundred miles away, a medical researcher was reviewing the results of a clinical trial for a new anti-inflammatory drug. The trial enrolled 180 patients with rheumatoid arthritis, randomized to drug or placebo. The primary endpoint -- reduction in joint swelling at twelve weeks -- showed no significant effect. The drug did not appear to work.

But the researcher did not stop there. He examined subgroups. Did the drug work better in women than men? In patients over fifty versus under fifty? In patients with severe versus mild disease? In patients who also took aspirin versus those who did not? He tested fourteen subgroups. In one of them -- women over fifty with severe disease who did not take aspirin -- the drug showed a statistically significant effect (p = 0.03). He published this result. The paper was titled "Novel Anti-Inflammatory Agent Shows Significant Benefit in Older Women with Severe Rheumatoid Arthritis."

Three years later, a larger trial specifically targeting this subgroup found no effect. The drug did not work for older women with severe rheumatoid arthritis any more than it worked for anyone else. The original finding was a false positive -- a pattern in noise, discovered by testing enough subgroups that one was bound to show a significant result by chance.

The Structural Parallel

The engineer and the researcher made the same error, despite working in different fields with different data, different methods, and different goals. The structure of the error is identical.

The engineer's version:

Element	Description
Model	Deep neural network with hundreds of millions of parameters
Training data	Ten million user interactions from a specific time period
Degrees of freedom	Hundreds of millions (the model's parameters)
What was overfit	Behavioral patterns specific to the training period (scroll patterns, seasonal effects, platform-specific quirks)
Test data	Live user behavior three weeks later
Result	Dramatic performance drop; model recalled and retrained

The researcher's version:

Element	Description
Model	"The drug works for women over 50 with severe disease who don't take aspirin"
Training data	180 patients in the original trial
Degrees of freedom	Fourteen subgroup analyses (each an additional hypothesis test)
What was overfit	A chance correlation in a small subgroup that happened to cross the significance threshold
Test data	A larger, independent trial targeting the same subgroup
Result	No effect found; original result was a false positive

The parallel extends to every feature of overfitting:

Too many degrees of freedom relative to data. The engineer's model had hundreds of millions of parameters for ten million data points -- a ratio that virtually guarantees overfitting without heavy regularization. The researcher tested fourteen subgroups on 180 patients -- and with a significance threshold of 0.05, the probability of finding at least one "significant" result across fourteen tests, even if the drug has no effect, is approximately 1 - (0.95)^14 = 51 percent. More than coin-flip odds of a false positive.

No out-of-sample testing before deployment. The engineer evaluated her model on the same time period's data. The researcher drew conclusions from the same trial that generated the hypothesis. Neither tested their model on genuinely independent data before committing to it.

The pattern was real in the data. This is the most insidious feature of overfitting. The engineer's model genuinely predicted user behavior during the training period with 94 percent accuracy. The researcher genuinely observed a significant difference in the target subgroup. Neither was fabricating results. The patterns were real -- real in the same way that a face in the clouds is real. It is really there, in that particular arrangement of water droplets. It just doesn't mean anything.

How Machine Learning Learned to Regularize

The history of machine learning is, to a significant degree, the history of learning how to prevent overfitting. The field did not begin with an understanding of the problem. It arrived at that understanding through decades of painful experience.

In the early decades of neural network research (1960s through 1980s), overfitting was a constant but poorly understood obstacle. Researchers would build networks, train them on data, achieve impressive results, and then find that the networks performed far worse on new data. The response was often to add more data or build bigger networks, which sometimes helped and sometimes made things worse.

The breakthrough came with the development of formal regularization techniques:

Early stopping (1990s): Rather than training the model until it achieved minimum error on the training data, researchers discovered that stopping training when the error on a held-out validation set began to increase produced models that generalized better. The model was deliberately prevented from fitting the training data as well as it could. This is satisficing applied to optimization: accepting a good-enough fit rather than pursuing a perfect one.

Weight decay / L2 regularization: Adding a penalty proportional to the size of the model's parameters to the loss function. This penalizes complexity -- large parameters mean the model is making sharp distinctions that may be based on noise. The penalty pushes the model toward simpler, smoother solutions that are less likely to be overfit.

Dropout (2012): During training, randomly "dropping" a fraction of the neural connections, forcing the network to learn redundant representations that are robust to the loss of any individual connection. This prevents the network from relying on specific, potentially noise-driven features of the training data.

Data augmentation: Artificially expanding the training set by creating modified versions of existing data (rotating images, adding noise, cropping). This increases the effective sample size without collecting new data, reducing the degrees-of-freedom-to-data ratio.

Ensemble methods: Training multiple models on different subsets of the data and averaging their predictions. This reduces variance because the individual models' overfitting errors tend to cancel out when averaged. This is the machine learning version of portfolio diversification.

Each of these techniques encodes the same principle: constrain the model's freedom, accept some bias, and gain generalization. The field arrived at this principle through trial and error, through the accumulated experience of thousands of researchers hitting the same wall.

How Medicine Is Learning to Regularize

Medicine's encounter with overfitting has been more painful and more recent, in part because the consequences are measured in human lives rather than advertising revenue.

The replication crisis in medicine -- the discovery that many published clinical findings fail to hold up in larger, independent trials -- triggered a wave of institutional reforms that are, in structure, identical to the regularization techniques developed in machine learning.

Pre-registration of clinical trials (analogous to fixing the model architecture before training): Since 2005, major medical journals have required that clinical trials be registered before they begin, specifying the primary hypothesis, the primary endpoint, the sample size, and the statistical analysis plan. This prevents researchers from adjusting their hypotheses after seeing the data -- from HARKing (Hypothesizing After Results are Known). Pre-registration is the scientific equivalent of committing to a model architecture before training: you lock in your degrees of freedom before you see the data.

Correction for multiple comparisons (analogous to regularization penalties): When multiple hypotheses are tested on the same dataset, statistical corrections (such as the Bonferroni correction) raise the significance threshold proportionally. If you test fourteen subgroups, the threshold for any individual test is adjusted from p < 0.05 to p < 0.05/14 = p < 0.0036. This makes it harder for any individual test to pass, reducing the false positive rate. It is a penalty for complexity -- for using too many degrees of freedom.

Larger sample sizes (analogous to more training data): The replication crisis demonstrated that many underpowered studies -- studies with too few participants to detect real effects reliably -- had produced false positives. The response has been a push toward larger trials, multi-site studies, and meta-analyses that pool data across studies. More data reduces the degrees-of-freedom-to-data ratio, making overfitting less likely.

Independent replication (analogous to out-of-sample testing): The gold standard in medicine, as in machine learning, is testing on independent data. A finding that replicates in a new trial with new patients, conducted by independent researchers, has passed the most demanding test of generalization.

Systematic reviews and meta-analyses (analogous to ensemble methods): Rather than relying on any single study, systematic reviews aggregate evidence across all available studies on a question, weighting each by its quality and sample size. This is the medical equivalent of ensemble learning: averaging across multiple "models" to reduce the variance that afflicts any individual one.

The Convergent Evolution of Regularization

The most striking feature of this parallel history is that machine learning and medicine arrived at structurally identical solutions independently. No machine learning researcher read a clinical trial methodology textbook and adapted its lessons. No medical statistician studied neural network regularization and applied it to trial design. The solutions converged because the problem was the same.

This convergent evolution of regularization techniques is itself a powerful example of cross-domain pattern recognition. The pattern is:

A field develops methods for finding patterns in data.
The methods find patterns, some real and some not.
The field deploys the patterns and discovers that some fail in new contexts.
The field develops constraints -- regularization techniques -- to prevent future overfitting.
The constraints share a common structure: reduce degrees of freedom, penalize complexity, test on independent data.

This sequence has occurred in machine learning, medicine, finance (where backtesting reforms now require out-of-sample testing), psychology (where the Open Science Framework promotes pre-registration and replication), and ecology (where meta-analyses have replaced reliance on individual field studies). Each field went through the sequence independently. Each arrived at the same destination.

The lesson is not that overfitting is a solved problem. It is not. New forms of overfitting emerge as methods become more sophisticated. In machine learning, the rise of large language models has introduced new overfitting challenges: models trained on the entire internet may overfit to the statistical regularities of human-generated text rather than learning to reason. In medicine, the proliferation of observational studies using electronic health records introduces new sources of confounding that mimic the multiple-testing problem.

The lesson is that recognizing the shared structure of overfitting across domains allows you to import solutions. A medical researcher who understands dropout in neural networks can see why multi-site trials with diverse populations are important. A machine learning engineer who understands the replication crisis can see why out-of-sample testing must be sacrosanct. The vocabulary differs. The math differs. The constraints are the same.

The Human Cost

The overfitting error in machine learning cost the engineer's company some advertising revenue and the engineer some embarrassment. The overfitting error in medicine cost patients something far more valuable.

When the subgroup analysis of the rheumatoid arthritis drug was published, doctors read it. Some prescribed the drug to older women with severe disease, believing -- based on a published, peer-reviewed study -- that it would help. It did not help. Those patients took a medication with side effects and no benefits, paid for a prescription that did nothing, and in some cases delayed receiving treatments that might have worked.

This is why overfitting is not merely an academic concern. In every domain where decisions are based on patterns extracted from data -- which is every domain -- overfitting has consequences that range from the trivial (bad ad targeting) to the catastrophic (ineffective medical treatments, failed financial strategies, misguided policy interventions). The structural identity of the error across domains means that the lessons learned in any one domain can, and should, inform all the others.

The engineer and the researcher, six hundred miles apart, made the same mistake for the same reasons. Understanding why they made it, and how to prevent it, is among the most practically important applications of cross-domain pattern recognition.

Questions for Discussion

The case study describes overfitting in both machine learning and medicine as involving "too many degrees of freedom relative to data." In what specific ways do the degrees of freedom differ between the two fields? Are "model parameters" and "subgroup analyses" the same kind of flexibility?
The medical researcher tested fourteen subgroups and found one significant result. If you were a journal reviewer, what specific regularization requirements would you impose before accepting such a paper for publication?
Machine learning developed regularization techniques largely through engineering trial and error. Medicine developed them largely in response to a public crisis of confidence (the replication crisis). Why do you think the timelines were different? Does the nature of the domain (engineering vs. human health) affect how quickly a field learns to regularize?
The case study notes that ensemble methods in machine learning and meta-analyses in medicine serve the same structural function. Can you identify other pairs of domain-specific techniques that are structurally identical regularization methods appearing under different names?
Both fields continue to face new forms of overfitting as their methods evolve. What new overfitting risks might emerge in the next decade in machine learning? In medicine? Would understanding the parallel help practitioners in either field anticipate and prepare for these risks?