Case Study 1: Medicine's Long Struggle with Bayes

Case Study 1: Medicine's Long Struggle with Bayes

"It is simply no longer possible by 'starring' anecdotally at the evidence to make sense of it." -- Archie Cochrane, Effectiveness and Efficiency (1972)

The Deadly Arithmetic

In 1978, the pathologist David Eddy published a study that should have changed the practice of medicine. He surveyed ninety-five physicians and asked them a question that, by that point, had a well-established correct answer:

A patient has a positive result on a screening test. The test has a sensitivity of 80 percent (it catches 80 percent of true cases) and a false positive rate of 10 percent. The prevalence of the condition in the population being screened is 1 percent. What is the probability that the patient actually has the condition?

The correct answer, derived from Bayes' theorem, is approximately 7.5 percent. Out of every 1,000 people screened, 10 have the condition. The test catches 8 of them (true positives). Of the 990 who do not have the condition, the test falsely flags 99 (false positives). So out of 107 total positive results, only 8 are genuine. The probability of disease given a positive test is 8/107, or about 7.5 percent.

The physicians' answers were devastating. The vast majority estimated the probability at between 70 and 80 percent. They were off by a factor of ten.

This was not a population of poorly trained doctors. These were practicing physicians at major medical centers. They had years of clinical experience. They made life-and-death decisions routinely. And they could not correctly interpret the single most common type of result in diagnostic medicine: a positive screening test.

Eddy's study was not an isolated finding. In the decades since, researchers have replicated it with depressing consistency. Gerd Gigerenzer and his colleagues tested 160 gynecologists on a mammography version of the problem in the late 1990s. Only 21 percent arrived at the correct answer. Casscells, Schoenberger, and Graboys found in 1978 that only 18 percent of Harvard Medical School faculty and students could correctly answer a similar problem. The numbers have improved somewhat in recent years as Bayesian literacy has entered some curricula, but the fundamental pattern persists: most physicians do not intuitively grasp the impact of base rates on the interpretation of diagnostic tests.

Why Physicians Get It Wrong

The failure is not about intelligence or education. It is about the cognitive architecture of human reasoning and the way medical training reinforces that architecture.

The Seduction of Sensitivity

Medical education teaches physicians to think in terms of individual patients, not populations. A doctor caring for a patient with symptoms wants to know: "If my patient has this disease, will the test catch it?" That question is answered by the sensitivity. A test with 90 percent sensitivity will catch 90 percent of cases. This feels reassuring.

What sensitivity does not answer is the inverse question: "If the test is positive, does my patient have the disease?" That question requires Bayes' theorem. It requires knowing the base rate. And the base rate is not a property of the test or the patient -- it is a property of the population from which the patient is drawn.

This distinction -- between the probability of the evidence given the disease and the probability of the disease given the evidence -- is the distinction between the likelihood and the posterior in Bayesian terminology. It is also the distinction that the prosecutor's fallacy violates (Chapter 10, Section 10.5). The cognitive error is identical: confusing P(evidence | hypothesis) with P(hypothesis | evidence).

The Narrative Pull

Physicians are trained to think in clinical narratives. A patient presents with symptoms. The physician generates a differential diagnosis -- a list of possible conditions ranked by plausibility. Each symptom, each lab result, each piece of the clinical history serves as evidence that updates the differential.

In principle, this process is Bayesian. In practice, it is distorted by what Amos Tversky and Daniel Kahneman called the representativeness heuristic. Physicians judge the probability of a diagnosis not by combining evidence with base rates but by assessing how well the patient's presentation matches the "typical" case of each condition. A patient who presents with textbook symptoms of a rare disease will often receive that diagnosis, even when the base rate of the disease makes it far less likely than a common condition with an atypical presentation.

This is representativeness bias dressed in clinical clothing. The physician is reasoning: "This presentation looks like lupus, therefore it probably is lupus." The Bayesian reasoning would be: "This presentation looks like lupus (high likelihood), but lupus is rare in this population (low prior), so the posterior probability is moderate at best -- and a common condition with a similar presentation is more probable."

The Institutional Failure

The problem extends beyond individual cognition to institutional practice. Medical screening programs are often evaluated and promoted based on sensitivity alone. "Our test catches 95 percent of cases!" is a compelling marketing claim and a reassuring clinical statistic. What is rarely communicated -- to physicians, patients, or policymakers -- is the positive predictive value: the probability that a positive result is real, given the base rate in the population being screened.

This institutional failure has concrete consequences. The PSA (prostate-specific antigen) test for prostate cancer is a case study. The test has reasonable sensitivity but moderate specificity, and prostate cancer in its early stages is very common (autopsy studies suggest that a large fraction of elderly men have histological evidence of prostate cancer). The result: widespread PSA screening generated an enormous number of positive results, most of which led to biopsies, many of which detected cancers that were unlikely to cause symptoms or death. The cascade of diagnosis and treatment -- surgery, radiation, their side effects -- caused substantial harm to men who would have lived their lives perfectly well without ever knowing about a microscopic, slow-growing tumor.

In 2012, the U.S. Preventive Services Task Force recommended against routine PSA screening for healthy men. The recommendation was controversial, but the Bayesian logic was clear: in a population with a relatively high base rate of indolent disease, a moderately specific test produces too many false positives (in a clinical sense -- true positives for cancers that did not need treatment), and the harms of downstream interventions outweigh the benefits of early detection.

The Natural Frequency Revolution

Gigerenzer's work pointed toward a solution: change the format, not the formula.

The problem, Gigerenzer argued, is not that physicians cannot do Bayesian reasoning. It is that conditional probabilities -- "90 percent sensitivity," "9 percent false positive rate" -- are an unnatural format for human cognition. Our brains did not evolve to process probabilities. They evolved to process frequencies. We are good at reasoning about counts of things: "9 out of 100 people with positive tests actually have the disease" is a statement that activates our natural frequency intuitions. "The positive predictive value is 9 percent" does not.

The natural frequency format presents the same information as Bayes' theorem but arranges it so the base rate is visible. Instead of:

Sensitivity: 90 percent
Specificity: 91 percent
Base rate: 1 percent

You say:

Out of every 1,000 women screened, about 10 have cancer.
Of those 10, the test catches 9.
Of the 990 without cancer, the test falsely flags about 90.
So out of about 99 positive results, only 9 are real.

When physicians are given the problem in this format, accuracy jumps from roughly 20 percent to roughly 80 percent. The improvement is dramatic, immediate, and requires no additional mathematical training.

Gigerenzer and his colleagues have advocated for reforming medical communication along these lines: patient brochures, informed consent forms, and clinical decision tools should present risk information in natural frequencies rather than conditional probabilities. Some countries and medical institutions have adopted this approach. The UK's National Health Service, for example, has experimented with natural frequency formats in cancer screening materials.

But adoption has been slow. The conditional probability format is entrenched in medical education, clinical guidelines, and journal publications. Changing it requires reforming not just how physicians think but how medical information is produced, packaged, and delivered. This is an institutional inertia problem of exactly the kind discussed in Chapter 10, Section 10.9.

A Deeper Problem: What Counts as Disease?

The Bayesian analysis of medical screening reveals a problem that goes beyond arithmetic. It forces a confrontation with a question that medicine has struggled with for decades: what counts as disease?

The PSA screening controversy illustrates this. Many elderly men have histological prostate cancer -- abnormal cells visible under a microscope. But many of these cancers are indolent: they grow so slowly that the man will die of something else long before the cancer causes symptoms. Is this cancer a "disease"? In one sense, yes: the cells are abnormal. In another sense, no: the abnormality will never harm the patient. The base rate of histological prostate cancer is high, but the base rate of clinically significant prostate cancer is much lower. Which base rate you use in the Bayesian calculation changes the posterior dramatically.

This is not just a Bayesian technicality. It reflects a deep tension in medicine between the pathological definition of disease (abnormal cells are present) and the clinical definition (the abnormality causes or will cause symptoms). Bayesian reasoning, by insisting that the prior matters, forces the question: prior of what? Clinically significant disease? Any histological abnormality? Future morbidity? The answer changes the entire analysis.

The mammography debate follows the same contour. Modern mammography can detect very small tumors, some of which may never progress to clinical significance. These are sometimes called "pseudodisease" -- conditions that meet the pathological definition of disease but would never have caused harm without intervention. When pseudodisease is included in the base rate, sensitivity appears high (the test catches these small tumors) and the screening program appears successful (survival rates improve because disease is detected earlier). But the patients with pseudodisease did not benefit from detection. They were harmed: they received a cancer diagnosis, underwent treatment, and suffered side effects for a condition that was never going to hurt them.

Bayesian reasoning does not resolve this debate. But it clarifies it. It shows that every screening decision rests on a prior -- and that the choice of prior is not a statistical question but a medical and ethical one. What condition are we screening for? What counts as a positive result? What population are we screening? These questions, which seem like they should be answered before the Bayesian calculation begins, are in fact the most important part of the calculation.

The Bayesian Future of Medicine

Despite decades of evidence that physicians struggle with Bayesian reasoning, and despite the existence of effective solutions (natural frequency formats, clinical decision support tools, Bayesian diagnostic algorithms), the integration of Bayesian thinking into routine clinical practice remains incomplete.

The reasons are familiar from the chapter's analysis of why Bayesian reasoning keeps getting forgotten: cognitive difficulty, institutional inertia, the seductiveness of simpler heuristics, and the cultural resistance to making uncertainty explicit. Medicine, like science more broadly, has a complicated relationship with uncertainty. Patients want certainty. Physicians are trained to project confidence. The Bayesian framework, which foregrounds uncertainty and makes it quantitative, can feel like an admission of ignorance rather than a tool for managing it.

But the tide is turning. Evidence-based medicine -- the movement to ground clinical decisions in systematic evidence rather than individual experience -- is fundamentally Bayesian in spirit, even when its methods are frequentist in execution. The growing use of clinical decision support tools, many of which implement Bayesian algorithms behind a user-friendly interface, is quietly bringing Bayesian reasoning into clinical practice without requiring physicians to understand the underlying mathematics. And the rise of personalized medicine -- which adjusts diagnosis and treatment to individual patients based on their specific risk profiles -- is, in essence, a movement toward making priors patient-specific rather than population-generic.

The Bayesian revolution in medicine is happening. It is just happening slowly, as all revolutions do in institutions with deep investments in existing practice. Thomas Kuhn would not be surprised.

Questions for Discussion

Why do you think physicians consistently overestimate the probability of disease given a positive test? Is this primarily a cognitive error, a training failure, or an institutional problem?
The natural frequency format dramatically improves Bayesian reasoning. Why do you think it has not been more widely adopted in medical practice?
The PSA screening controversy illustrates the problem of choosing which base rate to use. How should this choice be made -- by physicians, patients, public health authorities, or some combination?
The chapter's threshold concept is "Priors Are Not Bias." How does this apply to the medical context? Is a physician who considers disease prevalence before interpreting a test being more or less objective than one who does not?
How does the medical struggle with Bayesian reasoning connect to the signal detection problems discussed in Chapter 6? Are the errors the same, or do they have a different character in the clinical context?