Case Study 36.1: IBM Watson for Oncology — The Rise and Fall of AI's Healthcare Moonshot

From Jeopardy! to the Cancer Ward

Overview

In February 2011, a computer named Watson competed on Jeopardy! — the American television quiz show — against the two all-time champion human players, Ken Jennings and Brad Rutter. Watson won decisively, defeating both champions over three televised episodes. IBM had spent four years and tens of millions of dollars developing Watson's natural language understanding capabilities for this purpose. The victory was a genuine technical achievement: Watson demonstrated an ability to parse complex English-language questions, draw on a massive knowledge base, and respond quickly and accurately in a high-stakes, public setting.

IBM's subsequent decision to commercialize Watson in healthcare, and specifically in oncology, is a case study in the gap between AI's demonstrated capabilities in controlled settings and its performance in the complex, high-stakes, variable environment of clinical medicine. It is also a case study in what happens when commercial ambition, marketing enthusiasm, and institutional prestige combine to accelerate deployment beyond what clinical evidence supports.

The Pivot to Healthcare

Following the Jeopardy! victory, IBM positioned Watson as a platform for enterprise cognitive computing — artificial intelligence that could process unstructured information (like clinical literature) and provide expert-quality analysis. Healthcare was identified as a primary market, based on the intuition that medicine involved the exact challenge Watson had demonstrated facility with: large volumes of text-based information, complex questions, and the need for synthesized, authoritative answers.

The oncology application was particularly attractive because of cancer treatment's genuine complexity. Oncology is one of the most rapidly evolving fields in medicine, with hundreds of cancer types, thousands of clinical trial results, evolving genomic classifications, complex drug interactions, and patient-specific factors that all bear on treatment selection. Even experienced oncologists may not be current on the latest evidence across all tumor types. The vision of an AI system that had read and synthesized all of the oncology literature — that could give any oncologist anywhere access to MSKCC-quality reasoning — was compelling.

IBM announced a series of major institutional partnerships:

In 2013, IBM partnered with the University of Texas MD Anderson Cancer Center on a project to create a "cognitive computing system" for leukemia treatment, with a reported cost to MD Anderson of approximately $62 million.
IBM partnered with Memorial Sloan Kettering Cancer Center (MSKCC) to develop Watson for oncology treatment recommendations across multiple cancer types.
IBM announced partnerships with Manipal Hospitals in India, Samsung Medical Center in South Korea, Jupiter Medical Center in Florida, and hospitals across Europe, Asia, and Latin America.
IBM sold Watson Health assets to multiple health systems and positioned Watson as a transformative force in precision medicine.

The partnership announcements were made with significant media coverage and ambitious language. IBM executives and partner institution leaders described Watson's potential to transform oncology care, reduce diagnostic errors, and democratize access to expert oncology reasoning.

The Training Methodology

Understanding why Watson for Oncology failed requires understanding how it was trained — a methodology that contained the seeds of the subsequent problems.

Watson for Oncology's treatment recommendations were not generated by analysis of actual patient outcomes data. The system was not trained on databases showing which treatments led to which outcomes for which types of patients. Instead, Watson was trained on hypothetical patient cases — clinical vignettes — that were created and annotated by oncologists at MSKCC.

The training process worked roughly as follows: MSKCC oncologists described hypothetical patient scenarios, including simulated patient characteristics, pathology findings, and clinical contexts. They then annotated these scenarios with their recommended treatments and reasoning, explaining why certain therapies were preferred. Watson learned to match similar patient presentations to the annotated recommendations. The system was, in effect, learning MSKCC's clinical reasoning from curated examples, not from the messy, variable reality of actual patient outcomes across diverse populations.

This training methodology had a significant limitation that was not adequately appreciated during development: it produced a system that captured MSKCC's institutional approach to oncology in the context of MSKCC's patient population, MSKCC's formulary (the drugs available at MSKCC), MSKCC's clinical resources, and the cases that MSKCC oncologists happened to curate as training examples. When the system was deployed in different institutions — with different patient populations, different drugs available, different clinical resources, different treatment protocols, and different clinical contexts — there was no reason to expect that MSKCC's curated reasoning would translate accurately.

Additionally, the training cases were hypothetical rather than drawn from actual patient records. This meant that the training data could not reflect the actual distribution of patient presentations, clinical complications, and real-world variability that practicing oncologists encounter.

The STAT News Investigation

In September 2017, a detailed investigative report by Casey Ross and Ike Swetlitz at STAT News, drawing on internal IBM documents obtained by the publication, revealed the magnitude of the gap between Watson's marketing claims and its clinical performance.

The STAT News investigation reported that Watson for Oncology had:

Generated treatment recommendations that oncologists described as "unsafe and incorrect" in a significant number of cases.
One internal document, presented at a meeting of Watson developers, described Watson recommending "high-dose chemotherapy for a patient with severe bleeding" — a recommendation that oncologists noted could be lethal.
Watson's recommendations were generated from "synthetic" (hypothetical) cases rather than actual patient data, a limitation that IBM had not prominently disclosed to potential customers.
Watson sometimes flagged rare cancer treatments as "not recommended" that were actually appropriate, based on reasoning that did not align with international oncology guidelines.

The STAT News investigation prompted significant attention within the medical and technology communities. Internal documents from multiple hospital partners indicated dissatisfaction with Watson's performance. Oncologists at multiple institutions described the gap between what IBM's sales teams had promised and what Watson actually delivered.

The investigation also revealed that MD Anderson Cancer Center had ended its Watson-based leukemia project — the largest and most expensive Watson Health partnership — after spending approximately $62 million without achieving deployment. An audit by the University of Texas System found project management failures and concerns about clinical performance. MD Anderson had quietly wound down the project without public announcement.

The Evaluation at MSKCC

MSKCC's own internal evaluation of Watson's recommendations, portions of which were shared with the STAT News investigation and subsequently discussed in medical literature, provided evidence of the generalizability problem.

Oncologists at MSKCC reviewed Watson's recommendations for cancer patients at various institutions and found that a substantial fraction of recommendations did not align with what trained oncologists would recommend. The discrepancy was particularly pronounced for patient populations that differed from the MSKCC patient population used to generate training cases: patients in India, South Korea, or other countries where patient demographics, cancer presentation patterns, available drugs, and treatment infrastructure differed from MSKCC's.

A 2018 study published in the journal Oncotarget, conducted at a hospital in South Korea that had implemented Watson for Oncology, found that Watson's recommendations concurred with tumor board recommendations in only 49% of cases for colon cancer, 55% for rectal cancer, and higher rates for some other cancers, but with important discordances in treatment details. The authors noted that the discordances could not always be resolved clearly in Watson's favor, suggesting that the system's recommendations did not consistently improve on standard clinical practice.

A study from a hospital in India found similar patterns: meaningful rates of discordance between Watson's recommendations and institutional tumor board recommendations, with particular concerns about Watson recommending drugs not available in India or not approved in India, reflecting the MSKCC-centric training.

IBM's Response and Continuing Sales

IBM's commercial response to the emerging evidence of Watson's clinical limitations was a subject of significant criticism. IBM continued to market Watson for Oncology to hospitals internationally, announcing new partnership deals while the evidence of performance problems was accumulating. Critics noted that IBM's sales materials and executive statements did not adequately disclose the limitations of Watson's evidence base or the concerns that clinical evaluations had raised.

IBM's public communications described Watson's capabilities in terms that clinical experts found inconsistent with the evidence. When challenged on discordances between Watson's recommendations and clinical practice, IBM representatives sometimes characterized the discordances as evidence that Watson was identifying better options that conventional tumor boards had missed — an interpretation that was difficult to square with the clinical expert reviews suggesting the opposite.

The Divestiture

In 2022, IBM announced that it had sold the majority of Watson Health assets to Francisco Partners, a private equity firm, for a reported price of approximately $1 billion. IBM had acquired Watson Health assets over multiple years for a reported aggregate investment variously estimated at $4 billion to $15 billion, including the $2.6 billion acquisition of health data analytics firm Truven Health Analytics in 2016. The divestiture represented a write-down of substantial value and an acknowledgment that the healthcare AI vision had not been realized.

The sale closed in 2023. Francisco Partners subsequently rebranded the Watson Health assets under the Merative brand. The oncology-specific products were not continued in the same form.

What Watson Taught About AI Validation in Healthcare

The Watson for Oncology case teaches several lessons that remain relevant to clinical AI governance.

Marketing claims and clinical evidence operate on different evidentiary standards. Marketing claims about AI products are governed primarily by disclosure and deception standards that are far less rigorous than clinical evidence standards. IBM made representations about Watson's capabilities that were not supported by the clinical evidence at the time those representations were made. Healthcare organizations should apply clinical evidence standards — peer-reviewed validation studies, head-to-head comparisons with clinical practice, demographic performance data — to clinical AI procurement, not marketing standards.

Training methodology is clinical evidence, not just a technical detail. The decision to train Watson on hypothetical MSKCC-curated cases rather than on real patient outcomes data was a fundamental methodological choice with predictable clinical consequences. Healthcare organizations procuring AI tools should demand detailed transparency about training data — how it was collected, from what population, what it represents, and what it cannot represent. "We trained our AI on clinical data" is not an adequate answer.

Institutional prestige is not clinical evidence. MSKCC is one of the world's premier cancer centers. IBM's partnership with MSKCC was presented as lending MSKCC's clinical authority to Watson's recommendations. But clinical authority in human judgment does not automatically transfer to an AI system built on curated training data from that institution. Prestige associations in AI marketing should be examined critically.

Generalizability cannot be assumed — it must be demonstrated. A model trained in one clinical context will not automatically perform well in different contexts. Performance testing must occur in the specific population and clinical environment in which the system will be deployed. International deployment of Watson — to hospitals in India, South Korea, Europe, and Latin America — without population-specific validation was a governance failure that contributed to the clinical discordances documented in those settings.

The absence of regulatory requirements for clinical AI created a permissive environment for deployment without evidence. Watson for Oncology was not approved by the FDA as a medical device in the same way that a drug or device requires approval before marketing. The regulatory pathway for AI clinical decision support was unclear during the period of Watson's primary deployment. This regulatory gap allowed Watson to be marketed and sold to hospitals without the level of clinical evidence that would be required for, say, a new chemotherapy drug. The development of SaMD regulatory frameworks since this period has begun to close this gap, but the Watson episode illustrates why those frameworks matter.

Accountability requires honesty about failure. Watson's failures were not publicly acknowledged by IBM while the products were still being actively marketed. MD Anderson's termination of its Watson project was not publicly announced. The accumulation of negative clinical evidence came through investigative journalism and independent research, not through IBM's own disclosure. Accountability in clinical AI requires proactive disclosure of performance problems, not just reactive response to external investigation.

Legacy

Watson for Oncology's failure did not end AI in oncology. Since Watson's effective withdrawal, multiple other AI oncology tools have been developed and deployed, including AI tools for tumor detection in pathology images, genomic analysis for treatment selection, and radiation treatment planning. These tools have generally been developed with more rigorous clinical evidence standards than Watson was held to.

Watson's legacy, however, is primarily as a cautionary tale — one that is cited in clinical AI governance discussions, regulatory deliberations, and academic analyses of AI in healthcare. The case established several reference points: the gap between AI marketing and clinical evidence, the generalizability problem in clinical AI, and the importance of regulatory oversight for clinical AI systems. For this reason, it remains an essential case study for anyone responsible for governing AI in healthcare settings.

Discussion Questions

IBM trained Watson for Oncology on hypothetical patient cases rather than actual patient outcomes data. What would a rigorous training dataset for an oncology AI have looked like? What methodological standards should have applied?
Watson's recommendations were found to disconcord significantly with tumor board recommendations in institutions outside the United States. What due diligence should hospitals in those countries have conducted before purchasing and deploying Watson?
IBM continued to market Watson for Oncology and sign new partner deals while evidence of performance problems was accumulating internally. What ethical obligations did IBM have in this situation? What transparency was owed to existing and prospective hospital partners?
The regulatory framework for AI clinical decision support was inadequate during Watson's primary deployment period. If the current FDA SaMD framework had been in place, what requirements would Watson have needed to meet? Would those requirements have prevented the problems documented in this case?
Watson for Oncology was presented as democratizing access to MSKCC-quality oncology expertise. The actual evidence suggested it did not deliver on this promise for hospitals in India and South Korea. What does this teach about the equity implications of exporting AI tools trained in high-income country settings to lower-resource settings without validation?
IBM spent an estimated $4–15 billion on Watson Health before divesting the assets for approximately $1 billion. Who bore the costs of this failure? Patients who received Watson's recommendations? Hospitals that paid for Watson? IBM shareholders? What accountability mechanisms were available to each of these parties?