Case Study: Explainable AI in Healthcare — Promises and Pitfalls

DataField.Dev

Case Study: Explainable AI in Healthcare — Promises and Pitfalls

"The AI can see things in the scan that I can't. But I need to know why it sees them — because if I can't understand the reasoning, I can't trust the recommendation, and I certainly can't explain it to my patient." — Radiologist at a U.S. academic medical center, 2022

Overview

Healthcare is one of the highest-stakes domains for AI deployment. Algorithmic systems now assist with diagnosing diseases from medical images, predicting patient deterioration, recommending treatments, identifying drug interactions, and allocating scarce resources. In many of these applications, AI systems match or exceed human clinician performance.

But healthcare is also a domain where explainability is uniquely critical. Clinicians must understand AI recommendations to integrate them with their own judgment. Patients have a right to understand the basis for decisions affecting their health. Regulators must ensure that AI systems are safe and effective. And when things go wrong — when an AI misses a diagnosis or recommends a harmful treatment — the consequences are irreversible.

This case study examines the landscape of explainable AI (XAI) in healthcare: where it is deployed, how explanations are provided, what challenges arise, and what the experience reveals about the broader promises and pitfalls of transparency in algorithmic systems.

Skills Applied: - Evaluating XAI methods in a high-stakes domain - Analyzing the relationship between explainability and clinical trust - Assessing the adequacy of explanations for different healthcare stakeholders - Connecting healthcare AI governance to the chapter's broader frameworks

The Landscape: AI in Clinical Practice

Where AI Is Used

AI-assisted clinical decision-making has expanded rapidly across multiple domains:

Radiology. AI systems analyze medical images — X-rays, CT scans, MRIs, mammograms — to detect abnormalities. Systems from companies including Google Health, Aidoc, and Lunit have demonstrated performance comparable to experienced radiologists in detecting lung nodules, breast cancer, diabetic retinopathy, and bone fractures. The FDA has cleared over 500 AI/ML-enabled medical devices, with radiology representing the largest category.

Pathology. AI systems analyze tissue samples to identify cancerous cells, classify tumor types, and predict disease progression. PathAI and Paige AI have received FDA approval for pathology AI tools. These systems can process slides faster than human pathologists and may identify patterns invisible to the human eye.

Clinical decision support. Systems like Sepsis Watch (Duke University) predict which patients are at risk of developing sepsis, enabling earlier intervention. Epic Systems' deterioration index predicts patient deterioration in hospitals. VitraMed's patient risk model (from this textbook's narrative) represents this category.

Drug discovery and interaction. AI systems predict potential drug interactions, identify candidate compounds for new medications, and optimize dosing. These systems typically operate behind the scenes, supporting researchers rather than making direct patient-facing decisions.

The Accuracy Advantage

In many domains, AI systems demonstrate accuracy advantages over human clinicians:

Google's LYNA system detected metastatic breast cancer in lymph node biopsies with 99% accuracy, compared to human pathologists' 62% sensitivity under time pressure
A Stanford deep learning system diagnosed 14 types of skin cancer from clinical photographs with accuracy comparable to 21 board-certified dermatologists
Epic's sepsis prediction model, deployed at hundreds of hospitals, has been credited with reducing sepsis mortality rates at some institutions

These accuracy gains create pressure to deploy AI systems broadly — even when their internal reasoning is opaque.

The Explainability Challenge

Why Healthcare Needs XAI

The chapter's general argument for explainability applies with particular force in healthcare:

Clinical integration. A risk score or diagnostic recommendation is useful only if clinicians can integrate it with their own clinical judgment. A radiologist who receives a flag saying "possible nodule detected" needs to understand why the system flagged that region — the characteristics of the suspected lesion, the features the system identified, and how confident the system is. Without this information, the clinician faces a choice between blind trust (accepting the AI's output uncritically) and dismissal (ignoring the output because they cannot evaluate it).

Patient communication. Patients have a right to understand their diagnosis and treatment. If an AI system contributes to a clinical decision, the patient should be able to understand that contribution. "The AI says you have cancer" is not an acceptable basis for a diagnosis. "The AI identified a 2.3cm mass in the left lower lobe of your lung with characteristics consistent with malignancy, including irregular borders and ground-glass opacity" provides context the patient and clinician can discuss.

Error detection. AI systems make errors — and in healthcare, errors can be lethal. Explainability enables clinicians to catch errors by evaluating whether the AI's reasoning is consistent with clinical knowledge. If a system flags a region of an X-ray as suspicious and the explanation reveals that the "suspicion" is based on a bone fragment or imaging artifact rather than a genuine lesion, the clinician can override the recommendation. Without an explanation, the error may go undetected.

Regulatory requirements. The FDA requires that AI-based medical devices provide evidence of safety and efficacy. For "black box" systems, this creates a tension: how can regulators evaluate a system they cannot understand? The FDA's approach has been to require clinical validation (demonstrating that the system performs well in practice) rather than mechanical transparency (explaining how the system works internally). This approach measures outcomes but does not require understanding.

How Explanations Are Currently Provided

Healthcare AI systems employ various explanation methods:

Heat maps and saliency maps. In radiology, AI systems often produce heat maps that highlight the regions of a medical image that most influenced the system's output. A chest X-ray with a suspected nodule might show a red overlay on the region of concern, with cooler colors indicating areas the system considered less relevant. These visualizations are intuitive for radiologists, who are accustomed to focused visual analysis.

Feature importance lists. Clinical decision support systems often provide lists of the factors that most influenced a risk score — similar to the SHAP-based explanations discussed in the chapter. "Top risk factors: elevated troponin levels, tachycardia, age > 65, history of congestive heart failure."

Confidence scores. Some systems provide not just a prediction but a confidence level: "85% confidence that this lesion is malignant." This allows clinicians to calibrate their response — treating a high-confidence flag differently from a low-confidence one.

Natural language explanations. Emerging systems generate natural language summaries of their reasoning: "This patient is at elevated risk of sepsis based on: increasing white blood cell count over the past 12 hours, elevated lactate, and tachycardia. These findings are consistent with early systemic inflammatory response."

Case Illustrations

Case 1: IBM Watson for Oncology

IBM Watson for Oncology was one of the most prominent and most cautionary AI healthcare deployments. Launched in 2013 as a partnership with Memorial Sloan Kettering Cancer Center, Watson was marketed as an AI that could recommend cancer treatments by analyzing patient data against a vast corpus of medical literature.

The promise: Watson would democratize access to expert-level oncology guidance, bringing MSK-quality treatment recommendations to hospitals worldwide — particularly in low-resource settings.

The reality: Multiple investigations revealed significant problems:

Watson's recommendations were based primarily on training by MSK oncologists, not on independent analysis of medical literature. The system was effectively replicating one institution's practices, not synthesizing global evidence.
In some cases, Watson recommended treatments that were unsafe — including, in at least one documented case, recommending a drug for a cancer patient with a bleeding disorder where the drug could cause fatal hemorrhaging.
The system's explanations were often boilerplate — listing general factors without explaining why a specific treatment was recommended for a specific patient.
When clinicians disagreed with Watson's recommendations (which occurred in approximately 30-40% of cases at some hospitals), the system could not explain its reasoning in a way that facilitated productive dialogue.

IBM eventually scaled back Watson Health, and the oncology product was discontinued in its original form. The case illustrated that accuracy claims without explainability, and AI deployment without deep clinical integration, can produce systems that are neither trustworthy nor safe.

Case 2: Sepsis Prediction at Duke University

Duke University Hospital developed Sepsis Watch, a deep learning system that predicts which patients in the emergency department are at risk of developing sepsis — a life-threatening condition that kills approximately 270,000 Americans annually and requires rapid intervention.

The approach: Rather than deploying the model as a standalone tool, Duke integrated it into a sociotechnical system:

The deep learning model generates a risk score for each patient
A rapid response nurse receives the score along with an explanation of the key contributing factors
The nurse reviews the score, considers the clinical context, and decides whether to contact the clinical team
The clinical team makes the treatment decision

The XAI component: The system provides feature-attribution explanations showing which clinical factors most influenced the score (vital signs, lab values, medications, clinical notes). Importantly, the system also communicates what it does not know — flagging cases where the available data is insufficient for a confident prediction.

The results: Sepsis Watch has been credited with reducing sepsis mortality at Duke. But researchers also identified challenges:

Nurses sometimes dismissed high-risk scores when they conflicted with clinical intuition, even when the model was correct
The explanations sometimes highlighted factors that nurses considered obvious (e.g., "elevated temperature"), reducing trust in the model's value-add
Integration required months of training and organizational change — the technical system was the easy part

Case 3: Dermatology AI and Skin Tone Bias

AI systems for dermatological diagnosis — detecting skin cancer, rashes, and other conditions from clinical photographs — have demonstrated impressive accuracy. But investigations have revealed a critical limitation: many systems were trained primarily on images of light-skinned patients.

When these systems are deployed on patients with darker skin tones, accuracy drops significantly. A 2021 study found that dermatological AI systems achieved 10-20 percentage points lower accuracy on dark skin compared to light skin for some conditions.

The explainability challenge is acute here. When the system misidentifies a condition on dark skin, the explanation (typically a heat map or feature attribution) may not reveal the reason for the error — which is not about the specific image but about the training data's composition. The explanation tells the clinician what the system "saw," but not that the system was poorly equipped to see accurately in the first place.

This connects directly to the bias discussion from Chapter 14: representation bias in training data produces systematic failures that XAI tools can document but not resolve.

The Governance Question

Who Should Regulate Healthcare AI?

Healthcare AI governance involves multiple regulatory bodies with overlapping jurisdiction:

The FDA regulates AI as a medical device, requiring evidence of safety and efficacy
HIPAA governs the privacy and security of health data used to train and operate AI systems
Professional medical boards govern clinician practice, including how clinicians use AI tools
Hospital ethics committees evaluate the ethical implications of new technologies
The EU AI Act classifies healthcare AI as "high-risk" and imposes transparency and quality management requirements

No single body has comprehensive authority, and the regulatory landscape is fragmented. The result is a governance gap: AI systems can be deployed in clinical settings with limited oversight of their explainability, fairness, or ongoing performance.

The VitraMed Thread

Mira's experience with VitraMed's patient risk model exemplifies the governance challenge. The model is accurate in aggregate but biased against Black patients (as she discovered in Chapter 14). It is a black box — a gradient-boosted ensemble whose specific decision logic for any individual patient is difficult to articulate. And the hospital system has no formal process for evaluating, auditing, or overriding algorithmic recommendations.

"I can tell the ethics committee that the model has a problem," Mira said. "But what I can't tell them is exactly how it decides which patients to flag. I know the model's overall feature importances. I can generate SHAP values for individual patients. But I can't say: 'Here is the logic chain that led to this patient being deprioritized.' There is no logic chain. There are 5,000 decision trees, each with hundreds of nodes, and the final score is an average of all of them."

Discussion Questions

The trust calibration problem. Clinicians need to trust AI systems enough to use them but not so much that they defer uncritically. How should explanations be designed to promote appropriate trust? What level of detail helps clinicians make better decisions, and at what point does additional detail become counterproductive?
The patient's right to know. Should patients be informed when AI systems contribute to their diagnosis or treatment? If so, how? What would a meaningful disclosure look like — and how can it be provided without causing unnecessary anxiety about a technology most patients do not understand?
The FDA's approach. The FDA evaluates healthcare AI primarily by clinical outcomes (does the system perform well?) rather than by mechanistic transparency (can we explain how it works?). Is this approach adequate? Should the FDA require interpretability for AI medical devices, even at the cost of some accuracy?
The bias-explainability connection. The dermatology AI case shows that XAI methods can reveal what a system focused on but not why it was wrong for certain populations. How should healthcare AI address the intersection of bias and explainability? Is it sufficient to audit for bias separately, or should explanations be required to include bias-relevant information?

Your Turn: Mini-Project

Option A: Clinical Explanation Design. Design a user interface for presenting AI-generated risk scores to emergency room clinicians. Specify: What information would be displayed? How would explanations be formatted? How would confidence/uncertainty be communicated? How would the interface support (rather than undermine) clinical judgment? Create a mockup or detailed written specification.

Option B: Patient Communication Framework. Draft a one-page patient-facing document that a hospital could provide to patients explaining how AI is used in their care. The document should be written at an 8th-grade reading level and should address: what the AI does, how it contributes to decisions, what it cannot do, and how patients can ask questions or express concerns.

Option C: Healthcare AI Governance Framework. Draft a governance framework for a hospital system deploying AI in clinical practice. Your framework should address: (1) requirements for explainability before deployment, (2) ongoing monitoring for bias and accuracy, (3) protocols for clinician training, (4) patient notification and consent, (5) procedures for overriding algorithmic recommendations, and (6) accountability for errors. Write a two-page framework document.

References

Topol, Eric J. Deep Medicine: How Artificial Intelligence Can Make Healthcare Human Again. New York: Basic Books, 2019.
Sendak, Mark P., Madeleine Clare Elish, Michael Gao, Joseph Futoma, William Ratliff, Marshall Nichols, Armando Bedoya, Suresh Balu, and Cara O'Brien. "'The Human Body Is a Black Box': Supporting Clinical Decision-Making with Deep Learning." Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT)*, 99-109. ACM, 2020.
Strickland, Eliza. "IBM Watson, Heal Thyself: How IBM Overpromised and Underdelivered on AI Health Care." IEEE Spectrum, April 2, 2019.
Rudin, Cynthia. "Stop Explaining Black Box Machine Learning Models for High Stakes Decisions and Use Interpretable Models Instead." Nature Machine Intelligence 1, no. 5 (2019): 206-215.
Daneshjou, Roxana, Mary P. Smith, Mary D. Sun, Veronica Rotemberg, and James Zou. "Lack of Transparency and Potential Bias in Artificial Intelligence Data Sets and Algorithms: A Scoping Review." JAMA Dermatology 157, no. 11 (2021): 1362-1369.
U.S. Food and Drug Administration. "Artificial Intelligence and Machine Learning (AI/ML)-Enabled Medical Devices." FDA, October 2024.
European Commission. "Regulation of the European Parliament and of the Council Laying Down Harmonised Rules on Artificial Intelligence (Artificial Intelligence Act)." 2024.
Obermeyer, Ziad, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. "Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations." Science 366, no. 6464 (2019): 447-453.