33 min read

In This Chapter

Consequential, Opaque, and Hard to Hold Accountable
Learning Objectives
Section 1: The Clinical Decision-Making Landscape
Section 2: The Human-in-the-Loop Question
Section 3: Informed Consent for AI-Assisted Care
Section 4: FDA Regulation of Clinical AI
Section 5: Algorithmic Harm in Healthcare
Section 6: IBM Watson for Oncology — A Cautionary Tale
Section 7: End-of-Life Care and AI
Section 8: Mental Health AI
Section 9: AI and Health Equity
Section 10: Building Ethical Clinical AI Governance
Recurring Themes in This Chapter
Conclusion
Key Terms

Case Study 01 Case Study 02 Key Takeaways Exercises Quiz Further Reading

Chapter 36: AI in Healthcare Decision-Making

Consequential, Opaque, and Hard to Hold Accountable

Opening Hook

In September 2021, Epic Systems — the healthcare software giant whose electronic health record products run in approximately 2,500 hospitals across the United States, covering more than 250 million patient records — quietly updated its Deterioration Index. The Deterioration Index is an AI model embedded in Epic's hospital software that assigns patients scores predicting which individuals are at risk of rapid clinical decline, triggering alerts for clinical staff to intervene before a crisis.

The update changed the model's behavior in ways that were not trivial. Clinicians in hospitals across the country were not uniformly informed that the tool they were relying on had been altered. Some did not know. They continued consulting the Deterioration Index scores with the same trust and clinical habits they had developed around the previous version of the model — habits developed for a different algorithm. Independent researchers who later published validation studies of Epic's Deterioration Index found substantially variable performance across different hospital populations, raising questions about whether the model generalized adequately to diverse patient groups.

This episode captures, in miniature, the governance challenge of AI in healthcare. The Deterioration Index is not a research prototype — it runs in real clinical environments, alerts real nurses and physicians, shapes real decisions about which patients receive urgent attention. When it changes, patients are potentially affected. When it performs differently across demographic groups, the differential is not academic; it affects who survives clinical deterioration.

Healthcare AI is consequential, often opaque, and currently difficult to hold accountable. The frameworks for governing it — FDA regulation, hospital procurement standards, professional medical ethics — are developing but remain inadequate to the pace and scale of deployment. This chapter examines the ethical dimensions of AI in clinical decision-making: not AI in healthcare broadly (the field is too large), but specifically the domain of AI-assisted clinical decisions — diagnosis, risk prediction, treatment recommendations, and end-of-life care.

Learning Objectives

By the end of this chapter, students will be able to:

Describe the range of clinical contexts in which AI decision support tools are currently deployed and distinguish between augmentation and automation in clinical AI.
Explain the concept of automation bias in healthcare and analyze its implications for human oversight of clinical AI.
Evaluate the ethical dimensions of informed consent for AI-assisted care, including current practice gaps and what meaningful disclosure would require.
Explain the FDA's regulatory framework for Software as a Medical Device (SaMD) and identify its primary limitations in governing adaptive clinical AI.
Analyze specific documented cases of algorithmic harm in healthcare, identifying the sources of harm and the governance failures that allowed them.
Assess the ethical challenges specific to end-of-life care and mental health applications of AI.
Apply a health equity lens to clinical AI deployment decisions, identifying how AI can widen or narrow health disparities.
Develop a framework for ethical clinical AI governance in a hospital or health system context, including procurement, validation, monitoring, and patient communication requirements.

Section 1: The Clinical Decision-Making Landscape

Artificial intelligence has moved from research settings into active clinical practice at a pace that has outrun both regulatory oversight and professional norms. Understanding the landscape of where AI is currently deployed in clinical decision-making is a prerequisite for evaluating its governance.

Radiology and Medical Imaging

Radiology has been the leading edge of clinical AI deployment, in part because imaging data is structured and abundant, making it amenable to pattern recognition approaches. FDA-authorized AI tools now assist in interpreting chest X-rays, mammograms, CT scans, MRIs, and retinal images. Applications include detection of pulmonary nodules, measurement of cardiac structures, identification of diabetic retinopathy, and flagging of imaging findings consistent with intracranial hemorrhage.

The radiology use case illustrates a key distinction: AI tools in this space are generally positioned as decision support — assisting radiologists in identifying findings they might review and interpret — rather than fully replacing radiologist review. The question of what genuinely constitutes decision support, versus what creates automation pressure to accept AI-generated interpretations, is a recurring theme in clinical AI governance.

Sepsis Prediction and Deterioration Detection

One of the most widely deployed categories of clinical AI involves predicting which hospitalized patients are at risk of sepsis (a life-threatening infection response) or deterioration (rapid clinical decline). These tools analyze electronic health record data — vital signs, lab values, medication records, clinical notes — and generate risk scores that alert clinical staff to patients who may need urgent intervention.

Tools in this category include Epic's Deterioration Index, Epic's Sepsis Model, and competing products from various vendors. These tools operate at extraordinary scale — hundreds of thousands of hospitalized patients per day. Their performance, their demographic variation in performance, and their effects on clinical alert fatigue and decision-making are consequential issues examined throughout this chapter.

Treatment Recommendations and Clinical Decision Support

AI tools increasingly generate treatment recommendations alongside or within electronic health records: drug dosing suggestions, clinical pathway recommendations, diagnostic differentials. These tools range from evidence-based clinical decision support (CDSS) drawing on published guidelines to more complex AI systems that generate recommendations based on population-level pattern recognition.

The ethical stakes of treatment AI are high because recommendations can affect the quality and nature of care a patient receives. IBM Watson for Oncology — examined in detail in this chapter's case studies — became the most prominent cautionary tale about what happens when AI treatment recommendations fail to generalize appropriately.

Triage

Emergency department triage — the prioritization of patients for care based on urgency — is increasingly AI-assisted. Tools predict which patients presenting to emergency departments are at highest risk, enabling staff to prioritize scarce emergency care resources. The fairness implications of triage AI are significant: systematic underestimation of risk for patients from certain demographic groups means those patients receive lower priority in contexts where delays can cause harm.

Pathology

Digital pathology — AI analysis of tissue samples — is an emerging area with significant implications for cancer diagnosis. AI tools for analyzing biopsy slides have received FDA authorization in several categories, offering the potential to improve diagnostic consistency and throughput.

The Spectrum from Augmentation to Automation

A conceptual distinction that recurs throughout clinical AI governance is the spectrum from augmentation to automation. Augmentation uses AI to enhance human clinical judgment — presenting additional information, flagging findings for review, suggesting alternatives the clinician might not have considered. The clinician remains the decision-maker, with AI providing additional inputs. Automation, at the other end of the spectrum, replaces human judgment with algorithmic decisions — routing patients to care pathways, adjusting medication doses, or making diagnostic determinations without clinician review.

Most clinical AI is currently presented as augmentation. Whether it functions as augmentation in practice — whether clinicians genuinely exercise independent judgment over AI recommendations, or whether they tend to accept AI outputs without meaningful review — is an empirical question whose answer has significant governance implications.

Section 2: The Human-in-the-Loop Question

The principle of "human-in-the-loop" — maintaining meaningful human oversight of AI decisions — is central to AI ethics discussions generally. In clinical contexts, the question becomes concrete and consequential: What does genuine clinical oversight of AI actually look like? What conditions must exist for clinical oversight to be meaningful rather than nominal?

The Evidence on Automation Bias in Healthcare

Automation bias is the documented tendency to over-rely on automated systems, particularly when those systems present recommendations in authoritative-seeming formats. In healthcare, automation bias has been documented in multiple settings.

Studies of clinical decision support systems have found that clinicians accept system recommendations at high rates, including recommendations that, on reflection, they would identify as inappropriate. Alert override studies — tracking how clinicians respond to electronic health record alerts — consistently find that clinicians override most alerts, but also that when they agree with alerts, they do so rapidly, without always engaging in independent clinical reasoning.

For AI tools specifically, research has found that presenting AI confidence scores alongside recommendations can paradoxically increase automation bias: high-confidence AI recommendations are accepted at higher rates than lower-confidence ones, even when the clinician's own clinical assessment suggests uncertainty. The format and presentation of AI outputs shapes clinical behavior in ways that governance frameworks must address.

The Epic Deterioration Index Story

The Epic Deterioration Index story illustrates the automation bias problem in a real deployed system. Epic's electronic health record is used in the majority of large U.S. health systems. The Deterioration Index, embedded in Epic's flowsheet interface, presents clinicians with a score and risk category for each patient.

Research published in peer-reviewed journals has found substantial variation in the Deterioration Index's performance across hospital settings: positive predictive values, sensitivity, and specificity that differ significantly from the statistics Epic reports for the model. This variability suggests that the model, trained in one context, may not generalize adequately to other clinical environments.

The governance implication is critical: when clinicians trust a clinical AI tool as if it were validated for their specific patient population and workflow, but the tool's performance is substantially different from what they believe, the "human-in-the-loop" becomes a human who is making decisions based on information they incorrectly believe to be reliable. That is not meaningful human oversight — it is nominal oversight with false confidence.

When AI Augments vs. Undermines Clinical Judgment

Not all clinical AI undermines clinical judgment. Well-designed, well-validated AI tools can genuinely augment clinical capability by drawing attention to findings that clinicians might miss, presenting probabilistic assessments that contextualize clinical decisions, and reducing cognitive load in high-volume clinical environments.

The conditions for genuine augmentation include: the clinician has independent basis for verifying AI recommendations in the relevant domain; the AI tool's performance characteristics are transparent and validated for the relevant population; the presentation of AI outputs does not discourage independent verification; and feedback mechanisms exist so clinicians can learn when AI recommendations were incorrect.

The conditions for AI undermining clinical judgment include: clinicians lack the domain expertise to independently evaluate AI recommendations; AI outputs are presented in authoritative formats that discourage questioning; alert fatigue has led clinicians to develop habits of rapid acceptance or override without case-by-case assessment; and no feedback mechanism exists to reveal AI errors to the clinicians who acted on them.

A foundational principle of medical ethics is informed consent: patients have the right to know what is being done to them and to make autonomous decisions about their care. The question of what informed consent requires in the context of AI-assisted care is genuinely contested.

What Current Practice Looks Like

As of the early 2020s, most hospitals do not specifically disclose to patients that AI tools are involved in their care. Standard consent forms address the general use of technology in care delivery, but do not typically enumerate specific AI tools or describe their role in clinical decisions. Patients undergoing imaging interpretation, risk scoring, or clinical decision support typically have no awareness that AI is involved.

This reflects a broader legal and ethical ambiguity about what disclosure is required. No federal law in the United States explicitly requires patient disclosure of AI involvement in clinical care as of 2024. State laws vary. Professional guidance is evolving.

The Arguments for Disclosure

The arguments for requiring disclosure of AI involvement in care are grounded in several principles. First, informed consent requires that patients understand the nature of their care. If AI significantly influences diagnostic or treatment decisions, patients arguably have a right to know this, both to understand their care and to exercise their right to seek alternative opinions or providers.

Second, patients may have values-based objections to AI involvement in their care that deserve respect — concerns about privacy, about algorithmic bias, or about the substitution of relational care with algorithmic processing. These are legitimate patient preferences that informed consent should protect.

Third, disclosure enables accountability. When patients know AI is involved in their care, they can ask questions, seek explanations, and advocate for themselves in ways they cannot when AI involvement is invisible.

The Arguments Against Mandatory Disclosure

Arguments against mandatory AI disclosure in every clinical interaction focus on practical feasibility and clinical impact. Clinical care involves dozens of technology-mediated processes; requiring specific disclosure for each AI tool would create disclosure burdens that might obscure rather than illuminate meaningful information. Patients may also lack the technical background to make meaningful decisions based on disclosure of specific AI tools.

What Informed Consent Should Require

A principled framework for AI disclosure in healthcare does not require disclosure of every algorithm that touches care. It requires disclosure that is meaningful, material, and decision-relevant. At a minimum, disclosure should occur when: AI plays a significant role in diagnosis or treatment recommendation; the AI tool's accuracy or reliability may be relevant to a patient's decision about care; or the patient is from a demographic group for which the AI tool's performance has been found to differ materially from the reported average.

Section 4: FDA Regulation of Clinical AI

The primary federal regulatory framework for clinical AI in the United States is the Food and Drug Administration's oversight of Software as a Medical Device (SaMD) — software that makes or supports medical decisions.

The SaMD Framework

The FDA has applied the medical device regulatory framework to clinical AI software, treating software that aids in medical decision-making as a medical device subject to premarket review. The SaMD framework categorizes AI/ML software based on risk: software that could cause serious harm to patients if it fails (high risk) is subject to stricter premarket review than software with lower stakes.

The 510(k) and De Novo Pathways

The two primary pathways for SaMD authorization are the 510(k) substantial equivalence pathway and the De Novo pathway for novel devices without a predicate.

The 510(k) pathway allows manufacturers to demonstrate that their device is substantially equivalent to a previously authorized predicate device. For AI medical devices, the 510(k) pathway presents challenges: many AI clinical tools are genuinely novel, without clear predicates, yet some have used 510(k) as an authorization pathway by identifying technically similar predecessors. Critics argue that 510(k) is inadequate for novel AI medical devices because substantial equivalence to an older technology does not ensure that the new AI system is safe or effective.

The De Novo pathway is intended for novel, low-to-moderate risk devices without predicates. It requires the manufacturer to demonstrate safety and effectiveness through a formal review process that establishes special controls for the device category. Multiple AI medical devices have received De Novo authorization, establishing predicate status that subsequent similar devices can then use in 510(k) submissions.

Adaptive AI and Predetermined Change Control Plans

A distinctive challenge for FDA regulation of clinical AI is the treatment of adaptive AI — models that can update their parameters based on new data during deployment. Traditional medical devices have fixed specifications; once authorized, they function as described. AI/ML models can update continuously, potentially changing their performance characteristics in ways that the original authorization did not contemplate.

The FDA's 2021 AI/ML Action Plan proposed the concept of predetermined change control plans (PCCPs): a manufacturer would submit a plan describing what types of model updates would be made, how those updates would be validated, and what performance specifications must be maintained. The FDA would authorize not just a specific model version but a plan for how the model would evolve. This approach allows AI systems to improve over deployment while maintaining regulatory oversight of the change process.

The Demographic Performance Gap

In 2022, the FDA proposed a rule on transparency and accountability for clinical decision support software that included requirements for demographic performance reporting — reporting how AI tools perform across racial, ethnic, age, and sex subgroups. The proposal reflected growing evidence that clinical AI tools often perform differently across demographic groups, with some groups experiencing significantly worse performance than the overall statistics suggest. The demographic performance gap is examined further in Section 9.

Section 5: Algorithmic Harm in Healthcare

The theoretical risk of algorithmic harm in healthcare has concrete documentation in multiple real cases. Understanding these cases — what went wrong, how it was discovered, what the consequences were — is essential for developing effective governance.

The Optum Algorithm

One of the most thoroughly documented cases of racial bias in healthcare AI was the Optum care management algorithm studied by researchers Ziad Obermeyer and colleagues and published in Science in 2019. Optum's algorithm was used by health systems to identify patients who would benefit from care management programs — coordinated, intensive care for complex medical needs.

The algorithm used healthcare costs as a proxy for health need, on the assumption that patients with more health needs generate more healthcare costs. Obermeyer's team found that this proxy systematically underestimated the health needs of Black patients, because Black patients generated lower healthcare costs than equally sick white patients — a reflection of the systemic barriers to healthcare access that affect Black Americans. The result was that the algorithm allocated care management resources in ways that disadvantaged Black patients: at a given algorithmic risk score, Black patients were significantly sicker than white patients.

The algorithm was commercially deployed and widely used before the research team identified the bias. Optum subsequently updated the algorithm to reduce the racial disparity.

eGFR Race Correction

The estimated glomerular filtration rate (eGFR) formula used to measure kidney function incorporated a race-based correction factor that assigned higher kidney function values to Black patients than to white patients with identical creatinine measurements. The practical consequence was that Black patients with kidney disease could appear, by the formula's calculation, to have better kidney function than they actually did — delaying diagnosis and treatment.

The race correction was not based on robust evidence of a biological difference; it reflected historical misclassification of race-based health differences as biological rather than structural. The Kidney Disease Improving Global Outcomes (KDIGO) guidelines eliminated the race correction in 2021 following advocacy by medical trainees and researchers who documented the harm.

The Pulse Oximeter Problem

Pulse oximeters — widely used in clinical settings to measure blood oxygen saturation — were found in published research during the COVID-19 pandemic to systematically overestimate oxygen saturation in patients with darker skin tones. The FDA's device authorizations had not required testing across skin tone groups. The consequence was that clinical decisions about oxygen supplementation, mechanical ventilation, and discharge were potentially being made based on inaccurate measurements for patients with darker skin — precisely the patients disproportionately affected by severe COVID-19.

IBM Watson for Oncology

Watson for Oncology is examined in detail in this chapter's case studies. In brief: IBM's AI oncology treatment recommendation system generated treatment recommendations that were found to be clinically inappropriate in a substantial fraction of cases, as documented by an internal MSKCC evaluation and the STAT News investigation. The system's recommendations did not generalize from its MSKCC training context to the diverse clinical contexts in which it was deployed internationally.

Section 6: IBM Watson for Oncology — A Cautionary Tale

IBM Watson for Oncology is the most extensively documented cautionary tale in clinical AI, and understanding it in detail is essential for business and policy professionals involved in healthcare AI governance. The full case study appears in this chapter's case-study-01.md. This section provides the analytical framework.

The Promise

Watson for Oncology was announced amid extraordinary expectations. IBM had achieved global recognition through Watson's 2011 victory over human champions on Jeopardy! — a demonstration of natural language understanding that attracted enormous media attention. IBM pivoted Watson to healthcare, arguing that the same ability to process unstructured information at scale could be applied to clinical evidence and patient records to improve medical decision-making.

The oncology application promised to give every oncologist, at every hospital worldwide, access to the institutional knowledge and treatment recommendations of Memorial Sloan Kettering Cancer Center — the premier cancer research hospital in the United States. IBM signed a partnership with MD Anderson Cancer Center in 2013 to develop Watson for leukemia, and subsequently with Memorial Sloan Kettering for lung, breast, colorectal, cervical, rectal, and other cancers.

The Reality

A 2017 investigation by STAT News, based on internal IBM documents obtained by the publication, revealed that Watson for Oncology was generating treatment recommendations that oncologists found "unsafe and incorrect" in a significant number of cases. An internal IBM document quoted in the STAT News investigation described Watson recommending "high-dose chemotherapy for a patient with severe bleeding" — a recommendation that oncologists noted could be lethal.

The core problem was the training methodology. Watson for Oncology's recommendations were generated not from analysis of real-world patient outcomes data, but from training on hypothetical patient cases curated by MSKCC oncologists. These were cases designed to elicit MSKCC's clinical reasoning — but they represented MSKCC's specific institutional approach, not universally validated evidence-based medicine. When Watson was deployed in different clinical contexts — different patient populations, different treatment resources, different formularies, different clinical guidelines — its recommendations did not generalize appropriately.

The Lessons

Watson for Oncology teaches several durable lessons about AI in healthcare. First, marketing and clinical validation are different domains, and the gap between them can be wide and dangerous. Second, training data that reflects the practices of one institution — even a preeminent one — does not produce systems that generalize to other institutions. Third, the absence of real-world outcomes data in training creates systems that reflect clinical judgment without validating that judgment against patient outcomes. Fourth, the reputational and commercial pressures to announce and deploy AI solutions can overwhelm the methodological rigor necessary for clinical safety.

Section 7: End-of-Life Care and AI

The application of AI to end-of-life care represents some of the most ethically fraught territory in clinical AI. Death is inevitable; how and when it is anticipated, communicated, and managed has profound implications for human dignity, patient autonomy, and the quality of dying.

Prognostic AI

AI systems are increasingly used to predict mortality — to estimate the probability that a patient will die within a given time horizon. Epic's proprietary mortality predictor score, embedded in its EHR platform, is among the most widely deployed such systems, operating across thousands of hospitals. These systems generate numerical predictions — often presented as probabilities or risk scores — that clinical teams may use in making decisions about care intensity, palliative care referrals, and conversations with patients and families.

The Dignity Dimension

The application of algorithmic mortality prediction to end-of-life decision-making raises questions that are not purely technical. A mortality prediction score is not simply information — it is a framing that shapes how clinicians, families, and patients think about the future. Algorithmic predictions presented in authoritative numerical formats may carry different weight than clinical judgment communicated in language, which can express uncertainty, context, and the limits of knowledge.

When an algorithm assigns a patient a 74% probability of dying within thirty days, what does that mean for the clinical relationship? For the patient's decision-making about preferences and goals of care? For the family's understanding of prognosis? These are ethical and relational questions that the numerical output does not answer.

Who Controls the Data

End-of-life prognostic AI also raises data governance questions. The data used to generate mortality predictions is drawn from electronic health records — patient data held by health systems. Patients typically do not know their mortality prediction score, cannot review the data that generated it, and have limited recourse if the prediction is wrong. The asymmetry between institutional use of this data and patient knowledge or control of it is a dimension of healthcare AI governance that receives insufficient attention.

Patient Autonomy and Algorithmic Prognosis

Patient autonomy in end-of-life decision-making requires accurate, honest information about prognosis. AI can contribute to better prognostic information if it is well-validated and appropriately communicated. But it can undermine autonomy if patients are not informed about AI involvement in prognostication, if predictions are presented with false precision, or if algorithmic assessments drive clinical conversations in directions that do not reflect patient values.

The Palliative Care AI Opportunity

Used thoughtfully, prognostic AI can improve access to palliative care — a chronic undersupply in U.S. healthcare. By identifying patients with limited prognosis who have not received palliative care consultation, AI tools can trigger conversations that improve the quality of dying. Research has found that accurate prognostication and early palliative care referral improve both quality of life and, in some studies, length of life for patients with serious illness. The ethical use of end-of-life AI requires not just technical validity but careful attention to how prognostic information is communicated and how patient preferences are integrated into care decisions triggered by AI tools.

Section 8: Mental Health AI

The application of AI to mental health represents a rapidly developing area with significant opportunities and significant governance gaps.

Therapy Chatbots

A range of digital mental health applications using AI conversational interfaces have been deployed for consumer mental health support. Applications like Woebot, Wysa, and Replika use conversational AI to provide support based on cognitive behavioral therapy (CBT) principles, mindfulness techniques, and general emotional support.

These tools have potential advantages: accessibility (available 24/7, without geographic constraints or cost barriers that limit human therapy access), scalability (potentially reaching many more people than the mental health workforce could serve), and the reduced stigma that some users experience when disclosing distress to an AI rather than a human.

Evidence and Effectiveness

The evidence base for digital mental health interventions, including AI chatbots, is developing but uneven. Small randomized controlled trials of tools like Woebot have shown positive effects on depression and anxiety symptoms relative to control conditions. But the evidence base has important limitations: small sample sizes, short follow-up periods, concerns about active control group design, and the challenge of measuring outcomes in mental health.

FDA Oversight Gaps

Digital therapeutic applications occupy an ambiguous regulatory space. The FDA has taken a light-touch approach to most general wellness apps, including many digital mental health tools, under the FDASIA (Food and Drug Administration Safety and Innovation Act) exemption for low-risk general wellness software. More specific digital therapeutics that claim to treat specific mental health conditions — such as prescription digital therapeutic (PDT) products — are subject to fuller FDA review.

The result is that many widely used AI mental health tools operate without robust regulatory oversight or validated clinical evidence of effectiveness, while being presented to consumers in ways that may imply clinical benefit.

Privacy Concerns

Mental health applications collect deeply sensitive personal information: disclosures of trauma, suicidal ideation, relationship difficulties, medication use, and other information that carries significant potential for harm if disclosed inappropriately. The privacy policies of mental health apps have been found, in multiple analyses, to permit substantial data sharing with third parties including advertisers and data brokers.

Crisis Intervention

Perhaps the most significant clinical risk in AI mental health applications is the handling of crisis situations — disclosures of suicidal ideation, self-harm, or immediate danger. AI chatbots are not equipped to provide crisis intervention services equivalent to trained crisis counselors. Apps have faced criticism for inadequate responses to crisis disclosures. The Replika app, which positions itself as an AI companion, was investigated following reports of users developing emotional dependence and being inadequately redirected to professional resources during crises.

The Clinical Relationship Question

Mental health treatment is not purely informational — it is relational. Therapeutic benefit in evidence-based psychotherapy is substantially mediated by the therapeutic alliance, the relationship between clinician and patient. AI systems can simulate aspects of relational interaction but cannot replicate genuine human relationship. The ethical question is not whether AI tools can provide any benefit — they may — but whether they should substitute for human professional care, or whether they should be governed primarily as accessible complements to human care for populations that cannot access human care.

Section 9: AI and Health Equity

Building on the foundational discussion of healthcare AI bias in Chapter 12, this section examines how AI deployment decisions can widen or narrow health disparities, and what equity-centered clinical AI governance looks like in practice.

How AI Widens Disparities

AI tools can widen health disparities through several pathways. First, training data that underrepresents certain patient populations produces models with worse performance for underrepresented groups. If a sepsis prediction model is trained primarily on data from large urban academic medical centers, its performance may be substantially worse in rural hospitals, community hospitals, or hospitals serving predominantly minority populations.

Second, proxy variables that correlate with health outcomes may also correlate with race, ethnicity, or socioeconomic status in ways that introduce discriminatory effects. The Optum algorithm's use of healthcare costs as a proxy for health need is the documented example: a variable that appears clinically reasonable encodes structural inequities in healthcare access.

Third, differential access to AI tools across healthcare settings can itself create inequity. If the most sophisticated AI diagnostic tools are concentrated in well-resourced academic medical centers, patients receiving care in underresourced community health settings — disproportionately patients of color and low-income patients — may receive a lower standard of AI-augmented care.

How AI Can Narrow Disparities

Used thoughtfully, AI has potential to reduce healthcare disparities. AI tools that extend specialist expertise to primary care settings can improve access to specialist-quality diagnostic support for patients who cannot access specialists. AI that detects findings that clinicians might miss can reduce the diagnostic disparities that result from implicit bias in human clinical judgment. AI screening tools that identify patients for outreach can help health systems reach patients who are underutilizing preventive care.

Realizing the equity-positive potential of AI requires that developers test tools for equitable performance across demographic groups before deployment, that health systems prioritize equity in AI procurement, and that AI tools be deployed in settings where underserved populations can benefit.

Algorithmic Redlining

The term "algorithmic redlining" describes the use of algorithmic tools to systematically disadvantage patients from certain demographic groups in healthcare access, quality, or resource allocation. The term draws on the practice of historical redlining in mortgage lending — systematic exclusion of Black neighborhoods from federally-backed lending based on geographic proxies for race.

Algorithmic redlining in healthcare can occur through risk scoring tools that systematically underestimate risk for certain populations (as documented with the Optum algorithm), resource allocation tools that direct scarce interventions away from high-need populations because of proxy variable bias, or predictive tools that generate different recommendations for identically-situated patients based on demographic characteristics.

Equity-Centered Clinical AI Governance

What does equity-centered clinical AI governance look like in practice? Several elements are essential:

Demographic subgroup performance testing before procurement: health systems should require vendors to provide performance statistics broken down by race, ethnicity, sex, age, language, and socioeconomic indicators — not just overall performance statistics.
Ongoing performance monitoring by demographic group after deployment: disparities in real-world performance may differ from pre-deployment test performance, particularly as patient populations evolve.
Procurement priorities that consider whether AI tools will serve the health system's full patient population equitably.
Institutional equity impact assessments before deploying AI tools that affect resource allocation.

Section 10: Building Ethical Clinical AI Governance

The accumulated evidence of this chapter — algorithmic harm cases, automation bias research, regulatory gaps, equity concerns — points toward specific governance requirements for healthcare organizations that deploy clinical AI.

Procurement Standards

Clinical AI procurement must go beyond the purchasing criteria that govern traditional healthcare IT. Evaluating a clinical AI tool requires assessment of: clinical validation studies (not marketing claims) in populations similar to the institution's patient population; FDA authorization status and the pathway through which authorization was obtained; demographic subgroup performance data; vendor transparency about training data, model architecture, and known failure modes; terms for model updates and notification requirements; and the vendor's approach to ongoing monitoring and adverse event reporting.

Clinical Validation Requirements

Before deploying a clinical AI tool in a consequential clinical workflow, health systems should establish: that the tool has been externally validated (not just internally validated by the developer) in patient populations comparable to the institution's; that performance statistics are understood and communicated to clinical users; that the performance is consistent across demographic groups; and that the tool has been assessed for integration into the specific clinical workflow, not just for standalone performance.

Ongoing Monitoring

Deployment is not the end of the governance obligation — it is the beginning of a new phase. Clinical AI tools require ongoing performance monitoring, including tracking of clinical outcomes in patient populations affected by AI-assisted decisions, surveillance for demographic performance drift, and mechanisms for clinicians to report suspected AI errors. The Epic Deterioration Index story illustrates what happens when this monitoring does not occur: silent model updates that change clinical tool behavior without clinician awareness.

Patient Communication

Health systems should establish clear policies about when patients will be informed of AI involvement in their care. A minimum standard would require: disclosure in standard consent forms that AI tools may be used in clinical care; specific disclosure when AI plays a significant role in diagnosis or treatment recommendations; and a mechanism for patients to ask about and understand AI's role in their care.

Incident Response

Clinical AI tools will produce errors. Health systems need defined incident response procedures for AI-related adverse events: processes for identifying when an AI tool may have contributed to patient harm, reporting mechanisms, root cause analysis, and remediation. These procedures should parallel the existing adverse event reporting systems in healthcare, with modifications appropriate to algorithmic failures.

Regulatory Compliance

The regulatory landscape for clinical AI is evolving. Health systems should maintain awareness of FDA authorization requirements for the AI tools they deploy, comply with applicable state laws addressing healthcare AI, and participate in professional association guidance development. As the regulatory framework develops further, early engagement with emerging requirements reduces compliance risk.

Human Accountability

Ultimately, accountability in clinical AI must rest with identifiable human actors — not with algorithms. This requires clear allocation of responsibility within health systems: who is responsible for clinical AI procurement decisions, who is responsible for monitoring, who is accountable when AI tools contribute to patient harm. The automation of clinical decision-making does not remove clinical accountability; it creates new governance requirements for how that accountability is exercised.

Recurring Themes in This Chapter

Power and Accountability: Clinical AI concentrates substantial power in the companies that develop electronic health record platforms — particularly Epic, which runs in the majority of large U.S. hospitals. The ability to quietly update AI models embedded in ubiquitous clinical software without systematic clinician notification illustrates how opacity in powerful platforms translates into governance failure.

Innovation vs. Harm: The genuine potential of clinical AI to improve diagnosis, expand access to specialist expertise, and identify patients at risk of deterioration is real. So are the documented harms — algorithmic bias, Watson's unsafe recommendations, pulse oximeter inaccuracy. The governance challenge is not choosing between innovation and harm but creating structures that enable beneficial innovation while preventing and detecting harm.

Ethics Washing: The clinical AI market has seen significant marketing hype unaccompanied by clinical evidence. IBM Watson for Oncology is the paramount example: a product marketed with claims of revolutionary impact that the evidence did not support, deployed internationally before adequate validation. Distinguishing genuine clinical AI from AI-flavored marketing is a core procurement challenge.

Diversity and Inclusion: Demographic bias in clinical AI — differential performance across racial and ethnic groups — is not a peripheral concern. It is documented in deployed systems and has direct consequences for health equity. Governance frameworks that treat demographic performance as secondary to overall performance will perpetuate and potentially amplify existing health disparities.

Global Variation: Clinical AI governance frameworks vary substantially across countries. The FDA's SaMD framework is specific to the United States; the EU has its own medical device regulation; low- and middle-income countries often lack effective regulatory frameworks for clinical AI entirely. AI tools validated in high-income country populations may be deployed in LMIC populations without adequate validation, creating equity concerns at a global scale.

Conclusion

AI in healthcare decision-making is not a future possibility — it is an operational reality in thousands of hospitals serving millions of patients. The promise is genuine: better detection, more consistent care, expanded access to specialist expertise. The documented risks are also genuine: bias that worsens outcomes for already-disadvantaged populations, automation bias that undermines meaningful clinical oversight, opaque systems that change without clinician awareness, and marketing claims that outrun evidence.

The governance gap in clinical AI — between deployment scale and governance maturity — is smaller than in consumer generative AI, partly because healthcare has existing regulatory infrastructure (the FDA), professional ethics traditions (medical ethics, informed consent), and institutional governance capacity. But significant gaps remain: the regulatory framework is still adapting to adaptive AI; demographic performance requirements are proposed but not yet binding; patient disclosure standards are underdeveloped; and the Epic Deterioration Index story illustrates that even widely deployed tools in well-governed institutions can change without adequate oversight.

For health system leaders, clinical leaders, and health policy professionals, the practical implication is clear: deploying clinical AI is a governance responsibility, not just a technology procurement decision. The patients whose care is shaped by these tools deserve the same rigor in validation, monitoring, and accountability that they receive from any other medical intervention.

Key Terms

Software as a Medical Device (SaMD): Software that performs medical device functions — making or supporting medical decisions — and is therefore subject to FDA regulatory oversight.

Automation Bias: The documented tendency to over-rely on automated systems, accepting their outputs without adequate independent verification.

Deterioration Index: Epic Systems' AI model predicting which hospitalized patients are at risk of rapid clinical decline, embedded in Epic's widely-used electronic health record platform.

Predetermined Change Control Plan (PCCP): The FDA's proposed mechanism for authorizing AI medical devices that will update over time, by approving a plan for how updates will be made and validated.

Algorithmic Redlining: The use of algorithmic tools to systematically disadvantage patients from certain demographic groups in healthcare access, quality, or resource allocation.

Proxy Variable Bias: The introduction of discriminatory effects into algorithmic systems through the use of variables that appear clinically reasonable but correlate with protected demographic characteristics.

Digital Therapeutic: A software-based therapeutic intervention (including AI chatbots for mental health) that may be subject to FDA review if it claims to treat a medical condition.

eGFR: Estimated glomerular filtration rate, a measure of kidney function whose calculation historically included a race-based correction factor that disadvantaged Black patients.

Prescription Digital Therapeutic (PDT): A digital therapeutic that requires a physician prescription and has received FDA authorization as a medical device.