48 min read

In November 2019, a study published in Science — one of the world's most prestigious peer-reviewed journals — sent a shockwave through the healthcare industry. Ziad Obermeyer and his colleagues at UC Berkeley had spent months dissecting a commercial...

Chapter 12: Bias in Healthcare AI


Opening: A System That Knew the Price of Care but Not Its Need

In November 2019, a study published in Science — one of the world's most prestigious peer-reviewed journals — sent a shockwave through the healthcare industry. Ziad Obermeyer and his colleagues at UC Berkeley had spent months dissecting a commercial algorithm used by hospitals, insurers, and health systems across the United States to manage the care of roughly 200 million people. Their finding was stark, damning, and, in retrospect, entirely predictable: the algorithm was systematically steering Black patients away from extra care management programs — not because of any explicit racial variable, but because of a design choice that seemed, on its face, perfectly reasonable.

The algorithm was built to do something genuinely valuable: identify which patients were most likely to benefit from intensive care coordination programs — the kind of proactive, wraparound support that reduces hospitalizations, improves medication adherence, and ultimately saves lives. The designers chose healthcare cost as their primary proxy for health need. After all, sicker patients cost more to treat. That relationship is real, broadly documented, and intuitive. Cost data was also abundant, clean, and easy to obtain from insurance records. Using it seemed like an elegant engineering solution.

What the designers did not adequately account for — or did not check for — was that healthcare cost reflects not just how sick a patient is, but how much care that patient has historically received. And in the United States, how much care a patient has received is not race-neutral. Black Americans, due to centuries of systemic discrimination in medicine, residential segregation, income inequality, and insurance disparities, have historically received less care than their white counterparts — even when comparably or more severely ill. The algorithm did not see that history. It saw cost. And because Black patients had lower historical costs (not because they were healthier, but because they had received less care), it ranked them as lower-need.

The result: at any given level of predicted risk, Black patients enrolled in the care management program were significantly sicker than white patients. To receive the same automated referral, a Black patient had to be substantially more ill. Eighteen percent of Black patients who should have been referred were not. The algorithm was not malicious — it was a mirror of the society that produced its training data, and it reflected that society's inequities back onto patients who could least afford it.

This chapter examines bias in healthcare AI — what it is, where it comes from, how it manifests across clinical contexts, and what healthcare organizations, developers, and regulators can do about it. We do so in full awareness that the stakes here are unlike almost any other domain of AI ethics. In facial recognition, a false positive might mean wrongful arrest. In credit scoring, it might mean a denied loan. In healthcare AI, a false negative might mean a missed cancer diagnosis. A biased risk score might mean a patient dies without receiving care they needed. Bias in healthcare AI is not a compliance problem. It is a matter of life and death.


Learning Objectives

After completing this chapter, you will be able to:

  1. Explain the major ways bias enters healthcare AI systems, from training data to proxy selection to measurement tools, using specific real-world examples.

  2. Analyze the Obermeyer et al. (2019) Optum algorithm study, articulating how a seemingly neutral design choice (using cost as a proxy for need) produced racially disparate outcomes at scale.

  3. Describe the eGFR race correction controversy and the VBAC prediction tool controversy as examples of clinical decision support tools that encoded race in ways that delayed or distorted care.

  4. Identify the specific challenges of gender bias in healthcare AI, including the historical exclusion of women from clinical trials and its downstream effects on AI trained on that data.

  5. Evaluate the regulatory landscape governing healthcare AI, including FDA oversight of Software as a Medical Device (SaMD), and identify the significant gaps in current oversight.

  6. Apply a framework for equitable AI development to a healthcare procurement or development scenario, including pre-deployment demographic testing, model documentation, and ongoing bias monitoring.

  7. Distinguish between the representation problem (diverse training data) and the deeper structural problem (historical treatment disparities as training signal) in healthcare AI bias.

  8. Articulate the particular ethical risks of AI in mental health contexts, including therapy chatbots, depression screening tools, and suicide risk prediction algorithms.


Section 12.1: The Promise and Peril of AI in Healthcare

The Transformative Promise

The case for AI in healthcare is not manufactured enthusiasm. The potential is real, the early results in some domains are genuinely impressive, and the problems AI might help solve are serious enough to warrant significant investment and attention.

Consider the scale of what AI could theoretically address. Diagnostic error — misdiagnosis or delayed diagnosis — affects an estimated 12 million Americans annually and contributes to approximately 40,000 to 80,000 deaths per year in the United States alone. Physician burnout, driven in significant part by administrative burden, is accelerating a global shortage of clinical talent. Health disparities — differences in health outcomes between demographic groups — persist across virtually every condition and every country, driven by a complex web of social, economic, and historical factors that the healthcare system alone cannot resolve but frequently exacerbates. Into this landscape, the promise of AI is compelling: earlier, more accurate diagnosis; personalized treatment tailored to the individual patient rather than the average trial participant; reduced administrative burden on clinicians; and — perhaps most tantalizing — the potential to extend high-quality clinical decision support to settings that currently lack it.

Early results in specific domains have validated parts of this promise. In 2016, Google's DeepMind published results showing that a deep learning algorithm could detect diabetic retinopathy — a leading cause of blindness that is highly treatable if caught early — from retinal images with accuracy comparable to ophthalmologists. This mattered enormously because ophthalmologists are scarce in many parts of the world where diabetes is endemic. If an AI system could flag which retinal images required urgent specialist review, primary care physicians and even trained nurses could operate the screening program, dramatically expanding access. Subsequent deployments in Thailand and India tested this promise against the friction of real-world healthcare delivery.

AI for sepsis detection represents another domain of genuine clinical achievement. Sepsis — the body's dysregulated response to infection — kills approximately 270,000 Americans annually and is notoriously difficult to detect early, when treatment is most effective. AI systems trained on electronic health record data can identify patterns in vital signs, lab values, and nursing assessments that predict sepsis onset hours before it becomes clinically obvious. Several hospital systems have deployed such systems with documented reductions in sepsis mortality. The AI is not replacing the physician — it is alerting nurses and doctors to patients who need urgent attention before the clinical picture is obvious.

In radiology, AI systems have demonstrated the ability to outperform radiologists in specific, narrow tasks — detecting pneumonia from chest X-rays, identifying breast cancer in mammograms, flagging intracranial hemorrhages in head CT scans. Importantly, the most rigorous studies show that AI plus radiologist outperforms either alone, suggesting a collaborative model rather than a replacement model.

The Peril

Against this genuine promise, the harms documented in this chapter are not edge cases or theoretical risks. They are documented, measured phenomena affecting real patients — often the most vulnerable ones.

The fundamental problem is this: AI systems are trained on historical data, and healthcare data is a product of a healthcare system that has never been neutral. Every dataset used to train a clinical AI system is a record of who received care, what care they received, and how their outcomes were documented. Those records reflect decades or centuries of inequity: who had insurance, who lived near a hospital, who was believed when they reported pain, who was included in clinical trials, who received appropriate treatment for their symptoms. An AI system trained on this data does not learn healthcare — it learns the healthcare system, in all its historical injustice.

The peril compounds when AI systems are deployed without adequate testing across the populations who will actually use them. A skin lesion classifier trained predominantly on images of light-skinned patients will perform differently on patients with darker skin tones. A cardiac risk model trained mostly on male patients may be less accurate for women. A mental health screening tool developed and validated in a particular cultural context may produce meaningfully different results in a different context. These are not hypothetical concerns — they are documented, measured performance gaps with direct clinical consequences.

The health equity context cannot be separated from the healthcare AI conversation. In the United States, Black Americans have a life expectancy roughly four years shorter than white Americans. Maternal mortality rates for Black women are three times higher than for white women. Native Americans face dramatically elevated rates of diabetes, heart disease, and liver disease compared to the general population. Hispanic Americans face higher rates of uninsured status and lower rates of preventive care. Women's heart disease symptoms are systematically underrecognized and undertreated. These disparities did not emerge from individual choices — they are the product of systemic factors including historical discrimination, residential segregation, environmental racism, economic inequality, and a healthcare system that has not always served all populations equally. Healthcare AI has the potential to narrow these gaps or to widen them. The evidence so far suggests that, without deliberate intervention, it is more likely to widen them.

Vocabulary Builder

Clinical AI: Artificial intelligence systems designed to support or automate clinical decision-making, including diagnosis, prognosis, treatment recommendation, and care management.

Decision Support System (DSS): A clinical tool — AI-based or otherwise — that provides information, alerts, or recommendations to assist clinicians in making decisions. Critically, the clinician retains decision-making authority.

FDA Clearance (510(k)): The regulatory pathway by which the U.S. Food and Drug Administration reviews most medical devices for safety and effectiveness. Many clinical AI tools are regulated as Software as a Medical Device (SaMD) and require clearance or approval before marketing.

Risk Stratification: The clinical practice of categorizing patients by their predicted risk of a particular outcome (e.g., hospitalization, disease progression) in order to allocate scarce resources like care management programs to highest-need patients.

Health Equity: The principle that all people should have a fair and just opportunity to achieve their highest level of health, with specific attention to eliminating disparities that disadvantage historically marginalized groups.

Social Determinants of Health (SDOH): The non-medical factors that influence health outcomes, including housing, food security, income, education, transportation, and exposure to violence or discrimination.


Section 12.2: Sources of Healthcare Data Bias

Understanding bias in healthcare AI requires understanding precisely where, in the long chain from data collection to clinical deployment, bias enters. The answer, unfortunately, is: everywhere. Each of the following mechanisms is distinct, produces distinct types of harm, and requires distinct mitigation strategies.

Clinical Trial Underrepresentation

The foundational evidence base of modern medicine — clinical trials of drugs and devices — has historically and systematically underrepresented women, racial and ethnic minorities, the elderly, and people with multiple chronic conditions. The reasons are partly historical (early medical research operated under assumptions about who constituted the "standard" patient), partly economic (diverse recruitment is more expensive), and partly logistical (many trials are conducted at academic medical centers that do not reflect national demographics).

The consequences are not abstract. Drugs have been approved based on evidence drawn predominantly from middle-aged white men, then administered to populations whose responses differ — sometimes dramatically — due to differences in pharmacokinetics, hormonal environment, and comorbidity burden. The FDA issued guidelines in 1993 requiring the inclusion of women in clinical trials, and in 1998 began requiring reporting of results by sex, race, and age. But these rules applied prospectively — the legacy data from decades of underrepresentation remains in the medical literature and in the electronic health records that now train AI systems.

When AI is trained on this literature or on EHR data that reflects it, the AI inherits these gaps. A drug dosing recommendation algorithm trained on pharmacokinetics data drawn mostly from men may produce systematically different recommendations for women. A clinical risk model trained on trial data that excluded patients over 75 may perform poorly in elderly populations. The training data encodes the history of who was studied.

Electronic Health Record Bias

EHR data — the primary training resource for the majority of clinical AI systems — does not represent the population; it represents the population that accessed care within particular health systems. This is a profound limitation. People without insurance are less likely to have longitudinal EHR records because they are less likely to have a consistent care relationship with a health system. Rural residents are underrepresented in urban academic medical center datasets. Undocumented immigrants may avoid healthcare settings due to fear, appearing in records only in emergency situations. Homeless individuals may have fragmented records spread across multiple institutions with no integration.

The result: AI trained on EHR data learns from a population that skews toward insured, urban, documented, and connected-to-care — systematically underrepresenting the very populations that often have the highest unmet health need. When these tools are then deployed in settings serving underrepresented populations, performance degrades in ways that may not be immediately apparent because monitoring is often insufficient.

Imaging Dataset Demographics

Radiology and dermatology AI systems are typically trained on large databases of labeled images — X-rays, CT scans, MRIs, skin photographs — that have been collected at academic medical centers over decades. These centers, while medically elite, serve patient populations that are not nationally representative. The NIH ChestXray14 dataset, one of the most widely used training datasets for chest X-ray AI, was drawn from patients at the National Institutes of Health Clinical Center in Bethesda, Maryland. The ISIC Archive, a primary source for dermatology AI training, draws heavily from institutions in the United States, Europe, and Australia, with corresponding demographic skews toward lighter-skinned populations.

A study by Seyyed-Kalantari and colleagues (2021) examined the performance of chest X-ray AI across demographic subgroups in the NIH dataset and found that models trained on this data exhibited meaningful performance gaps across sex, race, insurance status, and age. Patients who were uninsured, female, or from racial minority groups were more likely to be incorrectly classified. The AI had learned the demographic patterns of the training dataset — and those patterns did not generalize equally.

Measurement Bias in Clinical Data

Some of the most troubling examples of healthcare data bias involve not what data is collected, but how it is measured — and the fact that some clinical measurement tools perform differently across demographic groups. The pulse oximeter is a now-infamous example, discussed in Chapter 8: the device uses light absorption to estimate blood oxygen saturation, and its accuracy degrades at higher levels of skin pigmentation. Studies published during the COVID-19 pandemic demonstrated that pulse oximeters overestimated blood oxygen saturation in Black, Asian, and Hispanic patients compared to white patients — meaning that clinicians using oximeter readings to make ventilator decisions were operating on systematically inaccurate data for patients with darker skin.

The clinical consequence was direct and grave: some patients with dangerously low true oxygen levels appeared to have adequate oxygenation on the oximeter, delaying life-saving interventions. When this flawed measurement tool is embedded in EHR data and that EHR data is used to train an AI system, the AI learns from data that systematically misrepresents the physiological status of patients with darker skin.

Documentation Bias in Clinical Notes

Unstructured clinical notes — the narrative documentation that clinicians write about patient encounters — encode the attitudes, assumptions, and biases of the clinicians who wrote them. Studies have demonstrated systematic differences in how pain is documented for patients of different races: providers are less likely to document pain complaints from Black patients, more likely to describe those patients as non-compliant or drug-seeking, and less likely to recommend strong analgesics. Similar patterns appear in mental health documentation, where the language used to describe patients varies systematically by race, gender, and socioeconomic status.

When natural language processing (NLP) systems are trained on clinical notes to extract features for downstream AI tasks — risk scoring, diagnosis coding, care recommendation — they learn from this biased documentation. If the training notes systematically underrepresent the pain burden of Black patients, an AI that predicts pain management needs from clinical notes will systematically underestimate those needs.

Missing Data Patterns

Missing data in clinical datasets is not random — it is structured by the social determinants of health. Patients without consistent insurance coverage have gaps in their records where preventive screenings should be. Patients who moved between geographic areas may have fragmented records across systems that do not communicate. Patients who primarily seek care in emergency settings rather than primary care have records dominated by acute events with little longitudinal context.

AI systems trained on datasets with these structured missing data patterns make implicit assumptions when data is absent. If an AI system treats the absence of a colonoscopy record as evidence of low GI risk (because well-resourced patients who did have colonoscopies and were found to be low-risk would have that documented), it may systematically underestimate risk for patients whose records are sparse simply because they never had access to preventive care.

Historical Treatment Disparities as Training Signal

Perhaps the most insidious source of healthcare data bias is the most deeply embedded: the AI learns not just what diseases look like, but how diseases were historically managed — and historical management has been racially stratified. Studies consistently show that Black patients receive less adequate pain management than white patients with comparable conditions. If an AI system for pain management recommendation is trained on historical prescribing patterns, it will learn a pattern in which Black patients receive fewer opioids and fewer referrals to pain management — not because this is medically appropriate, but because it reflects historical provider bias. The AI then recommends this discriminatory pattern as if it were clinical evidence, automating and scaling a human bias into a systematic institution.

This is the most difficult source of bias to address, because the solution is not simply to obtain more data or more diverse data — it requires making a deliberate judgment that the historical data reflects harmful practices that should not be learned, and constructing training objectives that encode better clinical standards.


Section 12.3: The Optum Healthcare Algorithm — A Case Study in Proxy Bias

The Market Context

Health risk stratification is not a niche product. It is one of the highest-value applications of analytics in American healthcare, generating hundreds of millions of dollars annually for companies including Optum (a subsidiary of UnitedHealth Group), IBM Watson Health, Epic Systems, Cerner, and numerous smaller vendors. The fundamental promise is straightforward: hospitals, health systems, insurers, and employers contract with these vendors to help them identify which patients are at highest risk of expensive, avoidable healthcare utilization — hospitalizations, emergency department visits, disease complications — so that proactive care management resources can be concentrated on those patients before crises occur.

This is genuinely valuable. Intensive care management programs — teams of nurses, social workers, pharmacists, and care coordinators who proactively reach out to high-risk patients, help them manage medications, navigate the healthcare system, and address social determinants of health — improve outcomes and reduce costs. The challenge is that they are expensive to operate, so health systems can serve only a fraction of their eligible patient population. The algorithm decides who gets in.

Optum's algorithm, marketed under the brand name "Impact Pro," was, by Obermeyer et al.'s estimate, used to manage the care of approximately 200 million patients when the study was published in 2019. It was among the most widely deployed commercial risk stratification products in the United States.

The Study Methodology

Ziad Obermeyer and colleagues at UC Berkeley obtained data from a large academic medical center that had licensed the Optum algorithm. They had access to the algorithm's output — risk scores for a large patient population — as well as detailed clinical data for those same patients, including their actual disease burden as measured by the number of active chronic conditions, laboratory values, and other clinical indicators.

The researchers asked a deceptively simple question: at a given level of algorithm-assigned risk score, are Black and white patients equally sick? If the algorithm were equitable, the answer should be yes — patients assigned the same risk score should have similar levels of health need, regardless of race. That is what a well-calibrated algorithm would produce.

What they found was sharply different. At every level of risk score, Black patients were significantly sicker than white patients. A Black patient and a white patient assigned the same risk score were not equally ill — the Black patient was, on average, carrying a substantially higher burden of chronic disease. The algorithm was systematically underestimating the health needs of Black patients.

The practical consequence: to be automatically enrolled in the care management program — which required exceeding a threshold risk score — a Black patient had to be meaningfully sicker than a white patient who qualified at the same threshold. Obermeyer et al. estimated that approximately 18 percent of Black patients who should have been referred to the care management program were not, due to this bias. If the algorithm had been corrected to produce equal health need (rather than equal cost prediction) at a given risk score, the proportion of Black patients enrolled in care management programs would have increased from approximately 17.7 percent to 46.5 percent.

The Proxy Mechanism

The specific mechanism by which the algorithm produced this outcome was the use of healthcare cost as a proxy for healthcare need. The algorithm was designed to predict future healthcare cost — and cost was used as the operational definition of health need. The correlation between cost and need is real; sicker patients do, in general, cost more to treat. But the correlation is imperfect in a way that is systematically structured by race.

In the United States, access to healthcare is not race-neutral. Black Americans face higher rates of uninsurance, are more likely to live in areas with fewer primary care physicians, face documented discrimination in clinical settings that affects willingness to seek care, and have lower average incomes that create financial barriers to care-seeking. The consequence: at a given level of true illness severity, a Black patient is likely to have incurred lower healthcare costs than a comparably ill white patient — not because the Black patient is less sick, but because the Black patient has received less care.

The algorithm learned this pattern from its training data. It did not know about race — race was not a variable in the algorithm. But it did not need to know about race explicitly. Healthcare cost, in the United States, is a variable that is correlated with race through the very mechanisms of structural racism that create health disparities in the first place. Using cost as a proxy for need was, in effect, using a racially proxied variable. The algorithm was proxying for race without encoding race.

This is the essence of proxy bias: the use of a facially neutral variable that serves as a proxy for a protected characteristic, because that variable is itself shaped by the historical discrimination the protected characteristic has experienced. It is a mechanism that appears repeatedly across AI applications — in credit scoring, in criminal justice risk assessment, in education — but has particularly grave consequences in healthcare, where the resource being allocated is medical care itself.

What Optum Said and Did

Optum responded to the Obermeyer et al. study with a public statement acknowledging the findings and committing to action. The company stated that it "concur[red] with the findings and support[ed] the Science article's conclusions." Optum said it was working to modify the algorithm to reduce the identified bias, and committed to re-examining similar tools across its product portfolio.

This response was notable for several reasons. First, it was relatively rapid — within weeks of publication. Second, it was substantive — an acknowledgment of the finding rather than a defense of the methodology. Third, it was company-initiated changes rather than regulatory enforcement. The FDA did not order a recall. No healthcare regulator mandated a correction. The correction occurred because a published academic study created reputational and legal pressure sufficient to motivate voluntary action.

The adequacy of this response has been debated. Critics noted that the algorithm had been deployed at scale for years before the bias was identified — raising questions about what internal testing Optum had conducted before deployment, whether that testing had examined performance by demographic group, and what processes existed for post-deployment monitoring. These questions remain at least partially unanswered in the public record.

Better Design: What a More Equitable Proxy Would Have Looked Like

The Obermeyer study did not merely document a problem — it pointed toward a solution. The researchers demonstrated that replacing healthcare cost with actual clinical indicators of health need — the number of active chronic conditions, laboratory values indicative of disease burden — produced dramatically more equitable risk stratification. A model trained to predict health need (rather than healthcare cost) did not show the same racial disparity in calibration.

This points to a general principle: the choice of what an algorithm is trained to predict is a values decision with ethical consequences. When designers chose to train on cost because cost data was clean and abundant, they made a decision that had predictable equity implications — whether or not they recognized it. The lesson for healthcare AI development is that the target variable must be chosen with explicit attention to what it measures and what its relationship to protected characteristics might be.

The Widespread Adoption Problem

A critical aspect of the Optum case is the scale of potential harm. The researchers estimated that the algorithm they analyzed served approximately 200 million Americans. Health risk stratification algorithms from Optum, Epic, Cerner, and IBM Watson Health collectively influence the care management decisions affecting a substantial fraction of the U.S. population. If similar proxy bias affects similar products — and there is no reason to believe the Optum product was uniquely flawed — the aggregate harm is not thousands of patients but potentially tens of millions.

This scale dynamic is one of the defining features of algorithmic harm compared to individual clinical bias: a single biased clinician affects the patients they personally treat; a biased algorithm deployed in a major health system's EHR affects every patient whose record the algorithm scores.


Section 12.4: Racial Bias in Clinical Decision Support

The Optum algorithm is the most studied and most publicized example of racial bias in healthcare AI, but it is far from unique. A pattern runs through clinical decision support across multiple specialties: algorithms developed without systematic attention to demographic variation, deployed at scale, and discovered to have differential performance — often after harm has occurred.

The eGFR Race Correction Controversy

Perhaps the most consequential embedded racial adjustment in clinical medicine was the race correction factor in the estimated Glomerular Filtration Rate (eGFR) — the primary laboratory measure of kidney function used to diagnose chronic kidney disease, determine medication dosing, and establish eligibility for kidney transplant referral.

For approximately two decades, the most widely used eGFR equations (the MDRD equation, introduced in 1999, and the CKD-EPI equation, introduced in 2009) included a race adjustment factor: the calculated eGFR was multiplied by 1.159 for patients identified as Black. The justification offered was that, on average, Black individuals have higher muscle mass, which produces more creatinine (the waste product kidney function is inferred from), which would otherwise cause eGFR to be underestimated. This adjustment, the argument went, produced more accurate kidney function estimates for Black patients.

The clinical community debate — catalyzed in part by the NephMadness competition and the work of nephrologists including Nwamaka Eneanya, Chi-yuan Hsu, and others — exposed significant problems with this reasoning. First, the original research supporting the adjustment was conducted on small, non-representative samples. Second, the adjustment was applied categorically to all patients identified as Black, erasing the enormous variation within that population. Third, and most consequentially: the race adjustment produced higher eGFR values for Black patients — meaning it made their kidney function appear better than it was.

The consequence was directly harmful. Because kidney transplant referrals are triggered at specific eGFR thresholds, Black patients with the race adjustment applied needed to have more severely damaged kidneys to cross the referral threshold. Studies found that eliminating the race adjustment would have reclassified significant numbers of Black patients to earlier stages of kidney disease — making them eligible for transplant referral sooner. The adjustment was delaying Black patients' access to transplant lists, potentially contributing to worse outcomes and longer wait times.

In September 2021, the National Kidney Foundation and American Society of Nephrology jointly recommended eliminating race from eGFR calculations, adopting a race-free equation (CKD-EPI 2021). This represented a significant reversal by leading medical societies, and most major health systems and laboratories have since transitioned to the new equation. But the reversal raises urgent questions: how many patients were harmed during the decades the adjustment was in use? Who bears accountability? And how many other clinical algorithms incorporate race in similarly problematic ways?

The VBAC Prediction Tool

The Vaginal Birth After Cesarean (VBAC) calculator, developed from data collected by the Maternal-Fetal Medicine Units Network of the National Institutes of Health, is used by obstetricians to estimate the probability of successful vaginal delivery for patients who have previously had a cesarean section. The decision of whether to attempt VBAC or schedule a repeat cesarean has meaningful consequences for both maternal and infant health.

The calculator, implemented in clinical software widely used by obstetricians, included race and ethnicity as predictive variables. The inclusion of race reduced the predicted probability of VBAC success for Black and Hispanic patients relative to white patients. In a clinical conversation where the obstetrician presents the calculated probability to the patient, a lower predicted probability may influence both the clinician's recommendation and the patient's decision.

Critics of the inclusion of race in this tool argued that using race as a predictive variable — when race is a social category, not a biological one — encodes historical disparities in obstetric outcomes as if they were fixed characteristics of Black and Hispanic patients' biology, rather than reflections of the differential care and structural conditions those patients have faced. The tool was essentially saying "Black patients have a lower probability of VBAC success" when it should have been asking "why have Black patients had worse VBAC outcomes, and what can we do to address those structural causes?"

The Society for Maternal-Fetal Medicine subsequently recommended removing race and ethnicity from the VBAC prediction calculator.

Dermatology AI and Skin Tone Performance Gaps

AI systems for analyzing skin lesions — distinguishing benign moles from potentially cancerous melanoma — have been celebrated as a transformative application of computer vision in medicine. Early studies, including a widely cited 2017 paper in Nature by Esteva and colleagues, showed deep learning systems matching or exceeding dermatologist performance on skin lesion classification tasks.

What these initial studies did not examine was differential performance across skin tones. The training datasets for these systems were drawn overwhelmingly from images of patients with lighter skin, reflecting both the demographics of the patient populations at the contributing institutions and the demographics of the dermatology specialty itself (which remains predominantly white).

Subsequent research demonstrated meaningful performance gaps. The performance of leading skin lesion classifiers on images of patients with darker skin tones was meaningfully lower than on lighter-skinned images. The clinical consequence is not abstract: a dermatology AI that performs well on light-skinned patients and poorly on dark-skinned patients, deployed in a setting that serves a mixed population, will produce systematically less reliable guidance for the patients with darker skin — who are, in the United States, disproportionately patients who already face barriers to specialty dermatology care.

The Adamson and Smith (2018) analysis in JAMA Dermatology examined the demographic composition of commonly used dermatology image datasets and found stark underrepresentation: fewer than 5 percent of images in several major datasets showed darker skin tones, despite those populations representing a substantial fraction of the at-risk public.

The Pattern Across Domains

What unites the eGFR case, the VBAC calculator, the dermatology AI, and the Optum algorithm is a pattern that recurs across clinical AI: algorithms and decision support tools are developed without systematic examination of their performance across demographic groups, deployed into clinical practice at scale, and discovered to produce disparate outcomes — often after years of harm, often only because a researcher specifically looked for the problem.

This pattern reflects a systemic failure of clinical algorithm development and procurement: the absence of a standard requirement that AI tools demonstrate equitable performance before deployment. The FDA, until relatively recently, did not require demographic performance reporting as part of device clearance. Professional societies have only recently begun developing standards for algorithmic equity. Health systems routinely purchase commercial clinical AI products without demanding evidence that those products have been tested for differential performance across their specific patient population.


Section 12.5: Gender Bias in Healthcare AI

The Yentl Syndrome and Its Legacy

In 1991, cardiologist Bernadine Healy — then director of the National Institutes of Health and later the first woman to hold that role — published an editorial in the New England Journal of Medicine describing what she called "the Yentl syndrome." In the 1983 film Yentl, a young Jewish woman in early 20th-century Eastern Europe must disguise herself as a man to gain access to religious education. Healy used the metaphor to describe the state of women's cardiac care: women with heart disease received appropriate workup and treatment only when they presented with the same symptoms as men.

The clinical reality Healy was describing is well-documented: women's heart disease frequently presents differently than men's. The "classic" heart attack — crushing chest pain radiating to the left arm — is more characteristic of men's presentation. Women more often present with atypical symptoms: fatigue, nausea, jaw pain, shortness of breath, back pain. Because the medical literature was built on male-dominated clinical trials, and because clinicians were trained on male-typical presentations, women's cardiac symptoms were — and continue to be — more likely to be dismissed, attributed to anxiety, or inadequately worked up.

AI cardiac diagnostic systems trained on this male-dominated evidence base inherit the same bias. Electrocardiogram (ECG) interpretation algorithms, cardiac risk scores, and imaging analysis systems that learned to recognize heart disease from data in which women's presentations were underrepresented will be less accurate for women — potentially contributing to the same pattern of underdiagnosis and undertreatment that Healy identified in 1991.

Reproductive Health AI and Expertise Gaps

The proliferation of AI tools in reproductive health — fertility prediction applications, pregnancy complication risk models, menstrual cycle tracking, ovulation prediction — represents a domain of particular concern. Several specific problems have been documented.

First, many reproductive health AI products have been built by development teams with limited obstetric or gynecological expertise, producing tools that encode incorrect or oversimplified models of reproductive physiology. Menstrual cycle prediction algorithms, for example, have often assumed a standard 28-day cycle when cycle length varies substantially across individuals and across the lifespan.

Second, the training data for these tools often comes from the self-selected population of users of consumer health apps — who are not representative of women generally, skewing toward younger, English-speaking, higher-income users with smartphone access.

Third, the regulatory oversight of consumer reproductive health AI is limited. Many of these tools do not require FDA clearance because they do not make explicit medical claims, operating in a zone between wellness products and medical devices.

The Intersectional Dimension

The most serious harms from gender bias in healthcare AI fall on patients who face both racial and gender bias — particularly Black women. The United States' maternal mortality crisis is concentrated in this population: Black women die from pregnancy-related causes at approximately three times the rate of white women. The causes are multiple and contested, but documented factors include inadequate pain management, underrecognition of clinical warning signs, and structural barriers to care.

AI systems deployed in obstetric care — including pregnancy risk prediction tools, early warning systems, and postpartum complication prediction — have rarely been specifically evaluated for performance in Black women. The population in which algorithmic performance most needs to be verified is often the population least represented in the validation studies. When a pregnancy complication prediction AI performs less well for Black women because its training data underrepresented them, the tool compounds rather than addresses an existing crisis.

Intersectionality — the framework developed by legal scholar Kimberlé Crenshaw to describe how overlapping systems of oppression produce outcomes that cannot be reduced to any single axis of discrimination — is an essential analytical lens for healthcare AI equity. The harm experienced by a Black woman from a biased healthcare AI is not simply the sum of "being Black" harm and "being a woman" harm; it is a distinct harm produced by the interaction of multiple systems of disadvantage. Equity analysis that evaluates only race or only gender will miss these intersectional harms.


Section 12.6: AI in Mental Health — Particular Risks

The Proliferation of Mental Health AI

The global mental health crisis — exacerbated dramatically by the COVID-19 pandemic and its social and economic consequences — has created intense demand for scalable mental health support solutions. Into this demand gap has rushed an array of AI-powered products: therapy chatbots marketed to individuals and employers, depression and anxiety screening tools deployed in primary care, suicide risk prediction algorithms used in emergency departments and inpatient settings, and AI-powered diagnostic aids for psychiatric conditions.

The appeal is understandable: mental health care is severely underprovided globally. The World Health Organization estimates a shortage of approximately 1.18 million mental health workers worldwide. In many low- and middle-income countries, there are fewer than one psychiatrist per 100,000 people. If AI can extend the reach of mental health support, the potential benefit is enormous.

But the risks in this domain are also particularly severe, and several are specific to mental health.

The Training Data Problem in Psychiatric AI

Psychiatric diagnosis is not like X-ray interpretation. The diagnoses that structure mental health treatment — depression, anxiety, schizophrenia, bipolar disorder, PTSD — are defined not by objective biological measurements but by sets of symptoms and their impact on functioning, as codified in the Diagnostic and Statistical Manual of Mental Disorders (DSM). The DSM itself has a contested history: diagnostic categories have been added and removed based on social and political context as much as scientific evidence (homosexuality was a DSM diagnosis until 1987), and current categories reflect predominantly Western, white, educated, industrialized, rich, and democratic (WEIRD) populations.

AI systems trained on psychiatric diagnosis data inherit all of the DSM's historical distortions. Studies have documented racial disparities in psychiatric diagnosis: Black patients are more likely to be diagnosed with schizophrenia and less likely to be diagnosed with depression or bipolar disorder relative to white patients with similar symptom presentations — a pattern that reflects both provider bias and cultural context. An AI trained to predict psychiatric diagnoses from clinical notes or symptom questionnaires will learn these biased patterns and replicate them.

Suicide Risk Prediction AI

Suicide risk prediction algorithms represent a particularly high-stakes application. These tools — deployed in emergency departments, inpatient units, and outpatient mental health settings — attempt to identify patients at elevated risk of suicide attempt, enabling more intensive intervention. The clinical need is real: suicide risk assessment by clinicians is notoriously inaccurate, and a more reliable algorithmic tool could save lives.

The research on demographic performance gaps in suicide risk prediction AI is concerning. Studies have found that leading suicide risk prediction models perform differently across racial and ethnic groups, with higher false negative rates (missing at-risk patients) in some minority populations. Given that minority populations already face barriers to mental health care, a prediction tool with higher false negative rates in those populations compounds existing underservice.

The false positive problem is also serious: patients identified as high suicide risk typically face involuntary evaluation, hospitalization, and restriction of autonomy. Systematic false positive rate differences across demographic groups would mean that patients in some groups are more frequently subjected to involuntary psychiatric holds — a severe deprivation of liberty — based on inaccurate risk scores.

The Substitution Risk

Perhaps the deepest concern about AI in mental health is not that the AI will be biased, but that it will be used to substitute for human care rather than to supplement it — and that this substitution will be concentrated in under-resourced settings serving vulnerable populations. When a large, well-resourced health system deploys a therapy chatbot, it is supplementing an already-adequate system of human mental health care. When a public community mental health center with no therapist capacity deploys a chatbot as the primary intervention, the chatbot is not supplementing — it is replacing.

The risk of AI substitution for human care is highest precisely where human care is scarcest: low-income communities, rural areas, incarcerated populations, under-resourced public mental health systems. These populations also face the greatest data representation gaps in mental health AI training, the greatest structural barriers to care, and the least ability to advocate for appropriate care when AI systems fail them.

FDA Oversight of Mental Health AI

The regulatory landscape for mental health AI is particularly fragmented. The FDA regulates Software as a Medical Device (SaMD) — software intended to diagnose, treat, mitigate, or prevent disease. Many mental health AI tools fall into gray areas: a chatbot marketed as "mental wellness support" may claim not to diagnose or treat, thus avoiding FDA jurisdiction, while providing guidance that clinically functions as treatment. The FDA's exemptions for "software functions that are intended to provide general wellness" and for "clinical decision support software" that presents information to clinicians (rather than making automated decisions) create gaps that mental health AI products may fall through.

As mental health AI becomes more sophisticated and more widely deployed, the question of when it requires FDA review — and what that review should demand in terms of demographic performance data — becomes increasingly urgent.


Section 12.7: Regulatory Framework for Healthcare AI

FDA Regulation of AI/ML-Based SaMD

The Food and Drug Administration has regulatory authority over Software as a Medical Device — software that performs a medical device function. AI systems that diagnose disease, recommend treatment, or perform risk stratification generally qualify as SaMD and require either FDA clearance (510(k) pathway, demonstrating substantial equivalence to an existing cleared device) or FDA approval (PMA pathway, requiring clinical evidence of safety and effectiveness). As of 2024, the FDA had cleared or approved over 900 AI/ML-enabled medical devices.

The challenge for AI is the "locked algorithm" model that traditional medical device regulation assumed. A conventional drug or device is approved in a specific configuration and cannot change substantially without additional review. AI/ML systems, by contrast, are often designed to learn continuously from new data — potentially changing their performance characteristics over time in ways that may not be apparent to users. A radiology AI cleared in 2020 based on performance data from one population may, if it continues to train on data from a different population, have meaningfully different performance characteristics by 2024.

The 2021 FDA AI/ML Action Plan

In January 2021, the FDA released its Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device Action Plan, outlining a framework for how the agency planned to approach the regulatory challenges posed by adaptive AI systems. The Action Plan addressed five key areas: tailored regulatory framework for adaptive AI, good machine learning practices, patient-centered approach with transparency, regulatory science in algorithm performance, and real-world performance monitoring.

Most relevant to health equity, the Action Plan explicitly acknowledged the need for demographic analysis of AI performance and committed the FDA to developing guidance on how manufacturers should assess and report performance across demographic subgroups.

FDA Diversity Requirements for AI Medical Devices

In 2023, the FDA proposed updated guidance requiring manufacturers of AI/ML-based medical devices to include demographic performance reporting as part of device submissions. Under the proposed guidance, manufacturers would be required to demonstrate that their devices have been tested across relevant demographic subgroups — including race, ethnicity, sex, and age — and to report performance metrics for each subgroup. Devices with significant performance gaps across demographic groups would face additional scrutiny.

This represented a significant step toward requiring equity evidence as a condition of market authorization. However, as a proposed guidance rather than a binding regulation, its implementation and enforcement remained to be established.

HHS AI Guidance and Health Equity

The Department of Health and Human Services has developed guidance on the use of AI in programs receiving federal funding, with explicit attention to health equity requirements. Under the guidance, healthcare organizations receiving Medicare or Medicaid reimbursement are expected to ensure that AI tools they deploy do not produce discriminatory outcomes based on race, color, national origin, sex, age, or disability — under existing anti-discrimination provisions in federal health law (Section 1557 of the Affordable Care Act, the Rehabilitation Act, and others).

Whether existing anti-discrimination law provides meaningful recourse against algorithmic discrimination in healthcare has been contested in courts, with outcomes that remain unsettled.

European Medical Device Regulation and the EU AI Act

European regulation of healthcare AI operates through two primary frameworks: the Medical Device Regulation (MDR), which governs AI systems functioning as medical devices and requires rigorous post-market surveillance including monitoring for differential performance; and the EU Artificial Intelligence Act, which came into force in 2024 and classifies most clinical AI systems as high-risk AI, requiring conformity assessments, technical documentation, human oversight requirements, and registration in an EU AI database.

The EU AI Act's high-risk classification for clinical AI is more prescriptive than the FDA's approach in several respects, requiring explicit testing for accuracy and robustness across population segments and documentation of datasets used for training, including their demographic characteristics.

The Certification Gap

A critical feature of the U.S. regulatory landscape is the large number of clinical AI tools that are not subject to FDA oversight. The FDA's clinical decision support (CDS) exemption exempts software that provides information to clinicians who can independently review the basis for the recommendations from the definition of a medical device — meaning such tools require no clearance or approval. Many risk stratification tools, including health risk algorithms similar to the Optum product, have been marketed under this exemption.

The practical effect is that the clinical AI tools with the most pervasive influence on healthcare delivery — embedded in EHR systems, used routinely in care management decisions — may operate with minimal regulatory oversight. Health systems procuring these tools have no FDA-mandated evidence of performance, no required demographic testing, and no post-market surveillance requirements to fall back on. They are largely on their own.

State-Level Initiatives

In the absence of comprehensive federal regulation, several states have enacted or proposed legislation addressing healthcare AI. New York City enacted Local Law 144 of 2021, requiring automated employment decision tools to undergo bias audits — a model that several state legislatures have considered adapting for healthcare AI. California has enacted legislation requiring health plans to use clinical standards rather than algorithmic standards for certain coverage decisions. These state-level initiatives are fragmented and partial, but they represent growing political will to address algorithmic accountability in healthcare.


Section 12.8: Building Equitable Healthcare AI

Pre-Deployment Demographic Testing

The single most important technical intervention for equitable healthcare AI is one that is also the most basic: before deploying an AI system in clinical practice, testing its performance separately for each major demographic subgroup represented in the clinical population it will serve.

For a risk stratification algorithm deployed in a health system where 40 percent of patients are Black, performance on Black patients is not an edge case — it is a core performance requirement. A vendor that cannot demonstrate comparable performance across demographic groups in data that reflects the deployment environment should not be deployed. This seems obvious stated plainly, and yet it has been the exception rather than the rule in healthcare AI procurement.

Pre-deployment demographic testing requires that health systems demand this evidence from vendors, that vendors collect it, and that both parties understand what acceptable performance parity means — because perfect parity is rarely achievable, and the clinical and ethical significance of specific performance gaps requires human judgment, not just statistical testing.

Diverse Development Teams

The composition of the teams that develop healthcare AI matters, both for the values embedded in design decisions and for the domain knowledge brought to bear on potential failure modes. Teams that are demographically homogeneous are less likely to ask whether a training dataset reflects the deployment population, more likely to accept a proxy variable without examining its demographic implications, and less likely to have personal or clinical knowledge of how algorithmic failures would affect marginalized communities.

Diversity in development teams means diversity at every level: data scientists and engineers, clinical advisors, project leadership, and patient advisory boards. It also means diversity in the type of expertise represented: epidemiologists who can identify dataset confounding, health equity researchers who understand the structural determinants of health disparities, medical ethicists who can raise questions about proxy selection and value alignment.

Community Engagement

For healthcare AI applications affecting specific communities — particularly vulnerable ones — direct engagement with those communities in the design and evaluation process is both an ethical obligation and a practical improvement to the product. Community members can identify failure modes that technical teams miss. They can surface concerns about privacy, surveillance, and the uses to which data might be put. They can evaluate whether AI-generated recommendations make sense in the context of their lived experience of the healthcare system.

Meaningful community engagement is not the same as a focus group or a survey. It involves ongoing participation, genuine power to influence design decisions, compensation for expertise, and transparency about how input is used and when it is not followed (and why).

Bias Monitoring Post-Deployment

Deploying a healthcare AI system without ongoing monitoring for differential performance is the equivalent of launching a drug without post-market surveillance. Performance can degrade as patient populations shift, as clinical practice changes, as the distribution of input data changes, or as the AI system continues to learn from new data. Monitoring for differential performance across demographic groups — using both process measures (who is referred, who receives interventions) and outcome measures (clinical outcomes by group) — is a basic requirement of responsible deployment.

Establishing this monitoring requires data infrastructure that records patient demographics alongside clinical outcomes in a form that enables analysis; statistical expertise to detect meaningful performance gaps; clinical leadership committed to acting on findings; and a process for feeding findings back to developers or triggering retraining.

The Model Card for Healthcare

Model cards — documentation artifacts that describe the training data, intended use, performance characteristics, and known limitations of an AI system — were proposed by researchers at Google and have become an emerging standard in responsible AI development. For healthcare AI, model cards should include:

  • The demographic composition of the training dataset (race, ethnicity, sex, age, geographic source, insurance status of included patients)
  • Performance metrics for each major demographic subgroup in the validation dataset
  • Known limitations and failure modes
  • The intended clinical use environment and the patient populations for which it has and has not been validated
  • The regulatory status of the tool (FDA-cleared, exempt, pending)
  • The last date of retraining and the data used for retraining

Health systems procuring AI should require model cards from vendors and should evaluate them as a core component of procurement due diligence.

Procurement Standards

Healthcare organizations have significant leverage over the AI products they use through their purchasing decisions. A large health system that tells a major EHR vendor "we will not purchase your AI tools unless you provide demographic performance data and model documentation" is exercising market power in the service of health equity. When many health systems make the same demand, market incentives shift.

Procurement standards for healthcare AI equity should include: - Mandatory demographic performance reporting across specified subgroups - Documentation of training dataset demographics - Commitment to post-market demographic performance monitoring - Contractual obligations to remediate identified performance gaps - Audit rights enabling the health system to conduct independent evaluation

Clinical Workflow Integration

The design of how AI is presented within clinical workflows has significant implications for equity. An AI tool that presents a risk score without explanation of how it was calculated, without indication of the populations in which it was validated, and without a clear pathway for clinician override may be used in ways its designers did not intend. Clinicians who do not understand the basis of an AI recommendation are less likely to catch errors — including systematic errors affecting specific demographic groups.

Clinical AI should be designed to augment human judgment, not to replace it. This means presenting AI outputs in ways that invite clinical assessment rather than automatic acceptance: explaining the factors driving a recommendation, flagging cases where the patient's characteristics differ from the training population, building override mechanisms that do not create friction, and ensuring that the clinician remains the decision-maker.

The equity implications of this design principle are direct: a clinician who is trained to understand AI limitations, who knows that a particular risk score performs less well for patients with certain characteristics, and who is empowered to exercise clinical judgment can partially compensate for algorithmic bias. A clinician who is trained to defer to algorithm outputs, or who operates in a workflow designed to route patients before clinician review based on the algorithm's output, cannot.

Looking ahead: Chapter 36 will go substantially deeper on AI in healthcare decision-making, examining case studies in specific clinical domains, the evolving regulatory landscape, and frameworks for healthcare organizational AI governance.


Discussion Questions

  1. The Optum algorithm was not designed to discriminate by race — it did not use race as an input variable. Yet it produced racially disparate outcomes. Does the absence of explicit racial coding absolve the algorithm's designers of ethical responsibility for the disparate outcomes it produced? What standard of care should apply to developers of health risk stratification algorithms?

  2. The eGFR race correction was, for decades, presented as a refinement that made the algorithm more accurate for Black patients. Yet it turned out to delay their access to transplant referrals. What does this case suggest about the process by which clinical algorithms that incorporate race should be developed and reviewed? Who should have decision-making authority over these choices?

  3. A community health organization serving predominantly low-income Black and Hispanic patients is considering deploying an AI-powered depression screening tool that was validated primarily in white, college-educated populations. The organization has no budget to conduct its own validation study, but the tool is free and could dramatically expand their screening capacity. What should they do? What obligations do the tool's developers have in this situation?

  4. AI companies sometimes argue that they cannot collect race and ethnicity data for model training because of privacy concerns or because they fear legal liability. Evaluate this argument. What are the legitimate privacy concerns, and what approaches might address both the privacy concern and the equity imperative?

  5. The FDA's proposed 2023 guidance on demographic performance reporting for AI medical devices was welcomed by health equity advocates but criticized by some industry groups as imposing onerous requirements that would slow innovation. How would you evaluate this tradeoff? What does "adequate" demographic performance evidence look like for a clinical AI system?

  6. Intersectionality theory suggests that the harms from AI bias for Black women cannot be understood as simply additive — the result of being Black plus the result of being a woman. How would you design a bias evaluation framework for healthcare AI that captures intersectional harms? What challenges would you face?

  7. The "substitution risk" argument suggests that deploying AI as a substitute for human care in under-resourced settings — rather than as a supplement — may be particularly harmful in mental health. Who bears responsibility for preventing this substitution: the AI developers, the healthcare organizations deploying the tools, regulators, or health system funders?


Chapter 12 continues in case studies examining the Optum health risk algorithm and dermatology AI performance gaps. Key takeaways, exercises, and quiz follow.