Key Takeaways: Chapter 12 — Bias in Healthcare AI

Core Takeaways

1. Healthcare AI bias is not hypothetical — it is documented, measured, and causing harm now. The examples examined in this chapter — the Optum health risk algorithm, eGFR race correction, dermatology AI performance gaps, the VBAC calculator, pulse oximeter inaccuracy — are not theoretical risks. They are documented phenomena with measurable effects on clinical care and health outcomes. Healthcare organizations cannot treat AI equity as a future concern to be addressed when the technology matures; it is a present obligation.

2. Bias enters healthcare AI at every stage: data collection, data labeling, variable selection, proxy choice, validation, and deployment. There is no single point of failure. A model can be trained on demographically diverse data but still encode historical treatment disparities in its labels. A model trained on accurate labels can still use a proxy variable (like healthcare cost) that functions as a racial proxy. A model with good training-set performance can fail at deployment when the deployment population differs from the training population. Equity requires attention at every stage of the development and deployment lifecycle.

3. The proxy problem is the central mechanism of many healthcare AI disparities. Using a variable that correlates with a protected characteristic — because that variable is itself shaped by historical discrimination — is the mechanism behind the Optum case, the eGFR race correction, and many other examples. Healthcare cost, access frequency, historical treatment patterns, and geographic indicators are all potentially racial proxies in the U.S. healthcare context. Every variable selection decision carries implicit equity implications that must be explicitly examined.

4. Scale transforms individual design choices into population-scale harms. A biased clinician affects the patients they personally see. A biased algorithm deployed in a health system's EHR affects every patient whose record it processes. An algorithm used by health systems serving 200 million patients propagates a design flaw into the care decisions affecting a substantial fraction of the U.S. population. The scale of healthcare AI deployment means that errors that would be minor at the individual level become major at the population level.

5. Equity testing across demographic subgroups must be required before deployment — not discovered after. The pattern across virtually every documented case of healthcare AI bias is the same: the algorithm was deployed at scale, operated for months or years, and the demographic performance gap was discovered only after a researcher specifically looked for it. This sequence means that harm occurs in the gap between deployment and discovery. Pre-deployment demographic testing — requiring evidence of comparable performance across demographic groups before clinical use — is the primary intervention to close this gap.

6. Race in clinical algorithms requires critical scrutiny, not routine acceptance. The eGFR race correction and the VBAC calculator were incorporated into clinical practice without adequate examination of whether race was being used as a valid biological variable or as a problematic proxy for social and historical factors. When race is used in a clinical algorithm, the question "what is this variable actually measuring, and are we using it to compensate for historical underservice rather than to correct a genuine biological difference?" must be explicitly asked and answered.

7. The regulatory gap is wide: most clinical decision support AI operates without mandatory equity requirements. The FDA's oversight of clinical AI is significant but incomplete. The clinical decision support exemption means that many of the most widely deployed clinical algorithms — embedded in major EHR systems, used by health systems serving millions of patients — are not subject to FDA clearance requirements, including demographic performance testing. Health systems purchasing these tools have largely been on their own in assessing equity. This regulatory gap represents a systemic accountability failure.

8. Gender bias in healthcare AI inherits decades of gender bias in clinical research. Women were systematically excluded from clinical trials for decades. AI trained on the resulting male-dominated evidence base inherits this exclusion and performs less well for women in domains including cardiac diagnosis, drug dosing, and other areas where sex-linked differences in physiology and disease presentation were understudied. The Yentl syndrome — women's symptoms being taken seriously only when they resembled men's — is reproduced in AI systems trained on the male-dominated record.

9. Mental health AI carries compounded risks: biased training data, contested diagnostic categories, and the substitution risk. Psychiatric diagnosis categories themselves encode historical biases. Suicide risk prediction algorithms show demographic performance gaps. Therapy chatbots and screening tools may substitute for human care in under-resourced settings serving vulnerable populations. The combination of high stakes (mental health emergencies can be life-threatening), contested ground truth (diagnostic categories are not objective measures), and underservice risk (AI replacing rather than supplementing human care) makes mental health AI a domain requiring particularly careful ethical attention.

10. Intersectionality matters: the harms facing Black women cannot be reduced to race harm plus gender harm. Equity analysis that examines only one demographic axis at a time will miss the compounded harms experienced by patients who face multiple overlapping disadvantages. Black women face maternal mortality rates three times higher than white women — a harm that sits at the intersection of race and gender and that cannot be understood by examining either dimension alone. Healthcare AI bias evaluation frameworks must be designed to detect intersectional harms.

11. Community engagement is both an ethical obligation and a practical improvement. Including affected communities in the design, evaluation, and governance of healthcare AI that will affect them is not only the right thing to do — it produces better AI. Community members identify failure modes that technical teams miss, surface concerns about privacy and surveillance that designers do not anticipate, and provide ground-truth knowledge about the lived experience of the healthcare system that is essential to building tools that work in the real world.

12. Procurement is power: health systems can drive equity by demanding evidence before purchase. Healthcare organizations that purchase commercial AI products have market leverage. A large health system that requires demographic performance data, model documentation, and contractual commitments to remediate identified biases as conditions of purchase is exercising power in the service of equity. When many health systems make the same demands, market incentives shift, and vendors respond. Procurement standards are an underutilized lever for advancing healthcare AI equity.

Essential Vocabulary

Term	Definition
Proxy bias	Bias resulting from using a variable that correlates with a protected characteristic because that variable is itself shaped by discrimination against that group (e.g., using healthcare cost as a proxy for healthcare need in a racially stratified healthcare system).
Proxy variable	A measurable variable used in place of an unmeasured variable of interest; the proxy is valid to the extent that it correlates with the target variable equally across all subgroups.
Calibration	A property of a predictive model indicating that predicted probabilities reflect actual outcome frequencies; a well-calibrated model for which the predicted probability is 0.7 should, across many patients, be correct 70% of the time — and should be equally accurate across demographic groups.
Clinical decision support (CDS)	Software tools that provide information, alerts, or recommendations to assist clinicians in making decisions; CDS that requires clinician interpretation and override may qualify for FDA exemption.
Software as a Medical Device (SaMD)	Software that is intended to be used for medical purposes without being part of a hardware medical device; clinical AI systems are typically SaMD and subject to FDA regulation.
Fitzpatrick scale	A classification system for skin tone, ranging from type I (very light, always burns) to type VI (deeply pigmented, never burns); used in dermatology AI to describe the skin tone distribution of training datasets.
eGFR	Estimated Glomerular Filtration Rate; a laboratory calculation used to assess kidney function, the race correction factor in which delayed transplant referrals for Black patients by producing inflated estimates of kidney function.
Intersectionality	A framework developed by Kimberlé Crenshaw describing how overlapping systems of oppression (race, gender, class, etc.) produce distinct harms that cannot be understood as the simple sum of harms from each system independently.
The Yentl syndrome	Named by cardiologist Bernadine Healy in 1991 to describe the phenomenon whereby women's cardiac symptoms are taken seriously only when they resemble male presentations; has broader implications for any AI trained on male-dominated medical data.
Health risk stratification	The practice of using data analytics to categorize patients by their predicted future health needs or healthcare utilization, enabling allocation of care management resources to highest-need patients.
Model card	A documentation artifact for AI systems describing training data, intended use, performance metrics (including across demographic subgroups), known limitations, and regulatory status.
False negative rate disparity	When a predictive model incorrectly classifies positive cases (e.g., a high-risk patient as low-risk, a malignant lesion as benign) at systematically different rates across demographic groups; particularly harmful in healthcare when the missed condition requires urgent intervention.
Post-market surveillance	Ongoing monitoring of a medical device's performance after deployment, required under FDA regulations for cleared devices; for AI, includes monitoring for demographic performance drift.

Core Tensions in This Chapter

Proxy validity vs. equity: Proxy variables that are statistically valid predictors at the aggregate level may produce systematically inequitable predictions at the demographic level, because the proxy variable's relationship to the outcome differs across groups.

Innovation speed vs. equity verification: Requiring pre-deployment demographic testing slows the development and release cycle for clinical AI. The cost of this slowdown must be weighed against the cost of deploying tools with undetected demographic performance gaps — a cost borne primarily by underserved patients.

Data availability vs. data representativeness: The data that is easiest to obtain — large insurance claims datasets, images from academic medical center archives — reflects the populations that access care most, which are systematically different from the populations with the highest unmet healthcare needs.

Scale benefits vs. scale harms: The same scale that makes AI a powerful tool for extending clinical decision support creates the capacity for algorithmic errors to propagate uniformly and simultaneously across millions of patients.

Automation efficiency vs. human clinical judgment: Designing AI tools to streamline decision-making can reduce the friction that enables clinicians to apply their own judgment when the algorithm may be wrong — particularly for patients whose characteristics differ from the training population.

Race as biology vs. race as social category: Clinical algorithms that incorporate race blur the distinction between race as a biological variable (which is contested and complex) and race as a social category that reflects historical treatment and discrimination. This conflation can encode historical injustice as if it were biological fact.

Questions to Carry Forward

As AI systems become more deeply embedded in clinical workflows — eventually making real-time treatment recommendations during procedures — what forms of human oversight remain meaningful, and how should they be structured to catch demographic performance gaps in operation?
The historical data problem — that AI trained on records of discriminatory care will learn those discriminatory patterns — may not be solvable simply by collecting more data from underrepresented populations. What alternative approaches to training data construction or learning objective design might address this problem?
Health systems in low- and middle-income countries often have even less data infrastructure, fewer regulatory resources, and more limited capacity to evaluate AI equity than U.S. health systems — but stand to adopt AI clinical tools from companies whose products were validated in high-income country populations. How should the global equity implications of healthcare AI development be governed?
If AI is used to recommend clinical interventions and those recommendations are biased against a demographic group, do the patients in that group have legal recourse? Against whom — the AI developer, the health system that deployed the tool, the clinician who acted on it? How should liability for algorithmic harm be structured?
Are there healthcare AI applications where the equity risks are low enough, and the potential benefits to underserved populations large enough, that deployment should proceed without full demographic validation? If so, what are those conditions, and who should make that judgment?