Further Reading: Chapter 12 — Bias in Healthcare AI

Chapter 12 | Bias in Healthcare AI

The sources below are organized thematically and annotated for use by business and healthcare management readers. Each annotation describes what the source contains, why it matters, and what a reader will gain from engaging with it. Sources marked [Primary] are foundational studies directly discussed in the chapter text. Sources marked [Policy] are regulatory and government documents. Sources marked [Review] are synthesizing works useful for broader context.

Foundational Studies

1. Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453. [Primary]

The study that brought healthcare AI bias to widespread public attention. Obermeyer and colleagues reverse-engineered a commercial health risk stratification algorithm used to manage care for approximately 200 million Americans and demonstrated that it systematically underestimated the health needs of Black patients — because it used healthcare cost as a proxy for health need, and Black patients historically incurred lower costs due to receiving less care, not because they were healthier. The study provides a rigorous methodological template for equity evaluation of clinical algorithms: examining calibration across demographic groups by comparing predicted risk to independent measures of health burden. Essential reading for anyone who wants to understand the Optum case in its full technical and social context. Freely available through Science magazine and the Obermeyer Lab website.

2. Adamson, A.S. & Smith, A. (2018). Machine learning and health care disparities in dermatology. JAMA Dermatology, 154(11), 1247–1248. [Primary]

A brief but highly influential research letter documenting the demographic composition of major dermatology AI training datasets. Adamson and Smith found that images depicting patients with darker skin tones (Fitzpatrick types V and VI) constituted fewer than 5 percent of images in several of the most widely used training datasets for skin lesion classification AI. The paper anticipated, with precision, the performance gap studies that followed. Essential reading for understanding the structural origins of dermatology AI bias and for illustrating the value of asking simple representational questions about training datasets before, rather than after, deployment. Adewole Adamson subsequently became one of the leading researchers at the intersection of dermatology and health equity.

3. Seyyed-Kalantari, L., Zhang, H., McDermott, M.B.A., Chen, I.Y., & Ghassemi, M. (2021). Underdiagnosis bias of artificial intelligence algorithms applied to chest radiographs in under-served patient populations. Nature Medicine, 27, 2176–2182.

A rigorous study examining the performance of chest X-ray AI algorithms across demographic subgroups, using data from the NIH Clinical Center's publicly available chest X-ray dataset. The study found that models trained on this dataset exhibited meaningful performance gaps across sex, race, insurance status, and age — with patients who were uninsured, female, or from racial minority groups more likely to be incorrectly classified. This paper is essential reading because it demonstrates the demographic performance gap in a domain (radiology AI) that had been celebrated as a success story for clinical AI, and because its analysis of insurance status as a performance-relevant variable illustrates how socioeconomic factors translate into AI performance disparities. The methodology — examining subgroup performance in a large public dataset — is a useful template for similar analyses.

4. Esteva, A., Kuprel, B., Novoa, R.A., Ko, J., Swetter, S.M., Blau, H.M., & Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115–118.

The landmark paper demonstrating that a convolutional neural network could classify skin lesions with accuracy comparable to board-certified dermatologists. Widely celebrated as a breakthrough in clinical AI, the paper's training data was subsequently analyzed by Adamson and Smith and found to contain the demographic composition problems described in this chapter. Reading the original Esteva et al. paper alongside the subsequent critique illustrates how performance claims can be technically accurate while obscuring population-level equity failures. The gap between the headline ("AI equals dermatologist") and the follow-up finding ("AI equals dermatologist primarily for patients similar to those in the training data") is a valuable lesson in how to read AI performance claims critically.

5. Vyas, D.A., Eisenstein, L.G., & Jones, D.S. (2020). Hidden in plain sight — reconsidering the use of race correction in clinical algorithms. New England Journal of Medicine, 383, 874–882.

A comprehensive review of the use of race as a variable in clinical algorithms, examining multiple examples including eGFR, VBAC prediction, spirometry reference values, and cardiac risk calculators. The authors argue that the routine incorporation of race into clinical algorithms encodes social categories as if they were biological constants, and that this practice needs systematic re-examination. The paper provides a useful taxonomy of how race enters clinical algorithms and a framework for evaluating whether its inclusion is scientifically justified. Essential for readers who want to understand the eGFR controversy and its relationship to broader debates about race in medicine. Published in the New England Journal of Medicine, one of the most influential medical journals in the world.

6. Obermeyer, Z., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations [supplemental commentary]. Science, 366(6464).

The supplemental materials and commentary associated with the primary Obermeyer et al. paper, which provide additional methodological detail and broader reflection on implications for health risk stratification as a field. Of particular value is the researchers' analysis of what an alternative, more equitable algorithm design would look like — demonstrating that clinical indicators of disease burden, substituted for cost, produce substantially more equitable predictions. This reframe — from "the algorithm was wrong" to "a better algorithm is achievable" — is essential for productive engagement with healthcare AI equity.

Regulatory and Policy Documents

7. U.S. Food and Drug Administration. (2021, January). Artificial Intelligence/Machine Learning (AI/ML)-Based Software as a Medical Device (SaMD) Action Plan. [Policy]

The FDA's comprehensive statement of its regulatory approach to AI and machine learning in medical devices, published in January 2021. The Action Plan addresses five areas: tailored regulatory frameworks for adaptive AI, good machine learning practices, patient-centered approaches with transparency requirements, regulatory science, and real-world performance monitoring. The document explicitly acknowledges the need for demographic performance analysis and commits the FDA to developing guidance in this area. Essential reading for anyone seeking to understand the regulatory landscape for clinical AI in the United States, and particularly for understanding the gap between the FDA's stated commitments and the tools it has deployed to enforce them. Freely available on FDA.gov.

8. U.S. Food and Drug Administration. (2023). Developing a Software Precertification Program: A Working Model [and associated guidance on AI/ML demographic performance reporting].

The FDA's proposed guidance on demographic performance reporting for AI/ML-based medical devices, released as part of the agency's evolving approach to healthcare AI equity. The proposed guidance would require manufacturers to demonstrate performance across relevant demographic subgroups and to disclose training dataset demographics. Understanding this document requires reading it against the backdrop of the FDA's existing regulatory frameworks for medical devices — particularly the 510(k) clearance pathway and the clinical decision support software exemption under the 21st Century Cures Act. Available on FDA.gov; the evolving regulatory landscape means readers should check for updates after the publication date of this textbook.

9. Office of the National Coordinator for Health Information Technology (ONC). (2021). Report on Health IT and Health Information Blocking.

ONC's framework for health information transparency and interoperability, relevant to healthcare AI in several ways: first, data sharing rules affect what training data is available to AI developers; second, transparency requirements about how data is used may create obligations relevant to algorithmic decision-making; third, the information-blocking framework creates pathways for patients to access their own health data, including potentially understanding algorithmic decisions that affected their care. Healthcare managers should read this alongside FDA guidance to understand the full federal regulatory landscape.

10. The National Kidney Foundation and American Society of Nephrology Joint Task Force on Reassessing the Inclusion of Race in Diagnosing Kidney Disease. (2021). A new race-free equation to estimate GFR: Recommendations from the NKF/ASN Task Force. American Journal of Kidney Diseases, 78(6), 861–872.

The formal recommendation by the leading nephrology professional societies to eliminate race from eGFR calculations, adopting the CKD-EPI 2021 equation. The document provides the scientific rationale for the change, addresses anticipated concerns from practitioners who had relied on the race-adjusted equation, and describes the transition process. Reading this alongside the earlier literature on the race correction's development illustrates how a clinical practice that was adopted in good faith with apparently reasonable scientific justification can, over time, be recognized to cause harm — and how professional societies can and should respond when this recognition arrives. Essential reading for understanding the eGFR case study in Chapter 12.

Reviews and Syntheses

11. Gianfrancesco, M.A., Tamang, S., Yazdany, J., & Schmajuk, G. (2018). Potential biases in machine learning algorithms using electronic health record data. JAMA Internal Medicine, 178(11), 1544–1547. [Review]

A concise, highly readable review of the major sources of bias in machine learning systems trained on EHR data, aimed at clinical audiences. The paper provides a taxonomy of bias types — including measurement bias, omitted variable bias, and treatment selection bias — with healthcare-specific examples. The paper is particularly useful for readers who want a systematic framework for thinking about where bias enters EHR-based AI, rather than a case-by-case account. The taxonomy presented complements the sources-of-bias discussion in Section 12.2 of this chapter.

12. Chen, I.Y., Agrawal, M., Horng, S., & Sontag, D. (2019). Robustly extracting medical knowledge from EHRs: A case study of learning a health knowledge graph from emergency department notes. Pacific Symposium on Biocomputing, 24, 1–12.

While technical in places, this paper illustrates how natural language processing systems extract clinical knowledge from unstructured clinical notes — and how the biases embedded in clinical documentation (discussed in Section 12.2 of this chapter) propagate into NLP-derived knowledge systems. Particularly useful for readers interested in the documentation bias problem and how it translates into AI system behavior.

13. Wawira Gichoya, J., Banerjee, I., Bhimireddy, A.R., et al. (2022). AI recognition of patient race in medical imaging: A modelling study. The Lancet Digital Health, 4(6), e406–e414.

A disturbing finding with profound implications for healthcare AI equity: deep learning models trained on medical images — chest X-rays, retinal photographs, bone density scans — can detect patient race with high accuracy, even from images where race is not apparent to trained human observers. This finding means that AI systems processing medical images may encode race in their representations and potentially use race-correlated features in ways that produce demographic performance disparities, even when race is not an explicit variable. The paper is essential for understanding why "we don't include race as a variable" is insufficient assurance that a healthcare AI system will not produce racially disparate outcomes.

14. Moons, K.G.M., Wolff, R.F., Riley, R.D., et al. (2019). PROBAST: A tool to assess the risk of bias and applicability of prediction model studies. Annals of Internal Medicine, 170(1), W1–W33.

The Prediction Model Risk of Bias Assessment Tool (PROBAST), developed by a multinational group of clinical researchers, provides a systematic framework for evaluating the quality and potential bias of clinical prediction models. The tool examines potential bias in participant selection, predictors, outcome definition, and statistical analysis. Healthcare managers evaluating commercial clinical AI products can use PROBAST as a structured framework for asking the right questions about a product's development methodology. The tool and user guide are freely available at probast.org.

15. Healy, B. (1991). The Yentl syndrome. New England Journal of Medicine, 325(4), 274–276.

The original editorial by Bernadine Healy coining the term "Yentl syndrome" to describe the systematic underrecognition and undertreatment of women's heart disease in medicine. While the editorial predates AI healthcare systems by decades, it identifies the pattern of gender bias in clinical medicine that AI systems trained on historical data inherit. Reading this short, powerful essay alongside contemporary studies of gender bias in cardiac AI creates a sobering perspective on how long-standing clinical biases persist through generations of medical knowledge, including AI-based medical knowledge. Available through the NEJM archives.

16. Obermeyer, Z., & Emanuel, E.J. (2016). Predicting the future — big data, machine learning, and clinical medicine. New England Journal of Medicine, 375(13), 1216–1219.

An earlier piece by Obermeyer that provides context for understanding his later bias research. This essay, written before the race bias findings, describes the promise and methodology of machine learning in clinical medicine. Reading it alongside the 2019 Science paper illustrates how the field moved from enthusiasm about clinical AI's potential to systematic examination of its equity implications — a trajectory worth understanding for readers thinking about how healthcare AI governance should evolve.

17. Obermeyer, Z., Nissan, R., Stern, M., Jena, A., Feigenbaum, J., & Mullainathan, S. (2023). Algorithmic bias in healthcare: Causes, consequences, and mitigation strategies. JAMA, 330(6), 587–592.

A follow-on paper by Obermeyer and colleagues synthesizing lessons from the 2019 study and subsequent research, with particular attention to the practical implications for health system leaders. The paper provides actionable guidance for how health systems can identify, measure, and mitigate algorithmic bias in clinical AI — making it particularly useful for the healthcare management audience of this textbook. The paper's framing — bias as a remediable organizational problem rather than an inevitable technical limitation — is valuable for readers who may be tempted toward fatalism about healthcare AI equity.

18. Jacobs, M., Pradier, M.F., McCoy, T.H., et al. (2021). How machine learning recommendations influence clinician treatment selections: The example of antidepressant selection. Translational Psychiatry, 11, 108.

An empirical study of how clinicians actually respond to AI treatment recommendations, with implications for the equity of AI-influenced clinical decisions. The study found that AI recommendations significantly influenced clinician selections — with clinicians less likely to deviate from AI recommendations when they were confident in the AI's accuracy. This finding has equity implications: if clinicians defer to AI recommendations even when those recommendations may be biased for a particular patient, the human oversight layer may be less protective than assumed. Essential for readers thinking about clinical workflow integration of AI tools.

19. Benjamin, R. (2019). Race After Technology: Abolitionist Tools for the New Jim Code. Polity Press.

A book-length treatment of how technology encodes and perpetuates racial inequality, by Princeton sociologist Ruha Benjamin. While not exclusively about healthcare, Benjamin's analysis of the "New Jim Code" — the use of technology to perpetuate racial hierarchy while appearing neutral — provides important theoretical grounding for understanding healthcare AI bias as a structural phenomenon rather than a set of isolated technical errors. Benjamin's concept of "discriminatory design" — bias that operates through apparently neutral design choices — directly illuminates the proxy bias mechanism in the Optum algorithm. Healthcare managers who want to think systematically about algorithmic equity, rather than case by case, will find this essential.

20. Dastin, J. (2018, October 10). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters.

While not a healthcare example, this widely cited journalism on Amazon's AI recruiting tool illustrates how the proxy bias mechanism operates in a different domain — one that is accessible to business readers who may not have clinical backgrounds. The Amazon system, trained on historical hiring decisions made in a male-dominated engineering environment, learned to penalize resumes that contained the word "women's" (as in "women's chess club") and to favor resumes that used language statistically associated with male applicants. Reading this alongside the Optum case allows readers to see proxy bias as a general pattern across AI domains, not a peculiarity of healthcare.

Note on currency: Healthcare AI regulation is evolving rapidly. Readers should verify the current status of FDA guidance documents and state-level legislation cited in this chapter, as these may have been updated after publication of this textbook. FDA.gov, ONC.gov, and the Kaiser Family Foundation's Health Equity tracker are useful resources for current information.