Case Study 12.2: Skin Lesion Classifiers and Demographic Gaps — When Dermatology AI Fails Darker Skin

Chapter 12 | Bias in Healthcare AI Primary Sources: Adamson, A.S. & Smith, A. (2018). Machine learning and health care disparities in dermatology. JAMA Dermatology, 154(11), 1247–1248. Esteva, A. et al. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542, 115–118.

Introduction

In January 2017, a paper published in Nature made international news. A team led by Andre Esteva at Stanford University reported that a convolutional neural network — trained on 129,450 clinical images — could classify skin lesions as benign or malignant with accuracy comparable to board-certified dermatologists. The headline was seductive: AI could detect skin cancer as well as a specialist. The implications for global health equity seemed straightforwardly positive. If a smartphone-based AI could substitute for a dermatologist in diagnosing melanoma, it could extend potentially life-saving early detection to communities that lacked access to dermatology specialists — communities disproportionately made up of people without insurance, people in rural or underserved areas, and people in low- and middle-income countries.

What the 2017 paper did not address — and what subsequent research would document — was how the system performed across different skin tones. The training dataset was drawn predominantly from images of patients with lighter skin. The dermatologists against whom the AI was benchmarked were, like the American dermatology specialty as a whole, predominantly white. The research team did not ask whether a classifier trained on light-skinned images would perform equally well on images of darker-skinned patients.

When researchers did ask that question, the answer was troubling. This case study examines what they found, how the medical community responded, and what the dermatology AI case reveals about the structural challenges of building healthcare AI that works equitably across the full spectrum of patients it will serve.

1. The Promise of AI Dermatology: Early Skin Cancer Detection

Skin cancer is the most common cancer in the United States, with approximately 9,500 diagnoses every day. Melanoma — the most dangerous form of skin cancer — accounts for less than 5 percent of skin cancers but the majority of skin cancer deaths. Early-stage melanoma is highly treatable; the five-year survival rate for localized melanoma exceeds 98 percent. Late-stage melanoma, once it has spread to distant organs, carries a five-year survival rate of approximately 30 percent. Early detection is, quite literally, the difference between life and death.

Early detection depends on access to dermatologic evaluation. A dermatologist examining a suspicious mole can apply clinical judgment — including the ABCDE criteria (Asymmetry, Border, Color, Diameter, Evolution) and dermoscopy — to distinguish the benign from the malignant. But dermatologist access is highly unequally distributed. The United States has approximately 12,000 practicing dermatologists, concentrated in urban areas and affluent communities. Wait times to see a dermatologist for a suspicious lesion can run to weeks or months in underserved areas. In many parts of the world, dermatology specialist access is effectively nonexistent.

AI skin lesion classification, in this context, represented a genuinely promising technology. If a primary care physician, a nurse practitioner, or a community health worker could photograph a suspicious lesion with a standard camera and receive a reliable classification of whether it required urgent specialist referral, the triage bottleneck created by dermatologist scarcity could be substantially alleviated. The AI would not replace the dermatologist for definitive diagnosis and treatment — but it could help route patients to the right level of care, catching dangerous lesions that might otherwise be watched and waited.

This was the promise. The peril lay in a foundational assumption: that an AI trained on available images would perform equally well across all patients who presented with suspicious lesions — regardless of their skin tone.

2. How Training Datasets Were Assembled

The performance of any AI image classifier is bounded by the data it was trained on. For skin lesion classification, training data consists of labeled images — photographs of lesions tagged by dermatologists as benign or malignant, or classified into specific diagnostic categories (melanoma, basal cell carcinoma, seborrheic keratosis, etc.). Assembling these datasets requires institutions with large collections of dermatology images, the capacity to have those images labeled by expert dermatologists, and the technical infrastructure to organize and share them.

In practice, the major dermatology AI training datasets were assembled at academic medical centers in the United States, Europe, and Australia — institutions with the resources and research capacity to lead this work. The patient populations served by these institutions are not nationally, much less globally, representative. They tend to serve populations that are urban, insured, English-speaking, and — particularly in the context of dermatology specialty care — predominantly white.

There is a structural reason for this last pattern. Dermatology as a specialty has historically served, and been accessed by, predominantly lighter-skinned populations in wealthier countries. This is partly because light-skinned individuals have substantially higher rates of melanoma — UV radiation is more carcinogenic when there is less melanin to absorb it — and partly because access to dermatology specialty care reflects broader patterns of healthcare access that are correlated with race and income. The consequence: the clinical photographs accumulated at academic dermatology departments over decades predominantly depict lighter skin tones.

When researchers assembled large-scale training datasets by digitizing these institutional archives and combining them with publicly available sources, the demographic skew of the source institutions was baked into the training data. The AI learned what skin lesions look like on the skin it was trained on.

3. The ISIC Archive and Its Demographic Skew

The International Skin Imaging Collaboration (ISIC) Archive is one of the primary public resources for skin lesion AI research — a repository of tens of thousands of dermoscopic images compiled from contributing institutions worldwide. The ISIC Archive has been used to train or validate many of the leading skin lesion classification systems, including those featured in high-profile competitions (the ISIC Challenge, held annually) that have driven significant advances in the field.

The contributing institutions to the ISIC Archive are concentrated in North America, Europe, and Australia. The demographic composition of the archive — the skin tone distribution of the patients depicted — reflects this geography and the institutional demographics of the contributing centers. A systematic analysis of the archive found that images depicting darker skin tones are substantially underrepresented relative to the racial and ethnic composition of the populations these tools will ultimately serve.

This underrepresentation is not incidental — it reflects the structure of dermatology research and practice. The ISIC Archive is an accumulation of what was available, which reflects who had access to specialty dermatology care and which institutions had the resources to contribute to research databases. The archive is a mirror of the healthcare system, and what that mirror reflects is a specialty that has not equally served all populations.

4. The Adamson and Smith (2018) Analysis

In November 2018, Adewole Adamson and Avery Smith published a research letter in JAMA Dermatology that provided the clearest early documentation of the demographic composition problem in dermatology AI training datasets. The letter was titled "Machine Learning and Health Care Disparities in Dermatology."

Adamson and Smith examined the demographic composition of images in the datasets most commonly used to train and validate skin lesion classification AI. Their findings were striking:

In three of the most widely used datasets, fewer than 5 percent of images depicted darker skin tones (Fitzpatrick skin types V and VI on the standard dermatology skin tone scale).
Images depicting Fitzpatrick type I and II skin (the lightest categories) were dramatically overrepresented relative to these types' share of the global or even U.S. population.
None of the datasets they examined systematically recorded skin type as a metadata field — meaning researchers using these datasets could not easily filter by skin tone or analyze performance across skin tone groups.

Adamson and Smith drew a direct line from this training data composition to an anticipated performance problem: AI classifiers trained predominantly on light-skinned images would be less accurate for patients with darker skin. They framed this not merely as a technical problem but as a health equity issue: the populations most likely to be harmed by reduced AI accuracy were the same populations that already faced the most significant barriers to dermatology specialty care.

The letter was prescient. Subsequent studies confirmed the performance gap the authors anticipated.

5. Performance Gap Studies: Accuracy Differences Across Skin Tone Groups

Following the Adamson and Smith letter, researchers began directly measuring whether skin lesion AI classifiers performed differently across skin tone groups. The results confirmed and quantified the anticipated gap.

Multiple studies found that leading skin lesion classification systems performed meaningfully better on images of lighter-skinned patients than on images of darker-skinned patients. The performance gap manifested in both sensitivity (the ability to correctly identify malignant lesions) and specificity (the ability to correctly classify benign lesions as benign). Different systems showed different magnitudes of gap, but few showed equivalent performance across the full spectrum of skin tones.

The gap in sensitivity is particularly clinically significant: a lower sensitivity for darker-skinned patients means a higher false negative rate — more malignant lesions classified as benign, more potential cancers missed. Given that the population at elevated risk from missed melanoma is precisely the population less likely to have access to repeat evaluation or follow-up imaging, the stakes of this error pattern are high.

The ISIC Challenge — the international competition that has driven much of the technical progress in skin lesion AI — did not include stratified performance by skin tone in its standard evaluation metrics until researchers explicitly raised the issue. Competition-winning models were optimized for overall accuracy, which, in a dataset dominated by light-skinned images, could be maximized without achieving adequate performance on darker-skinned images. The evaluation framework made the demographic performance gap invisible.

6. The Clinical Consequence: Missed Diagnoses in Patients with Darker Skin

The clinical consequence of the performance gap is direct and quantifiable, though difficult to trace in individual cases because AI-supported diagnosis involves clinician judgment as well as algorithmic output.

In settings where AI is used to triage suspicious lesions — routing patients to urgent specialist referral or watchful waiting — a classifier with systematically lower sensitivity for darker skin would refer fewer malignant lesions in darker-skinned patients for urgent evaluation, producing delayed diagnosis. Given that melanoma survival is strongly correlated with stage at diagnosis, delayed diagnosis translates directly into worse outcomes.

The affected population is not small. While melanoma rates are lower in patients with darker skin — because melanin provides some UV protection — melanoma in these populations is disproportionately diagnosed at later stages, and survival is correspondingly worse. Multiple studies have documented that Black and Hispanic patients have worse melanoma outcomes than white patients, even after accounting for stage at diagnosis. An AI system that is less accurate for these patients adds an algorithmic layer to the existing pattern of delayed diagnosis.

Moreover, patients with darker skin are more likely to develop melanoma in atypical locations — palms, soles, nail beds, mucous membranes — that may present differently than the typical UV-induced melanomas that dominate the training datasets. An AI trained on UV-induced melanomas from light-skinned patients may be doubly disadvantaged in classifying acral lentiginous melanoma, which disproportionately affects Black and Asian patients, because both the skin tone and the lesion type are underrepresented in training.

7. Medical Community Response

The medical community's response to the documented performance gap in dermatology AI evolved through several phases.

Initial phase: dismissal and minimization. Some initial responses from the AI dermatology research community questioned whether the performance gap was practically significant, noting that the absolute accuracy differences were modest in some studies, that dermoscopy images of darker skin are technically more challenging for any classifier, and that the clinical deployment context included human clinician oversight that would partially compensate for AI errors. These responses, while not entirely without merit, shared a pattern of minimizing the documented harm and defending the existing research trajectory.

Second phase: acknowledgment and redirection. As the evidence accumulated and health equity concerns gained prominence in medical publishing, the dermatology AI community broadly acknowledged the performance gap and began focusing on solutions — primarily, the need to diversify training datasets by collecting more images from patients with darker skin.

Third phase: deeper structural critique. More recent scholarship has pushed beyond the "just add more diverse data" framing to question whether the AI dermatology research agenda was structured from the beginning in ways that made these gaps predictable and preventable — and to ask who bears responsibility for the harm that occurred during the gap between the widespread deployment of these tools and the documentation of their performance limitations.

Professional organizations including the American Academy of Dermatology began developing position statements on AI equity, recommending that AI-assisted dermatology tools be evaluated across skin tone groups before clinical adoption.

8. FDA's 2023 Proposed Rule on Demographic Performance Reporting

The FDA's proposed guidance on demographic performance reporting for AI/ML-based medical devices, released in 2023, directly addressed the skin lesion classifier case and others like it. The proposed guidance would require manufacturers of AI/ML-enabled medical devices to:

Identify the demographic subgroups (including sex, age, race, and ethnicity) that are relevant to the device's intended use
Provide performance data disaggregated by those subgroups, demonstrating that the device performs adequately across groups
Describe the demographic composition of the datasets used for training and validation
Identify known performance limitations for specific subgroups and propose appropriate labeling or mitigation

For skin lesion classifiers, this would mean demonstrating performance across Fitzpatrick skin type groups — not just overall accuracy — as a condition of market authorization. A classifier with documented sensitivity gaps for darker skin types would face additional scrutiny and might need to carry labeling indicating its limitations for that population.

The proposed guidance was widely welcomed by health equity researchers. Industry responses were more mixed: some manufacturers raised concerns about the feasibility of collecting demographically diverse validation datasets, particularly for devices intended for rare conditions; others questioned what performance parity thresholds were acceptable.

The FDA's proposed rule represented a significant shift in the regulatory framing of healthcare AI: from an exclusive focus on overall accuracy toward explicit attention to equitable performance across population groups. Its final implementation and enforcement would determine whether it produced meaningful change in market practices.

9. The Representation Fix: Efforts to Diversify Dermatology Training Datasets

The most commonly proposed solution to the dermatology AI demographic gap is dataset diversification: collecting more images of skin lesions on patients with darker skin tones, labeling them with expert dermatologist diagnoses, and incorporating them into training datasets.

Several initiatives have pursued this goal. Researchers at institutions serving more diverse patient populations have begun contributing images to shared databases. Funded programs to collect dermatology images in Africa, South Asia, and among Hispanic populations in the Americas have generated new datasets that partially address the prior imbalance. The ISIC Archive has begun explicitly tracking skin tone metadata and actively soliciting contributions from underrepresented populations.

These efforts represent meaningful progress. A classifier trained on a more demographically diverse dataset should, in principle, perform better across skin tone groups, because it has been exposed to the appearance of skin lesions across the range of pigmentation levels it will encounter in practice. Several studies have confirmed that including diverse images in training produces more equitable validation performance.

However, dataset diversification faces practical challenges. Collecting high-quality dermoscopy images requires specialized equipment and trained operators. In settings with limited dermatology resources — precisely the settings serving populations that have been historically underrepresented — obtaining labeled images requires either training local providers in dermoscopy or bringing specialist resources into communities that currently lack them. Both approaches require sustained investment that the research community and commercial sector have not reliably provided.

10. The Deeper Problem: Diversifying Images Does Not Fix the Historical Data Problem

The dataset diversification response, while necessary, is not sufficient. It addresses the representation problem — the underrepresentation of darker-skinned patients in training images — but leaves a deeper structural problem largely untouched: the historical data problem.

Even if a skin lesion training dataset included perfectly equal proportions of images across all Fitzpatrick skin types, those images would still be drawn from a healthcare system that has provided systematically different care to patients of different races. The labels attached to training images — "malignant" or "benign" — reflect the diagnoses that were made. If darker-skinned patients with melanoma were historically more likely to be misdiagnosed (diagnosed as benign when malignant) due to physician bias, unfamiliarity with the appearance of melanoma on darker skin, or delayed presentation, then the training labels for images from darker-skinned patients would include a higher proportion of incorrect benign labels — melanomas labeled as benign because they were initially missed.

An AI trained on this data would learn from historical diagnostic errors as if they were ground truth. More diverse training data, if drawn from the same historical clinical record, would partially solve the representation problem while carrying forward the historical error problem. The AI would be exposed to more images of darker skin but would learn from diagnoses that may themselves reflect the bias of past clinical practice.

Addressing the historical data problem requires additional steps: using only images with pathology-confirmed diagnoses (biopsy results) as ground truth, rather than clinical diagnoses; actively auditing training labels for cases where the diagnosis may have been clinician-biased; and potentially weighting training samples to adjust for the historical underservice of specific populations.

These are technically demanding interventions that add cost and complexity to the development process. They are also essential if the goal is genuinely equitable performance rather than the appearance of equitable training data composition.

Discussion Questions

The 2017 Nature paper by Esteva et al. was widely celebrated as a breakthrough in democratizing dermatology expertise. In retrospect, what questions should peer reviewers and editors have asked about the demographic composition of the training data and the generalizability of the results before publication? What obligations do high-profile scientific publications have when reporting AI performance claims?
The dermatology AI performance gap was anticipated by researchers like Adamson and Smith before it was systematically measured. Why do you think the primary developers of skin lesion AI did not conduct the demographic performance analysis that would have revealed the gap before deploying these tools? What organizational, economic, or cultural factors might explain this?
The FDA's proposed 2023 rule on demographic performance reporting would require manufacturers to demonstrate equitable performance across skin tone groups. A small AI dermatology startup argues that it cannot afford to collect the demographically diverse validation dataset this would require for its device. How should the FDA balance innovation incentives against equity requirements? Is there a role for public investment in this situation?
Achieving equitable performance across Fitzpatrick skin types in dermatology AI requires diverse training data. But collecting that data requires engaging communities that have historically been harmed by medical research — in many cases, the same communities that are underrepresented in current datasets. How should researchers approach community engagement for dataset collection in this context? What principles should govern data collection in communities that have historical reasons to distrust medical research?

This case study is part of Chapter 12: Bias in Healthcare AI. See also Case Study 12.1 on the Optum health risk stratification algorithm.