Case Study 8.1: The Pulse Oximeter Problem — When Medical Device Bias Becomes a Matter of Life and Death


Overview

When a device used in virtually every hospital in the world turns out to give systematically inaccurate readings for patients with dark skin, and when that inaccuracy contributes to patients being sent home to die of conditions the device failed to detect, the question is not merely technical. It is a question about who is considered a default patient, whose body is treated as the norm, and what institutional structures allowed a dangerous measurement bias to persist for decades in a trusted medical device.

The pulse oximeter story is a case study in measurement bias — bias introduced not by algorithm or dataset, but by the assumptions embedded in the calibration of a physical instrument that generates data. It illustrates how bias baked into data-generating tools propagates through every downstream system that consumes that data, including AI-powered clinical decision support. And it raises sharp questions about regulatory accountability: who is responsible for ensuring that medical devices work equally well for all patients?


1. What Pulse Oximeters Measure and How They Work

A pulse oximeter is a small clip-on device, typically attached to a fingertip or earlobe, that measures blood oxygen saturation (SpO2) — the percentage of hemoglobin in the bloodstream that is carrying oxygen. Oxygen saturation is a critical vital sign: normal values are typically 95% or above, and values below 90% represent clinically significant hypoxemia requiring intervention. Pulse oximeters are among the most widely used medical monitoring devices in the world, present in virtually every emergency room, intensive care unit, hospital ward, and increasingly in patients' homes.

The technology works through a principle called photoplethysmography. The device emits two wavelengths of light — typically red (around 660 nanometers) and infrared (around 940 nanometers) — through the tissue of the fingertip. Oxygenated and deoxygenated hemoglobin absorb these wavelengths differently. By measuring how much of each wavelength is absorbed by the tissue, the device calculates the ratio of oxygenated to total hemoglobin.

The critical phrase is "through the tissue of the fingertip." That tissue includes skin, and skin contains melanin — the pigment responsible for skin tone. Melanin absorbs light, and it does so in ways that vary with concentration and with the specific wavelengths used by the pulse oximeter. The optical physics of the measurement means that skin pigmentation is not simply irrelevant background; it is part of the signal that the device must correctly interpret.


2. The Historical Calibration Problem

Pulse oximeters do not directly measure oxygen saturation from first principles. They measure light absorption ratios and use an empirically calibrated algorithm to translate those ratios into oxygen saturation estimates. The calibration — the mathematical relationship between the measured light absorption ratio and the true oxygen saturation — was determined by comparing pulse oximeter readings to gold-standard arterial blood gas measurements taken from volunteers.

Those calibration studies were conducted primarily in the 1970s and 1980s, when the technology was being developed and commercialized. The volunteers used in those studies were predominantly light-skinned. This was not necessarily an intentional act of exclusion; it reflected the convenience sampling typical of clinical research in that era, as well as the demographics of the populations most readily recruited for medical studies at research institutions.

The calibration curves derived from these studies accurately represented the relationship between light absorption ratios and oxygen saturation for the subjects in the studies. For patients with similar skin tones, the devices work well. For patients with darker skin tones, however, the calibration curves derived from light-skinned subjects are not equally accurate. The higher concentration of melanin in darker skin absorbs more light, altering the absorption ratio in ways that the calibration algorithm was not designed to handle. The device reads the absorption ratio, applies the calibration curve, and produces an estimate of oxygen saturation — an estimate that, for darker-skinned patients, may be systematically higher than the true value.

This is measurement bias in its most direct form: a measuring instrument that produces systematically different accuracy for different demographic groups, because the calibration data used to develop it was not representative of the full population it would serve.


3. Research Evidence: What the Studies Show

The measurement inaccuracy of pulse oximeters in darker-skinned patients was documented in research publications over multiple decades. Studies in the 1990s and 2000s noted potential inaccuracies associated with skin pigmentation, though the clinical significance was debated and the evidence was not consistently acted upon by manufacturers or regulators.

The publication that brought this issue to broad clinical and public attention was a landmark study by Sjoding, Dickson, Iwashyna, Gay, and Valley published in the New England Journal of Medicine in December 2020. Using data from the Beth Israel Deaconess Medical Center and the University of Michigan, the researchers compared pulse oximeter readings to arterial blood gas measurements — the gold standard for oxygen saturation — for thousands of patients.

Their finding was striking and unambiguous: pulse oximeters overestimated oxygen saturation in Black patients at a rate three times higher than in white patients. Specifically, occult hypoxemia — arterial oxygen saturation below 88% (dangerously low) that was not detected by the pulse oximeter because the device showed an apparently normal reading — occurred in 11.7% of Black patients compared with 3.6% of white patients.

A patient with occult hypoxemia appears, by pulse oximeter reading, to have adequate oxygen saturation. They do not. Their actual oxygen level is dangerously low — a condition associated with organ damage, cardiac events, and death if untreated. But the device does not reveal this, because the device's calibration overestimates saturation for that patient's skin tone.

Subsequent research confirmed and extended these findings. A 2022 study in JAMA Internal Medicine, using data from multiple hospitals, found similar patterns and documented that the measurement inaccuracy was associated with higher rates of harmful clinical events — patients with occult hypoxemia were less likely to receive supplemental oxygen therapy because their readings appeared normal. The harm was not merely statistical; it manifested in differential clinical outcomes.


4. The COVID-19 Connection

The clinical consequences of pulse oximeter bias would have been significant in any period. The COVID-19 pandemic, which made pulse oximeter readings central to millions of clinical decisions, amplified those consequences dramatically.

COVID-19 causes respiratory failure that manifests as hypoxemia — falling oxygen saturation. Pulse oximetry became the primary tool for monitoring COVID-19 patients both in hospitals and at home. The phenomenon of "silent hypoxia" — dramatically low oxygen levels in COVID-19 patients who appeared relatively comfortable and showed few other signs of distress — made oxygen saturation monitoring particularly critical.

Public health guidance during the pandemic directed patients to monitor their oxygen saturation at home using consumer pulse oximeters, often available cheaply at pharmacies, and to seek emergency care if their saturation fell below a threshold, typically 94% or 95%. Patients who appeared to maintain adequate saturation on their home oximeters were advised to continue home management.

For Black patients and other patients with darker skin tones, this guidance was compromised by measurement bias. A patient with an actual oxygen saturation of 85% — dangerously low, requiring immediate medical attention — might see a reading of 95% on their pulse oximeter. Following the public health guidance, they would not seek emergency care. They might deteriorate further at home, experiencing organ damage or death from a condition that would have been identified as an emergency if they had lighter skin.

Multiple analyses conducted during the pandemic documented disparate outcomes consistent with this mechanism. Black patients with COVID-19 experienced higher rates of severe disease and death than white patients with similar clinical presentations. While these disparities had multiple causes — differential access to care, pre-existing health conditions related to structural inequality, socioeconomic factors — the role of pulse oximeter inaccuracy in delaying necessary treatment was increasingly documented.


5. FDA Response and Its Delays

The U.S. Food and Drug Administration (FDA) regulates pulse oximeters as medical devices. The FDA had received reports of potential inaccuracies associated with skin pigmentation over the years, but the regulatory response was slow relative to the body of evidence.

In February 2021, approximately two months after the Sjoding et al. study appeared, the FDA issued a Safety Communication acknowledging that pulse oximeters may be less accurate in darker-skinned patients and advising clinicians to consider this limitation when interpreting pulse oximeter readings. This was an important acknowledgment but fell short of requiring manufacturers to update their devices, revise their labeling, or conduct new calibration studies.

In November 2022, the FDA convened an expert panel to discuss the evidence and potential regulatory responses. The panel heard testimony from researchers, clinicians, patient advocates, and manufacturers. The testimony documented the evidence of bias, the clinical consequences, and the technical feasibility of improved calibration approaches.

In early 2024, the FDA proposed new guidance that would require pulse oximeter manufacturers to test their devices on individuals with a wider range of skin tones and to include this performance information in device labeling. The proposed requirements represented a significant step forward but were prospective — they would apply to new submissions, not require existing devices already on the market to be recalled or relabeled.

The regulatory timeline — decades of published evidence of inaccuracy, followed by a pandemic that demonstrated fatal consequences at scale, followed by years of regulatory deliberation — raises fundamental questions about who bears the cost of slow regulatory response and why the burden of proof was set so high for a well-documented problem with a vulnerable population.


6. Why This Persisted for Decades

The persistence of pulse oximeter bias for decades, despite documented evidence of the problem, reflects several reinforcing dynamics that appear repeatedly in the broader landscape of technology and medical device bias.

Calibration data never challenged. Once established, calibration curves for medical devices are rarely revisited absent a specific safety concern. The original calibration data — drawn from predominantly light-skinned volunteers — became embedded in device design and was treated as the technical standard. Challenging it would have required manufacturers to conduct new studies, potentially acknowledge performance limitations, and face regulatory scrutiny. There was no competitive pressure to do so, because all devices were calibrated similarly.

The burden of proof problem. Documenting measurement inaccuracy in a clinical device is technically demanding. It requires paired measurements — pulse oximeter readings and gold-standard arterial blood gas measurements — from large enough samples of diverse patients to detect the disparity. Such data is collected in research contexts, not routinely in clinical practice. The absence of routine paired measurement data meant the problem remained underdocumented even as it was ongoing.

Adverse outcomes not attributed to measurement error. When a Black patient with occult hypoxemia was sent home and subsequently deteriorated, the adverse outcome was not typically attributed to pulse oximeter error. The patient was documented as having declined; the contribution of an inaccurate baseline reading was not part of the clinical record. Without attribution, the harm is invisible to standard outcome monitoring.

Regulatory categorization. Pulse oximeters were approved under a regulatory framework (the 510(k) pathway) that allows devices to be cleared based on demonstrated substantial equivalence to previously approved devices, rather than requiring new clinical testing. Devices continued to be cleared as substantially equivalent to earlier devices — including devices with the same calibration limitations.

The "race-neutral" assumption. Medical device development has historically proceeded on the assumption that devices developed and tested on one population can be applied to all patients without adjustment. This assumption — that a device that works for the calibration population is a device that works — is deeply embedded in clinical culture and regulatory frameworks. Challenging it required arguing against default assumptions that had institutional and legal weight.


7. The Broader Pattern: Medical AI and Device Bias

The pulse oximeter case is not isolated. It belongs to a broader pattern of medical device and clinical AI bias that disproportionately harms women, minorities, and elderly patients.

Electrocardiogram (ECG) interpretation: AI systems trained to interpret ECGs for cardiac risk have been found to perform differently across patient demographics, with documented disparities in performance for women. Women have different baseline ECG characteristics than men, and systems trained predominantly on male patients may not generalize.

Dermatology AI: Skin cancer detection AI systems have been trained predominantly on images of lighter-skinned patients, because dermatological image datasets reflect the demographics of patients seeking dermatological care at academic medical centers. Several studies have documented lower sensitivity for detecting skin cancer on darker skin tones.

Sepsis prediction: Clinical AI systems for predicting sepsis — a life-threatening systemic infection — have shown disparate performance across racial groups, with some studies finding lower sensitivity for Black patients. These systems are trained on electronic health record data that reflects the historical disparities in care documented for Black patients.

Cardiac risk calculators: Traditional cardiac risk calculators use race as a clinical variable in ways that have been questioned — sometimes including race adjustments that reduce estimated risk for Black patients in ways that may be based on historical data reflecting unequal access to care rather than genuine biological difference.

These patterns collectively suggest that the pulse oximeter problem is not an anomaly. It reflects a systematic tendency in medical device development and clinical AI to treat white, male, middle-aged patients as the default population, and to conduct calibration, validation, and evaluation in populations that overrepresent this default.


8. Regulatory Accountability Questions

The pulse oximeter case raises fundamental questions about regulatory accountability for medical device bias.

Who approved these devices? Pulse oximeters were approved through the FDA's 510(k) pathway, which requires demonstration of substantial equivalence to a predicate device. The calibration data underlying the predicate devices was never required to meet demographic diversity standards because no such standards existed.

Who monitored outcomes? Post-market surveillance for medical devices in the United States is limited. The Medical Device Reporting system relies on voluntary reporting by manufacturers, healthcare providers, and patients. Adverse events attributable to measurement inaccuracy — particularly when the inaccuracy is not recognized — are unlikely to be reported. The system is not designed to detect disparities in device performance across demographic groups.

Who bears liability? When a patient is harmed because a medical device gave an inaccurate reading due to calibration bias, the chain of liability is diffuse: the device manufacturer calibrated against existing standards; the regulatory agency approved the device under existing requirements; the clinician used the approved device according to labeling; the hospital purchased the device from an approved manufacturer. No single actor is clearly responsible for the harm, which creates collective responsibility that functions as collective unaccountability.

What would accountability look like? Meaningful accountability for medical device bias would require: demographic diversity requirements in calibration studies, as a condition of device approval; mandatory post-market surveillance with disaggregated performance monitoring; clear liability for manufacturers when device performance disparities are documented; and expedited regulatory pathways to require manufacturers to address documented performance disparities.


9. What Better Calibration Practices Would Look Like

The technical solution to pulse oximeter bias is well understood. The FDA's 2024 proposed guidance points in the right direction: calibration studies must include subjects across the full range of skin tones likely to be encountered in clinical use, and device labeling must report performance across skin tone categories.

More specifically, better calibration practices would include:

  • Diverse calibration cohorts: Recruitment of calibration study participants explicitly stratified by skin tone, using standardized measurement scales such as the Fitzpatrick scale or individual typology angle (ITA) measurements, ensuring adequate statistical power in each stratum.
  • Validation against gold standard: Comparison of device readings to arterial blood gas measurements (not to other pulse oximeters) across the full range of physiologically relevant oxygen saturation levels, not just the range where the device is most commonly used.
  • Performance reporting by stratum: Publication and labeling of device performance (accuracy, precision, limits of agreement) separately for each skin tone stratum, not as aggregate performance across all subjects.
  • Ongoing post-market monitoring: Systematic collection of paired pulse oximeter and blood gas measurements in clinical settings, analyzed for demographic performance disparities, with reporting requirements to regulatory authorities.

These practices are not technically complex. They are organizationally and economically demanding — they require larger, more diverse studies, and they create the possibility of having to acknowledge and address performance disparities. The barrier to implementation has been primarily institutional rather than technical.


10. Discussion Questions

  1. The FDA's regulatory pathway for medical devices (the 510(k) substantial equivalence pathway) did not require demographic diversity in calibration studies for pulse oximeters. Who bears moral responsibility for the harms that resulted from this omission: the device manufacturers, the FDA, the clinical researchers who did not raise the issue earlier, or some combination? How should responsibility be apportioned?

  2. The pulse oximeter case illustrates measurement bias in a physical device, not in a software algorithm. Does the distinction matter? How does the mechanism of harm differ from software-based measurement bias? How does it resemble it?

  3. If a hospital purchases and deploys pulse oximeters that have documented inaccuracies for darker-skinned patients, does the hospital have an obligation to act on that knowledge even if the devices remain FDA-approved? What actions would fulfill that obligation?

  4. The pulse oximeter case took decades to become publicly recognized despite documented evidence in the research literature. What organizational and institutional dynamics allowed this to persist? What would have been required to accelerate recognition and response? Apply your analysis to a current AI bias problem that may follow a similar trajectory.


See also: Section 8.3 (Representation Bias), Section 8.4 (Measurement Bias), Further Reading: Sjoding et al. (2020)