Case Study 12.1: The Optum Health Algorithm — How Cost Became a Proxy for Race

Chapter 12 | Bias in Healthcare AI Source: Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.

Introduction

In the fall of 2019, a research team led by economist and physician Ziad Obermeyer published a study in Science that reframed public understanding of how AI bias operates in high-stakes environments. The team had not built a biased algorithm — they had reverse-engineered one that was already running at scale, already making decisions affecting hundreds of millions of people, and already producing outcomes that were, the researchers showed, racially inequitable in a specific, measurable, and remediable way.

The algorithm they studied was a commercial health risk stratification product. The company that produced it — Optum, a subsidiary of UnitedHealth Group and one of the largest health services companies in the United States — was not unusual in selling such a product. Health risk stratification is a major market, serving a genuine clinical need. The specific mechanism by which the algorithm produced racially inequitable results was not exotic or surprising, in retrospect. It was a design choice that is common across industries and often passes without scrutiny: the use of a proxy variable.

This case study examines the Optum algorithm in depth — its commercial context, the methodology by which the research team uncovered its bias, the specific mechanism of that bias, the responses it generated, and the broader lessons it offers for organizations that develop, purchase, or are affected by clinical AI systems.

1. The Commercial Context: Health Risk Stratification as a Major Market

To understand the Optum algorithm, it is necessary first to understand the market in which it operated. Health risk stratification is the practice of using data — clinical records, insurance claims, demographic information, pharmacy data — to predict which patients are most likely to have high healthcare needs in the near future. These predictions drive the allocation of care management resources.

Care management programs represent a substantial clinical investment: a team of registered nurses, social workers, pharmacists, and care coordinators who proactively reach out to high-risk patients, help them manage chronic conditions, coordinate specialist referrals, ensure medication adherence, address social needs like housing instability and food insecurity, and generally provide the kind of intensive support that keeps complex patients from deteriorating to the point of hospitalization or emergency department use. These programs are expensive — typically costing thousands of dollars per patient per year — and the patient population that could benefit from them is large. Risk stratification allows health systems and insurers to concentrate this expensive resource on the patients whose expected benefit is highest.

The market for health risk stratification products is substantial. Major players include Optum (with its suite of predictive analytics products marketed under the Optum Insight brand), Epic Systems (whose integrated EHR includes predictive models for readmission, sepsis, and deterioration), Cerner (now Oracle Health), IBM Watson Health (since wound down and divested), Evolent Health, and numerous smaller vendors. These products are purchased by hospital systems, health insurers, employer-sponsored health plans, and Medicaid managed care organizations. They process data on tens to hundreds of millions of Americans annually.

Optum's product — marketed as Impact Pro — was, by Obermeyer et al.'s estimate at the time of publication, used to manage the care of approximately 200 million Americans. This is not a niche product or an experimental tool. It is a central piece of clinical infrastructure that shapes who receives proactive care management and who does not.

The business model creates specific incentive structures. Health systems and insurers purchase these products to reduce downstream costs — by intervening with high-risk patients before they become hospitalized, they reduce expensive acute care utilization. The economic logic is clear. The equity implications are less likely to appear on the vendor's product roadmap unless someone specifically looks for them.

2. The Study Methodology: How Obermeyer et al. Identified the Bias

The research team's methodology was elegant in its simplicity. They obtained data from a large academic medical center that had licensed the Optum algorithm, giving them access to two things simultaneously: the algorithm's output (risk scores assigned to patients) and detailed clinical data about those same patients that was independent of the algorithm's inputs.

The critical insight was to ask a calibration question rather than a prediction question. The standard way to evaluate a predictive model is to ask: does a patient assigned a high risk score actually have high future costs (or needs)? This is the internal consistency question — does the model predict what it says it predicts? The model, by this measure, was performing well.

Obermeyer et al. asked a different question: at a given level of predicted risk, are Black and white patients equally sick? This is the equity question — does the model treat comparable patients comparably, regardless of race? To answer this question, they needed an independent measure of how sick patients were — one that did not depend on healthcare cost or the algorithm's prediction.

They used a combination of clinical indicators: the number of active chronic conditions as documented in the medical record, laboratory values that reflect disease burden and physiological compromise, and other clinical markers. These indicators provided an estimate of patients' actual health need — distinct from their predicted future cost.

The comparison was then straightforward: take patients assigned a given risk score (say, the top decile of risk), and compare their actual disease burden across racial groups. If the algorithm were well-calibrated for equity, Black and white patients in the same risk decile should have similar disease burdens. They did not.

3. The Specific Findings in Quantitative Terms

The magnitude of the bias the research team documented was substantial. At the same predicted risk score — meaning patients the algorithm evaluated as equally likely to incur high future healthcare costs — Black patients were meaningfully sicker than white patients by every clinical measure the researchers examined.

Specifically: - Black patients assigned a given risk score had, on average, 26.3 percent more active chronic conditions than white patients assigned the same score. - Black patients were more likely to have poorly controlled diabetes, hypertension, anemia, renal insufficiency, and congestive heart failure at any given risk score level. - The bias was not concentrated at the extremes of the score distribution — it was present throughout.

The consequence for care management enrollment was direct. Care management programs typically enroll patients above a threshold risk score — the top 3 to 5 percent of predicted risk. At this threshold, a Black patient needed to be substantially sicker than a white patient to qualify. The researchers estimated that approximately 18 percent of Black patients who should have been enrolled in care management programs — based on their actual health need — were not enrolled, due to the algorithm systematically underestimating their risk score relative to comparably sick white patients.

A corrected version of the algorithm — one that targeted health need directly rather than predicted cost — would have increased the proportion of Black patients enrolled in care management from approximately 17.7 percent to 46.5 percent. The gap is not marginal. It is the difference between a program that meaningfully serves its sickest Black patients and one that systematically bypasses them.

4. The Proxy Mechanism: Healthcare Cost as a Signal of Need

The mechanism by which this bias was produced is worth examining in careful detail, because it illustrates a pattern that appears across many AI applications and that is easy to miss precisely because the logic of the proxy choice seems sound.

The algorithm was designed to predict future healthcare cost. This is a reasonable operational target. Sicker patients do incur higher healthcare costs, and the correlation between cost and need is real and documented. Cost data is also abundant, clean, and consistently recorded in insurance claims — making it a technically attractive training target compared to messier clinical indicators.

The problem lies not in the correlation but in what else cost is correlated with. In the United States, healthcare cost reflects not only how sick a patient is but how much care that patient has actually received. And how much care a patient receives is substantially determined by their access to care — which is in turn shaped by insurance status, income, geography, and, in the United States, by race.

The causal chain is documented at every link:

Race → Insurance Status: Black Americans are uninsured at higher rates than white Americans, limiting their access to non-emergency care.

Race → Income: The racial wealth gap in the United States is large and persistent. Lower income means higher cost-sharing burdens, reducing ability to seek care.

Race → Geographic Access: Residential segregation has concentrated Black Americans in areas that have historically had fewer primary care physicians, specialist offices, and healthcare facilities.

Race → Provider Discrimination: Multiple studies document that Black patients receive less pain medication, fewer diagnostic tests, and less aggressive treatment for equivalent conditions compared to white patients, reflecting provider bias.

Race → Patient Avoidance: Historical mistreatment in medical settings — including the documented legacy of experiments like the Tuskegee syphilis study — contributes to lower rates of care-seeking among some Black Americans, a rational response to historical violation of trust.

Each of these mechanisms means that, at a given level of true illness severity, a Black patient is likely to have incurred lower healthcare costs than a comparably ill white patient — not because they are healthier, but because they have received less care. The algorithm learned this pattern from its training data: Black patients at a given level of disease severity had historically lower costs, so the algorithm assigned them lower risk scores. The bias was built into the training signal.

This is the essential structure of proxy bias: a facially neutral variable (healthcare cost) serves as a proxy for a protected characteristic (race) because that variable is itself causally downstream of the structural discrimination that has affected that protected characteristic. The algorithm did not know about race. It did not need to. Cost did the work.

5. Optum's Response

Within weeks of the Science publication, Optum issued a public statement. The company stated that it "concur[red] with the findings" of the research and "support[ed] the Science article's conclusions." Optum committed to working to improve its algorithm to address the identified bias.

The company also pushed back on certain characterizations of the product's breadth and the article's methodology, arguing that the authors had studied a version of the algorithm and deployment context that may not be representative of all implementations of similar tools. Optum pointed out that health systems configure and deploy risk stratification tools in a variety of ways, and that the specific threshold-setting and enrollment decisions are typically made by the health system rather than dictated by the algorithm.

This response illustrates a common dynamic in algorithmic accountability: the algorithm vendor and the deploying institution can each point to the other as the locus of accountability. The vendor argues that the deploying institution makes the actual clinical decisions. The institution argues that it relies on the vendor's product and cannot independently audit its performance. The patient is caught in the gap.

The response also raised a question that the research team's work did not fully resolve: how widely had Optum tested its own product for demographic performance before deployment? The answer — at least in the public record — was not clearly provided. No internal testing data demonstrating adequate demographic performance was released. No timeline was given for when corrective modifications would be complete or how affected patients might be identified and retroactively served.

6. The Editorial Response in the Medical Community

The Science paper generated immediate and extensive commentary across medicine, public health, and technology policy. Several themes recurred.

The "we should have known" response: Many health equity researchers and clinicians noted that the mechanism was entirely predictable. The use of cost as a proxy for need in a racially stratified healthcare system should have triggered alarm at the design stage. The question of why it did not — who reviewed the algorithm's design choices, what equity analysis was performed, what internal testing was conducted — pointed to systemic failures in the development process, not merely a technical oversight.

The scale problem: Commentators repeatedly noted that the aspect of the case most difficult to process was not the bias per se — bias in clinical decision-making is extensively documented — but the scale at which a single algorithmic product could propagate that bias. An individual clinician's bias affects the patients they personally see; a nationwide algorithm propagates bias uniformly across every patient whose record it touches.

The invisible harm problem: Because the algorithm's effects were visible only in aggregate — no single patient received a letter saying "you were not referred because our algorithm underestimated your risk due to a racially biased proxy" — the harms were diffuse and largely invisible to the people experiencing them. This is a recurring feature of algorithmic harm: the affected individuals often cannot know they have been affected.

The accountability vacuum: Several editorials noted the absence of any regulatory mechanism that would have required Optum to test the demographic performance of its algorithm before deployment. Unlike a new drug, which requires FDA review of clinical evidence before marketing, a health risk stratification algorithm deployed under the clinical decision support exemption could reach 200 million patients without independent review of its equity performance.

7. How Widespread Are Similar Algorithms?

The Optum case was studied because researchers had rare access to the algorithm's output alongside independent clinical data. This combination — necessary to conduct the kind of equity analysis Obermeyer et al. performed — is rarely available. Most health risk stratification algorithms are proprietary products that do not share their score distributions with researchers. The vast majority of commercial clinical algorithms have never been independently evaluated for demographic performance.

There are strong theoretical reasons to expect that algorithms similar to the Optum product — using healthcare cost as a primary predictive target, trained on U.S. insurance claims data — would exhibit similar bias, because they share the same fundamental design choice. The cost proxy problem is not unique to one vendor; it reflects a widely shared approach to health risk stratification.

Epic's EHR includes predictive models for readmission risk, sepsis risk, and patient deterioration. These models are used by hundreds of health systems and affect millions of patients. Subsequent to the Optum publication, researchers began examining whether similar demographic performance gaps existed in Epic's predictive models. Studies found that Epic's sepsis prediction model (the Epic Deterioration Index) performed differently across demographic groups, with documented differences in sensitivity for Black and white patients.

The pattern suggests that the Optum algorithm may be a representative example of a class of clinical AI products with similar design choices and similar equity implications — not an outlier that can be addressed by fixing one product while leaving the broader market unchanged.

8. What Happened After: Policy Responses and Algorithm Changes

The Obermeyer et al. paper contributed to a wave of policy and research attention to algorithmic equity in healthcare that continues to develop. Several specific policy developments followed:

FDA attention: The FDA cited concerns about algorithmic bias in healthcare AI in its 2021 AI/ML Action Plan and in subsequent guidance on demographic performance reporting, acknowledging that the clinical AI market had grown rapidly with insufficient attention to equity.

Congressional attention: Members of Congress wrote to major health risk stratification vendors seeking information about their demographic performance testing practices. Several held hearings on algorithmic bias in healthcare, at which versions of the Optum case were prominently discussed.

Academic follow-on research: The paper catalyzed a wave of research examining demographic performance gaps in other clinical algorithms — risk scores for hospital readmission, sepsis prediction, cardiac risk, and dermatology AI, among others. The field of algorithmic fairness in healthcare, already emerging, became substantially more prominent.

Optum's algorithm revision: Optum committed to modifying its algorithm to reduce racial bias. The specific technical changes made and the post-modification performance characteristics have not been publicly reported in peer-reviewed literature as of this writing, leaving the question of whether and to what degree the bias has been addressed partially open.

9. The Broader Pattern: Proxy Selection as an Ethical Choice

The Optum case is not fundamentally about a bad actor making a racist decision. The team that designed the algorithm almost certainly did not intend to disadvantage Black patients. They made a design choice — use healthcare cost as a proxy for healthcare need — that was technically sensible, practically convenient, and aligned with common practice in health analytics. They may not have asked whether that choice would produce racially disparate outcomes. Or they may not have had the tools or data to test it. Or they may have tested it and not known what to look for.

What the case illustrates is that proxy selection is never a purely technical decision. The choice of what an algorithm is trained to predict, and the choice of what variables are used to make that prediction, embeds values and assumptions about what matters and what the relationship between variables reflects. When those choices are made without explicit attention to their equity implications — when engineers treat proxy selection as a technical optimization problem rather than an ethical one — they are making a values choice by default.

The Optum case asks us to see proxy selection — an apparently mundane engineering decision — as a domain that requires ethical review. Questions that should be asked at the design stage include:

What does this proxy variable actually measure, and what else is it correlated with?
How does the relationship between this proxy and the outcome we care about differ across demographic groups?
If we train our model to optimize for this proxy, what are the downstream consequences for patients in different demographic groups?
Is there an alternative proxy or outcome target that would better reflect what we actually care about, with fewer equity implications?

These are not purely technical questions. They require domain expertise, equity expertise, and ethical deliberation. They require asking not just "does this algorithm predict accurately?" but "accurately for whom, and what does it mean for those for whom it is less accurate?"

10. What Health Equity-Centered Algorithm Design Would Look Like

The Obermeyer et al. study did not merely document a problem — it demonstrated an alternative. The researchers showed that replacing healthcare cost with direct clinical measures of health burden — the number of active chronic conditions, laboratory indicators of physiological compromise — produced risk scores that were substantially more equitable across racial groups. A model trained to predict health need, rather than healthcare cost, did not exhibit the same racial disparity in calibration.

This points toward a set of design principles for health equity-centered algorithm development:

Start with the outcome you actually care about. In health risk stratification, the ultimate goal is to identify patients who would benefit most from care management — not patients who will incur the highest future costs. These are correlated but not identical. Design the algorithm to predict the actual goal.

Examine every proxy for its demographic implications before including it. Any variable that reflects historical access to care, insurance status, or geographic resources is potentially a racial proxy in the U.S. context. This does not mean these variables cannot be used, but their use must be consciously chosen and explicitly tested for its equity effects.

Require calibration testing across demographic groups before deployment. The specific test that Obermeyer et al. applied — examining whether patients at a given predicted risk level have similar actual need across racial groups — should be a standard pre-deployment requirement for any health risk stratification algorithm.

Include affected communities in the design process. Community members, patient advocates, and clinicians who serve underrepresented populations bring knowledge of the lived experience of algorithmic health systems that technical teams often lack.

Build ongoing monitoring into the deployment contract. Equity is not something that can be established once and assumed to persist. As patient populations shift, as clinical practice changes, as the algorithm's training data evolves, its equity properties may change. Monitoring is essential.

Require transparency about training data demographics. Health systems purchasing risk stratification algorithms should know the demographic composition of the data the algorithm was trained on, the demographic composition of the population in which it was validated, and the performance metrics disaggregated by demographic group.

Discussion Questions

The Optum algorithm used healthcare cost as a proxy for healthcare need without conducting, or at least without publicly demonstrating, testing for racial performance disparities. What internal processes should algorithm developers be required to complete before deploying a health risk stratification tool that will affect millions of patients? Who should be responsible for enforcing these requirements — regulators, professional societies, or health system purchasers?
Optum stated that health systems make their own enrollment decisions and that the algorithm's outputs are one input to those decisions. The health systems, in turn, often rely on the algorithm's scores as the primary enrollment criterion. In this diffusion of responsibility, how should accountability for the disparate enrollment outcomes be allocated? Does it matter for your analysis whether Optum provided transparency about its training methodology?
The researchers found that replacing healthcare cost with clinical indicators of health burden produced a more equitable algorithm. Why might commercial vendors have incentives to use cost as a training target rather than clinical burden? How might regulatory or market incentives be structured to make equity-centered design more attractive?
A hospital administrator reads the Obermeyer study and realizes that the risk stratification algorithm their hospital has been using for three years may have similar bias. They cannot know for certain without conducting the same type of analysis. What steps should they take? What are their obligations to the patients who may have been underserved during those three years?
Critics of algorithmic fairness requirements sometimes argue that requiring equal performance across demographic groups is technically infeasible and that the resources spent on demographic testing should instead be spent on direct services to underserved populations. Evaluate this argument. What assumptions does it rest on, and are they valid?

This case study is part of Chapter 12: Bias in Healthcare AI. See also Case Study 12.2 on dermatology AI and skin tone performance gaps.