Quiz: Chapter 12 — Bias in Healthcare AI

Total Questions: 20 Format: 8 Multiple Choice | 5 True/False | 4 Short Answer | 3 Applied Scenario Recommended time: 50–60 minutes


Part I: Multiple Choice (2 points each)

1. The Obermeyer et al. (2019) study found that the Optum health risk algorithm systematically underestimated the health needs of Black patients. The primary mechanism was:

A. The algorithm explicitly used race as a variable and weighted it negatively B. Healthcare cost, used as a proxy for health need, was lower for Black patients because they had historically received less care — not because they were healthier C. The algorithm was trained only on white patients and therefore could not accurately predict risk for Black patients D. The algorithm used zip code data, which is correlated with race, as its primary predictive feature


2. Which of the following best describes "calibration" in a healthcare risk prediction model?

A. The process of updating the model with new training data to improve accuracy over time B. The degree to which the model's predicted probabilities accurately reflect actual outcome frequencies, including across demographic subgroups C. The proportion of the training dataset that comes from demographically underrepresented populations D. The adjustment made to the model's outputs based on feedback from clinicians during deployment


3. The eGFR race correction factor, applied in kidney function calculations for Black patients, was ultimately found to be harmful because:

A. The adjustment was based on a racial stereotype with no scientific basis whatsoever B. Black patients do not have higher average muscle mass, making the biological rationale entirely false C. The upward adjustment inflated eGFR values, making kidney disease appear less severe and delaying transplant referrals for Black patients D. It caused Black patients to be prescribed incorrect medication doses for unrelated conditions


4. The Adamson and Smith (2018) study in JAMA Dermatology found that in major dermatology AI training datasets:

A. Patients with darker skin tones were excluded from datasets for privacy reasons B. Fewer than 5 percent of images depicted darker skin tones (Fitzpatrick types V and VI) C. Dermatologists who labeled the training images were more likely to misclassify lesions on darker skin D. The FDA had not approved any dermatology AI trained on datasets with demographic imbalances


5. Which of the following best describes the "substitution risk" in mental health AI?

A. The risk that patients will substitute one AI therapy chatbot for another rather than engaging with evidence-based tools B. The risk that AI diagnostic tools will substitute incorrect diagnoses for correct ones due to demographic bias C. The risk that AI will be deployed as a replacement for human mental health care in under-resourced settings, rather than as a supplement D. The risk that AI recommendations will be substituted for clinical guidelines by uninformed clinicians


6. Under current FDA regulation, which of the following clinical AI tools is MOST LIKELY to require FDA clearance before deployment?

A. A health risk stratification algorithm embedded in a hospital EHR that presents risk scores to physicians for their clinical judgment B. A consumer mental health chatbot marketed as a wellness application C. A deep learning system that autonomously analyzes radiology images and produces a diagnostic report without required clinician review D. A clinical protocol reminder system that alerts nurses to standard care checklist items


7. The concept of "intersectionality," as applied to healthcare AI bias, means:

A. Multiple AI systems from different vendors may produce compounded bias when used together in clinical workflows B. The harms experienced by patients who face multiple overlapping forms of disadvantage (e.g., Black women) cannot be understood as the simple sum of harms from each disadvantage independently C. Healthcare AI bias intersects with financial bias and criminal justice bias to produce a comprehensive system of discrimination D. Bias in clinical algorithms always intersects with unconscious bias in the clinicians who use them, making both problems worse


8. The FDA's clinical decision support (CDS) software exemption matters for healthcare AI equity primarily because:

A. It prevents the FDA from requiring demographic performance testing for many widely deployed clinical AI tools B. It requires CDS software to be tested across all demographic groups before deployment C. It limits the types of data that CDS software can access, preventing the use of socioeconomic variables D. It mandates that CDS software carry warnings about potential demographic performance gaps


Part II: True/False (1 point each)

9. An AI health risk stratification algorithm that does not include race as an input variable cannot produce racially disparate outcomes.

True / False


10. The Optum algorithm's primary error was that it predicted which patients had the highest healthcare costs rather than which patients had the highest healthcare needs, and healthcare cost is not equally correlated with health need across racial groups.

True / False


11. The National Kidney Foundation and American Society of Nephrology recommended eliminating race from eGFR calculations in 2021, replacing the race-adjusted equation with a race-free equation.

True / False


12. AI skin lesion classifiers that perform well overall necessarily perform comparably across all skin tone groups, because overall accuracy statistics average across all subgroups.

True / False


13. The historical data problem in healthcare AI refers to the fact that AI trained on records of discriminatory care may learn and replicate those discriminatory patterns, even if the training dataset is demographically representative.

True / False


Part III: Short Answer (8 points each)

14. Explain the "Yentl syndrome" and describe two specific ways that AI systems trained on historical clinical data may inherit and perpetuate this pattern of gender bias. Be specific about the type of AI system and the mechanism of the bias.

(Recommended length: 150–200 words)


15. A hospital administrator is evaluating two AI-powered sepsis prediction systems:

System A: AUC = 0.85 overall; demographic performance breakdown not provided. Training data drawn from 15 academic medical centers. Vendor does not disclose training data demographics.

System B: AUC = 0.83 overall; sensitivity: white patients 0.84, Black patients 0.79, Hispanic patients 0.81. Training data includes demographic breakdown; validation conducted in populations similar to the deploying hospital.

Which system would you recommend for deployment, and why? What additional information would you want before making a final recommendation?

(Recommended length: 150–200 words)


16. Describe the concept of "proxy bias" and explain how it differs from "explicit bias" (the deliberate use of a protected characteristic as an input variable). Why is proxy bias particularly difficult to detect, and what are two specific methods for identifying it in a healthcare AI system before deployment?

(Recommended length: 150–200 words)


17. What is a "model card" in the context of healthcare AI, and how should it be used in the procurement process? List at least four specific pieces of information a model card should contain for a clinical AI system, and explain why each matters for a health system considering deployment.

(Recommended length: 150–200 words)


Part IV: Applied Scenario (15 points each)

18. Read the following scenario and answer the questions below.

A large Medicaid managed care organization (MCO) serving 2.4 million low-income patients in a predominantly urban state has licensed an AI tool from a major health analytics vendor to identify members at high risk of avoidable hospitalization. The tool will be used to route high-risk members into a care management program offering home visits, nurse coaching, and social services coordination. The vendor's product brochure reports 82% accuracy in predicting high utilizers. The MCO paid $4.2 million for a three-year license. No demographic performance data was requested or provided during procurement. Six months after deployment, a health equity researcher at a partner organization conducts an analysis and finds that at the same predicted risk level, Black and Hispanic members are statistically sicker than white members — a pattern similar to the Optum findings.

a. Identify the specific ethical failures that occurred in this scenario, tracing the error chain from procurement decision to discovered bias.

b. What immediate steps should the MCO take upon receiving the researcher's findings?

c. What contractual, regulatory, or organizational mechanisms would have prevented this outcome if they had been in place?

d. The MCO's legal counsel advises that sharing the researcher's findings publicly could expose the organization to liability. How should the MCO's leadership weigh transparency obligations against legal risk? Who should be involved in this decision?


19. Read the following scenario and answer the questions below.

A digital health startup has developed an AI-powered postpartum depression screening tool. The tool analyzes responses to a 15-item questionnaire administered via smartphone app two weeks after delivery. The tool was trained on data from 12,000 women who completed the questionnaire at a network of maternal health clinics in the Pacific Northwest. The training sample was 71% white, 14% Asian American, 9% Hispanic, 4% Black, and 2% other. The tool's sensitivity for identifying postpartum depression is 0.88 overall. The startup is seeking to license the tool to a national network of OB-GYN practices and has applied for FDA clearance. A reviewer at the FDA notes the limited diversity of the training sample.

a. What specific risks arise from the demographic composition of the training sample, given the intended deployment context?

b. As the FDA reviewer, what would you require the startup to provide before clearing the device?

c. One of the OB-GYN practices in the proposed licensing network serves a predominantly Black patient population in the Southeast U.S., where Black maternal mortality rates are significantly elevated. What specific obligations does this practice have before deploying the tool?

d. The startup argues that waiting for additional demographic validation will delay access to a beneficial screening tool for all postpartum patients, including Black and Hispanic patients who need it most. Evaluate this argument. Under what conditions, if any, should the tool be deployed before additional demographic validation is completed?


20. Read the following scenario and answer the questions below.

A county public health department in a mid-sized city is exploring the use of AI to support suicide risk assessment in its community mental health centers. The centers are severely understaffed — there is an 8-week wait for an initial psychiatric evaluation. The proposed AI tool would screen all patients who contact the crisis line, assigning each a risk tier that determines the speed and type of follow-up. High-risk patients would receive a same-day call from a crisis counselor; moderate-risk patients would be scheduled within 72 hours; low-risk patients would be sent self-help resources. The AI tool was developed by a startup and has been validated in a predominantly white, college-educated population. The county's crisis line serves a predominantly Black, Hispanic, and low-income population. The county health director is under pressure from elected officials to address the mental health crisis rapidly; deploying the AI tool is seen as a faster solution than hiring additional staff.

a. Identify the specific equity risks of deploying this AI tool in this context.

b. Describe what the "substitution risk" looks like in this specific scenario, and explain why it is a particular concern in mental health contexts.

c. The county argues that some mental health support — even imperfect — is better than the current situation of an 8-week wait. Evaluate this argument carefully. Does it justify deployment of this specific tool for this specific population? What conditions would need to be met?

d. Propose an alternative approach to the county's mental health capacity problem that takes the equity concerns seriously while also addressing the genuine need. Your proposal should be realistic about resource constraints.


Answer Key

Multiple Choice

  1. B — The algorithm used healthcare cost as a proxy; Black patients had lower historical costs due to less access to care, not less illness.
  2. B — Calibration refers to the accuracy of predicted probabilities reflecting actual outcomes, including subgroup-level accuracy.
  3. C — The upward adjustment inflated eGFR, making disease appear less severe and delaying transplant referrals.
  4. B — Fewer than 5% of images in major datasets depicted darker Fitzpatrick types V and VI.
  5. C — Substitution risk is the risk of AI replacing rather than supplementing human care in under-resourced settings.
  6. C — Autonomous radiology diagnosis without required clinician review is most clearly SaMD requiring FDA clearance; the CDS exemption does not apply when there is no independent clinician review.
  7. B — Intersectionality describes how overlapping disadvantages produce distinct harms not reducible to the sum of individual axes.
  8. A — The CDS exemption allows many clinical AI tools to avoid FDA review, including the demographic performance testing that review would require.

True/False

  1. False — As the Optum case demonstrates, algorithms without explicit race inputs can produce racially disparate outcomes through proxy variables.
  2. True — This accurately describes the primary mechanism of the Optum algorithm's bias.
  3. True — The NKF/ASN jointly recommended eliminating the race adjustment in September 2021.
  4. False — Overall accuracy averages across subgroups and can be high even when performance varies significantly across groups; a dataset dominated by light-skinned images allows high overall accuracy without equivalent performance on darker skin.
  5. True — The historical data problem refers specifically to this mechanism — AI learning discriminatory treatment patterns from historical records.

Short Answer — Scoring Guide

Question 14 (Yentl Syndrome — 8 points) Full credit requires: accurate description of Yentl syndrome (women's cardiac symptoms taken seriously only when they resemble male presentations); two specific AI examples (e.g., ECG interpretation AI trained on male-dominated data producing lower sensitivity for female cardiac events; cardiac risk models underestimating risk in women because training data reflected historical underdiagnosis); clear description of the mechanism linking the historical clinical bias to the AI system's performance gap.

Question 15 (Sepsis system comparison — 8 points) Full credit requires: recommendation of System B with clear justification related to transparency and demographic performance data; acknowledgment that System A's higher overall accuracy could mask subgroup disparities; identification of additional needed information (the hospital's own patient demographics, the validation population's similarity to deployment population, the clinical significance of the 5-point sensitivity gap in System B); no full credit for recommending System A based solely on overall accuracy.

Question 16 (Proxy bias vs. explicit bias — 8 points) Full credit requires: accurate definition of both concepts; clear explanation of why proxy bias is harder to detect (the proxying variable appears legitimate and the protected characteristic is absent from the model); two specific detection methods with explanation (e.g., calibration testing across demographic groups, examining correlation between proxy variables and protected characteristics in training data, sensitivity analysis removing suspect proxy variables).

Question 17 (Model cards — 8 points) Full credit requires: accurate description of model cards as documentation artifacts for AI systems; at least four specific content items with procurement-relevant explanations (e.g., training data demographics — to assess representativeness for deployment population; subgroup performance metrics — to identify demographic performance gaps; regulatory status — to understand FDA oversight; known limitations — to inform clinical workflow integration; last retraining date — to assess potential performance drift).

Applied Scenario — Scoring Guide

Question 18 (8 MCO scenario — 15 points) - Part a (4 pts): Should identify: failure to require demographic performance data during procurement; vendor failure to proactively disclose; MCO failure to conduct independent evaluation; absence of regulatory requirement to force disclosure. - Part b (4 pts): Immediate steps should include: briefing clinical leadership; notifying care management teams of the limitation; conducting own analysis to quantify the extent of the bias in their population; engaging vendor to address the gap; considering whether to disclose to affected members. - Part c (4 pts): Should identify contractual equity requirements, FDA demographic reporting requirements (if applicable), state Medicaid oversight requirements, procurement standards demanding performance data. - Part d (3 pts): Should address tension between transparency and legal risk thoughtfully; note that HHS anti-discrimination provisions may create affirmative obligations; recommend involving legal counsel, ethics committee, and board.

Question 19 (Postpartum depression tool — 15 points) - Part a (4 pts): Training sample underrepresents Black patients (4%) who face the highest postpartum depression and mortality risk; Pacific Northwest sample may not reflect Southern US population; risk of lower sensitivity for Black postpartum women. - Part b (4 pts): Should require subgroup performance data by race/ethnicity; validation in populations more similar to intended deployment demographics; transparency about known limitations in labeling. - Part c (3 pts): Practice should conduct independent equity evaluation; brief clinicians on limitations; establish enhanced oversight for patients whose demographics differ from training population. - Part d (4 pts): Argument has some merit (imperfect screening > no screening) but ignores differential impact; lower sensitivity for Black patients means tool provides less benefit to those at highest risk; conditions for conditional deployment should include clinician awareness, enhanced follow-up for flagged demographic groups, rapid validation plan.

Question 20 (County mental health AI — 15 points) - Part a (4 pts): Training/deployment population mismatch (white, college-educated vs. Black, Hispanic, low-income); potential false negative rate disparities mean highest-risk patients may be routed to lowest care tier; false positives mean patients face involuntary holds inequitably. - Part b (4 pts): Substitution risk is acute — AI is the primary intervention, not a supplement; uniquely concerning in mental health because of high stakes of both false negatives (missed suicide risk) and false positives (involuntary hold). - Part c (4 pts): "Some care is better than no care" argument is partially valid but fails to account for differential impact — if the AI's false negative rate is higher for the county's predominantly Black/Hispanic population, it provides less benefit to those most served; conditions should include independent performance evaluation in this population, clinician override training, monitoring plan, defined threshold for discontinuation. - Part d (3 pts): Alternative proposals might include: grant funding for staff expansion; peer support workers; telehealth partnerships; partnership with academic medical center; targeted evidence-based crisis intervention (e.g., Zero Suicide model); any proposal should address resource constraints realistically and prioritize equity.