Chapter 9 Quiz: Measuring Fairness — Metrics and Trade-offs
Total Questions: 20 | Recommended Time: 45 minutes
Part 1: Multiple Choice (8 questions, 2 points each)
Question 1 ProPublica found that COMPAS assigned high-risk scores to Black defendants who did not reoffend at nearly twice the rate of white defendants who did not reoffend. This finding describes which specific fairness metric violation?
A) Demographic parity — the overall prediction rate was different across racial groups B) False positive rate disparity — the rate of wrongly flagging non-reoffenders differed by race C) Calibration failure — the predicted probabilities did not match actual recidivism rates D) Individual fairness violation — similar defendants were scored differently E) Treatment equality violation — the ratio of false negatives to false positives differed
Question 2 The Chouldechova (2017) impossibility theorem states that three common fairness criteria cannot simultaneously hold. Which three criteria does the theorem address?
A) Individual fairness, group fairness, and counterfactual fairness B) Demographic parity, equalized odds, and calibration C) True positive rate, false positive rate, and precision D) Equal opportunity, treatment equality, and demographic parity E) Accuracy, calibration, and individual fairness
Question 3 A lender's automated system approves 68% of white applicants and 44% of Black applicants. According to the four-fifths (80%) rule used in US employment and fair lending analysis, what is the disparity ratio and does it trigger scrutiny?
A) Ratio = 1.55; this does not trigger scrutiny because it is greater than 1.0 B) Ratio = 0.65; this triggers scrutiny because it is below 0.80 C) Ratio = 0.44; this triggers scrutiny because it equals the minority group's rate D) Ratio = 1.24; this does not trigger scrutiny because it is within acceptable bounds E) Ratio = 0.80; this exactly meets the threshold and does not trigger scrutiny
Question 4 A medical diagnostic algorithm has the following properties: among patients who have the disease, it correctly identifies 85% of white patients and 60% of Black patients. Among patients who do not have the disease, it incorrectly identifies 10% of white patients and 10% of Black patients as diseased. Which fairness criterion is satisfied and which is violated?
A) Equal opportunity satisfied (equal FPR); equalized odds violated (unequal TPR) B) Equal opportunity violated (unequal TPR); equalized odds partially satisfied (equal FPR but not TPR) C) Demographic parity satisfied; calibration violated D) Equalized odds satisfied; equal opportunity violated E) Calibration satisfied; demographic parity violated
Question 5 An organization wants to evaluate whether its algorithm satisfies counterfactual fairness. What would this require?
A) Equal selection rates across all demographic groups B) A causal model of how the protected attribute influences other features, and evidence that decisions would not change if the protected attribute were different C) Equal true positive and false positive rates across demographic groups D) Evidence that predicted probabilities match actual outcomes within each group E) Evidence that similar individuals (measured by a task-relevant distance metric) receive similar predictions
Question 6 A company reports: "Our hiring algorithm achieves 91% accuracy for both male and female candidates." Which of the following statements best evaluates this claim as a fairness report?
A) This is a complete fairness evaluation; equal accuracy means the system is fair B) This is inadequate because accuracy does not disaggregate error types; the system could have very different false positive and false negative rates for men vs. women while maintaining equal overall accuracy C) This is adequate for equalized odds but does not address demographic parity D) This proves calibration is satisfied but says nothing about equal opportunity E) Equal accuracy is a stronger fairness criterion than equalized odds, so this report is sufficient
Question 7 Hébert-Johnson et al. (2018) proposed "multicalibration" as a stronger fairness criterion than ordinary calibration. What does multicalibration require beyond standard calibration?
A) Equal false positive rates across all demographic groups B) Calibration that holds not just for each protected group but for all efficiently computable subgroups, addressing intersectional fairness C) Calibration that holds across all possible prediction thresholds, not just at the default threshold D) Equal accuracy across all intersectional subgroups defined by combinations of protected attributes E) Perfect calibration (zero calibration error) for all groups simultaneously
Question 8 Which of the following scenarios best illustrates the tension between "formal fairness" and "substantive fairness" as described in Section 9.1?
A) A hiring algorithm uses race as an explicit input variable to increase diversity B) A lending algorithm applies the same underwriting criteria to all applicants, but those criteria were designed around the financial profiles historically associated with advantaged groups, producing lower approval rates for disadvantaged groups C) A medical algorithm is less accurate for elderly patients because there is less training data for that population D) A criminal justice algorithm assigns different score distributions to different racial groups because they have different base rates of recidivism E) An employer uses a test that has equal pass rates across racial groups but measures skills that are not relevant to job performance
Part 2: True or False (5 questions, 2 points each)
For each statement, mark True or False and write one to two sentences explaining your reasoning.
Question 9 A classification model that has equal overall accuracy for Group A and Group B necessarily has equal false positive rates for both groups.
Question 10 The impossibility theorem proves that we should not attempt to measure algorithmic fairness, because perfect fairness is unachievable.
Question 11 Excluding race as a direct input variable from a machine learning model guarantees that the model's predictions will be race-neutral.
Question 12 Individual fairness and group fairness are mathematically equivalent when the population is large and representative.
Question 13 When base rates of an outcome are equal across demographic groups, it becomes possible to simultaneously satisfy demographic parity, equalized odds, and calibration.
Part 3: Short Answer (4 questions, 5 points each)
Question 14 In two to three sentences, explain the difference between equalized odds and equal opportunity. Give a concrete example where you would choose equal opportunity over equalized odds, and justify that choice.
Question 15 A bank uses zip code as one of many input variables in its mortgage underwriting algorithm. A fair lending advocate argues this constitutes illegal discrimination; the bank argues it is capturing legitimate risk information. In three to four sentences, explain the concept of the "zip code proxy" problem and why this debate cannot be resolved by simply looking at whether race is an explicit input variable.
Question 16 Briefly explain what a confusion matrix is and why computing separate confusion matrices for each demographic group reveals information that an aggregate confusion matrix hides. Use a concrete numerical example to illustrate.
Question 17 What is ethics washing in the context of algorithmic fairness? Provide a specific example of what ethics washing would look like in an organization's fairness reporting, and explain what genuine fairness practice would require instead.
Part 4: Applied Scenario (3 questions, 6 points each)
Scenario for Questions 18–20:
VetPath Analytics has developed an algorithm that predicts which veterans are at risk of experiencing housing instability within 90 days. The algorithm is used by a network of veterans' service organizations to prioritize outreach and preventive support services. When resources are scarce, the algorithm determines who receives proactive assistance.
A researcher has evaluated the algorithm on a sample of 1,200 veterans and found the following:
| Male Veterans (n=900) | Female Veterans (n=300) | |
|---|---|---|
| Base rate (actual instability) | 20% | 35% |
| True Positive Rate | 72% | 58% |
| False Positive Rate | 18% | 15% |
| Selection rate | 27% | 28% |
Question 18 Based on the data provided: a. Does the system satisfy demographic parity? Show your calculation (2 points) b. Does the system satisfy equalized odds? Identify which component(s) of equalized odds are violated and by how much (2 points) c. Does the system satisfy equal opportunity? Show your calculation (2 points)
Question 19 A service organization director argues: "Female veterans are the most vulnerable group. The system is missing 42% of the female veterans who will experience instability (false negative rate = 1 - TPR = 42%), compared to only 28% of male veterans. This is the most serious disparity."
a. Is the director's calculation correct? Verify the numbers (2 points) b. Which fairness metric does this disparity relate to? (1 point) c. If you were asked to recalibrate the algorithm's threshold specifically to improve performance for female veterans, what trade-off would you likely face? (3 points)
Question 20 The algorithm's developer argues: "Our system has similar selection rates for male and female veterans (27% vs. 28%), which means it satisfies demographic parity. Any remaining differences in error rates are due to the genuinely higher base rate of instability among female veterans — our model is just accurately reflecting the real world."
a. Evaluate this argument. Is it correct that the system satisfies demographic parity? (2 points) b. The developer invokes the Chouldechova impossibility theorem to argue that unequal error rates are inevitable given different base rates. Is this invocation of the theorem correct? (2 points) c. The developer concludes that because metric conflicts are mathematically inevitable, no further action is warranted. Explain why this conclusion is an example of ethics washing and what genuine ethics practice would require instead. (2 points)
Answer Key
Part 1: Multiple Choice
Q1: B — ProPublica documented that among defendants who did not reoffend, Black defendants were flagged as high-risk at a higher rate than white defendants. This is a false positive rate disparity, not overall prediction rate (A), calibration failure (C), individual fairness (D), or treatment equality (E).
Q2: B — The Chouldechova impossibility theorem addresses the incompatibility of demographic parity, equalized odds, and calibration. The other combinations are either not the subject of the theorem or are partially overlapping constructs.
Q3: B — Disparity ratio = 44%/68% = 0.647. Since 0.647 < 0.80, this triggers scrutiny under the four-fifths rule. Option A reverses the ratio. Options C, D, and E make mathematical errors.
Q4: B — Equal FPR (10% for both groups) means one component of equalized odds is satisfied. Unequal TPR (85% vs. 60%) means the TPR component of equalized odds is violated, and equal opportunity (which requires equal TPR) is also violated.
Q5: B — Counterfactual fairness requires a causal model and the counterfactual analysis of what would happen if the protected attribute changed. Options A, C, D, and E describe demographic parity, equalized odds, calibration, and individual fairness respectively.
Q6: B — Equal overall accuracy is a necessary but not sufficient condition for fairness. A model can have equal accuracy while having dramatically different error compositions by group. This is one of the most common forms of inadequate fairness reporting.
Q7: B — Multicalibration requires calibration for all efficiently computable subgroups, not just the major protected groups. This addresses intersectional fairness by ensuring that even small intersectional subgroups receive calibrated predictions.
Q8: B — This is the classic formal vs. substantive fairness tension: applying formally neutral criteria that produce substantively disparate outcomes because the criteria encode historical privilege.
Part 2: True or False
Q9: FALSE — Equal overall accuracy can coexist with very different FPR and FNR across groups. For example, if Group A has a higher base rate, a calibrated system may achieve equal accuracy through different combinations of TP, TN, FP, and FN that produce different FPRs.
Q10: FALSE — The impossibility theorem demonstrates that not all fairness criteria can be simultaneously satisfied, but this does not mean fairness measurement is pointless. It means metric selection must be deliberate and transparent, and it heightens rather than diminishes the importance of measuring fairness carefully.
Q11: FALSE — The "fairness through unawareness" approach fails because other variables in the model may proxy for race. In the United States, variables such as zip code, educational attainment, employment history, and many others are correlated with race due to historical patterns. Excluding the explicit variable does not eliminate its influence.
Q12: FALSE — Individual fairness and group fairness are conceptually distinct and can conflict even with large, representative populations. A system can treat every individual consistently relative to a task-relevant distance metric while still producing systematically different selection rates across groups.
Q13: TRUE — Chouldechova's proof depends on the existence of base rate differences between groups. When base rates are equal, the mathematical relationships that force metric conflicts no longer operate (assuming the classifier is not perfect, the impossibility still technically applies only because the classifier is imperfect, but equal base rates substantially ease the tensions among metrics).
Part 3: Short Answer
Q14: Key elements: Equal opportunity requires only equal TPR (equal rates of correctly identifying positive cases). Equalized odds requires equal TPR AND equal FPR. Example: in hiring, if you are primarily concerned with ensuring qualified candidates from all groups are identified at equal rates (false negatives are most costly), equal opportunity is appropriate. You might choose equal opportunity over equalized odds when false positives (incorrectly advancing unqualified candidates) are less harmful than false negatives.
Q15: Key elements: The zip code proxy problem arises because residential segregation is so persistent in the US that zip code is a strong predictor of racial composition. A model that uses zip code as an underwriting variable will produce racially correlated outputs even without using race explicitly. Evaluating legality requires examining the disparate impact of the variable's inclusion, not merely whether the variable is race-neutral on its face.
Q16: Key elements: A confusion matrix shows TP, TN, FP, FN counts. A single aggregate matrix averages over all groups, masking that one group may have FPR = 10% while another has FPR = 40%. A numerical example should show how two groups can have identical overall accuracy but very different FPR/FNR compositions.
Q17: Key elements: Ethics washing = superficial adoption of fairness language without substantive commitment. Example: a company reports "our model is 95% accurate for all groups" without disclosing FPR disparities, without specifying which fairness metric was chosen, and without acknowledging trade-offs. Genuine practice = disclose all computed metrics, document which was prioritized and why, acknowledge accepted trade-offs, involve affected communities.
Part 4: Applied Scenario
Q18: a. Selection rates: Male = 27%, Female = 28%. Demographic parity approximately satisfied (difference = 1 percentage point). Yes, demographic parity is approximately satisfied. b. TPR: Male = 72%, Female = 58% — difference of 14 percentage points (violated). FPR: Male = 18%, Female = 15% — difference of 3 percentage points (approximately satisfied). Equalized odds is violated on the TPR component but approximately satisfied on the FPR component. c. Equal opportunity requires equal TPR. TPR gap is 72% - 58% = 14 percentage points. Equal opportunity is violated.
Q19: a. FNR (male) = 1 - 0.72 = 28%. FNR (female) = 1 - 0.58 = 42%. The director's calculation is correct. b. This relates to the true positive rate (equal opportunity / equalized odds TPR component) — specifically the false negative rate, which is the complement of TPR. c. Lowering the classification threshold for female veterans would increase TPR (catching more true positives) but would also increase FPR (generating more false positives among female veterans who will not experience instability). Given female veterans' higher base rate, this trade-off may be worthwhile but would require careful analysis of resource constraints and the relative costs of missing true cases vs. over-targeting.
Q20: a. Selection rate for female veterans = 28%; for male veterans = 27%. The difference is 1 percentage point. Yes, demographic parity is approximately satisfied by selection rate. b. The invocation is partially correct: different base rates (20% vs. 35%) do create pressure toward unequal error rates under calibration, consistent with Chouldechova. However, the impossibility theorem does not excuse the disparity — it explains its mathematical origin. The developer is correct that some metric conflict was inevitable, but incorrect to conclude that no action is warranted. c. Ethics washing: The developer uses the impossibility theorem as a conversation-stopper — as evidence that nothing needs to be done — rather than as a prompt for deliberate, accountable metric selection. Genuine ethics practice would require: acknowledging the trade-off explicitly; documenting which metric was prioritized (demographic parity) and why; explaining what was sacrificed (equal opportunity, particularly for the more vulnerable female veteran population); consulting affected stakeholders about whether this prioritization aligns with their values; and implementing ongoing monitoring with clear remediation processes.