Chapter 9: Key Takeaways

Essential Takeaways

Fairness is not a single thing. There are multiple mathematically coherent definitions of algorithmic fairness — demographic parity, equalized odds, equal opportunity, calibration, counterfactual fairness, treatment equality — and they capture different moral intuitions about what it means to treat people equitably. No single definition is universally correct.
The Chouldechova impossibility theorem is not a bug; it is a mathematical theorem. When base rates of an outcome differ across demographic groups — as they almost always do in socially complex domains — it is mathematically impossible to simultaneously satisfy demographic parity, equalized odds, and calibration. Choosing among them is a values choice, not a technical one.
Confusion matrices disaggregated by group reveal disparities that aggregate accuracy statistics hide. A system can achieve 90% overall accuracy while producing wildly different false positive and false negative rates for different demographic groups. Disaggregated analysis is not optional — it is the minimum standard for responsible evaluation.
Different errors carry different costs, and the distribution of those costs across groups is a fairness question. False positives and false negatives impose different harms on different parties. In criminal justice, false positives (flagging low-risk people as high-risk) harm defendants; false negatives harm potential future victims. Who bears the cost of which type of error — and whether that burden is equally distributed across racial groups — is a core fairness concern.
Choosing a fairness metric is choosing a political position. The metric that Northpointe chose for COMPAS — calibration — protected the system's vendor by emphasizing predictive accuracy. The metric that ProPublica highlighted — equalized odds / false positive rates — protected defendants by emphasizing equitable error distribution. These choices were not neutral. In every high-stakes application, the choice of fairness metric reflects whose interests are prioritized.
Individual fairness depends on a contestable definition of "similar." The requirement that similar individuals be treated similarly is intuitive and important, but it requires a task-relevant similarity metric that is itself a substantive and often contested choice. Defining who counts as similarly situated frequently reproduces the same social judgments that produced inequality in the first place.
Single-axis fairness analysis can mask intersectional harms. A system can satisfy demographic parity for race and separately for gender while still producing worse outcomes for Black women. Fairness analysis must be conducted at the intersectional level for the groups most vulnerable to compound discrimination.
There is no universal fairness metric; the right metric depends on the domain. Criminal justice applications prioritize equalized odds (distributing errors equitably across groups). Healthcare applications prioritize calibration and equalized odds jointly. Employment applications must meet the four-fifths threshold for disparate impact. The framework in Section 9.7 provides domain-specific guidance.
Fairness measurement requires data you may not have. Computing fairness metrics requires outcome data disaggregated by protected characteristics. Many organizations lack this data. Absence of measurement is not evidence of fairness; it is evidence of inadequate monitoring. The legal and practical challenges of data collection must be navigated, not used as excuses to avoid analysis.
Fairness is a continuous process, not a one-time check. Models drift over time. Populations change. Feedback loops can create new disparities. Post-deployment monitoring is essential, with automated alerts, regular audits, and clear escalation processes for when metrics deteriorate.
Ethics washing in fairness is common and recognizable. Organizations that report only their most favorable fairness metric, fail to acknowledge trade-offs, or claim to have "solved" fairness without disclosing which metric they chose are engaging in ethics washing. Genuine fairness practice requires transparency about which metrics were evaluated, which were prioritized, what trade-offs were accepted, and who made those decisions.
The impossibility theorem applies to human decision-makers too. Human judges, loan officers, and hiring managers also cannot simultaneously satisfy all fairness criteria. This is not an argument for replacing humans with algorithms — it is an argument for applying the same rigorous fairness analysis and accountability mechanisms to both.

Essential Vocabulary

Term	Definition
Confusion matrix	A table showing the distribution of a classifier's predictions into true positives, true negatives, false positives, and false negatives. The foundation of fairness metric computation.
Demographic parity	The fairness criterion requiring equal rates of positive predictions (e.g., loan approvals, low-risk scores) across demographic groups, regardless of base rate differences.
Equalized odds	The fairness criterion requiring equal true positive rates AND equal false positive rates across demographic groups; formalized by Hardt, Price, and Srebro (2016).
Calibration	The property that predicted probabilities match actual outcome rates within each demographic group. COMPAS was calibrated but did not satisfy equalized odds.
Equal opportunity	A relaxed version of equalized odds requiring only equal true positive rates — equal rates of correctly identifying positive cases — across groups.
Counterfactual fairness	The criterion that a decision would remain the same if the individual's protected attribute had been different, holding all non-descendant factors constant.
Impossibility theorem	Chouldechova's (2017) formal proof that demographic parity, equalized odds, and calibration cannot simultaneously hold when base rates differ across groups and errors are nonzero.
False positive rate (FPR)	Among individuals who do not have the outcome of interest, the proportion incorrectly classified as positive. The disparity in FPR across racial groups was the central ProPublica finding.
Base rate	The actual prevalence of an outcome in a population or subgroup. Differences in base rates between groups are the mathematical root cause of fairness metric incompatibility.
Multicalibration	The stronger fairness requirement (Hébert-Johnson et al., 2018) that predictions be calibrated not just for each protected group but for all efficiently computable subgroups, enabling intersectional fairness analysis.

Core Tensions to Carry Forward

Technical precision vs. social meaning: Fairness metrics are mathematically precise. But what they measure is socially constructed: who gets arrested, who is labeled a recidivist, whose neighborhood is considered high-risk. Precise measurement of an imprecisely constructed social reality gives false confidence.

Accuracy vs. equity: Systems optimized for predictive accuracy across the full population will typically not satisfy equalized odds. Improving equity for disadvantaged groups may require accepting some reduction in overall predictive accuracy — a trade-off that has real costs and that must be made deliberately.

Transparency vs. trade secrecy: Algorithmic fairness cannot be verified without access to the algorithm. But vendors protect their models as trade secrets. This tension between accountability and intellectual property is unresolved in most jurisdictions.

Compliance vs. commitment: Fairness metrics can be satisfied in ways that meet the letter of regulatory requirements without genuine commitment to equitable outcomes. Distinguishing ethics washing from genuine fairness practice requires examining what metrics were chosen, who was consulted, and what trade-offs were acknowledged.

Individual vs. group: Individual fairness and group fairness are conceptually distinct and sometimes conflicting. A system can treat each individual consistently relative to others while still producing systematically disparate group-level outcomes, and vice versa.

Questions to Carry Forward

If a perfectly calibrated algorithmic system and a racially biased human judge produce identical outcome distributions, are they morally equivalent? If not, why not?
Who should have the democratic authority to choose which fairness metric is used when an algorithm makes decisions that affect someone's liberty or economic opportunity?
Can a system be genuinely fair if it is trained on data that reflects historical injustice — if the outcomes it learns to predict are themselves products of discrimination?
What would it mean for the criminal justice system, or the credit system, to be not just less unfair than current systems, but genuinely just? Can algorithmic tools help achieve that, or do they merely manage injustice more efficiently?
How do we balance the need for demographic data to measure fairness against the privacy interests and legal constraints that limit data collection?