49 min read

In This Chapter

Opening: A Paradox in a Courtroom
Learning Objectives
Section 9.1: What Does "Fair" Mean? The Problem of Competing Definitions
Section 9.2: Understanding Confusion Matrices — The Foundation of Fairness Measurement
Section 9.3: Individual Fairness
Section 9.4: Group Fairness Metrics
Section 9.5: The Impossibility Theorem
Section 9.6: Intersectional Fairness
Section 9.7: Context Dependence of Fairness Metrics
Section 9.8: Fairness Measurement in Practice — Organizational Processes
Section 9.9: Beyond Binary — Fairness in Multi-Class and Regression Settings
Section 9.10: Organizational Implications — Making Fairness a Practice
Discussion Questions

Case Study 01 Case Study 02 Key Takeaways Exercises Quiz Further Reading

Chapter 9: Measuring Fairness — Metrics and Trade-offs

Opening: A Paradox in a Courtroom

On any given morning in an American courthouse, a judge preparing to set bail or recommend a sentence may glance at a printout generated by a piece of software called COMPAS — Correctional Offender Management Profiling for Alternative Sanctions. The score on that printout, a number from one to ten, represents the algorithm's estimate of how likely the defendant is to reoffend. Judges have discretion about how much weight to give it, but the number is there, authoritative, the product of computation and data.

In 2016, journalists at ProPublica — led by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner — published an investigation titled "Machine Bias" that would become one of the most consequential pieces of technology journalism ever written. They obtained COMPAS scores for more than 7,000 people arrested in Broward County, Florida, then tracked those people for two years to see who actually reoffended. What they found was striking: Black defendants were nearly twice as likely as white defendants to be falsely labeled as high risk. White defendants, conversely, were more likely to be falsely labeled low risk when they went on to commit new crimes.

The reaction was immediate and widespread. Here, critics said, was proof that algorithmic systems launder human bias into the false authority of mathematics. A tool marketed as objective and scientific was perpetuating racial disparities in the justice system under a different name.

Northpointe, the company that built COMPAS, responded with its own analysis. Their system, they said, was fair — and they had the numbers to prove it. When COMPAS assigned a risk score of seven to a defendant, roughly 70% of defendants at that score actually reoffended, regardless of race. The predictive probabilities were equally accurate for Black and white defendants. This property — calibration — is a legitimate and important fairness criterion.

Here is the paradox: both sides were telling the truth.

ProPublica was right that Black defendants experienced higher false positive rates — they were flagged as high-risk at greater rates when they were actually low-risk. Northpointe was right that the system was calibrated — scores meant the same thing in probability terms across racial groups. Both of these things were simultaneously true. And as a University of Pittsburgh statistician named Alexandra Chouldechova proved mathematically in 2017, when base rates of a measured outcome differ between groups — as they do for recidivism in the United States, due to decades of differential policing, prosecution, and incarceration — it is mathematically impossible to satisfy both calibration and equal false positive rates at the same time.

This is not a software bug. It is not an engineering oversight that better code could fix. It is a mathematical theorem.

The COMPAS controversy is the perfect entry point into the subject of this chapter because it reveals something that makes many people uncomfortable: fairness is not a single thing. There are multiple mathematically coherent definitions of algorithmic fairness. They make different moral claims. They serve different interests. And in many realistic circumstances, they cannot all be satisfied simultaneously. Choosing among them is not a technical decision — it is a values decision, with real consequences for real people. The choice of which fairness metric to use is, inescapably, a political choice.

This chapter equips you with the tools to understand those choices: what the major fairness metrics are, how to calculate them, when each one is appropriate, and how to build fairness measurement into an organizational practice. The goal is not to resolve the paradox — it cannot be resolved — but to help you reason about it rigorously and honestly.

Learning Objectives

By the end of this chapter, you will be able to:

Explain why there is no single definition of algorithmic fairness and why multiple definitions can be simultaneously valid yet mutually contradictory.
Construct and interpret a confusion matrix and identify the different costs associated with false positives and false negatives in high-stakes decision contexts.
Define and calculate six major group fairness metrics — demographic parity, equalized odds, equal opportunity, calibration, counterfactual fairness, and treatment equality — and provide worked numerical examples.
Explain the Chouldechova impossibility theorem and describe its practical implications for organizations deploying classification systems.
Apply a principled framework for selecting a fairness metric appropriate to a specific organizational context, considering domain, error costs, vulnerability, and legal requirements.
Identify the challenges of measuring fairness at the intersectional level and describe approaches for evaluating subgroup performance.
Design a basic organizational fairness monitoring process, including what data to collect, when to measure, and how to document fairness choices.
Distinguish between ethics washing — the superficial adoption of fairness language without substantive commitment — and genuine organizational fairness practice.

Section 9.1: What Does "Fair" Mean? The Problem of Competing Definitions

Ask ten people whether a system is fair, and you will likely get ten different answers — and all of them might be right. This is not because fairness is meaningless. It is because fairness is a genuinely complex concept that, when pressed for precision, reveals multiple distinct and sometimes conflicting principles. Before we can measure fairness, we need to be clear about what we are trying to measure — and why that question is harder than it looks.

Formal vs. Substantive Fairness

Philosophers and legal scholars have long distinguished between two broad approaches to fairness. Formal fairness focuses on procedural consistency: treat like cases alike, apply the same rules to everyone, do not discriminate based on irrelevant characteristics. Substantive fairness focuses on outcomes: are people ending up in similar positions? Are historical disadvantages being perpetuated or remedied?

These two approaches often conflict. A formally fair process — applying the same lending criteria to everyone — can produce substantively unfair outcomes if those criteria have historically been designed around the profiles of white applicants, or if they correlate with neighborhood characteristics that encode racial history. Conversely, a process designed to produce equal outcomes may require treating different groups differently by formal criteria, which some people find unfair in a different sense.

This tension runs through every major debate about algorithmic fairness. When Northpointe argued that COMPAS was fair because it applied the same algorithm to everyone, they were making a formal fairness claim. When ProPublica argued that COMPAS was unfair because it produced worse outcomes for Black defendants — more false positives — they were making a substantive fairness claim. Neither side was wrong in its own terms. They were answering different questions.

The Proliferation of Fairness Definitions

In the academic literature on machine learning fairness — a field that has exploded since roughly 2014 — researchers have proposed more than twenty distinct mathematical definitions of fairness. Most of these cluster around a smaller number of core concepts: demographic parity, equalized odds, equal opportunity, calibration, counterfactual fairness, individual fairness, and a handful of variants. We will examine the major ones carefully in Sections 9.3 and 9.4.

What is important to understand at the outset is that these definitions are not simply different names for the same thing. They make substantively different claims about what matters morally. Demographic parity says that the right fairness goal is equal outcomes across groups — the same rate of loan approvals, the same rate of parole recommendations. Calibration says the right fairness goal is that predicted probabilities accurately reflect actual probabilities for all groups — that when we say someone has a 70% chance of doing something, that should be equally true across groups. These are different goals, grounded in different values, and in many realistic situations they pull in opposite directions.

Why Value Judgments Are Unavoidable

Here is the uncomfortable truth that this chapter asks you to sit with: there is no neutral, value-free way to choose a fairness metric. Every choice embeds a judgment about what matters most.

Choosing demographic parity reflects a commitment to equal outcomes — a priority often associated with remedying historical disadvantage. Choosing calibration reflects a commitment to predictive accuracy and trust in the score — a priority often associated with the interests of the system's deployer. Choosing equalized odds reflects a commitment to distributing errors equally across groups — a priority often associated with the interests of people subjected to adverse decisions.

These preferences are not arbitrary — they correspond to different moral and political frameworks with long intellectual histories. But they are choices. And in high-stakes settings like criminal justice, credit, and healthcare, the choice of fairness metric determines whose interests are prioritized and whose harms are accepted as tolerable.

One of the themes of this textbook is the distinction between ethics washing and genuine ethics. Ethics washing — the superficial adoption of ethical language without substantive commitment — is nowhere more tempting than in the domain of fairness metrics. An organization can produce a technically accurate statement like "our system is 95% accurate across groups" or "our system is calibrated" while obscuring the fact that it produces systematically worse outcomes for already disadvantaged groups. The goal of this chapter is to give you the vocabulary and analytical tools to see through such claims.

Vocabulary Builder

Fairness: In algorithmic systems, the property of producing decisions or predictions that do not systematically disadvantage individuals based on protected characteristics such as race, gender, or age. Multiple formal definitions exist; no single definition captures all morally relevant aspects of fairness.

Confusion matrix: A table summarizing the performance of a classification model by showing the counts of true positives, true negatives, false positives, and false negatives.

False positive rate (FPR): Among people who do not have the outcome of interest (e.g., people who will not reoffend), the proportion who are incorrectly classified as positive (high-risk). Also called the Type I error rate.

False negative rate (FNR): Among people who do have the outcome of interest (e.g., people who will reoffend), the proportion who are incorrectly classified as negative (low-risk). Also called the Type II error rate.

Precision: Among all cases predicted as positive, the proportion that are truly positive.

Recall (sensitivity, true positive rate): Among all truly positive cases, the proportion correctly identified as positive.

Calibration: The property that predicted probabilities match observed outcome frequencies. A well-calibrated model that assigns 70% risk to a set of people will see approximately 70% of those people experience the predicted outcome.

Demographic parity: The fairness criterion requiring that the rate of positive predictions be equal across demographic groups.

Equalized odds: The fairness criterion requiring that both the true positive rate and the false positive rate be equal across demographic groups.

Counterfactual fairness: The fairness criterion requiring that a decision be the same regardless of an individual's protected attribute, holding all causally non-descendent factors constant.

Section 9.2: Understanding Confusion Matrices — The Foundation of Fairness Measurement

Before we can evaluate whether a system is fair, we need a way to characterize its errors. Confusion matrices are the essential tool for this. They are simple, powerful, and — as we will see — capable of revealing disparities that overall accuracy statistics completely hide.

The Four Types of Outcomes

Any binary classification system — one that outputs a yes/no, high/low, approve/deny decision — produces four types of outcomes:

True positive (TP): The system predicts positive, and the outcome is actually positive. In COMPAS terms: the system predicts high risk, and the person does reoffend.
True negative (TN): The system predicts negative, and the outcome is actually negative. COMPAS predicts low risk, and the person does not reoffend.
False positive (FP): The system predicts positive, but the outcome is actually negative. COMPAS predicts high risk, but the person does not reoffend. This is the error ProPublica highlighted: people labeled dangerous when they are not.
False negative (FN): The system predicts negative, but the outcome is actually positive. COMPAS predicts low risk, but the person does reoffend. This is the opposite error: people labeled safe when they are not.

The Different Costs of Different Errors

In most real-world applications, these errors are not equally costly. The appropriate weighting of errors depends on the stakes and who bears the costs.

In criminal justice, a false positive — labeling a person high-risk when they are not — may mean they receive a harsher sentence, are denied bail, or remain incarcerated longer than warranted. The cost falls on the individual defendant. A false negative — labeling a person low-risk when they go on to commit another crime — imposes costs on potential future victims and on public safety. Neither type of error is free. The question is how to distribute those costs, and whether the distribution is equitable across demographic groups.

In medical screening, the costs are configured differently. A false negative on a cancer screening (missing a cancer that is present) may allow a treatable condition to become fatal. A false positive triggers unnecessary follow-up tests that are costly, stressful, and occasionally harmful. Most people, weighing these asymmetrically, prefer to accept more false positives to avoid missing true cases — which is why cancer screening programs are typically tuned for high sensitivity even at the cost of lower precision.

The point is that the choice of how to balance error types is a values choice, not a technical one. And when we ask whether a system is fair, we must ask: are errors of each type distributed equitably across demographic groups?

A Worked Example

The following table illustrates a hypothetical scenario modeled on ProPublica's COMPAS analysis. The numbers are simplified for clarity but reflect the proportional patterns documented in the original investigation.

Table 9.1: Hypothetical Confusion Matrices for Two Demographic Groups

	Predicted High-Risk	Predicted Low-Risk	Total
Group A: Actually reoffended	180 (TP)	60 (FN)	240
Group A: Did not reoffend	100 (FP)	160 (TN)	260
Group A Total	280	220	500

	Predicted High-Risk	Predicted Low-Risk	Total
Group B: Actually reoffended	70 (TP)	50 (FN)	120
Group B: Did not reoffend	48 (FP)	232 (TN)	280
Group B Total	118	282	400

From these tables, we can calculate:

Overall accuracy: - Group A: (180 + 160) / 500 = 68% - Group B: (70 + 232) / 400 = 75.5%

False positive rate (among people who did not reoffend, the proportion incorrectly flagged as high-risk): - Group A: 100 / 260 = 38.5% - Group B: 48 / 280 = 17.1%

False negative rate (among people who did reoffend, the proportion incorrectly flagged as low-risk): - Group A: 60 / 240 = 25% - Group B: 50 / 120 = 41.7%

True positive rate / recall (among people who did reoffend, the proportion correctly flagged): - Group A: 180 / 240 = 75% - Group B: 70 / 120 = 58.3%

Notice what these numbers reveal. Group A has a higher false positive rate (38.5% vs. 17.1%) — members of Group A who did not reoffend are flagged as high-risk at more than twice the rate of Group B members who did not reoffend. This is the disparity ProPublica documented. But Group B has a higher false negative rate (41.7% vs. 25%) — members of Group B who did reoffend are missed at higher rates. Both types of errors are unequally distributed. The question of which matters more is, again, a values question.

Overall accuracy — which we might naively use as our single performance metric — would tell us Group B is doing better (75.5% vs. 68%). But this aggregate number hides the fact that the system is making systematically different types of errors for different groups.

This is why overall accuracy is an insufficient measure of performance in high-stakes systems, and why disaggregated analysis is essential.

Section 9.3: Individual Fairness

The oldest intuitive principle of fairness is that like cases should be treated alike. Individual fairness translates this principle into a mathematical framework for algorithmic systems.

The Core Idea

Individual fairness, formalized by Cynthia Dwork and colleagues in a landmark 2012 paper, holds that any two individuals who are similar in all task-relevant respects should receive similar predictions or decisions from the algorithm. If two loan applicants have identical income, credit history, employment status, debt load, and every other characteristic relevant to creditworthiness, the algorithm should give them similar approval probabilities — regardless of whether they belong to different demographic groups.

The formal mathematical statement uses what is called a Lipschitz condition. If we have a metric $d$ measuring distance between individuals in task-relevant feature space, and a metric $D$ measuring distance between decisions or predictions, individual fairness requires that:

$$D(M(x), M(y)) \leq d(x, y)$$

In plain language: the difference in decisions for two people should be no greater than the difference in their relevant characteristics. People who are close together in relevant ways should receive close decisions.

The Appeal

Individual fairness has immediate intuitive appeal. It captures something real about what discrimination means: treating people differently based on characteristics that should not matter. It avoids some of the problems of group-level metrics, which can require ignoring genuine differences between individuals in service of group-level balance. And it connects naturally to anti-discrimination law's concept of disparate treatment — making decisions on the basis of protected characteristics.

For business professionals, individual fairness aligns well with intuitions about meritocracy: reward people based on their individual characteristics, not their group membership.

The Problem: Defining "Similar"

Individual fairness's critical vulnerability is exactly the phrase "similar in task-relevant respects." Before you can implement individual fairness, you need a metric — a way of measuring how similar two people are. And that metric is itself a substantive choice that embeds value judgments.

For a simple domain like comparing loan applications, you might think the metric is obvious: compare income, credit scores, employment history. But consider: should two people with identical incomes but one has recently been incarcerated be considered similarly situated for lending purposes? Prior incarceration correlates with race in the United States due to decades of differential enforcement. Using it as a relevant characteristic may preserve a racially disparate outcome while satisfying formal individual fairness. Excluding it may ignore a factor that is genuinely predictive of default.

The similarity metric is not neutral. It encodes judgments about what matters and what does not — judgments that often reproduce existing social hierarchies.

When Individual Fairness Works

Individual fairness is most tractable in domains where there is broad agreement on what makes two cases similar and where those similarity criteria do not themselves encode historical bias. Technical screening tests where two candidates' answers can be compared objectively is one example. Image classification tasks where two images can be compared via established perceptual metrics is another.

When It Fails

In complex social domains — criminal justice, credit, hiring, healthcare — the contested nature of the similarity metric makes individual fairness difficult to implement in a way that commands broad agreement. Two defendants may look similar on observable characteristics but differ in ways that are contested as either relevant or irrelevant. For these settings, group fairness metrics — despite their own limitations — provide a more tractable framework for identifying systematic disparities.

Section 9.4: Group Fairness Metrics

Group fairness metrics shift the focus from individual cases to aggregate patterns across demographic groups. Rather than asking whether two specific individuals are treated similarly, they ask whether groups defined by protected characteristics — race, gender, age, disability status — experience systematically different rates of favorable and unfavorable outcomes or errors.

Each of the six metrics below captures something real and important about fairness. Each has contexts where it is the right metric and contexts where it is inadequate. Understanding when each applies is the essential skill for responsible AI deployment.

9.4.1 Demographic Parity (Statistical Parity)

Definition: A classifier satisfies demographic parity if the rate of positive predictions is equal across demographic groups. If the system approves loans, approvals should happen at equal rates for all groups. If the system flags items for review, flagging should happen at equal rates across groups.

In formula terms: $P(\hat{Y} = 1 | A = 0) = P(\hat{Y} = 1 | A = 1)$, where $\hat{Y}$ is the prediction and $A$ is the protected attribute.

Worked example: Suppose an automated loan system approves 60% of applications from Group A and 40% of applications from Group B. Demographic parity is violated. To satisfy it, the approval rate must be equalized — either by approving more Group B applicants, approving fewer Group A applicants, or some combination.

When it is the right metric: Demographic parity is appropriate when you have strong reason to believe that the underlying qualification rates should be equal across groups, or when the purpose of the system is explicitly to achieve equal representation. It is also appropriate in regulatory contexts like employment law in some jurisdictions, where a significant disparity in selection rates across groups triggers scrutiny regardless of the reason.

Limitations: Demographic parity can require ignoring real predictive differences. If Group A genuinely applies with stronger average credentials than Group B (perhaps because of differential access to education — itself a product of historical injustice), forcing equal approval rates will mean approving some Group B applicants who are less creditworthy by the model's criteria, which may increase default rates. It can also be gamed: a system could achieve demographic parity by randomly selecting applicants while hiding its discriminatory logic behind a veneer of equal outcomes. Demographic parity also does not prevent differential error rates — you could have equal approval rates but very different false positive rates across groups.

9.4.2 Equalized Odds

Definition: A classifier satisfies equalized odds if it has equal true positive rates (TPR) AND equal false positive rates (FPR) across demographic groups. This means: among people who should receive a positive decision, the system identifies them at equal rates across groups; AND among people who should receive a negative decision, the system incorrectly flags them at equal rates across groups.

Formal definition from Hardt, Price, and Srebro (2016): $P(\hat{Y} = 1 | A = 0, Y = y) = P(\hat{Y} = 1 | A = 1, Y = y)$ for both $y = 0$ and $y = 1$.

Worked example: Using our Table 9.1 numbers, Group A has a false positive rate of 38.5% and a true positive rate of 75%. Group B has a false positive rate of 17.1% and a true positive rate of 58.3%. Equalized odds is violated on both metrics. Achieving equalized odds would require adjusting the system's threshold differently for each group — a practice called threshold adjustment or postprocessing.

When it is the right metric: Equalized odds is appropriate when we believe errors should be equally distributed across groups — that it is unjust for one group to bear a disproportionate burden of either type of mistake. In criminal justice, where both false positives (wrongly detaining low-risk people) and false negatives (failing to identify high-risk people) have real consequences, equalized odds captures an important symmetrical fairness intuition.

Limitations: Equalized odds can conflict with calibration (as we will see in Section 9.5). It may also require deliberately worsening performance for one group to achieve parity — reducing the true positive rate for Group A to match Group B's lower rate. Some argue this is perverse: it makes the system worse in absolute terms to achieve relative parity. It also requires knowing the true outcome, which may itself be contaminated by bias (who gets arrested, who gets charged, who gets convicted all reflect pre-existing biases in the system).

9.4.3 Equal Opportunity

Definition: Equal opportunity is a relaxed version of equalized odds, requiring only that the true positive rate be equal across groups. It focuses specifically on ensuring that people who deserve a positive decision get one at equal rates, without requiring that false positive rates also be equalized.

When it is the right metric: Equal opportunity is appropriate when false negatives are substantially more costly than false positives, and when the primary concern is that qualified individuals are being unfairly denied opportunities. In hiring, for example, the primary concern might be ensuring that equally qualified candidates have equal chances of being identified as strong prospects — rather than worrying as much about the rate at which unqualified candidates from different groups are incorrectly screened in.

Limitations: By relaxing the constraint on false positive rates, equal opportunity can leave unchecked a disparity in who is wrongly flagged as positive — which may itself be a serious fairness concern in many contexts.

9.4.4 Calibration

Definition: A classifier is calibrated if the predicted probability of an outcome matches the actual frequency of that outcome within each demographic group. If COMPAS assigns a risk score of 70% to a set of defendants, approximately 70% of those defendants should actually reoffend — and this should hold true separately for each racial group.

In formula terms: $P(Y = 1 | \hat{P} = p, A = a) = p$ for all values of $p$ and all groups $a$.

Why it matters: Calibration is crucial for decision-makers who use predicted probabilities as inputs to their own reasoning. If a judge or a loan officer uses a score expecting it to mean "70% probability," calibration ensures that interpretation is correct. A miscalibrated model may assign higher scores to one group without those scores meaning what they appear to mean.

The ProPublica vs. Northpointe conflict: Northpointe's defense of COMPAS rested primarily on calibration. At each score level, the proportion of defendants who actually reoffended was consistent across racial groups. This is a genuine and important fairness property. A score of 7 meant roughly the same thing in probability terms for Black and white defendants.

ProPublica's critique rested on false positive rates — a different metric that COMPAS did not satisfy. These two facts coexist. The impossibility theorem (Section 9.5) explains why they must.

When it is the right metric: Calibration is essential when decision-makers need to use predicted probabilities as probabilities — when they need to know that "70% risk" actually means 70% and not 50% or 90%. It is also important for combining predictions across different models or systems.

Limitations: Calibration is consistent with large disparities in prediction rates across groups. A system can be calibrated while assigning much higher scores, on average, to one group — perhaps because of differences in features that are themselves products of historical injustice. Calibration ensures the scores are accurate on average within groups, but does not require that the distribution of scores across groups be equitable.

9.4.5 Counterfactual Fairness

Definition: A decision is counterfactually fair if, had a person belonged to a different demographic group (but been otherwise identical in all causally relevant ways), they would have received the same decision.

This definition draws on causal inference and the concept of a counterfactual: what would have happened if one thing had been different? For counterfactual fairness, the counterfactual is: what decision would this person have received if their protected attribute — race, gender, etc. — had been different, holding constant all factors that are not causally downstream of that attribute?

The causal framework: Counterfactual fairness requires a causal model of how the protected attribute influences the input features. Some features are direct descendants of the protected attribute — neighborhood of residence, for example, is causally influenced by race in the United States due to historical housing discrimination. Others are not. Counterfactual fairness asks that the decision not depend on the protected attribute or any of its descendants.

Challenges: Counterfactuals for social categories are philosophically complex. What does it mean to ask what would have happened if a Black person had been white? Their entire life history might be different. Their education, neighborhood, social network, economic circumstances — all of these are influenced by their race in a society with a history of racial discrimination. The counterfactual individual may not be meaningfully comparable to the actual individual. Constructing a credible causal model of these relationships is an enormous empirical and conceptual challenge.

When it is the right metric: Counterfactual fairness is most tractable in settings where the causal relationships are well-understood and the protected attribute's influence on other variables is relatively contained. It is theoretically appealing as a framework even when exact computation is difficult.

9.4.6 Treatment Equality

Definition: Treatment equality requires that the ratio of false negatives to false positives be equal across demographic groups. It captures the idea that the relative costs of the two types of errors should be distributed equally.

Worked example: Using Table 9.1: - Group A: FN/FP ratio = 60/100 = 0.6 - Group B: FN/FP ratio = 50/48 = 1.04

These ratios differ substantially. Group A is experiencing relatively more false positives than false negatives, while Group B experiences roughly equal numbers of each type of error. Whether this matters depends on whether the costs of false positives and false negatives are symmetric — in criminal justice, many would argue they are not.

When it is the right metric: Treatment equality is most relevant when the system's designers have reason to believe false positives and false negatives impose symmetric costs, and want to ensure neither group disproportionately bears one type of error burden.

Comparison Table: Six Fairness Metrics at a Glance

Metric	What It Requires	What It Ignores	When It Conflicts
Demographic Parity	Equal positive prediction rates across groups	Differences in base rates; error distributions	Calibration; equalized odds
Equalized Odds	Equal TPR and FPR across groups	Overall prediction rates may differ	Calibration (Chouldechova 2017)
Equal Opportunity	Equal TPR only across groups	FPR disparities	Calibration; demographic parity
Calibration	Predicted probabilities match actual rates per group	Differences in score distributions across groups	Equalized odds when base rates differ
Counterfactual Fairness	Decision unchanged if protected attribute changed	Practical computation; causal model validity	May conflict with all group metrics
Treatment Equality	Equal FN/FP ratio across groups	Absolute error rates; overall accuracy	May conflict with calibration; equalized odds

Section 9.5: The Impossibility Theorem

What we have seen in Section 9.4 is that there are multiple reasonable metrics for fairness, each capturing something important. A natural question is: can we just satisfy all of them? The answer — formally proven — is generally no.

The Mathematical Result

In 2017, Alexandra Chouldechova, a statistician at Carnegie Mellon University, proved a theorem that made rigorous what the COMPAS controversy had suggested intuitively. Independently, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan proved a closely related result. The core finding is:

When base rates differ between groups, it is mathematically impossible to simultaneously satisfy demographic parity, equalized odds, and calibration, unless the classifier is perfect (zero errors).

The proof follows directly from the mathematical relationships between the terms of the confusion matrix. If calibration holds — scores mean the same thing as probabilities for each group — and if base rates differ between groups (Group A has a 30% base rate of recidivism while Group B has a 50% base rate), then equalizing false positive rates while also equalizing true positive rates becomes impossible without making the predictions meaningless.

Here is the intuition: if the base rate of an outcome is genuinely different between groups, a well-calibrated system will assign different distributions of scores to those groups. Those different score distributions will necessarily produce different error rates for any given classification threshold — unless you use different thresholds for different groups. But using different thresholds violates some people's intuitions about fairness (different rules for different groups). This tension cannot be resolved.

What This Means Practically

The impossibility theorem has profound practical implications. It means that:

Choosing a fairness metric is a values choice, not a technical one. There is no algorithm that discovers the "right" fairness metric. There is no technical solution that achieves all reasonable fairness criteria simultaneously. The selection must be made through a substantive deliberative process that is transparent about the trade-offs involved.
Different stakeholders will prefer different metrics — and those preferences track their interests. Northpointe, as the system's developer, preferred calibration — a metric on which COMPAS performed well. Defendants' rights advocates preferred equalized odds — a metric that highlighted the system's disparate impact on Black defendants. Neither group is wrong in its own terms. But neither is neutral.
The metric choice has distributional consequences. Choosing calibration over equalized odds, in a world with different base rates, means accepting higher false positive rates for the group with higher base rates. In the United States, due to systematic over-policing and over-prosecution of Black communities, this means accepting that a calibrated system will produce higher false positive rates for Black defendants. Whether this is the right trade-off to make is a question about justice, not statistics.
This does not mean fairness is impossible. The impossibility theorem does not say that all metrics are equivalent, that fairness is meaningless, or that we should give up. It says that we cannot satisfy all reasonable criteria simultaneously, and that we must make explicit, accountable choices about which criteria to prioritize. That is a demand for greater rigor and transparency, not a license for nihilism.

The Human Judgment Parallel

A useful perspective: human judges also cannot simultaneously satisfy all fairness criteria. Studies of bail decisions by human judges show that they exhibit all of the same disparities that COMPAS does — different false positive rates for Black and white defendants, different false negative rates, miscalibrated confidence. The impossibility theorem applies to human decision-making too. What this suggests is not that algorithms and humans are equally flawed, but that the comparison needs to be made carefully: both human and algorithmic systems must be evaluated on the same criteria, their errors made equally visible, and accountability mechanisms applied equally.

What the COMPAS Case Illustrates

In the COMPAS case, base rates of recidivism as measured in the dataset — which reflects who gets arrested and convicted rather than who actually commits crimes — differ between racial groups. This makes it mathematically inevitable that calibration and equalized odds cannot both be fully satisfied. Northpointe chose calibration. ProPublica highlighted the equalized odds failure that follows necessarily from that choice. Both descriptions are accurate. The controversy was fundamentally about which fairness criterion should take precedence, and who gets to make that choice.

Section 9.6: Intersectional Fairness

The metrics we have discussed so far treat demographic groups as single-axis categories — race, or gender, or age, one at a time. But real people occupy multiple categories simultaneously. A Black woman is not just Black and not just a woman; her experience as a Black woman may not be captured by analysis of either axis in isolation. Intersectional fairness addresses this limitation.

Why Single-Axis Analysis Misses Intersectional Harms

Consider a hiring algorithm that is fair by race (equal true positive rates for Black and white candidates) and fair by gender (equal true positive rates for men and women). Does this guarantee fairness for Black women? Not necessarily. It is mathematically possible for a system to satisfy fairness constraints on each individual axis while still producing worse outcomes specifically for the intersection of Black and female — because intersectional subgroups may have different base rates, feature distributions, or error structures that are masked when you look at each axis separately.

Legal scholar Kimberlé Crenshaw coined the term "intersectionality" in 1989 to describe how overlapping systems of oppression create experiences that cannot be reduced to any single axis of identity. This insight applies with full force to algorithmic fairness: a system can satisfy every single-axis fairness criterion while still imposing disproportionate harm on people who occupy multiple marginalized identities.

The Subgroup Explosion Problem

Intersectional analysis faces a practical challenge: the number of subgroups grows rapidly with the number of protected attributes. With just two binary protected attributes (say, race coded as two categories and gender coded as two categories), you have four intersectional subgroups. With race at five categories and gender at three categories, you have fifteen subgroups. Add age and disability status and the number multiplies further. With small datasets, many subgroups will have too few members to support reliable statistical analysis.

This is a real constraint, not an excuse. It argues for:

Prioritizing the most at-risk intersections. Even if you cannot analyze all possible intersections, you can focus on combinations that prior knowledge suggests are likely to be most vulnerable to harm.
Collecting larger and more representative datasets. Better data collection enables intersectional analysis. This may require deliberate oversampling of smaller subgroups.
Using uncertainty quantification. Report confidence intervals on fairness metrics so that readers can see when a subgroup estimate is too uncertain to be reliable.

Multicalibration

A formal approach to intersectional fairness appears in work by Hébert-Johnson, Kim, Reingold, and Rothblum (2018) on multicalibration: the requirement that a predictor be calibrated not just for each protected group but for every efficiently computable subgroup of the population. Multicalibration is a much stronger condition than ordinary calibration — it requires that the predicted probabilities be accurate for every intersection of characteristics that can be computed by a simple function. This is a technically demanding standard, but it provides a theoretical framework for taking intersectional analysis seriously.

Practical Recommendations

For organizations deploying high-stakes classification systems, the minimum intersectional analysis should include:

Performance evaluation for race × gender subgroups
Performance evaluation for race × age brackets (e.g., young vs. older)
Any intersections known from prior research or community input to be particularly vulnerable
Honest reporting of when subgroup sizes are too small for reliable estimates

Section 9.7: Context Dependence of Fairness Metrics

There is no universal fairness metric. The metric that is appropriate for criminal justice bail decisions is not necessarily appropriate for credit approval, and neither may be right for medical diagnosis. Choosing the right metric requires understanding the context — the type of decision, the costs of errors, the vulnerability of affected parties, what affected communities want, and what the law requires.

A Framework for Metric Selection

Step 1: What type of decision is being made?

Screening decisions (admit/deny, flag/pass) have different implications than scoring decisions (rank-ordering individuals on a continuous scale) or classification decisions (assigning individuals to categories). Screening decisions produce sharp in/out distinctions where threshold choices are especially consequential. Scoring decisions produce orderings where rank disparity may be the relevant fairness concern. Classification decisions depend on whether the categories themselves are appropriate.

Step 2: What are the costs of different errors?

Identify specifically what happens when the system makes a false positive versus a false negative, and for whom. In criminal justice, the costs of false positives fall primarily on defendants; the costs of false negatives may fall on potential future victims and on public trust. In cancer screening, false negatives are life-threatening while false positives are costly and stressful. The cost structure should inform which error rates matter most.

Step 3: Who is most vulnerable to harm?

Fairness metrics should be chosen to protect the parties who are most exposed to adverse consequences and who have the least ability to contest or recover from wrong decisions. In general, this argues for paying more attention to error rates for historically marginalized groups, and for error types that produce the most severe and least reversible harms.

Step 4: What do affected communities say they want?

This is the most frequently neglected step. Affected communities often have sophisticated views about what fairness means to them, rooted in their lived experience of the systems they are being subjected to. Surveys of criminal justice-involved communities, for instance, have found significant variation in how people weight different fairness criteria — and those preferences do not always match what researchers or policymakers assume. Participatory design processes that include affected communities in metric selection produce both better metrics and greater legitimacy.

Step 5: What does the law require?

Legal requirements create floors below which no fairness metric selection is acceptable. In the United States, the major frameworks include:

Title VII disparate impact doctrine: Selection rates for protected groups less than 80% of the most-favored group trigger scrutiny (the "four-fifths rule"). This is effectively a demographic parity floor for employment decisions.
Equal Credit Opportunity Act (ECOA): Prohibits discriminatory credit decisions and requires that credit models be validated for disparate impact.
Fair Housing Act: Prohibits discriminatory effects in housing decisions, including algorithmic ones.
EU AI Act: Classifies high-risk AI systems (including those used in credit, employment, education, and justice) as requiring conformity assessments that include bias testing.

Domain-Specific Guidance

Criminal justice: The stakes are asymmetric and severe. False positives deprive people of liberty. Equalized odds — ensuring that both true positive rates and false positive rates are equal across racial groups — is the most defensible fairness criterion for systems that influence incarceration decisions. This should be the minimum standard.

Credit and lending: Both demographic parity (equal approval rates) and equalized odds (equal error rates) are relevant. The ECOA's requirement to avoid disparate impact pushes toward demographic parity. Equalized odds ensures that qualified applicants from all groups are equally likely to be identified correctly.

Hiring and promotion: The four-fifths rule provides a legal floor. Beyond compliance, organizations deploying AI-assisted hiring should evaluate both the selection rate and the error rate for each protected group. Where the candidate pool includes people from groups historically excluded from the industry, calibration becomes particularly important: does a score of "strong candidate" mean the same thing across demographic groups?

Healthcare: Calibration is often paramount — clinical decision-makers need to trust that predicted risks are accurately scaled. But equalized odds is also critical: equal access to interventions for people who need them, and equal rates of unnecessary interventions for people who do not, across demographic groups. The stakes make this a domain that demands both criteria be taken seriously, even if the impossibility theorem means they cannot both be perfectly satisfied.

The Participatory Approach

Metric selection is too consequential to be done only by the technical team. A genuinely ethical approach to fairness metric selection involves:

Convening a diverse stakeholder group that includes affected community representatives
Presenting the available metrics, their trade-offs, and what would be gained and lost by each
Making the deliberation transparent and documented
Revisiting metric selection regularly as the context evolves

This is not just a procedural nicety. Organizations that skip this step regularly find that their technically sound fairness claims fail to persuade affected communities — because those communities were never consulted about what fairness means to them. And in a domain as contested as algorithmic fairness, legitimacy matters.

Section 9.8: Fairness Measurement in Practice — Organizational Processes

Understanding fairness metrics conceptually is necessary but not sufficient. Organizations need to embed fairness measurement into their actual processes — before models are deployed, and continuously after. This section provides practical guidance for doing so.

When to Measure

Pre-deployment testing: Before any system that makes or influences decisions about people is deployed, its behavior should be characterized across demographic groups. This means:

Evaluating all relevant fairness metrics (not just the most favorable one) on a held-out test set
Documenting the results, including the metric chosen and the reasoning behind it
Setting explicit fairness thresholds that the system must meet before deployment
Involving diverse reviewers, including people from affected communities, in the evaluation

Post-deployment monitoring: Fairness is not a property you check once and certify. Model performance drifts as the world changes. The population of people subject to decisions may shift. Feedback loops — where the model's decisions affect the data it will be trained on in the future — can create dynamic disparities that emerge over time. Organizations need:

Ongoing collection of outcome data disaggregated by protected characteristics
Automated alerts when fairness metrics exceed thresholds
Regular audits (quarterly, at minimum, for high-stakes systems) with external review
Clear processes for investigating disparities and determining whether they warrant model revision or discontinuation

The Data Challenge

Fairness measurement requires data you may not have. Specifically, it requires:

Protected attribute data: To measure disparate impact, you need to know which group each affected person belongs to. Many organizations do not collect race, gender, or disability data — sometimes for privacy reasons, sometimes because of legal uncertainty, sometimes because of organizational discomfort.
Outcome data: To compute most fairness metrics, you need to know what actually happened — not just what the model predicted. For loan approval, did the applicant actually default? For criminal justice, did the defendant actually reoffend? This data may be unavailable, delayed, or itself biased (as recidivism data reflects who gets arrested and prosecuted, not who commits crimes).
Disaggregated outcome data: You need both protected attribute data and outcome data, linked at the individual level, for fairness metrics to be computable.

Legal Constraints on Data Collection

In the United States, anti-discrimination law creates real tensions around data collection. Employers are generally prohibited from using race as a factor in employment decisions — but you need race data to know whether your hiring algorithm is producing racially disparate outcomes. The law permits collecting demographic data for compliance monitoring purposes, but there are constraints on how it can be used. Credit regulations require financial institutions to analyze whether their models produce disparate impacts but restrict using race as an input variable. These tensions are real and require careful legal analysis.

Statistical Power

Even when demographic data is available, small group sizes can make fairness metrics unreliable. If you have only 50 members of a minority group in your evaluation dataset, confidence intervals on your false positive rate estimate will be very wide — you may not be able to reliably distinguish a 20% difference in false positive rates from random noise. This argues for:

Calculating and reporting confidence intervals on all fairness metrics
Using equivalence testing methodologies to assess whether small observed differences are practically significant
Being honest about when subgroup sizes are insufficient for reliable analysis — and using that as a reason to collect more data, not to avoid the analysis

Fairness Tools

Several open-source toolkits support fairness analysis:

AI Fairness 360 (IBM): A comprehensive Python library implementing more than seventy fairness metrics and more than ten bias mitigation algorithms. Includes pre-processing, in-processing, and post-processing approaches. Well-documented and actively maintained.

Fairlearn (Microsoft): A Python package focused on assessing and mitigating fairness issues in classification, regression, and clustering. Includes dashboard visualizations and a dashboard for comparing models.

Aequitas (University of Chicago): A bias and fairness audit toolkit designed for practitioners, with a focus on clarity of output and policy relevance. Developed explicitly for criminal justice applications.

What-If Tool (Google): A visual tool for exploring model behavior across different subsets of a dataset, supporting counterfactual analysis and metric comparison.

These tools are useful for technical analysis but cannot substitute for the substantive deliberation about which metrics to use and how to interpret them.

Building Fairness into Model Documentation

Every model that influences decisions about people should include a fairness section in its documentation, specifying:

Which protected characteristics were included in fairness analysis
Which fairness metrics were computed and what values were found
Which fairness criterion the model was optimized or selected for, and why
What trade-offs were accepted and who made that decision
What the monitoring plan is for ongoing assessment

Model cards (Mitchell et al., 2019) and datasheets for datasets (Gebru et al., 2018) provide frameworks for this kind of documentation. The practice of publishing fairness documentation is increasingly expected by regulators, researchers, and sophisticated organizational customers.

Section 9.9: Beyond Binary — Fairness in Multi-Class and Regression Settings

Most of the academic fairness literature, and most of this chapter so far, focuses on binary classification — systems that output yes/no, high/low, approve/deny. The real world is considerably more complex. Fairness analysis must extend to multi-class classification, regression, and ranking settings.

Multi-Class Fairness

Many real-world systems assign inputs to more than two categories. A job screening system might categorize candidates as "Strong Yes," "Borderline," "No." A medical triage system might assign severity levels. A content moderation system might assign content to multiple violation categories.

In multi-class settings, the pairwise relationships between groups and outcomes multiply. Demographic parity requires that the distribution across all classes be equal across groups. Equalized odds requires that the confusion matrix be similar across groups for every pair of predicted and actual classes. The computational complexity grows, and the number of statistics to report increases.

An additional complication: the categories themselves may be contested. If a content moderation system categorizes posts as "hate speech," "harassment," or "neither," the category definitions encode judgments about what speech is harmful — judgments that may themselves be applied differently across demographic groups.

Regression Fairness

Regression systems produce continuous outputs: a predicted income, a credit score, a risk probability. Fairness in regression requires that the prediction error be equally distributed across groups — not just that the average prediction be similar, but that the distribution of errors (residuals) be similarly shaped.

Mean prediction parity: Equal average predictions across groups. Residual equity: Equal variance and distribution of prediction errors across groups. Calibration in regression: Predictions are on average accurate within each group across the range of predicted values.

Ranking Fairness

Recommendation systems, search engines, and any system that rank-orders items or individuals create a distinct fairness challenge: even small systematic biases in how items are scored translate into large disparities in who appears at the top of the ranked list. Being ranked first versus tenth has enormous practical consequences — in search results, in job candidate queues, in college application reviews where evaluators may not look past the first page.

Key fairness concepts for ranking: - Exposure fairness: Do items from different groups receive similar amounts of exposure (visibility in high-ranked positions)? - Utility fairness: Do users receive equally relevant results regardless of which group the top-ranked items come from? - Position parity: Are items from different groups distributed similarly across rank positions?

The Challenge of Continuous Outcomes

When outcomes are themselves continuous — credit default risk, patient outcomes, job performance — defining ground truth becomes complicated. The outcome we measure may itself be biased: job performance ratings may reflect supervisor bias; credit default correlates with economic shocks that fall unevenly on demographic groups; patient outcomes depend on differential access to follow-up care. Fairness analysis in regression settings must grapple with the possibility that the outcome measure is itself a product of social inequity.

Section 9.10: Organizational Implications — Making Fairness a Practice

Everything in this chapter — the metrics, the impossibility theorem, the domain guidance — is knowledge that is only useful if it is embedded into organizational practice. Knowing that equalized odds is the right metric for criminal justice applications is insufficient if the organization deploying a criminal justice tool does not measure equalized odds before deployment and monitor it afterward.

This section addresses the organizational infrastructure that makes fairness a practice rather than a claim.

Fairness as a Continuous Process

The most important organizational shift is from treating fairness as a checklist item — something you check at deployment — to treating it as an ongoing process. The model's behavior in production will differ from its behavior in testing. Populations change. Policies change. Economic conditions change. What was fair yesterday may not be fair today.

Leading organizations build fairness monitoring into their standard operational processes with the same rigor they apply to performance monitoring. Dashboards track fairness metrics alongside accuracy metrics. Automated alerts notify teams when metrics drift. Root cause analyses investigate disparities. Remediation processes exist for when disparities are found.

Accountability Structures

Fairness is too important to be left to any one team. Effective organizational accountability structures include:

Technical responsibility: Data scientists and ML engineers who compute and report fairness metrics
Business responsibility: Product and business leaders who make the substantive decisions about which metrics matter and what trade-offs are acceptable
Legal and compliance: Counsel who ensure that fairness analysis meets regulatory requirements
Ethics oversight: Whether an internal ethics board, an external advisory panel, or a formal audit relationship, some form of external review provides accountability that internal teams alone cannot
Community engagement: Formal mechanisms for affected communities to provide input and raise concerns

The Role of External Auditors

External auditing of AI systems for fairness is an emerging practice with important limitations and important potential. External auditors can provide independence that internal teams cannot — they are not subject to the same organizational pressures to approve a deployment or to characterize results favorably. They can also bring expertise and methodological rigor that may exceed what internal teams have.

But external audits are not a magic bullet. Auditors need access to model internals, training data, and deployment data — which vendors often resist providing. Audit methodologies are not yet standardized. And a favorable audit report does not guarantee that a system is fair; it guarantees that the system passed the tests the auditors ran, which may not capture all relevant fairness dimensions.

Despite these limitations, external auditing is an important accountability mechanism, and one that regulators in the EU and increasingly in the US are moving toward requiring for high-stakes applications. We return to this topic in Chapter 19.

Documenting Fairness Choices

Fairness documentation serves multiple purposes: it creates accountability (decisions are on record), it enables review and challenge, and it provides institutional memory when teams change. At minimum, fairness documentation should record:

What fairness metrics were computed and what values were found
Which metric was selected as the primary criterion and who made that decision
What trade-offs were accepted and on what basis
Who was consulted, including any affected community engagement
What monitoring is in place and what the escalation process is for metric violations

Communicating Fairness to Stakeholders

Communicating about fairness is difficult because the subject involves technical concepts, contested values, and high stakes. Some principles for effective communication:

Be specific about which metric: "Our system is fair" is meaningless without specifying what fairness metric was applied. "Our system satisfies calibration across racial groups but does not satisfy equalized odds" is informative.

Acknowledge trade-offs: Pretending that all fairness criteria are satisfied simultaneously — when the impossibility theorem proves this is not possible — is ethics washing. Honest communication acknowledges what was not achieved.

Explain in plain language: Confusion matrices and false positive rates are technical concepts that require translation for non-specialist audiences. Invest in communication that makes these concepts accessible without oversimplifying the trade-offs.

Connect to affected communities' experience: Statistics gain meaning when they are connected to what they mean for people's lives. "Black defendants are flagged as high-risk at twice the rate of equally low-risk white defendants" is more communicatively powerful than a table of false positive rates, because it connects the statistic to its human consequence.

Preview: Chapters 19 and 15

The organizational processes described in this section connect to two later chapters. Chapter 19 examines algorithmic auditing in depth — what good auditing practice looks like, what the current regulatory landscape requires, and what the limits of auditing are. Chapter 15 addresses communication — how to talk about AI risk, fairness, and uncertainty with a range of stakeholder audiences, from technical colleagues to boards of directors to media and the public.

Understanding fairness metrics is the first step. Building the organizational infrastructure to measure them, communicate about them, and act on them is the harder and more consequential work.

Discussion Questions

Northpointe argued that COMPAS was fair because it was calibrated, while ProPublica argued it was unfair because of unequal false positive rates. Given what you now know about the Chouldechova impossibility theorem, evaluate both positions. Who do you think has the more defensible fairness claim, and why? Does your answer change if you consider whose interests each metric protects?
A hiring algorithm used by a major technology company is found to have equal true positive rates for male and female candidates (equal opportunity satisfied) but substantially higher false positive rates for female candidates (equalized odds not satisfied). The company's HR leadership argues that what matters is that equally qualified female candidates are identified at equal rates, and that higher false positive rates for women simply means more women are being considered — a positive outcome. Evaluate this argument. What is it getting right? What is it missing?
The framework in Section 9.7 recommends involving affected communities in the selection of fairness metrics. What practical challenges does this recommendation face? How would you design a community engagement process for a county considering deploying an algorithmic pretrial risk assessment tool? Who would you include, and how would you present the trade-offs in accessible terms?
Consider a healthcare algorithm used to predict which patients should receive additional case management resources. The algorithm is found to perform well on calibration but to have a lower true positive rate for elderly patients compared to younger patients — it is less likely to correctly identify elderly patients who need intensive care as high-need. Using the framework in Section 9.7, what fairness metric would you prioritize in this context, and why? What would you do about the discovered disparity?
"The impossibility theorem proves that algorithmic fairness is impossible, so we should give up on it and focus on improving the social conditions that create unequal base rates." Evaluate this argument. Is it logically valid? Is it practically wise? What is it getting right and what is it missing?
An organization deploying a credit scoring algorithm collects no racial demographic data, arguing that this ensures the model cannot discriminate by race. A fair lending advocate argues that without demographic data, the organization cannot know whether its model discriminates and cannot fix it if it does. Evaluate both positions. What are the legal, ethical, and practical considerations?
Review the six fairness metrics described in Section 9.4. For each of the following three domains, identify which metric or combination of metrics you would prioritize as the primary fairness criterion, and explain your reasoning: (a) a parole recommendation algorithm; (b) an algorithm that recommends which children should receive additional educational support services; (c) a fraud detection algorithm that flags financial transactions for human review.

Chapter 9 is part of a running case study that continues in Chapters 18, 19, 20, and 30. The COMPAS algorithm and the Chouldechova impossibility theorem are referenced throughout the book as a touchstone for understanding the fundamental challenges of algorithmic fairness.