> "Fairness is not a single concept. It's a family of competing concepts, and choosing among them is a moral and political act."
Learning Objectives
- Define and distinguish four major mathematical definitions of algorithmic fairness
- Explain demographic parity and identify its strengths and limitations
- Explain equalized odds and its relationship to error-rate parity
- Explain calibration (predictive parity) and why it often conflicts with other fairness definitions
- State the impossibility theorem and explain why simultaneously satisfying multiple fairness criteria is generally infeasible
- Use a Python FairnessCalculator to compute multiple fairness metrics for the same system
- Analyze a real-world scenario to determine which fairness definition is most appropriate given stakeholder values
- Articulate why choosing a fairness metric is a political and ethical decision, not merely a technical one
In This Chapter
- Chapter Overview
- 15.1 The Fairness Problem
- 15.2 Demographic Parity (Statistical Parity)
- 15.3 Equalized Odds
- 15.4 Calibration (Predictive Parity)
- 15.5 Individual Fairness vs. Group Fairness
- 15.6 The Impossibility Theorem
- 15.7 Building a Fairness Calculator in Python
- 15.8 Choosing a Fairness Metric: Values, Not Variables
- 15.9 Fairness Beyond Metrics
- 15.10 The Eli Thread: Fairness and the Criminal Justice System
- 15.11 Chapter Summary
- What's Next
- Chapter 15 Exercises → exercises.md
- Chapter 15 Quiz → quiz.md
- Case Study: ProPublica vs. Northpointe — The COMPAS Fairness Debate → case-study-01.md
- Case Study: Fairness in College Admissions Algorithms → case-study-02.md
Chapter 15: Fairness — Definitions, Tensions, and Trade-offs
"Fairness is not a single concept. It's a family of competing concepts, and choosing among them is a moral and political act." — Arvind Narayanan, Princeton University (2018 tutorial, "21 Fairness Definitions and Their Politics")
Chapter Overview
Chapter 14 established that algorithmic bias is real, pervasive, and consequential. The natural next question — the one every student asks, the one every policymaker demands an answer to — is: how do we fix it?
The answer begins with a prior question that turns out to be far harder than anyone expected: what does "fair" mean?
This chapter explores the discovery that fairness is not a single concept but a family of competing mathematical definitions, each capturing a different moral intuition about what equality requires. Demographic parity demands equal selection rates across groups. Equalized odds demands equal error rates. Calibration demands equal predictive accuracy. Individual fairness demands that similar individuals receive similar treatment.
These definitions sound complementary. They are not. In a landmark result, Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) proved independently that — except in trivial cases — you cannot satisfy multiple fairness definitions simultaneously when base rates differ across groups. This is the impossibility theorem, and it is one of the most important results in the entire field of algorithmic fairness.
The impossibility theorem means that building a "fair" algorithm requires choosing which kind of fairness to prioritize — and that choice is not a technical decision. It is an ethical and political one.
In this chapter, you will learn to: - Define and compute four major fairness metrics - Demonstrate, using Python, that the same system can be "fair" by one metric and "unfair" by another - Explain the impossibility theorem and its practical implications - Analyze a fairness dilemma from multiple stakeholder perspectives - Recognize that choosing a fairness metric is choosing a set of values
15.1 The Fairness Problem
15.1.1 Why Definition Matters
In Chapter 14, we said that the COMPAS algorithm produced racially disparate outcomes. ProPublica said the system was unfair because it had different false positive rates for Black and white defendants. Northpointe said the system was fair because it was calibrated — defendants with the same score had similar recidivism rates regardless of race.
Both were correct. How is that possible?
It is possible because ProPublica and Northpointe were using different definitions of fairness. ProPublica measured fairness by error-rate parity (a form of equalized odds). Northpointe measured fairness by calibration (predictive parity). These are different mathematical properties, and a system can satisfy one while violating the other.
This is not a bug in the debate. It is the central problem of algorithmic fairness: there are multiple reasonable definitions, they capture different moral intuitions, and they are often mutually incompatible.
"I thought fairness was simple," Mira said in class. "You just... treat everyone equally. Don't discriminate. It seems obvious."
"Obvious," Dr. Adeyemi replied, "until you try to define what 'equally' means. Does it mean equal rates of selection? Equal accuracy? Equal error rates? Equal treatment of similar individuals? These all sound like the same thing. They're not. And today you'll see why."
15.1.2 The Setup: A Binary Classification Problem
To make the fairness definitions concrete, we'll use a common framework: a binary classification system that makes predictions about individuals from two groups.
Terminology:
| Term | Meaning |
|---|---|
| Positive prediction | The system predicts the favorable/target outcome (e.g., "will reoffend," "will repay loan," "high health risk") |
| Negative prediction | The system predicts the non-target outcome |
| True positive (TP) | Predicted positive, actually positive |
| False positive (FP) | Predicted positive, actually negative |
| True negative (TN) | Predicted negative, actually negative |
| False negative (FN) | Predicted negative, actually positive |
| Base rate | The actual rate of the positive outcome in a group (the "ground truth" prevalence) |
For each group g, we can compute:
- Selection rate = (TP + FP) / (TP + FP + TN + FN) — proportion predicted positive
- True positive rate (TPR) = TP / (TP + FN) — proportion of actual positives correctly identified (also called sensitivity or recall)
- False positive rate (FPR) = FP / (FP + TN) — proportion of actual negatives incorrectly flagged
- Positive predictive value (PPV) = TP / (TP + FP) — proportion of positive predictions that are correct
These metrics form the building blocks of every fairness definition that follows.
15.2 Demographic Parity (Statistical Parity)
15.2.1 The Definition
Demographic parity (also called statistical parity) requires that the selection rate — the proportion of each group receiving the positive prediction — be equal across groups.
Formally: P(Predicted Positive | Group A) = P(Predicted Positive | Group B)
In plain language: the algorithm should approve (or flag, or select) the same proportion of each group.
15.2.2 The Moral Intuition
Demographic parity captures a straightforward intuition about fairness: if 30% of white applicants are approved for a loan, then 30% of Black applicants should also be approved. Any difference in selection rates indicates that the system is treating groups differently.
This maps to a vision of fairness as equal representation in outcomes — the idea that a fair system should produce the same results regardless of group membership.
15.2.3 Strengths
- Simple and auditable. Selection rates are easy to compute and easy for non-technical stakeholders to understand.
- Addresses structural inequality. If historical discrimination has created disparities in the underlying data, demographic parity prevents the algorithm from reproducing those disparities in its predictions.
- Aligned with legal standards. The disparate impact framework (Chapter 14, four-fifths rule) is essentially a relaxed form of demographic parity.
15.2.4 Limitations
- Ignores accuracy. Demographic parity can be achieved by a completely random system. A coin flip produces equal selection rates regardless of group — but it's not a useful prediction. Demographic parity asks only about outcomes, not about whether the outcomes are correct.
- May require treating unequals equally. If the actual base rates differ across groups (e.g., if one group genuinely has higher default rates due to economic conditions shaped by structural inequality), demographic parity requires approving applicants at the same rate regardless — which may mean approving less-qualified applicants from one group or denying more-qualified applicants from another.
- May harm the intended beneficiaries. In lending, approving applicants who are likely to default does not help them — it saddles them with debt they cannot repay. An algorithm that achieves demographic parity in approvals but leads to disparate default rates may have shifted the location of the harm rather than eliminated it.
- Ignores the question of why base rates differ. Demographic parity treats differential base rates as a given and overrides them with equal selection. But it does not address why the base rates differ — and the reasons are usually structural (poverty, discrimination, unequal access). Achieving demographic parity in loan approvals does not fix the poverty that causes differential default rates.
Reflection: Consider a university admissions algorithm. If demographic parity requires admitting the same percentage of applicants from each racial group, but one group has historically had less access to educational resources and therefore lower test scores, does demographic parity produce a fair outcome or merely an equal-looking one? What happens to admitted students who were not academically prepared — if the university does not invest in support programs, has demographic parity helped them or set them up to fail? Is the metric measuring fairness or masking the absence of structural change?
Ethical Dimensions: The debate about demographic parity maps onto a deeper debate in political philosophy: should fairness be measured by equality of opportunity (everyone has the same chance) or equality of outcome (everyone achieves the same result)? Demographic parity is an outcome-based standard. Its critics often favor opportunity-based standards. But as we discussed in Chapter 6 (justice theory), the distinction is less clear than it appears — because unequal opportunities produce unequal qualifications, which produce unequal outcomes, in a cycle that makes it impossible to cleanly separate "opportunity" from "outcome."
15.3 Equalized Odds
15.3.1 The Definition
Equalized odds (Hardt, Price, and Srebro, 2016) requires that the algorithm's error rates be equal across groups. Specifically, it requires that both the true positive rate and the false positive rate be the same for each group.
Formally: - P(Predicted Positive | Actually Positive, Group A) = P(Predicted Positive | Actually Positive, Group B) — equal TPR - P(Predicted Positive | Actually Negative, Group A) = P(Predicted Positive | Actually Negative, Group B) — equal FPR
A relaxed version, equal opportunity, requires only that the true positive rate be equal across groups (equalizing benefit access) without constraining the false positive rate.
15.3.2 The Moral Intuition
Equalized odds captures the intuition that a fair system should make the same types of mistakes regardless of group membership. If you are actually qualified for a loan, your probability of being correctly approved should not depend on your race. If you are actually not going to reoffend, your probability of being falsely flagged as high-risk should not depend on your race.
This is the definition that ProPublica was (implicitly) applying when it found that COMPAS was unfair: the false positive rate was dramatically different for Black and white defendants.
15.3.3 Strengths
- Focuses on errors, not just outcomes. A system can have different selection rates and still be fair under equalized odds — as long as the differences are justified by actual differences in the target variable.
- Separates accuracy from equity. Equalized odds allows the system to be accurate while still demanding that accuracy be equitable across groups.
- Protects against the most harmful errors. False positives in criminal justice (jailing people who would not have reoffended) and false negatives in healthcare (missing patients who need care) are where the human cost is highest. Equalizing these error rates across groups ensures that the burden of errors is not concentrated on one group.
15.3.4 Limitations
- Assumes reliable ground truth. Equalized odds requires knowing the actual outcome (did the defendant reoffend? did the patient develop the condition?) to calculate error rates. But ground truth is often contaminated by the very biases we're trying to correct — recidivism data reflects policing bias, health outcomes reflect access disparities, hiring outcomes reflect workplace culture. If the ground truth is biased, equalizing error rates relative to that ground truth may perpetuate rather than correct the bias.
- May conflict with calibration. As we'll see in Section 15.5, equalizing error rates across groups is generally incompatible with equal predictive accuracy across groups when base rates differ. This is not a fixable limitation — it is a mathematical impossibility.
- Computationally harder. Achieving equalized odds typically requires post-processing adjustments to model outputs, which may reduce overall accuracy. The system must be accurate and equitable, which creates tension when the two objectives pull in different directions.
- Requires choosing which errors to equalize. Full equalized odds requires equalizing both TPR and FPR. But in practice, you may only be able to approximate one. Equal opportunity (equalizing TPR only) is a common relaxation — but the choice of which error rate to prioritize is itself a value judgment. In criminal justice, equalizing FPR (ensuring that innocents are equally unlikely to be falsely flagged, regardless of race) may be the priority. In healthcare, equalizing TPR (ensuring that sick patients are equally likely to be identified, regardless of race) may matter more.
Connection: The choice between equalizing FPR and equalizing TPR is not a technical question. It is a question about which group of people we are most concerned about protecting from errors. Equalizing FPR protects people who are not in the positive class (innocent defendants, healthy patients) from being incorrectly flagged. Equalizing TPR protects people who are in the positive class (defendants who will reoffend, patients who need care) from being incorrectly missed. The first is about avoiding false positives; the second is about avoiding false negatives. Which matters more depends on the context — and on whose perspective you prioritize.
15.4 Calibration (Predictive Parity)
15.4.1 The Definition
Calibration (also called predictive parity) requires that among individuals who receive the same score or prediction, the actual rate of the positive outcome be the same across groups.
Formally: P(Actually Positive | Predicted Positive, Group A) = P(Actually Positive | Predicted Positive, Group B)
In plain language: if the system says a defendant has a 70% chance of reoffending, that should mean 70% regardless of whether the defendant is Black or white.
15.4.2 The Moral Intuition
Calibration captures the intuition that a fair system should mean the same thing for everyone. A risk score of "7" should represent the same level of risk regardless of the individual's group membership. This is the fairness of accurate communication — the system's predictions should be equally trustworthy across groups.
This is the definition that Northpointe invoked when defending COMPAS: among defendants who received the same score, recidivism rates were similar across racial groups.
15.4.3 Strengths
- Preserves informational accuracy. If a system is calibrated, decision-makers (judges, loan officers, clinicians) can trust the score regardless of the individual's group membership. A score means the same thing for everyone.
- Respects individual differences. Calibration doesn't force equal selection rates. If one group has a higher base rate of the target variable, the system can reflect that — as long as its predictions are accurate within each group.
- Aligned with statistical fairness. Calibration is a standard statistical property. Well-calibrated models are well-behaved in a statistical sense.
15.4.4 Limitations
- Can coexist with dramatically disparate outcomes. A perfectly calibrated system can have very different selection rates, error rates, and impacts across groups — as long as its predictions are accurate. If Black defendants genuinely reoffend at higher rates (due to structural factors like poverty, lack of support services, and biased policing), a calibrated system will predict higher risk for Black defendants. It will be "fair" in its predictions while reflecting and reinforcing the unfairness of the underlying system.
- Focuses on the model, not the impact. Calibration evaluates the quality of the prediction, not the consequences of acting on it. A calibrated prediction that leads to discriminatory outcomes is still discriminatory in practice.
- Depends on the definition of the target variable. If "recidivism" is measured by re-arrest (which reflects policing patterns) rather than actual criminal behavior, calibration to re-arrest data embeds policing bias.
Ethical Dimensions: The tension between calibration and equalized odds lies at the heart of the COMPAS debate — and of the Accountability Gap. Northpointe could claim their system was "fair" because it was calibrated. ProPublica could claim it was "unfair" because it had disparate error rates. Both were right, by their own definitions. This is not a failure of analysis. It is a revelation: there is no neutral, purely technical definition of fairness. Every definition embeds a value judgment about what matters most.
15.5 Individual Fairness vs. Group Fairness
15.5.1 The Distinction
The definitions above are all forms of group fairness — they evaluate whether the system treats groups equitably. Individual fairness, proposed by Dwork et al. (2012), takes a different approach: it requires that similar individuals receive similar treatment.
Formally: if individual i and individual j are "similar" (by some task-relevant metric), the system should give them similar predictions.
15.5.2 The Moral Intuition
Individual fairness captures the Aristotelian principle that equals should be treated equally. If two people have the same qualifications, experience, and credit history, they should receive the same loan decision — regardless of their demographic group.
15.5.3 The Challenge
The appeal is obvious. The challenge is defining "similarity." What makes two individuals "similar" for the purposes of a loan decision? Education? Income? Employment stability? Zip code? Family wealth? Each choice of similarity metric embeds assumptions about what is relevant — and those assumptions are themselves value-laden.
If you define similarity to include zip code, you may encode racial segregation. If you exclude zip code, you may lose genuinely predictive information. If you define similarity based on credit history, you may disadvantage young people or recent immigrants who haven't had time to build a credit record. The definition of "similar" is as contested as the definition of "fair."
Individual fairness also faces a fundamental tension with group fairness. It is possible for a system to satisfy individual fairness perfectly (similar individuals receive similar treatment) while violating demographic parity dramatically (if the similarity metric itself encodes structural inequality). Two applicants may be "similar" in their credit profiles and yet belong to different racial groups — and if their credit profiles reflect racial disparities in economic opportunity, treating them "similarly" means treating the consequences of inequality as neutral facts.
Common Pitfall: Individual fairness is sometimes presented as a technical solution that avoids the political challenges of group fairness. The reasoning goes: "we don't need to think about groups at all — just treat similar people similarly." But this reasoning obscures the value judgments embedded in the similarity metric. Defining "similarity" is a political act, and it determines which group differences the system ignores and which it preserves. Individual fairness does not escape politics. It hides them in the definition of similarity.
15.5.4 When Individual and Group Fairness Diverge
Consider a concrete scenario: a hospital has a limited number of slots in a cardiac care program. Two patients present with similar clinical profiles — similar age, similar blood pressure, similar cholesterol levels. Patient A is white; Patient B is Black. Under individual fairness, they should receive similar treatment (both enrolled or both not enrolled).
But demographic data shows that Black patients are enrolled in the cardiac care program at half the rate of white patients — a violation of demographic parity. An intervention to achieve demographic parity might give priority to Patient B to correct the group-level disparity, even though Patient A and Patient B are individually "similar."
Which standard should govern? Individual fairness says: treat these two patients the same. Demographic parity says: correct the group-level imbalance. They cannot both be satisfied in this case. The choice between them is a choice about whether fairness is an individual property (each person is treated based on their own characteristics) or a structural property (group-level patterns of advantage and disadvantage must be addressed).
15.6 The Impossibility Theorem
15.6.1 The Mathematical Result
In 2016 and 2017, two independent research groups proved what practitioners had begun to suspect: multiple fairness definitions cannot be simultaneously satisfied except in trivial cases.
Chouldechova (2017) proved that when base rates differ across groups, it is mathematically impossible for a classifier to simultaneously achieve: - Equal false positive rates across groups - Equal false negative rates across groups - Equal positive predictive values across groups
Kleinberg, Mullainathan, and Raghavan (2016) proved a related result: calibration, balance for the positive class, and balance for the negative class cannot all be achieved simultaneously (except when the classifier is perfect or the base rates are equal).
15.6.2 Why This Matters
The impossibility theorem is not merely an academic curiosity. It has profound practical implications:
You must choose. There is no algorithm that is "fair" by all definitions. Every system designer, every policymaker, and every judge who uses an algorithmic score must decide which definition of fairness to prioritize — and that decision will inevitably sacrifice other definitions.
The choice is political. Prioritizing demographic parity says: equal outcomes matter most. Prioritizing equalized odds says: equal error rates matter most. Prioritizing calibration says: equal predictive accuracy matters most. These are different value commitments with different consequences for different people.
Technical solutions are insufficient. No amount of algorithmic sophistication can resolve a value conflict. The impossibility theorem means that fairness in algorithmic systems ultimately requires deliberation — transparent discussion among stakeholders about which trade-offs are acceptable.
Connection: The impossibility theorem is the algorithmic equivalent of the ethical framework conflicts examined in Chapter 6. Just as utilitarianism and deontology can point in different directions, demographic parity and calibration can demand contradictory outcomes. In Chapter 6, we developed a process for navigating ethical disagreement. In this chapter, we need a similar process for navigating fairness disagreement — one that is transparent, inclusive, and accountable.
15.6.3 A Concrete Illustration
Let us make the impossibility theorem tangible with a simplified example.
Suppose a risk assessment tool is applied to two groups, Group A and Group B. Group A has a 40% base rate (40% actually reoffend) and Group B has a 20% base rate (20% actually reoffend). The classifier assigns risk scores and uses a threshold to predict who will reoffend.
If the system achieves calibration (equal PPV): among those predicted positive, 60% actually reoffend regardless of group. This means the system is equally informative for both groups. But because Group A has a higher base rate, more of its members will be predicted positive — leading to unequal selection rates (violating demographic parity) and unequal false positive rates (violating equalized odds). Specifically, Group A will have a higher FPR because the threshold must be set lower relative to Group A's risk distribution to maintain calibration.
If the system achieves equalized odds (equal FPR and TPR): the same proportion of innocents are falsely flagged in each group, and the same proportion of actual positives are detected in each group. But because the base rates differ, the PPV will differ — a positive prediction means different things for different groups. A "high-risk" prediction for a Group B member (lower base rate) would be less likely to be correct than the same prediction for a Group A member.
You cannot have it both ways. The mathematics is unforgiving.
15.6.4 When the Theorem Doesn't Bite
The impossibility result applies when base rates differ across groups. If the base rates are equal (the same proportion of each group actually possesses the target characteristic), the fairness definitions converge — all can be simultaneously satisfied.
This is one reason why addressing underlying social inequality matters even from a purely technical perspective. To the extent that interventions can equalize base rates — by addressing the root causes of disparate outcomes, not merely adjusting algorithmic outputs — the impossibility constraint relaxes.
But we should be honest: base rates in criminal justice, healthcare, lending, and employment differ significantly across racial groups, and those differences reflect centuries of structural inequality that no algorithm can fix. Expecting an algorithm to produce fair outcomes from unfair inputs — to correct at the point of prediction what society has failed to correct at the point of causation — is asking the algorithm to do work that belongs to social policy.
This does not mean algorithms are exempt from fairness requirements. It means that algorithmic fairness and social justice are complementary, not substitutable. You need both.
15.7 Building a Fairness Calculator in Python
15.7.1 Purpose
The FairnessCalculator class demonstrates the impossibility theorem in practice. It takes predictions, actual outcomes, and group labels, then computes multiple fairness metrics for the same system. The output shows how the same algorithm can be "fair" under one definition and "unfair" under another.
15.7.2 The FairnessCalculator Dataclass
"""
FairnessCalculator: Computes multiple fairness metrics for a binary
classification system, demonstrating that different definitions of
fairness can yield contradictory assessments of the same system.
This module supports the core argument of Chapter 15: fairness is not a
single concept, and choosing a fairness metric is a value-laden decision.
Usage:
calc = FairnessCalculator(
predictions=[1, 0, 1, ...],
actuals=[1, 1, 0, ...],
groups=["A", "B", "A", ...],
)
report = calc.full_report()
Requires: Python 3.7+, pandas
"""
from dataclasses import dataclass, field
from typing import Any
import pandas as pd
@dataclass
class FairnessCalculator:
"""
Computes and compares fairness metrics across groups defined by a
protected attribute, for a binary classification system.
Attributes:
predictions: List of predicted outcomes (0 or 1).
actuals: List of actual outcomes (0 or 1).
groups: List of group labels for each observation.
positive_label: Value representing the positive class. Default: 1.
"""
predictions: list
actuals: list
groups: list
positive_label: int = 1
_df: pd.DataFrame = field(init=False, repr=False)
def __post_init__(self) -> None:
"""Validate inputs and construct the internal DataFrame."""
n = len(self.predictions)
if not (n == len(self.actuals) == len(self.groups)):
raise ValueError(
"predictions, actuals, and groups must all have the same "
f"length. Got {n}, {len(self.actuals)}, {len(self.groups)}."
)
if n == 0:
raise ValueError("Cannot compute fairness on empty data.")
self._df = pd.DataFrame({
"prediction": self.predictions,
"actual": self.actuals,
"group": self.groups,
})
def _confusion_matrix(self, group_name: str) -> dict[str, int]:
"""
Compute confusion matrix counts for a specific group.
Returns dict with keys: 'TP', 'FP', 'TN', 'FN'.
"""
g = self._df[self._df["group"] == group_name]
pos = self.positive_label
tp = int(((g["prediction"] == pos) & (g["actual"] == pos)).sum())
fp = int(((g["prediction"] == pos) & (g["actual"] != pos)).sum())
tn = int(((g["prediction"] != pos) & (g["actual"] != pos)).sum())
fn = int(((g["prediction"] != pos) & (g["actual"] == pos)).sum())
return {"TP": tp, "FP": fp, "TN": tn, "FN": fn}
def group_metrics(self, group_name: str) -> dict[str, float]:
"""
Compute key metrics for one group: selection rate, base rate,
TPR (sensitivity), FPR, and PPV (precision).
"""
cm = self._confusion_matrix(group_name)
tp, fp, tn, fn = cm["TP"], cm["FP"], cm["TN"], cm["FN"]
total = tp + fp + tn + fn
selection_rate = (tp + fp) / total if total else 0.0
base_rate = (tp + fn) / total if total else 0.0
tpr = tp / (tp + fn) if (tp + fn) else 0.0
fpr = fp / (fp + tn) if (fp + tn) else 0.0
ppv = tp / (tp + fp) if (tp + fp) else 0.0
return {
"group": group_name,
"n": total,
"base_rate": round(base_rate, 4),
"selection_rate": round(selection_rate, 4),
"TPR": round(tpr, 4),
"FPR": round(fpr, 4),
"PPV": round(ppv, 4),
}
def demographic_parity_difference(self) -> dict[str, Any]:
"""
Compute the difference in selection rates between the group
with the highest selection rate and the group with the lowest.
Demographic parity is satisfied when the difference is 0.
"""
metrics = {
g: self.group_metrics(g)
for g in self._df["group"].unique()
}
rates = {g: m["selection_rate"] for g, m in metrics.items()}
max_group = max(rates, key=rates.get)
min_group = min(rates, key=rates.get)
diff = rates[max_group] - rates[min_group]
return {
"metric": "demographic_parity_difference",
"value": round(diff, 4),
"max_group": max_group,
"max_rate": rates[max_group],
"min_group": min_group,
"min_rate": rates[min_group],
"satisfied": diff < 0.05, # common tolerance
}
def equalized_odds_difference(self) -> dict[str, Any]:
"""
Compute the maximum difference in TPR and FPR across groups.
Equalized odds is satisfied when both TPR and FPR are equal
across groups (difference = 0 for both).
"""
metrics = {
g: self.group_metrics(g)
for g in self._df["group"].unique()
}
tprs = {g: m["TPR"] for g, m in metrics.items()}
fprs = {g: m["FPR"] for g, m in metrics.items()}
tpr_diff = max(tprs.values()) - min(tprs.values())
fpr_diff = max(fprs.values()) - min(fprs.values())
return {
"metric": "equalized_odds_difference",
"tpr_difference": round(tpr_diff, 4),
"fpr_difference": round(fpr_diff, 4),
"max_difference": round(max(tpr_diff, fpr_diff), 4),
"tpr_by_group": tprs,
"fpr_by_group": fprs,
"satisfied": max(tpr_diff, fpr_diff) < 0.05,
}
def calibration_difference(self) -> dict[str, Any]:
"""
Compute the difference in positive predictive value (PPV)
across groups.
Calibration (predictive parity) is satisfied when PPV is equal
across groups — i.e., among those predicted positive, the
proportion actually positive is the same regardless of group.
"""
metrics = {
g: self.group_metrics(g)
for g in self._df["group"].unique()
}
ppvs = {g: m["PPV"] for g, m in metrics.items()}
diff = max(ppvs.values()) - min(ppvs.values())
return {
"metric": "calibration_difference",
"value": round(diff, 4),
"ppv_by_group": ppvs,
"satisfied": diff < 0.05,
}
def full_report(self) -> dict[str, Any]:
"""
Run all fairness analyses and produce a comprehensive report.
"""
group_names = sorted(self._df["group"].unique())
per_group = {g: self.group_metrics(g) for g in group_names}
dp = self.demographic_parity_difference()
eo = self.equalized_odds_difference()
cal = self.calibration_difference()
# Build summary.
verdicts = []
if dp["satisfied"]:
verdicts.append("Demographic parity: SATISFIED")
else:
verdicts.append("Demographic parity: NOT SATISFIED")
if eo["satisfied"]:
verdicts.append("Equalized odds: SATISFIED")
else:
verdicts.append("Equalized odds: NOT SATISFIED")
if cal["satisfied"]:
verdicts.append("Calibration: SATISFIED")
else:
verdicts.append("Calibration: NOT SATISFIED")
return {
"per_group_metrics": per_group,
"demographic_parity": dp,
"equalized_odds": eo,
"calibration": cal,
"verdicts": verdicts,
}
# ──────────────────────────────────────────────────────────────────────
# Example: Demonstrating the Impossibility Theorem
# ──────────────────────────────────────────────────────────────────────
if __name__ == "__main__":
# Scenario: A recidivism risk prediction system evaluated on two
# groups with DIFFERENT base rates of recidivism.
#
# Group A: 40% base rate (40 out of 100 actually reoffend).
# Group B: 25% base rate (25 out of 100 actually reoffend).
#
# The classifier is CALIBRATED — among those predicted positive,
# the actual positive rate is similar across groups (~65-67%).
# But the error rates are NOT equal: the false positive rate is
# higher for Group A than Group B.
#
# This demonstrates the impossibility theorem: calibration is
# satisfied, but equalized odds is not.
# Group A: 100 individuals, 40% base rate.
# Predictions designed to be calibrated but with unequal FPR.
predictions_a = (
[1] * 30 + # 30 predicted positive
[0] * 70 # 70 predicted negative
)
actuals_a = (
[1] * 20 + [0] * 10 + # Of 30 predicted positive: 20 TP, 10 FP
[1] * 20 + [0] * 50 # Of 70 predicted negative: 20 FN, 50 TN
)
# Group B: 100 individuals, 25% base rate.
# Predictions designed to be calibrated (~same PPV) but lower FPR.
predictions_b = (
[1] * 20 + # 20 predicted positive
[0] * 80 # 80 predicted negative
)
actuals_b = (
[1] * 13 + [0] * 7 + # Of 20 predicted positive: 13 TP, 7 FP
[1] * 12 + [0] * 68 # Of 80 predicted negative: 12 FN, 68 TN
)
all_predictions = predictions_a + predictions_b
all_actuals = actuals_a + actuals_b
all_groups = ["Group_A"] * 100 + ["Group_B"] * 100
calc = FairnessCalculator(
predictions=all_predictions,
actuals=all_actuals,
groups=all_groups,
)
report = calc.full_report()
# Display results.
print("=" * 65)
print(" FAIRNESS ANALYSIS REPORT — Recidivism Risk Prediction")
print("=" * 65)
print("\nPer-Group Metrics:")
print(f"{'Metric':<20} {'Group_A':>10} {'Group_B':>10}")
print("-" * 42)
for key in ["n", "base_rate", "selection_rate", "TPR", "FPR", "PPV"]:
val_a = report["per_group_metrics"]["Group_A"][key]
val_b = report["per_group_metrics"]["Group_B"][key]
if isinstance(val_a, float):
print(f"{key:<20} {val_a:>10.4f} {val_b:>10.4f}")
else:
print(f"{key:<20} {val_a:>10} {val_b:>10}")
print(f"\n{'=' * 65}")
print("Fairness Verdicts:")
print("-" * 65)
dp = report["demographic_parity"]
print(
f"Demographic Parity: Difference = {dp['value']:.4f} "
f"{'SATISFIED' if dp['satisfied'] else 'NOT SATISFIED'}"
)
eo = report["equalized_odds"]
print(
f"Equalized Odds: TPR diff = {eo['tpr_difference']:.4f}, "
f"FPR diff = {eo['fpr_difference']:.4f} "
f"{'SATISFIED' if eo['satisfied'] else 'NOT SATISFIED'}"
)
cal = report["calibration"]
print(
f"Calibration: PPV diff = {cal['value']:.4f} "
f"{'SATISFIED' if cal['satisfied'] else 'NOT SATISFIED'}"
)
print(f"\n{'=' * 65}")
print("INTERPRETATION:")
print(
"This system is approximately calibrated (similar PPV across\n"
"groups) but does NOT satisfy equalized odds (different FPR\n"
"across groups). This illustrates the impossibility theorem:\n"
"with different base rates (40% vs 25%), calibration and\n"
"equalized odds cannot both be satisfied simultaneously."
)
15.7.3 Walking Through the Code
The dataclass. FairnessCalculator takes three parallel lists — predictions, actual outcomes, and group labels — and validates that they have the same length. The internal DataFrame makes all subsequent computations straightforward.
Confusion matrix. The _confusion_matrix method computes the four cells (TP, FP, TN, FN) for a single group. These are the atomic building blocks of every fairness metric.
Group metrics. The group_metrics method computes the five key rates — selection rate, base rate, TPR, FPR, and PPV — for a single group. These metrics tell you how the system is performing for that group, not just whether it's performing well.
Three fairness metrics. Each fairness method computes the relevant difference across groups:
- demographic_parity_difference compares selection rates
- equalized_odds_difference compares TPR and FPR
- calibration_difference compares PPV
Each method also returns a satisfied flag based on a 0.05 tolerance threshold. In practice, the threshold is context-dependent.
The example. The demonstration uses a carefully constructed scenario where the base rates differ (40% vs. 25%), the system is approximately calibrated (similar PPV), but equalized odds is violated (different FPR). Running the code produces a clear report showing that the same system is "fair" by calibration and "unfair" by equalized odds.
Intuition: The example is constructed to mirror the COMPAS debate. Northpointe's defense (calibration) and ProPublica's critique (error-rate disparity) correspond to different methods in the
FairnessCalculator. The code doesn't resolve the debate — it makes it precise, quantifiable, and impossible to ignore.
15.7.4 What the Output Shows
When you run the example, the output table will show: - Group A has a higher base rate (0.40) and a higher selection rate (0.30) - Group B has a lower base rate (0.25) and a lower selection rate (0.20) - The PPV is similar across groups (~0.65-0.67): calibration is approximately satisfied - But the FPR differs significantly (Group A ~0.17 vs. Group B ~0.09): equalized odds is NOT satisfied - Demographic parity is also not satisfied (selection rates differ: 0.30 vs. 0.20)
This is the impossibility theorem in action. The system cannot satisfy all three metrics simultaneously given the different base rates.
15.8 Choosing a Fairness Metric: Values, Not Variables
15.8.1 The Choice Is Political
The impossibility theorem forces a choice. And that choice is not a technical decision — it is a political and ethical one. Different fairness metrics serve different values:
| Metric | Prioritizes | Implicitly Accepts |
|---|---|---|
| Demographic parity | Equal representation in outcomes | May require different accuracy standards for different groups |
| Equalized odds | Equal accuracy (equal error rates) | May produce different selection rates that reflect disparate base rates |
| Calibration | Equal meaning of predictions | May produce dramatically different impact across groups |
| Individual fairness | Consistent treatment of similar individuals | Requires defining "similarity" — itself a value-laden choice |
15.8.2 Context Determines Appropriateness
No single fairness metric is universally correct. The appropriate metric depends on the context, the stakeholders, and the consequences:
Criminal justice — pretrial risk assessment. Many scholars argue that equalized odds (specifically, equal false positive rates) should be prioritized here. A false positive means an innocent person is detained. If Black defendants face a higher false positive rate, they disproportionately experience unjust detention. The harm of this error is severe and falls along racial lines.
Lending. The appropriate metric is debated and depends on what you believe lending fairness requires. Demographic parity ensures equal access to credit across groups — but may lead to higher default rates for historically disadvantaged groups if the underlying economic conditions are not addressed. Calibration ensures that predictions are equally accurate across groups — but a calibrated model may deny credit at much higher rates to groups affected by historical economic exclusion. Equal opportunity (equal TPR) ensures that qualified applicants from all groups have the same chance of approval — but "qualified" is defined by the same historical data that encodes inequality.
Ray Zhao brought the lending perspective to the class from his experience at NovaCorp. "We tested all three metrics on our credit model last year. The results were sobering. Demographic parity would require us to approve applications we believe will default — and those defaults hurt the applicants as much as anyone, because they damage their credit further. Calibration gives us a technically sound model, but it approves 74% of white applicants and only 51% of Black applicants. Equalized odds improves the false positive balance but reduces overall accuracy. There is no clean answer. Every metric involves a trade-off, and every trade-off affects real people."
Healthcare — care allocation. Mira's VitraMed dilemma highlights the stakes. If the patient risk model is calibrated but has different selection rates (fewer Black patients flagged for additional care), calibration alone is insufficient — because the lower selection rate for Black patients reflects historical access disparities, not lower health need.
15.8.3 The VitraMed Thread: Which Fairness Metric?
Mira brought the fairness question to Dr. Adeyemi's office hours.
"I've been reading about the different fairness definitions," Mira said. "And I realize the problem at VitraMed isn't just that the model is biased — it's that there's no agreement on what 'fair' would even look like. The data science team says the model is calibrated. But calibration means that the model accurately reflects the fact that Black patients have historically used fewer services. It doesn't correct for the reason they used fewer services."
"So which metric would you choose?" Dr. Adeyemi asked.
"I think for healthcare, equalized odds — specifically, equal false negative rates — should be the priority. A false negative in this context means a sick patient who isn't flagged for care. If Black patients have a higher false negative rate, they're being systematically denied preventive care. That's the most harmful error in this context."
"And what are you willing to give up to achieve that?"
Mira paused. "Calibration, probably. The model might 'mean' something slightly different for different groups. But I'd rather have a model that catches more sick patients and loses some statistical precision than one that's precisely calibrated but misses the people who need help most."
"That," Dr. Adeyemi said, "is a value judgment. A good one, I think. But a value judgment nonetheless. And it should be made transparently, by people who understand the trade-offs — not buried in a line of code."
Real-World Application: The Obermeyer et al. (2019) team proposed a fix for the healthcare algorithm they studied: they replaced the cost-based proxy with a health-based measure (the number of active chronic conditions). This effectively shifted the model from calibration-on-spending to something closer to equal opportunity-on-health-need. The change nearly eliminated the racial disparity. But it required a policy decision — changing what the model was optimized for — not merely a technical fix.
15.8.4 Stakeholder Deliberation
Because choosing a fairness metric is a value choice, it should not be made unilaterally by engineers. It requires stakeholder deliberation — a process in which the people affected by the system have a voice in determining what fairness means in their context.
A stakeholder deliberation process for fairness metric selection might include:
- Identify stakeholders. Who is affected by the system? Who bears the costs of its errors? Who benefits from its accuracy?
- Map the trade-offs. Using tools like the
FairnessCalculator, show stakeholders what each metric achieves and what it sacrifices. Make the impossibility theorem concrete. - Elicit values. Which errors are most harmful? Which groups are most vulnerable? What existing inequalities should the system correct, perpetuate, or ignore?
- Decide transparently. Document which metric was chosen, by whom, and why. Make this documentation public — because fairness choices should be accountable.
- Monitor and revisit. Fairness is not a one-time decision. As contexts change, as base rates shift, as new information emerges, the choice of metric should be revisited.
Ethical Dimensions: The Power Asymmetry in fairness metric selection is significant. In practice, the choice is usually made by engineers and data scientists — the people who build the system — not by the communities who are subject to it. A democratic approach to algorithmic fairness would invert this: the people who bear the consequences should have the most say in defining what "fair" means.
15.9 Fairness Beyond Metrics
15.9.1 The Limits of Quantification
This chapter has focused on mathematical definitions of fairness — metrics that can be computed, compared, and optimized. But fairness is not only a mathematical concept. It is also a social, political, and moral one.
Some dimensions of fairness resist quantification:
Procedural fairness. Was the process by which decisions were made fair? Did affected individuals have notice, the opportunity to be heard, and access to appeal? These are process questions, not outcome questions.
Distributive fairness. Are the benefits and burdens of the system distributed justly? This question — central to Rawls's justice theory (Chapter 6) — goes beyond comparing group statistics.
Epistemic fairness. Are the knowledge claims embedded in the system fair? Whose knowledge counts? Whose experiences are represented in the training data? Whose concerns were heard during system design?
Relational fairness. Does the system treat people with dignity? Does it recognize their agency? Does it maintain or undermine relationships of trust? This is the care ethics perspective (Chapter 6) applied to algorithmic systems.
15.9.2 Historical and Corrective Justice
The fairness definitions in this chapter are all forward-looking — they evaluate whether a system treats groups equitably going forward. But they do not address a deeper question: does the system owe anything to groups that have been harmed in the past?
A corrective justice perspective would argue that algorithms deployed in domains shaped by historical injustice (lending, criminal justice, healthcare, employment) have an obligation not merely to avoid reproducing past discrimination but to actively correct for it. This might mean deliberately favorable treatment for historically disadvantaged groups — a form of algorithmic affirmative action.
This perspective is controversial. Its supporters argue that neutrality in the context of historical injustice is itself a form of injustice — that "treating everyone equally" in an unequal world preserves inequality. Its critics argue that corrective treatment through algorithms raises questions about who bears the cost, how long the correction should last, and whether algorithmic adjustment is the right instrument for historical redress.
The debate mirrors the broader societal debate about affirmative action, reparations, and restorative justice — debates that predate algorithms by decades. Algorithms have not created these questions; they have given them new urgency and new technical specificity.
15.9.3 Fairness as Ongoing Practice
Sofia Reyes, speaking at a DataRights Alliance webinar that Dr. Adeyemi assigned to the class, summarized the challenge:
"Every time I hear a tech company say 'we've made our algorithm fair,' I ask: fair by whose definition? Measured how? Evaluated by whom? Over what time period? Fairness isn't a property you can bake into a model and ship. It's an ongoing practice — a relationship between the system and the communities it affects. And like all relationships, it requires continuous attention, adjustment, and accountability."
Reflection: Consider the fairness definitions in this chapter. Which one most closely matches your intuitive sense of what "fair" means? Now consider: if you were a member of the most disadvantaged group affected by an algorithmic system, would your answer change? What does the difference (if any) tell you about the relationship between perspective and fairness?
15.10 The Eli Thread: Fairness and the Criminal Justice System
Eli had been wrestling with the fairness definitions all week. In office hours with Dr. Adeyemi, he laid out his frustration.
"I've been applying these definitions to the predictive policing algorithms used in Detroit," he said. "And here's what I keep coming back to: every definition of fairness assumes that the underlying system is legitimate. Demographic parity asks whether the algorithm selects equally. Equalized odds asks whether the algorithm errs equally. Calibration asks whether the algorithm predicts equally. But none of them ask whether the algorithm should exist."
He continued: "Predictive policing doesn't need to be fair. It needs to be abolished. Making a surveillance tool that targets Black neighborhoods 'fairer' — by whatever metric — doesn't change the fact that it's a surveillance tool that targets Black neighborhoods. You can't make a fundamentally unjust system fair by adjusting its error rates."
Dr. Adeyemi listened carefully. "That's a powerful argument, Eli. And it resonates with a tradition in critical theory — the idea that reform can sometimes entrench the systems it claims to improve. If you make predictive policing 'fair,' you may legitimate it. And a legitimate predictive policing system may be harder to abolish than an obviously biased one."
"But," she added, "there's a counterargument. If an algorithm is going to be used — and many will be, regardless of our philosophical objections — is it not better that it be as fair as possible? People are being scored right now. They're being detained right now. Making the system fairer, even incrementally, reduces real harm for real people. Can we afford to reject incremental improvement while waiting for structural transformation?"
The tension between reform and abolition — between making unjust systems less unjust and demanding that they be replaced — is one of the deepest in the field of algorithmic fairness. This textbook does not resolve it. But it insists that you grapple with it.
Reflection: Consider Eli's argument that some algorithmic systems should not be made fairer but should be eliminated. Can you think of an algorithmic application where the appropriate response is not "make it fairer" but "don't build it at all"? What criteria would you use to make that distinction?
15.11 Chapter Summary
Key Concepts
- Demographic parity requires equal selection rates across groups — fairness as equal representation in outcomes.
- Equalized odds requires equal error rates (TPR and FPR) across groups — fairness as equal accuracy.
- Calibration (predictive parity) requires equal predictive value across groups — fairness as equal meaning of predictions.
- Individual fairness requires that similar individuals receive similar treatment — fairness as consistent treatment.
- The impossibility theorem (Chouldechova 2017; Kleinberg et al. 2016) proves that multiple fairness definitions cannot be simultaneously satisfied when base rates differ across groups.
- Choosing a fairness metric is a political and ethical decision, not a technical one. It requires transparent stakeholder deliberation.
Key Debates
- Is demographic parity the right standard when base rates differ — or does it force systems to ignore real differences?
- Should the COMPAS system be evaluated by calibration (Northpointe's argument) or by equalized odds (ProPublica's argument)?
- Who should choose the fairness metric — engineers, policymakers, affected communities, or some combination?
- Can mathematical definitions of fairness ever capture what fairness truly means, or are they necessarily incomplete?
Applied Framework
When confronting a fairness question in an algorithmic system: 1. Compute multiple fairness metrics (demographic parity, equalized odds, calibration) — do not rely on any single metric 2. Check base rates — if they differ across groups, the impossibility theorem applies 3. Map the trade-offs — what does each metric prioritize, and what does it sacrifice? 4. Identify the most harmful errors — in this specific context, which type of mistake causes the most damage? 5. Engage stakeholders — the people affected by the system should inform the choice of fairness definition 6. Document the choice — which metric was selected, by whom, and why? 7. Monitor over time — fairness is a practice, not a one-time certification
What's Next
In Chapter 16: Transparency, Explainability, and the Black Box Problem, we'll examine what happens when we cannot explain how an algorithmic system reaches its decisions. We'll explore explainable AI (XAI) methods like LIME and SHAP, the legal right to explanation under GDPR Article 22, the distinction between meaningful transparency and transparency theater, and the trade-off between explainability and accuracy. Mira will confront the question of whether VitraMed can explain its risk predictions to the patients and clinicians who depend on them.
Before moving on, complete the exercises and quiz. The exercises include applying the FairnessCalculator to new scenarios and analyzing the fairness trade-offs in a real-world system.