Case Study: ProPublica vs. Northpointe — The COMPAS Fairness Debate

DataField.Dev

Case Study: ProPublica vs. Northpointe — The COMPAS Fairness Debate

"They can't both be right — can they?" — Mira, in Dr. Adeyemi's seminar, after reading the competing analyses

Overview

Chapter 14 introduced the COMPAS recidivism prediction tool and ProPublica's finding that it exhibited racially disparate error rates. This case study examines the debate that followed — the most consequential and illuminating exchange in the short history of algorithmic fairness. ProPublica said COMPAS was unfair because Black defendants were more likely to be falsely labeled high-risk. Northpointe said COMPAS was fair because defendants with the same score reoffended at the same rate regardless of race. Both were right — and their disagreement revealed the impossibility theorem years before it was formally proven.

This case study is not primarily about COMPAS. It is about what happens when two reasonable definitions of fairness conflict — and what that conflict means for the institutions that use algorithmic systems.

Skills Applied: - Comparing and evaluating competing fairness definitions - Applying the impossibility theorem to a concrete case - Analyzing how mathematical truths translate into policy debates - Connecting fairness definitions to stakeholder values

The Debate

ProPublica's Position: Equal Error Rates

ProPublica's May 2016 investigation, "Machine Bias," analyzed COMPAS scores for over 7,000 defendants in Broward County, Florida. Their key finding concerned the distribution of errors:

False positive rate (incorrectly predicted to reoffend): - Black defendants: 44.9% - White defendants: 23.5%

False negative rate (incorrectly predicted not to reoffend): - Black defendants: 28.0% - White defendants: 47.7%

ProPublica's conclusion: COMPAS is unfair because it makes different kinds of mistakes for different racial groups. A Black defendant who will not reoffend is nearly twice as likely to be wrongly flagged as dangerous. A white defendant who will reoffend is nearly twice as likely to be wrongly classified as safe.

The fairness standard implicitly invoked: equalized odds — the requirement that true positive rates and false positive rates be equal across groups. ProPublica's analysis demonstrated a clear violation.

Northpointe's Position: Equal Predictive Accuracy

In July 2016, Northpointe (now Equivant) published a detailed response. Their key claim:

Among defendants who received the same COMPAS score, the actual recidivism rates were approximately equal across racial groups. A Black defendant scored "7" and a white defendant scored "7" reoffended at similar rates. The score means the same thing for everyone.

Northpointe's conclusion: COMPAS is fair because its predictions are equally accurate for both groups. A "high-risk" label is equally likely to be correct whether the defendant is Black or white.

The fairness standard invoked: calibration (predictive parity) — the requirement that among individuals predicted positive, the actual positive rate be equal across groups.

Both Were Right

This is the crux of the case: both ProPublica and Northpointe were mathematically correct about the facts. COMPAS did have disparate false positive rates (ProPublica's finding). COMPAS was approximately calibrated (Northpointe's finding). These are not contradictory claims about the same metric — they are claims about different metrics that were both true simultaneously.

The disagreement was not about facts. It was about which facts matter most.

The Mathematics: Why Both Cannot Be Achieved

The Underlying Structure

The Broward County data had a critical structural feature: the base rates (actual recidivism rates) differed across racial groups. Black defendants had a base rate of approximately 51%; white defendants had a base rate of approximately 39%.

This difference — whatever its causes (and the causes are structural, as discussed in Chapter 14) — triggers the impossibility theorem.

The Impossibility in This Case

When base rates differ and a system is calibrated:

A score of "7" correctly means "high risk" for both groups — the PPV is equal.
But because there are more actual positives in the higher-base-rate group, the system must predict more of them as positive.
This means that, at any given threshold, more individuals from the higher-base-rate group will be predicted positive — producing more false positives in that group.
The false positive rate for the higher-base-rate group (Black defendants) will therefore be higher than for the lower-base-rate group (white defendants).

Conversely, if you force equal false positive rates:

You must adjust the threshold differently for each group.
This means a score of "7" no longer means the same thing for both groups — it represents different risk levels depending on race.
Calibration is violated.

The mathematics is precise and unforgiving. Given these base rates, COMPAS cannot simultaneously achieve calibration and equalized odds. Every improvement on one dimension comes at the cost of the other.

What Each Definition Values

Equalized Odds Values: Protection from Errors

The equalized odds perspective asks: "For someone who is actually innocent (will not reoffend), does the system treat them the same regardless of race?" If the answer is no — if Black innocents are more likely to be wrongly flagged — then the system is imposing a racial penalty on innocence. This is a profound injustice: your race determines the probability that you will be wrongly detained, wrongly sentenced, wrongly denied parole.

This perspective prioritizes the individual experience of being judged. It says: whatever the aggregate statistics, each person who stands before a judge deserves to be assessed by a system that is equally accurate for them as it would be for someone of a different race.

Calibration Values: Accuracy of Communication

The calibration perspective asks: "When the system says a defendant is high-risk, does that mean the same thing regardless of race?" If the answer is yes — if a score of "7" represents the same probability of reoffending for Black and white defendants — then the system is providing equally informative assessments to judges, and judges can trust the score regardless of the defendant's race.

This perspective prioritizes the decision-maker's ability to use the information correctly. It says: a risk score should mean what it says. If "high risk" means different things for different racial groups, the score is misleading, and judges cannot rely on it.

Neither Is Wrong

The debate is not between a correct and an incorrect definition of fairness. It is between two legitimate moral intuitions that, given the mathematical constraints, cannot be simultaneously honored.

Equalized odds says: fairness means bearing the same risk of algorithmic error regardless of your race. Calibration says: fairness means receiving equally accurate information regardless of your race. Both are reasonable. Both are defensible. Both are incomplete.

The Scholarly Response

Formalizing the Impossibility

The ProPublica-Northpointe exchange catalyzed a wave of scholarship:

Chouldechova (2017) published a formal proof that when base rates differ, a classifier cannot simultaneously achieve equal false positive rates, equal false negative rates, and equal positive predictive values across groups. Her proof confirmed mathematically what the COMPAS debate had revealed empirically.

Kleinberg, Mullainathan, and Raghavan (2016) proved a related but independently derived result: calibration, balance for the positive class, and balance for the negative class cannot all hold simultaneously except in trivial cases.

Corbett-Davies and Goel (2018) surveyed the landscape of fairness definitions and argued that the choice among them should be understood as a political question — not a technical optimization problem.

The Policy Impact

The debate had significant policy consequences:

Several jurisdictions paused or reconsidered their use of algorithmic risk assessment tools
The Wisconsin Supreme Court, in State v. Loomis (2016), affirmed the use of COMPAS in sentencing but imposed requirements for transparency
The debate informed the development of the EU's approach to AI regulation, which emphasizes transparency and human oversight for high-risk AI systems
Academic and professional organizations developed guidelines for the responsible use of risk assessment in criminal justice

Stakeholder Analysis

The Defendant

A defendant's primary concern is being treated fairly as an individual. From the defendant's perspective, equalized odds is likely more intuitive: "If I'm not going to reoffend, I shouldn't be more likely to be wrongly flagged just because I'm Black." The defendant experiences the false positive personally. The system's calibration across a population is cold comfort to the individual wrongly detained.

The Judge

A judge's primary concern is making good decisions. From the judge's perspective, calibration may be more appealing: "When I see a risk score, I need to know what it means. If a '7' means the same probability of reoffending for every defendant, I can use the score consistently." A judge operating with a non-calibrated tool may be misled about the meaning of scores across different populations.

The Community

Communities have dual concerns: public safety (accurately identifying high-risk individuals) and justice (not over-punishing people based on flawed predictions). The community's preferred fairness definition may depend on which concern is more salient — and on which community is being asked. Communities disproportionately affected by over-policing and mass incarceration may prioritize equalized odds. Communities experiencing high crime rates may prioritize calibration and accuracy.

The Algorithm Designer

The designer faces a technical and ethical choice. The impossibility theorem means they must choose — and whichever choice they make, they will be vulnerable to criticism from the perspective they did not choose. Many designers would prefer to defer this choice to policymakers or stakeholders, but as Section 15.6.2 notes, the choice is often made implicitly during model development, without explicit deliberation.

Discussion Questions

Whose definition should prevail? If you had to choose one fairness definition for criminal justice risk assessment — equalized odds or calibration — which would you choose, and why? Does your answer change depending on whether the score is used for bail (pretrial detention) versus sentencing versus parole?
The base rate question revisited. The impossibility theorem applies because base rates differ. Should we accept different base rates as a given, or should we address the structural causes of different base rates (poverty, policing patterns, systemic racism) as part of the fairness intervention? What are the practical challenges of each approach?
The communication problem. ProPublica published its analysis in plain language. Northpointe published a detailed statistical response. Most members of the public, most judges, and most policymakers could not evaluate the competing mathematical claims. How should complex fairness debates be communicated to non-technical audiences? Whose responsibility is this communication?
Beyond binary choices. The debate is often framed as "equalized odds vs. calibration." Are there fairness approaches that partially satisfy both, even if perfectly satisfying both is impossible? What would a compromise look like, and who should decide where the compromise falls?

Your Turn: Mini-Project

Option A: Stakeholder Deliberation Simulation. Organize a group of 4-6 people and assign each person one stakeholder role: defendant, judge, prosecutor, defense attorney, community member from a high-crime neighborhood, and algorithm designer. Each person argues for the fairness definition their stakeholder would prefer. After deliberation, attempt to reach consensus. Write a one-page reflection on whether consensus was possible and what the exercise revealed about the nature of the fairness choice.

Option B: Fairness Definition Policy Brief. You are an advisor to a state legislature considering a bill that would mandate the use of a specific fairness definition for all criminal justice risk assessment tools used in the state. Write a two-page policy brief that: (a) explains the three major fairness definitions in plain language, (b) presents the impossibility theorem's implications, (c) recommends a specific definition or approach, and (d) justifies your recommendation.

Option C: The Third Way. Research approaches that attempt to navigate the impossibility theorem — such as the "cost-based" approach (Corbett-Davies et al., 2017), which frames the fairness choice as an optimization problem with explicit costs for different types of errors. Write a two-page analysis: Does this approach resolve the tension, or does it merely relocate the value judgment?

References

Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. "Machine Bias." ProPublica, May 23, 2016.
Northpointe Inc. "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Research Department, July 8, 2016.
Chouldechova, Alexandra. "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data 5, no. 2 (2017): 153-163.
Kleinberg, Jon, Sendhil Mullainathan, and Manish Raghavan. "Inherent Trade-Offs in the Fair Determination of Risk Scores." Proceedings of Innovations in Theoretical Computer Science (ITCS), 2017.
Corbett-Davies, Sam, Emma Pierson, Avi Feller, Sharad Goel, and Aziz Huq. "Algorithmic Decision Making and the Cost of Fairness." Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 797-806. ACM, 2017.
Corbett-Davies, Sam, and Sharad Goel. "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning." arXiv preprint arXiv:1808.00023, 2018.
Berk, Richard, Hoda Heidari, Shahin Jabbari, Michael Kearns, and Aaron Roth. "Fairness in Criminal Justice Risk Assessments: The State of the Art." Sociological Methods & Research 50, no. 1 (2021): 3-44.