Case Study 9.1: COMPAS and the Impossibility of Algorithmic Fairness

Chapter 9 | Running Case Study | Referenced in Chapters 18, 19, 20, 30

Introduction

The COMPAS controversy is not just a story about one algorithm. It is the story of how mathematical necessity and moral philosophy collide when we try to quantify justice. Understanding it deeply — the original design, the ProPublica investigation, Northpointe's rebuttal, the mathematics that underlies both, and the aftermath — is essential preparation for anyone who will deploy, regulate, or be affected by algorithmic decision systems. That category now includes nearly everyone.

1. What COMPAS Was Designed to Do and How It Works

COMPAS — Correctional Offender Management Profiling for Alternative Sanctions — is a risk assessment tool developed by Northpointe Inc. (later acquired by Equivant). It was designed to help criminal justice professionals make more consistent and evidence-based decisions about pretrial release, sentencing, and parole supervision levels. The logic behind it was appealing: human judges are inconsistent, subject to mood, fatigue, and implicit bias, and often have very little structured information about the defendants they assess. An actuarial instrument grounded in data might, its proponents argued, introduce greater consistency and objectivity.

COMPAS produces risk scores in several domains: general recidivism risk, violent recidivism risk, and pretrial failure risk. Each score runs from one (lowest risk) to ten (highest risk). The general recidivism score is derived from a proprietary algorithm that weighs a defendant's responses to approximately 137 survey questions covering topics such as criminal history, age, peer associations, substance use, residential instability, education, and family background. Critically, COMPAS does not use race as a direct input variable. Northpointe has consistently emphasized this point in its defense of the system.

The scores are generated automatically and presented to judges, probation officers, and parole boards as part of a defendant's file. Different jurisdictions use the scores differently: some use them only as background information; others have developed formal protocols tying score ranges to specific recommendations. The Wisconsin Supreme Court, in its 2016 decision in State v. Loomis (discussed below), upheld the use of COMPAS scores in sentencing as long as they were not the determinative factor.

By 2016, COMPAS was in use in more than a dozen states. Risk assessment tools of this general type — actuarial instruments that generate numerical predictions about future criminal behavior — had been in use in various forms since at least the 1920s, and their use was expanding as part of the broader movement toward evidence-based practices in criminal justice.

2. The ProPublica Methodology

In May 2016, ProPublica published "Machine Bias: There's Software Used Across the Country to Predict Future Criminals. And It's Biased Against Blacks." The investigation was reported by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner.

The methodology was straightforward in design but labor-intensive in execution. The journalists obtained from the Broward County Sheriff's Office in Florida the COMPAS risk scores assigned to defendants between 2013 and 2014 — more than 7,000 people. They then linked those scores to arrest records over the following two years to determine whether each defendant had actually been arrested for a new crime. This linkage allowed them to compare predicted risk against actual subsequent behavior.

The team then computed false positive and false negative rates separately for Black and white defendants. A false positive in this context is a defendant who was predicted to be high risk but was not subsequently arrested for a new crime. A false negative is a defendant who was predicted to be low risk but was subsequently arrested.

ProPublica set the threshold for "high risk" at a COMPAS score of five or above on the ten-point scale. Using this threshold, they found:

Black defendants were nearly twice as likely as white defendants to be falsely labeled high risk: a false positive rate of approximately 45% for Black defendants versus 23% for white defendants.
White defendants were more likely than Black defendants to be falsely labeled low risk: a false negative rate of approximately 48% for white defendants versus 28% for Black defendants.
The overall predictive accuracy of COMPAS was modest — approximately 65% — comparable to what studies have found for human judges.

The investigation also presented compelling individual cases: Black defendants with no prior criminal record assigned high-risk scores; white defendants with long criminal histories assigned low-risk scores. These human stories gave visceral meaning to the statistical findings.

The ProPublica analysis and accompanying code were published openly, an important choice that allowed the findings to be scrutinized, replicated, and challenged — which is exactly what happened.

3. The Specific Findings

The ProPublica findings can be summarized in the language of confusion matrices:

For Black defendants (simplified from ProPublica data): - Among those who did not reoffend: approximately 45% were scored high risk (false positives) - Among those who did reoffend: approximately 72% were scored high risk (true positives)

For white defendants: - Among those who did not reoffend: approximately 23% were scored high risk (false positives) - Among those who did reoffend: approximately 48% were scored high risk (true positives)

The asymmetry is striking. A Black defendant who did not go on to commit a new crime was nearly twice as likely as a white defendant in the same situation to be labeled high risk. A white defendant who did go on to commit a new crime was substantially less likely than a Black defendant in the same situation to have been flagged.

These findings have direct practical implications. In criminal justice, a high-risk score can influence bail decisions (keeping defendants in pretrial detention), sentencing decisions (recommending harsher sentences), and parole and probation decisions (imposing more restrictive supervision). A false positive label does not merely result in a minor inconvenience — it may mean months or years of additional incarceration.

4. Northpointe's Rebuttal: Calibration as the Fairness Metric

Northpointe responded to ProPublica's findings directly and forcefully. Their technical response, authored by William Dieterich, Christina Mendoza, and Tim Brennan, challenged ProPublica's framing and offered an alternative analysis.

Northpointe's central argument was that COMPAS was, in fact, fair — by the metric that mattered most for a risk assessment tool: calibration. At every score level on the one-to-ten scale, the proportion of defendants who actually reoffended was approximately the same for Black and white defendants. A score of seven meant roughly the same thing in probability terms regardless of race. This is a genuine and important fairness property.

Northpointe further argued that ProPublica's choice of metric was inappropriate. By setting a fixed threshold of five and comparing false positive rates, ProPublica was not accounting for the different base rates of recidivism in the dataset. Black defendants in Broward County had higher recidivism rates in the dataset, Northpointe argued, and a calibrated system would necessarily produce different false positive rates when base rates differ — just as a well-calibrated weather forecast for different cities would produce different rates of rain prediction if the cities have different climates.

Northpointe also challenged some of ProPublica's methodological choices, including the use of rearrest as the outcome variable (arguing that rearrest reflects police activity rather than actual criminal behavior) and the choice of threshold.

Both parties had competent statisticians. Both made legitimate points. The subsequent academic literature confirmed that neither was wrong within their own framing.

5. The Mathematical Explanation of Why Both Are Right

This is where the story becomes intellectually fascinating. The ProPublica findings and the Northpointe findings are not contradictory — they are both accurate descriptions of the same system, described using different fairness metrics.

To understand why, consider the mathematical relationships within a confusion matrix. For any binary classifier applied to any population:

False Positive Rate (FPR) = FP / (FP + TN) = false alarms among negatives
False Negative Rate (FNR) = FN / (FN + TP) = misses among positives
Calibration (Positive Predictive Value / PPV) = TP / (TP + FP) = accuracy of positive predictions

These three quantities are not independent. They are related through the base rate (the prevalence of the positive outcome in the population). The relationship can be expressed as:

FPR = (base_rate × FNR × (1 - PPV)) / ((1 - base_rate) × PPV)

Or more intuitively: if you know any two of these quantities and the base rate, the third is determined. You cannot freely choose all three independently.

Now consider what happens when the base rate differs between groups. If Group A has a base rate of 30% and Group B has a base rate of 50%, and the system is calibrated (PPV is equal across groups), then it is mathematically impossible for both FPR and FNR to be equal across groups simultaneously — unless the classifier makes no errors at all.

This is not a limitation of COMPAS specifically. It is a mathematical theorem that applies to any classifier operating on populations with different base rates.

Northpointe's system was calibrated (PPV approximately equal across races). The base rates of recidivism in the Broward County dataset differed between Black and white defendants. Therefore, by mathematical necessity, the false positive rates and false negative rates had to be unequal. ProPublica documented the FPR disparity. Northpointe demonstrated calibration. Both descriptions are accurate.

6. The Chouldechova (2017) Proof

The mathematical argument sketched above was formalized by Alexandra Chouldechova, then at Carnegie Mellon University, in a 2017 paper titled "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Chouldechova's contribution was to state the impossibility result precisely and prove it rigorously, directly in response to the COMPAS controversy.

Chouldechova demonstrated formally that for any binary risk prediction instrument:

If the instrument is calibrated for all groups, and if there are differences in prevalence (base rates) across groups, then the instrument cannot simultaneously satisfy both: 1. Equal false positive rates across groups 2. Equal false negative rates across groups

Independently, Jon Kleinberg, Sendhil Mullainathan, and Manish Raghavan reached a closely related result in their paper "Inherent Trade-Offs in the Fair Determination of Risk Scores" (2016, published 2018). They proved that three natural conditions — calibration, equal false positive rates across groups, and equal false negative rates across groups — are mutually incompatible except in degenerate cases (equal base rates or perfect prediction).

These impossibility results transformed the debate. They made clear that the disagreement between ProPublica and Northpointe was not a matter of one side being right and the other wrong. It was a genuine conflict between two legitimate but incompatible fairness criteria. The question could not be resolved empirically. It required a moral and political judgment about which criterion should take precedence.

7. What Courts Have Done: Loomis v. Wisconsin

The COMPAS controversy intersected with the legal system most directly in the case of Eric Loomis, who was sentenced in Wisconsin in 2013 after being arrested in connection with a drive-by shooting. The judge explicitly referenced Loomis's high COMPAS risk score in the sentencing, stating that the score "confirms the appropriateness" of a lengthy prison term.

Loomis appealed, arguing that using a proprietary algorithm whose methodology is trade-secret protected violates due process — because he could not meaningfully challenge the basis for his sentence if he could not examine how the score was calculated. He also argued that the score was based partly on group characteristics (such as gender, since the algorithm's norms differ by gender) and that sentencing based on group characteristics rather than individual conduct violates equal protection.

In 2016, the Wisconsin Supreme Court rejected Loomis's appeal. The court held that: 1. COMPAS could be used in sentencing, as long as the sentencing judge acknowledged that the score could not be the sole or determinative factor 2. The trade-secret protection did not violate due process, because Loomis had access to the questions and could challenge the questionnaire's application to him 3. Using gender-normed risk assessments does not constitute impermissible use of gender in sentencing

Loomis petitioned the U.S. Supreme Court, which declined to hear the case.

The Wisconsin decision is widely criticized by legal scholars and civil rights advocates. The court's reasoning on the due process issue is particularly strained: telling a defendant he can challenge the questionnaire is not equivalent to giving him access to the algorithm that converts questionnaire answers to scores. And the court's acceptance of gender-normed scoring — where men and women with identical questionnaire answers receive different scores because the algorithm was trained on gender-differentiated historical data — illustrates how historical disparities become encoded in algorithmic systems and then immunized from challenge.

The Loomis decision has shaped practice in many jurisdictions. Several states have since enacted legislation requiring that risk assessment tools used in sentencing be validated for racial bias — but the validation standards vary widely, and the COMPAS-style tools remain in widespread use.

8. The Stakes: COMPAS Scores and Millions of Lives

The scale of COMPAS's deployment makes the fairness question urgent rather than merely academic. At its peak, COMPAS-style tools were in use in over a dozen states. Independent estimates suggest that hundreds of thousands of defendants annually were subject to COMPAS scores that influenced decisions about their pretrial detention, sentencing, and parole supervision.

Consider what those numbers mean in human terms. A false positive score — a low-risk person labeled high-risk — may mean: - Remaining in pretrial detention for weeks or months because bail is denied or set too high, when the person would have been released on recognizance with a lower score - Receiving a longer sentence than a comparable defendant who was lucky enough to receive a lower score - Being placed on more restrictive probation or parole supervision, with more opportunities for technical violations that lead to reincarceration

For Black defendants, who experienced false positive rates nearly twice those of white defendants, these consequences were systematically and disproportionately borne. The algorithm did not create racial disparity in the criminal justice system — that disparity predated it by decades. But it formalized, quantified, and in some ways legitimized a disparity that human discretion might at least have allowed exceptions to.

9. What Happened to COMPAS After the Controversy

The ProPublica investigation did not eliminate COMPAS. The system remained in use in many jurisdictions. Some states doubled down on risk assessment tools; others became more cautious. The debate generated an enormous academic literature on algorithmic fairness, substantially accelerated the development of formal fairness metrics, and contributed to a broader public awareness of bias in AI systems.

Several jurisdictions did move to restrict or eliminate the use of risk assessment tools in criminal sentencing. California, after years of debate, passed legislation limiting the use of risk assessment tools in determinate sentencing — though the tools remain widely used in pretrial and parole contexts.

The state of New Jersey, which had aggressively adopted risk assessment tools for pretrial release decisions, commissioned an independent audit that found significant racial disparities in both the algorithm's outputs and in how judges used the scores. The audit recommended recalibration and stronger oversight.

The broader risk assessment industry continued to grow, with companies offering competing products making varying fairness claims. The field now includes tools validated (with varying methodologies) for racial fairness — but the impossibility theorem means that "validated for fairness" necessarily means "validated for some conception of fairness, not others."

10. The Alternatives: Human Judgment and Its Limits

A fair evaluation of COMPAS requires comparing it to the alternative: human judgment. The evidence on human judgment in criminal justice decisions is not reassuring. Studies of bail decisions by human judges have found:

Substantial racial disparities in detention rates that cannot be fully explained by case characteristics
High inconsistency: the same judge may make different decisions about similar defendants at different times of day, before and after lunch, or on days following their sports team's loss
Judges who are systematically overconfident in their own predictive accuracy

Research by Kleinberg, Lakkaraju, Leskovec, Ludwig, and Mullainathan (2018) used machine learning to analyze hundreds of thousands of bail decisions and found that an algorithmic tool could substantially reduce crime rates while also reducing the number of people detained pretrial — suggesting that human judges were both racially biased and less accurate than a well-designed algorithm.

But there is an important asymmetry: human bias can be investigated, challenged, appealed, and remedied through processes that the justice system has developed over centuries. A judge can be questioned, cross-examined, required to explain their reasoning, reported to judicial conduct boards, and disqualified from a case. An algorithm's reasoning is typically opaque, protected as trade secret, and immune from the usual mechanisms of judicial accountability.

This does not mean algorithms are necessarily worse than human judges. It means that deploying algorithms in criminal justice requires developing new accountability mechanisms appropriate to their specific characteristics — transparency requirements, auditing regimes, and meaningful opportunities to challenge algorithmic outputs on both factual and methodological grounds.

11. The Democratic Question: Who Gets to Choose the Fairness Metric?

Perhaps the most important question raised by the COMPAS controversy is political rather than technical: who gets to decide which fairness metric is used, and through what process?

Northpointe made a design choice to optimize for calibration. That choice — which had the effect of accepting higher false positive rates for Black defendants — was made internally, by a private company, without public deliberation. The company's clients — county governments and court systems — adopted the tool without necessarily understanding or evaluating that design choice. Judges used the scores without necessarily knowing that the false positive rates differed systematically by race.

In a democracy, decisions about the acceptable distribution of criminal justice errors across racial groups seem like decisions that should be made through democratic processes — legislative deliberation, regulatory rulemaking, community input — rather than by private algorithmic vendors. The fact that such decisions are embedded in proprietary software and effectively delegated to technology companies is a significant accountability gap.

Several reform proposals have been advanced:

Mandatory disclosure: Vendors of risk assessment tools should be required to disclose their fairness metrics, validation methodologies, and the trade-offs made in tool design.
Regulatory approval: Risk assessment tools used in criminal justice should require regulatory approval based on demonstrated fairness across specified criteria, similar to how medical devices require FDA approval.
Community participation: Affected communities — criminal justice-involved people, their families, and communities disproportionately affected by incarceration — should have formal roles in deciding what fairness criteria to use.
Public alternatives: Some scholars advocate for publicly funded, open-source risk assessment tools whose algorithms are fully transparent and subject to democratic oversight.

The COMPAS controversy illustrates a broader challenge: as algorithmic systems take on decision-making functions that were previously performed by humans subject to democratic accountability, the accountability gap between algorithmic and human decision-making widens. Closing that gap requires not just better algorithms but better governance.

Discussion Questions

ProPublica's investigation focused on false positive rates; Northpointe emphasized calibration. Given the Chouldechova impossibility theorem, both positions are mathematically defensible. If you were advising a state legislature on which fairness criterion to require for risk assessment tools used in pretrial detention decisions, which would you recommend, and why? What stakeholders would you consult before making this recommendation?
The Wisconsin Supreme Court upheld the use of COMPAS in sentencing while acknowledging that it could not be the "sole or determinative" factor. Do you believe this constraint is sufficient to prevent the harms documented by ProPublica? What additional procedural safeguards would you recommend?
The research literature suggests that human judges are also racially biased in their bail and sentencing decisions, but their decisions are subject to traditional accountability mechanisms (appeals, judicial conduct boards, public scrutiny). How should we weigh algorithmic bias against human bias when the two are subject to different accountability regimes?
Northpointe's algorithm does not take race as a direct input, yet produces racially disparate outputs. How is this possible? What does this tell us about the adequacy of "fairness through unawareness" as an approach to algorithmic fairness?
Some critics argue that any risk assessment tool that predicts future criminal behavior is fundamentally unjust because it punishes people for what they might do rather than what they have done. Evaluate this argument. Does it apply equally to human judges who also make predictive judgments? What would a criminal justice system that rejected predictive judgment look like?

This case study is the foundation for discussions of algorithmic auditing (Chapter 19), communicating AI risks (Chapter 15), and the governance of AI in public institutions (Chapters 20 and 30).