Case Study 30-1: The COMPAS Controversy — From ProPublica Investigation to Policy Change

Overview

In May 2016, the investigative news organization ProPublica published "Machine Bias" — a data journalism investigation that documented systematic racial disparities in COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a proprietary risk assessment algorithm used in criminal sentencing across the United States. The investigation sparked a national controversy that brought together academic researchers, legal advocates, algorithm developers, judges, policymakers, and constitutional scholars in a multi-year debate about whether algorithmic justice can be fair, what fairness means mathematically, and who should have accountability for algorithmic criminal justice failures.

This case study examines the full arc of the COMPAS controversy: the ProPublica investigation, the technical rebuttal from Northpointe (COMPAS's developer), the mathematical impossibility proof that reframed the debate, the Loomis v. Wisconsin litigation, and the policy changes (and non-changes) that followed.

Part 1: The ProPublica Investigation

The Data and the Method

ProPublica's investigation was built on an extraordinary dataset: COMPAS risk scores for 7,214 individuals arrested in Broward County, Florida, between 2013 and 2014, matched to criminal history and subsequent arrest records over the following two years. This dataset was obtained through a Florida public records request — a reminder that transparency in criminal justice data is both possible and productive when pursued.

The reporting team analyzed the data along several dimensions:

Accuracy: COMPAS was correct in predicting recidivism approximately 65% of the time. This sounds better than chance (50%), but raises the question of whether a complex, proprietary, 137-question instrument is meaningfully outperforming much simpler approaches. A subsequent analysis by Dressel and Farid (2018) found that COMPAS achieved similar accuracy to a simple two-variable model using only age and number of prior offenses — and to the predictions of random people (untrained Amazon Mechanical Turk workers) given brief case descriptions.

False Positive Rates by Race: Among individuals who did not reoffend within two years (true non-recidivists), Black defendants were nearly twice as likely as white defendants to have been classified as high risk by COMPAS — 44.9% of Black defendants in this category were falsely classified as high risk, compared to 23.5% of white defendants.

False Negative Rates by Race: Among individuals who did reoffend within two years (true recidivists), white defendants were more likely than Black defendants to have been classified as low risk — 47.7% of white recidivists were classified low risk, compared to 28.0% of Black recidivists.

The Individual Illustrations: The investigation used individual case comparisons to make the statistical pattern concrete. Brisha Borden, 18, Black, was scored high risk for a petty theft charge; she did not reoffend. Vernon Prater, 41, white, had a more serious prior record; he was scored low risk and subsequently reoffended significantly. These individual examples were criticized by statisticians for being insufficient to evaluate the algorithm — which they are, in isolation — but they illustrated the aggregate pattern in human terms.

The Cultural Impact

"Machine Bias" generated immediate and widespread attention. It was cited in congressional hearings, law review articles, computer science papers, and popular media. It became a reference point in the nascent field of algorithmic fairness and a catalyst for the broader "AI fairness" academic movement. It also created the canonical demonstration, accessible to non-specialists, of how an AI system could appear technically neutral while producing racially disparate outcomes — the concept of what scholars later termed "algorithmic discrimination."

Part 2: Northpointe's Rebuttal

The Calibration Defense

Northpointe's response, published shortly after the ProPublica investigation, made a specific technical argument: ProPublica had used the wrong fairness criterion. The investigation had examined error rates conditioned on actual outcome — what proportion of those who didn't recidivate were nonetheless flagged as high risk, separately for Black and white defendants. Northpointe argued that the appropriate criterion was calibration — whether a given score predicted the same actual recidivism rate regardless of race.

Northpointe's analysis showed that COMPAS was well-calibrated: a score of 7 predicted roughly the same actual two-year recidivism rate for Black and white defendants. On this criterion, the algorithm was fair — it meant the same thing for everyone.

Both analyses were technically sound, and both conclusions were accurate. The ProPublica analysis accurately described differential false positive and false negative rates by race. Northpointe accurately described calibration across races. These findings were not contradictory because they were measuring different properties of the same system.

The Competing Fairness Criteria

The Northpointe rebuttal framed the dispute as one of competing fairness criteria — a framing that turned out to be exactly right, and that enabled the next crucial analytical step.

The criteria at issue:

Equal false positive rates (predictive parity across non-recidivists): Ensuring that those who will not reoffend are equally likely to be flagged as high risk regardless of race.
Equal false negative rates (predictive parity across recidivists): Ensuring that those who will reoffend are equally likely to be flagged as low risk regardless of race.
Calibration (score consistency): Ensuring that a given score predicts the same actual outcome rate regardless of race.

The critical question: Can all three be achieved simultaneously?

Part 3: The Chouldechova Impossibility Result

The Mathematical Proof

In 2017, Alexandra Chouldechova, then a statistician at Carnegie Mellon University, published a mathematical proof that answered the question decisively: No. When two groups have different base rates of the outcome being predicted — when they actually reoffend at different rates — it is mathematically impossible to simultaneously achieve equal false positive rates, equal false negative rates, and calibration.

The proof is elegant and surprisingly simple. When base rates differ between groups (as they do between Black and white defendants in Broward County data, reflecting structural inequalities), any calibrated classifier will necessarily produce different false positive and false negative rates across groups. Conversely, a classifier that achieves equal false positive rates will necessarily be miscalibrated across groups if base rates differ.

This is not a property of COMPAS specifically. It is a mathematical property of any classification system applied to two groups with different base rates. No amount of algorithmic improvement, no change in inputs, no technical innovation will change this fundamental constraint. If you prioritize calibration, you will have unequal error rates. If you prioritize equal error rates, you will have miscalibration. You cannot have both when groups genuinely differ in the outcome they are being scored on.

The Implications for Criminal Justice AI

The Chouldechova result does not resolve the COMPAS debate — it reframes it at a deeper and more honest level. The question "is COMPAS biased?" is revealed to be an incomplete question. The more accurate formulation is: "Which fairness criterion does COMPAS prioritize, and is that the right criterion for criminal sentencing?"

This is a political and ethical question, not a technical one. Calibration prioritizes consistency — the same score means the same thing for everyone. Equal error rates prioritize burden distribution — no group bears a disproportionate share of false accusations or false exculpations. Which of these is more important in a criminal justice context is a value judgment. In a context where false positives (incorrectly flagging someone as high risk) result in harsher treatment by the state, prioritizing equal false positive rates arguably follows from the principle that the state should not impose punitive risk on people based on their demographic membership.

But the base rates themselves — the different actual recidivism rates by race — are not fixed facts of nature. They reflect decades of differential policing, differential prosecution, differential socioeconomic opportunity, and racially unequal criminal justice administration. An algorithm calibrated to those base rates is, in effect, encoding and perpetuating structural racial inequality in the criminal justice system. The "calibration" that appears race-neutral is actually tracking the output of a racially biased system.

A second impossibility result, formalized by Kleinberg, Mullainathan, and Raghavan (2017) in a parallel analysis, established similar constraints under different fairness definitions. The set of results as a whole established that algorithmic fairness is not a technical optimization problem with a correct solution — it is a domain where fundamental value choices must be made explicitly.

Part 4: Loomis v. Wisconsin — The Legal Battle

The Case in Detail

Eric Loomis's challenge to his COMPAS-influenced sentence reached the Wisconsin Supreme Court as a test of whether due process rights are violated when a criminal sentence is influenced by a proprietary algorithmic assessment whose formula cannot be disclosed, examined, or challenged.

Loomis was convicted of eluding an officer and operating a vehicle without the owner's consent — offenses that, for a defendant with his prior record, carried significant sentencing exposure. The sentencing judge received Loomis's COMPAS assessment, which showed "high risk" on multiple dimensions. In announcing the sentence of six years, the judge specifically referenced COMPAS: "I'm concerned about the public safety. I'm concerned about your safety. This [COMPAS report] shows that you're a high risk to your community."

Loomis appealed, arguing: (1) COMPAS's proprietary formula violated his right to an individualized sentence because he could not challenge the formula; (2) using a group-based risk score for an individual sentence violated his due process rights; and (3) COMPAS's use of gender as an input variable in one of its subscales violated prohibitions on gender-based sentencing.

The Wisconsin Supreme Court's analysis acknowledged the seriousness of the due process concerns while ultimately rejecting the constitutional challenge. On the formula secrecy issue, the court held that Loomis received sufficient information — his scores, the factors assessed, and the ability to challenge their accuracy — to satisfy due process. On the group-based assessment issue, the court noted that courts routinely use group-based information (prior criminal records, employment history) in sentencing, and that COMPAS's actuarial approach was within that tradition. On the gender issue, the court accepted Northpointe's explanation that gender was used only in some subscales to improve accuracy.

The court did add important caveats: judges should not place "exclusive or determinative weight" on COMPAS; they should be aware of its limitations and the debates about its accuracy; and they should use it as one input among many, not as a substitute for individualized assessment.

The US Supreme Court's refusal to hear the case in 2017, without explanation as is the Court's custom on denials of certiorari, left the Wisconsin ruling as the authoritative treatment of the due process question. The constitutional challenge to algorithmic sentencing was not definitively resolved — the Court's denial does not constitute endorsement of the Wisconsin ruling — but it removed the possibility of a near-term federal constitutional ruling that would have required reform across jurisdictions.

What the Loomis Ruling Permits and What It Leaves Unresolved

The Wisconsin ruling effectively permits the following: a sentencing judge may consult and explicitly reference a proprietary risk assessment whose formula cannot be disclosed to the defendant, as long as the judge treats it as one factor among many and does not rely on it exclusively. This is a significant authorization of algorithmic opacity in criminal proceedings.

What the ruling left unresolved: the specific constitutional limits on how much weight a judge can give a risk score; whether constitutionality changes if a jurisdiction formally incorporates risk scores into sentencing guidelines rather than leaving their use to judicial discretion; whether discovery rights require disclosure of validation studies, training data, and error rates even if the formula itself remains protected; and whether false positive error rates — documented in the ProPublica analysis — constitute evidence of systematic inaccuracy that defendants are entitled to challenge.

Part 5: Policy Responses Across States

The Range of State-Level Responses

Following the COMPAS controversy, state legislatures, court systems, and criminal justice reform advocates pursued a range of policy responses:

California (AB 2799, 2018): Required the California Department of Corrections and Rehabilitation to conduct independent evaluations of any risk assessment tool used in the state's justice system, to make evaluation results public, and to monitor for disparate impact by race and other characteristics. This represented a transparency and accountability requirement rather than a prohibition.

New Jersey: Implemented the PSA (Arnold Foundation's Public Safety Assessment) for pretrial decisions as part of its 2017 criminal justice reform, with a commitment to ongoing independent evaluation. The choice of PSA over COMPAS reflected, in part, a preference for a transparent, non-proprietary tool. New Jersey published detailed evaluation reports examining PSA outcomes by race and other characteristics.

New York: The state legislature passed legislation in 2019 (the "DISCOVERA Act") requiring that defendants be provided with risk assessment scores and the factors generating those scores in their criminal cases, strengthening disclosure rights beyond what Loomis established as a constitutional minimum.

Hawaii: Passed legislation requiring independent validation studies for any risk assessment tool used in criminal proceedings, specifying that the state should not rely on validation studies provided by tool vendors.

Abolition of Tool Use: No major state enacted a comprehensive prohibition on risk assessment tools in criminal proceedings, though several jurisdictions (Santa Cruz, California, for predictive policing; and some counties within states) enacted narrower prohibitions on specific applications.

Has COMPAS Been Reformed?

As of 2024, COMPAS (rebranded under the Equivant corporate name and then through corporate transitions) remains in use in multiple US jurisdictions. Northpointe/Equivant has released a "Practitioner's Guide" and some technical documentation in response to transparency pressure, but the core formula has not been made fully public. The company has argued that its tools have been independently validated and that the validation evidence supports continued use. Independent researchers who have examined COMPAS validation studies have found methodological limitations and have argued that the validation evidence is weaker than Northpointe claims.

The ProPublica investigation's most direct effect was not regulatory reform of COMPAS but the catalysis of the algorithmic fairness academic field and the broader public and policy debate about AI in criminal justice. The specific policy changes enacted have been incremental rather than fundamental.

Analysis: What COMPAS Teaches About AI Accountability

The Accountability Chain

The COMPAS case reveals the accountability chain problem clearly: when algorithmic criminal justice causes harm, who is responsible? Northpointe/Equivant, as the tool developer, is not a state actor and cannot be sued for constitutional violations. Judges who use COMPAS are protected by judicial immunity. Prosecutors who recommend sentences informed by COMPAS are protected by prosecutorial immunity. The state that uses COMPAS in its justice system faces sovereign immunity limitations. The private company retains trade secrecy protection that defeats disclosure.

This is a near-comprehensive accountability shield: the tool developer cannot be sued constitutionally; the government officials who deploy it are immune; and the trade secrecy doctrine forecloses the transparency that would enable accountability through oversight. The person whose sentence is influenced by an inaccurate or racially biased algorithm has limited recourse.

The Fairness Framework Crisis

The Chouldechova impossibility result transformed the COMPAS debate from a question of technical error to a question of political choice: which fairness criterion should govern? This is a more honest question, but also a more uncomfortable one, because it requires acknowledging that there is no technically correct answer — only choices about whose interests to protect.

For criminal justice specifically, the relevant question is: Who should bear the burden of algorithmic error? Equal false positive rates would mean that no racial group is disproportionately falsely classified as dangerous — protecting defendants from racially differential risk. Calibration would mean that the scores reliably predict outcomes across groups — but at the cost of disparate error rates when base rates differ. The choice between these is a choice about whose interests to prioritize, and it is a choice that is currently made invisibly by algorithm developers without democratic deliberation.

The Transparency Imperative

The single most important policy lesson from the COMPAS case is the transparency imperative: criminal justice AI tools must be fully transparent to be defensible in a democratic constitutional system. Transparency is not merely nice to have; it is a precondition for meaningful challenge, independent evaluation, and accountability. A proprietary formula influencing criminal sentences is constitutionally and ethically anomalous — a private commercial secret embedded in the exercise of the state's coercive power.

Discussion Questions

The Chouldechova impossibility result shows that when racial groups have different base rates of recidivism, no risk assessment tool can simultaneously satisfy all fairness criteria. Does this mean that algorithmic risk assessment in criminal sentencing is inherently inappropriate, or does it mean that fairness criteria must be chosen explicitly and democratically?
The Wisconsin Supreme Court held that Northpointe's trade secrecy interest in the COMPAS formula was compatible with due process for criminal defendants. Do you agree? What would a constitutional due process requirement that adequately protected defendants' challenge rights look like?
If COMPAS's disparate false positive rates for Black defendants are a mathematical consequence of base rate differences rather than a flaw in the algorithm's design, does this affect its ethical use in sentencing?
The Arnold Foundation's PSA represents the transparency approach — a publicly available, non-proprietary methodology. What are the trade-offs between transparent, non-proprietary tools and proprietary tools that vendors claim are more accurate?
The COMPAS controversy catalyzed the academic field of algorithmic fairness. Has that academic development translated into meaningful improvements in actual criminal justice AI practice? What would meaningful translation look like?