Case Study 2: COMPAS and Criminal Justice — When Algorithms Judge

DataField.Dev

Case Study 2: COMPAS and Criminal Justice — When Algorithms Judge

Introduction

In May 2016, the investigative journalism organization ProPublica published an article that would reshape the global conversation about algorithmic fairness. Titled "Machine Bias," the article examined COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) — a risk assessment algorithm developed by the company Equivant (formerly Northpointe) and used by courts across the United States to predict the likelihood that a criminal defendant would reoffend.

The article's central finding was explosive: COMPAS was twice as likely to falsely label Black defendants as future criminals (false positives) compared to white defendants, and twice as likely to falsely label white defendants as low-risk when they actually went on to reoffend (false negatives). The algorithm, ProPublica argued, was systematically biased against Black Americans.

Equivant disputed the findings. Academic researchers weighed in. A mathematical proof emerged showing that the debate was not, at its core, about whether COMPAS was biased — but about what "fair" means when applied to an algorithm that affects human freedom. The COMPAS controversy remains, nearly a decade later, the most important case study in algorithmic fairness — not because it was resolved, but because it revealed that the question of fairness in AI does not have a single correct answer.

What COMPAS Does

COMPAS is a proprietary risk assessment tool that evaluates criminal defendants on a 1-to-10 scale across several dimensions, including risk of general recidivism (committing any new crime) and risk of violent recidivism (committing a new violent crime). The scores are derived from a questionnaire of 137 questions, covering criminal history, substance abuse, social and family factors, residential stability, education, and attitudes toward criminal behavior.

Judges use COMPAS scores — alongside other information — when making bail, sentencing, and parole decisions. A defendant with a high COMPAS score might receive a longer sentence, be denied bail, or be denied parole. A defendant with a low score might receive probation instead of incarceration or be released on their own recognizance.

COMPAS is not the only risk assessment tool used in the US criminal justice system — others include the Public Safety Assessment (PSA), the Ohio Risk Assessment System (ORAS), and the Virginia Pretrial Risk Assessment Instrument (VPRAI). But COMPAS became the focal point of the fairness debate because of ProPublica's analysis and because it is commercially developed, proprietary, and widely deployed.

Caution

The use of AI in criminal justice decisions raises stakes that are qualitatively different from commercial applications. A biased product recommendation wastes a customer's time. A biased hiring algorithm costs someone a job. A biased criminal justice algorithm can cost someone their freedom. The severity of the consequences demands a correspondingly higher standard of scrutiny.

The ProPublica Investigation

ProPublica obtained COMPAS risk scores for over 7,000 defendants arrested in Broward County, Florida, between 2013 and 2014. They matched these scores against actual recidivism outcomes over the following two years — did the defendant actually commit a new crime?

The analysis produced several findings:

Overall accuracy was moderate. COMPAS correctly predicted recidivism approximately 61 percent of the time — roughly comparable to a group of humans with no criminal justice expertise who were asked to predict recidivism based on a defendant's age and criminal history.

False positive rates differed dramatically by race. Among defendants who did not go on to reoffend, Black defendants were 44.9 percent likely to have been classified as high-risk by COMPAS, compared to 23.5 percent for white defendants. In other words, COMPAS was nearly twice as likely to falsely flag a Black defendant as a future criminal.

False negative rates also differed. Among defendants who did reoffend, white defendants were 47.7 percent likely to have been classified as low-risk, compared to 28.0 percent for Black defendants. COMPAS was more likely to incorrectly classify a white defendant as safe.

ProPublica summarized the finding bluntly: "Black defendants were far more likely than white defendants to be incorrectly judged to be at a higher risk of recidivism, while white defendants were more likely than Black defendants to be incorrectly flagged as low risk."

The article included individual stories that gave the statistics human weight. A 18-year-old Black woman named Brisha Borden, with no prior adult offenses, who had taken a child's bicycle and a scooter, received a high risk score. A 41-year-old white man named Vernon Prater, who had been convicted of armed robbery and had served five years in prison, received a low risk score. Borden did not reoffend. Prater went on to commit a grand theft.

Equivant's Response

Equivant (then Northpointe) published a detailed rebuttal. Their central argument was not that ProPublica's numbers were wrong, but that ProPublica had used the wrong definition of fairness.

Equivant's defense rested on a fairness criterion called predictive parity (a form of calibration): among all defendants who receive a given risk score, the actual recidivism rate should be roughly the same regardless of race. And by this measure, COMPAS was fair. A Black defendant with a COMPAS score of 7 and a white defendant with a COMPAS score of 7 had approximately the same likelihood of actually reoffending. The scores meant the same thing across racial groups.

ProPublica had evaluated COMPAS using a different fairness criterion: error rate balance (a component of equalized odds). They asked whether the model's mistakes — false positives and false negatives — were distributed equally across racial groups. By this measure, COMPAS was unfair, because Black defendants bore a disproportionate burden of false positive errors.

Both ProPublica and Equivant were, in their own terms, correct. The question was not who had made a mathematical error. The question was which definition of fairness should apply.

The Mathematical Impossibility

The COMPAS debate forced the academic community to confront a mathematical truth that had been proven but not widely appreciated: the definitions of fairness that ProPublica and Equivant each invoked cannot, in general, be satisfied simultaneously.

Chouldechova (2017) and Kleinberg, Mullainathan, and Raghavan (2016) independently proved that when base rates differ across groups — when one group has a higher actual rate of the predicted outcome than another — it is mathematically impossible to simultaneously achieve:

Calibration (predictive parity): Equal predictive value across groups — a score of X means the same probability for all groups.
Equal false positive rates: The proportion of non-recidivists incorrectly flagged as high-risk is the same across groups.
Equal false negative rates: The proportion of actual recidivists incorrectly classified as low-risk is the same across groups.

In Broward County, the base rate of recidivism differed by race: approximately 51 percent for Black defendants and 39 percent for white defendants. (These base rates themselves reflect systemic factors — policing patterns, poverty, historical disinvestment — not inherent criminality. But they are the mathematical reality that any model must contend with.)

Given these different base rates, COMPAS could be calibrated (Equivant's criterion) or could have equal error rates (ProPublica's criterion), but not both. The impossibility is not a flaw in COMPAS. It is a mathematical property of binary classification with differing base rates. No algorithm — no matter how sophisticated — can escape it.

Research Note: The impossibility theorem does not mean that fairness is unachievable. It means that fairness requires choosing which definition to prioritize — and that choice is a moral and political judgment, not a technical one. Different choices produce different consequences for different communities. The theorem forces transparency about what tradeoffs are being made and who bears the cost.

The Deeper Questions

The COMPAS debate surfaces questions that extend far beyond any single algorithm:

Should We Use Algorithms in Criminal Justice at All?

Risk assessment tools like COMPAS were introduced partly to reduce human bias in judicial decisions. Extensive research documents that judges' sentencing decisions are influenced by legally irrelevant factors — the defendant's race, the time of day, whether the judge's local sports team won the previous evening, and even how recently the judge had a meal break. A risk assessment tool, the argument goes, provides a consistent, data-driven input that constrains the worst excesses of human subjectivity.

But this argument assumes that the algorithm's biases are less harmful than the judge's biases. ProPublica's analysis suggests that the algorithm may simply encode a different form of the same bias — one that is less visible but more systematic. A biased judge can be overruled on appeal. A biased algorithm shapes thousands of decisions with mechanical consistency.

What Does the Base Rate Mean?

The mathematical impossibility theorem rests on the fact that base rates differ across groups. But the base rates themselves are not neutral facts — they are products of the very system that risk assessment tools are embedded in.

If Black Americans are arrested at higher rates partly because of differential policing (more police patrols in Black neighborhoods, different enforcement priorities, documented racial disparities in stop-and-frisk and traffic stops), then the "base rate" of recidivism for Black defendants is partially an artifact of the system, not a measure of underlying criminality. Training a model on data from this system, and then deploying the model within the same system, creates a circularity that no fairness metric can resolve.

This is the feedback loop problem from the chapter, in its most consequential form. Biased policing produces biased arrest data. Biased arrest data produces biased risk scores. Biased risk scores influence sentencing and parole decisions. Those decisions shape future arrest data. The loop reinforces itself.

Who Decides What "Fair" Means?

The impossibility theorem forces a choice: calibration or equal error rates. This choice determines which group bears the cost of the model's inevitable errors.

If we prioritize calibration (Equivant's position), the model's scores are equally meaningful across groups, but Black defendants bear a disproportionate burden of false positives — being labeled high-risk when they would not have reoffended. The cost is borne by innocent Black defendants who are denied bail, sentenced more harshly, or denied parole because of a prediction that turns out to be wrong.

If we prioritize equal error rates (ProPublica's position), both groups bear an equal share of false positives and false negatives, but the model's scores mean different things for different groups — a score of 7 for a Black defendant corresponds to a different actual probability of recidivism than a score of 7 for a white defendant. The cost is borne by all defendants, in the form of a less accurate prediction system.

Neither option is costless. Neither is obviously correct. The choice is, at its core, a question about values: whose interests should be protected, and at what cost to whom?

What Happened After ProPublica

The COMPAS story did not end with the 2016 article. Its ripple effects have shaped law, policy, and research in the years since.

Academic research exploded. The COMPAS debate launched the modern field of algorithmic fairness as a formal discipline. The annual ACM Conference on Fairness, Accountability, and Transparency (FAccT) was founded in 2018, directly inspired by the questions the debate raised. Hundreds of papers have since explored the mathematical foundations of fairness, proposed new fairness metrics, and developed mitigation techniques.

Jurisdictions pushed back. Several jurisdictions have restricted or reconsidered the use of algorithmic risk assessment in criminal justice. In 2020, the Idaho Supreme Court upheld a defendant's right to challenge a risk assessment tool's methodology. In 2021, the European Commission proposed that AI systems used in criminal justice be classified as "high-risk" under the AI Act, requiring transparency, human oversight, and bias testing.

COMPAS remains in use. Despite the controversy, COMPAS and similar risk assessment tools continue to be used across the United States. Proponents argue that even an imperfect algorithm is better than unconstrained judicial discretion, particularly when judges receive bias training and understand the tool's limitations. Critics argue that the tools provide a veneer of scientific objectivity to decisions that are fundamentally shaped by systemic inequality.

The debate shifted. The most important legacy of the COMPAS controversy is not a verdict on the algorithm itself but a shift in the terms of debate. Before ProPublica's article, discussions of AI fairness were largely theoretical. After it, they became practical, legal, and political. The question moved from "Can algorithms be fair?" to "Who decides what fair means, and who bears the cost when they are wrong?"

Lessons for Business Leaders

The COMPAS case may seem remote from the concerns of an MBA student preparing for a career in business. It is not. The mathematical impossibility that underlies the COMPAS debate — the tension between different definitions of fairness when base rates differ — applies to every high-stakes AI system, including those in business contexts.

1. Fairness Is Not a Single Metric

There is no single metric called "fairness" that, if satisfied, guarantees an equitable outcome. Fairness is a family of metrics, and they can conflict with each other. Business leaders who ask, "Is our model fair?" are asking an incomplete question. The complete question is: "Fair by which definition, for whom, and at what cost to whom?"

2. Base Rates Matter — and Base Rates Have Histories

When a lending model shows different default rates across demographic groups, the difference is not just a statistical fact — it reflects decades of unequal access to credit, education, and wealth-building opportunities. Treating the base rate as a neutral input to a fairness calculation ignores the history that produced it.

3. Transparency Is Non-Negotiable

COMPAS is proprietary. Equivant has resisted disclosing the algorithm's full methodology, arguing that it is a trade secret. This lack of transparency makes it impossible for defendants, their attorneys, or independent researchers to fully evaluate or challenge the algorithm's predictions. For business leaders, the lesson is clear: AI systems that affect people's lives must be transparent enough to be scrutinized. "Trust us" is not a fairness strategy.

4. Human Oversight Is Essential, Not Optional

Risk assessment tools were designed to inform judicial decisions, not replace them. But in practice, as with Athena's HR screening model, human decision-makers frequently defer to algorithmic recommendations — the automation bias problem. A COMPAS score of 8 becomes, in the judge's mind, a fact rather than a prediction. Organizational design must ensure that humans retain genuine decision-making authority and are trained to exercise it.

5. The Stakes Determine the Standard

The severity of potential harm should determine the rigor of fairness evaluation. A product recommendation engine that occasionally suggests an irrelevant item causes minimal harm. A criminal justice algorithm that contributes to a wrongful incarceration causes irreversible harm. Business leaders must calibrate their fairness standards to the stakes of the decision — and HR, lending, insurance, and healthcare AI all involve high stakes that demand rigorous scrutiny.

Connection to Athena

Athena's HR screening model shares structural similarities with COMPAS:

Both were trained on historical human decisions. COMPAS was trained on past criminal justice outcomes; Athena's model was trained on past hiring decisions. In both cases, the historical decisions reflected the biases of the humans and systems that produced them.
Both amplified existing patterns. COMPAS's differential error rates reflect patterns in the criminal justice system that produced its training data. Athena's model amplified the age bias present in its historical hiring data.
Both raise the question of who bears the cost. False positives in COMPAS fall disproportionately on Black defendants. False negatives in Athena's model fall disproportionately on older candidates and candidates without four-year degrees.
Both reveal the limits of aggregate metrics. COMPAS's overall accuracy of 61% and Athena's overall accuracy mask significant group-level disparities. Both cases demonstrate that disaggregated evaluation — examining performance for each subgroup — is essential.

The key difference is stakes. Athena's bias affected job interviews. COMPAS's bias affected human freedom. But the principle is the same: an AI system trained on biased history will produce biased futures, and the people harmed by that bias are disproportionately those who have been historically marginalized.

Discussion Questions

ProPublica and Equivant each used a mathematically valid definition of fairness to reach opposite conclusions about whether COMPAS is biased. If you were a judge deciding whether to use COMPAS in your courtroom, which definition would you prioritize? Why? What would you tell defendants who bear the cost of your choice?
The COMPAS debate assumes that "recidivism" (being arrested for a new crime) is a reliable measure of criminal behavior. Is it? How do policing patterns, prosecutorial discretion, and access to legal representation affect who gets counted as a "recidivist"?
Some advocates argue that we should abandon algorithmic risk assessment in criminal justice entirely. Others argue that the alternative — pure judicial discretion — is at least as biased. Evaluate both positions. Is there a middle ground?
COMPAS is proprietary. Should risk assessment tools used in criminal justice be required to be open-source? What are the arguments for and against transparency in this context?
The impossibility theorem applies to business AI as well. Identify a business context (lending, insurance, hiring, marketing) where the same impossibility — the inability to simultaneously satisfy calibration, equal false positive rates, and equal false negative rates — would arise. How would you navigate the tradeoff in that context?

Sources

Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). "Machine Bias." ProPublica, May 23, 2016.
Dieterich, W., Mendoza, C., & Brennan, T. (2016). "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Research Department.
Chouldechova, A. (2017). "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data, 5(2), 153-163.
Kleinberg, J., Mullainathan, S., & Raghavan, M. (2016). "Inherent Trade-Offs in the Fair Determination of Risk Scores." Proceedings of Innovations in Theoretical Computer Science (ITCS).
Dressel, J. & Farid, H. (2018). "The accuracy, fairness, and limits of predicting recidivism." Science Advances, 4(1).
Corbett-Davies, S. & Goel, S. (2018). "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning." arXiv preprint.
Washington, A.L. (2018). "How to Argue with an Algorithm: Lessons from the COMPAS-ProPublica Debate." Colorado Technology Law Journal, 17(1).