Case Study: The COMPAS Algorithm — Predicting Recidivism

"Across the nation, judges, probation and parole officers are using scores generated by algorithms to make decisions about a defendant's future. But a ProPublica analysis shows those scores are unreliable and biased against Black defendants." — ProPublica, "Machine Bias," May 23, 2016

Overview

The COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) case is the most prominent and consequential debate in algorithmic fairness. It pits a recidivism prediction tool — used in courtrooms across the United States — against competing definitions of fairness, raising questions that remain unresolved today. This case study provides a comprehensive examination of the system, the investigation, the competing claims, and the deeper structural questions the debate reveals.

Skills Applied: - Analyzing algorithmic bias in criminal justice - Distinguishing between different fairness metrics - Connecting bias to structural inequality - Evaluating the Accountability Gap in automated decision-making


The System

What COMPAS Does

COMPAS was developed by Northpointe, Inc. (now Equivant) and has been used since 1998 in courtrooms across the United States, including Wisconsin, New York, California, Florida, and several other states. The system takes as input a defendant's demographic information, criminal history, and responses to a 137-item questionnaire. Topics covered include:

  • Criminal history and prior convictions
  • Substance abuse history
  • Social environment and peer associations
  • Family structure and living arrangements
  • Education and employment history
  • Attitudes toward crime and the legal system
  • Personal responsibility and financial status

From these inputs, COMPAS generates three scores on a scale of 1 to 10:

  • Pretrial risk score: Likelihood of failing to appear for trial
  • General recidivism score: Likelihood of committing any new crime within two years
  • Violent recidivism score: Likelihood of committing a violent crime within two years

These scores are presented to judges and used to inform — though not determine — decisions about bail, sentencing, and parole. A high general recidivism score might lead a judge to impose a longer sentence or deny parole. A high pretrial risk score might result in higher bail or pretrial detention.

The Scope of Impact

COMPAS and similar risk assessment tools are used in most U.S. states. An estimated 60% of U.S. jurisdictions use some form of algorithmic risk assessment in pretrial, sentencing, or parole decisions. The tools affect hundreds of thousands of decisions per year, influencing whether defendants are held in jail or released, how long they are sentenced, and whether they are granted parole.

The consequences of these decisions are not abstract. Defendants held in pretrial detention (often because of a high risk score) are more likely to lose their jobs, lose custody of their children, and plead guilty to charges — even when they might have been acquitted at trial — simply to get out of jail. The downstream effects of pretrial detention compound through families and communities.


The ProPublica Investigation

Methodology

In May 2016, ProPublica journalists Jeff Larson, Surya Mattu, Lauren Kirchner, and Julia Angwin published "Machine Bias," an investigation based on COMPAS scores for over 7,000 defendants in Broward County, Florida. The team:

  1. Obtained COMPAS scores for defendants arrested in Broward County between 2013 and 2014 through public records requests
  2. Merged these scores with public criminal records to determine which defendants actually reoffended within two years
  3. Analyzed error rates — false positives (predicted high risk but did not reoffend) and false negatives (predicted low risk but did reoffend) — disaggregated by race

Key Findings

The investigation revealed a stark asymmetry in errors:

Metric Black Defendants White Defendants
False positive rate (labeled high-risk but did not reoffend) 44.9% 23.5%
False negative rate (labeled low-risk but did reoffend) 28.0% 47.7%
Overall accuracy ~61% ~59%

The pattern: Black defendants were nearly twice as likely as white defendants to be incorrectly labeled high-risk. White defendants were nearly twice as likely as Black defendants to be incorrectly labeled low-risk.

In human terms: if you are a Black defendant who will not reoffend, COMPAS is roughly twice as likely to tell the judge you are dangerous. If you are a white defendant who will reoffend, COMPAS is roughly twice as likely to tell the judge you are safe.

Individual Stories

ProPublica paired statistical analysis with individual cases that made the disparity concrete:

Brisha Borden, an 18-year-old Black woman, was arrested for taking a child's bicycle and a scooter that were left on a sidewalk. She had no prior adult criminal record. COMPAS scored her as high risk (8 out of 10 for general recidivism). She did not reoffend.

Vernon Prater, a 41-year-old white man, was arrested for shoplifting $86.35 worth of tools from Home Depot. He had previously been convicted of armed robbery and had served five years in prison. COMPAS scored him as low risk (3 out of 10). He went on to commit another crime within two years.

These cases illustrate the human stakes of differential error rates. The algorithm's mistakes are not symmetric — and they track the racial fault lines of the American criminal justice system.


The Defense: Northpointe's Response

The Calibration Argument

Northpointe (now Equivant) published a detailed response arguing that COMPAS was, in fact, fair — by a different metric. Their key claim: COMPAS is calibrated.

Calibration means that among defendants who received the same score, the actual recidivism rates were similar across racial groups. A Black defendant scored "7" and a white defendant scored "7" reoffended at approximately the same rate. The score means the same thing regardless of race.

Northpointe argued that calibration is the appropriate standard of fairness for a prediction tool: if the tool's predictions are equally accurate for all groups, it is treating everyone fairly. The differential false positive rates, they argued, are a consequence of different base rates of recidivism — not a flaw in the algorithm.

The Base Rate Problem

The base rates in the Broward County data were not equal. Black defendants had a higher observed recidivism rate (approximately 51%) than white defendants (approximately 39%). Whether this difference reflects genuine behavioral differences, differential policing, systemic disadvantage, or some combination is a matter of profound debate — but the statistical reality is that the rates differ.

When base rates differ, a calibrated classifier will necessarily produce different false positive and false negative rates across groups. This is not a flaw in COMPAS specifically; it is a mathematical property of all calibrated classifiers operating on populations with different base rates.

This is the core of the impossibility theorem, which Chapter 15 will examine formally.


The Structural Context

Why Base Rates Differ

The COMPAS debate cannot be understood outside its structural context. Black defendants in Broward County had a higher observed recidivism rate than white defendants. But why?

The answer is not that Black people are inherently more likely to commit crimes. The factors that predict recidivism — poverty, unemployment, housing instability, substance abuse, exposure to violence, lack of access to reentry services, and intensive policing — are conditions that are systematically more prevalent in Black communities as a direct consequence of centuries of discriminatory policy: slavery, Jim Crow, redlining, mass incarceration, employment discrimination, school segregation, and the war on drugs.

A risk assessment tool that uses these conditions as predictive features is not discovering a truth about race. It is encoding the consequences of racism into a score — and presenting that score as an objective assessment of individual dangerousness.

The Measurement Problem

COMPAS predicts "recidivism," but recidivism is typically measured as re-arrest — not as actual criminal behavior. Re-arrest reflects policing patterns as much as it reflects crime. Communities that are more heavily policed have higher re-arrest rates — not necessarily because more crime occurs, but because more crime is detected. If Black neighborhoods are policed more intensively (and extensive research documents that they are), then using re-arrest as the target variable introduces measurement bias that compounds the historical bias already present in the data.

The Black Box Problem

COMPAS is a proprietary system. Northpointe/Equivant has not fully disclosed the algorithm's methodology, the specific weights assigned to each questionnaire item, or the training data used to develop the model. Defendants and their attorneys cannot inspect the system that contributes to their sentencing.

In State v. Loomis (2016), the Wisconsin Supreme Court ruled that judges may use COMPAS scores as one factor in sentencing — but also acknowledged that the proprietary nature of the algorithm raised due process concerns. The court stopped short of requiring disclosure, creating a precedent in which defendants can be sentenced based partly on a score they cannot inspect, challenge, or understand.


Alternative Analyses

The "Better Algorithm" View

Some researchers argue that the COMPAS debate highlights the need for better algorithms — systems that are more accurate, more transparent, and designed with fairness constraints built in. A simpler model using only age and criminal history has been shown to perform as well as COMPAS (Dressel and Farid, 2018), suggesting that the 137-item questionnaire may add complexity without adding predictive value.

This view emphasizes technical improvement but does not question whether algorithmic risk assessment should be used in criminal justice at all.

The Abolitionist View

A more radical critique holds that algorithmic risk assessment in criminal justice is fundamentally incompatible with justice — regardless of how accurate or fair the algorithm is. This view argues that:

  • Predicting future behavior based on past data treats individuals as statistical entities rather than moral agents capable of change
  • Risk assessment tools embed systemic inequality in a veneer of scientific objectivity
  • The very concept of a "risk score" for a human being is dehumanizing
  • Resources spent on prediction should be redirected to addressing the root causes of crime: poverty, lack of opportunity, inadequate mental health and substance abuse treatment, and community disinvestment

The Pragmatic Middle

Others argue for a middle position: algorithmic tools can reduce certain forms of human bias (judges are also biased, and studies show significant sentencing disparities driven by judicial ideology, mood, and implicit bias) but only if they are transparent, regularly audited, used as one input among many, and accompanied by the political will to address the structural conditions that produce differential base rates.


Discussion Questions

  1. The fairness dilemma. ProPublica measured fairness by equal false positive rates. Northpointe measured fairness by calibration. Both are mathematically correct. If you were a judge using COMPAS, which definition would you prioritize, and why? Does your answer change depending on whether you prioritize the rights of defendants or public safety?

  2. The base rate question. Should an algorithm be "allowed" to predict higher risk for a group with a genuinely higher base rate of recidivism — when that higher rate is itself a product of structural inequality? Is there a principled way to use statistical prediction without reproducing the inequalities embedded in the data?

  3. Human vs. algorithmic judges. Research shows that human judges exhibit their own biases: sentencing disparities based on race, harsher sentences before lunch, and implicit bias affecting bail decisions. Is a biased algorithm better or worse than a biased human? Does the answer depend on the type of bias? On its consistency?

  4. The proprietary problem. COMPAS is proprietary software. Should defendants have the right to inspect the algorithm that contributes to their sentencing? What weight should courts give to trade secret claims when they conflict with due process rights?


Your Turn: Mini-Project

Option A: Error Rate Analysis. Using the error rates reported in this case study, construct a hypothetical population of 1,000 defendants (500 Black, 500 white) with the base rates and error rates described. Calculate how many individuals in each group are correctly and incorrectly classified. Present your results in a confusion matrix and write a one-page analysis of what the numbers mean in human terms — how many people are held in jail who should not be?

Option B: Sentencing Policy. You are a state legislator considering a bill that would either mandate or ban the use of algorithmic risk assessment in criminal sentencing. Write a two-page policy memo that evaluates the arguments for and against each option, drawing on the COMPAS case and the concepts from this chapter. Recommend a specific policy and justify it.

Option C: Alternative Design. Design (on paper) an alternative to COMPAS for informing pretrial detention decisions. Specify: What inputs would you use? What would you predict? What fairness definition would you prioritize and why? How would you handle the base rate problem? How would you ensure transparency and accountability?


References

  • Angwin, Julia, Jeff Larson, Surya Mattu, and Lauren Kirchner. "Machine Bias." ProPublica, May 23, 2016.

  • Northpointe Inc. "COMPAS Risk Scales: Demonstrating Accuracy Equity and Predictive Parity." Northpointe Research Department, July 8, 2016.

  • Chouldechova, Alexandra. "Fair Prediction with Disparate Impact: A Study of Bias in Recidivism Prediction Instruments." Big Data 5, no. 2 (2017): 153-163.

  • Dressel, Julia, and Hany Farid. "The Accuracy, Fairness, and Limits of Predicting Recidivism." Science Advances 4, no. 1 (2018): eaao5580.

  • Corbett-Davies, Sam, and Sharad Goel. "The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning." arXiv preprint arXiv:1808.00023, 2018.

  • State v. Loomis, 881 N.W.2d 749 (Wis. 2016).

  • Harcourt, Bernard E. Against Prediction: Profiling, Policing, and Punishing in an Actuarial Age. Chicago: University of Chicago Press, 2007.

  • Alexander, Michelle. The New Jim Crow: Mass Incarceration in the Age of Colorblindness. New York: The New Press, 2010.