Case Study 19-2: ProPublica's COMPAS Audit — External Auditing from Output Data Alone

Overview

In May 2016, ProPublica published "Machine Bias," an investigative piece by Julia Angwin, Jeff Larson, Surya Mattu, and Lauren Kirchner. The article documented racial disparities in COMPAS (Correctional Offender Management Profiling for Alternative Sanctions), a widely used recidivism prediction tool developed by Northpointe (later Equivant) and used by courts in Florida and elsewhere to assess the likelihood that criminal defendants would reoffend.

The ProPublica investigation was not just a piece of investigative journalism. It was a methodological innovation: it demonstrated that meaningful external auditing of an AI system is possible from output data alone, without access to the model's design, training data, or internal workings. When Northpointe refused to disclose the details of COMPAS's algorithm, ProPublica used public court records to construct an analysis that answered the questions a full technical audit would have addressed.

This case study examines the ProPublica investigation's methodology, findings, and limitations; Northpointe's response; the academic controversy over competing fairness definitions that the investigation triggered; and the broader lessons it provides for external AI auditing when model access is denied.

Background: What Is COMPAS?

COMPAS is a commercial algorithmic risk assessment tool used in criminal justice contexts. It takes questionnaire responses about a defendant's criminal history, social background, and attitudes and produces scores predicting the likelihood of future criminal activity. In Florida — the jurisdiction ProPublica studied — COMPAS scores were incorporated into pre-sentencing reports provided to judges and into bail determination processes.

Northpointe (the company that developed COMPAS) treated the underlying algorithm as a trade secret and declined to disclose how the score was calculated, which variables were included, and what weights were assigned to different factors. This opacity made it impossible for defendants to challenge their scores through traditional legal processes — a due process concern discussed at length in Case Study 20-2.

By 2016, COMPAS was being used in courts across the United States. Its widespread deployment, combined with its opacity and its potential to influence defendants' liberty interests, made it an important subject for external scrutiny.

ProPublica's Methodology: Auditing from Outputs

Data Collection

ProPublica obtained a dataset of 7,000 criminal defendants who were arrested in Broward County, Florida between 2013 and 2014, who had been assessed with COMPAS at that time, and for whom subsequent criminal history data was available through 2016. This dataset combined COMPAS scores — obtained through public records requests — with actual criminal record information from Florida Department of Corrections records.

The combination of COMPAS scores and actual recidivism outcomes made it possible to construct the kind of analysis that a technical audit would produce: comparing the model's predictions to actual outcomes, stratified by demographic group.

The Core Finding

ProPublica found that COMPAS's predictions of recidivism were racially disparate in a specific and striking way: the system was significantly more likely to produce false positive errors for Black defendants than for white defendants, and significantly more likely to produce false negative errors for white defendants than for Black defendants.

Concretely: among defendants who did not actually reoffend within two years, Black defendants were labeled "higher risk" at nearly twice the rate of white defendants (45% vs. 24%). Conversely, among defendants who did actually reoffend within two years, white defendants were labeled "lower risk" at nearly twice the rate of Black defendants (48% vs. 28%).

ProPublica reported this as a violation of fairness: COMPAS was more likely to incorrectly label Black defendants as future criminals and more likely to incorrectly label white defendants as likely to be law-abiding, even when controlling for the type of crime and criminal history.

The Method's Strengths

The ProPublica methodology demonstrates that meaningful external auditing is possible without model access, when outcome data is available and the investigator is willing to invest in data acquisition.

Access to model not required. ProPublica did not need to know how COMPAS worked internally — what variables it used or how it weighted them. By matching scores to outcomes, they could assess the model's fairness properties from its behavior alone. This is analogous to assessing a test for discriminatory impact under employment discrimination law: you don't need to know how the test was designed to measure whether it produces discriminatory results.

Public records as audit data. The dataset was constructed from public records — COMPAS scores obtained through public records requests, and Florida criminal records that were already partially public. This demonstrates that external auditing does not necessarily require the cooperation of the audited entity: where outcome data is available through other channels, auditors can construct the dataset they need independently.

Longitudinal tracking. ProPublica tracked defendants over two years to assess actual recidivism, rather than assessing only the model's outputs at a single point in time. This allowed assessment of predictive validity — not just whether the model produced equal rates of predictions across groups, but whether its predictions were equally accurate across groups.

Transparency of methodology. ProPublica published its data and methodology alongside its findings, allowing critics to reproduce and evaluate its analysis. This transparency is an important feature of external audit methodology: the credibility of audit findings depends on their reproducibility.

The Response: Northpointe and the Fairness Metric Controversy

Northpointe's Response

Northpointe responded to the ProPublica findings by challenging their fairness analysis on technical grounds. Northpointe argued that COMPAS was fair by a different but equally legitimate fairness criterion: calibration, or predictive parity. Under calibration, a risk score is fair if defendants with the same score have the same probability of reoffending, regardless of race.

Northpointe published its own analysis showing that COMPAS scores were well-calibrated across racial groups: a score of 7 (on a 10-point scale) predicted roughly the same probability of reoffending for Black and white defendants. The company argued that ProPublica's false positive/false negative analysis was not a meaningful measure of algorithmic fairness, because it failed to account for differences in base rates of reoffending across demographic groups — differences that themselves reflected real-world patterns, not algorithmic error.

The Academic Controversy

The dispute between ProPublica and Northpointe generated a substantial academic literature on the incompatibility of fairness metrics. Researchers Chouldechova (2017) and Kleinberg et al. (2016) independently demonstrated mathematically that when base rates differ across groups — that is, when one group has a higher actual rate of the outcome being predicted — it is mathematically impossible to satisfy both calibration and equal false positive/false negative rates simultaneously.

This finding is of enormous practical importance: it means that the choice between fairness metrics is not merely technical. It is a normative judgment about whose errors matter most and what fairness requires in the specific context. In the criminal justice context, the relevant normative questions include:

Is it more important that defendants with the same score face the same risk of reoffending (calibration/predictive parity)?
Or is it more important that defendants who will not reoffend are not labeled as high-risk at the same rate regardless of their race (equal false positive rates)?
Or is it more important that defendants who will reoffend are identified at the same rate regardless of their race (equal true positive rates / equal opportunity)?

These questions cannot be answered by technical analysis alone. They require normative judgments about whose interests count and how competing interests should be balanced — judgments that should be made through democratic deliberation, not delegated to algorithm designers.

What the Controversy Reveals About Auditing

The ProPublica/Northpointe dispute illustrates a fundamental challenge for AI auditing: the choice of fairness metrics is not a technical question that auditors can answer objectively. It is a normative question that requires stakeholder engagement, public deliberation, and ultimately regulatory or legislative specification.

Effective AI auditing must therefore include: (a) calculation of multiple fairness metrics, so that auditors can present a complete picture of the system's fairness properties rather than cherry-picking the metric that favors the desired conclusion; (b) explicit specification of which metrics are most appropriate for the specific context, with justification; and (c) engagement with affected communities in specifying what fairness means in the relevant context.

NYC LL 144's requirement to calculate impact ratios (related to demographic parity/selection rate differences) represents a specific normative choice — one that prioritizes equal selection rates over calibration or predictive parity. This choice is defensible in some employment contexts but is not the only legitimate choice. The appropriate fairness metric for criminal justice risk assessment may be very different from the appropriate metric for hiring decisions.

Limitations of the Output-Only Audit Approach

The ProPublica methodology, while a genuine methodological contribution, has significant limitations as a model for AI auditing:

Cannot Identify the Source of Disparities

An output-only audit can identify that disparities exist — that Black defendants receive more false positive predictions than white defendants — but cannot determine why. The disparity could arise from the model's design choices (which variables are included and how they are weighted), from the training data (which may itself encode historical racial disparities in policing and prosecution), or from differences in actual recidivism rates that the model is accurately reflecting. Distinguishing between these explanations requires access to the model and training data that an output-only audit cannot provide.

This limitation matters for accountability. If the disparity arises from the model design — for example, if the model weights certain criminal history factors more heavily in ways that systematically disadvantage Black defendants due to over-policing — then the algorithm itself is the problem. If the disparity arises from the training data encoding historical racial disparities in policing, the problem is deeper and requires more systemic remediation. An output-only audit cannot distinguish these cases.

Requires Access to Outcome Data

The ProPublica methodology required outcome data: actual recidivism records for the defendants whose COMPAS scores were examined. This data was available in Florida through a combination of public records requests and FOIA processes. It would not be available in all jurisdictions, and the combination of criminal history data with COMPAS scores required significant data work that was feasible for a well-resourced investigative news organization but might not be feasible for smaller auditors or advocacy organizations.

In domains other than criminal justice — employment, credit, healthcare — the outcome data required for a comparable analysis may be less accessible. Employers do not routinely disclose which AI tool was used to screen each applicant or what score that applicant received. Lenders do not publicly disclose credit scores alongside credit outcomes. This means that the ProPublica-style output audit is most feasible in contexts where outcome data is partially public, as in the criminal justice context.

Subject to Ecological Fallacy Concerns

The ProPublica analysis was conducted at the population level — examining aggregate patterns across thousands of defendants. Individual predictions may behave differently than aggregate patterns suggest. Northpointe argued that the ProPublica analysis conflated individual prediction accuracy with group-level fairness, and that the relevant question for judicial use of COMPAS is whether individual predictions are accurate, not whether aggregate error rates differ by group.

This critique has some merit: a group-level disparity in error rates does not automatically mean that any individual defendant's score was inaccurate. But the group-level disparity is the relevant question for systematic fairness: if the system reliably produces different error rates for Black and white defendants at the aggregate level, then its use in the criminal justice system will reliably produce racially disparate outcomes at the system level — regardless of whether any individual prediction is accurate.

Legacy and Influence

Impact on the Field

The ProPublica COMPAS investigation has had enormous influence on AI ethics, law, and policy. It:

Demonstrated that external AI auditing is possible from output data, setting a methodological precedent
Triggered an academic literature on fairness metric incompatibility that fundamentally advanced the field
Created public and legislative awareness of AI bias in criminal justice that has shaped subsequent policy debates
Prompted Northpointe/Equivant to release more documentation about COMPAS, including a recidivism risk validation study
Contributed to the Loomis v. Wisconsin litigation (discussed in Case Study 20-2), in which a criminal defendant challenged the use of COMPAS in his sentencing

Subsequent Research

Academic researchers have extended and critiqued the ProPublica analysis. Dressel and Farid (2018) found that human predictions of recidivism were no more accurate than COMPAS and were similarly racially biased — suggesting that the question is not whether to use algorithmic prediction but how to design and validate any prediction tool, algorithmic or human. Flores, Bechtel, and Lowenkamp (2016) challenged aspects of ProPublica's methodology. Multiple researchers have examined COMPAS and comparable tools across different jurisdictions, generally finding similar patterns of racial disparity.

The COMPAS controversy has not resulted in COMPAS being withdrawn from use. The tool continues to be used in criminal justice contexts across the United States. This is itself a form of accountability failure: extensive documentation of racial disparities has not prevented continued deployment.

Lessons for AI Auditing Practice

Output auditing is a viable methodology when model access is denied. Public records, outcome data, and rigorous statistical analysis can support meaningful external assessment of AI system fairness properties. This approach should be in the toolkit of researchers, journalists, and advocacy organizations.
Fairness metric selection is a normative, not technical, judgment. Auditors must specify which fairness metric they are applying and justify that choice in the deployment context. Reporting only one metric — when multiple metrics are legitimate and can give different verdicts — is incomplete and potentially misleading.
Output-only audits have fundamental limitations. They cannot identify the source of disparities, and they require outcome data that may not be available. Full technical audits — requiring model access — remain the gold standard.
Transparency of methodology is essential for credibility. ProPublica's decision to publish its data and methodology allowed its findings to be reproduced, critiqued, and extended. Audit credibility depends on transparency that allows independent verification.
Documentation of disparities does not automatically produce remediation. The COMPAS controversy has not ended COMPAS's use. Accountability requires not just audit findings but mechanisms that translate findings into action — enforcement, litigation, regulatory requirements, or market pressure.

Discussion Questions

ProPublica found that COMPAS produced higher false positive rates for Black defendants. Northpointe argued that COMPAS was calibrated — that the same score predicted the same probability of reoffending across racial groups. Both claims can be simultaneously true (due to the mathematical incompatibility of these fairness criteria). How should courts, lawmakers, and the public decide which fairness criterion is appropriate for criminal justice risk assessment?
ProPublica's analysis was feasible in Florida because of the state's relatively open public records laws and the availability of criminal history data. How would you conduct a comparable analysis in a jurisdiction with less open criminal records, or in a domain (such as employment or lending) where outcome data is not publicly available?
The COMPAS controversy has not prevented continued widespread use of the tool. What accountability mechanisms would be necessary to ensure that documented algorithmic bias in high-stakes decision systems actually results in remediation or discontinuation?
Dressel and Farid found that human prediction of recidivism was no more accurate than COMPAS and exhibited similar racial disparities. Does this finding undermine the case for algorithmic accountability? Or does it suggest that the problem of prediction bias is systemic and requires more fundamental reform?
The ProPublica investigation was conducted by a journalistic organization, not a regulatory body or audit firm. What does this say about the current state of AI accountability mechanisms? What institutional infrastructure would need to exist for this kind of external AI assessment to be conducted systematically rather than episodically?