Case Study 1: When Algorithms Discriminate — Bias in Hiring, Lending, and Criminal Justice


Tier 1 — Verified Concepts: This case study discusses documented cases of algorithmic bias in hiring (Amazon, reported by Reuters in 2018), criminal justice (COMPAS, analyzed by ProPublica in 2016), and lending (documented in multiple academic studies). The specific cases are based on published reporting and peer-reviewed research. The analytical framework and synthesis are the authors'.


Three Systems, One Pattern

This case study examines three algorithmic systems that were deployed in high-stakes domains — hiring, criminal justice, and lending — and that produced discriminatory outcomes. Each system was built by skilled engineers with access to large datasets. Each system was technically sophisticated. And each system learned to discriminate, not because its creators intended harm, but because the data, the metrics, and the assumptions embedded discrimination into the system from the start.

The pattern across all three cases is the same: an algorithm trained on historically biased data, optimizing for a narrowly defined metric, producing outcomes that disproportionately harm already-disadvantaged groups.

Case A: Amazon's Resume Screener

What Happened

Between 2014 and 2017, Amazon developed a machine learning system to automate the initial screening of job applicants. The system was trained on resumes submitted to Amazon over the previous decade. It learned to assign each resume a score from 1 to 5, with 5 being the strongest recommendation.

The model learned patterns associated with successful past hires. It learned to favor certain verbs ("executed," "captured") and to penalize others. It learned to prefer candidates from certain universities. And — because Amazon's technical workforce had been predominantly male for the past decade — it learned that being male was a predictor of being hired.

The system penalized resumes containing the word "women's" — as in "women's chess club captain" or "Society of Women Engineers." It downgraded graduates of two all-women's colleges. It had learned, from perfectly accurate historical data, that men were more likely to be hired — and it dutifully replicated that pattern.

Why It Happened

The model was not poorly built. It was, in a narrow technical sense, performing exactly as designed: identifying patterns in historical data that predicted historical outcomes. The problem was that historical outcomes reflected decades of gender imbalance in the technology industry.

Consider the logic chain:

  1. Amazon's past workforce was predominantly male (a documented fact about the tech industry)
  2. The training data consisted of resumes from past applicants, with success defined as "was hired and retained"
  3. The model learned that features associated with male applicants correlated with "success"
  4. The model applied these learned patterns to new applicants, penalizing female-associated features

At no point did anyone tell the model "prefer men." The model discovered this on its own, because the data contained this pattern. This is the core lesson: bias in training data produces biased models, regardless of the model builder's intentions.

What Was Done

Amazon attempted to correct the gender bias by removing explicitly gendered features. But the model found new proxies — universities that were predominantly female, extracurricular activities associated with women, even writing styles that correlated with gender. The proxies were endless.

Eventually, Amazon abandoned the project entirely. The model was never used for actual hiring decisions, though it is unclear how much it influenced internal thinking before being scrapped.

Lessons

  1. Historical data encodes historical bias. If you train on biased outcomes, you get biased predictions.
  2. Removing protected features is insufficient. Proxies abound.
  3. The definition of "success" matters. "Was hired at Amazon in a male-dominated era" is not the same as "would be a good employee."

Case B: COMPAS and Criminal Justice

What Happened

COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) is a proprietary risk assessment tool used in courts across the United States. It assigns defendants a score from 1 to 10 predicting the likelihood of recidivism (committing another crime within two years). Judges use these scores to inform decisions about bail, sentencing, and parole.

In 2016, ProPublica obtained COMPAS scores for over 7,000 defendants in Broward County, Florida, and matched them with actual outcomes (did the person reoffend within two years?). Their analysis revealed:

For Black defendants who did NOT reoffend: - 44.9% were incorrectly classified as medium or high risk (false positive rate)

For white defendants who did NOT reoffend: - 23.5% were incorrectly classified as medium or high risk

For white defendants who DID reoffend: - 47.7% were incorrectly classified as low risk (false negative rate)

For Black defendants who DID reoffend: - 28.0% were incorrectly classified as low risk

In plain language: if you were Black and innocent, COMPAS was nearly twice as likely to wrongly flag you as dangerous. If you were white and guilty, COMPAS was nearly twice as likely to wrongly flag you as safe.

The Fairness Debate

Northpointe (COMPAS's developer) responded that ProPublica's analysis was misleading. They argued that COMPAS satisfied a different — and equally valid — definition of fairness: predictive parity. Among defendants scored as high-risk, roughly the same proportion of Black and white defendants actually reoffended. The score meant the same thing regardless of race.

Both sides were correct. And that is the problem.

The mathematical impossibility result proved what this debate illustrated: when base rates differ between groups (and they do — for complex reasons including policing patterns, poverty, and systemic inequality), you cannot simultaneously equalize both error rates and predictive values. You must choose.

ProPublica chose to prioritize error rate equality — the idea that innocent people of all races should be equally likely to be wrongly flagged. Northpointe chose to prioritize predictive parity — the idea that a score of "7" should mean the same thing regardless of race.

Neither choice is objectively "correct." Both involve value judgments about which type of unfairness is more acceptable. This is what makes algorithmic fairness fundamentally an ethical question, not a technical one.

The Deeper Problem

Beyond the fairness debate, COMPAS raises questions about whether algorithmic risk assessment belongs in criminal justice at all:

  • The data reflects policing, not crime. Arrest data does not measure criminal behavior — it measures the intersection of criminal behavior and policing. Communities that are more heavily policed generate more arrest data, regardless of actual crime rates.
  • The variables are proxies for race. COMPAS uses 137 features. Race is not one of them. But many features — neighborhood, employment status, family criminal history — are correlated with race due to systemic inequalities. The model can learn racial patterns without ever seeing race.
  • Feedback loops. A defendant labeled as high-risk may receive a longer sentence, which reduces their employment prospects, which increases their likelihood of reoffending, which validates the original prediction.
  • The illusion of objectivity. Putting a number on a person's "risk" gives the impression of scientific precision. But the number reflects the biases of the data, the assumptions of the model, and the values embedded in the definition of "fairness."

Case C: Algorithmic Lending Discrimination

What Happened

Multiple studies have documented racial disparities in algorithmic lending decisions. A 2021 study examining over 2 million mortgage applications found that Black and Hispanic applicants were significantly more likely to be denied loans than white applicants with similar financial profiles. An analysis of fintech lending (where algorithms, not human loan officers, make decisions) found that algorithmic lenders charged Black and Hispanic borrowers 7.9 basis points more in interest rates than comparable white borrowers — even after controlling for creditworthiness.

Why Algorithms Did Not Fix Human Bias

There was hope that algorithmic lending would reduce discrimination by removing the human judgment (and human bias) from the process. In some respects, this happened — discrimination in algorithmic lending is lower than in traditional face-to-face lending. But it did not disappear.

The reasons echo the previous cases:

  • Historical data. Credit scores, payment histories, and other financial data reflect decades of economic inequality. Redlining (the historical practice of denying services to residents of certain neighborhoods, usually based on race) created wealth gaps that persist today and are encoded in the data.
  • Proxy variables. Features like zip code, education level, and employment sector correlate with race. Algorithms can learn racial patterns through these proxies.
  • Narrow optimization. Algorithms optimized to minimize default risk will naturally charge higher rates to borrowers who statistically default more often. But default rates reflect not just individual behavior but structural barriers — communities that were denied generational wealth accumulation through homeownership are more likely to struggle with loan repayment.

The Vicious Cycle

Lending discrimination creates a feedback loop: denied loans mean no homeownership, which means no wealth accumulation, which means lower credit scores, which means higher denial rates. The algorithm does not create this cycle — it inherits it from the data — but it perpetuates it with mechanical efficiency.

Connecting the Cases: A Shared Pattern

Across all three cases, we see the same five-step pattern:

Step 1: Historical inequality exists (gender imbalance in tech, racial disparities in policing and wealth)

Step 2: Data is collected that reflects this inequality (resumes from a male-dominated industry, arrest data from racially biased policing, credit data from an unequal economy)

Step 3: A model is trained to find patterns (features correlated with being male, features correlated with recidivism, features correlated with default)

Step 4: The model learns the historical inequality as if it were truth (male = good hire, certain neighborhoods = high risk, certain profiles = high default)

Step 5: The model's predictions reinforce the inequality (fewer women hired, more Black defendants incarcerated, less credit for minority communities)

At each step, there are opportunities for intervention:

  • Step 1: Address the root inequality (systemic change — important but beyond the scope of data science alone)
  • Step 2: Collect more representative data, use different outcome variables, acknowledge the limitations of the data
  • Step 3: Test for proxy discrimination, audit feature importance for proxies
  • Step 4: Evaluate the model across subgroups, not just overall, using multiple fairness definitions
  • Step 5: Monitor deployed systems for disparate impact, implement feedback mechanisms, establish appeals processes

What Should Data Scientists Do?

This case study is not an argument against using algorithms. Human decision-makers in all three domains (hiring, criminal justice, lending) have well-documented biases. The question is not "algorithms vs. humans" — it is "how do we make decisions as fair as possible, regardless of who (or what) makes them?"

For data scientists, the practical implications are:

  1. Question your data. Ask what historical patterns are embedded in it and whether those patterns reflect the world as it should be, not just the world as it has been.

  2. Question your metrics. "Accuracy" is not enough. Test for disparate impact across demographic groups. Use multiple fairness definitions and acknowledge the tradeoffs between them.

  3. Question your deployment. Who is affected by the model's decisions? Is there an appeals process? Is there monitoring for emergent bias? Are feedback loops possible?

  4. Question your silence. If you see bias in a system and say nothing, you are complicit in the harm it causes. Raising concerns is not optional for a responsible practitioner.

Discussion Questions

  1. In the Amazon case, the company tried to "debias" the model by removing gendered features. Why did this approach fail? Is there a better approach, or is the problem fundamentally unfixable with this type of data?

  2. In the COMPAS debate, ProPublica and Northpointe both claimed their definition of fairness was correct. If you were a judge, which definition would you prioritize, and why?

  3. Algorithmic lending is measurably less discriminatory than human lending. Should we celebrate the improvement, or focus on the remaining disparities? How do you evaluate "better but not good"?

  4. All three systems were built by people who did not intend to discriminate. Does intent matter? Should the legal and ethical evaluation of an algorithm depend on whether its creators meant well?

  5. Some researchers argue that the solution is "fairness-aware" algorithms that explicitly incorporate fairness constraints. Others argue that this is just putting a band-aid on a structural problem. What is your view?

  6. The feedback loop problem (Step 5) means that biased algorithms do not just reflect inequality — they amplify it. How would you design a system that breaks the loop rather than reinforcing it?