Case Study: Amazon's Hiring Algorithm — When AI Learns to Discriminate

DataField.Dev

Case Study: Amazon's Hiring Algorithm — When AI Learns to Discriminate

"No one programmed it to be sexist. It taught itself — from the data we gave it." — Paraphrased from an anonymous Amazon engineer, as reported by Reuters, 2018

Overview

In October 2018, Reuters reported that Amazon had built and then scrapped an experimental AI recruiting tool that systematically discriminated against women. The system, developed internally beginning in 2014, was designed to automate the screening of resumes for software engineering roles. Trained on a decade of Amazon's own hiring data, the algorithm learned that male candidates were statistically preferable — because the historical data reflected an industry and a company in which men had been disproportionately hired.

This case study examines how the bias emerged, why Amazon's attempts to fix it failed, and what the case reveals about the structural dynamics of algorithmic bias in hiring. It connects the Amazon case to the broader themes of historical bias, proxy variables, redundant encoding, and the limits of technical fixes.

Skills Applied: - Tracing bias through the ML development pipeline - Analyzing the role of training data in encoding historical discrimination - Evaluating the limits of "fairness through unawareness" (removing protected features) - Connecting corporate AI to structural gender inequality in the technology industry

The Situation

Amazon's Goal

By 2014, Amazon was one of the largest employers in the world, receiving hundreds of thousands of job applications annually. For popular positions — particularly software development and technical roles — the volume of resumes far exceeded the capacity of human recruiters to review. Amazon's solution: build a machine learning system to automate the first stage of resume screening.

The goal was straightforward and, from a business perspective, appealing. The system would:

Ingest resumes submitted for technical positions
Analyze their content using natural language processing (NLP)
Rate each resume on a scale of 1 to 5 stars
Forward the highest-rated resumes to human recruiters for further consideration

The system was designed to learn what a "good" resume looked like by studying the resumes of candidates Amazon had successfully hired in the past. In machine learning terms, it was a supervised learning problem: the training data consisted of resumes labeled as "hired" or "not hired," and the model learned to predict which new resumes were most similar to those of successfully hired candidates.

The Training Data

The training data was drawn from Amazon's hiring records over the preceding ten years (approximately 2004-2014). This decade of data reflected the gender composition of Amazon's technical workforce — and of the technology industry more broadly.

In 2017, women held approximately 27% of computing roles in the U.S. (down from 35% in 1990). At Amazon specifically, women constituted a minority of the technical workforce. The exact figures from the training period are not publicly available, but industry composition data makes clear that the resumes of successfully hired candidates were predominantly male.

The training data, in other words, was a faithful record of who Amazon had hired — which was, predominantly, men. The data was not inaccurate. It accurately reflected the historical reality. The problem was that the historical reality was discriminatory.

How the Bias Manifested

What the Algorithm Learned

The algorithm learned that certain features of resumes were predictive of being hired — and those features were correlated with gender. Specifically:

Direct gender signals. The system penalized resumes that contained the word "women's" — as in "women's chess club captain" or "president of the women's engineering society." These phrases appeared almost exclusively on women's resumes. The algorithm learned that their presence was negatively correlated with being hired.

Institutional signals. The algorithm downgraded graduates of two all-women's colleges. Attending an all-women's college is, obviously, a strong indicator of female gender. The algorithm treated it as a negative predictor because very few graduates of all-women's colleges appeared in the "hired" dataset — not because they were less qualified, but because Amazon had historically hired few of them.

Linguistic signals. Amazon engineers discovered that the algorithm had identified subtler patterns as well. Certain verbs more commonly used by women on resumes were penalized. Certain formatting conventions associated with women's resumes were treated as negative signals. The algorithm had constructed a multidimensional profile of "what a successful resume looks like," and that profile was implicitly male.

The Attempted Fix

When the gender bias was identified, Amazon's engineers attempted the most intuitive correction: they removed gender as an explicit feature. The system would be "blind" to gender, they reasoned, and therefore could not discriminate.

This approach — sometimes called "fairness through unawareness" — failed. The algorithm found other features correlated with gender and used those instead:

It continued to penalize "women's" in resume text
It continued to downgrade graduates of women's colleges
It identified new proxy features — verb patterns, organizational affiliations, formatting choices — that correlated with gender
Removing one proxy feature caused the model to shift weight to other proxies

The bias was not located in any single feature. It was distributed across the entire feature space — embedded in the correlational structure of the data itself. This is the phenomenon the chapter calls redundant encoding: the protected characteristic (gender) is encoded in so many other features that removing it explicitly does not remove it functionally.

The Decision to Scrap

By 2017, Amazon's engineers concluded that the bias could not be adequately removed. The system was never used for actual hiring decisions and was disbanded. An Amazon spokesperson confirmed to Reuters that the tool "was never used by Amazon recruiters to evaluate candidates."

The decision to scrap the system was, in one sense, responsible: Amazon recognized the problem and chose not to deploy a biased tool. But the case raises a question the chapter poses directly: how many similar systems are in production at companies that lack Amazon's resources for internal auditing?

Analysis Through Chapter Frameworks

The Bias Pipeline

The Amazon case illustrates bias entry at multiple pipeline stages:

Problem formulation. The problem was defined as "predict which resumes look like the resumes of people we've hired before." This formulation encodes the assumption that past hiring decisions were correct — that the people Amazon hired were the best candidates, rather than the candidates who most closely matched the preferences and biases of past hiring managers. The question "who should we hire?" was operationalized as "who have we hired?" — and the gap between those questions is where historical bias enters.

Data collection. The training data was a decade of Amazon's own hiring records — a dataset that faithfully reflected the gender imbalance of the tech industry and the company's own workforce. The data was accurate. It was also biased. These are not contradictory statements.

Feature engineering. Resume text was the primary feature space. Text contains countless features correlated with gender: names, pronouns, organizational affiliations, college names, verbs, and linguistic patterns. No amount of feature selection could remove gender from a feature space as rich as natural language.

Model training. The model optimized for predicting past hiring decisions. Since past decisions were biased, the model learned to reproduce that bias. The optimization objective — "match historical outcomes" — was the wrong objective if the goal was fair hiring, because historical outcomes were not fair.

Evaluation. If the evaluation set was drawn from the same biased distribution as the training set, the model would have appeared to perform well — it accurately predicted the biased past. Only when the model's outputs were analyzed by gender did the discrimination become visible.

Historical Bias as Root Cause

The Amazon case is a textbook illustration of historical bias (Section 14.2.1). The bias existed in the world before any data was collected: the technology industry has systematically excluded women for decades, through hostile workplace cultures, biased hiring practices, gender stereotypes, and pipeline constrictions beginning in childhood education. Any dataset of "successful tech hires" from 2004-2014 would reflect this history.

The chapter's key point applies: historical bias cannot be fixed by "better data collection." The data accurately reflects the biased world. The solution requires either (a) changing the world (addressing the structural causes of gender inequality in technology) or (b) changing the optimization objective (predicting who would be a good hire if gender bias were absent, rather than who was hired in a biased past).

Option (b) is technically possible — through counterfactual modeling, causal inference, or reweighting techniques — but it requires defining what "fair hiring" looks like, which is a value judgment, not a technical specification.

Redundant Encoding and the Limits of Blinding

The Amazon case is one of the most cited examples of why "fairness through unawareness" fails. Removing the protected attribute (gender) from the model's features does not remove the bias, because:

Gender is encoded in dozens of other features (proxy variables)
Removing one proxy causes the model to shift weight to other proxies
The correlational structure of the data preserves the signal even when the explicit variable is removed

This has important implications beyond hiring. Any attempt to "blind" an algorithm to race, gender, age, or other protected characteristics faces the same problem: in a society where these characteristics are correlated with countless aspects of daily life — where you live, where you went to school, what you do for work, how you write, what language you use — removing the label does not remove the reality it represents.

Broader Implications

The Hiring Technology Industry

Amazon's system was experimental and was never deployed. But the hiring technology industry is vast and growing. As of 2024, automated resume screening, video interview analysis, gamified assessments, and AI-driven candidate ranking are used by employers of all sizes. Companies including HireVue, Pymetrics, Textio, and many others offer AI-powered hiring tools.

Most of these systems are proprietary. Most are not subject to independent bias audits. Most are used by HR departments that lack the technical expertise to evaluate their fairness. The Amazon case is notable not because it is unique but because it was discovered and disclosed. The question the chapter poses — "How many biased hiring algorithms are operating right now, unexamined?" — remains open.

The Legal Landscape

U.S. employment law prohibits discrimination based on protected characteristics (Title VII of the Civil Rights Act, the Age Discrimination in Employment Act, the Americans with Disabilities Act). The four-fifths rule provides a framework for detecting disparate impact. But enforcement of these protections in the context of algorithmic hiring remains limited.

In 2023, New York City implemented Local Law 144, requiring employers to conduct independent bias audits of automated employment decision tools before using them. The law is the first of its kind in the United States, and its implementation has been closely watched. The audits examine selection rates across race, ethnicity, and sex — essentially codifying the disparate impact framework described in this chapter.

The Liability Question

If an AI hiring system discriminates, who is liable? The employer that deployed it? The vendor that built it? The data scientists who trained it? The training data itself? This question parallels the Accountability Gap discussed throughout Part 3. Current law generally holds the employer responsible for hiring outcomes, but the use of third-party AI tools complicates the chain of accountability.

Discussion Questions

The training data trap. Amazon trained its algorithm on "who we hired in the past." What alternative training signal could they have used? Consider: performance reviews (but these may also be biased), retention rates, peer evaluations, or objective productivity metrics. Each has limitations. Is there a training signal that avoids historical bias entirely?
The detection problem. Amazon discovered the bias through internal audit. Many companies lack the resources or awareness to conduct similar audits. Should bias audits of hiring algorithms be legally required? If so, who should conduct them, how often, and to what standard?
The broader pipeline. The Amazon case focuses on the algorithm, but the gender gap in technology hiring exists across the entire pipeline: from childhood education to university enrollment to workplace culture to promotion patterns. Is fixing the algorithm addressing the right problem? Or is it a distraction from the structural changes needed?
The "humans are biased too" argument. Some defend algorithmic hiring by arguing that human recruiters are also biased — and that an algorithm, even an imperfect one, might be more consistent and less prone to individual prejudice. Evaluate this argument. Is it valid? Under what conditions might algorithmic hiring be more fair than human hiring? Under what conditions might it be less fair?

Your Turn: Mini-Project

Option A: Resume Audit. Collect (or create) 10 fictional resumes — 5 with stereotypically male features (men's sports, fraternities, military service) and 5 with stereotypically female features (women's organizations, sororities, care-related volunteer work). Submit them to a publicly available resume scoring tool or AI writing assistant and compare the feedback. Document any differences and analyze whether they suggest gender bias. (Note: respect the terms of service of any tool you use.)

Option B: Vendor Evaluation Framework. You are the HR director of a mid-size company considering purchasing an AI-powered resume screening tool. Design a due diligence framework: What questions would you ask the vendor? What evidence of fairness testing would you require? What ongoing monitoring would you implement? What contractual protections would you demand? Write a two-page evaluation framework.

Option C: Comparative Legal Analysis. Research New York City's Local Law 144 (bias audits for automated employment decision tools) and at least one other jurisdiction's approach to regulating AI in hiring (e.g., the EU AI Act's provisions on high-risk AI in employment, or Illinois's AI Video Interview Act). Write a two-page comparative analysis: What do these regulations require? How do they define bias? What are their strengths and limitations?

References

Dastin, Jeffrey. "Amazon Scraps Secret AI Recruiting Tool That Showed Bias against Women." Reuters, October 10, 2018.
Raghavan, Manish, Solon Barocas, Jon Kleinberg, and Karen Levy. "Mitigating Bias in Algorithmic Hiring: Evaluating Claims and Practices." Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency (FAT)*, 469-481. ACM, 2020.
Buolamwini, Joy, and Timnit Gebru. "Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification." Proceedings of the 1st Conference on Fairness, Accountability and Transparency, 77-91. PMLR, 2018.
Kim, Pauline T. "Data-Driven Discrimination at Work." William & Mary Law Review 58, no. 3 (2017): 857-936.
Bogen, Miranda, and Aaron Rieke. "Help Wanted: An Examination of Hiring Algorithms, Equity, and Bias." Upturn, December 2018.
New York City Department of Consumer and Worker Protection. "Local Law 144: Automated Employment Decision Tools." New York City, 2023.
National Bureau of Economic Research. Bertrand, Marianne, and Sendhil Mullainathan. "Are Emily and Greg More Employable than Lakisha and Jamal? A Field Experiment on Labor Market Discrimination." American Economic Review 94, no. 4 (2004): 991-1013.