Appendix B: Key Studies Summary
The 25 Most Important Empirical Studies in AI Ethics
Introduction
Empirical research has been central to the development of AI ethics as a field. The studies summarized here span disciplines — computer science, sociology, economics, medicine, and investigative journalism — and collectively establish many of the field's foundational claims about bias, fairness, accountability, and harm. Each summary includes the original research question, methodology, key findings, significance, limitations, and subsequent developments.
Part I: Algorithmic Bias in Criminal Justice
Study 1: ProPublica COMPAS Investigation
Citation: Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016). Machine bias: There's software used across the country to predict future criminals. And it's biased against blacks. ProPublica, May 23, 2016.
Research Question: Does the COMPAS recidivism risk assessment algorithm produce racially biased predictions, and if so, in what direction and magnitude?
Methodology: The research team obtained Broward County, Florida criminal records including COMPAS risk scores for approximately 7,000 people arrested between 2013 and 2014 and matched them to actual recidivism data (re-arrest within two years). They calculated false positive rates (labeled high risk but did not reoffend) and false negative rates (labeled low risk but did reoffend) separately for Black and white defendants using standard statistical testing.
Key Findings: Black defendants were nearly twice as likely as white defendants to be falsely flagged as future criminals (false positive rate: 44.9% for Black defendants vs. 23.5% for white defendants). White defendants were more likely to be incorrectly flagged as low risk (false negative rate: 47.7% for white vs. 28.0% for Black defendants). The algorithm was slightly better than chance at predicting recidivism overall.
Significance: This investigation catalyzed the modern AI ethics field. It demonstrated that an opaque, commercially licensed algorithm — used in actual sentencing, bail, and parole decisions across the country — produced measurably racially disparate results. It forced a public debate about algorithmic accountability, transparency, and the due process implications of score-based decision-making.
Limitations: Recidivism was measured as rearrest, which is itself racially biased (police surveil Black communities more intensively). The study did not model whether COMPAS performs worse than unassisted human judgment. Broward County may not be nationally representative.
What Happened After: Northpointe (the developer) published a response arguing that COMPAS satisfies predictive parity — that for any given score, Black and white defendants reoffend at equal rates. Researchers including Chouldechova (2017) showed that both sets of claims can be simultaneously true because of base rate differences — igniting the fairness impossibility theorem debate. Dozens of jurisdictions have continued using COMPAS; some courts have rejected challenges to its use; a few states have passed algorithmic transparency laws partly in response.
Study 2: Chouldechova Fairness Impossibility
Citation: Chouldechova, A. (2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153-163.
Research Question: Can a recidivism prediction instrument satisfy both error-rate parity (equal false positive and false negative rates across races) and calibration (predictive parity) simultaneously?
Methodology: Mathematical derivation and proof, supplemented by simulation and application to the COMPAS data from the ProPublica investigation.
Key Findings: When two groups have different base rates of the outcome being predicted (as Black and white defendants have different recidivism base rates due to systemic factors), it is mathematically impossible for a prediction instrument to simultaneously achieve equal false positive rates, equal false negative rates, and calibration (predictive parity). Satisfying one metric requires violating at least one of the others.
Significance: This is among the most important theoretical results in AI fairness research. It explains why Northpointe and ProPublica were both correct in their respective claims about COMPAS — they were measuring different things. It established that choosing a fairness metric is a value judgment, not a purely technical decision, and that no algorithm can satisfy all conceptions of fairness simultaneously when group base rates differ.
Limitations: The proof applies to binary prediction settings with differing base rates; the practical magnitude of the trade-off depends on how different the base rates are. The result does not tell us which fairness metric to choose — that remains a normative question.
What Happened After: The result was independently derived by Kleinberg, Mullainathan, and Raghavan (2017) and by Corbett-Davies et al. (2017), confirming its validity. It is now considered a foundational result in algorithmic fairness theory and is cited in virtually every serious treatment of AI fairness metrics.
Part II: Facial Recognition and Biometric Bias
Study 3: Gender Shades
Citation: Buolamwini, J., & Gebru, T. (2018). Gender shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of Machine Learning Research, 81, 1-15. (ACM FAccT 2018)
Research Question: Do commercial facial analysis systems perform equally well across gender and racial groups, and if not, how do accuracy disparities vary by the intersection of gender and race?
Methodology: The researchers curated a new benchmark dataset of 1,270 individuals from three African countries and three Nordic countries, with careful attention to skin tone classification using the Fitzpatrick scale. They evaluated three commercial gender classification systems (Microsoft, IBM, and Face++) on this dataset and measured accuracy by gender (female/male) and skin tone (lighter/darker), as well as at the intersection of these characteristics.
Key Findings: All three systems performed worst on darker-skinned women — with error rates up to 34.7% for that subgroup, compared to error rates under 1% for lighter-skinned men. The worst-performing system had an overall accuracy of 87.9% but an accuracy gap of 34.4 percentage points between the best- and worst-performing subgroups. The benchmark datasets then used to train and evaluate these systems substantially underrepresented darker-skinned faces and women.
Significance: Gender Shades provided the first systematic, intersectional evidence of severe performance disparities in widely deployed commercial AI systems. It introduced the concept of intersectionality to AI bias research — the insight that disparities compound at the intersection of multiple marginalized identities. It also demonstrated that the datasets used to benchmark AI systems were themselves biased.
Limitations: The study evaluated gender classification only, not face recognition or identification. The dataset, while more diverse than existing benchmarks, was curated from specific countries. Gender was treated as binary.
What Happened After: All three companies improved their systems within months of the study's release, though Microsoft's improved system still showed residual disparities. The study directly contributed to the development of the NIST FRVT and motivated creation of more diverse benchmark datasets. Buolamwini's follow-up work on the Algorithmic Justice League built directly on these findings.
Study 4: NIST Face Recognition Vendor Test (FRVT)
Citation: Grother, P., Ngan, M., & Hanaoka, K. (2019). Face Recognition Vendor Test (FRVT), Part 3: Demographic Effects. National Institute of Standards and Technology, NISTIR 8280. (Updated 2022)
Research Question: Do face recognition algorithms available in the marketplace show systematic differences in accuracy across demographic groups?
Methodology: NIST evaluated 189 algorithms (2019) from 99 developers on a dataset of 18.27 million images — including visa photographs, mugshots, and border crossing images — all from U.S. government databases. The study measured false match rate and false non-match rate separately by age, gender, and race/ethnicity, enabling a comprehensive, empirically grounded comparison across systems.
Key Findings: The majority of systems showed higher false positive rates for African-American and Asian faces compared to Caucasian faces (by factors of 10 to 100 in some one-to-one matching tasks). For one-to-many identification (the scenario relevant to surveillance and law enforcement), disparities were even larger. Algorithms developed in Asia showed higher accuracy on Asian faces. False match rates were also elevated for older individuals and women.
Significance: The FRVT is the most comprehensive and authoritative empirical assessment of facial recognition bias ever conducted, covering the full market rather than selected systems. Its findings directly informed policy debates about facial recognition moratoriums, directly contributed to a series of high-profile false arrest cases, and established NIST's role as a de facto standard-setter for AI performance evaluation.
Limitations: NIST used government databases, which may not represent the full range of real-world deployment conditions. The study does not address the separate question of whether deployments are justified regardless of technical accuracy.
What Happened After: The false arrest cases predicted by the FRVT began materializing: Robert Williams, Nijeer Parks, and Michael Oliver in Michigan and New Jersey were falsely arrested based on facial recognition matches (all were Black men). Several major cities (San Francisco, Detroit, Oakland) enacted moratoriums. The 2022 update found continued disparities despite industry improvements.
Part III: Healthcare Algorithms
Study 5: Obermeyer et al. Health Algorithm Bias
Citation: Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447-453.
Research Question: Does a widely used commercial algorithm for identifying patients who need enhanced medical care show racial bias in its predictions?
Methodology: The researchers analyzed data from a major academic medical center on roughly 50,000 patients enrolled in a risk management program. The algorithm in question used predicted health care costs as a proxy for health need to identify patients for enhanced care programs. The researchers examined whether Black and white patients with the same algorithm-assigned risk score had the same level of health need (measured by objective indicators of illness).
Key Findings: At a given risk score, Black patients were significantly sicker than white patients — meaning the algorithm systematically underestimated the health needs of Black patients. The algorithm reduced the number of Black patients identified for extra care by more than half compared to a race-neutral assignment. The cause: using healthcare costs as a proxy for health need introduced racial bias because Black patients with the same health conditions generate lower costs (due to reduced access to care, lower trust in the medical system, and other systemic factors).
Significance: This study demonstrated racial bias in a healthcare algorithm used by approximately 200 million people annually — making it one of the highest-stakes AI bias findings ever published. It showed how proxy variables can introduce bias even when the algorithm is ostensibly race-neutral and no one intended discrimination. It became a landmark in medical AI ethics.
Limitations: The study examined a specific algorithm at a specific medical center; the developer later updated the algorithm. The study demonstrates bias in risk scoring but does not evaluate downstream clinical outcomes. Cost as a proxy for health is not the only possible source of healthcare AI bias.
What Happened After: The algorithm's developer (Optum) announced they would modify the algorithm to reduce racial bias. The study attracted enormous policy attention, contributing to FDA and HHS guidance on algorithmic bias in healthcare. It has been cited more than 2,000 times.
Study 6: Sjoding et al. Pulse Oximeter Racial Bias
Citation: Sjoding, M. W., Dickson, R. P., Iwashyna, T. J., Gay, S. E., & Valley, T. S. (2020). Racial bias in pulse oximetry measurement. New England Journal of Medicine, 383(25), 2477-2478.
Research Question: Do pulse oximeters — the finger clip devices used to measure blood oxygen — perform equally well for patients of different racial groups?
Methodology: The researchers used a retrospective cohort study of ICU patients in the Michigan Medicine health system and the eICU database, comparing pulse oximeter readings to simultaneous arterial blood gas (the gold-standard measurement of blood oxygen). They examined whether discrepancies between pulse oximeter and arterial blood gas readings differed by race.
Key Findings: Black patients were approximately three times as likely as white patients to have "occult hypoxemia" — undetected low blood oxygen — where pulse oximeters showed acceptable oxygen levels while arterial blood gas showed dangerously low levels. This failure was directly attributable to the lower melanin concentration in lighter skin, which the device's optical sensors were optimized for.
Significance: This study revealed that a ubiquitous, life-critical medical device used in virtually every hospital in the world contained systematic racial bias — not as a software algorithm but as a physical design. It was published during the COVID-19 pandemic, when pulse oximetry was critical to pandemic management. It directly led to FDA safety communications, device manufacturer investigations, and policy changes. It demonstrated that AI ethics bias problems are not limited to software.
Limitations: The study was observational and retrospective. The eICU database has its own sampling limitations. The practical clinical significance in terms of mortality outcomes was documented in subsequent studies, not in this paper.
What Happened After: Subsequent studies found that the hypoxemia bias contributed to disparate COVID-19 treatment and outcomes for Black patients. The FDA issued safety communications in 2021 and 2022. Device manufacturers began investigating redesign. The study has been cited more than 1,500 times and fundamentally changed how medical device regulators think about bias.
Part IV: Labor Market and Hiring Discrimination
Study 7: Bertrand and Mullainathan Résumé Audit
Citation: Bertrand, M., & Mullainathan, S. (2004). Are Emily and Greg more employable than Lakisha and Jamal? A field experiment on labor market discrimination. American Economic Review, 94(4), 991-1013.
Research Question: Does perceived race affect employer callback rates for identical job applications?
Methodology: The researchers responded to over 1,300 help-wanted ads in Chicago and Boston newspapers by sending nearly 5,000 fictitious résumés with randomly assigned names that signal race. White-sounding names (Emily Walsh, Greg Baker) and Black-sounding names (Lakisha Washington, Jamal Jones) were randomly assigned to otherwise near-identical high-quality and low-quality résumés. The outcome was whether the applicant received a callback.
Key Findings: Résumés with white-sounding names received 50% more callbacks than identical résumés with Black-sounding names (9.65% vs. 6.45% callback rate). Higher-quality résumés (more experience, fewer gaps) produced larger callback improvements for white-named applicants than for Black-named applicants, suggesting discrimination compounds with qualifications. Discrimination was consistent across industries, occupations, and employer size.
Significance: This is one of the most cited papers in economics and established the field audit methodology as the gold standard for detecting employment discrimination. It established causal evidence (not just correlation) of racial discrimination in hiring. Its methodology has been adapted for studying algorithmic hiring discrimination and name-based targeting in digital advertising.
Limitations: The study examined one channel (initial callback) not the full hiring process. It used newspaper ads, now largely replaced by online platforms. It was conducted in 2001-2002 in two cities. Subsequent research has examined whether results hold on modern platforms and in different labor markets.
What Happened After: The study has been replicated and extended dozens of times, including studies examining gender discrimination, age discrimination, and discrimination in online platforms. It directly informed the study of algorithmic hiring tools. It is routinely cited in EEOC legal proceedings.
Study 8: Datta et al. Gender Targeting in Job Ads
Citation: Datta, A., Tschantz, M. C., & Datta, A. (2015). Automated experiments on ad privacy settings: A tale of opacity, choice, and discrimination. Proceedings on Privacy Enhancing Technologies, 2015(1), 92-107.
Research Question: Does Google's ad targeting system show gender-based disparities in the delivery of job advertisements?
Methodology: The researchers used AdFisher, a tool they developed to create automated browser profiles with controlled characteristics and observe which ads were shown. They created profiles that differed only in simulated gender signals and measured what ads Google served to male-presenting vs. female-presenting profiles.
Key Findings: Profiles presenting as male were shown ads for high-paying executive positions ($200,000+ salary) significantly more often than profiles presenting as female. This occurred through Google's interest-based targeting, suggesting that either Google's targeting algorithm or its advertiser clients were directly or indirectly targeting based on gender.
Significance: This study was among the first to demonstrate algorithmic discrimination in digital advertising using experimental methodology, establishing a template for auditing online ad systems. It showed that intent is not required — discriminatory targeting can emerge from optimization processes without anyone explicitly deciding to discriminate.
Limitations: The study used simulated profiles, not real users. The mechanism of the bias (advertiser behavior vs. platform algorithm) could not be fully disentangled. Google's advertising systems have changed substantially since 2015.
What Happened After: This study contributed directly to regulatory scrutiny of targeted advertising. The EEOC, HUD, and DOJ investigated Facebook's (Meta's) advertising targeting in subsequent years. Facebook settled for $115 million with a consent decree in 2022 restricting certain targeting practices for housing, employment, and credit ads.
Part V: Natural Language Processing and Word Embeddings
Study 9: Caliskan et al. Word Embedding Biases
Citation: Caliskan, A., Bryson, J., & Narayanan, A. (2017). Semantics derived automatically from language corpora contain human-like biases. Science, 356(6334), 183-192.
Research Question: Do word embedding models (the foundational technology underlying most NLP systems) encode human-like social biases, and can these biases be measured systematically?
Methodology: The researchers applied the Word Embedding Association Test (WEAT) — adapted from the Implicit Association Test used in psychology to measure implicit bias — to GloVe word embeddings trained on the Common Crawl web corpus. They tested whether embedding proximity scores reflected known human biases about race, gender, and occupation.
Key Findings: Word embeddings showed strong associations consistent with human social biases: European-American names were more associated with "pleasant" words than African-American names; male names were more associated with careers while female names were associated with family; arts were more associated with women and science with men. The effect sizes were comparable to human implicit association test results.
Significance: This study established that the word embedding models underlying virtually all commercial NLP (search engines, recommendation systems, translation, autocomplete, hiring tools) encode and propagate social biases present in the training text. It connected AI bias to decades of social science research on implicit bias, and provided a measurement methodology that has since been widely used.
Limitations: Effect sizes in WEAT require careful interpretation. The study demonstrates association in the embedding space, not necessarily in downstream application behavior. Web text is not a representative sample of human language.
What Happened After: The WEAT methodology has been applied to dozens of embedding models and extended to multilingual settings. Debiasing techniques have been proposed and critiqued. The study fundamentally shaped how researchers think about bias in large language models.
Study 10: Abid et al. Anti-Muslim Bias in GPT-3
Citation: Abid, A., Farooqi, M., & Zou, J. (2021). Persistent anti-Muslim bias in large language models. Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, 298-306.
Research Question: Does GPT-3 associate Muslims with violence and terrorism, and is this association persistent across different prompt formulations?
Methodology: The researchers systematically prompted GPT-3 with sentence completions referencing different religious groups and compared the violence and terrorism associations in the generated text. They used both automated analysis and human evaluation.
Key Findings: GPT-3 completed prompts mentioning "Muslims" with violent associations 23% of the time — more than any other religious group and significantly higher than groups like Christians, Jews, or Hindus. This association persisted across diverse prompt framings and was robust to synonym substitutions. Even prompts designed to counteract the bias did not eliminate it.
Significance: This study was among the first rigorous assessments of social bias in large language models as a class. It demonstrated that LLMs can produce content that would constitute discriminatory or harmful output if deployed in applications — and that the bias is not easily eliminated by prompt engineering.
Limitations: GPT-3 as evaluated represents a specific version at a point in time; subsequent models have changed. The study evaluates association, not the probability that generated text causes harm in deployment. Human evaluation of what constitutes "violent" framing involves subjectivity.
What Happened After: The study contributed to OpenAI's development of safety guidelines and content policies for GPT-3 and subsequent models. It established a methodology adapted by many subsequent bias evaluations of LLMs. The term "stochastic parrot" (Bender et al., 2021) built on similar concerns about what LLMs "know" versus what they statistically associate.
Part VI: Explainability
Study 11: Ribeiro et al. LIME
Citation: Ribeiro, M. T., Singh, S., & Guestrin, C. (2016). "Why should I trust you?": Explaining the predictions of any classifier. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 1135-1144.
Research Question: Can a model-agnostic technique explain individual predictions of any machine learning classifier in human-interpretable terms?
Methodology: The researchers developed LIME (Local Interpretable Model-agnostic Explanations), which works by perturbing the input to a model (changing words in a text, pixels in an image, or feature values in tabular data) and fitting a simple, interpretable model (such as a linear model) to the model's behavior in the local neighborhood of a prediction.
Key Findings: LIME reliably identifies the features most influential to individual predictions across diverse model types (text classifiers, image classifiers, tabular data models). In user studies, domain experts using LIME were better able to detect when a model had learned spurious correlations versus genuine signal.
Significance: LIME introduced the concept of local, post-hoc explanation — making any black-box model partially interpretable without changing the model itself. It has become one of the most-used explainability tools in practice and is foundational to the field of Explainable AI (XAI).
Limitations: LIME explanations are local (valid near a specific input, not globally); they may be inconsistent across slightly different inputs. The underlying model is unmodified, raising questions about whether explanations are faithful to actual model behavior. LIME can be gamed by adversarial inputs designed to produce misleading explanations.
What Happened After: LIME has been downloaded millions of times and is integrated into commercial AI platforms. It was followed by SHAP (Lundberg & Lee, 2017), which offers stronger theoretical guarantees. Both are now considered standard tools in responsible AI practice.
Part VII: Environmental Impact
Study 12: Strubell et al. Energy Cost of NLP
Citation: Strubell, E., Ganesh, A., & McCallum, A. (2019). Energy and policy considerations for deep learning in NLP. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 3645-3650.
Research Question: What are the financial and environmental costs of training state-of-the-art natural language processing models?
Methodology: The researchers measured the energy consumption of training several large NLP models on a single GPU setup and extrapolated to CO2 emissions using regional power grid data. They compared costs across model architectures and reported both dollar costs and CO2 equivalents.
Key Findings: Training a single large transformer NLP model (BERT-large, on GPU) produces approximately 1,438 lbs of CO2 — equivalent to a round-trip transatlantic flight. A full neural architecture search for a single NLP model was estimated to produce 626,155 lbs of CO2 — equivalent to the lifetime emissions of five cars. Cloud compute costs for state-of-the-art model training exceeded $250,000.
Significance: This was the first systematic accounting of the environmental and financial costs of large-scale AI training. It introduced the concept of "compute budget" as an equity concern — only well-funded organizations can afford to train the largest models — and established environmental sustainability as an AI ethics issue.
Limitations: Estimates are based on specific hardware configurations and regional power grids; actual environmental impact varies significantly by region and over time. The study predates the scale of GPT-3, GPT-4, and other subsequent large models, whose costs are orders of magnitude higher.
What Happened After: Subsequent studies have documented far larger compute costs for models like GPT-3, GPT-4, and Gemini. Carbon disclosure has become part of responsible AI reporting standards. The EU AI Act includes provisions related to energy consumption reporting for general-purpose AI models.
Part VIII: Surveillance and Discrimination in Digital Advertising
Study 13: Sweeney Name-Based Ad Discrimination
Citation: Sweeney, L. (2013). Discrimination in online ad delivery. Communications of the ACM, 56(5), 44-54.
Research Question: Does Google's ad system deliver discriminatory ads based on names associated with race?
Methodology: Sweeney searched for names that were statistically more common among Black Americans versus white Americans (based on birth certificate data) and observed what ads Google displayed alongside each search.
Key Findings: Searches on Black-identifying names were significantly more likely to generate ads suggestive of a criminal record ("Latanya Sweeney, arrested?") than searches on white-identifying names, even though the researcher had no criminal history. The pattern held consistently across a large sample of names.
Significance: This study directly demonstrated how algorithmic advertising can produce discriminatory outputs — in this case, false insinuations of criminal history — even when no one explicitly programmed discrimination. It illustrated how training on historical data can encode and amplify existing racial disparities.
Limitations: The mechanism — whether the ads were driven by advertiser targeting or Google's optimization algorithm — could not be definitively established from observation alone. The finding may reflect advertiser behavior as much as platform design.
What Happened After: The study contributed to growing regulatory scrutiny of online advertising. Google and other platforms have modified their policies on criminal record-related advertising. The study has been cited in FTC investigations of discriminatory advertising.
Study 14: Facebook Emotional Contagion Experiment
Citation: Kramer, A. D. I., Guillory, J. E., & Hancock, J. T. (2014). Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences, 111(24), 8788-8790.
Research Question: Can emotions spread through social networks through a process of emotional contagion, even without nonverbal cues?
Methodology: For one week in 2012, Facebook experimentally manipulated the News Feed of approximately 689,003 users without their knowledge or consent: one group saw reduced positive content, another saw reduced negative content, and their subsequent posts were analyzed for emotional valence.
Key Findings: Exposure to more negative content caused users to post more negatively; exposure to more positive content caused users to post more positively. Emotional contagion was demonstrated at massive scale in a social media context.
Significance: The study's significance for AI ethics lies not in its findings but in its conduct. It demonstrated that Facebook was willing and able to conduct large-scale psychological experiments on users without their knowledge or consent, that the platform had the ability to manipulate emotional states at scale, and that the terms of service informed consent framework is inadequate for experimental contexts.
Limitations: Effect sizes were small. The study has been critiqued on methodological grounds. The core finding (emotional contagion) was itself already established in prior research.
What Happened After: The study provoked major public backlash when published. It contributed to ongoing debates about informed consent in digital research, the ethics of platform A/B testing, and the GDPR's provisions on automated decision-making. Cornell University (which had IRB jurisdiction over one author) found that the study raised concerns but did not clearly violate existing guidelines, exposing gaps in research ethics frameworks.
Part IX: Algorithmic Redlining in Lending
Study 15: The Markup Algorithmic Redlining Investigation
Citation: Martinez, E., & Kirchner, L. (2021). The secret bias hidden in mortgage-lending algorithms. The Markup, August 25, 2021.
Research Question: Do the mortgage lending algorithms used by major U.S. lenders disadvantage minority applicants even after controlling for creditworthiness?
Methodology: The Markup analyzed approximately 2.7 million conventional mortgage applications submitted in 2019 to the largest U.S. lenders using publicly available HMDA data. They applied the same statistical controls used by federal regulators — including income, loan amount, and loan-to-value ratios — and measured residual racial disparities in denial rates.
Key Findings: Across the largest lenders, Black applicants were 80% more likely to be denied than white applicants with similar financial profiles. Latino and Asian applicants were also more likely to be denied at similar rates to comparably situated white applicants. These disparities persisted across different lenders and geographic markets.
Significance: This investigation applied rigorous investigative journalism methodology to a massive dataset, producing the most comprehensive recent evidence of algorithmic mortgage discrimination. It directly contributed to Congressional scrutiny of algorithmic lending and prompted regulatory responses from CFPB.
Limitations: HMDA data lacks credit scores and detailed debt information, making it impossible to fully control for all legitimate underwriting factors. The investigation cannot prove that the algorithms (as opposed to other factors) caused the disparities.
What Happened After: The investigation prompted congressional hearings and renewed regulatory scrutiny. The CFPB issued updated fair lending guidance citing algorithmic decision-making concerns. Several lenders named in the investigation denied the findings.
Part X: Additional Significant Studies
Study 16: Angwin et al. Disparate Impact in Insurance Pricing
Citation: Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2017). Minority neighborhoods pay higher car insurance premiums than white areas with the same risk. ProPublica, April 5, 2017.
Summary: ProPublica and Consumer Reports analyzed car insurance premiums across 33 states and found that major insurers charged drivers in predominantly minority zip codes significantly higher premiums than drivers in white areas with the same accident risk. Five of the largest U.S. auto insurers showed this pattern. The finding illustrated how algorithmic pricing using geographic data can produce racially disparate outcomes even when race is not explicitly used.
Study 17: Crawford et al. Excavating AI
Citation: Crawford, K., & Paglen, T. (2019). Excavating AI: The politics of training sets for machine learning. AI Now Institute.
Summary: The researchers examined the ImageNet dataset — the foundational training dataset for most computer vision systems — and found that it contained thousands of demeaning, offensive, and sexualized labels for human subjects, including racial and gendered slurs. Subjects had been labeled without their consent. The study demonstrated that dataset curation is a value-laden practice with profound ethical implications, not a neutral technical exercise.
Study 18: Eubanks — Automating Inequality
Citation: Eubanks, V. (2018). Automating Inequality: How High-Tech Tools Profile, Police, and Punish the Poor. St. Martin's Press.
Summary: This book-length ethnographic study examined three automated systems: Allegheny County's child welfare risk assessment, Indiana's benefits eligibility automation, and Los Angeles's homeless services coordination system. Eubanks documented how each system concentrated surveillance, restriction, and punishment on low-income communities, creating what she termed a "digital poorhouse." This work established the ethical stakes of algorithmic decision-making in public benefits systems.
Study 19: Dastin — Amazon AI Hiring Bias
Citation: Dastin, J. (2018). Amazon scraps secret AI recruiting tool that showed bias against women. Reuters, October 10, 2018.
Summary: Reuters revealed that Amazon had developed and then abandoned an AI hiring tool after discovering it systematically downgraded résumés that included the word "women's" (as in "women's chess club") and graduates of all-women's colleges. The model was trained on historically male-dominated technical hiring at Amazon, and learned to penalize female signals. Amazon disbanded the team in 2017. The case became the canonical example of historical bias in training data.
Study 20: Buolamwini and MIT Media Lab
Citation: Buolamwini, J. (2018). AI, Ain't I a Woman? (Video performance/research project, MIT Media Lab).
Summary: This accessible public communication of Gender Shades findings, produced as a spoken-word video performance, demonstrated that commercial facial analysis systems failed to correctly identify the gender of prominent Black women including Michelle Obama, Oprah Winfrey, and Congresswomen Maxine Waters. The work connected technical research to civil rights history and public understanding, demonstrating the importance of accessible research communication in AI ethics.
Study 21: Bender et al. Stochastic Parrots
Citation: Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, 610-623.
Summary: This paper argued that large language models (LLMs) pose environmental costs, encode social biases from training data, can generate text that appears meaningful but is not grounded in understanding, and concentrate power among well-resourced organizations. The paper's attempted publication at Google was central to Timnit Gebru's firing from Google AI.
Study 22: Angwin et al. Facebook Ad Targeting Discrimination
Citation: Angwin, J., & Parris, T. (2016). Facebook lets advertisers exclude users by race. ProPublica, October 28, 2016.
Summary: ProPublica demonstrated that Facebook's advertising platform allowed advertisers to target housing ads while excluding users categorized as "African American," "Asian American," or "Hispanic." This constituted on-face violations of the Fair Housing Act. Facebook initially defended the practice as "standard industry practice" before changing policies under pressure.
Study 23: Barocas and Selbst Disparate Impact
Citation: Barocas, S., & Selbst, A. D. (2016). Big data's disparate impact. California Law Review, 104, 671-732.
Summary: This foundational law review article analyzed how machine learning systems can produce disparate impact discrimination even when developers have no discriminatory intent. The article identified five specific technical sources of discrimination in the data pipeline — skewed samples, tainted examples, limited features, proxy discrimination, and feedback loops — and analyzed how each relates to existing anti-discrimination law.
Study 24: Noble — Algorithms of Oppression
Citation: Noble, S. U. (2018). Algorithms of Oppression: How Search Engines Reinforce Racism. NYU Press.
Summary: Noble demonstrated through systematic content analysis that Google search results for queries about Black girls and women returned hypersexualized, stereotyped, and dehumanizing content, while equivalent searches about white women did not. The book situates algorithmic racism in the political economy of search engine advertising and the structural position of Black women in American society.
Study 25: Whittaker et al. AI Now 2018 Report
Citation: Whittaker, M., et al. (2018). AI Now Report 2018. AI Now Institute.
Summary: This annual report introduced the concept of "AI ethics washing" — the pattern of companies publishing ethics principles while resisting regulation and accountability — and documented the concentration of AI development in a small number of large technology companies. It called for greater accountability mechanisms, public auditing rights, and worker protections. The AI Now annual reports have served as the field's most comprehensive annual accounting of AI ethics challenges.
These 25 studies represent the foundational empirical literature of AI ethics. Together they establish that algorithmic bias is measurable, widespread, consequential, and amenable to systematic investigation. They also reveal the field's methodological diversity: audit studies, large-scale statistical analysis, ethnography, document analysis, and investigative journalism have all made essential contributions.