Appendix B: Key Studies Summary — Landmark Research Referenced in This Book

This appendix summarizes the most important studies cited throughout the textbook. For each study, we provide the research question, the method, the key finding, why it matters, its limitations, and its citation tier (see Appendix I for the tier system). Studies are organized by topic.


Bias and Fairness in AI Systems

Gender Shades (Buolamwini & Gebru, 2018)

Referenced in: Ch. 6 (§6.4), Ch. 9 (§9.4)

Citation Tier: Tier 1 (Verified)

Research Question: Do commercial facial recognition systems perform equally well across different skin tones and genders?

Method: The researchers created a new benchmark dataset — the Pilot Parliaments Benchmark (PPB) — consisting of 1,270 faces of parliamentarians from three African countries and three European countries. The faces were classified by skin type (using the Fitzpatrick scale) and gender. Three commercial facial recognition systems (from IBM, Microsoft, and Face++) were tested for accuracy in classifying gender.

Key Finding: All three systems performed worst on darker-skinned females, with error rates up to 34.7% for this group compared to error rates below 1% for lighter-skinned males. The overall accuracy figures reported by the companies masked dramatic disparities across demographic subgroups.

Why It Matters: This study demonstrated that aggregate performance metrics can hide serious inequities. It showed that bias in AI systems is not hypothetical — it is measurable, documented, and consequential. The study also demonstrated the power of independent audits: the companies' own testing had not revealed these disparities.

Limitations: The dataset, while carefully constructed, was relatively small. The study tested gender classification specifically, not face identification (matching a face to an identity). Performance has improved since the study was published, partly in response to its findings. The study examined three systems available in 2017; the landscape has changed substantially.

Full Citation: Buolamwini, J., & Gebru, T. (2018). Gender Shades: Intersectional accuracy disparities in commercial gender classification. Proceedings of the Conference on Fairness, Accountability and Transparency (FAccT), 77–91.


ProPublica COMPAS Analysis (Angwin, Larson, Mattu, & Kirchner, 2016)

Referenced in: Ch. 9 (§9.3, §9.4), Ch. 17 (§17.2)

Citation Tier: Tier 1 (Verified)

Research Question: Is the COMPAS recidivism prediction tool, used in U.S. courts to inform sentencing and bail decisions, racially biased?

Method: ProPublica obtained COMPAS risk scores for over 7,000 defendants in Broward County, Florida, and tracked whether they were rearrested within two years. They compared error rates across racial groups.

Key Finding: The analysis found that Black defendants were nearly twice as likely as white defendants to be incorrectly flagged as high risk (false positive rate: 44.9% for Black defendants vs. 23.5% for white defendants). Conversely, white defendants were more likely to be incorrectly labeled as low risk when they actually went on to reoffend. The tool's overall accuracy was similar across groups, but its errors were distributed unevenly.

Why It Matters: This investigation ignited a national debate about algorithmic fairness in the criminal justice system. It also catalyzed a crucial insight in fairness research: Northpointe (COMPAS's developer) countered that COMPAS was calibrated — a score of 7 meant roughly the same recidivism probability regardless of race. Both claims were statistically correct, revealing that different definitions of fairness can be simultaneously satisfied or violated depending on which metric you prioritize. This mathematical incompatibility — later formalized by researchers — is one of the threshold concepts of this textbook (Ch. 9, §9.3).

Limitations: The analysis measured rearrest, not reoffending — these are different things, since arrest rates themselves reflect policing patterns. The two-year follow-up window is somewhat arbitrary. The analysis did not assess whether the tool performed better or worse than unaided judicial decision-making. The dispute between ProPublica and Northpointe remains unresolved because both sides used valid but different fairness metrics.

Full Citation: Angwin, J., Larson, J., Mattu, S., & Kirchner, L. (2016, May 23). Machine bias: There's software used across the country to predict future criminals. And it's biased against blacks. ProPublica. https://www.propublica.org/article/machine-bias-risk-assessments-in-criminal-sentencing


Healthcare Algorithm Racial Bias (Obermeyer, Powers, Vogeli, & Mullainathan, 2019)

Referenced in: Ch. 9 (§9.2), Ch. 15 (§15.3)

Citation Tier: Tier 1 (Verified)

Research Question: Does a widely used healthcare algorithm — one that affected the care of approximately 200 million Americans — exhibit racial bias?

Method: The researchers examined a commercial algorithm used by hospitals to identify patients who would benefit from extra care (case management programs). They obtained data on the algorithm's predictions and compared them to actual patient health needs across racial groups.

Key Finding: The algorithm systematically assigned lower risk scores to Black patients than to equally sick white patients. At a given risk score, Black patients were significantly sicker than white patients. The source of the bias: the algorithm used healthcare costs as a proxy for health needs. Because Black patients in the United States historically have less access to healthcare and therefore incur lower costs — even when equally or more ill — the algorithm interpreted their lower spending as lower need.

Why It Matters: This study is one of the clearest demonstrations of how proxy variables can encode structural inequality. Nobody programmed the algorithm to discriminate by race. The variable "healthcare cost" is racially neutral on its face. But because costs are shaped by a history of unequal access, using costs as a proxy for need reproduced that inequality. The study affected an estimated 200 million patients and led the algorithm's developer to collaborate with the researchers on fixes.

Limitations: The study examined one specific algorithm from one vendor. The researchers had access to the algorithm's outputs but not its complete source code. The fix — replacing cost-based proxies with direct health measures — is conceptually simple but operationally complex.

Full Citation: Obermeyer, Z., Powers, B., Vogeli, C., & Mullainathan, S. (2019). Dissecting racial bias in an algorithm used to manage the health of populations. Science, 366(6464), 447–453.


Language Models and AI Capabilities

"Stochastic Parrots" (Bender, Gebru, McMillan-Major, & Shmitchell, 2021)

Referenced in: Ch. 5 (§5.6)

Citation Tier: Tier 1 (Verified)

Research Question: What are the environmental, social, and ethical risks of ever-larger language models?

Method: This is a position paper and literature review, not an empirical study. The authors synthesized existing research to argue that the trend toward larger and larger language models creates risks that are insufficiently discussed, including environmental costs, training data biases that scale with model size, and the risk that fluent text generation is mistaken for genuine understanding.

Key Finding: The paper coined (or popularized) the term "stochastic parrot" to describe language models: systems that produce plausible-sounding text by recombining patterns from training data without understanding meaning. The authors argued that the appearance of fluency can deceive users into overestimating the model's competence, and that scaling up model size amplifies, rather than resolves, embedded biases.

Why It Matters: The paper sparked an enormous controversy, partly because of its content and partly because of the circumstances surrounding its publication: Timnit Gebru, then co-lead of Google's Ethical AI team, left the company under disputed circumstances shortly after the paper was submitted. The episode raised questions about academic freedom in corporate AI research. The "stochastic parrot" framing remains one of the most influential metaphors in AI discourse.

Limitations: As a position paper, it does not present new empirical data. Some critics argue that the "stochastic parrot" framing underestimates the emergent capabilities of large language models. The paper was written before the release of GPT-4, Claude, and other models that have demonstrated capabilities the authors may not have anticipated. The core debate — whether pattern matching at sufficient scale constitutes something meaningfully different from "mere" repetition — remains unresolved.

Full Citation: Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? 🦜 Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency (FAccT '21), 610–623.


AI and Work

Frey & Osborne — Susceptibility of Jobs to Automation (2013/2017)

Referenced in: Ch. 10 (§10.2, §10.3)

Citation Tier: Tier 1 (Verified)

Research Question: What proportion of U.S. jobs are susceptible to automation by AI and related technologies?

Method: The researchers classified 702 occupations based on whether their component tasks could plausibly be automated using machine learning, mobile robotics, and related technologies. They used a probabilistic model to estimate each occupation's susceptibility.

Key Finding: The study estimated that 47% of U.S. employment was in the "high risk" category — occupations with a 70% or greater probability of being automated within the next decade or two. Transportation, logistics, office support, and production occupations were among the most susceptible.

Why It Matters: The 47% figure became one of the most cited (and most debated) statistics in the automation discourse. It influenced policy discussions worldwide and sparked a wave of follow-up research.

Limitations: The study estimated task-level susceptibility, but jobs consist of many tasks, and automating some tasks within a job is very different from eliminating the job entirely. Subsequent research (notably by the OECD) that examined tasks rather than whole occupations produced much lower estimates — around 9–14% of jobs at high risk. The original study's timeline has not been borne out; the predicted automation has proceeded more slowly than the model suggested. The study does not account for new jobs created by automation.

Full Citation: Frey, C. B., & Osborne, M. A. (2017). The future of employment: How susceptible are jobs to computerisation? Technological Forecasting and Social Change, 114, 254–280. (Originally circulated as a working paper in 2013.)


AI Benchmarks and Capabilities

ImageNet and AlexNet (Russakovsky et al., 2015; Krizhevsky, Sutskever, & Hinton, 2012)

Referenced in: Ch. 2 (§2.4), Ch. 3 (§3.6), Ch. 6 (§6.2)

Citation Tier: Tier 1 (Verified)

Research Question (ImageNet): Can a large-scale, hierarchically organized image dataset advance the state of object recognition research?

Research Question (AlexNet): Can a deep convolutional neural network dramatically outperform traditional computer vision methods on large-scale image classification?

Method: ImageNet is a dataset of over 14 million images organized into more than 20,000 categories. The ImageNet Large Scale Visual Recognition Challenge (ILSVRC) used a subset of 1,000 categories to benchmark image classification systems. In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton submitted a deep convolutional neural network (later called "AlexNet") that won the challenge by a dramatic margin.

Key Finding: AlexNet achieved a top-5 error rate of 15.3%, compared to 26.2% for the second-place entry — a relative improvement of over 40%. This result demonstrated that deep neural networks, trained on large datasets using GPUs, could vastly outperform hand-engineered feature approaches. It is widely considered the event that launched the deep learning revolution.

Why It Matters: The ImageNet/AlexNet moment is often cited as the beginning of modern AI. It demonstrated three things simultaneously: (1) large datasets matter, (2) deep neural networks work, and (3) GPUs can make training feasible. These three ingredients — data, depth, and compute — remain the foundation of most AI progress.

Limitations: ImageNet has been criticized for biases in its categories, including problematic labels for images of people. The dataset reflects the biases of its creators and the internet from which images were scraped. The person categories were removed in 2019 in response to these critiques. Benchmark performance on ImageNet does not guarantee performance in real-world vision tasks.

Full Citations: - Russakovsky, O., Deng, J., Su, H., et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252. - Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25.


"Attention Is All You Need" (Vaswani et al., 2017)

Referenced in: Ch. 2 (§2.5), Ch. 5 (§5.2)

Citation Tier: Tier 1 (Verified)

Research Question: Can a neural network architecture based entirely on attention mechanisms — without recurrence or convolution — match or exceed existing approaches to sequence modeling?

Method: The authors proposed the Transformer architecture, which processes input sequences using self-attention mechanisms that allow every element to attend to every other element simultaneously.

Key Finding: The Transformer achieved state-of-the-art results on machine translation benchmarks while being faster to train than previous architectures. More importantly, the architecture proved to be extraordinarily versatile and scalable, becoming the foundation for GPT, BERT, Claude, and essentially all large language models.

Why It Matters: This may be the most consequential AI paper of the 2010s. The Transformer architecture is the technical foundation of the generative AI revolution. Every large language model you interact with — ChatGPT, Claude, Gemini, Llama — is built on the Transformer or a direct descendant.

Limitations: The paper itself focused on machine translation and did not anticipate the full range of applications that would follow. The scalability of Transformers also means they are computationally expensive, contributing to the environmental concerns discussed in Chapter 18.

Full Citation: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.


Privacy and Surveillance

Clearview AI Investigation (Hill, 2020)

Referenced in: Ch. 6 (§6.4), Ch. 12 (§12.3)

Citation Tier: Tier 1 (Verified — journalistic investigation, The New York Times)

Key Finding: Clearview AI built a facial recognition database by scraping billions of photos from social media platforms without user consent. The tool was marketed to law enforcement agencies across the United States and could match a photograph of a person against this massive database. The investigation revealed that the company had operated largely in secret, that many law enforcement agencies were using the tool without public knowledge, and that the technology raised fundamental questions about privacy, consent, and mass surveillance.

Why It Matters: The Clearview AI case demonstrated that facial recognition technology had advanced to the point where a private company could build a surveillance infrastructure that rivaled state capabilities, using publicly available data that individuals never consented to having used this way.

Full Citation: Hill, K. (2020, January 18). The secretive company that might end privacy as we know it. The New York Times.


AI Safety and Alignment

AI Alignment Research Overview (Ngo, Chan, & Shlegeris, 2024)

Referenced in: Ch. 20 (§20.1, §20.4)

Citation Tier: Tier 2 (Attributed — based on widely discussed research within the AI safety community)

Summary: The alignment problem — ensuring AI systems do what humans actually want — has generated a growing body of research. Key concerns include specification gaming (AI finding unintended loopholes in reward functions), reward hacking, mesa-optimization (AI systems developing internal goals that diverge from their training objectives), and the difficulty of specifying human values in formal terms. Organizations such as Anthropic, OpenAI, DeepMind, the Alignment Research Center, and MIRI have published research on interpretability, RLHF, constitutional AI, and evaluation frameworks. The field remains nascent and contentious, with significant disagreement about the magnitude of long-term risks.


How to Use This Appendix

Each study summary above is deliberately concise. For deeper engagement:

  1. If you are writing a paper: Use these summaries to identify relevant studies, then read the original sources. The full citations are provided.
  2. If you are evaluating a claim: Check whether the claim cites any of these studies and whether it represents the findings accurately.
  3. If you are building your AI Audit Report: Use these studies as models for the kind of evidence you should look for when analyzing your chosen system.
  4. If you want to go further: The Further Reading sections in each chapter and the bibliography in Appendix I provide additional sources.