In June 2015, a software engineer named Jacky Alciné opened Google Photos on his phone and discovered that the app's automatic tagging feature had labeled photos of him and his Black friends as "gorillas." The story spread rapidly. Google apologized...
In This Chapter
- Opening: The Gorilla Problem
- Learning Objectives
- 8.1 A Taxonomy of Bias Sources
- 8.2 Historical Bias — When Data Reflects Unjust History
- 8.3 Representation Bias — Who Is Missing from the Data
- 8.4 Measurement Bias — How We Measure Creates What We Find
- 8.5 Aggregation Bias — One Model Doesn't Fit All
- 8.6 Evaluation Bias — What You Test For Shapes What You Build
- 8.7 Deployment Bias — When Context Changes Everything
- 8.8 The Problem of Proxy Variables
- 8.9 Large Language Models and the Encoding of Cultural Bias
- 8.10 Organizational Practices for Bias Prevention
- Discussion Questions
Chapter 8: Sources of Bias in Data and Models
Opening: The Gorilla Problem
In June 2015, a software engineer named Jacky Alciné opened Google Photos on his phone and discovered that the app's automatic tagging feature had labeled photos of him and his Black friends as "gorillas." The story spread rapidly. Google apologized within hours, calling it "horrifying" and "appalling." Engineers scrambled to fix the problem.
The fix they deployed was not what most people expected. They did not retrain the model with better data. They did not audit the full system for racial bias. They blocked the word "gorilla" — and later "chimp," "chimpanzee," and "monkey" — from Google Photos results entirely. The underlying model continued to exist, continued to be used. The label was simply suppressed.
Nine years later, in 2023, researchers at The Guardian tested Google Photos again. The underlying problem had not been solved. The company's response to documented racial bias in a consumer product used by billions of people was to delete a word from a dictionary.
The cause of the original error was not malicious code. No engineer had written an instruction to mislabel Black faces. The cause was a training dataset where images of dark-skinned people were dramatically underrepresented, and what images did appear were not sufficiently varied to allow the model to learn accurate distinctions. The model had learned from the data it was given, and the data reflected the demographics of who assembled it, who contributed to it, and what assumptions governed that assembly.
This is the central problem of this chapter: bias in AI systems does not usually originate from malicious intent. It originates from data — from what data was collected, from whom, by whom, using what measurements, for what purposes, and under what historical conditions. It originates from modeling choices: what a system is asked to optimize for, what benchmarks are used to evaluate it, and where it is ultimately deployed. And it persists not because engineers are evil but because organizations lack the practices, processes, and accountability structures to detect and address it.
Chapter 7 introduced bias as a concept and established the fundamental fairness framework. This chapter goes deeper. We examine the specific technical and organizational sources of bias — a taxonomy of failure modes — so that business professionals can recognize them in the systems they build, buy, and deploy. Understanding where bias comes from is the necessary precondition for preventing it.
Learning Objectives
By the end of this chapter, you will be able to:
- Distinguish between data bias and model bias, and explain why both require attention in AI governance.
- Apply the six-category taxonomy (historical, representation, measurement, aggregation, evaluation, and deployment bias) to diagnose bias sources in real AI systems.
- Explain why removing protected attributes (race, gender, age) from training data is insufficient to prevent discriminatory outcomes.
- Identify common proxy variables and explain how they allow discrimination to persist even in facially neutral models.
- Describe the specific mechanisms by which large language models encode and amplify cultural biases from training data.
- Evaluate organizational practices — data audits, datasheets for datasets, labeling protocols — for their effectiveness in reducing bias.
- Ask substantive vendor due diligence questions about training data and evaluation methodology.
- Connect bias prevention practices to broader governance structures, including the auditing frameworks discussed in Chapter 19.
8.1 A Taxonomy of Bias Sources
Data Bias vs. Model Bias
In popular discourse, "AI bias" is treated as a single phenomenon requiring a single fix. In practice, bias enters AI systems through multiple distinct mechanisms, at different stages of the development lifecycle, and for different reasons. A useful starting point is the distinction between data bias and model bias.
Data bias refers to systematic errors or distortions that originate in the data used to train, validate, or test a model. The data does not accurately represent the world as it is, or it accurately represents the world as it has been — which may be a world shaped by historical injustice. Data bias is a property of the training set, independent of the modeling choices that follow.
Model bias refers to systematic errors introduced by the choices made during model development: what the model is asked to optimize for, how it is validated, and how its performance is measured. A model trained on unbiased data can still produce biased outputs if its architecture, loss function, evaluation benchmark, or deployment context introduces systematic disadvantage for particular groups.
In practice, the two interact. Biased data produces biased models. But even models trained on relatively balanced data can produce disparate outcomes depending on how "performance" is defined and measured, and on whether evaluation includes disaggregated performance across demographic groups.
The Six-Category Taxonomy
The most influential framework for understanding bias sources across the machine learning lifecycle was developed by Harini Suresh and John Guttag (2019) at MIT. Their paper, "A Framework for Understanding Sources of Harm Throughout the Machine Learning Life Cycle," identifies six distinct categories of bias, each corresponding to a specific stage in the pipeline from world to deployed system. This chapter organizes its analysis around that taxonomy.
1. Historical bias arises when the world itself — the social reality from which data is drawn — contains systematic inequity. Data collected from a biased world faithfully records that world's biases. The problem is not in the data collection process; it is in the historical conditions that shaped the underlying reality.
2. Representation bias arises when the sample of data collected does not accurately reflect the population the model is intended to serve. Certain groups are undersampled or absent, leading to models that perform poorly for those groups.
3. Measurement bias arises when the features chosen to represent a concept measure that concept differently across groups — or when a proxy variable is used that correlates with protected characteristics.
4. Aggregation bias arises when a model treats a heterogeneous population as homogeneous — fitting a single model to data that should be analyzed separately, obscuring differential patterns.
5. Evaluation bias arises when the benchmarks used to evaluate a model are themselves biased or non-representative, causing a model that performs well on the benchmark to perform poorly in deployment for underrepresented groups.
6. Deployment bias arises when a system is used in contexts substantially different from those for which it was designed — different populations, different use cases, different stakes.
Each of these categories is examined in depth in the sections that follow.
Vocabulary Builder
Historical bias: Bias that enters AI systems because training data reflects historically unjust social conditions, not errors in data collection methodology.
Representation bias: Systematic undersampling of certain demographic groups in training data, causing models to perform worse for those groups.
Measurement bias: Bias introduced when features or proxy variables measure a target concept differently across demographic groups.
Aggregation bias: Error introduced by fitting a single model to data from heterogeneous groups with meaningfully different underlying patterns.
Proxy variable: A variable used in a model as a substitute for a characteristic that cannot be directly measured, which correlates with protected attributes and therefore allows discrimination to persist even after protected attributes are removed.
Benchmark: A standardized test dataset and evaluation methodology used to compare model performance; benchmarks can themselves be biased if they overrepresent certain populations or contexts.
8.2 Historical Bias — When Data Reflects Unjust History
The Core Concept
Historical bias is perhaps the most philosophically challenging category in the taxonomy because it does not require any error in data collection. The data is accurate. It faithfully records what happened. The problem is that what happened was itself unjust — shaped by discrimination, exclusion, unequal enforcement of law, differential access to resources, and the accumulated decisions of institutions that did not treat all people equally.
When an AI system learns from historical data, it learns the patterns encoded in that history. It learns that certain people received certain outcomes — loans, jobs, parole, medical treatment — and it learns to predict that those people will continue to receive those outcomes. The model is doing exactly what it was designed to do. The problem is that the patterns it is learning are the residue of discrimination, not the reflection of genuine differences in ability, risk, or need.
The critical insight for business professionals is this: you cannot make a historically biased dataset ethically appropriate by cleaning it carefully or using it correctly. If the underlying social reality from which the data was drawn was unjust, the data will reproduce that injustice. This does not mean historical data should never be used — it means its use requires explicit scrutiny of what history is being encoded and what decisions will be made based on patterns derived from it.
Credit Scores and Lending History
The credit score is one of the most consequential algorithmic systems in American economic life. It determines whether people can borrow money, at what interest rates, and on what terms. And its inputs — payment history, credit utilization, account age, credit mix — are themselves products of a history of discriminatory lending.
For decades, practices including redlining (the systematic denial of mortgages to residents of predominantly Black neighborhoods), discriminatory pricing, and exclusionary terms denied Black, Latino, and immigrant communities access to the credit system. People who were denied accounts cannot have account histories. People who were offered only high-cost credit will have higher utilization ratios. People whose neighborhoods were not served by mainstream banks will have alternative credit histories not captured by standard scoring.
A credit scoring model trained on this history learns that certain patterns — associated with groups that were historically excluded — predict lower creditworthiness. And because the model uses that prediction for future decisions, it perpetuates the exclusion into the future. This is historical bias operating at scale: a technically sophisticated, statistically valid model that encodes and propagates historical discrimination.
Hiring Algorithms and the "Successful Employee" Problem
Amazon's now-discontinued automated resume screening tool, developed internally and used from approximately 2014 to 2017, provides a vivid illustration of historical bias in hiring contexts. The system was trained on resumes submitted to Amazon over a ten-year period — resumes primarily from men, because the technology industry in that period had hired primarily men. The model learned the characteristics of "successful" applicants: the words on their resumes, the schools they attended, the trajectory of their careers.
The model learned that the word "women's" — as in "women's chess club" or "women's university" — was a negative signal. It downgraded graduates of all-women's colleges. It had learned, from historical data, that women were less likely to be hired at Amazon. When asked to predict who would be hired, it predicted accurately — by selecting men.
Amazon shut the system down when the problem was discovered. But the episode illustrates a general pattern: any hiring algorithm trained on historical hiring data at a company with a history of demographic homogeneity will learn to replicate that homogeneity. This is true even when protected characteristics are explicitly removed from the training data, because the correlates of those characteristics remain.
Healthcare and Pain Management
Researchers studying racial disparities in pain management have documented a systematic pattern: Black patients in American healthcare settings receive less pain medication than white patients for comparable conditions, a disparity that cannot be fully explained by clinical factors. This pattern appears across settings — emergency rooms, post-surgical care, cancer treatment — and across conditions.
The reasons are multiple and complex: provider bias, patient-provider communication patterns, structural factors in healthcare access, and — critically — false beliefs about biological differences between racial groups that have persisted in some clinical training materials despite having no scientific basis.
When AI clinical decision support systems are trained on historical treatment data, they learn these patterns. A model trained to predict "appropriate treatment" based on what treatments were historically administered to similar patients will recommend less aggressive pain management for Black patients — not because of malicious intent, but because that is what the historical record shows. The model reproduces the discrimination embedded in the data it was trained on.
The Recidivism Loop
Criminal risk assessment tools — used in bail, sentencing, and parole decisions in many US jurisdictions — provide a case where multiple forms of historical bias compound one another. These tools are trained on data about arrests, convictions, and reoffending. That data reflects decades of differential policing: Black and Latino neighborhoods have historically been more heavily policed, meaning more arrests per criminal incident, more convictions, and more data points indicating "recidivism."
A risk assessment model trained on this data learns that attributes associated with heavily policed communities — neighborhood, criminal record, family criminal history — are predictive of reoffending. This prediction is self-fulfilling: people assessed as high-risk are more likely to be incarcerated, and upon release are more likely to be subject to intensive supervision, and more likely to be arrested for technical violations. The model embeds a historical policing pattern into future criminal justice decisions.
The COMPAS risk assessment tool, analyzed by ProPublica in 2016, showed that Black defendants were nearly twice as likely as white defendants to be falsely flagged as high-risk — a disparity that the tool's developers contested using a different fairness metric. The debate about COMPAS illuminated the mathematical impossibility of satisfying multiple fairness criteria simultaneously when base rates differ between groups — a point introduced in Chapter 7 and relevant here as a consequence of historical bias in the underlying data.
Key Implication: Protected Attributes Cannot Be "Removed"
Perhaps the most important practical implication of historical bias for technical practitioners is that it cannot be addressed by removing protected attributes from the training data. Removing race, gender, or age from a dataset does not remove the information those attributes carry, because that information is distributed across correlated variables. Zip code correlates with race. Job title correlates with gender. Graduation year correlates with age. Purchase history correlates with religion.
As long as these correlated variables remain — and they generally must remain, because they may carry legitimate predictive information — the model retains the capacity to reproduce the discriminatory outcomes the protected attribute would have produced. This point, rigorously formalized in work by Dwork et al. (2012) on "fairness through awareness," is counterintuitive to many business leaders who believe that data "blind" to protected characteristics is automatically fair. It is not.
8.3 Representation Bias — Who Is Missing from the Data
Why Training Data Is Not a Random Sample
Machine learning systems learn from samples, not populations. The assumption underlying the statistical validity of this learning is that the sample adequately represents the population to which the model will be applied. In practice, training datasets frequently violate this assumption in ways that are not random — they systematically underrepresent particular groups, often the same groups that face social marginalization.
Representation bias is sometimes described as if it were merely a technical problem of sample size: include more examples and the bias goes away. But the causes of underrepresentation are frequently structural — rooted in who has internet access, who participates in data collection, whose images appear in stock photo libraries, whose medical records are included in research databases — and these structural causes cannot be addressed by simply collecting more data from the same sources.
The ImageNet Problem
ImageNet, the large-scale visual dataset that catalyzed the deep learning revolution beginning in 2012, was assembled through a process that revealed the structural sources of representation bias in practice. Images were collected from internet sources and labeled by workers recruited through Amazon Mechanical Turk. Both the image sources and the labeling workforce were predominantly English-speaking and Western.
Subsequent analysis revealed significant demographic imbalances. Faces appearing in the dataset skewed toward lighter skin tones and toward a narrow age range. Geographical representation was heavily biased toward North America and Europe. Researchers analyzing ImageNet found that certain image categories contained predominantly white subjects while others contained primarily non-white subjects — and the categories were not neutral in their associations.
When face recognition systems trained on ImageNet-derived data were evaluated across demographic groups, the performance disparities were striking. The Gender Shades project (Buolamwini and Gebru, 2018) evaluated commercial face analysis systems and found error rates for darker-skinned women that were up to 34 percentage points higher than for lighter-skinned men — the group most heavily represented in training data.
Geographic and Linguistic Exclusion
Large datasets used to train AI systems reflect the geography of technology adoption and internet access. English-language content constitutes a dramatically disproportionate share of training data for language models. North American and European users generate the majority of behavioral data used to train recommendation systems. Images used to train computer vision systems are drawn predominantly from contexts familiar to technologists in high-income countries.
This geographic skew has direct consequences for performance. Natural language processing systems trained on English text perform worse in other languages and especially poorly in low-resource languages that lack sufficient text corpora. Computer vision systems trained on Western faces perform worse on faces from East Asian, South Asian, or Sub-Saharan African contexts. Healthcare AI systems trained on data from US academic medical centers may not generalize to clinical settings in other countries or to underserved communities within the United States.
The implication for multinational businesses is particularly significant: a system that works well in the company's US headquarters market may perform substantially worse — and discriminate substantially more — when deployed in markets where the training population is underrepresented.
Age, Disability, and the Digitally Marginal
Elderly populations, people with disabilities, and others who are less active in digital environments are systematically underrepresented in the datasets used to train AI systems. Elderly users are less likely to have contributed to internet text corpora, less likely to have had their voices recorded in datasets used to train voice recognition systems, and less likely to appear in image datasets assembled from social media.
The consequences are predictable and documented. Voice recognition systems perform worse for elderly speakers whose voices have age-related characteristics not well-represented in training data. Accessibility AI tools — systems intended specifically to serve people with disabilities — often perform worst for the very populations they are meant to serve, because people with the relevant disabilities were not included in user testing.
This pattern reveals a troubling irony: the populations with the greatest need for AI-powered assistance are frequently the populations for whom AI systems perform least reliably.
Medical Data Gaps
Clinical trials — the gold standard for medical evidence — have historically excluded women, minorities, and older adults from participation. This was particularly pronounced before the National Institutes of Health Revitalization Act of 1993 required the inclusion of women and minorities in NIH-funded research. The exclusions were sometimes justified on clinical grounds (e.g., concerns about fetal exposure to experimental drugs) and sometimes reflected simple convenience sampling.
The legacy of these exclusions is a medical literature, and by extension a set of medical datasets, that does not represent the full population of patients. AI diagnostic systems trained on medical literature or clinical records inherit these representation gaps. A diagnostic model trained predominantly on data from white male patients may not generalize to women, people of color, or older adults — who may present with different symptom patterns, have different baseline biomarker levels, and respond differently to treatments.
The Pulse Oximeter Case
The pulse oximeter case, examined in depth in Case Study 8.1, illustrates representation bias in a medical device context with fatal consequences. Pulse oximeters — devices that clip to a finger and use light to measure blood oxygen saturation — were calibrated in clinical studies that used predominantly light-skinned subjects. The optical measurement technique that underlies pulse oximetry is affected by skin pigmentation, and the calibration curves developed from light-skinned subjects do not accurately generalize to patients with darker skin.
Research published in the New England Journal of Medicine in December 2020 by Sjoding et al. documented that pulse oximeters overestimated oxygen saturation levels in Black patients, with a threefold higher frequency of occult hypoxemia — dangerously low oxygen that the device failed to detect. During the COVID-19 pandemic, when pulse oximeter readings were used to determine whether patients required hospitalization or could be safely discharged, this measurement error had direct life-or-death consequences.
The pulse oximeter case is discussed more fully in Case Study 8.1, but its essential lesson belongs here: representation bias in the data used to calibrate a medical device produced a device that measured one population accurately and another inaccurately — and the group measured inaccurately was the group already facing worse health outcomes.
8.4 Measurement Bias — How We Measure Creates What We Find
The Core Problem
Measurement bias arises when the features chosen to represent a concept measure that concept differently for different groups — or when a variable is used as a proxy for a concept it does not accurately represent for all groups. The problem is not that measurement is absent; it is that measurement is assumed to be neutral when it is not.
Every feature in a machine learning model is a measurement. Every measurement embeds assumptions about what is being measured and how reliably. When those assumptions hold equally across all demographic groups, the measurement is fair. When they do not — when the same feature captures the underlying concept more accurately for some groups than for others — the model built on that feature will perform differently across those groups, typically to the disadvantage of groups for whom the proxy is less accurate.
Standardized Tests as Proxies for Potential
Standardized tests — the SAT, ACT, GRE, GMAT — are used as proxies for academic potential in admissions and hiring decisions. The justification for their use is that they measure cognitive skills that predict academic and professional performance. The critique is that performance on these tests correlates substantially with family income, parental education, access to test preparation resources, and familiarity with the specific cultural context embedded in test questions — which means the tests measure something considerably broader than the cognitive skills they claim to assess.
When AI admissions or hiring systems use standardized test scores as features, they incorporate this measurement bias. The model learns that high test scores predict success — which may be accurate, but the prediction may operate partly through socioeconomic background and educational advantage rather than through raw cognitive potential. For populations with less access to preparation resources, the test score is a noisier proxy for potential, and models trained on this proxy will systematically underestimate the potential of students from lower-income backgrounds.
Arrests as a Proxy for Crime
Criminal risk assessment tools that use arrest history as a feature face a fundamental measurement problem: arrests are a measure of police activity, not of criminal behavior. Areas that are more heavily policed generate more arrests per incident of criminal behavior. Individuals who live in heavily policed areas, or who belong to groups subject to more intensive law enforcement scrutiny, will accumulate more arrests independent of their actual criminal behavior.
When arrest history is used as a feature in a recidivism risk model, the model is not simply measuring criminal history. It is measuring a combination of criminal behavior and policing intensity. For groups subject to more intensive policing, the feature measures more heavily toward policing patterns; for groups subject to less intensive policing, it measures more heavily toward actual criminal behavior. The same nominal feature is measuring different underlying realities for different groups.
Engagement Metrics as Proxies for Quality
Digital platforms — social media, search engines, streaming services — use engagement metrics (clicks, shares, likes, view duration) as proxies for content quality or relevance. Recommendation systems trained on engagement data assume that what people engage with is what is good for them, or at least what is most relevant.
But engagement patterns are shaped by pre-existing inequalities in the media landscape. Content produced by and for groups with less media representation has fewer existing consumers to generate engagement signals. Content that is shocking, outrageous, or emotionally activating generates disproportionate engagement independent of its quality or accuracy. Recommendation systems that optimize for engagement will systematically undervalue content from underrepresented communities and systematically amplify content that exploits psychological vulnerabilities.
The Sentiment Analysis Problem
Natural language processing systems trained to detect sentiment — positive or negative tone — in text perform substantially worse on African American Vernacular English (AAVE) than on Standard American English. Research by Blodgett et al. (2017) documented that social media posts written in AAVE were significantly more likely to be misclassified by sentiment analysis tools, with negative sentiment over-detected and positive sentiment under-detected.
This measurement bias has practical consequences: content moderation systems that use sentiment analysis may over-flag AAVE content. Customer service tools that assess customer satisfaction may systematically misread the emotional tone of responses from Black customers. Employment screening tools that analyze written communication may score AAVE-speaking applicants lower on "communication quality" measures that are actually proxies for conformity to Standard American English norms.
Business Implication
Every feature in a machine learning model encodes an assumption about what it measures. Those assumptions should be documented, scrutinized, and audited. For each feature in a model, practitioners should be able to answer: What real-world concept is this feature intended to represent? How accurately does it represent that concept across different demographic groups? Are there groups for whom this feature is a noisier or more biased proxy? If the answer to the last question is "we don't know," that is itself a red flag requiring investigation.
8.5 Aggregation Bias — One Model Doesn't Fit All
The Assumption of Homogeneity
Aggregation bias arises when a model treats a heterogeneous population as if it were homogeneous — when a single model is fitted to data that contains meaningfully different underlying patterns for different subgroups. The model learns the average relationship across all groups, which may accurately describe no individual group, and may perform poorly for groups whose patterns diverge substantially from the average.
The assumption of homogeneity is often invisible. Practitioners who build a single model on a full dataset may not recognize that subgroups within that dataset have different statistical relationships between features and outcomes. When performance is evaluated only on the aggregate, the model may appear to perform well. When performance is disaggregated by group, the same model may perform excellently for some groups and poorly for others.
Medical Example: HbA1c and Diabetes Management
Glycated hemoglobin (HbA1c) is the primary biomarker used to monitor diabetes management. The standard threshold for diagnosing diabetes is an HbA1c level of 6.5%. This threshold was established based on research that showed an association between this level and elevated risk of retinopathy in the studied population.
However, research has documented that HbA1c distributions differ across ethnic groups. Black patients have higher HbA1c levels on average than white patients with equivalent blood glucose control, due to differences in red blood cell turnover rates that affect the measurement. A single diagnostic threshold applied uniformly across groups will systematically misclassify some patients — producing higher false positive rates for some groups and higher false negative rates for others.
A clinical AI system that uses a single HbA1c threshold for all patients inherits this aggregation bias. It appears to be making the same decision for everyone, applying the same rule uniformly. But because the same threshold has different sensitivity and specificity for different groups, it is effectively applying different diagnostic criteria to different patients.
Language and Sentiment Analysis
Sentiment analysis provides a clearer illustration of aggregation bias. A single sentiment analysis model trained on mixed-population text will learn an average relationship between language features and sentiment. But different communities use language differently — different idioms, different conventions for expressing emotion, different relationships between word choice and intended meaning.
A model that aggregates across these linguistic communities learns an average that may not accurately represent any of them. For communities whose linguistic patterns diverge most from the dominant patterns in the training data, the aggregated model will perform worst. This is aggregation bias: the model's performance is uneven not because of anything wrong with the data collection methodology (representation bias) but because a single model was used where multiple models — or models with explicit group-level components — would have been more appropriate.
The Technical Fix and Its Costs
The technical response to aggregation bias is to use separate models for different groups, or to include interaction terms in a single model that allow the relationship between features and outcomes to vary across groups. Both approaches require additional data (enough data from each subgroup to train and validate a separate model or to estimate interaction effects), additional validation work (each model must be separately validated), and additional monitoring (each model must be separately monitored in deployment).
These costs are not trivial. Organizations that do not have sufficient data on minority subgroups to train separate models face a genuine dilemma: they can use an aggregated model that performs poorly for the minority group, or they can attempt to augment their data — which raises its own questions about data quality and representation. There is no simple technical solution; addressing aggregation bias requires organizational investment.
8.6 Evaluation Bias — What You Test For Shapes What You Build
Benchmarks and Their Limitations
Machine learning systems are developed and refined through iterative evaluation against benchmark datasets — standardized test sets used to measure performance. The accuracy, reliability, and fairness of a model are all assessed relative to how well it performs on these benchmarks. This makes the choice of benchmark one of the most consequential decisions in the model development process — and one that receives insufficient attention in discussions of AI ethics.
If a benchmark is biased — if it overrepresents certain populations, if it defines "correct" answers in ways that reflect one group's norms, if it evaluates performance only in contexts favorable to majority users — then optimizing for high performance on that benchmark will produce a model that performs well for majority groups and poorly for minority groups. The model passes its evaluation, but the evaluation itself was not a reliable test of what matters.
The ImageNet Evaluation Problem
The same ImageNet dataset that exemplifies representation bias also exemplifies evaluation bias. As noted in Section 8.3, ImageNet was assembled in ways that produced demographic skews. The same demographic skews that make it a poor training dataset also make it a poor evaluation dataset: a model that achieves 95% accuracy on ImageNet may still perform very poorly on darker-skinned faces, elderly faces, or faces from non-Western geographical contexts, because those faces are underrepresented in both the training data and the evaluation benchmark.
The history of the face recognition field illustrates this vividly. For years, face recognition systems were evaluated using benchmarks like LFW (Labeled Faces in the Wild) that were heavily skewed toward white male faces. Performance on these benchmarks was reported as overall accuracy; performance on darker-skinned or female subjects was not separately reported. When the Gender Shades project (Buolamwini and Gebru, 2018) evaluated commercial face analysis systems using a dataset specifically constructed to include equal representation across gender and skin tone categories, the performance disparities were revealed that the standard benchmarks had entirely concealed.
NLP Benchmarks and Western-Centric Evaluation
Natural language processing models are commonly evaluated using benchmark datasets such as GLUE (General Language Understanding Evaluation) and SuperGLUE. These benchmarks consist of tasks derived primarily from English-language text, and the performance definitions embedded in them reflect assumptions grounded in mainstream American English usage.
Models that score at or near human performance on GLUE may perform substantially worse on non-standard dialects of English, on text from speakers for whom English is a second language, or on tasks requiring cultural knowledge not represented in the mainstream English corpus on which both the models and the benchmarks were built. Benchmark scores do not generalize to real-world performance on these populations.
The Aggregate Accuracy Problem
A model with 95% overall accuracy on a test set may have 99% accuracy on the majority group and 75% accuracy on a minority group that constitutes 20% of the population. The weighted average looks good. The disaggregated performance reveals a different story. Optimizing for aggregate accuracy creates powerful incentives to improve performance on majority groups — where most of the data, and therefore most of the gradient signal, resides — while neglecting minority groups.
This is not an edge case. It is the default behavior of standard machine learning optimization when training data is imbalanced and evaluation is aggregate. Without explicit requirements for disaggregated evaluation — and without organizational structures that enforce those requirements — the default trajectory of model development is toward better performance for majority groups and worse performance for minority groups.
What Good Evaluation Looks Like
Good evaluation practice requires:
- Disaggregated metrics: Report performance separately for demographic groups, not just in aggregate.
- Representative test sets: Ensure evaluation data includes sufficient samples from all groups the model will serve.
- Real-world testing environments: Evaluate models in conditions that match expected deployment contexts, including edge cases and adversarial conditions.
- Diverse evaluators: Include people from affected communities in evaluation teams; they will identify failure modes that homogeneous teams miss.
- Beyond accuracy: Evaluate not just accuracy but also calibration, fairness metrics across groups, and error analysis to understand what kinds of errors the model makes for different populations.
8.7 Deployment Bias — When Context Changes Everything
The Deployment Gap
A system can be designed well, trained on representative data, and evaluated rigorously — and still cause harm when deployed. Deployment bias arises when the context in which a system is used differs substantially from the context for which it was designed. This mismatch can take several forms: the population of users differs from the training population, the use case has expanded beyond the original design, or the decision-making environment has changed in ways that alter the stakes of errors.
Deployment bias is a particularly insidious failure mode because it often develops gradually, through scope creep, organizational change, or market evolution, rather than appearing at a single identifiable moment. By the time the mismatch between designed use and actual use becomes apparent, the system may be deeply embedded in organizational processes.
Scope Creep: When Tools Get Repurposed
Hiring assessment tools have been repurposed for promotion decisions. Credit scoring systems built for consumer lending have been applied to insurance underwriting. Criminal risk assessment tools designed for bail decisions have been used in parole determinations. In each case, a tool developed and validated for one specific decision context is deployed in a related but distinct context, with different populations, different decision consequences, and different fairness implications.
The Amazon hiring tool case provides a domestic illustration. The tool was designed to screen resumes for software engineering roles — a specific context with a specific population of applicants and a specific set of criteria for "success." Whether or not it could have been made to work fairly in that context (it was ultimately shut down), deploying it for other role types, or in non-technical roles where the attributes of "successful" employees differed substantially, would have amplified its problems.
Population Shift: Training on One Group, Deploying on Another
Population shift occurs when a model trained on one population is deployed on a different population without revalidation. This problem was documented acutely during the COVID-19 pandemic. AI diagnostic tools trained on data from patients with other respiratory conditions were rapidly adapted for COVID-19 screening. The training population had one clinical profile; the deployment population — patients with a novel disease whose presentation differed substantially — had another.
Numerous studies found that COVID-19 AI diagnostic tools produced in the early pandemic had significantly inflated performance estimates because they were evaluated on datasets with spurious confounding features (for example, patient positioning in images that correlated with COVID status in early hospital datasets). When deployed on genuinely new patients, performance degraded substantially.
Healthcare provides the clearest examples of population shift bias, but the problem is not unique to medicine. Any model trained on historical data and deployed in a changed environment — a new geography, a new customer segment, a new economic context — faces population shift. The longer the time between training and deployment, and the more dynamic the underlying environment, the greater the risk.
The Dual-Use Problem
A more intentional form of deployment bias occurs when systems designed for one purpose are deliberately repurposed for discriminatory ends. A customer sentiment analysis tool designed to measure satisfaction with a retail experience is repurposed by an employer to evaluate employee "attitude." A language model fine-tuned to generate marketing copy is deployed to generate misleading content. A face recognition system designed for voluntary identity verification is deployed for covert surveillance.
The dual-use problem is not unique to AI — most powerful tools can be misused — but AI systems are particularly susceptible because their capabilities are general and because the technical work of deployment is often much simpler than the technical work of initial design. An organization that has acquired a capable AI system faces low friction in deploying it for new purposes; the ethics review required to determine whether a new deployment is appropriate requires deliberate organizational structures to occur.
Monitoring Requirements
The appropriate organizational response to deployment bias is ongoing post-deployment monitoring. Performance metrics must be tracked across demographic groups continuously, not measured once at initial deployment. Use cases must be reviewed regularly to ensure they remain within the scope for which the system was designed and validated. Significant changes in the deployment context — new user populations, new decision stakes, new organizational applications — should trigger re-evaluation.
Building these monitoring structures requires deliberate organizational commitment. The monitoring function must have the authority to flag problems and escalate them to decision-makers who can authorize remediation or discontinuation. Without that authority, monitoring becomes performative — a box-checking exercise that does not change outcomes.
8.8 The Problem of Proxy Variables
What Proxy Variables Are
A proxy variable is a variable used in a model as a substitute for a concept that cannot be directly measured, or as a predictor that is believed to correlate with the target outcome. Proxy variables are legitimate and necessary in many modeling contexts — direct measurement of what we care about is often impossible, so we use observable correlates.
The problem arises when proxy variables correlate with protected characteristics. In a society characterized by residential segregation, occupational segregation, educational inequality, and differential access to economic opportunity, almost every socioeconomic variable will correlate with race, gender, national origin, or other protected characteristics. Using these variables in models — even models that never explicitly include protected attributes — allows discrimination to persist and be amplified.
Why Removing Protected Attributes Is Not Enough
The naive approach to preventing discriminatory AI is to remove protected attributes from the training data: take out race, gender, age, national origin, and the model cannot discriminate on those bases. This approach — sometimes called "fairness through blindness" — is seductive in its simplicity and fails in practice for a well-understood reason: the information carried by protected attributes is distributed across correlated variables.
Dwork et al. (2012) formalized this problem in their paper on fairness through awareness. The critique is simple: if you want to treat similar people similarly, you need to be aware of what dimensions of similarity matter for fairness — which requires knowing what groups people belong to, not ignoring that information. Fairness through blindness makes it impossible to detect discrimination (because you cannot measure group-level outcomes without knowing group membership) while doing nothing to prevent it (because correlated proxies remain in the model).
Common Proxies in Practice
The following proxy relationships are well-documented and should be treated as red flags in any feature audit:
Zip code and race: Due to historical residential segregation, zip code is a powerful predictor of racial composition in the United States. A model that uses zip code as a feature may effectively discriminate by race without ever seeing racial data. This is particularly concerning in financial services, insurance, and healthcare, where zip code is a commonly used feature.
College name and socioeconomic status/race: The college an applicant attended correlates strongly with family income, which correlates with race, and directly with institutional demographic composition. Using college name or selectivity as a feature in hiring or admissions models embeds these correlations.
Job title history and gender: Occupational segregation — the concentration of women in certain roles and men in others — means that job title history is a proxy for gender. A model trained on job title history to predict management potential will encode the gender patterns embedded in historical occupational structures.
Surname and national origin/ethnicity: Surnames correlate strongly with ethnic and national origin background. Models that use name-based features — including NLP models that process resumes or social media profiles — may effectively discriminate by national origin.
Shopping behavior and religion: Consumer purchasing patterns around religious holidays — purchases of Ramadan decorations, Passover foods, Christmas gifts — are predictors of religious affiliation. Models that use purchasing behavior may inadvertently discriminate on the basis of religion.
Word embeddings and gender stereotypes: Bolukbasi et al. (2016) demonstrated that word embeddings trained on Google News text encoded gender stereotypes, with words like "programmer" and "doctor" associated with the male vector and words like "nurse" and "receptionist" associated with the female vector. These embeddings, when used as features in downstream models, propagate the encoded stereotypes into those models' predictions.
The Proxy Whack-a-Mole Problem
Removing a single identified proxy does not solve the discrimination problem if correlated proxies remain. If zip code is identified as a proxy for race and removed from a credit model, but the model retains income, employer, and neighborhood crime statistics — all of which correlate with race through residential segregation patterns — the model's ability to discriminate by race is largely preserved.
This "proxy whack-a-mole" dynamic means that addressing proxy discrimination requires comprehensive feature audits, not targeted removal of identified problem features. Every feature in the model must be evaluated for its correlation with protected characteristics, and the model's overall predictions must be tested for disparate impact across demographic groups, even after individual proxies have been removed.
Practical Guidance: Feature Auditing
A feature audit for proxy variables should:
- Calculate correlation coefficients between each model feature and each protected characteristic in the training population.
- Identify features with correlation above a specified threshold (which will depend on context and risk tolerance).
- For features identified as potential proxies, conduct counterfactual analysis: does the model's prediction change if only that feature is altered while others remain constant?
- Test the full model for disparate impact across protected groups using relevant fairness metrics.
- Document the findings, including features that were retained despite proxy correlations and the justification for retention.
This audit is not a one-time activity. As data distributions shift and as new features are added to models, proxy correlations can change. Ongoing auditing is required.
8.9 Large Language Models and the Encoding of Cultural Bias
How LLMs Learn from Text
Large language models (LLMs) are trained on vast corpora of text — billions or trillions of words drawn from websites, books, code repositories, social media, and other sources. The training process teaches the model to predict what text comes next given what has come before, which requires learning enormous amounts of information about language, concepts, facts, relationships, and cultural associations.
The cultural associations encoded in the training data are not filtered before training. The model learns whatever patterns are present in the text, including patterns reflecting historical prejudice, contemporary stereotype, and the systematic overrepresentation of certain voices and perspectives in the written record. The result is a model that knows a great deal about the world — and that has absorbed the biases of the texts from which it learned.
Word Embeddings and the Reproduction of Stereotype
A foundational result in the study of bias in language models was published by Caliskan, Bryson, and Narayanan in 2017. Their paper, "Semantics derived automatically from language corpora contain human-like biases," demonstrated that word embeddings trained on large text corpora reproduce documented human biases. Using an extension of the Implicit Association Test (IAT) — a psychological measure of implicit bias — they showed that word embeddings associated pleasant concepts with white names and unpleasant concepts with Black names, associated female names with family and arts concepts and male names with career and science concepts, and reproduced other documented patterns of human implicit association.
This result is significant not as a curiosity but as a practical concern. Word embeddings are used as features in downstream models — in hiring tools that process resumes, in content moderation systems that evaluate posts, in customer service tools that classify inquiries. The biases encoded in the embeddings propagate into the predictions of those downstream models. A resume screening model that uses word embeddings will inherit the embedding's gendered associations between occupations and gender.
Anti-Muslim Bias in GPT-3
Research by Abid, Farooqi, and Zou (2021) documented a striking and persistent pattern of anti-Muslim bias in GPT-3. When the researchers prompted GPT-3 with sentence completions involving the word "Muslim" — for example, "Two Muslims walked into a..." — the model generated violent associations at dramatically higher rates than for comparable prompts involving other religious groups. "Two Christians walked into a..." generated church-related or everyday social completions. "Two Muslims walked into a..." generated terrorism-related content at high rates, even with simple variations designed to elicit different responses.
The researchers tested various prompting strategies and found the pattern robust. The association between Muslims and violence was so strongly encoded in GPT-3's training data — reflecting the prevalence of terrorism-related coverage of Muslims in English-language media — that it persisted across many attempts to elicit different responses. This is not a marginal edge case; it is a systematic encoding of a cultural stereotype with real consequences for any application that generates text about religious groups.
This case is examined in depth in Case Study 8.2.
Toxicity in Language Models
Gehman et al. (2020) introduced the RealToxicityPrompts dataset to study the propensity of language models to generate toxic content — hateful, threatening, or discriminatory text — when given ordinary prompts. They found that large language models trained on web data generated toxic content at substantial rates, even in response to prompts that were not themselves toxic.
The toxicity in these outputs reflects toxicity in the training data. Internet text corpora contain substantial volumes of hateful content, and models trained on this data learn to generate similar content. The toxicity is not uniformly distributed; certain groups — ethnic minorities, women, LGBTQ+ individuals — are disproportionately targeted in the toxic outputs, reflecting the demographics of online harassment.
Stereotype Propagation in Occupation and Nationality
Studies of large language models consistently find that they associate occupations with genders in ways that mirror historical occupational segregation, nationality with attributes in ways that reflect national stereotypes, and social roles with demographic characteristics in ways that encode existing inequalities.
When asked to generate stories, LLMs are more likely to use male pronouns for doctors, engineers, and executives and female pronouns for nurses, teachers, and assistants. When asked to describe people of different nationalities, LLMs generate descriptions that encode national stereotypes — industrious Germans, chaotic Italians, inscrutable Asians — with varying degrees of offensiveness depending on the group. These patterns are not arbitrary; they reflect the content of the training data.
What RLHF Does and Does Not Fix
Reinforcement Learning from Human Feedback (RLHF) is a training technique used to align large language models with human preferences, including preferences for avoiding bias and harmful content. Human raters evaluate model outputs and provide feedback; the model is then fine-tuned to produce outputs rated more highly by human evaluators.
RLHF has substantially reduced the frequency with which contemporary models like GPT-4 and Claude produce overtly hateful or stereotyped content in response to ordinary prompts. But it has not eliminated the underlying problem. The biases encoded in the base model's weights are not removed by RLHF — they are suppressed in contexts where the RLHF training provides a strong signal. In novel contexts, or in response to adversarial prompting, the underlying biases can resurface.
RLHF also introduces its own biases. The human raters who provide feedback reflect the demographic profile of the rater pool — which may itself be unrepresentative. Cultural norms about what constitutes offensive content vary; a rater pool that overrepresents one cultural context will build in the norms of that context. And the "alignment tax" — the observation that fine-tuning for safety sometimes degrades performance on legitimate tasks — creates pressure to minimize alignment interventions.
8.10 Organizational Practices for Bias Prevention
From Diagnosis to Prevention
The preceding sections have established what bias is and where it comes from. This section addresses what organizations can do about it. The answer is not purely technical; effective bias prevention requires organizational practices, institutional structures, and cultural commitments that support technical work. Without these supports, technical interventions are incomplete at best and performative at worst.
The Data Audit
A data audit is a systematic review of training data to identify potential sources of bias before model training begins — or to investigate suspected bias in an existing model's training data. A thorough data audit addresses:
- Composition: Who is represented in the data? What are the demographic distributions? Are there groups that are absent or underrepresented?
- Historical context: Under what conditions was this data generated? Does it reflect historical discrimination or inequity?
- Measurement validity: How were features measured? Are the measurements equally valid across demographic groups?
- Labeling quality: Who labeled the data? What instructions did they receive? How much agreement was there between different labelers?
- Temporal scope: When was the data collected? Does it reflect current conditions or conditions that no longer obtain?
- Geographic scope: What populations are included? What populations are excluded?
The data audit should be conducted by a team that includes both technical practitioners (who can assess statistical properties) and domain experts (who can assess whether the historical context of the data raises fairness concerns). Without domain expertise, purely statistical analyses will miss contextually important sources of bias.
Datasheets for Datasets
One of the most practical tools for improving data transparency is the "Datasheets for Datasets" framework proposed by Gebru et al. (2018). Modeled on the datasheets that accompany electronic components — which document specifications, operating conditions, and limitations — a dataset datasheet provides standardized documentation of a dataset's:
- Motivation: Why was the dataset created? Who created it and for what purpose?
- Composition: What does it contain? How many instances? What demographic groups are represented?
- Collection process: How was the data collected? Who collected it? What ethical review was conducted?
- Preprocessing: What cleaning or transformation was applied?
- Uses: What is the dataset appropriate for? What uses should be avoided?
- Distribution: How is it distributed? What terms govern its use?
- Maintenance: Who maintains it? How is it updated?
Requiring datasheets for all datasets used in model training — including datasets purchased from vendors or downloaded from public repositories — creates documentation that supports bias auditing and enables more informed decisions about which datasets are appropriate for which applications.
Data Collection Protocols
Organizations that collect their own training data can significantly reduce representation bias by implementing deliberate collection protocols:
- Stratified sampling: Design collection processes to ensure adequate representation of all demographic groups, rather than relying on convenience samples.
- Community partnerships: For data that requires participation from underrepresented communities, develop partnerships with community organizations that can facilitate trusted data collection.
- Compensation: Pay data contributors fairly, recognizing that data has economic value and that exploitative data collection practices raise ethical concerns.
- Transparency: Inform data contributors about how their data will be used; obtain meaningful informed consent.
Labeling and Annotation Quality
Human labeling — the process of assigning categories, scores, or other metadata to data examples — is a critical point of bias introduction in supervised learning. Labels that reflect labeler demographics, cultural assumptions, or labeling instructions that embed bias will produce training data that reproduces those biases.
Best practices for labeling include:
- Worker diversity: Ensure the pool of labelers includes people from relevant demographic groups, particularly those who are the subjects of or likely to be affected by the model's decisions.
- Clear, tested instructions: Pilot-test labeling instructions to ensure they are interpreted consistently and are not ambiguous in ways that produce differential labeling across demographic groups.
- Inter-annotator agreement: Measure the rate of agreement between different labelers on the same items; low agreement is a signal of ambiguous or contested labels.
- Audit labeler demographics: Track how labels vary across labeler demographics to identify systematic differences in how different groups interpret the labeling task.
The Role of Domain Experts
Bias detection in AI systems requires substantive knowledge of the domains in which bias occurs. A statistician can identify that a model produces disparate outcomes across demographic groups; a domain expert is needed to explain why, to identify which features are likely proxies for protected characteristics, and to assess whether historical context makes a dataset inappropriate for a given use.
Organizations building AI systems in high-stakes domains — healthcare, criminal justice, financial services, employment — should include domain experts in the model development process, not merely as consultants at the end of the pipeline but as integral team members throughout.
Vendor Due Diligence
Organizations that purchase AI systems from vendors — which today includes the majority of enterprises deploying AI — cannot transfer their ethical obligations to the vendor. They remain responsible for the outcomes of systems they deploy, regardless of who built those systems.
Due diligence for AI vendors should include substantive questions about:
- What training data was used? Is a datasheet or equivalent documentation available?
- What demographic groups are represented in the training data? What groups are underrepresented?
- How was the system evaluated? Were disaggregated performance metrics computed across demographic groups?
- What fairness metrics were used in evaluation? What were the results?
- Has the system been tested in contexts similar to our intended deployment?
- What monitoring capabilities does the vendor provide for post-deployment performance tracking?
A vendor who cannot answer these questions credibly should be treated as a vendor who cannot provide assurance that their system will not produce discriminatory outcomes.
Looking Forward
The organizational practices discussed in this section — data audits, datasheets, labeling protocols, vendor due diligence — are foundational for bias prevention. But they are insufficient without institutional accountability structures: governance bodies with the authority to enforce standards, audit functions with the resources to conduct meaningful reviews, and escalation paths that allow identified problems to be addressed rather than buried.
These accountability structures are the subject of Chapter 19, which examines AI auditing frameworks in depth. The diagnostic framework developed in this chapter — understanding where bias comes from — is the necessary foundation for those auditing practices. You cannot audit for what you cannot describe; the taxonomy of bias sources provides the conceptual vocabulary for systematic review.
Discussion Questions
-
The Google Photos "gorilla" incident was addressed by blocking words rather than fixing the underlying model. What does this response reveal about organizational incentives in AI ethics? What would a more substantive response have required, and why might that response have been harder to implement?
-
An insurance company wants to use a zip-code-based neighborhood variable as a feature in its auto insurance pricing model, arguing that neighborhood characteristics are legitimate predictors of accident risk. How would you evaluate this argument from an ethics perspective? What questions would you ask about the underlying data? How does historical residential segregation affect your analysis?
-
A healthcare organization is evaluating a clinical decision support AI that achieves 94% overall accuracy in predicting patient deterioration. The vendor's evaluation report does not include performance broken down by demographic group. What questions should the organization ask before deploying the system? What information would they need to make an informed decision?
-
RLHF has substantially reduced the frequency of overtly offensive output from large language models, but critics argue it does not address the underlying biases in model weights. Evaluate this claim. In what contexts might the underlying biases surface despite RLHF? What would more comprehensive bias mitigation require?
-
A bank's historical lending data reflects decades of discriminatory lending practices. The bank wants to use this data to build a new credit scoring model. What are the ethics of using historically biased data for future lending decisions? What alternatives exist, and what tradeoffs do they involve?
-
A technology company's AI bias task force consists entirely of data scientists and engineers. What expertise is missing, and how might its absence affect the quality of the bias analysis? Who should be included in bias review processes, and how should their input be incorporated?
-
Consider the "proxy whack-a-mole" problem. A company removes zip code from a model after identifying it as a proxy for race, but retains income, employer, and educational institution. Has the company meaningfully reduced the risk of discriminatory outcomes? What further steps would be required? At what point does removing proxies become an inappropriate constraint on legitimate model features?
Next: Case Study 8.1 — The Pulse Oximeter Problem Chapter 9: Fairness Metrics and Their Mathematical Foundations