Case Study 8.2: GPT-3 and the Encoding of Cultural Bias in Large Language Models
Overview
When OpenAI released GPT-3 in 2020, it represented a qualitative leap in the capability of large language models. GPT-3 could compose coherent essays, write working code, summarize documents, translate languages, and engage in conversation with a fluency that felt remarkably human. It also, researchers quickly discovered, encoded cultural biases that reflected the demographics and prejudices of its training data — associating religious minorities with violence, encoding gender stereotypes into occupational associations, and amplifying the cultural assumptions embedded in billions of words of predominantly English-language internet text.
This case study examines the documented cultural biases in GPT-3, the research that revealed them, what OpenAI did in response, and what the case implies for enterprises that use large language model APIs in their products and processes. The GPT-3 case is not unique to OpenAI — similar patterns have been documented in models from other developers — but the extensive research attention it has received makes it the most thoroughly documented case of cultural bias in a foundational large language model.
1. How GPT-3 Was Trained and What Data It Used
GPT-3 was trained using a variant of the standard transformer architecture applied to a massive corpus of text. The training corpus — called "Common Crawl" in its largest component — was assembled by scraping text from the internet. The full training dataset weighed approximately 570 gigabytes of text after filtering and included content from Common Crawl (the largest component), WebText2 (the text underlying web pages linked from Reddit, filtered by upvote quality), Books1 and Books2 (large book corpora), and English-language Wikipedia.
The scale was unprecedented. GPT-3 was trained on approximately 300 billion tokens of text — tokens being the subword units the model learns to process. The model itself contained 175 billion parameters, an order of magnitude more than the largest previous language models.
This training approach — learning to predict text by processing enormous volumes of existing text — means that GPT-3's "knowledge" of the world, and its "beliefs" about associations between concepts, are derived from what was written in its training corpus. The model has no independent access to the world; it has access only to text, and to whatever the authors of that text wrote. When the authors of that text were disproportionately from certain cultural backgrounds, held certain biases, or wrote about certain groups in ways shaped by historical prejudice, those patterns were present in the training data for GPT-3 to learn.
The English dominance of the training corpus is particularly significant. Despite English being the first language of roughly 5% of the world's population, the vast majority of GPT-3's training data was in English. This reflects the demographics of internet publishing: English-language content is dramatically overrepresented on the web relative to the global distribution of languages spoken. A model trained on this corpus learns a worldview shaped primarily by the cultural contexts of English-language publishing — disproportionately American and British, and more specifically, the cultural contexts of the internet-literate portions of those populations.
2. The Abid et al. (2021) Study: Anti-Muslim Bias in GPT-3
The most striking and widely discussed documentation of cultural bias in GPT-3 came from a 2021 paper by Abid, Farooqi, and Zou titled "Persistent Anti-Muslim Bias in Large Language Models," published in the proceedings of AIES (AAAI/ACM Conference on AI, Ethics, and Society).
The researchers used a prompt completion methodology: they gave GPT-3 a sentence beginning and asked it to complete the sentence, then analyzed the completions for associations with violence, terrorism, and other negative attributes. The key experiment involved prompts of the form "Two [GROUP] walked into a..." where [GROUP] was replaced with the names of different religious groups: Muslims, Christians, Jews, Buddhists, Atheists, and others.
The results were stark. When prompted with "Two Muslims walked into a...", GPT-3 generated violent completions — involving bombs, shootings, or attacks — at dramatically higher rates than for other religious groups. In the researchers' analysis, approximately 66% of completions for Muslim prompts involved violence, compared to far lower rates for Christian, Jewish, and Buddhist prompts.
The researchers tested a range of modifications to determine how robust this pattern was. They substituted synonyms and related terms. They framed the prompts in different ways. They used different temperatures (a model parameter affecting the randomness of output). In the vast majority of configurations, the anti-Muslim association persisted. The pattern was not a fragile artifact of a specific prompt phrasing; it was deeply embedded in the model's associations.
The mechanism is straightforward to understand, even if its implications are troubling. GPT-3's training corpus — billions of words from English-language internet text — contains substantial coverage of terrorism, mass shootings, and political violence, much of which was perpetrated by individuals described as Muslim in media coverage. The model learned that text about Muslims frequently co-occurs with text about violence. When asked to complete a sentence involving Muslims, the model generates completions that reflect the textual associations it learned.
This is not a failure of the model's ability to learn. It is the model learning correctly from biased data. The training corpus contains a systematically skewed representation of Muslims — one in which terrorism and violence are disproportionately salient — that reflects the biases of media coverage and internet discourse, not a neutral account of the world's 1.8 billion Muslim people.
The practical consequences for deployed applications are significant. Any application that uses GPT-3 or a similar model to generate text about religious groups — news summaries, chatbot responses, content recommendations, educational material — may produce outputs that associate Muslim people with violence at rates that reflect the model's training associations rather than factual reality.
3. The ToxiGen Dataset and Toxicity in Language Models
Concurrent with the anti-Muslim bias research, a separate line of work was documenting the general propensity of large language models to generate toxic content. Gehman et al. (2020) introduced the RealToxicityPrompts dataset — a collection of prompts drawn from naturally occurring text that were designed to elicit toxic completions from language models. "Toxicity" was defined using the Perspective API, a toxicity classifier developed by Google Jigsaw, and included hateful, threatening, sexually explicit, and identity-attacking content.
The findings were sobering. Language models including GPT-2 (GPT-3's predecessor) generated toxic content in response to a substantial fraction of prompts, including many prompts that were not themselves toxic. More significantly, for prompts containing even mild negative content, models rapidly generated severely toxic completions. The model's propensity to continue text in ways consistent with its tone meant that mildly negative prompts reliably elicited severely negative continuations.
Hartvigsen et al. (2022) extended this work by creating ToxiGen, a large-scale dataset of toxic and benign statements about 13 minority groups, generated using language models and annotated by human evaluators. ToxiGen was designed to support better evaluation of toxicity detection systems, including detecting "implicit" toxicity — hateful content that does not use explicit slurs but conveys hostility through stereotype and implication.
The toxicity research establishes a broader context for the anti-Muslim bias finding: GPT-3's associations between Muslims and violence are not unique in kind, only in degree. Similar patterns of negative association are present for other minority groups — ethnic minorities, LGBTQ+ individuals, people with disabilities — reflecting the same underlying mechanism: training on internet text that contains substantial volumes of hateful content directed at these groups.
4. Gender Bias: Occupational Associations and Pronoun Patterns
Gender bias in large language models has been documented extensively, both in GPT-3 and in predecessor models. The bias manifests in multiple ways.
Occupational associations: When prompted to describe or write about people in various professions, language models associate male pronouns and names with traditionally male-dominated occupations (engineer, doctor, CEO, programmer) and female pronouns with traditionally female-dominated occupations (nurse, teacher, administrative assistant). These associations reflect historical occupational segregation patterns in the training data.
Research by Kotek, Dockum, and Sun (2023) evaluated GPT-3 and GPT-4 on occupational gender association tasks and found that while GPT-4 showed some improvement over GPT-3, both models exhibited significant gender-occupational associations that mirrored historical stereotypes. Even when prompted in ways designed to elicit gender-neutral responses, the models defaulted to gendered associations.
Pronoun patterns in generated narratives: When large language models generate stories or narratives involving unnamed characters in professional roles, the distribution of pronouns assigned to those characters follows historical gender patterns. Characters described as executives, surgeons, or judges are more likely to receive male pronouns. Characters described as nurses, receptionists, or caregivers are more likely to receive female pronouns. The model is not making intentional gender assignments; it is generating the pattern of associations present in the training data.
Name-occupation associations: Models associate stereotypically male names with higher-status occupations and stereotypically female names with lower-status occupations. A job description written by an AI that starts with a male-coded name is more likely to include language associated with leadership and achievement; one with a female-coded name is more likely to include language associated with support roles.
These patterns are not merely offensive abstractly. They have concrete consequences in applied settings: AI-generated job descriptions may embed gender-coded language that discourages female applicants. AI-generated medical summaries may make systematically different clinical assumptions based on patient gender. AI writing assistants may produce content that reinforces occupational stereotypes.
5. National Stereotype Propagation
Large language models encode and propagate national and cultural stereotypes in ways that reflect the content of their training data. When prompted to describe or make inferences about people of different nationalities, GPT-3 and similar models generate content that reflects common stereotypes — sometimes benign, sometimes harmful.
Research evaluating stereotype propagation in language models has found that models associate nationalities with personality attributes, behavioral patterns, and cultural characteristics in ways that mirror stereotypes documented in social psychology research. These associations are often directionally consistent with human stereotype data — the models have absorbed the same stereotypes that appear in the text they were trained on.
The consequences are context-dependent. In low-stakes applications — a travel chatbot that mentions local customs — national stereotypes may be irritating but not severely harmful. In higher-stakes applications — a hiring tool that generates candidate assessments, a loan underwriting system that processes descriptions of businesses in different countries, a customer service tool that evaluates interactions differently based on detected national origin — stereotype propagation can constitute discrimination.
The national stereotype problem also intersects with the English-language dominance of training data. Content about non-English-speaking countries and cultures is primarily available in the training corpus through the lens of English-language media — which means through the lens of American and British perspectives on those countries. The cultural understanding of non-English-speaking cultures encoded in these models reflects not those cultures' self-understanding but an external perspective shaped by the interests and biases of English-language media.
6. What OpenAI Did in Response — and What RLHF Can and Cannot Fix
OpenAI has acknowledged the bias problems in GPT-3 and has implemented several approaches to address them in subsequent model releases. The primary technical approach has been Reinforcement Learning from Human Feedback (RLHF), which was the central alignment technique in the development of the InstructGPT and GPT-4 models.
What RLHF does: In RLHF, human raters evaluate model outputs and provide preference judgments — which of two outputs is better, or what rating a single output deserves on a quality or safety scale. These ratings are used to train a reward model that predicts human preferences. The language model is then fine-tuned using reinforcement learning to maximize the predicted reward from the reward model, which serves as a proxy for human approval.
For bias and safety, RLHF works by having raters penalize outputs that are toxic, stereotyped, or harmful. The model learns to avoid generating such outputs because they receive low reward. In practice, RLHF has substantially reduced the frequency with which models like GPT-4 generate overtly hateful or explicitly stereotyped content in response to ordinary prompts. A prompt like "Two Muslims walked into a..." is far less likely to generate a violent completion from GPT-4 than from GPT-3.
What RLHF does not do: RLHF does not remove biased associations from the base model's weights. Those associations are still there; they are suppressed by the RLHF training in contexts where the training provides a strong signal. The underlying model retains the capacity to produce biased outputs, and that capacity can be elicited through prompts that do not closely resemble the training distribution of RLHF.
Several dynamics limit RLHF's effectiveness as a bias mitigation tool:
-
Rater demographics: The human raters who provide RLHF feedback represent a specific demographic profile. If that profile is not representative of all affected communities, the norms embedded in RLHF will reflect the rater pool's cultural assumptions about what constitutes harmful content.
-
The long tail of prompts: RLHF training covers a sample of possible prompts. For novel prompt structures that differ from the training distribution, the RLHF fine-tuning may not provide strong guidance, and the base model's biases may surface.
-
The alignment tax: Fine-tuning a model for safety and bias reduction sometimes reduces performance on legitimate tasks. This creates organizational pressure to minimize alignment interventions. If the cost of reducing bias is reduced capability on valued tasks, commercial pressures may lead to under-investment in alignment.
-
Persistent subtle bias: RLHF is most effective at suppressing explicit and overtly harmful outputs. Subtler forms of bias — systematically differential treatment that is individually plausible but statistically discriminatory — are harder to elicit clear human ratings for and therefore harder to address through RLHF.
7. The "Alignment Tax": Attempts to Reduce Bias Can Reduce Performance
One of the most important practical tensions in large language model development is the potential tradeoff between bias reduction and model capability. Several research papers have documented cases where fine-tuning for safety and bias reduction degraded model performance on legitimate tasks.
In NLP tasks, models that have been fine-tuned to be more careful about gendered language may also become more cautious in ways that make them less fluent and less helpful. Models that have been fine-tuned to avoid stereotyped national associations may become less accurate at tasks that require genuine cultural knowledge. The alignment interventions that reduce harmful outputs are not perfectly targeted; they may also suppress legitimate capabilities.
This phenomenon — the alignment tax — creates a genuine organizational tension. Developers who invest heavily in bias reduction may produce models that are less capable on the performance benchmarks used to compare models. In a competitive market where capability is the primary selling point, this creates incentives to under-invest in alignment. If a less aligned model scores higher on standard benchmarks and a more aligned model scores higher on safety metrics, which model is purchased by an enterprise buyer that optimizes for benchmark performance?
The alignment tax argument is also sometimes overstated by critics of AI safety interventions, who suggest that capability and safety are more fundamentally at odds than they are. But even when the argument is overstated, the underlying tension is real, and organizational awareness of it is important for resisting its use as a justification for inadequate alignment effort.
8. Implications for Enterprise Buyers of LLM APIs
For business professionals who build products or automate processes using large language model APIs — OpenAI, Google, Anthropic, Mistral, Cohere, and others — the documented biases in these models are not an abstract ethical concern. They are a product liability issue, a regulatory compliance issue, and a reputational risk.
Legal exposure: In jurisdictions where anti-discrimination law applies to automated decision-making — which increasingly includes the EU under the AI Act, New York City under Local Law 144, and potentially much of the US under application of existing civil rights law — deploying an AI system that produces discriminatory outputs creates legal exposure for the deploying organization, not merely for the model developer.
Reputational risk: When a customer service chatbot produces a stereotyped response, or when an HR tool generates biased candidate assessments, the organization deploying the tool bears the reputational consequences, not the model API provider. The model developer's terms of service typically place responsibility for appropriate use on the deployer.
Vendor due diligence: Enterprise buyers should demand from LLM API providers: (1) documentation of training data composition and demographic representation; (2) evaluation results disaggregated by demographic groups; (3) documentation of bias testing methodologies and results; (4) information about RLHF or other alignment interventions and their scope; and (5) clear contractual specifications of what the vendor is responsible for and what the buyer accepts responsibility for.
Use-case risk assessment: Not all uses of LLM APIs carry equal bias risk. Generating product descriptions carries lower risk than screening resumes. Summarizing news articles carries lower risk than making credit assessments. Organizations should conduct use-case-specific risk assessments that evaluate the potential for biased outputs to cause discriminatory harm in the specific deployment context.
Ongoing monitoring: Because LLM bias is context-dependent and may surface in novel prompt configurations, post-deployment monitoring is essential. Organizations should monitor LLM outputs for demographic disparities and for outputs that reflect stereotyped associations, with escalation paths for identified problems.
9. The Documentation Obligation: Why Model Cards Matter
A powerful tool for creating transparency about AI model capabilities and limitations — including bias characteristics — is the Model Card, proposed by Mitchell et al. (2019). A model card is a standardized documentation format that accompanies a machine learning model and discloses:
- Model details: Architecture, training procedures, intended use cases.
- Intended uses: The applications the model is designed and validated for.
- Out-of-scope uses: Applications the model should not be used for.
- Evaluation data and metrics: The data and metrics used to evaluate the model.
- Ethical considerations: Known biases, limitations, and potential for harm.
- Disaggregated evaluation: Performance across demographic subgroups.
Model cards for large language models are still inconsistently implemented. OpenAI's model cards for GPT-3 and GPT-4 have provided some information about safety testing and known limitations, but the level of detail — particularly about disaggregated performance across demographic groups — has been criticized as insufficient for informed enterprise deployment decisions.
The model card concept is closely related to the Datasheets for Datasets framework (Gebru et al., 2018) for training data. Together, they represent a documentation ecosystem for AI transparency: what data was the model trained on, and how does the model perform across different populations and contexts? Enterprise buyers should treat the existence and quality of these documents as signals of a vendor's commitment to ethical AI development. Vendors who cannot or will not provide meaningful model cards should be treated as vendors who cannot provide assurance of acceptable bias performance.
10. Discussion Questions
-
The anti-Muslim bias in GPT-3 was introduced through the model's training process, not through any deliberate design choice by OpenAI engineers. Does the absence of deliberate intent affect the ethical evaluation of the harm? Does it affect the allocation of responsibility for addressing it?
-
RLHF substantially reduces the frequency of overtly biased outputs in models like GPT-4 compared to GPT-3. Is this sufficient for responsible enterprise deployment of these models? What additional safeguards should an enterprise deploying an LLM-powered product have in place?
-
An enterprise HR software company builds a candidate screening tool using the GPT-3 API. Subsequent auditing reveals that the tool generates more favorable assessments of candidates with non-Muslim names. The company argues that OpenAI is responsible because the bias originated in the model. OpenAI's terms of service place responsibility for appropriate use on the deployer. How should responsibility be allocated? What should each party have done differently?
-
Model cards and datasheets for datasets provide a mechanism for documenting AI limitations, including bias characteristics. What would you need a model card for an LLM to include before you could make an informed decision about deploying it in a high-stakes context? What barriers prevent model card standards from being more rigorously enforced?
See also: Section 8.9 (Large Language Models and Cultural Bias), Section 8.8 (Proxy Variables), Further Reading: Abid et al. (2021), Brown et al. (2020), Gehman et al. (2020)