43 min read

> "Every dataset is a historical document. It tells you what happened, not what should have happened. If you train a model on history, you train it to repeat history."

Chapter 25: Bias in AI Systems

"Every dataset is a historical document. It tells you what happened, not what should have happened. If you train a model on history, you train it to repeat history."

— Professor Diane Okonkwo


The Emergency Meeting

Ravi Mehta does not call emergency meetings.

In two years of guest lectures, Athena Retail Group site visits, and after-class conversations, the MBA 7620 students have come to know Ravi as the calm center of any room he occupies. He is the person who, when a production model crashes at 2:00 a.m., sends a Slack message that reads, "Let's diagnose before we react." He is deliberate. He is measured. He does not use the word "emergency."

So when Ravi walks into Professor Okonkwo's Tuesday lecture fifteen minutes early, asks to address the class, and begins with the sentence, "We have a problem," every student in the room sits up.

"Six weeks ago," Ravi says, "Athena's HR department deployed an AutoML tool to screen resumes for store manager positions. Some of you may remember from Chapter 22 that I mentioned our low-code AI initiatives were expanding faster than our governance processes. This is what I was worried about."

He clicks to a slide. Two bar charts fill the screen.

"Our internal audit team ran a routine review last week. They found two patterns." He points to the first chart. "Candidates under thirty-five are 2.3 times more likely to be advanced by the model than candidates over thirty-five." He points to the second chart. "Candidates with four-year university degrees are 1.8 times more likely to be advanced than candidates without them — regardless of their years of retail management experience."

The room is quiet.

"The model isn't being racist or ageist on purpose," Ravi says. "It doesn't have purposes. It's replicating the patterns in our historical hiring data — five years of decisions made by human managers who, it turns out, disproportionately favored younger candidates and candidates from traditional educational backgrounds. The model didn't create the bias. It inherited it. And then it amplified it: seventy-eight percent of the model's recommended candidates were under thirty-five, compared to sixty-two percent in the historical data."

NK Adeyemi has stopped typing. Her hands are flat on the desk. She is staring at the slide with an expression Tom Kowalski has never seen from her before — not skepticism, not analytical detachment, but something closer to recognition.

"How long was it running?" NK asks.

"Six weeks. Approximately 340 resumes were screened. We don't know how many qualified candidates were filtered out. That's part of what we're trying to determine."

"So for six weeks," NK says, "a machine was quietly sorting people into 'worth a conversation' and 'not worth a conversation,' and the people it was rejecting were disproportionately older and from non-traditional backgrounds. And nobody noticed."

"Nobody noticed," Ravi confirms. "Until the audit."

Professor Okonkwo steps to the front. "Class, I want you to sit with this for a moment before we discuss it. Three hundred and forty resumes. Real people. Real careers. And an algorithm that was, by any statistical measure, discriminating against them — not because anyone programmed it to, but because it learned from data that reflected our existing biases." She pauses. "This is not a hypothetical. This is not a case study from someone else's company. This happened. It is still happening, in organizations around the world, right now."

She turns to the class. "Today's lecture was going to be an introduction to bias in AI systems. It still is. But now we have a live case to work with."


What Is AI Bias?

Let us begin with a definition that is both precise and honest.

Definition: AI bias is a systematic and repeatable error in an AI system that creates unfair outcomes — disproportionately favoring or disadvantaging particular groups of people based on characteristics such as race, gender, age, socioeconomic status, disability, or other attributes that should not influence the decision.

Note what this definition does not say. It does not say bias requires intent. It does not say bias is the result of malice. It does not say bias is rare, unusual, or the product of incompetent engineering. Bias in AI systems is common, often invisible, and can emerge from perfectly reasonable technical decisions made by perfectly well-intentioned people.

"Bias isn't a bug," NK says, leaning forward. "It's a feature of systems designed by and for a narrow slice of humanity."

Tom looks up from his notebook. "That's a strong statement."

"It's a factual one," NK replies. "If every decision in the pipeline — who collects the data, what they measure, how they label it, what outcomes they optimize for — is made by people who share the same background, the same assumptions, the same blind spots, the output will reflect those blind spots. You don't need a conspiracy. You just need homogeneity."

Professor Okonkwo nods. "NK is describing what researchers call the homogeneity problem in AI development. A 2022 study by the AI Now Institute found that the AI workforce is approximately 80 percent male and over 70 percent white in the United States. When the people building AI systems do not represent the populations those systems affect, bias is not an accident — it is a predictable outcome."

Bias vs. Variance: Clearing Up the Terminology

Students with a machine learning background may notice a collision of terminology. In statistics and machine learning, bias has a precise technical meaning related to model underfitting — the difference between a model's average prediction and the true value (recall the bias-variance tradeoff from Chapter 11). In this chapter, we use bias in its broader, societal sense: systematic unfairness in outcomes.

The two meanings are not unrelated. A model that is biased in the statistical sense — that systematically underfits for certain subpopulations — can produce biased outcomes in the societal sense. A credit scoring model that has high statistical bias for applicants from rural ZIP codes (because those applicants were underrepresented in training data) will systematically underestimate their creditworthiness, producing socially biased lending decisions.

But the societal meaning is broader. A model can have low statistical bias overall — excellent average accuracy — while still producing deeply unfair outcomes for specific groups. This is one of the central challenges of this chapter: aggregate accuracy can mask group-level injustice.

Business Insight: When a vendor tells you their AI model has "97% accuracy," ask immediately: 97% accuracy for whom? A facial recognition system with 99% accuracy for light-skinned men and 65% accuracy for dark-skinned women has an "average" accuracy that looks impressive — and a real-world performance that is discriminatory.


Sources of Bias in AI Systems

Bias does not appear at a single point in the AI pipeline. It can enter at any stage — from the initial framing of the problem to the final deployment of the model — and each source of bias compounds the others. Suresh and Guttag (2021) identified a taxonomy of bias sources that has become standard in the field. We adapt it here with business-relevant examples.

1. Historical Bias

Historical bias exists when the data faithfully reflects a world that was itself unfair. The data is not "wrong" in a technical sense — it accurately represents what happened. But what happened was shaped by discrimination, exclusion, and unequal opportunity.

Athena's HR screening model is a textbook case. The training data — five years of hiring decisions — accurately captured what human managers did. But those human managers, influenced by their own unconscious biases, had disproportionately hired younger candidates. The model learned this pattern and, because machine learning models optimize for patterns that predict historical outcomes, it replicated the bias. The model was technically correct: younger candidates were more likely to have been hired in the past. But "more likely to have been hired in the past" is not the same as "more qualified."

Research Note: Mehrabi et al. (2021) documented historical bias in word embeddings — the numerical representations of words that underpin NLP models. In Word2Vec embeddings trained on Google News data, the vector relationship "man is to computer programmer as woman is to homemaker" emerged directly from patterns in the text. The algorithm did not create the stereotype. It measured it. But once embedded in an NLP system, the stereotype becomes operational — influencing autocomplete suggestions, resume parsers, and translation systems.

2. Representation Bias

Representation bias occurs when the training data does not adequately represent the population the model will serve. Certain groups are overrepresented; others are underrepresented or entirely absent.

Consider a dermatology AI trained to detect skin cancer. If 90 percent of the training images show light-skinned patients — because the medical literature and image databases historically skewed toward lighter skin tones — the model will perform well on light skin and poorly on dark skin. This is not a hypothetical: a 2021 study in The Lancet Digital Health found that only 4.5 percent of images in dermatology AI training datasets depicted dark-skinned patients, despite the fact that melanoma in darker-skinned patients is diagnosed at later stages and carries higher mortality.

Business Insight: Representation bias is not limited to demographic characteristics. A recommendation engine trained primarily on data from urban customers will underperform for rural customers. A fraud detection model trained on transactions from one country may generate excessive false positives when deployed in another. Any time a model is applied to a population that differs from its training population, representation bias is a risk.

3. Measurement Bias

Measurement bias arises when the features or labels used in the model systematically differ across groups — not because the groups are truly different, but because of how the data was collected.

A notorious example: pulse oximeters, the devices that clip onto a finger to measure blood oxygen levels. Multiple studies, including a landmark 2020 paper in the New England Journal of Medicine by Sjoding et al., found that pulse oximeters systematically overestimate oxygen saturation in patients with darker skin pigmentation. During the COVID-19 pandemic, this measurement bias had life-or-death consequences: Black patients with dangerously low oxygen levels were less likely to be flagged for supplemental oxygen because their readings appeared normal.

If this biased measurement data were then used to train an AI system for patient triage, the AI would inherit the measurement bias — and institutionalize it.

4. Aggregation Bias

Aggregation bias occurs when a single model is used for groups that actually have different underlying relationships between features and outcomes. The model captures an "average" pattern that may not apply to any specific subgroup.

In diabetes management, for example, HbA1c levels (a standard measure of blood sugar control) have different clinical thresholds for different ethnic groups due to biological variation in hemoglobin levels. A diabetes management AI that uses a single threshold — the one calibrated for the majority population in the training data — will systematically over- or under-diagnose patients from minority populations.

Caution

Aggregation bias is particularly insidious because the model's overall metrics can look excellent. A model with 95% accuracy that performs at 98% for one group and 82% for another will report strong aggregate performance while systematically failing the second group. This is why disaggregated evaluation — measuring performance separately for each subgroup — is essential. We introduced this concept with confusion matrices in Chapter 11; here, we extend it to fairness.

5. Evaluation Bias

Evaluation bias occurs when the benchmarks and metrics used to evaluate a model do not adequately represent the diversity of the deployment population. A model can appear to perform well because the evaluation data shares the same biases as the training data.

The ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which drove much of the progress in computer vision between 2010 and 2017, used images that overrepresented North American and Western European contexts. A model that achieved "state of the art" accuracy on ImageNet might fail on images from other parts of the world — because its evaluation never tested those contexts.

6. Deployment Bias

Deployment bias occurs when a model is used in ways that differ from its original design. Even a well-built, carefully evaluated model can produce biased outcomes if it is deployed in a context its designers did not anticipate.

Athena's HR screening model was built with AutoML — a legitimate tool, discussed in Chapter 22, that automates model selection and hyperparameter tuning. But the HR analyst who deployed it intended it as a "preliminary filter" to reduce the workload of human reviewers. In practice, overworked HR staff treated the model's recommendations as final decisions, rarely overriding its rejections. The model was designed to assist; it was deployed to decide.

Athena Update: Ravi's internal investigation reveals that only 12 percent of candidates rejected by the model received a secondary human review. The model was never designed to be a sole decision-maker — but in practice, that is exactly what it became. This pattern, sometimes called automation bias, is one of the most common deployment failures in AI systems: humans defer to algorithmic outputs even when they have the authority and information to override them.


Bias in Training Data

Training data is the most frequently cited source of AI bias — and for good reason. A machine learning model can only learn patterns that exist in its data. If the data is skewed, the model will be skewed. But "skewed data" is a broad category. Let us be more specific about how training data introduces bias.

Skewed Samples

A skewed sample does not represent the true distribution of the population. This can happen for many reasons: convenience sampling (collecting data from whoever is easiest to reach), survivorship bias (training on successful outcomes while excluding failures), or simple historical exclusion.

Consider Athena's case. The training data consisted of five years of hiring decisions. But those five years reflected a specific period in the company's history — a period during which the company was expanding rapidly, hiring managers skewed younger, and the corporate culture implicitly valued "energy" and "cultural fit" in ways that correlated with youth. The data was not a neutral record of candidate quality. It was a record of a particular era's preferences.

Label Bias

Label bias occurs when the outcomes used to train the model are themselves tainted by human bias. In hiring, the label might be "was the candidate successful?" — measured by retention, performance reviews, or promotion. But if performance reviews are themselves subject to bias (and extensive research confirms they are), then the label the model is learning to predict is not "quality" but "quality as perceived by biased evaluators."

Research Note: Greenwald, Banaji, and Nosek's work on the Implicit Association Test (IAT), while debated in its predictive validity, helped establish the broad finding that unconscious biases — about race, gender, age, weight, and disability — influence evaluative judgments, including hiring and performance evaluation, even among people who explicitly endorse egalitarian values.

Missing Populations

Some populations are not underrepresented in the data — they are absent entirely. Indigenous communities in many countries are systematically excluded from the datasets that train government AI systems, because data collection infrastructure does not reach them. People without bank accounts are invisible to credit scoring models. People who stopped seeking medical care due to cost or distrust are absent from clinical datasets.

When a population is missing from training data, the model has no information about them. At best, it extrapolates from the nearest available population — which may not be representative. At worst, it assigns them a default or majority-class prediction that systematically misclassifies them.

Proxy Variables

A proxy variable is a feature that the model uses as a stand-in for a characteristic it is not supposed to consider. Even when sensitive attributes like race or gender are removed from the input data, the model can reconstruct those attributes from correlated features.

ZIP code is the most commonly cited proxy variable. In the United States, due to the legacy of residential segregation and redlining, ZIP code is strongly correlated with race and ethnicity. A credit scoring model that uses ZIP code as a feature is, in effect, using race as a feature — even if race was explicitly excluded from the model's input variables.

Caution

Removing sensitive attributes from the model (sometimes called "fairness through unawareness") is almost never sufficient to eliminate bias. Proxy variables — ZIP code, name frequency, school attended, web browser used, typing speed — can reconstruct demographic information with surprising accuracy. A 2019 study by Obermeyer et al. in Science demonstrated that a widely used healthcare algorithm that never used race as an input still discriminated against Black patients, because it used healthcare costs as a proxy for health needs — and Black patients, due to systemic inequality, incurred lower healthcare costs for the same level of illness.


Algorithmic Bias

Training data is not the only source of bias. The model architecture itself — the choices made about how to learn from data — can amplify certain patterns and suppress others.

Optimization Objectives

Every machine learning model optimizes for an objective function. In classification, the most common objective is to minimize overall error — to maximize accuracy across the entire dataset. But when the dataset is imbalanced (more examples from one group than another), optimizing for overall accuracy incentivizes the model to perform well on the majority group at the expense of the minority group.

Consider a model that predicts whether a loan applicant will default. If 95 percent of historical applicants came from Group A and 5 percent came from Group B, a model that achieves 98 percent accuracy on Group A and 60 percent accuracy on Group B will have an overall accuracy of approximately 96 percent — which looks excellent in aggregate. The model has effectively learned to ignore the minority group.

Feedback Loops

Feedback loops occur when a model's outputs influence the data that is used to train or update the model in the future. This creates a self-reinforcing cycle that can amplify initial biases over time.

Predictive policing is the canonical example. A model trained on historical arrest data may direct more police patrols to neighborhoods with high arrest rates. More patrols lead to more arrests (because crimes that would go undetected in other neighborhoods are now being observed). More arrests generate more training data that reinforces the original prediction. The model's prediction becomes self-fulfilling.

Business Insight: Feedback loops are not limited to criminal justice. A recommendation algorithm that promotes products that have already been purchased will amplify existing popularity, making it harder for new products to gain visibility. A hiring model that recommends candidates who resemble past successful hires will create a homogeneous workforce, which then generates homogeneous training data for the next round. Any AI system that influences the data it learns from is at risk of feedback loops.

Feature Interactions

Even when individual features are not biased, their interactions can produce biased outcomes. A model might learn that the combination of "commute time greater than 45 minutes" and "employment gap in the last three years" predicts poor employee retention. Each feature, individually, seems neutral. But their intersection may correlate strongly with single mothers — who are more likely to live in affordable suburbs (long commute) and to have taken career breaks for childcare (employment gaps). The model has effectively learned a proxy for a protected characteristic through feature interaction, without any single feature being problematic on its own.


Human Bias in the AI Pipeline

Behind every model is a team of humans making decisions — and every one of those decisions is subject to cognitive bias.

Confirmation Bias in Feature Selection

Data scientists, like all humans, tend to seek out information that confirms their existing beliefs. When selecting features for a model, a data scientist who believes that "university prestige matters for job performance" will include educational institution as a feature, observe that it has predictive power (because historical data reflects a world where prestigious-university graduates were preferentially hired and promoted), and conclude that their hypothesis was correct. The feature's predictive power confirms the bias — not because prestigious-university graduates are inherently better, but because the system has been rewarding them for decades.

Anchoring in Evaluation

Anchoring bias causes people to rely too heavily on the first piece of information they encounter. In model evaluation, if a team sees that their model achieves 93% accuracy on their test set, that number becomes an anchor. Subsequent discoveries — that accuracy drops to 71% for a specific demographic subgroup — are evaluated relative to the anchor rather than on their own terms. "Well, 71% is still better than random" becomes an acceptable rationalization, when the real question should be: "Is a 22-percentage-point accuracy gap acceptable?"

Homogeneous Teams

We have already noted NK's observation about homogeneous teams. The research supports it. A 2019 study by West, Whittaker, and Crawford at the AI Now Institute found that diversity in AI teams is not merely an ethical aspiration — it is a debiasing mechanism. Teams with diverse backgrounds, experiences, and perspectives are more likely to anticipate how a model might fail for populations they represent. They are more likely to ask: "Have we tested this on...?" They are more likely to notice when a dataset's composition does not match the deployment population.

Business Insight: Diversity on AI teams is not only the right thing to do — it is a risk mitigation strategy. Homogeneous teams have systematic blind spots. Those blind spots become the model's blind spots, which become the organization's liability.


The Hiring AI Scandal: Amazon's Resume Screening Tool

The Athena case is not unique. The most famous corporate example of AI hiring bias occurred at Amazon — a company with some of the most sophisticated machine learning capabilities on earth.

In 2014, Amazon began developing an AI system to automate resume screening. The goal was efficiency: the company received hundreds of thousands of applications each year, and a model that could identify the top candidates would save thousands of hours of human review.

The model was trained on ten years of resumes submitted to Amazon, with the "labels" being which candidates had been hired and promoted. The training data reflected reality: over the previous decade, most of Amazon's technical hires had been men — a reflection of the well-documented gender imbalance in the technology industry.

The model learned this pattern. It penalized resumes that contained the word "women's" — as in "women's chess club captain" or "women's studies." It downgraded graduates of two all-women's colleges. It assigned higher scores to resumes that used language more commonly found in male-authored resumes — aggressive verbs like "executed" and "captured" over collaborative language like "collaborated" and "supported."

Amazon's engineers tried to fix the model. They removed explicit gender indicators. But the proxy variables persisted — the model found other features correlated with gender and continued to discriminate. In 2018, Amazon abandoned the project entirely.

Tom sets down his pen. "I worked at a fintech startup before business school," he says. "We had an algorithm that scored loan applications. We were proud of it — it was fast, consistent, no human subjectivity. I'm now realizing we never tested it for disparate impact across demographic groups. We never even asked the question."

"That's the point," Professor Okonkwo says. "The question is not whether your algorithm is biased. The question is whether you've looked."

Research Note: Amazon's case was first reported by Reuters in October 2018 (Dastin, 2018). Amazon has stated that the tool was never used as the sole mechanism for evaluating candidates, but the incident has become a landmark case in AI ethics education and policy discussions. The EU AI Act, passed in 2024, classifies AI systems used in employment decisions as "high-risk," requiring conformity assessments, human oversight, and bias testing before deployment.


Bias in Facial Recognition: The Gender Shades Study

If Amazon's hiring tool demonstrated bias in text-based AI, Joy Buolamwini and Timnit Gebru's 2018 Gender Shades study revealed it in computer vision — and did so with a rigor and clarity that reshaped the entire field.

Buolamwini, then a graduate researcher at the MIT Media Lab, had noticed something troubling: facial recognition systems frequently failed to detect her face — a dark-skinned woman's face. She began a systematic investigation.

The Gender Shades study evaluated three commercial facial recognition systems — from Microsoft, IBM, and Face++ (Megvii) — on a benchmark dataset of 1,270 faces, balanced across four subgroups defined by the intersection of gender (male/female) and skin type (lighter/darker, measured using the Fitzpatrick skin type scale).

The results were stark:

Subgroup Microsoft IBM Face++
Lighter-skinned males 99.7% 99.4% 99.3%
Lighter-skinned females 98.3% 97.7% 95.6%
Darker-skinned males 94.4% 88.0% 99.3%
Darker-skinned females 79.2% 65.3% 78.7%

The overall accuracy of each system exceeded 90 percent. But for darker-skinned women, error rates were between 20.8 and 34.7 percent — up to 34 times higher than for lighter-skinned men.

"This is what 'aggregate accuracy can mask group-level injustice' looks like," Professor Okonkwo says. "If you reported only the overall number, you would conclude these systems work well. If you disaggregated by intersectional subgroups, you would conclude they are unacceptable for deployment in any context that affects dark-skinned women."

The impact was immediate and lasting. Microsoft, IBM, and Amazon (which had not been included in the original study but was later shown to have similar disparities) all publicly committed to improving their systems. In 2020 and 2021, IBM exited the facial recognition market entirely, and multiple US cities banned government use of facial recognition. The study also demonstrated the power of intersectional analysis — examining bias at the intersection of multiple identity dimensions, not just one at a time.

Definition: Intersectional bias occurs when disparities are only visible (or are dramatically amplified) at the intersection of two or more demographic characteristics. A system may show acceptable performance for women overall and for Black individuals overall, while performing far worse for Black women specifically. Kimberlé Crenshaw, who coined the term "intersectionality" in 1989, argued that systems of disadvantage interact multiplicatively, not additively.


Bias in Lending: The Ghost of Redlining

The US financial system provides some of the most well-documented examples of AI bias — in part because lending is heavily regulated and subject to fair lending laws that require disparate impact analysis.

Between the 1930s and the late 1960s, the US government and private lenders systematically denied mortgages and other financial services to residents of predominantly Black neighborhoods — a practice known as redlining (named for the red lines drawn on maps to designate "hazardous" lending areas). The Fair Housing Act of 1968 and the Equal Credit Opportunity Act of 1974 outlawed explicit redlining. But the legacy persists in the data.

Modern credit scoring models, including AI-based models, are trained on historical financial data that reflects decades of unequal access to credit. Applicants from historically redlined neighborhoods — disproportionately Black and Hispanic — have lower average credit scores, fewer lines of credit, and less accumulated wealth. A model trained on this data will learn that these features predict higher default risk. The prediction may be statistically accurate, but it reflects a history of exclusion rather than inherent creditworthiness.

A 2021 study by the National Bureau of Economic Research (Bartlett et al.) found that both algorithmic and human lenders charged Black and Hispanic borrowers 7.9 basis points more in interest than comparable white borrowers — costing minority borrowers approximately $765 million per year in excess interest payments. The algorithmic lenders were slightly less discriminatory than human loan officers, suggesting that AI can reduce bias relative to human decision-making — but only if the underlying data and model are actively debiased.

Business Insight: The lending example illustrates a critical nuance: statistical accuracy and fairness can conflict. A model that uses ZIP code to predict default risk may be accurately predicting default rates in those ZIP codes — but those default rates themselves reflect historical discrimination. The question for business leaders is not "Is the model accurate?" but "Should we optimize for a metric that encodes historical injustice?"

Lena Park, the policy advisor, frames the legal dimension: "Under US law, a lending practice that has a disparate impact on a protected group is presumptively illegal under the Equal Credit Opportunity Act, even if the lender had no discriminatory intent. Intent does not matter. Impact does. AI companies that deploy credit models without disparate impact testing are exposing themselves to enormous legal liability."


Bias in Healthcare

Healthcare AI offers perhaps the most urgent illustration of how bias causes direct, physical harm.

The Pulse Oximeter Problem

We mentioned pulse oximeters earlier. The clinical significance bears emphasis. Sjoding et al. (2020), published in the New England Journal of Medicine, found that Black patients were nearly three times as likely as white patients to have occult hypoxemia (dangerously low oxygen levels) that was not detected by pulse oximetry. During the COVID-19 pandemic, when pulse oximetry readings guided decisions about whether to administer supplemental oxygen or escalate to ICU care, this measurement bias contributed to worse outcomes for Black patients.

Any AI triage system trained on pulse oximetry data would inherit this bias — and might never reveal it, because the biased measurement would be baked into both the training data and the evaluation data. The bias would be invisible unless someone explicitly tested for differential accuracy across skin tones.

Dermatology AI

Computer vision models for skin cancer detection have shown significantly lower accuracy on darker skin tones, for the same reason facial recognition systems do: training data imbalance. A 2022 study in JAMA Dermatology found that the most widely used dermatology AI datasets contained fewer than 5 percent images of dark-skinned patients. Models trained on these datasets achieved area under the curve (AUC) scores above 0.90 for lighter skin tones and below 0.75 for darker skin tones — a clinically meaningful gap that could result in missed diagnoses.

The Optum Algorithm

The Obermeyer et al. (2019) study in Science — one of the most cited papers in algorithmic fairness — examined an algorithm used by Optum (a subsidiary of UnitedHealth Group) to identify patients who would benefit from supplemental healthcare programs. The algorithm used predicted healthcare costs as a proxy for healthcare needs. But due to systemic inequalities in healthcare access and spending, Black patients generated lower healthcare costs than equally sick white patients — not because they were healthier, but because they had less access to care.

The result: at the same risk score, Black patients were significantly sicker than white patients. The algorithm effectively required Black patients to be sicker before recommending them for additional care. The researchers estimated that eliminating this bias would increase the percentage of Black patients identified for additional care from 17.7 percent to 46.5 percent.

"This is the one that keeps me up at night," NK says. "It's not about convenience or efficiency. It's about who gets care and who doesn't. Who lives and who dies."


Measuring Bias: Fairness Metrics

If we are going to take bias seriously, we need to measure it. Intuitions about fairness are important but insufficient — different intuitions can point in different directions, and without quantitative metrics, organizations can engage in what Selbst et al. (2019) called "fairness washing": performing the rituals of fairness without the substance.

This section introduces four core fairness metrics. Chapter 26 will explore their mathematical properties, tensions, and tradeoffs in depth.

1. Disparate Impact Ratio (The Four-Fifths Rule)

The disparate impact ratio compares the selection rate of a disadvantaged group to the selection rate of an advantaged group. In US employment law, the "four-fifths rule" (established in the 1978 Uniform Guidelines on Employee Selection Procedures) states that a selection process has adverse impact if the selection rate for a protected group is less than 80 percent (four-fifths) of the selection rate for the group with the highest rate.

$$\text{Disparate Impact Ratio} = \frac{\text{Selection Rate (disadvantaged group)}}{\text{Selection Rate (advantaged group)}}$$

A ratio below 0.80 indicates potential disparate impact.

For Athena's HR screening model, let us calculate: if the model advanced 45 percent of candidates under 35 and 19.6 percent of candidates over 35, the disparate impact ratio is 19.6 / 45 = 0.435 — far below the 0.80 threshold. The model has a severe disparate impact on older candidates.

Caution

The four-fifths rule is a screening tool, not a definitive legal standard. A ratio above 0.80 does not guarantee fairness, and a ratio below 0.80 does not guarantee illegality. But it is a widely used and understood starting point, and any model deployed in a hiring context that fails the four-fifths rule should trigger an immediate review.

2. Demographic Parity

Demographic parity (also called statistical parity or independence) requires that the model's positive prediction rate be equal across groups. In hiring, this means the model should recommend the same percentage of candidates from each demographic group.

$$P(\hat{Y} = 1 | A = a) = P(\hat{Y} = 1 | A = b)$$

Demographic parity is intuitive and easy to explain to non-technical stakeholders. Its limitation is that it does not account for differences in base rates — if one group genuinely has more qualified candidates (for legitimate, non-discriminatory reasons), demographic parity may reduce overall prediction quality.

3. Equalized Odds

Equalized odds requires that the model's true positive rate (sensitivity) and false positive rate be equal across groups. In hiring, this means the model should be equally likely to correctly identify a qualified candidate, regardless of their demographic group, and equally likely to incorrectly advance an unqualified candidate.

$$P(\hat{Y} = 1 | Y = 1, A = a) = P(\hat{Y} = 1 | Y = 1, A = b)$$ $$P(\hat{Y} = 1 | Y = 0, A = a) = P(\hat{Y} = 1 | Y = 0, A = b)$$

Equalized odds is more nuanced than demographic parity because it conditions on the true outcome. Its limitation is that the "true outcome" is often itself tainted by historical bias (as we discussed with label bias).

4. Calibration

Calibration (or predictive parity) requires that among all individuals who receive a given risk score, the actual positive rate should be the same across groups. If a model assigns a risk score of 0.7 to an applicant, that should mean a 70 percent chance of the relevant outcome regardless of the applicant's demographic group.

Calibration is particularly important in lending and criminal justice, where scores are interpreted as probabilities and used to set interest rates or bail conditions.

Research Note: Chouldechova (2017) proved a foundational impossibility result: except in trivial cases, it is mathematically impossible to simultaneously satisfy calibration, equalized odds, and demographic parity when base rates differ across groups. This means that fairness always involves tradeoffs — you must choose which definition of fairness to prioritize, and that choice is fundamentally a values decision, not a technical one. We will explore this tension in depth in Chapter 26.


Building the BiasDetector: Finding Unfairness in Athena's HR Model

Let us move from theory to practice. Ravi has asked his data science team to conduct a comprehensive bias audit of the HR screening model. In this section, we build a BiasDetector class that automates the core steps of that audit.

Try It: The code below uses fairlearn, Microsoft's open-source fairness assessment toolkit, and standard data science libraries. Install Fairlearn with pip install fairlearn.

Step 1: Generate Synthetic Hiring Data

We begin by creating a synthetic dataset that mirrors the bias patterns Ravi described. In a real audit, you would use actual data; we use synthetic data here for pedagogical clarity and to avoid privacy concerns.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
import matplotlib.pyplot as plt

np.random.seed(42)

def generate_hiring_data(n_samples: int = 2000) -> pd.DataFrame:
    """
    Generate synthetic hiring data with realistic bias patterns.

    The data simulates Athena's HR screening scenario:
    - Historical managers favored younger candidates
    - Historical managers favored four-year degree holders
    - These biases are embedded in the 'hired' label
    """
    # Candidate demographics
    age = np.random.normal(38, 10, n_samples).clip(22, 65).astype(int)
    age_group = np.where(age < 35, 'Under 35', '35 and Over')

    # Education: 60% four-year degree, 40% other
    has_four_year_degree = np.random.binomial(1, 0.60, n_samples)
    education = np.where(has_four_year_degree, 'Four-Year Degree', 'Other')

    # Gender: roughly balanced
    gender = np.random.choice(['Male', 'Female'], n_samples, p=[0.52, 0.48])

    # Legitimate qualifications (should predict hiring)
    years_experience = np.random.normal(8, 4, n_samples).clip(0, 30)
    management_score = np.random.normal(70, 15, n_samples).clip(0, 100)
    interview_score = np.random.normal(65, 12, n_samples).clip(0, 100)

    # Create hiring probability with embedded bias
    # Base probability from legitimate factors
    hire_prob = (
        0.01 * management_score
        + 0.008 * interview_score
        + 0.005 * years_experience
    )

    # Bias: younger candidates get a boost (historical manager preference)
    hire_prob += np.where(age < 35, 0.25, 0.0)

    # Bias: four-year degree holders get a boost
    hire_prob += np.where(has_four_year_degree, 0.15, 0.0)

    # Normalize to probability range
    hire_prob = (hire_prob - hire_prob.min()) / (hire_prob.max() - hire_prob.min())
    hire_prob = hire_prob * 0.6 + 0.1  # Scale to [0.1, 0.7]

    # Generate hiring decisions
    hired = np.random.binomial(1, hire_prob)

    return pd.DataFrame({
        'age': age,
        'age_group': age_group,
        'education': education,
        'gender': gender,
        'years_experience': years_experience.round(1),
        'management_score': management_score.round(1),
        'interview_score': interview_score.round(1),
        'hired': hired
    })


# Generate the data
df = generate_hiring_data(2000)
print(f"Dataset shape: {df.shape}")
print(f"\nHiring rate by age group:")
print(df.groupby('age_group')['hired'].mean().round(3))
print(f"\nHiring rate by education:")
print(df.groupby('education')['hired'].mean().round(3))
print(f"\nHiring rate by gender:")
print(df.groupby('gender')['hired'].mean().round(3))

Expected output:

Dataset shape: (2000, 8)

Hiring rate by age group:
age_group
35 and Over    0.341
Under 35       0.502
Name: hired, dtype: float64

Hiring rate by education:
education
Four-Year Degree    0.445
Other               0.351
Name: hired, dtype: float64

Hiring rate by gender:
gender
Female    0.403
Male      0.410
Name: hired, dtype: float64

Notice the pattern: the historical hiring data already shows a gap between age groups (50.2% vs. 34.1%) and between education levels (44.5% vs. 35.1%). Gender shows minimal disparity — the bias in this scenario is primarily around age and education, consistent with Ravi's audit findings.

Step 2: Train the Screening Model

Now we train a model on this historical data — the same type of model Athena's HR analyst built using AutoML.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import LabelEncoder

# Prepare features
le_education = LabelEncoder()
le_gender = LabelEncoder()

X = df[['age', 'years_experience', 'management_score',
         'interview_score']].copy()
X['education_encoded'] = le_education.fit_transform(df['education'])
X['gender_encoded'] = le_gender.fit_transform(df['gender'])

y = df['hired']

# Split data
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

# Train model (simulating the AutoML output)
model = GradientBoostingClassifier(
    n_estimators=100, max_depth=4, random_state=42
)
model.fit(X_train, y_train)

# Predictions
y_pred = model.predict(X_test)
print(f"Overall accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"\n{classification_report(y_test, y_pred)}")

The model will show reasonable overall accuracy — this is the trap. Aggregate metrics look fine. The problems emerge only when we disaggregate.

Step 3: The BiasDetector Class

from fairlearn.metrics import MetricFrame
from sklearn.metrics import accuracy_score, recall_score, precision_score
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd


class BiasDetector:
    """
    Detect and measure bias in binary classification models.

    Designed for HR screening, lending, and other high-stakes
    classification tasks where fairness across demographic groups
    is legally and ethically required.

    Usage:
        detector = BiasDetector(y_true, y_pred, sensitive_features)
        report = detector.full_audit()
        detector.plot_prediction_rates()
    """

    def __init__(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
        sensitive_features: pd.DataFrame
    ):
        """
        Parameters
        ----------
        y_true : array-like
            Ground truth labels (0 or 1).
        y_pred : array-like
            Model predictions (0 or 1).
        sensitive_features : pd.DataFrame
            DataFrame where each column is a sensitive attribute
            (e.g., age_group, gender, education).
        """
        self.y_true = np.array(y_true)
        self.y_pred = np.array(y_pred)
        self.sensitive_features = sensitive_features.reset_index(drop=True)
        self.audit_results = {}

    def disparate_impact_ratio(self, group_col: str) -> dict:
        """
        Calculate the disparate impact ratio for each group.

        The four-fifths rule: a ratio below 0.80 indicates
        potential adverse impact.

        Returns dict with selection rates and DI ratios.
        """
        groups = self.sensitive_features[group_col]
        results = {}

        # Calculate selection rate for each group
        for group_name in groups.unique():
            mask = groups == group_name
            selection_rate = self.y_pred[mask].mean()
            results[group_name] = {
                'selection_rate': selection_rate,
                'n_candidates': mask.sum(),
                'n_selected': self.y_pred[mask].sum()
            }

        # Find the group with the highest selection rate
        max_rate = max(r['selection_rate'] for r in results.values())

        # Calculate DI ratio relative to highest-rate group
        for group_name in results:
            rate = results[group_name]['selection_rate']
            results[group_name]['di_ratio'] = (
                rate / max_rate if max_rate > 0 else 0
            )
            results[group_name]['passes_four_fifths'] = (
                results[group_name]['di_ratio'] >= 0.80
            )

        self.audit_results[f'disparate_impact_{group_col}'] = results
        return results

    def metric_frame_analysis(self, group_col: str) -> pd.DataFrame:
        """
        Use Fairlearn's MetricFrame for group-level metric comparison.

        Computes accuracy, precision, recall, and selection rate
        for each group, plus the difference and ratio between
        the best and worst performing groups.
        """
        metrics = {
            'accuracy': accuracy_score,
            'precision': precision_score,
            'recall': recall_score,
            'selection_rate': lambda y_t, y_p: y_p.mean(),
            'count': lambda y_t, y_p: len(y_t)
        }

        mf = MetricFrame(
            metrics=metrics,
            y_true=self.y_true,
            y_pred=self.y_pred,
            sensitive_features=self.sensitive_features[group_col]
        )

        summary = pd.DataFrame({
            'overall': mf.overall,
            'group_min': mf.group_min(),
            'group_max': mf.group_max(),
            'difference': mf.difference(),
            'ratio': mf.ratio()
        })

        self.audit_results[f'metric_frame_{group_col}'] = {
            'by_group': mf.by_group.to_dict(),
            'summary': summary.to_dict()
        }

        return mf.by_group

    def plot_prediction_rates(
        self, group_col: str, figsize: tuple = (10, 6)
    ) -> None:
        """
        Visualize prediction rates by demographic group.

        Displays selection rate (P(y_pred=1)) for each group
        with the four-fifths threshold line.
        """
        groups = self.sensitive_features[group_col]
        rates = {}

        for group_name in sorted(groups.unique()):
            mask = groups == group_name
            rates[group_name] = self.y_pred[mask].mean()

        max_rate = max(rates.values())
        threshold = 0.80 * max_rate

        fig, ax = plt.subplots(figsize=figsize)

        bars = ax.bar(
            rates.keys(), rates.values(),
            color=['#2ecc71' if r >= threshold else '#e74c3c'
                   for r in rates.values()],
            edgecolor='white', linewidth=1.5
        )

        # Add value labels on bars
        for bar, rate in zip(bars, rates.values()):
            ax.text(
                bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.01,
                f'{rate:.1%}', ha='center', va='bottom', fontweight='bold'
            )

        # Four-fifths threshold line
        ax.axhline(
            y=threshold, color='#e67e22', linestyle='--', linewidth=2,
            label=f'Four-Fifths Threshold ({threshold:.1%})'
        )

        ax.set_ylabel('Selection Rate (Positive Prediction Rate)')
        ax.set_xlabel(group_col.replace('_', ' ').title())
        ax.set_title(
            f'Model Selection Rate by {group_col.replace("_", " ").title()}\n'
            f'Red bars indicate potential disparate impact'
        )
        ax.legend()
        ax.spines['top'].set_visible(False)
        ax.spines['right'].set_visible(False)

        plt.tight_layout()
        plt.savefig(f'bias_audit_{group_col}.png', dpi=150,
                    bbox_inches='tight')
        plt.show()

    def full_audit(self) -> dict:
        """
        Run a complete bias audit across all sensitive features.

        Returns a comprehensive report including disparate impact
        ratios and group-level metrics for each sensitive attribute.
        """
        report = {
            'n_samples': len(self.y_true),
            'overall_selection_rate': self.y_pred.mean(),
            'overall_accuracy': accuracy_score(self.y_true, self.y_pred),
            'attributes_audited': [],
            'findings': []
        }

        for col in self.sensitive_features.columns:
            report['attributes_audited'].append(col)

            # Disparate impact
            di_results = self.disparate_impact_ratio(col)

            # Check for violations
            for group, metrics in di_results.items():
                if not metrics['passes_four_fifths']:
                    report['findings'].append({
                        'attribute': col,
                        'group': group,
                        'di_ratio': metrics['di_ratio'],
                        'selection_rate': metrics['selection_rate'],
                        'severity': 'HIGH' if metrics['di_ratio'] < 0.6
                                    else 'MEDIUM',
                        'recommendation': (
                            'Immediate review required. Selection rate '
                            f'for {group} is {metrics["selection_rate"]:.1%}, '
                            f'DI ratio is {metrics["di_ratio"]:.2f} '
                            '(below 0.80 threshold).'
                        )
                    })

            # Metric frame
            self.metric_frame_analysis(col)

        report['total_findings'] = len(report['findings'])
        report['has_disparate_impact'] = report['total_findings'] > 0

        self.audit_results['full_report'] = report
        return report

    def generate_report(self) -> str:
        """
        Generate a human-readable bias audit report.
        """
        if 'full_report' not in self.audit_results:
            self.full_audit()

        report = self.audit_results['full_report']

        lines = [
            "=" * 60,
            "BIAS AUDIT REPORT",
            "=" * 60,
            f"Samples evaluated: {report['n_samples']}",
            f"Overall selection rate: {report['overall_selection_rate']:.1%}",
            f"Overall accuracy: {report['overall_accuracy']:.1%}",
            f"Attributes audited: {', '.join(report['attributes_audited'])}",
            "",
            "-" * 60,
            "FINDINGS",
            "-" * 60,
        ]

        if not report['findings']:
            lines.append("No disparate impact findings. All groups pass "
                         "the four-fifths rule.")
        else:
            lines.append(
                f"TOTAL FINDINGS: {report['total_findings']}"
            )
            lines.append("")

            for i, finding in enumerate(report['findings'], 1):
                lines.extend([
                    f"Finding {i}: [{finding['severity']}]",
                    f"  Attribute: {finding['attribute']}",
                    f"  Affected group: {finding['group']}",
                    f"  Selection rate: {finding['selection_rate']:.1%}",
                    f"  Disparate impact ratio: {finding['di_ratio']:.3f}",
                    f"  {finding['recommendation']}",
                    ""
                ])

        lines.extend([
            "-" * 60,
            "RECOMMENDATION",
            "-" * 60,
        ])

        if report['has_disparate_impact']:
            lines.extend([
                "This model shows evidence of disparate impact.",
                "Recommended actions:",
                "  1. Halt model deployment pending review",
                "  2. Investigate root causes in training data",
                "  3. Consider mitigation strategies (resampling,",
                "     reweighting, threshold adjustment)",
                "  4. Re-evaluate with legal and ethics review",
                "  5. Document findings and remediation steps"
            ])
        else:
            lines.extend([
                "No disparate impact detected under the four-fifths rule.",
                "Continue monitoring with regular audits."
            ])

        return "\n".join(lines)

Step 4: Run the Audit on Athena's Model

# Prepare sensitive features for the test set
test_indices = X_test.index
sensitive_test = df.loc[test_indices, ['age_group', 'education', 'gender']]

# Create the BiasDetector
detector = BiasDetector(
    y_true=y_test.values,
    y_pred=y_pred,
    sensitive_features=sensitive_test.reset_index(drop=True)
)

# Run the full audit
report = detector.full_audit()

# Print the report
print(detector.generate_report())

Expected output:

============================================================
BIAS AUDIT REPORT
============================================================
Samples evaluated: 600
Overall selection rate: 42.3%
Overall accuracy: 67.8%
Attributes audited: age_group, education, gender

------------------------------------------------------------
FINDINGS
------------------------------------------------------------
TOTAL FINDINGS: 2

Finding 1: [HIGH]
  Attribute: age_group
  Affected group: 35 and Over
  Selection rate: 31.2%
  Disparate impact ratio: 0.574
  Immediate review required. Selection rate for 35 and Over
  is 31.2%, DI ratio is 0.574 (below 0.80 threshold).

Finding 2: [MEDIUM]
  Attribute: education
  Affected group: Other
  Selection rate: 34.8%
  Disparate impact ratio: 0.731
  Immediate review required. Selection rate for Other is
  34.8%, DI ratio is 0.731 (below 0.80 threshold).

------------------------------------------------------------
RECOMMENDATION
------------------------------------------------------------
This model shows evidence of disparate impact.
Recommended actions:
  1. Halt model deployment pending review
  2. Investigate root causes in training data
  3. Consider mitigation strategies (resampling,
     reweighting, threshold adjustment)
  4. Re-evaluate with legal and ethics review
  5. Document findings and remediation steps

The audit confirms what Ravi's team found: the model has severe disparate impact against older candidates (DI ratio of 0.574, well below the 0.80 threshold) and moderate disparate impact against candidates without four-year degrees (DI ratio of 0.731).

Step 5: Visualize the Bias

# Generate visualizations for each sensitive attribute
detector.plot_prediction_rates('age_group')
detector.plot_prediction_rates('education')
detector.plot_prediction_rates('gender')

# Detailed group-level metrics
print("\n--- Metrics by Age Group ---")
print(detector.metric_frame_analysis('age_group'))

print("\n--- Metrics by Education ---")
print(detector.metric_frame_analysis('education'))

The visualizations make the disparity immediately visible. The age group chart shows a red bar for "35 and Over" — well below the four-fifths threshold line. The education chart shows a similar pattern. The gender chart, by contrast, shows both bars in green — no significant disparity.

Athena Update: When Ravi presents these visualizations to Athena's executive team, the response is immediate. The CHRO asks: "How many qualified candidates did we lose?" The General Counsel asks: "What is our legal exposure?" The CEO asks: "How did this happen without anyone knowing?" Ravi's answer to the third question is the most important: "We didn't have a governance process. We had a tool and good intentions. That is not enough." The executive team authorizes Ravi to (1) immediately halt the model, (2) conduct a full bias audit of every AI system at Athena, (3) establish an AI Ethics Board with cross-functional representation, and (4) implement mandatory bias review for all models before deployment. This moment — the shift from "move fast" to "move responsibly" — becomes the turning point in Athena's AI journey.


Mitigation Strategies

Detecting bias is necessary but not sufficient. The goal is to reduce or eliminate unfair disparities while maintaining model utility. Bias mitigation strategies fall into three categories, corresponding to the three stages of the modeling pipeline where intervention is possible.

Pre-Processing: Fix the Data Before Training

Pre-processing strategies modify the training data to remove or reduce bias before a model is trained on it.

Resampling. Oversample underrepresented groups or undersample overrepresented groups to balance the training data. In Athena's case, this might mean duplicating records of hired candidates who were over 35 to increase their representation, or removing some records of hired candidates under 35 to reduce theirs.

Reweighting. Assign different weights to training examples so that the model pays more attention to underrepresented groups. Instead of changing the data itself, you change how much each data point "counts" during training. A candidate over 35 who was hired might receive a weight of 1.5, while a candidate under 35 who was hired might receive a weight of 0.8.

Relabeling. In some cases, the labels themselves are biased (as with the performance review issue discussed earlier). Relabeling involves correcting biased labels — for example, using a panel review to re-evaluate borderline hiring decisions and update labels accordingly.

from sklearn.utils import resample

def resample_for_fairness(
    df: pd.DataFrame,
    target_col: str,
    sensitive_col: str
) -> pd.DataFrame:
    """
    Resample training data to equalize positive outcome rates
    across groups defined by sensitive_col.

    Strategy: oversample positive cases in disadvantaged groups
    to match the advantaged group's positive rate.
    """
    groups = df[sensitive_col].unique()
    positive_rates = df.groupby(sensitive_col)[target_col].mean()
    target_rate = positive_rates.max()

    resampled_parts = []

    for group in groups:
        group_df = df[df[sensitive_col] == group]
        group_positive = group_df[group_df[target_col] == 1]
        group_negative = group_df[group_df[target_col] == 0]

        current_rate = len(group_positive) / len(group_df)

        if current_rate < target_rate:
            # Calculate how many positive samples we need
            n_positive_needed = int(
                target_rate * len(group_df) / (1 - target_rate)
                * len(group_negative) / len(group_df)
            )
            n_positive_needed = max(n_positive_needed, len(group_positive))

            # Oversample positive cases
            group_positive_resampled = resample(
                group_positive,
                replace=True,
                n_samples=n_positive_needed,
                random_state=42
            )
            resampled_parts.append(
                pd.concat([group_positive_resampled, group_negative])
            )
        else:
            resampled_parts.append(group_df)

    return pd.concat(resampled_parts, ignore_index=True)


# Apply resampling
df_resampled = resample_for_fairness(df, 'hired', 'age_group')
print("After resampling:")
print(df_resampled.groupby('age_group')['hired'].mean().round(3))

Caution

Pre-processing strategies modify the training data, which can affect model accuracy. The tradeoff between fairness and accuracy is real — but in most high-stakes applications, a small reduction in overall accuracy is acceptable (and legally required) if it substantially reduces disparate impact.

In-Processing: Constrain the Model During Training

In-processing strategies modify the learning algorithm itself to incorporate fairness constraints during training.

Constrained optimization. Instead of minimizing error alone, the model minimizes error subject to a fairness constraint — for example, requiring that the disparate impact ratio remain above 0.80. Fairlearn's ExponentiatedGradient algorithm implements this approach.

Adversarial debiasing. Train two models simultaneously: a predictor that tries to make accurate predictions, and an adversary that tries to predict the sensitive attribute from the predictor's output. The predictor is penalized when the adversary succeeds — forcing the predictor to learn representations that are informative for the task but uninformative about the sensitive attribute.

Regularization for fairness. Add a penalty term to the loss function that increases when predictions are correlated with the sensitive attribute. This is analogous to L1/L2 regularization (introduced in Chapter 8 for preventing overfitting), but instead of penalizing model complexity, it penalizes unfairness.

from fairlearn.reductions import ExponentiatedGradient, DemographicParity
from sklearn.ensemble import GradientBoostingClassifier

def train_with_fairness_constraint(
    X_train: pd.DataFrame,
    y_train: np.ndarray,
    sensitive_train: pd.Series,
    constraint_type: str = 'demographic_parity'
) -> object:
    """
    Train a model with fairness constraints using Fairlearn's
    ExponentiatedGradient algorithm.

    Parameters
    ----------
    constraint_type : str
        'demographic_parity' or 'equalized_odds'
    """
    base_estimator = GradientBoostingClassifier(
        n_estimators=50, max_depth=3, random_state=42
    )

    if constraint_type == 'demographic_parity':
        constraint = DemographicParity()
    else:
        from fairlearn.reductions import EqualizedOdds
        constraint = EqualizedOdds()

    mitigator = ExponentiatedGradient(
        estimator=base_estimator,
        constraints=constraint
    )

    mitigator.fit(X_train, y_train, sensitive_features=sensitive_train)

    return mitigator


# Train a fair model
sensitive_train = df.loc[X_train.index, 'age_group']
fair_model = train_with_fairness_constraint(
    X_train, y_train, sensitive_train, 'demographic_parity'
)

# Compare predictions
y_pred_fair = fair_model.predict(X_test)
print(f"Fair model accuracy: {accuracy_score(y_test, y_pred_fair):.3f}")

# Re-run bias audit
detector_fair = BiasDetector(
    y_true=y_test.values,
    y_pred=y_pred_fair,
    sensitive_features=sensitive_test.reset_index(drop=True)
)
print(detector_fair.generate_report())

Post-Processing: Adjust Outputs After Prediction

Post-processing strategies modify the model's outputs — after the model has been trained and has generated predictions — to achieve fairness goals.

Threshold adjustment. Instead of using a single classification threshold (typically 0.50) for all groups, use group-specific thresholds calibrated to achieve equal selection rates or equalized odds. For example, if the model's 0.50 threshold produces a 45% selection rate for younger candidates and a 31% selection rate for older candidates, lower the threshold for older candidates (say, to 0.38) to equalize selection rates.

from fairlearn.postprocessing import ThresholdOptimizer

def apply_threshold_adjustment(
    estimator,
    X_train: pd.DataFrame,
    y_train: np.ndarray,
    sensitive_train: pd.Series,
    X_test: pd.DataFrame,
    sensitive_test: pd.Series,
    constraint: str = 'demographic_parity'
) -> np.ndarray:
    """
    Apply post-processing threshold adjustment to equalize
    outcomes across groups.
    """
    postprocessor = ThresholdOptimizer(
        estimator=estimator,
        constraints=constraint,
        objective='accuracy_score',
        prefit=True
    )

    postprocessor.fit(
        X_train, y_train, sensitive_features=sensitive_train
    )

    y_pred_adjusted = postprocessor.predict(
        X_test, sensitive_features=sensitive_test
    )

    return y_pred_adjusted


# Apply threshold adjustment to original (biased) model
sensitive_train_series = df.loc[X_train.index, 'age_group']
sensitive_test_series = sensitive_test['age_group']

y_pred_adjusted = apply_threshold_adjustment(
    model, X_train, y_train, sensitive_train_series,
    X_test, sensitive_test_series,
    constraint='demographic_parity'
)

# Compare before and after
print("=== BEFORE THRESHOLD ADJUSTMENT ===")
for group in ['Under 35', '35 and Over']:
    mask = sensitive_test_series == group
    rate = y_pred[mask.values].mean()
    print(f"  {group}: {rate:.1%} selection rate")

print("\n=== AFTER THRESHOLD ADJUSTMENT ===")
for group in ['Under 35', '35 and Over']:
    mask = sensitive_test_series == group
    rate = y_pred_adjusted[mask.values].mean()
    print(f"  {group}: {rate:.1%} selection rate")

print(f"\nOriginal accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"Adjusted accuracy: {accuracy_score(y_test, y_pred_adjusted):.3f}")

Expected output:

=== BEFORE THRESHOLD ADJUSTMENT ===
  Under 35: 54.3% selection rate
  35 and Over: 31.2% selection rate

=== AFTER THRESHOLD ADJUSTMENT ===
  Under 35: 41.8% selection rate
  35 and Over: 40.5% selection rate

Original accuracy: 0.678
Adjusted accuracy: 0.651

The threshold adjustment has dramatically reduced the disparity — the selection rates are now nearly equal. Accuracy dropped slightly (from 67.8% to 65.1%), but this is a small price for eliminating a discriminatory pattern that exposed Athena to both ethical harm and legal liability.

Business Insight: The accuracy-fairness tradeoff is often smaller than organizations expect. In this case, a 2.7-percentage-point reduction in accuracy (well within normal model variance) eliminated a pattern that was disproportionately excluding qualified candidates over 35. In most cases, the "cost" of fairness is modest; the cost of unfairness — in lawsuits, regulatory penalties, reputational damage, and lost talent — is enormous.

Comparing Mitigation Strategies

Each approach has strengths and limitations:

Strategy When to Use Advantage Limitation
Pre-processing (resampling, reweighting) When bias is primarily in the training data Model-agnostic; easy to implement May not fully eliminate bias learned from correlated features
In-processing (constrained optimization) When you control the training pipeline Directly optimizes for fairness during learning Requires fairness-aware training infrastructure
Post-processing (threshold adjustment) When you cannot retrain the model Can be applied to any model, including black-box vendor models Does not fix the underlying model; may require group labels at inference time

In practice, the most effective approach is often a combination: clean the data (pre-processing), train with fairness constraints (in-processing), and validate with threshold analysis (post-processing). Defense in depth — a principle borrowed from cybersecurity — applies to fairness as well.


Organizational Responsibility: Who Owns Bias?

The technical tools in this chapter — disparate impact ratios, Fairlearn, threshold adjustment — are necessary but not sufficient. Bias is not a problem that can be solved by data scientists alone. It is an organizational problem that requires organizational solutions.

The Responsibility Chain

"Who is responsible for bias in Athena's HR model?" Professor Okonkwo asks the class.

"The HR analyst who deployed it," says one student.

"The data science team that didn't review it," says another.

"The executives who didn't establish governance," says Tom.

"The managers who made biased hiring decisions for five years," says NK.

"You're all right," Professor Okonkwo says. "And that's the problem. When everyone is partially responsible, no one feels fully accountable. This is the diffusion of responsibility problem in AI governance."

She draws a chain on the board:

Data CollectionData LabelingFeature EngineeringModel TrainingModel EvaluationDeploymentMonitoringImpact

"Bias can enter at any link in this chain. And the person responsible for each link is different. The managers who collected the biased training data didn't think they were creating an AI problem — they were just hiring. The HR analyst who deployed the model didn't think she was discriminating — she was just automating. The data science team didn't think they were abdicating responsibility — they didn't even know the model existed."

The Three Lines of Defense

Drawing from the risk management framework used in financial services, organizations should establish three lines of defense against AI bias:

First line: The model builders. Data scientists and ML engineers have a responsibility to test for bias before deployment. This includes disaggregated evaluation (breaking performance metrics down by demographic subgroup), disparate impact analysis, and documentation of known limitations. The BiasDetector class we built in this chapter should be a standard step in every model development pipeline.

Second line: The governance function. An AI Ethics Board or AI Risk Committee — independent from the teams building models — should review high-risk models before deployment. "High-risk" should be defined broadly: any model that makes or influences decisions about people (hiring, lending, healthcare, criminal justice, education) should require ethics review.

Third line: Internal audit. An independent audit function should periodically test deployed models for bias drift — the phenomenon where a model's fairness characteristics change over time as the underlying population or data shifts. Athena's bias was caught by internal audit; many organizations do not have this capability.

Business Insight: The cost of establishing these three lines of defense is far less than the cost of a single high-profile bias incident. Amazon's abandoned recruiting tool generated years of negative press. The COMPAS algorithm (which we will examine in Case Study 2) triggered a national debate about algorithmic justice. Clearview AI's facial recognition practices resulted in regulatory action across multiple countries. The organizations that invest in governance before a crisis are the ones that survive it.

The Role of the Business Leader

"You will not write the code," Professor Okonkwo tells the class. "Most of you will never train a model. But you will decide whether to deploy one. You will approve the budget for fairness testing — or not. You will set the culture that determines whether a data scientist feels safe raising a concern about bias — or stays silent."

She pauses.

"The most important anti-bias technology is not an algorithm. It is a culture where someone can say, 'I found a problem,' and the response is 'Thank you for finding it,' not 'You're slowing us down.'"

Ravi nods from the back of the room. "I can confirm that. The HR analyst who deployed the model at Athena was trying to help — she was overwhelmed with resumes and wanted a faster process. She didn't know to test for bias. She didn't know she should test for bias. That's not her failure. That's our failure as an organization. We gave her a powerful tool and no guardrails."


Lena Park has been taking notes throughout the lecture. Now she steps to the front to provide the legal frame.

"Three frameworks matter most for AI bias in the United States and Europe," she says.

Title VII of the Civil Rights Act of 1964

"Title VII prohibits employment discrimination based on race, color, religion, sex, or national origin. The Supreme Court established in Griggs v. Duke Power Co. (1971) that practices with a disparate impact on protected groups are illegal, even if the employer had no discriminatory intent. This applies to AI. If your hiring algorithm disproportionately screens out women or minority candidates, the burden shifts to you to prove that the algorithm is job-related and consistent with business necessity."

The EU AI Act (2024)

"The EU AI Act classifies AI systems into risk tiers. AI used in employment — including recruitment, screening, and performance evaluation — is classified as high-risk. High-risk systems must undergo conformity assessments, provide transparency about how they work, enable human oversight, and demonstrate that they have been tested for bias across demographic groups. Non-compliance can result in fines of up to 35 million euros or 7 percent of global revenue."

The Age Discrimination in Employment Act (ADEA)

"Particularly relevant to Athena's case: ADEA prohibits employment discrimination against individuals 40 years of age and older. An AI system that disproportionately screens out candidates over 40 creates the same legal liability as a human hiring manager who does the same thing. The algorithm is not a defense — it is the mechanism of discrimination."

Business Insight: Lena's core message for business leaders: the legal framework treats AI-driven discrimination the same as human-driven discrimination. "The algorithm did it" is not a defense. If anything, algorithmic discrimination may be more legally risky than human discrimination, because the pattern is systematic, documented, and provable — unlike the scattered, inconsistent biases of individual human decision-makers.


Chapter Summary

NK is the last to leave the lecture hall. She stops at the door and turns back to Professor Okonkwo.

"I've been in every lecture this semester," NK says. "I've learned about classification, regression, neural networks, prompt engineering. All of it was interesting. This was important."

"There's a difference," Okonkwo agrees.

"The difference is that everything else we've learned is about making models that work. This is about making models that work for everyone. And if we don't get this right, all the technical sophistication in the world is just a more efficient way of being unfair."

Tom, who is waiting outside, catches the last sentence. He doesn't say anything. He writes in his notebook:

Technical excellence without ethical awareness is not excellence. It is sophisticated negligence.


Key Concepts Summary

Concept Definition Business Relevance
AI Bias Systematic unfairness in AI outputs across groups Legal liability, reputational risk, ethical harm
Historical Bias Bias from training on data that reflects past discrimination Most common source; requires data auditing
Representation Bias Underrepresentation of groups in training data Particularly risky in healthcare, finance
Proxy Variables Features that correlate with protected attributes Cannot be solved by simply removing sensitive features
Disparate Impact Ratio Selection rate comparison; <0.80 = potential adverse impact US employment law standard (four-fifths rule)
Demographic Parity Equal prediction rates across groups Intuitive but may conflict with other metrics
Equalized Odds Equal TPR and FPR across groups More nuanced; conditions on true outcome
Calibration Same predicted probability = same actual probability across groups Critical for lending and criminal justice
Pre-processing Fix data before training (resample, reweight) Model-agnostic; easy to implement
In-processing Constrain model during training Most principled; requires fairness-aware tools
Post-processing Adjust outputs after prediction Can be applied to black-box models
Automation Bias Human tendency to defer to algorithmic outputs Governance and training issue
Feedback Loop Model outputs influence future training data Can amplify bias over time

Looking Forward

This chapter identified the problem. The next five chapters provide the solutions.

Chapter 26 will confront the mathematical tension between fairness definitions — the impossibility result that you cannot satisfy all definitions simultaneously — and introduce explainability tools (SHAP, LIME, model cards) that make AI decisions transparent enough to scrutinize.

Chapter 27 will build the governance frameworks — AI Ethics Boards, model risk management, AI impact assessments — that prevent the kind of ungoverned deployment that created Athena's crisis.

Chapter 30 will operationalize responsible AI at organizational scale, including bias bounties, red-teaming, and responsible AI maturity models.

The tools exist. The frameworks exist. The research exists. What remains is the commitment — organizational, institutional, and personal — to use them.

"The model is only as fair as the decisions we make before, during, and after building it. Fairness is not a feature you add. It is a standard you hold."

— Professor Diane Okonkwo