Chapter 25 Exercises: Bias in AI Systems


Section A: Bias Identification (Exercises 1-5)

Exercise 1: Classifying Bias Sources

For each of the following scenarios, identify the primary source of bias using the Suresh and Guttag taxonomy (historical, representation, measurement, aggregation, evaluation, deployment). Explain your reasoning in two to three sentences.

(a) A speech recognition system trained primarily on American English speakers performs poorly for users with Nigerian, Indian, and Scottish accents. The system was tested using a benchmark dataset that was also composed primarily of American English speakers.

(b) A credit scoring model uses data from a period when banks required cosigners for applicants from certain neighborhoods, making it harder for those applicants to qualify for loans. The model learns that applicants from these neighborhoods have lower repayment rates — because they were historically denied adequate credit products.

(c) A medical AI system uses a single diagnostic threshold for liver enzyme levels across all patients. Research shows that "normal" ranges for liver enzymes differ significantly by sex and ethnicity due to biological factors.

(d) A customer churn prediction model was designed to flag high-risk customers for a retention offer. Instead, marketing uses the model's risk scores to determine which customers to exclude from promotional campaigns, reasoning that high-churn-risk customers are not worth the investment.

(e) A sentiment analysis model trained on product reviews from an electronics retailer is deployed to analyze customer feedback for a luxury hotel chain. The model was never re-evaluated on hospitality-domain text.


Exercise 2: Proxy Variable Detection

A bank's lending model uses the following features to predict loan default. For each feature, assess whether it could serve as a proxy for a protected characteristic (race, gender, age, national origin). Explain the mechanism by which the proxy operates.

Feature Description
ZIP code Applicant's residential ZIP code
First name Applicant's first name
Commute distance Distance from home to workplace
Browser type Web browser used to submit application
Time of application Time of day the application was submitted
Highest degree Highest educational degree earned
Years at current address Duration at current residence

Exercise 3: Identifying Feedback Loops

A large online marketplace uses an AI recommendation engine to surface products to customers. The engine ranks products based on predicted purchase probability, which is trained on historical purchase data.

(a) Describe the feedback loop that could emerge in this system. Be specific about each step in the cycle.

(b) A new vendor joins the marketplace with a product that has no purchase history. How does the feedback loop affect this vendor? What is this phenomenon called?

(c) Propose two interventions to break or weaken the feedback loop without abandoning the recommendation system entirely.

(d) How does this feedback loop relate to the concept of "rich get richer" dynamics in network economics? Connect to a real-world example beyond e-commerce.


Exercise 4: Bias in Athena's Data

Athena Retail Group's HR screening model was trained on five years of hiring data. Below is a simplified representation of the training data composition.

Characteristic Hired (%) Applied (%)
Under 35 62% 48%
35 and Over 38% 52%
Four-Year Degree 71% 55%
No Four-Year Degree 29% 45%

(a) Calculate the historical hiring rate advantage for candidates under 35 relative to their application rate. Do the same for four-year degree holders.

(b) The model amplified these patterns: 78% of model-recommended candidates were under 35 (vs. 62% in the historical data). Why do machine learning models tend to amplify existing patterns rather than simply replicate them?

(c) Ravi discovers that "years of experience" was not included as a feature in the model because the HR analyst considered it "redundant" with age. Explain why this omission could worsen age-based bias.

(d) Suppose Athena removes age from the model's features entirely. Is this sufficient to eliminate age bias? Why or why not?


Exercise 5: The Resume Screening Problem

You are reviewing the features used in an AI resume screening system. The system uses the following to score candidates:

  1. University attended (ranked by US News tier)
  2. Years since graduation
  3. Number of previous employers
  4. Presence of certain keywords (e.g., "led," "managed," "built")
  5. Length of resume (in words)
  6. GPA (if listed)
  7. Extracurricular activities

For each feature, identify at least one way it could introduce or amplify bias. For three of the seven features, propose an alternative feature that captures similar information with less bias risk.


Section B: Fairness Metric Calculation (Exercises 6-10)

Exercise 6: Disparate Impact Ratio

A hiring model evaluates 500 candidates: 300 from Group A and 200 from Group B. The model advances 120 candidates from Group A and 50 from Group B.

(a) Calculate the selection rate for each group.

(b) Calculate the disparate impact ratio.

(c) Does this model pass the four-fifths rule? Show your work.

(d) If the company wanted to meet the four-fifths rule by advancing more candidates from Group B (without changing Group A's numbers), what is the minimum number of Group B candidates that would need to be advanced?

(e) A manager argues: "The four-fifths rule is just a guideline, not a law." Is this legally accurate? What is the legal significance of the four-fifths rule?


Exercise 7: Demographic Parity vs. Equalized Odds

A loan approval model produces the following confusion matrices for two demographic groups:

Group A (400 applicants):

Predicted: Approve Predicted: Deny
Actual: Repaid 160 40
Actual: Defaulted 60 140

Group B (200 applicants):

Predicted: Approve Predicted: Deny
Actual: Repaid 50 30
Actual: Defaulted 20 100

(a) Calculate the approval rate (positive prediction rate) for each group. Does the model satisfy demographic parity?

(b) Calculate the true positive rate (recall) for each group. Does the model satisfy the equal opportunity criterion (equal TPR)?

(c) Calculate the false positive rate for each group. Does the model satisfy equalized odds (equal TPR and equal FPR)?

(d) Calculate the precision (positive predictive value) for each group. What does this tell you about calibration?

(e) Can this model simultaneously satisfy demographic parity and equalized odds? Why or why not?


Exercise 8: The Impossibility Theorem in Practice

Chouldechova (2017) proved that when base rates differ across groups, it is impossible to simultaneously achieve calibration, equal false positive rates, and equal false negative rates.

Consider two populations:

  • Population A: 10% default rate (base rate = 0.10)
  • Population B: 20% default rate (base rate = 0.20)

(a) If a model is perfectly calibrated (a score of 0.3 means a 30% chance of default for both groups), and it uses a single threshold of 0.5 for both groups, which group will have a higher false negative rate? Explain intuitively.

(b) If you lower the threshold for Population B to equalize false negative rates, what happens to the false positive rate for Population B?

(c) In your own words, explain why this represents a genuine tradeoff rather than a technical limitation that better algorithms will eventually solve.

(d) Who should make the decision about which fairness metric to prioritize — the data scientist, the legal team, the affected community, or the executive team? Justify your answer.


Exercise 9: Calculating Bias in Athena's Model

Using the BiasDetector class from this chapter, analyze the following test set results from Athena's HR screening model.

Candidate ID Age Group Education Predicted Actual
1 Under 35 Four-Year 1 1
2 Under 35 Four-Year 1 0
3 35+ Other 0 1
4 Under 35 Other 1 1
5 35+ Four-Year 1 1
6 35+ Other 0 0
7 Under 35 Four-Year 1 1
8 35+ Other 0 1
9 Under 35 Other 0 0
10 35+ Four-Year 0 0
11 Under 35 Four-Year 1 0
12 35+ Other 0 1

(a) Calculate the selection rate (positive prediction rate) for each age group.

(b) Calculate the disparate impact ratio for age group.

(c) Calculate the selection rate for each education level.

(d) Calculate the disparate impact ratio for education.

(e) Which dimension (age or education) shows a more severe disparate impact? Does either pass the four-fifths rule?


Exercise 10: Before-and-After Mitigation

A model's selection rates before and after threshold adjustment are shown below:

Group Before After
Group A 52% 44%
Group B 28% 41%
Group C 61% 45%
Group D 33% 42%

(a) Calculate the disparate impact ratio before mitigation (using the highest-rate group as the reference).

(b) Calculate the disparate impact ratio after mitigation.

(c) Do all groups pass the four-fifths rule after mitigation?

(d) The overall accuracy dropped from 74% to 69% after threshold adjustment. A product manager says, "We're sacrificing five points of accuracy for fairness — that's too much." Write a two-paragraph response that addresses this objection, incorporating both ethical and business arguments.


Section C: Mitigation Design (Exercises 11-13)

Exercise 11: Designing a Mitigation Strategy

You are the Chief Data Officer at a health insurance company. Your team has built a model that predicts which members are at high risk of hospitalization. The model will be used to enroll members in a preventive care program. A bias audit reveals:

  • The model identifies 32% of white members as high-risk but only 19% of Black members.
  • Black members have higher actual hospitalization rates than white members.
  • The disparity exists because the model uses historical healthcare utilization as a predictor, and Black members have historically had lower utilization rates due to access barriers — not lower health needs.

(a) Which source of bias (from the Suresh and Guttag taxonomy) is primarily responsible?

(b) Would removing race as a feature fix the problem? Why or why not?

(c) Design a mitigation strategy that combines at least two approaches (pre-processing, in-processing, post-processing). Be specific about what you would do at each step.

(d) The actuarial team argues that the model is "accurate" because it correctly predicts utilization. Draft a one-paragraph response explaining why accuracy in predicting utilization is not the same as fairness in identifying health needs.


Exercise 12: Resampling Implementation

Given the following simplified training data distribution:

Group Positive (Hired) Negative (Not Hired) Total Hire Rate
Under 35 180 120 300 60%
35+ 100 300 400 25%

(a) If you want to equalize hiring rates across groups through oversampling the positive cases in the disadvantaged group, how many additional "hired" records from the 35+ group would you need to add? Show your calculation.

(b) If instead you undersample the negative cases in the 35+ group, how many negative records would you remove? What are the risks of this approach?

(c) If you use reweighting instead of resampling, what weight should you assign to positive cases in the 35+ group relative to positive cases in the Under 35 group? Show the calculation.

(d) Which approach (oversampling, undersampling, or reweighting) would you recommend for a dataset of 10,000 records? What about 500 records? Explain why dataset size matters.


Exercise 13: Post-Processing Threshold Analysis

A binary classification model outputs probability scores between 0 and 1. Using a threshold of 0.50, the model produces the following results:

Group n Predicted Positive (>0.50) True Positive Rate False Positive Rate
Group A 500 240 (48%) 0.82 0.18
Group B 500 150 (30%) 0.61 0.09

(a) What is the disparate impact ratio at the 0.50 threshold?

(b) If you lower the threshold for Group B to 0.38, 220 candidates from Group B would be predicted positive (44%). The new TPR would be 0.79 and FPR would be 0.15. Does this pass the four-fifths rule? What happens to equalized odds?

(c) What is the practical concern with using group-specific thresholds in a deployed system? Consider both legal and operational perspectives.

(d) Propose an alternative to group-specific thresholds that achieves a similar fairness improvement without requiring group membership at inference time.


Section D: Ethical Scenarios (Exercises 14-17)

Exercise 14: The Governance Failure

Re-read the Athena scenario from the chapter opening. Then answer:

(a) List at least five governance failures that allowed the biased model to be deployed. For each, identify who was responsible and what process should have existed.

(b) Ravi halts the model immediately. A senior VP pushes back: "We've already screened 340 resumes. If we stop now and rescreen manually, it'll delay hiring by six weeks during our peak season. The bias isn't that bad — can't we just monitor it going forward?" Write Ravi's response (3-4 sentences).

(c) Three months later, Athena's AI Ethics Board proposes that all AI systems that make or influence decisions about people must undergo a bias audit before deployment. The VP of Engineering objects: "This will slow down our AI development by weeks. Our competitors don't do this." Write a one-page memo (approximately 300 words) to the CEO arguing for or against mandatory pre-deployment bias audits, using evidence from this chapter.


Exercise 15: Competing Fairness Definitions

A university admissions AI system is being evaluated. The following data emerges:

Group Admitted (%) Graduation Rate (if admitted)
Group A 40% 85%
Group B 25% 78%

The admissions office wants the model to satisfy demographic parity (equal admission rates). The provost wants the model to satisfy calibration (similar success rates for students with the same predicted score, regardless of group). The student government wants equalized odds (equal true positive rates for students who would graduate).

(a) Can the model satisfy all three definitions simultaneously? Why or why not?

(b) For each stakeholder (admissions office, provost, student government), explain why their preferred fairness definition is reasonable from their perspective.

(c) You are the university's Chief Analytics Officer. Write a one-paragraph recommendation for which fairness definition to prioritize, with justification. Acknowledge the tradeoffs explicitly.


Exercise 16: The Whistleblower Dilemma

You are a junior data scientist at a large financial services firm. While conducting routine model monitoring, you discover that the company's AI-powered loan approval system has a disparate impact ratio of 0.62 for Hispanic applicants — well below the legal threshold. You report this to your manager. Your manager says: "The model was approved by compliance last year. Don't make waves. If this gets out, it'll be a PR disaster and people will lose their jobs — possibly including you."

(a) What are your ethical obligations in this scenario? Reference at least one professional code of ethics (e.g., ACM Code of Ethics, IEEE Code of Ethics, or similar).

(b) What are your legal protections if you report this externally? Research the relevant whistleblower protections in US federal law.

(c) Describe three actions you could take, in order from least confrontational to most confrontational. For each, assess the likely outcomes for the affected applicants, for the company, and for you personally.

(d) NK says: "The system is designed to protect itself, not the people it harms." Do you agree? How does organizational culture affect the likelihood that bias discoveries are acted upon?


Exercise 17: Designing an AI Ethics Board

Ravi establishes an AI Ethics Board at Athena Retail Group. Design the board:

(a) Who should be on it? List at least six roles (not just names), including at least two from outside the data science team. Explain why each role is important.

(b) What is the board's scope of authority? Should it have the power to block model deployments, or only make recommendations? Justify your position.

(c) Draft a one-page charter for the board (approximately 300 words) that includes: mission, scope, membership, meeting cadence, decision-making process, and escalation path.

(d) How do you prevent the board from becoming a "rubber stamp" that approves everything without meaningful review? Propose at least three structural safeguards.


Section E: Python Implementation (Exercises 18-20)

Exercise 18: Extending the BiasDetector

Add the following methods to the BiasDetector class:

(a) intersectional_analysis(col_a, col_b) — Calculate selection rates and disparate impact for the intersection of two sensitive attributes (e.g., age_group AND gender). This should handle groups formed by all combinations of values in the two columns.

(b) bias_over_time(timestamps, group_col, window_size) — Given a timestamp column, calculate the disparate impact ratio over a rolling time window to detect bias drift.

(c) feature_importance_by_group(model, X, group_col) — Using SHAP values (or permutation importance), identify which features contribute most to the prediction difference between groups.


Exercise 19: Building a Bias Dashboard

Using the BiasDetector class and matplotlib, create a multi-panel dashboard (2x2 grid) that displays:

  1. Top-left: Selection rates by group (bar chart with four-fifths threshold line)
  2. Top-right: Confusion matrix heatmaps side by side for two groups
  3. Bottom-left: ROC curves overlaid for each group, with AUC values in the legend
  4. Bottom-right: Calibration curves for each group (predicted probability vs. observed frequency)

The dashboard should accept a BiasDetector instance and a group column name, and produce a single figure saved to a file.


Exercise 20: End-to-End Bias Audit Pipeline

Write a complete Python script that:

  1. Loads the synthetic hiring data (using the generate_hiring_data function from the chapter)
  2. Trains a model on the data
  3. Runs the BiasDetector full audit
  4. Applies threshold adjustment mitigation
  5. Re-runs the audit on the mitigated predictions
  6. Generates a before-and-after comparison report
  7. Saves all visualizations to files

The script should be structured as a command-line tool that accepts parameters for the number of samples, random seed, and which sensitive attributes to audit.


Submission Guidelines

  • Section A: Written responses, 2-4 sentences per sub-question unless otherwise specified.
  • Section B: Show all calculations. Express ratios to three decimal places and percentages to one decimal place.
  • Section C: Written responses with specific, actionable recommendations.
  • Section D: Longer-form responses. Engage with the ethical complexity — there are rarely clean answers.
  • Section E: Submit working Python code with comments explaining your design decisions. Code should run without errors on the synthetic data.