Chapter 9: Exercises

Difficulty ratings: ⭐ Basic | ⭐⭐ Applied | ⭐⭐⭐ Analytical | ⭐⭐⭐⭐ Integrative

† Recommended for classroom discussion or written assignment submission


Section A: Comprehension and Vocabulary

Exercise 1 ⭐ Define each of the following terms in one to two sentences each, without using the term itself. Use a non-criminal-justice example for each: a. False positive rate b. Calibration c. Demographic parity d. Equalized odds e. Base rate


Exercise 2 ⭐ True or False? For each statement, explain why it is true or false in two to three sentences.

a. If a machine learning model is equally accurate for all demographic groups, it satisfies equalized odds. b. A well-calibrated model cannot have different false positive rates for different demographic groups. c. The impossibility theorem means algorithmic fairness can never be achieved. d. Individual fairness and group fairness always require the same adjustments to a model. e. Choosing demographic parity as a fairness metric is a neutral, value-free technical decision.


Exercise 3 ⭐ Match each fairness metric on the left to the most appropriate domain application on the right. Each domain may match more than one metric, and you should briefly justify each pairing.

Metrics: Demographic Parity | Equalized Odds | Calibration | Equal Opportunity | Counterfactual Fairness

Domains: a. Parole eligibility recommendation b. Credit card approval algorithm c. Medical diagnostic tool for high-risk pregnancy d. College admissions screening algorithm e. Fraud detection system that flags transactions for human review


Section B: Confusion Matrix Calculations

Exercise 4 ⭐⭐ † The table below shows prediction outcomes for a hypothetical hiring screening algorithm evaluated on 800 applications. Use the data to answer the questions.

Predicted: Advance Predicted: Reject Total
Group A — Actually Qualified 160 40 200
Group A — Not Qualified 80 320 400
Group A Total 240 360 600
Predicted: Advance Predicted: Reject Total
Group B — Actually Qualified 50 50 100
Group B — Not Qualified 20 80 100
Group B Total 70 130 200

a. Calculate the overall accuracy for each group. b. Calculate the false positive rate (FPR) and false negative rate (FNR) for each group. c. Calculate the true positive rate (TPR) for each group. d. Calculate the selection rate (rate of positive predictions) for each group. e. Does this system satisfy demographic parity? Show your calculation. f. Does this system satisfy equalized odds? Show your calculation. g. Does this system satisfy equal opportunity? Show your calculation. h. What is the base rate (proportion of actually qualified applicants) for each group? i. Using the Chouldechova impossibility theorem's logic, explain why you would expect some fairness metric to be violated given the base rates you found in part (h). j. If you were the hiring company's ethics officer, which fairness metric would you prioritize, and why? Who would you consult before making this decision?


Exercise 5 ⭐⭐ The following data describes a hypothetical medical screening algorithm for a serious condition. Group C is an elderly population; Group D is a younger population.

Predicted: High Risk Predicted: Low Risk Total
Group C — Has Condition 85 15 100
Group C — No Condition 40 360 400
Predicted: High Risk Predicted: Low Risk Total
Group D — Has Condition 30 20 50
Group D — No Condition 25 425 450

a. Calculate TPR, FPR, FNR, and precision for each group. b. Which group has worse sensitivity (TPR)? What are the clinical implications of this difference? c. Which group has higher false positive rate? What are the clinical implications? d. Given the medical context, which error type (false positive vs. false negative) is more costly? How does your answer to this question affect which fairness metric you would prioritize? e. Is this system more fair by equalized odds or by equal opportunity? Calculate both to support your answer.


Exercise 6 ⭐⭐ Using only the data in Exercise 4, calculate the treatment equality metric (FN/FP ratio) for each group. Is treatment equality satisfied? Interpret what the ratio means in the context of a hiring algorithm.


Section C: Conceptual Analysis

Exercise 7 ⭐⭐ † The Chouldechova impossibility theorem states that calibration, equal false positive rates, and equal false negative rates cannot simultaneously hold when base rates differ. Using a concrete numerical example (you may make up numbers), show how these three metrics conflict. Your example should: - Specify different base rates for two groups - Show that if calibration holds, equal FPR and FNR cannot both hold - Explain in plain language what this means for a decision-maker choosing a fairness metric


Exercise 8 ⭐⭐ Consider a content moderation algorithm used by a major social media platform to detect "toxic speech." The algorithm is deployed globally and is evaluated for fairness across users from different national and cultural backgrounds.

a. Identify at least three reasons why "toxic speech" is a particularly difficult category for algorithmic fairness analysis. b. Which fairness metric (demographic parity, equalized odds, or calibration) would you argue is most appropriate for this application? Defend your choice. c. The platform discovers that users from Country X are flagged at three times the rate of users from Country Y, even after controlling for detectable difference in content. What are the three most likely explanations for this disparity, and how would you investigate each? d. What would meaningful community engagement look like for this application? Who would you include, and what questions would you ask?


Exercise 9 ⭐⭐ Individual fairness requires that similar individuals be treated similarly. For each of the following scenarios, identify at least two reasons why defining "similar" is difficult or contested:

a. Two loan applicants with the same current credit score but different histories of past incarceration b. Two candidates for a software engineering job with the same years of experience but degrees from different types of universities c. Two patients with the same medical diagnosis but living in different neighborhoods d. Two defendants charged with the same crime but with different immigration statuses


Exercise 10 ⭐⭐⭐ Northpointe's defense of COMPAS rested on calibration; ProPublica's critique rested on false positive rate disparity. Write a 400-word analysis (suitable for a memo to a non-technical executive) that: - Explains both positions accurately - Explains why both are correct within their own terms - Explains the impossibility theorem in plain language - Recommends which metric should be prioritized in criminal justice applications, with reasoning - Identifies the stakeholders who should be involved in making this recommendation


Exercise 11 ⭐⭐⭐ † Intersectional fairness analysis requires evaluating performance for subgroups defined by combinations of protected attributes. Consider a hiring algorithm used by a technology company.

a. List all subgroups you would evaluate if you have data on: Race (3 categories: White, Black, Other), Gender (3 categories: Man, Woman, Non-binary), and Age group (2 categories: Under 40, 40+). How many subgroups is this? b. The company's dataset has 3,000 applicants distributed roughly proportionally across all subgroups. What is the average subgroup size? Why does this matter for statistical analysis? c. You find that the algorithm satisfies equalized odds separately for race and for gender, but Black women experience a true positive rate 15 percentage points lower than white men. Is this a fairness violation? What does it tell us about single-axis analysis? d. What recommendations would you make to the company about how to address the intersectional disparity?


Section D: Applied Scenarios

Exercise 12 ⭐⭐ A county government is considering deploying an automated system to prioritize child welfare investigations. The system would assign a risk score to each case, and investigators would prioritize cases with higher scores. Child welfare agencies are already understaffed and must ration investigative resources.

a. Identify the four potential outcomes of the confusion matrix (TP, TN, FP, FN) in plain language for this application. b. Who bears the cost of false positives? Who bears the cost of false negatives? c. Which fairness metric would you prioritize, and why? d. What data would you need to compute fairness metrics for this system? e. Identify two factors that might make it difficult to collect the demographic data needed for fairness analysis in this context.


Exercise 13 ⭐⭐ † A financial services company deploys an algorithm that recommends whether customers should be offered a credit limit increase. An internal audit reveals: - White customers are offered increases at a rate of 35% - Black customers are offered increases at a rate of 22% - The model is calibrated (a score of 70% means 70% chance of a "yes" for both groups)

a. Is there evidence of a demographic parity violation? Calculate the disparity ratio and apply the four-fifths rule. b. The company argues that the disparity is explained by credit score differences: Black customers have lower average credit scores, which are the primary driver of the algorithm's recommendations. Is this argument legally sufficient under ECOA? Is it ethically sufficient? c. A researcher suggests that the credit score gap is itself a product of historical lending discrimination (denying mortgages to Black families reduced their ability to build equity, which affected their financial stability, which affected their credit scores). How does this argument change your ethical analysis? d. What remediation steps would you recommend? Consider both technical and non-technical options.


Exercise 14 ⭐⭐⭐ A large hospital system uses a machine learning model to predict which patients are at risk of hospital readmission within 30 days of discharge. High-risk patients receive enhanced care coordination services (follow-up calls, home nurse visits, prescription delivery).

It is discovered that the model assigns lower risk scores to Black patients than to white patients with similar clinical profiles. The original model was trained on historical data in which Black patients used fewer health care services after discharge — not because they needed fewer services, but because they had less access to transportation, fewer follow-up appointments, and worse insurance coverage.

a. What type of feedback loop does this illustrate? b. What fairness metric would best capture the disparity described? c. If the model is recalibrated on outcome data (30-day readmission rates), will this fix the fairness problem? Why or why not? d. What modifications to the training data or model design might address the underlying problem? e. What are the ethical obligations of the hospital system once this disparity is discovered?


Exercise 15 ⭐⭐⭐⭐ † Design a fairness monitoring program for an automated resume screening tool used by a company with 500 employees that hires approximately 100 new employees per year. Your program should include:

a. A statement of which fairness metrics you will monitor and why you chose those metrics over alternatives b. What demographic data you will collect, how you will collect it, and what legal constraints apply c. At what thresholds you will trigger review (specify thresholds numerically and explain your reasoning) d. Who within the organization is responsible for each component of the monitoring program e. What your escalation process is when a threshold is breached f. How you will document fairness choices and make them available for external review g. How you will communicate the results of fairness monitoring to: (i) the executive team, (ii) the HR team that uses the tool, (iii) applicants h. Identify the two most likely ways this monitoring program could fail, and how you would guard against each


Section E: Research and Extension

Exercise 16 ⭐⭐ Research one of the following fairness toolkits: AI Fairness 360 (IBM), Fairlearn (Microsoft), or Aequitas (University of Chicago). Write a 250-word summary covering: - What fairness metrics the toolkit supports - What mitigation algorithms it offers - What kinds of datasets and model types it works with - One specific use case documented in the toolkit's examples


Exercise 17 ⭐⭐ The 80% rule (four-fifths rule) is the standard threshold used in US employment discrimination law to assess disparate impact. Look up its origin and scope, then answer: a. In what context was the 80% rule originally developed? b. What are two criticisms of using 80% as the threshold? c. Should the same threshold apply to an algorithm used in healthcare triage as in employment screening? Why or why not?


Exercise 18 ⭐⭐⭐ Read the abstract and key findings sections of Chouldechova's 2017 paper "Fair Prediction with Disparate Impact" (available freely online). Then answer: a. What is the specific mathematical relationship Chouldechova identifies between calibration, false positive rates, false negative rates, and base rates? b. How does Chouldechova apply this relationship to the COMPAS dataset specifically? c. What policy implications does Chouldechova draw from her findings? Do you agree with her framing?


Exercise 19 ⭐⭐⭐ The Global Variation theme: Different countries approach algorithmic fairness differently.

Research the EU AI Act's requirements for "high-risk AI systems" in the areas of credit scoring and criminal justice. Then compare with US regulatory requirements under ECOA and Title VII. Write a 400-word comparison covering: - What fairness analysis is required before deployment - What ongoing monitoring is required - What transparency and explanation requirements exist - Which framework you believe offers stronger protection, and why


Exercise 20 ⭐⭐⭐⭐ † The following scenario is designed to test your ability to apply the full chapter framework.

Scenario: A state department of motor vehicles (DMV) is considering deploying an algorithmic system to prioritize license reinstatement applications. The system predicts the probability that a driver whose license was revoked for DUI will be involved in a serious accident within two years of reinstatement. High-probability cases are deprioritized; low-probability cases move to the front of the queue for expedited review.

You have been hired as the ethics consultant. You have access to the following information: - Base rates of post-reinstatement accidents differ by age group (younger drivers have higher rates), by gender (men have higher rates), and by geography (rural drivers have lower rates despite longer commutes) - The algorithm uses variables including prior accident history, prior DUI offenses, employment status, and residential zip code - The system's vendor claims it is "fair" because it has equal accuracy (75%) for all demographic groups

Write a memo to the DMV commissioner (2–3 pages, structured document) that: 1. Explains why "equal accuracy" is insufficient to evaluate fairness 2. Identifies which protected groups face the greatest potential for disparate impact and why 3. Specifies which fairness metrics you recommend computing and why 4. Explains the impossibility trade-off the DMV will face in choosing metrics 5. Recommends a process for making the metric choice democratically and accountably 6. Specifies minimum data collection and monitoring requirements 7. Identifies two aspects of the vendor's design that require further scrutiny


Exercise 21 ⭐⭐ Suppose a company claims their new hiring algorithm achieves "fairness" because it has the same false positive rate for all groups. However, the system has a true positive rate of 80% for Group A and only 55% for Group B.

a. What fairness criterion is satisfied? b. What fairness criterion is violated? c. In a hiring context, what does a lower true positive rate for Group B mean in practical terms? d. Is it possible for a system to satisfy equal FPR but not equal TPR? Under what mathematical conditions would this occur?


Exercise 22 ⭐⭐ Consider a recommendation algorithm used by a streaming platform to recommend movies. The algorithm is found to recommend a lower proportion of films made by directors of color to users identified as white.

a. Is this a fairness problem? From whose perspective? b. Which fairness metric or metrics are most relevant here? c. How would you measure the disparity formally? d. The platform argues that the algorithm is just giving users what they have watched in the past — it is being personalized, not discriminatory. Evaluate this argument.


Exercise 23 ⭐⭐⭐ † Power and Accountability theme: Who should be empowered to choose the fairness metric for a risk assessment tool used in immigration court to predict flight risk?

Write a 300-word structured argument that: - Identifies the key stakeholders and their interests - Evaluates the fairness metric that best serves each stakeholder group's interests - Recommends a governance process for making the metric selection decision - Identifies what accountability mechanisms should accompany the tool's deployment


Exercise 24 ⭐⭐ The concept of "algorithmic affirmative action" — deliberately adjusting a model's outputs to produce more equitable outcomes for disadvantaged groups — is controversial.

a. Is demographic parity a form of algorithmic affirmative action? Explain. b. Some legal scholars argue that adjusting model thresholds differently by demographic group constitutes illegal discrimination. Others argue it is justified as remediation for historical bias. Summarize both arguments. c. How does the EU approach this question differently from the US approach?


Exercise 25 ⭐⭐⭐⭐ Capstone Exercise: Return to the COMPAS controversy. Using everything you have learned in this chapter, write a 500-word position paper that takes a clear position on the following question:

"Should Northpointe's COMPAS system be considered fair or unfair, and who should have the authority to make this determination?"

Your paper must: - Reference the specific fairness metrics discussed in Sections 9.3 and 9.4 - Apply the impossibility theorem to explain why the question does not have a purely technical answer - Identify the relevant stakeholders and characterize whose interests each metric serves - State and defend your position on which fairness metric should be used in criminal justice risk assessment - Address the accountability question: what institutional process should be used to make this determination going forward


Answer guides for computational exercises (4, 5, 6) are available in the instructor's resource package.