Chapter 31: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Fairness Metrics
Exercise 31.1 (*)
A hiring model scores applicants on a 0-1 scale. The company uses a threshold of 0.6 to recommend candidates for interviews. The following confusion matrix data is available for two groups (group 0: male, group 1: female):
| Group 0 (n=2000) | Group 1 (n=1500) | |
|---|---|---|
| True Positives | 600 | 300 |
| False Positives | 200 | 180 |
| True Negatives | 900 | 720 |
| False Negatives | 300 | 300 |
(a) Compute the selection rate, TPR, FPR, PPV, and FNR for each group.
(b) Does this model satisfy demographic parity? Equal opportunity? Equalized odds? Predictive parity? Show your calculations.
(c) Does this model pass the four-fifths rule? What is the disparate impact ratio?
Exercise 31.2 (*)
Using the FairnessMetrics class from Section 31.2, compute all fairness metrics for the following synthetic credit scoring scenario:
import numpy as np
np.random.seed(42)
n = 5000
sensitive = np.random.choice([0, 1], size=n, p=[0.6, 0.4])
# Group 0: higher base rate of repayment
base_rate = np.where(sensitive == 0, 0.85, 0.70)
y_true = (np.random.random(n) < base_rate).astype(int)
# Model score correlates with true label but is noisier for group 1
noise = np.where(sensitive == 0, 0.15, 0.25)
y_score = np.clip(
y_true * 0.7 + (1 - y_true) * 0.3
+ np.random.normal(0, noise, n),
0, 1,
)
y_pred = (y_score >= 0.5).astype(int)
(a) Report the group-level confusion matrix rates using FairnessMetrics.group_metrics().
(b) Report demographic parity difference, equalized odds difference, equal opportunity difference, and predictive parity difference.
(c) Which fairness criterion shows the largest violation? Explain why, given the data generation process.
Exercise 31.3 (*)
A recidivism prediction model produces risk scores for defendants. The model is calibrated by group: when the model says "60% risk," about 60% of defendants in both groups actually recidivate. However, the selection rate at the common threshold (risk > 0.5 = detained) differs: 45% of group 0 and 30% of group 1 are detained.
(a) Is this model consistent with demographic parity? Explain.
(b) Can the model simultaneously be calibrated by group and satisfy demographic parity without changing the model itself? Why or why not?
(c) A policy-maker proposes using different thresholds for each group to equalize detention rates. What fairness criterion does this satisfy? What criterion does it violate? Use the impossibility theorem to explain.
Exercise 31.4 (**)
Extend the FairnessMetrics class to support conditional demographic parity: the selection rate should be equal across groups, conditioned on a legitimate risk factor $L$ (e.g., credit score tier).
(a) Define conditional demographic parity formally: $P(\hat{Y} = 1 \mid A = a, L = l) = P(\hat{Y} = 1 \mid A = a', L = l)$ for all $a, a', l$.
(b) Implement a method conditional_demographic_parity(self, legitimate_factor: np.ndarray) -> pd.DataFrame that computes the selection rate by group within each level of the legitimate factor.
(c) Why might conditional demographic parity be more appropriate than unconditional demographic parity in credit scoring? What is the danger of conditioning on a feature that is itself a product of historical discrimination?
The Impossibility Theorem
Exercise 31.5 (*)
Using the demonstrate_impossibility() function from Section 31.6.3, explore the impossibility theorem with the following parameter settings:
(a) br_0 = 0.05, br_1 = 0.30, fpr = 0.03, fnr = 0.15. Report the PPV gap. Is it larger or smaller than the example in the chapter? Why?
(b) br_0 = 0.20, br_1 = 0.22, fpr = 0.05, fnr = 0.20. Report the PPV gap. What happens as the base rates converge?
(c) Find values of fpr and fnr such that the PPV gap between groups with br_0 = 0.10 and br_1 = 0.25 is minimized (but nonzero). What does this tell you about the relationship between classifier quality and the severity of the impossibility?
Exercise 31.6 (**)
The impossibility theorem assumes binary classification. Does it extend to continuous scores?
(a) Consider a model that outputs probabilities $S \in [0, 1]$ rather than binary predictions. Show that if the model is perfectly calibrated by group ($P(Y=1 \mid S=s, A=a) = s$ for all $s, a$), then any single threshold $t$ applied to $S$ will generally not satisfy equalized odds when base rates differ.
(b) Show that for a perfectly calibrated model, group-specific thresholds can achieve equalized odds. What must the relationship between $t_0$ and $t_1$ be?
(c) Explain the practical tension: a regulator requires both calibration (the score must mean the same thing for everyone) and a single threshold (the same rules for everyone). The impossibility theorem implies these requirements conflict when base rates differ. How should a practitioner respond?
Exercise 31.7 (***)
Prove that the impossibility theorem holds for the multi-group case. Specifically, for $K > 2$ groups with distinct base rates, show that calibration plus equalized odds implies at most one group can have a non-degenerate classifier (nonzero error rate). Present your proof step by step.
Pre-Processing, In-Processing, and Post-Processing
Exercise 31.8 (*)
Using the compute_reweighing_weights() function from Section 31.8.1, compute the reweighing weights for the following scenario:
| Group | $Y=0$ count | $Y=1$ count |
|---|---|---|
| $A=0$ | 400 | 600 |
| $A=1$ | 350 | 150 |
(a) Compute the weight for each (Y, A) cell. Which cell receives the largest weight? Why?
(b) Train a logistic regression model on the unweighted data and compute the demographic parity difference. Then retrain using the reweighing weights (via the sample_weight parameter) and recompute. How much does the demographic parity difference change?
(c) Compute the accuracy on unweighted and weighted models. Quantify the fairness-accuracy tradeoff.
Exercise 31.9 (**)
Implement a learning fair representations (LFR) pre-processing method (Zemel et al., 2013). The idea is to learn an intermediate representation $Z$ of the features $X$ that:
- Preserves information about $Y$ (is predictive)
- Removes information about $A$ (is fair)
(a) Implement a simple version using an autoencoder with an adversarial branch. The encoder maps $X \to Z$. The decoder reconstructs $X$ from $Z$. The adversary predicts $A$ from $Z$. The encoder is trained to minimize reconstruction loss and maximize adversary loss.
(b) Compare the representation learned by LFR with the original features by measuring (i) mutual information between $Z$ and $A$ (should be low), (ii) mutual information between $Z$ and $Y$ (should be high), and (iii) downstream model accuracy and fairness when trained on $Z$ vs. $X$.
(c) What is the limitation of LFR when the protected attribute is correlated with the legitimate predictive signal? When is LFR preferable to reweighing?
Exercise 31.10 (**)
Compare the three post-processing methods from Section 31.10 on the synthetic credit scoring data from Exercise 31.2:
(a) Apply find_equalized_odds_thresholds() with accuracy_floor=0.70. Report the before and after equalized odds difference and accuracy.
(b) Apply reject_option_classification() with margin=0.15. Report the before and after demographic parity difference and accuracy.
(c) Which method achieves a better fairness-accuracy tradeoff for this dataset? Under what conditions would you prefer one over the other?
Exercise 31.11 (**)
Train an adversarial debiasing model (Section 31.9.2) on the Adult Income dataset (predict income > $50K, protected attribute: sex).
(a) Train the model with adversary_loss_weight values of 0.0, 0.5, 1.0, 2.0, and 5.0. For each, report accuracy and demographic parity difference.
(b) Plot the fairness-accuracy frontier. At what value of $\lambda$ does the marginal accuracy cost per unit of fairness improvement increase sharply?
(c) Compare the adversarial debiasing results with Fairlearn's ExponentiatedGradient using the DemographicParity() constraint on the same dataset. Which achieves a better Pareto frontier?
Exercise 31.12 (***)
Implement calibrated equalized odds post-processing (Pleiss et al., 2017). The key insight is that if a model is well-calibrated, post-processing to achieve equalized odds only requires randomizing the prediction in specific score regions.
(a) Implement the algorithm: for each group, find the score regions where predictions must be randomized (flipped with some probability) to equalize TPR and FPR across groups, subject to the constraint that calibration is preserved within the randomization.
(b) Apply it to a calibrated credit scoring model. Compare the calibration curves before and after. Is calibration preserved? How does this compare to the simple threshold adjustment method?
(c) What is the accuracy cost compared to threshold adjustment? What does this tell you about the impossibility theorem in practice — can we "nearly" satisfy both calibration and equalized odds?
Fairness Auditing with Fairlearn and AIF360
Exercise 31.13 (*)
Using the fairlearn_audit() function from Section 31.11.1, audit a gradient-boosted tree model trained on the German Credit dataset.
(a) Train an XGBoost model to predict credit risk. Compute the Fairlearn audit with sensitive = age (binarized at 25) and sensitive = sex.
(b) Which protected attribute shows larger disparities? For which metrics?
(c) Compute the intersectional audit using intersectional_fairness_audit() with both age and sex. Are there intersectional groups that are worse-served than either single-attribute analysis reveals?
Exercise 31.14 (**)
Build a complete fairness audit pipeline that integrates with the ML pipeline from Chapter 27.
(a) Write a Dagster asset that computes the Fairlearn MetricFrame for a trained model and stores the results as a structured artifact.
(b) Write a validation check that fails the pipeline if the demographic parity ratio drops below 0.80 or the equalized odds difference exceeds 0.10.
(c) Write a monitoring asset that computes fairness metrics for each scoring batch and appends results to a time-series table. Generate a visualization showing fairness metric trends over time.
Exercise 31.15 (**)
Compare Fairlearn and AIF360 on the same dataset (COMPAS or Adult Income).
(a) Compute the following metrics using both libraries: statistical parity difference, disparate impact ratio, equal opportunity difference, and average odds difference. Verify that the results match (or explain any differences).
(b) Apply Fairlearn's ExponentiatedGradient with EqualizedOdds() and AIF360's CalibratedEqOddsPostprocessing. Compare the fairness and accuracy of the resulting models.
(c) Which library would you recommend for a production deployment? Justify your choice based on API design, maintainability, and integration with existing ML infrastructure.
Organizational Practice
Exercise 31.16 (*)
Draft a fairness metric selection document for the following scenario: a health insurance company builds a model to predict which patients will benefit most from a proactive care management program. The model's predictions determine which patients receive outreach calls from nurses. Protected attributes include race, age, sex, and disability status.
(a) What is the primary harm this model could cause? Which fairness criterion best addresses that harm?
(b) The model's base rate (patients who would benefit from care management) differs across racial groups because historical healthcare access varies. Should the fairness criterion account for this base rate difference? Why or why not?
(c) Draft a one-page fairness metric selection document that specifies the primary criterion, secondary monitoring metrics, acceptable thresholds, and the rationale for each choice.
Exercise 31.17 (**)
Design a fairness review board charter for a technology company that builds both a credit scoring product and a content recommendation product.
(a) Define the board's composition (roles, not names), meeting cadence, and decision authority.
(b) The credit scoring product is subject to ECOA and requires the four-fifths rule. The recommendation product has no regulatory fairness requirement. Should the same fairness criteria apply to both products? Draft differentiated fairness policies for each.
(c) Define an escalation policy: what happens when a fairness metric crosses the threshold? Who is notified, what actions are required, and what is the timeline for remediation?
Exercise 31.18 (**)
A model has been in production for 6 months. The quarterly fairness review reveals that the demographic parity ratio has declined from 0.85 at deployment to 0.76 (below the 0.80 threshold). The equalized odds difference has remained stable at 0.04.
(a) What are the possible root causes of declining demographic parity while equalized odds remains stable? (Hint: consider changes in the base rate, feature drift, and population composition.)
(b) Design a root cause analysis workflow. What data would you examine first? What hypotheses would you test?
(c) The FRB must decide whether to (i) adjust thresholds immediately, (ii) retrain the model, or (iii) investigate further before acting. Draft a decision memo recommending one of these options, with justification.
Intersectionality and Advanced Topics
Exercise 31.19 (***)
Implement a minimax fairness optimization: instead of minimizing the average loss, minimize the maximum loss across all demographic groups.
(a) Formulate the minimax problem: $\min_\theta \max_{a \in \mathcal{A}} L_a(\theta)$, where $L_a(\theta)$ is the loss on group $a$. Implement this using PyTorch with a custom training loop that upweights the worst-performing group at each step.
(b) Apply minimax training to a neural network on the Adult Income dataset with race as the protected attribute (multi-valued: White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other). Compare per-group accuracy with standard ERM training. Which groups benefit? Which lose?
(c) What is the relationship between minimax fairness and Rawlsian justice (maximize the welfare of the worst-off group)? Under what conditions does minimax training improve the Pareto frontier rather than simply redistributing error?
Exercise 31.20 (***)
Implement counterfactual fairness (Kusner et al., 2017) for a loan approval model.
(a) Define a causal graph for loan approval with the following variables: Race ($A$), Education ($E$), Income ($I$), Credit Score ($C$), Loan Default ($Y$). Assume $A \to E \to I$, $A \to I$, $E \to C$, $I \to C$, $I \to Y$, $C \to Y$. Identify which features are causally downstream of $A$.
(b) Implement a counterfactually fair predictor that uses only features not causally downstream of $A$. What features can it use? Compute its accuracy and compare with the full-feature model.
(c) The causal graph is contested: some argue that $A \to E$ is not a direct causal link but reflects systemic barriers. If you change the graph to remove $A \to E$, how does the set of admissible features change? What does this tell you about the sensitivity of counterfactual fairness to causal assumptions?
Exercise 31.21 (***)
Implement a fairness-constrained hyperparameter search that integrates fairness into the model selection process.
(a) Modify an Optuna study (Chapter 22) to include a fairness metric as a second objective. Use optuna.multi_objective to find the Pareto frontier of (accuracy, demographic parity ratio).
(b) Compare the Pareto frontier from the multi-objective search with the frontier obtained by post-processing a single unconstrained model (Section 31.10.1). Which approach yields a better frontier?
(c) In practice, should fairness be a constraint (hard threshold) or an objective (optimized jointly with accuracy)? Argue both sides and state your recommendation.
StreamRec and the Progressive Project
Exercise 31.22 (**)
Extend the CreatorFairnessAudit from Section 31.13.1 to account for content quality.
(a) Modify the equity ratio to be quality-weighted: instead of comparing impression share to production share, compare impression share to quality-weighted production share, where quality is measured by average completion rate.
(b) A creator group produces 20% of content but receives only 8% of impressions (equity ratio 0.40). However, their average completion rate is 15% (vs. 45% platform-wide). Does the quality-weighted equity ratio change the assessment? Should it?
(c) Discuss the ethical tension: is it fair to give less exposure to creators whose content has lower engagement, when that lower engagement may itself be a product of algorithmic feedback loops (less exposure → fewer viewers → lower engagement metrics)?
Exercise 31.23 (**)
The StreamRec fairness audit (M15) reveals that users in a specific age group (18-24) receive significantly lower NDCG@10 (0.14 vs. 0.21 platform average). Investigation reveals this is because the model has less training data for this cohort (newer users with shorter history).
(a) Is this a fairness problem or a cold-start problem? Can it be both? How does the answer depend on whether age is a protected attribute in your jurisdiction?
(b) Propose two interventions: one technical (model-level) and one organizational (data collection). Estimate the expected impact of each on the NDCG@10 gap.
(c) The fairness review board asks whether to apply a post-processing quality boost for age 18-24 users. What are the risks of this approach? Could it create a new unfairness (older users receiving worse recommendations than they otherwise would)?
Exercise 31.24 (***)
Design a real-time fairness monitoring system for StreamRec that computes fairness metrics on live traffic.
(a) Define the streaming fairness metrics: compute creator exposure equity and user recommendation quality disparity on 1-hour sliding windows. Implement the windowed computation using a streaming data structure.
(b) Define alerting rules: an alert fires if the exposure equity ratio for any creator group drops below 0.5 for two consecutive windows, or if the user quality disparity for any demographic exceeds 0.08 for one window. Implement the alerting logic.
(c) What is the statistical challenge of computing fairness metrics on streaming data? How do you handle the variance of metrics computed on small windows? Propose a sequential testing approach (connecting to Chapter 33's experimentation methods).
Exercise 31.25 (****)
The impossibility theorem applies to static, one-shot classification. In a recommendation system like StreamRec, recommendations are made repeatedly over time, and the system's choices affect future data (feedback loops).
(a) Formalize the dynamic fairness problem: define a notion of "long-run fairness" where the fairness criterion must hold not at a single point in time but on average over a time horizon $T$. Show that dynamic fairness can be easier to achieve than static fairness — the system can alternate between satisfying different criteria at different times.
(b) Implement a simple simulation: a two-group recommendation problem where the system alternates between a creator-fair policy (equal exposure) and a user-optimal policy (maximize engagement) every $k$ rounds. Measure long-run creator exposure equity and user engagement as a function of $k$.
(c) Does the impossibility theorem have a dynamic analogue? Can a system that alternates between criteria actually satisfy all criteria "on average," or does the impossibility resurface in a different form? Relate your answer to the literature on online fairness (e.g., Joseph et al., 2016, "Fairness in Learning: Classic and Contextual Bandits").
Capstone Integration
Exercise 31.26 (**)
Integrate the fairness audit into the ML testing infrastructure from Chapter 28. Write a behavioral test suite for Meridian Financial's credit scoring model that includes:
(a) An invariance test: changing the applicant's name (from a name associated with one racial group to a name associated with another) should not change the credit score by more than 5 points.
(b) A minimum functionality test: the model's AUC should be at least 0.75 for every racial group.
(c) A four-fifths rule test: the approval rate ratio between any two racial groups should be at least 0.80.
Implement these as pytest functions that can run in a CI pipeline.
Exercise 31.27 (***)
Connect fairness to interpretability (previewing Chapter 35). A credit scoring model uses 200 features. SHAP analysis (Chapter 35) reveals that zip code is the third most important feature overall, but the first most important feature for Black applicants — suggesting that zip code is acting as a proxy for race.
(a) Implement a "proxy detection" analysis: for each feature, compute the mutual information between the feature and the protected attribute, the feature importance (SHAP), and the interaction between the two (SHAP feature importance for each group separately). Flag features that are both highly important and highly correlated with the protected attribute.
(b) The model owner proposes removing zip code. Estimate the impact on (i) overall accuracy, (ii) group-specific accuracy, and (iii) fairness metrics. Does removing zip code improve fairness?
(c) An alternative to removal is to replace zip code with a "debiased" version: the residual of zip code after regressing out the protected attribute. Implement this and compare the fairness-accuracy tradeoff with feature removal.
Exercise 31.28 (****)
The Meridian Financial case study requires adverse action notices — explanations for why a credit application was denied. Under ECOA, these explanations must be accurate and non-discriminatory.
(a) A model denies an applicant and lists the top reason as "low income." The applicant's income is at the 40th percentile for their racial group but the 25th percentile overall. Should the adverse action reason be computed relative to the overall population or relative to the applicant's group? What are the legal and ethical implications of each choice?
(b) Implement an adverse action reason generator that produces group-independent explanations: the top features contributing to the denial, ranked by SHAP value, with feature values contextualized relative to approved applicants (not relative to any demographic group). Verify that the explanations do not differ systematically across groups.
(c) Is it possible for a model to be fair (by any metric) but produce systematically different explanations for different groups? Construct an example. What does this imply about the relationship between fairness and interpretability?