Exercises: Chapter 33

DataField.Dev

Exercises: Chapter 33

Fairness, Bias, and Responsible ML

Exercise 1: Identifying Bias Types (Conceptual)

For each scenario below, identify the primary type of bias (historical, representation, measurement, or aggregation) and explain how it would affect model predictions. Some scenarios may involve more than one type.

a) A hiring model is trained on 10 years of resume-to-hiring-outcome data from a tech company where 82% of engineering hires were male. The model learns that male-associated keywords (e.g., certain sports, military service) are predictive of being hired.

b) A hospital builds a sepsis prediction model using data from its ICU. The ICU patient population is 70% white and 30% non-white, but the hospital's overall patient population is 45% white. The model has significantly lower recall for non-white patients.

c) A loan default model uses "number of credit inquiries in the past 12 months" as a feature. However, some lenders report inquiries to all three credit bureaus, while others report to only one. The reporting pattern correlates with the borrower's geographic region and, indirectly, with race.

d) A customer satisfaction prediction model is built for a national bank. Customer behavior differs substantially between urban and rural branches --- urban customers use mobile banking heavily while rural customers rely on in-branch transactions --- but the model treats all customers identically.

e) A recidivism prediction model is trained on arrest records. Arrest rates for drug offenses are 3.5 times higher for Black Americans than white Americans, despite similar rates of drug use (as measured by anonymous surveys). The model learns that race-correlated features are predictive of re-arrest.

Exercise 2: Computing Fairness Metrics by Hand (Conceptual + Code)

A model for predicting loan default produces the following confusion matrices for two groups:

Group A (n=1000):

	Predicted Default	Predicted No Default
Actually Defaulted	80 (TP)	20 (FN)
Did Not Default	90 (FP)	810 (TN)

Group B (n=1000):

	Predicted Default	Predicted No Default
Actually Defaulted	50 (TP)	30 (FN)
Did Not Default	120 (FP)	800 (TN)

a) Compute by hand for each group: TPR, FPR, precision (PPV), positive prediction rate, and base rate.

b) Does the model satisfy demographic parity? Show the calculation.

c) Does the model satisfy equalized odds? Show the calculations for both TPR and FPR.

d) Does the model satisfy predictive parity? Show the calculation.

e) Compute the disparate impact ratio (Group B relative to Group A). Does it pass the 80% rule?

f) Now verify your hand calculations in Python:

import numpy as np
from sklearn.metrics import confusion_matrix

# Your task: Create y_true and y_pred arrays for each group
# that match the confusion matrices above, then compute all
# five metrics using the functions from this chapter.

Exercise 3: The Impossibility Theorem in Practice (Code)

Generate synthetic data with two groups that have different base rates (Group A: 12% positive rate, Group B: 24% positive rate). Train a gradient boosting classifier.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, roc_curve

np.random.seed(42)
n = 10000

# Generate two groups with different base rates
group = np.random.choice(['A', 'B'], size=n, p=[0.6, 0.4])

# Your task:
# 1. Generate features (at least 5 numeric features)
# 2. Generate a target variable where base rate depends on group
#    (12% for Group A, 24% for Group B)
# 3. Train a GradientBoostingClassifier on the features (not group)
# 4. Apply a single threshold (0.5) and compute TPR, FPR, PPV per group

a) With a single threshold of 0.5, show that TPR, FPR, and precision all differ between groups.

b) Find group-specific thresholds that equalize TPR at approximately 0.65. What happens to the FPR for each group?

c) Find group-specific thresholds that equalize FPR at approximately 0.10. What happens to the TPR for each group?

d) Can you find thresholds that simultaneously equalize TPR and FPR? Explain why or why not, relating your finding to the impossibility theorem.

e) Compute the overall accuracy with the single threshold (part a) and with the equalized-TPR thresholds (part b). What is the accuracy cost of fairness?

Exercise 4: Reweighting for Fairness (Code)

Using the Metro General readmission dataset from the chapter, implement reweighting and evaluate its effect on fairness metrics.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

np.random.seed(42)

# Recreate the Metro General dataset from the chapter
# (copy the data generation code from Part 2)

# Your task:
# 1. Train an unweighted model and compute the fairness audit
# 2. Compute fairness weights using compute_fairness_weights()
# 3. Train a weighted model and compute the fairness audit
# 4. Compare: which fairness metrics improved? Which got worse?
# 5. What happened to overall accuracy?

a) Create a summary table comparing the unweighted and weighted models on: overall accuracy, overall AUC, TPR per group, FPR per group, and precision per group.

b) Did reweighting reduce the TPR disparity between groups? By how much?

c) Did reweighting introduce any new disparities (e.g., did FPR become less equal)?

d) Write a 3-sentence recommendation: should Metro General deploy the reweighted model or the original model? Justify your answer in terms of the specific fairness-accuracy tradeoff you observed.

Exercise 5: Threshold Adjustment (Code)

Using the same Metro General model from Exercise 4 (the unweighted version), apply post-processing threshold adjustment.

a) Find group-specific thresholds that achieve approximately equal TPR across all four racial groups (target TPR: 0.70).

b) Report the resulting FPR for each group. Are they equal?

c) Find group-specific thresholds that achieve approximately equal FPR across groups (target FPR: 0.10). Report the resulting TPR for each group.

d) Compare the two approaches: - Equal-TPR thresholds: What is the overall accuracy? What are the per-group FPRs? - Equal-FPR thresholds: What is the overall accuracy? What are the per-group TPRs? - Which approach would you recommend for Metro General, and why? (Consider: what is the cost of a false negative vs. a false positive in readmission prediction?)

e) Combine reweighting (Exercise 4) with threshold adjustment (this exercise). Does the combination produce better fairness outcomes than either approach alone?

Exercise 6: Building a Model Card (Applied)

Create a complete model card for the StreamFlow churn prediction model. Use the following specifications:

Model: XGBoost classifier predicting 30-day churn for StreamFlow subscribers
Training data: 50,000 subscribers, features include session frequency, content engagement, account age, plan price, payment history
Overall AUC: 0.938
Use case: Weekly "high-risk" list for Customer Success team
Protected attributes to evaluate: Age group (18--25, 26--40, 41--60, 60+) and geographic region (Urban, Suburban, Rural)

Your model card must include all eight sections from the chapter template:

Model Details
Intended Use
Out of Scope
Training Data
Performance (overall metrics)
Fairness Metrics (with demographic parity and equalized odds for both age group and region)
Limitations
Ethical Considerations

Write the model card in Python using the create_model_card function from the chapter. For the fairness metrics, you may simulate reasonable values based on what you know about how base rates might differ across age groups and regions.

Exercise 7: Fairness Audit Automation (Code)

Build a FairnessAuditor class that automates the full audit workflow.

class FairnessAuditor:
    """Automated fairness audit for binary classification models."""

    def __init__(self, y_true, y_pred, y_prob, protected_attribute,
                 attribute_name='Protected Attribute'):
        """
        Parameters
        ----------
        y_true : array-like, true labels
        y_pred : array-like, predicted labels (binary)
        y_prob : array-like, predicted probabilities
        protected_attribute : array-like, group membership
        attribute_name : str, name of the protected attribute
        """
        # Your implementation here
        pass

    def base_rates(self):
        """Return base rates (positive class proportion) per group."""
        pass

    def demographic_parity(self):
        """Return positive prediction rate per group."""
        pass

    def equalized_odds(self):
        """Return TPR and FPR per group."""
        pass

    def predictive_parity(self):
        """Return precision per group."""
        pass

    def disparate_impact(self, reference_group):
        """Return disparate impact ratio per group relative to reference."""
        pass

    def auc_by_group(self):
        """Return AUC per group."""
        pass

    def full_report(self, reference_group=None):
        """Print a complete fairness audit report."""
        pass

    def find_equalized_thresholds(self, metric='tpr', target=0.70):
        """
        Find per-group thresholds to equalize a metric.

        Parameters
        ----------
        metric : str, 'tpr' or 'fpr'
        target : float, target value for the metric
        """
        pass

a) Implement all methods of the FairnessAuditor class.

b) Test it on the Metro General readmission dataset: instantiate the auditor and call full_report().

c) Use find_equalized_thresholds to find thresholds for equal TPR at 0.70, then create a second FairnessAuditor with the adjusted predictions and run full_report() again.

d) Add a summary_table method that returns a single DataFrame with one row per group and columns for: base rate, positive prediction rate, TPR, FPR, precision, AUC. This is the table you would show a stakeholder.

Exercise 8: Fairness Across Multiple Attributes (Conceptual + Code)

In practice, fairness must be evaluated across multiple protected attributes simultaneously (intersectional fairness).

a) Conceptual: Explain why a model can satisfy equalized odds across racial groups AND across gender groups, but still violate equalized odds for the intersection (e.g., Black women vs. white men). This is called the fairness gerrymandering problem.

b) Code: Using the Metro General dataset, add a gender attribute (simulate 52% female, 48% male, with a small difference in readmission base rates). Run the fairness audit separately for: - Race (4 groups) - Gender (2 groups) - Race x Gender intersection (8 groups)

# Your task:
# 1. Add gender to the dataset
# 2. Create an intersection variable: race + '_' + gender
# 3. Run fairness_audit for each attribute
# 4. Identify: are there any intersectional groups that show
#    disparity not visible in the single-attribute audits?

c) What practical challenges arise when auditing intersectional fairness? Consider sample size in particular (what happens to the "Asian Female" group if Asian patients are 5% and females are 52% of the data?).

Exercise 9: The Fairness-Accuracy Frontier (Code)

For the Metro General model, generate the fairness-accuracy Pareto frontier by sweeping the fairness constraint strength.

# Your task:
# 1. For lambda values from 0.0 to 1.0 (step 0.05), train a model with
#    reweighting where the weight strength is scaled by lambda
#    (lambda=0 means no fairness correction; lambda=1 means full correction)
# 2. For each lambda, record:
#    - Overall accuracy
#    - Overall AUC
#    - TPR disparity (max TPR - min TPR across groups)
#    - FPR disparity (max FPR - min FPR across groups)
# 3. Plot two curves:
#    - X-axis: TPR disparity, Y-axis: Overall accuracy
#    - X-axis: FPR disparity, Y-axis: Overall AUC
# 4. Identify the "knee" of the curve: the point where further
#    fairness improvement costs the most accuracy.

a) At what point on the curve does the accuracy cost become steep? Is there a "sweet spot" where you get significant fairness improvement with minimal accuracy loss?

b) How would you present this tradeoff to Metro General's Chief Medical Officer? Draft a 3-sentence summary.

Exercise 10: Aequitas Quick Start (Applied)

The Aequitas toolkit (from the University of Chicago's Center for Data Science and Public Policy) provides an open-source fairness auditing tool.

# pip install aequitas

# Your task:
# 1. Format the Metro General test data for Aequitas
#    (needs columns: 'score', 'label_value', and protected attributes)
# 2. Use Aequitas Group() to compute group-level metrics
# 3. Use Aequitas Bias() to compute bias metrics
# 4. Use Aequitas Fairness() to make fairness determinations
# 5. Compare Aequitas results with your manual calculations
#    from the chapter. Do they match?

Note: If you cannot install Aequitas in your environment, write the code assuming it is available and describe what you would expect the output to show based on your manual analysis.

These exercises accompany Chapter 33: Fairness, Bias, and Responsible ML. Return to the chapter for full context.