Exercises — Chapter 29: Algorithmic Fairness and Bias in Compliance Systems

DataField.Dev

Exercises — Chapter 29: Algorithmic Fairness and Bias in Compliance Systems

Exercise 1: Calculating Demographic Parity Ratios

Background

A UK mortgage lender uses an automated decisioning system for mortgage applications. A quarterly outcome review has produced the following data on application approval rates broken down by applicant demographic group.

Demographic Group	Applications	Approvals	Approval Rate
White British	4,200	3,486	83.0%
South Asian	820	615	75.0%
Black African/Caribbean	640	429	67.0%
Mixed/Other	390	297	76.2%

Tasks

(a) Calculate the demographic parity ratio for each non-reference group, using White British as the reference group. Show your working.

(b) Identify which groups, if any, fail the four-fifths rule (parity ratio below 0.80). State the specific ratios.

(c) The lender's compliance team argues that the approval rate differences reflect genuine creditworthiness differences and are not the result of model bias. What additional data and analysis would you need to assess whether this argument is correct? List at least four specific pieces of evidence you would seek.

(d) Suppose the model's calibration analysis shows that its probability-of-default scores are equally accurate across all four demographic groups (for example, a score of 0.15 corresponds to a 15% actual default rate for all groups). Does this information support the argument that the observed disparities are justified? Explain your reasoning with reference to the impossibility theorem.

(e) The lender is considering introducing group-specific approval thresholds to bring the parity ratios above 0.80. Currently the standard decision threshold is a probability-of-default score of 0.18 (if predicted probability of default > 0.18, decline; otherwise approve). If the current approval rate for Black African/Caribbean applicants is 67.0% and the target is to bring this to at least 83.0% × 0.80 = 66.4%, is threshold adjustment required? What regulatory and ethical considerations should the lender document before making this change?

Worked Solution

(a) Parity ratios, with White British (83.0%) as reference:

South Asian: 75.0% / 83.0% = 0.904
Black African/Caribbean: 67.0% / 83.0% = 0.807
Mixed/Other: 76.2% / 83.0% = 0.918

(b) No groups fail the four-fifths rule — all parity ratios are above 0.80. South Asian (0.904), Black African/Caribbean (0.807), and Mixed/Other (0.918) all clear the threshold, though Black African/Caribbean is close to the boundary and warrants continued monitoring.

(c) Additional evidence needed: 1. True positive rate and false positive rate by group — to assess whether the model is making different types of errors for different groups 2. Base rate analysis — do actual default rates in historical data differ across groups? (If so, are those differences explained by genuine financial factors or by historical lending practices that constrained certain groups' ability to build credit history?) 3. Training data demographic composition — how are the four groups represented in the training dataset? 4. Feature contribution analysis — which model features are driving the approval/decline decision for each group? Are any features acting as proxies for protected characteristics (e.g., credit history length, which can disadvantage newer UK residents)?

(d) Good calibration does not establish that disparities are justified. A model can be perfectly calibrated — its scores meaning the same probability of default for all groups — and still produce a 0.807 parity ratio if the score distribution for Black African/Caribbean applicants is shifted toward higher predicted default probabilities relative to the reference group. The question then becomes: does that score distribution shift reflect genuine credit risk differences, or does it reflect historical biases in the features used to compute the scores? Calibration confirms the model is internally consistent, not that the inputs it was trained on were free from discriminatory patterns. The impossibility theorem also cautions that calibration may be achieved at the cost of other fairness criteria.

(e) The current Black African/Caribbean approval rate of 67.0% already exceeds the four-fifths threshold of 66.4% (83.0% × 0.80 = 66.4%). No threshold adjustment is mathematically required to clear the four-fifths threshold. However, the proximity to the boundary (0.807 parity ratio) suggests that threshold monitoring is appropriate. If the lender were to pursue threshold adjustment, the key documentation should include: (i) the legal basis under the Equality Act positive action provisions; (ii) confirmation that the purpose is to remediate potential indirect discrimination, not to create preferential treatment based on a protected characteristic; (iii) an ongoing review mechanism to assess whether the adjustment is achieving its remediation purpose; and (iv) legal advice confirming the approach is proportionate to the identified disparity.

Exercise 2: Drafting a Regulatory Notification

Background

You are the Chief Compliance Officer of Verdant Bank UK (Maya Osei's role from the chapter). Your quarterly KYC analytics report has identified a 3.8× rejection rate differential between customers with African names and customers with Anglo-Saxon names. You have confirmed the finding is statistically robust and is attributable to the automated document verification system procured from a third-party vendor. You have taken immediate operational steps (halted automated rejections; implemented manual review queue). You are now preparing a voluntary briefing note to the FCA under your Consumer Duty obligations.

Task

Draft a regulatory briefing note to the FCA addressing the following elements:

Executive Summary — A two-paragraph summary of the finding and the immediate steps taken
Description of the Issue — The nature of the 3.8× differential, how it was identified, and its regulatory significance under the Consumer Duty and Equality Act 2010
Root Cause Hypothesis — Your current working hypothesis about the cause (based on the chapter's analysis of KYC bias sources)
Immediate Remediation Steps — Steps already taken
Planned Remediation Programme — What you are committing to do, and on what timeline
Ongoing Monitoring — How you will ensure the problem does not recur undetected

The briefing note should be approximately 600–800 words. It should be written in a professional regulatory communication style: factual, direct, and demonstrating genuine engagement with the regulatory obligations rather than defensive minimisation.

Guidance Notes

A well-drafted regulatory notification demonstrates four qualities: candour (the firm reports what it knows without minimising or obscuring the severity); accountability (the firm accepts responsibility even where a vendor is involved); proportionality (the remediation programme is calibrated to the severity of the finding); and forward commitment (the firm commits to specific, measurable outcomes and timelines rather than vague good-faith aspirations). The FCA's supervisory approach to Consumer Duty self-reporting places significant weight on the quality of the firm's own analysis — a firm that identifies and articulates the regulatory issue accurately is demonstrating the governance standards the Duty requires.

Exercise 3: Designing a Fairness Monitoring Programme

Background

Following the resolution of the KYC differential identified in Exercise 2, Verdant Bank's Board has requested that the compliance team design a comprehensive, ongoing fairness monitoring programme for all algorithmic systems used in customer-facing processes. This will cover: the KYC document verification system; the credit limit determination model used for Verdant's credit card product; and the fraud transaction monitoring system.

Task

Design a fairness monitoring programme covering the following components:

(a) Scope and Inventory — What information should a firm capture about each algorithmic system in scope? List the minimum data elements required for a fairness monitoring programme inventory.

(b) Metrics and Thresholds — For each of the three system types (KYC verification, credit decisioning, fraud monitoring), specify: (i) which fairness metrics are most appropriate and why; (ii) the trigger thresholds you would set; and (iii) the frequency of assessment.

(c) Governance and Escalation — Design an escalation pathway for fairness findings. Who should receive findings at different severity levels? What actions should be required at each level?

(d) Vendor Engagement — For each vendor-supplied system, what information should the firm request from the vendor as part of the monitoring programme? How should vendor-provided information be validated?

(e) Documentation and Regulatory Reporting — What records should the monitoring programme generate? Under what circumstances should findings be reported to the FCA proactively, and in what form?

Your answer should be structured as a programme design document, not a prose essay. Use headings and lists where appropriate. The programme should be practical and implementable, not merely aspirational.

Exercise 4: Threshold Adjustment Implementation

Coding Exercise

The following exercise asks you to implement a demographic parity threshold adjustment function in Python. This is a practical implementation of one of the remediation approaches discussed in Section 6 of the chapter.

Background

A KYC model outputs a continuous risk score for each applicant between 0 and 1, where higher scores indicate higher assessed risk. The standard decision threshold is 0.40: applicants with scores above 0.40 are rejected, applicants with scores at or below 0.40 are approved. This threshold produces a four-fifths rule violation for two demographic groups.

Task

Implement the following function:

import numpy as np
import pandas as pd
from typing import Optional


def apply_demographic_parity_thresholds(
    df: pd.DataFrame,
    score_col: str,
    group_col: str,
    reference_group: str,
    default_threshold: float = 0.40,
    target_parity_ratio: float = 0.80,
    max_iterations: int = 100,
    tolerance: float = 0.005,
) -> tuple[pd.DataFrame, dict[str, float]]:
    """
    Adjust decision thresholds per demographic group to achieve target demographic
    parity ratio relative to a reference group.

    For each non-reference group whose current parity ratio falls below
    target_parity_ratio, the function finds a higher threshold (more lenient)
    that brings the group's approval rate to at least
    reference_group_approval_rate * target_parity_ratio.

    Parameters
    ----------
    df : DataFrame with columns [score_col, group_col]
    score_col : column containing continuous risk scores (0–1)
    group_col : column containing demographic group labels
    reference_group : the baseline group whose threshold is not adjusted
    default_threshold : the standard decision threshold (reject if score > threshold)
    target_parity_ratio : minimum acceptable approval rate ratio vs. reference group
    max_iterations : maximum search iterations for threshold finding
    tolerance : convergence tolerance for approval rate matching

    Returns
    -------
    result_df : original df with added 'approved' column (1 = approved, 0 = rejected)
                and 'threshold_applied' column showing the threshold used for each row
    thresholds : dict mapping group name to the threshold applied for that group
    """
    # TODO: implement this function
    pass


def evaluate_adjustment(
    df: pd.DataFrame,
    group_col: str,
    approved_col: str,
    reference_group: str,
) -> pd.DataFrame:
    """
    Evaluate the demographic parity outcomes after threshold adjustment.
    Returns a summary DataFrame showing approval rates and parity ratios.
    """
    # TODO: implement this function
    pass

Requirements for apply_demographic_parity_thresholds:

For the reference group, apply the default threshold unchanged
For each non-reference group, compute the current approval rate using the default threshold
Compute the current parity ratio vs. the reference group
If the parity ratio is below target_parity_ratio, use binary search or iterative search over the threshold range [0, 1] to find the highest (most restrictive) threshold that achieves an approval rate >= reference_group_approval_rate * target_parity_ratio
Return the result DataFrame with applied decisions and a dictionary of per-group thresholds

Requirements for evaluate_adjustment:

Compute the approval rate for each group
Compute the parity ratio for each non-reference group vs. the reference group
Flag groups still below 0.80 parity ratio
Return a summary DataFrame

Test your implementation with the following synthetic data:

np.random.seed(99)
n = 3000

# Generate scores: Group A (reference) has lower mean scores (less risk)
# Group B has slightly elevated scores
# Group C has substantially elevated scores under the default threshold

scores_a = np.clip(np.random.beta(2, 5, n), 0, 1)          # Mean ~0.29
scores_b = np.clip(np.random.beta(2.5, 4, n), 0, 1)        # Mean ~0.38
scores_c = np.clip(np.random.beta(3.5, 3, n), 0, 1)        # Mean ~0.54

df_test = pd.DataFrame({
    "risk_score": np.concatenate([scores_a, scores_b, scores_c]),
    "group": (["Group_A"] * n + ["Group_B"] * n + ["Group_C"] * n)
})

# Run adjustment
result_df, thresholds = apply_demographic_parity_thresholds(
    df=df_test,
    score_col="risk_score",
    group_col="group",
    reference_group="Group_A",
    default_threshold=0.40,
    target_parity_ratio=0.80,
)

print("Adjusted thresholds:", thresholds)
summary = evaluate_adjustment(result_df, "group", "approved", "Group_A")
print(summary)

Expected output (approximate):

The thresholds dictionary should show threshold 0.40 for Group_A, a threshold somewhat above 0.40 for Group_B (which may not need adjustment if its parity ratio already exceeds 0.80), and a substantially higher threshold for Group_C to compensate for its score distribution. All parity ratios in the summary should be >= 0.80.

Reflection question: After implementing the function, write a paragraph (200 words) discussing the ethical and regulatory considerations of this approach. When is group-specific threshold adjustment defensible? When is it insufficient as a remediation strategy?

Exercise 5: Designing a Vendor Contract Fairness Clause

Background

Meridian Financial (from Case Study 02) has completed its credit decisioning system procurement from VantageDecision. Priya Chandrasekaran has been asked to draft the fairness schedule that will be appended to the master services agreement. This schedule will govern VantageDecision's fairness reporting obligations and Meridian's rights in the event of fairness violations.

Task

Draft a vendor contract fairness schedule covering the following provisions:

(a) Definitions — Define the key terms that the schedule will use: Demographic Parity Ratio, Four-Fifths Violation, Protected Attribute, Reference Group, Disaggregated Performance Report, Fairness Remediation Plan.

(b) Reporting Obligations — Specify VantageDecision's reporting obligations, including: frequency; the demographic attributes for which disaggregated performance must be reported; the specific metrics that must be included; and the format in which data must be delivered.

(c) Violation Thresholds and Notification — Define what constitutes a reportable fairness event and what VantageDecision must do when one is identified, including notification timelines and the form of the notification.

(d) Remediation Process — Specify the process that must follow a confirmed four-fifths violation, including: root cause analysis obligations; remediation plan content requirements; and timeline commitments.

(e) Meridian's Rights — Specify Meridian's rights in the event of: (i) a four-fifths violation that persists for more than 12 months; (ii) VantageDecision's failure to provide required disaggregated reporting on time; (iii) discovery that VantageDecision provided materially inaccurate disaggregated performance data.

(f) Cooperation Obligations — Specify each party's obligations to cooperate with: regulatory investigations; Meridian's own fairness monitoring programme; and third-party audits of the model's demographic performance.

Your schedule should be written in a precise, legally structured style appropriate for a commercial contract appendix. It should be specific and enforceable, not aspirational. Each provision should be numbered and clearly delineated.

Guidance Notes

Contract fairness clauses are most effective when they specify exactly what data must be provided, at what frequency, in what format, and what happens if the data reveals a problem. Vague commitments to "use best efforts to ensure fairness" or "cooperate in good faith on fairness matters" provide no meaningful protection. The schedule should reflect the regulatory reality: Meridian bears regulatory responsibility for the outcomes VantageDecision's model produces, and the contractual framework must give Meridian the information and leverage it needs to discharge that responsibility.

Pay particular attention to the definitions section. Fairness concepts are often contested in disputes, and clear contractual definitions prevent disagreement about whether a violation has occurred. The Four-Fifths Violation definition, in particular, should specify the reference group, the measurement period, the minimum group size for calculation, and how the parity ratio is computed.