Case Study 1: Meridian Financial — Credit Scoring Fairness and ECOA Compliance

Context

Meridian Financial, the mid-size consumer lending institution from Case Studies in Chapters 24 and 28, processes 15,000 credit card applications per day using a gradient-boosted tree ensemble (XGBoost, 500 trees, 200 features). The model scores applicants on a 0-1 probability-of-default scale. Applications scoring below 0.12 are auto-approved; applications scoring above 0.35 are auto-declined; applications between 0.12 and 0.35 are routed to human underwriters.

The model does not use any protected attribute (race, gender, age, national origin) as a feature. It was validated by the model risk management (MRM) team under SR 11-7 guidance (Chapter 28, Case Study 2), passing all data validation, behavioral tests, and model validation gates. AUC on the held-out set is 0.83. The 90-day default rate on approved applications is 2.1%. By every accuracy metric, the model is excellent.

The fairness problem surfaced during a routine regulatory examination. An OCC examiner requested a disaggregated analysis of approval rates by race, using HMDA (Home Mortgage Disclosure Act) methodology adapted for consumer lending. The data science team had never performed this analysis.

The Audit

Phase 1: Baseline Fairness Assessment

The team computed approval rates by race using the FairnessMetrics class and Fairlearn's MetricFrame. Race was not a model feature, so the team joined model predictions with demographic data from the applicant database (collected under ECOA for monitoring purposes but excluded from model features).

Racial Group n Applications Approval Rate Default Rate (approved)
White 8,250 74.2% 1.9%
Black 2,700 51.3% 2.3%
Hispanic 2,400 58.7% 2.1%
Asian 1,200 71.5% 1.7%
Other 450 63.1% 2.0%

Four-fifths rule check: The highest approval rate is 74.2% (White). The four-fifths threshold is $0.742 \times 0.8 = 0.594$. The Black approval rate (51.3%) falls below this threshold. The disparate impact ratio for Black applicants is $0.513 / 0.742 = 0.691$ — well below 0.80.

Equalized odds analysis: Among applicants who would not default ($Y = 0$, the "qualified" applicants), the approval rates differed:

Group TPR (approve qualified) FPR (approve unqualified)
White 0.82 0.09
Black 0.63 0.07
Hispanic 0.70 0.08

The equal opportunity difference (max TPR - min TPR) was $0.82 - 0.63 = 0.19$ — a substantial gap. Among creditworthy Black applicants, the model correctly identified only 63% as creditworthy, compared to 82% of creditworthy White applicants. This is the most consequential finding: qualified applicants are being denied at unequal rates.

Calibration analysis: Using check_group_calibration(), the team found that the model was reasonably well-calibrated for White and Asian applicants (mean predicted probability matched observed default rate within 2 percentage points at every score decile) but poorly calibrated for Black applicants (systematic overestimation of default risk by 3-5 percentage points in the 0.15-0.35 score range — precisely the range where human underwriters make decisions).

Phase 2: Root Cause Analysis

The team identified three contributing factors:

Factor 1: Proxy features. SHAP analysis (previewing Chapter 35) revealed that zip code was the third most important feature overall, but the first most important feature for Black applicants. The model assigned higher default risk to applicants from majority-Black zip codes — not because zip code directly predicts default, but because historical redlining concentrated Black applicants in zip codes with fewer banking services, lower property values, and thinner credit files. The model learned the statistical footprint of neighborhood-level structural inequality.

Factor 2: Training label bias. The model was trained on historical approval/default outcomes. But historical approvals were themselves products of a prior (less sophisticated) scoring model and human underwriter judgment. If prior underwriters were less likely to approve Black applicants in the uncertainty zone (0.15-0.35), the training data had fewer Black approvals in that range, and the model learned to replicate that pattern. This is the selective labels problem: the model can only learn from the outcomes of applicants who were approved, and the approval process itself was biased.

Factor 3: Feature quality differential. Credit bureau data (FICO scores, credit history length, utilization ratio) is thinner for applicants with less access to traditional credit products. Black and Hispanic applicants disproportionately had thinner credit files (fewer trade lines, shorter credit history), which the model interpreted as higher risk — conflating "less data" with "more risk."

Phase 3: Intervention

The team implemented a layered intervention strategy, ordered by increasing invasiveness:

Layer 1: Post-processing threshold adjustment. Using find_equalized_odds_thresholds(), the team found group-specific thresholds that reduced the equal opportunity difference from 0.19 to 0.04:

Group Original Threshold Adjusted Threshold
White 0.35 0.37
Black 0.35 0.29
Hispanic 0.35 0.32

The adjusted thresholds lowered the bar for Black and Hispanic applicants in the underwriter zone and raised it slightly for White applicants. Overall accuracy (AUC) decreased from 0.83 to 0.81 — within the risk team's 0.78 floor. The four-fifths ratio improved from 0.691 to 0.82, above the 0.80 threshold.

Layer 2: Pre-processing reweighing. For the next model retraining, the team applied compute_reweighing_weights() to the training data, upweighting positive outcomes for Black and Hispanic applicants. The retrained model achieved AUC 0.82 (slightly lower than the original 0.83 but higher than the post-processed version) with a four-fifths ratio of 0.84 and equal opportunity difference of 0.06 — without requiring group-specific thresholds.

Layer 3: Feature engineering. The team replaced raw zip code with a "debiased" version: the residual of zip code encoding after regressing out race (a technique analogous to disparate impact removal). This reduced the proxy effect while preserving the legitimate neighborhood-level signal (e.g., local economic conditions that affect all residents regardless of race). The model retrained with debiased zip code achieved AUC 0.82 with a four-fifths ratio of 0.86.

Phase 4: Organizational Infrastructure

Metric selection document. The fairness review board (FRB) — composed of the Chief Risk Officer, General Counsel, VP of Data Science, a consumer advocate (external), and the model's senior data scientist — selected the following criteria:

  • Primary binding constraint: Four-fifths rule (demographic parity ratio $\geq 0.80$) for race, gender, age, and national origin. This is legally mandated under ECOA.
  • Secondary monitoring metric: Equal opportunity difference $< 0.10$. This ensures that the model is not systematically denying creditworthy applicants from any group.
  • Tertiary monitoring metric: Calibration by group (mean absolute calibration error $< 0.03$ per decile). This ensures that scores mean the same thing across groups.

Monitoring configuration. The team deployed the FairnessMonitorConfig with weekly metric computation and the following alert thresholds:

Metric Warning Critical
Four-fifths ratio < 0.83 < 0.80
Equal opportunity diff. > 0.08 > 0.10
Max calibration error > 0.03 > 0.05

A critical alert triggers automatic model hold (no retraining proceeds) and FRB escalation within 48 hours. A warning triggers investigation by the data science team within one week.

Adverse action integration. Under ECOA, denied applicants must receive an adverse action notice listing the top reasons for denial. The team verified that the adverse action reasons generated by the model did not differ systematically across racial groups. The proxy detection analysis (Exercise 31.27) flagged zip code and employer name as features that could produce group-correlated explanations; these were excluded from adverse action reasons in favor of direct financial indicators (income, debt-to-income ratio, delinquency history).

Outcome

Six months after implementation, the quarterly FRB review showed:

  • Four-fifths ratio: 0.85 (stable, above 0.80 threshold)
  • Equal opportunity difference: 0.05 (below 0.10 monitoring threshold)
  • Calibration error: 0.02 per decile (below 0.03 monitoring threshold)
  • AUC: 0.82 (within 0.78 floor)
  • 90-day default rate: 2.2% (0.1 percentage point increase from baseline — within risk tolerance)
  • Zero regulatory examination findings related to fair lending — a first in three examination cycles

The OCC examiner noted in the examination report that Meridian's fairness monitoring infrastructure was "among the most comprehensive observed in a mid-size lending institution," and recommended it as a model for peer institutions.

Lessons

  1. Removing protected attributes does not prevent discrimination. Proxy features — zip code, employer, credit history patterns — carry the statistical signal of structural inequality. Fairness requires examining outcomes, not feature lists.

  2. The four-fifths rule is a floor, not a ceiling. Passing the four-fifths rule does not mean the model is fair. Meridian added equal opportunity and calibration monitoring because the four-fifths rule alone did not capture the most consequential harm (denying creditworthy applicants at unequal rates).

  3. Layered interventions are more effective than single-point fixes. Post-processing provided immediate relief; reweighing improved the next model version; feature debiasing addressed the root cause. Each layer operates at a different timescale and addresses a different mechanism.

  4. Organizational infrastructure is as important as technical infrastructure. The FRB, the metric selection document, the monitoring configuration, and the quarterly review cycle are what sustain fairness over time — long after the initial audit is complete.

  5. The accuracy cost was modest. AUC decreased from 0.83 to 0.82; the default rate increased by 0.1 percentage points. The fairness-accuracy tradeoff was far less severe than anticipated. This is consistent with the empirical literature: the first several points of fairness improvement are often nearly free.