Case Study 34.1: The Algorithm That Made the Right Decision for the Wrong Reasons

DataField.Dev

Case Study 34.1: The Algorithm That Made the Right Decision for the Wrong Reasons

The Situation

Organization: Cornerstone Financial Group (fictional composite institution) Division: Retail Banking — Credit Decisioning Challenge: A credit scoring algorithm is discovered to be achieving good performance through a mechanism that, on examination, violates the ethical principles the institution claims to hold Timeline: Q1–Q3 2024

Background

Cornerstone's retail credit division replaced its rules-based credit scoring system with a gradient-boosted machine learning model in 2022. The model's performance was strong: default prediction accuracy improved by 14% relative to the prior system; credit losses fell; analyst review time decreased as the model handled a higher proportion of decisions automatically.

Eighteen months after deployment, Cornerstone engaged Priya Nair's RegTech Advisory team to conduct a model governance review as part of preparation for the EU AI Act compliance program. Priya's scope: review the model's technical documentation, validation records, and ongoing monitoring reports.

What the Review Found

The review was proceeding normally until Priya's colleague, data scientist Elena Reyes, ran a feature importance analysis using SHAP values on the full production scoring population.

The SHAP analysis revealed that three features collectively accounted for 41% of the model's predictive power:

Time of application submission (18% of model contribution): applications submitted between 2 AM and 5 AM had significantly higher default rates. The model weighted this heavily.
Mobile device model age (13% of model contribution): applications submitted from older mobile devices (identified by device fingerprint) had higher default rates. The model treated this as a proxy for borrower financial position.
Application completion time (10% of model contribution): applications completed in under 4 minutes had higher default rates. The model had learned that very fast completion correlated with less careful review of terms.

Elena flagged these findings. The features were technically valid predictors — they genuinely correlated with default in the training data. But their use raised ethical questions that the original model documentation had not addressed.

The Ethical Analysis

Priya convened a review team: the model risk team, compliance, legal, the head of retail credit, and Cornerstone's Chief Ethics Officer, Dr. Amara Osei.

Time of application submission. Why do 2–5 AM applications default more often? The training data correlation was real. But what is it measuring? Several hypotheses: (1) distressed borrowers apply late at night when immediate need is greatest; (2) people in financial difficulty lose regular sleep patterns; (3) the middle-of-the-night application pattern correlates with psychological states associated with poor financial decision-making.

If hypothesis (1) or (2) is correct, the model is using financial distress itself as a predictor of default — effectively penalizing people who are already in difficult circumstances for being in difficult circumstances. The model was correctly predicting default, but doing so by identifying borrowers at their most vulnerable and giving them worse terms or rejections.

Mobile device age. Why do applications from older devices default more often? The correlation likely reflects that phone model age is a proxy for income — people with less disposable income use older phones longer. The model was using device age as a socioeconomic proxy.

This is a sophisticated version of the geographic proxy problem: the model was not using income directly (Cornerstone considered this appropriate), but it was using an indirect proxy that achieved a similar discriminatory effect. Worse: it was using a proxy that customers could not reasonably anticipate.

Application completion time. The fastest completions defaulted more. The model learned this as a legitimate predictor of credit risk. But it was also effectively penalizing borrowers who understood the forms well enough to complete them quickly, or who were applying for small amounts they clearly needed rather than deliberating over large ones.

Dr. Osei's assessment: "These features are predictive. They are also ethically troubling in ways that the original model design process did not consider. We should not have features in production that penalize borrowers for being financially distressed, that proxy for socioeconomic status in ways customers don't expect, or that penalize efficient form completion."

The Tension Between Accuracy and Ethical Practice

The head of retail credit pushed back. "If we remove these features, our default prediction deteriorates. We'll approve more loans to people who default. That's bad for the borrowers, not just the bank. Are you saying we should give more credit to people who are going to struggle to repay it?"

This was the hardest ethical question of the review. A consequentialist argument: the model's current features, ethically questionable as they are, produce better credit outcomes overall. Removing them would mean more defaults — more borrowers in financial distress, more collections activity, more credit harm to vulnerable people.

Priya's response: "Two things. First, the consequentialist calculation assumes that 'better for the bank' and 'better for borrowers' are the same thing. A borrower who is rejected because they applied at 3 AM might be someone who needed the credit and was denied it because of when they submitted their application. That's harm too — just not harm that shows up in our default rate.

"Second, the argument 'accurate discrimination is better for the discriminated-against group' is one of the oldest justifications for unjust treatment in financial services. It was the argument for redlining. The question isn't only whether removing these features changes aggregate outcomes. The question is whether using these features is compatible with treating borrowers as full persons, not as data points to be profiled."

The Resolution

After three weeks of discussion, Cornerstone's credit risk committee, in consultation with legal and compliance, made three decisions:

Feature 1 (time of application): Removed. The committee concluded that penalizing applications for timing — particularly the 2–5 AM window most likely to capture financial distress — was incompatible with the Consumer Duty principle of good outcomes for all customers. The model was retrained without this feature. Default prediction declined by 2.1 percentage points.

Feature 2 (device age): Removed. Using device age as a socioeconomic proxy was judged to be inconsistent with Cornerstone's fairness commitments, even if the correlation was technically lawful. Retrained model: default prediction declined by 1.8 percentage points.

Feature 3 (completion time): Retained with modification. The credit risk committee concluded that very fast completion (under 2 minutes) correlated with automated form-filling tools sometimes used in application fraud — a legitimate detection signal. Under-4-minutes was too broad. The feature was respecified as a fraud signal for very rapid completions only (under 90 seconds), not as a credit risk signal.

Model performance after retraining: Default prediction accuracy declined from the peak — but remained 9.7% better than the pre-ML rules-based system. The model still significantly outperformed the prior approach.

Documentation: The ethics review was documented and included in Cornerstone's model governance record and EU AI Act conformity assessment documentation. The documentation made explicit: three features were identified as ethically problematic; two were removed; one was respecified. This transparency was a deliberate choice — Cornerstone wanted its AI Act documentation to reflect genuine ethical engagement, not just technical compliance.

Discussion Questions

1. The credit risk head argued that removing ethically questionable but accurate features would hurt borrowers who would receive more credit they could not repay. Is this a valid consequentialist argument? How should an ethical analysis weigh aggregate credit outcomes against the rights-based concerns of individual borrowers who are penalized for time-of-application?

2. Mobile device age was used as a proxy for socioeconomic status. The feature was not illegal — income is not a protected characteristic under the Equality Act. How should an institution decide which proxy variables are acceptable to use, when there is no specific legal prohibition but the feature effectively encodes a characteristic that the institution would not use directly?

3. The review found that a model performing strongly on aggregate metrics contained ethically problematic features. Cornerstone's original model development and validation process had not identified these concerns. What changes to the model development and validation process would have surfaced these ethical questions before the model went to production?

4. After retraining, the model's default prediction declined but still outperformed the pre-ML system. Does this outcome — better than before but worse than the peak — represent a success or a failure of ethics-by-design? How should Cornerstone communicate this outcome to its Board and, if asked, to the FCA?

5. Cornerstone documented the ethics review and its findings transparently in its EU AI Act conformity assessment. This transparency included acknowledging that the production model had contained ethically problematic features for eighteen months. Evaluate the risks and benefits of this level of transparency. Is radical transparency about past ethical failures a compliance advantage or a regulatory liability?