Case Study 2: Meridian Financial — Regulatory Monitoring for Fair Lending Compliance
Context
Meridian Financial, the mid-size consumer lending institution from Case Study 2 of Chapter 28, has deployed a gradient-boosted credit scoring model that has passed all pre-deployment validation checks: data quality, behavioral tests, adverse impact analysis, and score stability (PSI < 0.10). The model is performing well at launch.
But regulatory guidance — specifically the OCC's SR 11-7 — does not end at deployment. The guidance requires ongoing monitoring of all models used in credit decisions. Specifically, the model risk management (MRM) team requires:
- Monthly stability reporting: PSI of the score distribution and key features, compared to the development sample.
- Quarterly performance assessment: Model discrimination (AUC, KS statistic) on the most recent quarter's data, with ground truth labels (default/non-default) for loans that have matured.
- Continuous fair lending monitoring: Adverse impact ratios (AIR) for all ECOA-protected groups, updated monthly with the latest decision data.
- Trigger-based re-validation: Automatic full re-validation when any trigger threshold is exceeded (score PSI > 0.20, default rate deviation > 0.5 percentage points, AIR below 0.80 for any group).
The data science team must build a monitoring system that satisfies these requirements while also providing operational observability for the ML serving infrastructure.
The Monitoring Architecture
Layer 1: System Health
The credit scoring API serves 50,000-80,000 scoring requests per business day. System health monitoring is standard SRE practice:
| SLI | SLO | Rationale |
|---|---|---|
| Availability | 99.95% | Credit decisions cannot be delayed; applicants are waiting |
| Latency p99 | 200ms | Embedded in the application workflow; <200ms is imperceptible |
| Error rate | < 0.01% | Every error requires manual underwriting, costing $45/application |
Layer 2: Data Quality
The credit model consumes features from four sources (application database, credit bureau, transaction history, employment verification). Each source has a data contract from Chapter 28. Monitoring extends these contracts to continuous operation:
Feature freshness. Credit bureau data is refreshed nightly. Transaction history is refreshed hourly. Employment verification is refreshed weekly. The monitoring system checks that each source is within its freshness SLA on every scoring request.
Feature drift. Monthly PSI computation for all 200 features, comparing the current month's applicant population to the development sample. The MRM team requires this report for regulatory files.
Population shift. The demographic composition of applicants changes over time — economic conditions, marketing campaigns, and geographic expansion all affect who applies. The monitoring system tracks the distribution of key demographic variables (income brackets, geographic regions, credit score tiers) and flags significant shifts.
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from datetime import datetime
import numpy as np
@dataclass
class RegulatoryMonitoringReport:
"""Monthly regulatory monitoring report for credit models.
Designed to satisfy OCC SR 11-7 ongoing monitoring requirements.
Produced automatically and reviewed by the MRM team before
filing.
Attributes:
model_name: Registered model name and version.
report_period: Month covered by the report (YYYY-MM).
generated_at: Timestamp of report generation.
score_psi: PSI of the score distribution vs. development.
feature_psi: PSI for each feature vs. development.
adverse_impact_ratios: AIR for each protected group.
performance_metrics: Model discrimination metrics (if
ground truth is available for matured loans).
default_rate_observed: Observed default rate for matured
loans in this period.
default_rate_predicted: Model-predicted default rate for
the same cohort.
trigger_violations: List of trigger thresholds exceeded.
recommendation: MRM team recommendation.
"""
model_name: str
report_period: str
generated_at: datetime
score_psi: float
feature_psi: Dict[str, float] = field(default_factory=dict)
adverse_impact_ratios: Dict[str, float] = field(
default_factory=dict
)
performance_metrics: Dict[str, float] = field(
default_factory=dict
)
default_rate_observed: Optional[float] = None
default_rate_predicted: Optional[float] = None
trigger_violations: List[str] = field(default_factory=list)
recommendation: str = ""
def check_triggers(self) -> List[str]:
"""Check all regulatory trigger thresholds.
Returns:
List of trigger descriptions for any violations.
"""
violations = []
# Trigger 1: Score distribution instability
if self.score_psi > 0.20:
violations.append(
f"Score PSI {self.score_psi:.3f} exceeds "
f"threshold 0.20 — model score distribution "
f"has shifted significantly from development."
)
# Trigger 2: Default rate deviation
if (
self.default_rate_observed is not None
and self.default_rate_predicted is not None
):
deviation = abs(
self.default_rate_observed
- self.default_rate_predicted
)
if deviation > 0.005: # 0.5 percentage points
violations.append(
f"Default rate deviation {deviation:.4f} "
f"(observed: {self.default_rate_observed:.4f}, "
f"predicted: {self.default_rate_predicted:.4f}) "
f"exceeds threshold 0.005."
)
# Trigger 3: Adverse impact ratio
for group, air in self.adverse_impact_ratios.items():
if air < 0.80:
violations.append(
f"Adverse impact ratio for {group} is "
f"{air:.3f}, below the four-fifths threshold "
f"of 0.80."
)
# Trigger 4: Feature instability
high_psi_features = [
(name, psi) for name, psi in self.feature_psi.items()
if psi > 0.25
]
if len(high_psi_features) > 5:
violations.append(
f"{len(high_psi_features)} features have PSI > 0.25, "
f"indicating broad population shift. "
f"Top: {high_psi_features[:3]}"
)
self.trigger_violations = violations
return violations
def requires_revalidation(self) -> bool:
"""Determine if the model requires full re-validation.
Returns:
True if any trigger threshold is exceeded.
"""
if not self.trigger_violations:
self.check_triggers()
return len(self.trigger_violations) > 0
def generate_summary(self) -> str:
"""Generate executive summary for MRM review.
Returns:
Formatted summary string.
"""
status = (
"TRIGGER EXCEEDED — REVALIDATION REQUIRED"
if self.requires_revalidation()
else "WITHIN THRESHOLDS"
)
lines = [
f"Model: {self.model_name}",
f"Period: {self.report_period}",
f"Status: {status}",
f"Score PSI: {self.score_psi:.4f}",
f"Features with PSI > 0.10: "
f"{sum(1 for v in self.feature_psi.values() if v > 0.10)}/"
f"{len(self.feature_psi)}",
]
if self.adverse_impact_ratios:
min_air_group = min(
self.adverse_impact_ratios,
key=self.adverse_impact_ratios.get,
)
min_air_value = self.adverse_impact_ratios[min_air_group]
lines.append(
f"Minimum AIR: {min_air_value:.3f} ({min_air_group})"
)
if self.trigger_violations:
lines.append(f"\nTrigger Violations:")
for v in self.trigger_violations:
lines.append(f" - {v}")
return "\n".join(lines)
Layer 3: Fair Lending Monitoring
Fair lending monitoring is the regulatory layer that has no analogue in a recommendation system like StreamRec. The monitoring system must continuously track whether the model's decisions have a disparate impact on protected groups.
Adverse Impact Ratio (AIR) monitoring. For each ECOA-protected group (race, sex, age, national origin, marital status), compute:
$$\text{AIR} = \frac{\text{Approval rate for protected group}}{\text{Approval rate for control group}}$$
The four-fifths rule requires AIR $\geq$ 0.80. Meridian's internal standard is $\geq$ 0.85. The monitoring system computes AIR monthly using the latest decision data and fires an alert if any group drops below 0.85.
Demographic parity monitoring. Beyond AIR, the system tracks the score distribution for each protected group. If the score distributions diverge (measured by KS statistic), the model's treatment of different groups may be drifting — even if the overall AIR remains above threshold.
Reason code monitoring. Under FCRA, every adverse action must include reason codes explaining the top factors. The monitoring system tracks reason code frequency by protected group. If a specific reason code (e.g., "high debt-to-income") appears disproportionately for one group, it may indicate a proxy variable issue — the model is using a facially neutral feature that correlates with protected status.
from dataclasses import dataclass, field
from typing import Dict, List, Tuple
import numpy as np
@dataclass
class FairLendingMonitor:
"""Continuous fair lending monitoring for credit decisions.
Tracks adverse impact ratios, score distributions by group,
and reason code patterns. Designed for monthly reporting
and continuous trigger-based alerting.
Attributes:
protected_groups: Mapping from group name to (group_col,
protected_value, control_value).
air_threshold: Minimum acceptable AIR (internal standard).
regulatory_threshold: Regulatory minimum AIR (four-fifths).
"""
protected_groups: Dict[str, Tuple[str, str, str]]
air_threshold: float = 0.85
regulatory_threshold: float = 0.80
def compute_air(
self,
decisions: np.ndarray,
group_labels: np.ndarray,
protected_value: str,
control_value: str,
) -> Dict[str, float]:
"""Compute adverse impact ratio for a single group.
Args:
decisions: Binary array (1 = approved, 0 = denied).
group_labels: Array of group membership labels.
protected_value: Value identifying the protected group.
control_value: Value identifying the control group.
Returns:
Dictionary with approval rates and AIR.
"""
protected_mask = group_labels == protected_value
control_mask = group_labels == control_value
protected_approval = (
decisions[protected_mask].mean()
if protected_mask.sum() > 0
else 0.0
)
control_approval = (
decisions[control_mask].mean()
if control_mask.sum() > 0
else 0.0
)
air = (
protected_approval / control_approval
if control_approval > 0
else 0.0
)
return {
"protected_approval_rate": float(protected_approval),
"control_approval_rate": float(control_approval),
"air": float(air),
"protected_n": int(protected_mask.sum()),
"control_n": int(control_mask.sum()),
}
def compute_all_air(
self,
decisions: np.ndarray,
group_data: Dict[str, np.ndarray],
) -> Dict[str, Dict[str, float]]:
"""Compute AIR for all protected groups.
Args:
decisions: Binary array (1 = approved, 0 = denied).
group_data: Mapping from group column name to array
of group labels.
Returns:
Nested dict: group_name -> AIR metrics.
"""
results = {}
for group_name, (col, prot, ctrl) in (
self.protected_groups.items()
):
if col not in group_data:
continue
results[group_name] = self.compute_air(
decisions, group_data[col], prot, ctrl
)
return results
def check_violations(
self, air_results: Dict[str, Dict[str, float]]
) -> List[str]:
"""Check for AIR threshold violations.
Args:
air_results: Output from compute_all_air.
Returns:
List of violation descriptions.
"""
violations = []
for group_name, metrics in air_results.items():
air = metrics["air"]
if air < self.regulatory_threshold:
violations.append(
f"REGULATORY VIOLATION: {group_name} AIR "
f"{air:.3f} < {self.regulatory_threshold}. "
f"Immediate action required."
)
elif air < self.air_threshold:
violations.append(
f"INTERNAL THRESHOLD: {group_name} AIR "
f"{air:.3f} < {self.air_threshold}. "
f"Investigation required."
)
return violations
# Meridian Financial fair lending configuration
meridian_fair_lending = FairLendingMonitor(
protected_groups={
"race_black": ("race", "black", "white"),
"race_hispanic": ("race", "hispanic", "white"),
"race_asian": ("race", "asian", "white"),
"sex_female": ("sex", "female", "male"),
"age_young": ("age_group", "18-24", "25-64"),
"age_senior": ("age_group", "65+", "25-64"),
"marital_single": (
"marital_status", "single", "married"
),
},
air_threshold=0.85,
regulatory_threshold=0.80,
)
Layer 4: Model Performance (Delayed Ground Truth)
Credit scoring has a long feedback loop: the ground truth (default/non-default) is not known for 6-18 months after the lending decision. This means real-time model performance monitoring is impossible. Instead, the team uses three strategies:
Early warning proxies. 30-day delinquency rates are available much sooner than default rates and are strongly correlated. The monitoring system tracks 30-day delinquency for each monthly cohort and compares it to the model's predicted probability of 30-day delinquency (derived from the default model using historical transition rates).
Vintage analysis. Each monthly cohort of approved loans is tracked separately over time. If the October cohort's 30-day delinquency rate is 1.8% while the model predicted 1.2%, and the September cohort was 1.3% vs. predicted 1.2%, the October cohort is underperforming — possible concept drift from economic conditions.
Backtesting on matured loans. Every quarter, the team evaluates the model on loans that have reached maturity (12+ months). This provides true performance metrics (AUC, calibration, discrimination) but with a 12-month delay.
The Incident: Geographic Expansion and Covariate Shift
Seven months after deployment, Meridian expanded its lending operations from 12 states to 18 states. The 6 new states included a different demographic and economic profile: lower median incomes, higher unemployment, and different industry mixes.
Detection
The monthly regulatory monitoring report for Month 7 flagged three triggers:
-
Score PSI = 0.24 (threshold 0.20) — the score distribution had shifted. More applicants were being scored in the 0.20-0.40 range (the manual review zone) and fewer in the auto-approve range (< 0.12).
-
Feature PSI: 14 of 200 features had PSI > 0.10, and 4 had PSI > 0.25. The top contributors were
median_neighborhood_income,state_unemployment_rate,industry_sector, andyears_at_current_address. -
AIR for race_hispanic = 0.83 — below Meridian's internal threshold of 0.85 but above the regulatory threshold of 0.80. The 6 new states had a higher proportion of Hispanic applicants, and the model's treatment of this population was less favorable.
The monitoring system fired alerts for all three triggers. The MRM team initiated a trigger-based re-validation.
Investigation
The data science team's investigation proceeded in stages:
Stage 1: Root cause identification. The feature PSI analysis localized the shift: it was concentrated in geographic and economic features, not in applicant-specific features like FICO score or debt-to-income. This was covariate shift — the applicant population had changed — not concept drift.
Stage 2: Impact assessment. The team evaluated the current model on the new-state applicants separately. AUC was 0.74 (vs. 0.81 on the original-state population). The model was less discriminating in the new states because it had never seen these economic conditions during training.
Stage 3: Fair lending deep dive. The AIR drop for Hispanic applicants was traced to the median_neighborhood_income feature. In the 6 new states, Hispanic applicants were disproportionately located in lower-income neighborhoods. The model used median_neighborhood_income as a risk signal (higher income → lower risk), but in the new states, this feature was acting as a stronger proxy for ethnicity than it was in the original states, where income and ethnicity were less correlated.
Stage 4: Reason code analysis. The monitoring system's reason code tracking showed that "low median neighborhood income" had become the #2 reason code for adverse actions among Hispanic applicants in the new states, up from #7 in the original states. This shift was invisible in the aggregate reason code ranking — it only appeared when sliced by group and geography.
Response
The team implemented a three-part response:
-
Short-term mitigation (Week 1). Tighten the auto-decline threshold from 0.35 to 0.45 for new-state applicants, routing more borderline cases to human underwriters. This reduced automated adverse impact while the model was being updated.
-
Model retraining (Weeks 2-4). Retrained the model including 6 months of application data from the new states (collected during the expansion ramp-up period). Removed
median_neighborhood_incomeas a direct feature and replaced it with two less proxy-sensitive alternatives:applicant_incomeanddebt_to_income_ratio(which capture economic risk without geographic correlation to ethnicity). -
Monitoring enhancement (Week 3). Added geographic segmentation to all monitoring metrics. The monitoring system now computes PSI, AIR, and performance metrics separately for original-state and new-state applicants, in addition to the aggregate. This ensures that a population shift in one region does not mask problems in another.
Outcome
| Metric | Pre-Expansion | Post-Expansion (Month 7) | After Retraining (Month 9) |
|---|---|---|---|
| Score PSI (aggregate) | 0.04 | 0.24 | 0.07 |
| AUC (aggregate) | 0.81 | 0.78 | 0.80 |
| AUC (new states) | N/A | 0.74 | 0.79 |
| AIR race_hispanic | 0.91 | 0.83 | 0.90 |
| Features with PSI > 0.10 | 2 | 14 | 3 |
The retrained model restored performance in both the original and new states. The geographic segmentation in monitoring ensured that future expansions would be detected immediately.
The OCC examiner, reviewing the quarterly monitoring reports, specifically praised the trigger-based re-validation process: "The institution detected the population shift within one monthly reporting cycle, initiated re-validation within the regulatory trigger framework, and deployed a remediated model within 8 weeks. The monitoring infrastructure and response process are consistent with SR 11-7 expectations."
Lessons Learned
-
Geographic expansion is a covariate shift event that monitoring must detect. When the applicant population changes — through expansion, marketing changes, economic shifts, or policy changes — feature distributions shift. If the monitoring system computes only aggregate PSI, geographic concentration effects can mask the shift in affected features.
-
Fair lending monitoring must be segmented by the same dimensions as the business. Aggregate AIR can be above threshold while a specific geography or channel is below. Monitoring that does not segment by geography, channel, and product cannot detect localized fairness problems.
-
Proxy variable strength varies by population. A feature that is a weak proxy for ethnicity in one population can be a strong proxy in another.
median_neighborhood_incomewas weakly correlated with Hispanic ethnicity in the original 12 states (correlation ~0.15) but strongly correlated in the new 6 states (correlation ~0.38). The model learned from the original states and applied the relationship to the new states, where the proxy effect was amplified. -
Regulatory monitoring and operational monitoring are the same infrastructure with different reporting cadences. Every check that satisfies a regulatory requirement — PSI stability, AIR monitoring, performance tracking — is also a check that any well-run ML system should perform. The regulatory overlay adds documentation requirements and examiner-ready formatting, but the underlying monitoring logic is identical to what StreamRec implements for operational reasons.