Case Study 2: Meridian Financial — Regulatory Model Validation for Credit Scoring

DataField.Dev

Case Study 2: Meridian Financial — Regulatory Model Validation for Credit Scoring

Context

Meridian Financial, the mid-size consumer lending institution introduced in Case Study 2 of Chapter 24, operates under regulatory model risk management guidance: the Federal Reserve's SR 11-7 ("Guidance on Model Risk Management") and the OCC's 2011-12 companion document. These regulations require that every model used for credit decisions undergo rigorous validation before deployment and ongoing monitoring after deployment. An examiner can — and frequently does — ask to see the validation evidence for any production model.

Meridian's current credit scoring model is a gradient-boosted tree ensemble (XGBoost, 500 trees, 200 features) trained on 3 years of application data. The model scores applicants on a 0-1 probability-of-default scale. Applications scoring below 0.12 are auto-approved; applications scoring above 0.35 are auto-declined; applications between 0.12 and 0.35 are routed to human underwriters for manual review.

The model risk management (MRM) team has tasked the data science team with building a validation infrastructure that satisfies three requirements:

Examiner-ready documentation: Every validation check must produce auditable evidence — not just a pass/fail status, but the data, methodology, and results, stored for 7 years.
Pre-deployment gating: No model reaches production without passing the full validation suite.
Ongoing monitoring: Production models must be re-validated quarterly and whenever a trigger event occurs (PSI > 0.20 on the score distribution, default rate deviation > 0.5 percentage points from expected).

The Validation Framework

Tier 1: Data Validation

The credit model's training data comes from four sources: the application database, the credit bureau (Experian/TransUnion), the internal transaction history, and a third-party employment verification service. Each source has a data contract specifying schema, freshness, and quality requirements.

The data validation suite contains 47 expectations organized by source:

Source	Expectations	Critical Checks
Application DB	15	Income > 0, employment_years $\geq$ 0, requested_amount in [$500, $50,000]
Credit bureau	14	FICO in [300, 850], utilization_ratio in [0, 1], n_accounts $\geq$ 0
Transaction history	10	Transaction_count > 0 for active accounts, monthly_balance $\geq$ 0
Employment verification	8	Salary > 0 when employed, tenure_months $\geq$ 0

Regulatory addition: the suite includes a representativeness check that verifies the training data's demographic distribution does not deviate significantly from the applicant population. If the training data underrepresents any ECOA-protected group by more than 20% relative to the application population, the validation fails and the data science team must investigate sampling bias.

Tier 2: Behavioral Tests

The MRM team specified 18 behavioral tests derived from regulatory requirements:

Minimum Functionality (6 tests):

AUC $\geq$ 0.78 overall
AUC $\geq$ 0.72 for each income quartile (4 tests)
Hosmer-Lemeshow calibration $p$-value $> 0.05$

Invariance — Fair Lending (6 tests):

Gender invariance: changing gender must not change the score by more than 0.01
Marital status invariance: same constraint
Race/ethnicity invariance: same constraint (tested via constructed matched pairs)
National origin invariance: applicant born in the U.S. vs. not must not change the score by more than 0.01
Zip code partial invariance: changing zip code (a geographic proxy for race) while holding all financial factors constant must not change the score by more than 0.02
Age partial invariance: changing age within an adult range (25-65) while holding tenure and other factors constant should have limited effect (< 0.03) — age can be legitimately predictive via ECOA exceptions, but the model should not be overly sensitive

Directional (6 tests):

Higher income → lower predicted default probability
Higher FICO score → lower predicted default probability
Higher debt-to-income ratio → higher predicted default probability
Longer employment tenure → lower predicted default probability
More delinquencies → higher predicted default probability
Higher credit utilization → higher predicted default probability

Each directional test uses constructed pairs: two applicants identical on all features except the one being varied. The model must move in the expected direction for $\geq 95\%$ of test pairs (a higher threshold than StreamRec's 70%, reflecting the regulatory environment).

from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Any


@dataclass
class RegulatoryValidationReport:
    """Structured report for regulatory model validation.

    Designed to satisfy SR 11-7 / OCC 2011-12 documentation
    requirements. Every field is required for examiner review.

    Attributes:
        model_name: Registered model name and version.
        validation_date: Date of validation.
        validator: Name and title of the independent validator.
        model_developer: Name of the model development team.
        data_period: Training data date range.
        holdout_period: Holdout data date range.
        data_validation_results: Pass/fail and details for each
                                 data expectation.
        behavioral_test_results: Pass/fail and details for each
                                 behavioral test.
        metric_comparison: Champion vs. challenger on all metrics.
        fair_lending_results: Adverse impact ratios by protected group.
        stability_results: PSI for score distribution and key features.
        overall_decision: APPROVE, CONDITIONAL, or REJECT.
        conditions: List of conditions (for CONDITIONAL decisions).
        findings: List of model risk findings.
        reviewer_signature: MRM reviewer who approved the report.
    """
    model_name: str
    validation_date: str
    validator: str
    model_developer: str
    data_period: str
    holdout_period: str
    data_validation_results: List[Dict[str, Any]] = field(
        default_factory=list
    )
    behavioral_test_results: List[Dict[str, Any]] = field(
        default_factory=list
    )
    metric_comparison: Dict[str, Dict[str, float]] = field(
        default_factory=dict
    )
    fair_lending_results: Dict[str, float] = field(default_factory=dict)
    stability_results: Dict[str, float] = field(default_factory=dict)
    overall_decision: str = "REJECT"
    conditions: List[str] = field(default_factory=list)
    findings: List[str] = field(default_factory=list)
    reviewer_signature: str = ""

    def is_deployable(self) -> bool:
        """Check if the model is approved for deployment.

        Returns:
            True if decision is APPROVE or CONDITIONAL (with conditions
            addressed).
        """
        return self.overall_decision in ("APPROVE", "CONDITIONAL")

    def summary(self) -> str:
        """Generate executive summary for MRM committee review."""
        n_data_pass = sum(
            1 for r in self.data_validation_results if r.get("passed")
        )
        n_behavioral_pass = sum(
            1 for r in self.behavioral_test_results if r.get("passed")
        )

        return (
            f"Model Validation Report: {self.model_name}\n"
            f"Date: {self.validation_date}\n"
            f"Decision: {self.overall_decision}\n"
            f"Data Validation: {n_data_pass}/"
            f"{len(self.data_validation_results)} passed\n"
            f"Behavioral Tests: {n_behavioral_pass}/"
            f"{len(self.behavioral_test_results)} passed\n"
            f"Findings: {len(self.findings)}\n"
            f"Conditions: {len(self.conditions)}\n"
        )

Tier 3: Model Validation Gate

The validation gate for credit models includes all standard checks (metrics, behavioral tests, latency, size) plus regulatory-specific checks:

Adverse impact analysis. For each ECOA-protected group (race, sex, age, national origin, marital status), the gate computes the adverse impact ratio:

$$\text{AIR} = \frac{\text{Approval rate for protected group}}{\text{Approval rate for control group}}$$

The four-fifths rule requires AIR $\geq 0.80$. Meridian applies a stricter internal standard of AIR $\geq 0.85$ to provide a safety margin.

Score distribution stability. PSI of the model's score distribution between the development sample and the validation sample must be below 0.10. PSI above 0.10 suggests the model's behavior has shifted between development and validation, which may indicate overfitting or data quality issues.

Reason code coverage. Under FCRA, every adverse action (denial) must include a notice listing the top factors. The gate verifies that the top 5 reason codes account for $\geq 90\%$ of adverse actions. Low coverage suggests the model's decisions are driven by many small features rather than a few interpretable factors — a red flag for both regulators and consumers.

Concentration analysis. The gate checks that no single feature contributes more than 30% of the model's total importance. High feature concentration is not inherently problematic, but it creates fragility: if that feature's data source degrades, the model's performance degrades disproportionately.

Tier 4: Ongoing Monitoring

After deployment, the production model is re-validated on a quarterly cadence. The quarterly validation runs the full suite (data validation, behavioral tests, metric evaluation) against the most recent quarter's data. Additionally, three trigger-based validations fire automatically:

Score drift trigger: PSI of the monthly score distribution exceeds 0.20
Performance trigger: Observed 90-day default rate deviates from predicted by more than 0.5 percentage points
Data trigger: Any data validation checkpoint fails for 3 consecutive days

Each trigger generates a RegulatoryValidationReport and notifies both the data science team and the MRM team. If the quarterly or trigger-based validation produces a REJECT decision, the model must be retrained or replaced within 30 days.

Outcome

The validation framework was implemented over 8 weeks. In its first year of operation:

Metric	Result
Models submitted for validation	12
Models approved on first submission	7
Models approved with conditions	3
Models rejected	2
Trigger-based re-validations	4
Regulatory examination findings related to model validation	0

The two rejected models failed for different reasons. The first failed the adverse impact analysis: AIR for Hispanic applicants was 0.76, below the 0.85 internal threshold. Investigation revealed that the model was using a zip-code-derived feature (median neighborhood income) that served as a proxy for ethnicity. Removing the feature and retraining produced a model with AIR = 0.89, which passed on resubmission. The second rejection was a stability failure: PSI of 0.18 between development and validation score distributions. The root cause was a credit bureau data format change during the validation period that affected 3 features.

The three conditional approvals included conditions such as: "Monitor AIR for age group 18-21 monthly; re-validate if AIR drops below 0.85" and "Recompute reason code coverage after Q2 data refresh." All conditions were tracked in a remediation log and closed within the specified timeframes.

The most significant outcome was the regulatory examination in Month 10. The OCC examiner reviewed the validation framework, tested three models' validation reports, and produced zero findings related to model validation — a first for Meridian in three examination cycles. The examiner specifically cited the behavioral test suite and the automated adverse impact analysis as best practices.

Lessons Learned

Regulatory requirements are a superset of good engineering practice. Every check that the MRM team required — data validation, behavioral testing, metric comparison, fairness analysis, stability monitoring — is also a check that any well-run ML system should have. The regulatory overlay adds documentation rigor and retention requirements, but the underlying validation logic is identical.
The behavioral test suite is the most valuable artifact for regulatory communication. Examiners understand test suites. "We run 18 behavioral tests including 6 fair lending invariance checks, and all 18 must pass before a model can be deployed" is a sentence that resonates with regulators. Abstract metrics like AUC do not.
Adverse impact analysis must run on realistic data, not synthetic test cases. Early versions of the invariance tests used synthetic applicant profiles. The examiner pointed out that synthetic data may not capture the correlation structure of real applicant features, and required that the analysis be rerun on stratified samples from the actual applicant population.
The validation gate is a negotiation tool, not just a technical check. When the data science team proposed a model with a new feature set, the MRM team reviewed the validation gate configuration and negotiated tighter thresholds for specific checks. The gate configuration became a living document that encoded the agreed-upon risk appetite between the two teams.