28 min read

> "All models are wrong, but some are useful — and in credit risk, the distinction between the two can determine whether a bank survives a downturn or does not."

In This Chapter

Part 3: Risk Management and Regulatory Reporting
Opening Narrative: Verdant Bank's IRB Ambition
15.1 Credit Risk Fundamentals: The Expected Loss Framework
15.2 The Basel Framework for Credit Risk
15.3 Scorecard Models: The Workhorse of Retail Credit Risk
15.4 Python Implementation: Credit Scorecard
15.5 Credit Risk Model Validation
15.6 SR 11-7: The Gold Standard for Model Risk Management
15.7 Machine Learning in Credit Risk: Promise and Peril
15.8 Retail vs. Wholesale Credit Risk Modelling
15.9 Through-the-Cycle vs. Point-in-Time PD Estimates
15.10 Model Risk Framework: Governance Architecture
Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 15: Credit Risk Modelling and Model Risk Management

Part 3: Risk Management and Regulatory Reporting

"All models are wrong, but some are useful — and in credit risk, the distinction between the two can determine whether a bank survives a downturn or does not."

— Adapted from George E. P. Box

Opening Narrative: Verdant Bank's IRB Ambition

Maya Osei did not come to banking through the traditional route. She had spent her twenties at a legal aid charity in South London, helping small business owners navigate insolvency proceedings. By 32, she had worked her way into a Chief Compliance Officer role at Verdant Bank — a digital challenger bank licensed by the Prudential Regulation Authority in 2022 — and what she remembered most from those early years was the particular look on a business owner's face when they realised the credit decision had been made before they even sat down. The model had already spoken.

Verdant's credit risk function was, by challenger bank standards, relatively sophisticated. The retail portfolio ran on a vendor-provided scorecard. The SME portfolio — a deliberate strategic choice to serve the underbanked small business community — ran on something more ad hoc: a hybrid of relationship manager judgment, a rudimentary scoring tool built during the bank's Series B fundraising period, and a set of internal credit policies that had accumulated like geological strata since the bank's founding.

In early 2024, Verdant's Chief Risk Officer placed a project on Maya's desk: the bank intended to apply for IRB (Internal Ratings-Based) approval from the PRA for its SME portfolio. Not full Advanced IRB — that would require years of data the bank did not yet have — but Foundation IRB, where the bank would estimate Probability of Default internally while using supervisory estimates for Loss Given Default and Exposure at Default.

The implication was clear. Verdant would need to build a credit risk model that could withstand regulatory scrutiny. And that meant Maya needed to understand not just what the model did, but every assumption embedded in its architecture.

This chapter follows Maya's journey through that build — and the broader landscape of credit risk modelling, model validation, and model risk management that her journey illuminates.

15.1 Credit Risk Fundamentals: The Expected Loss Framework

Credit risk is the risk that a counterparty — a borrower, bond issuer, or trading counterpart — will fail to meet its financial obligations. It is the oldest recognised form of financial risk and, by most measures, the most significant contributor to bank failures historically.

The modern regulatory treatment of credit risk rests on a deceptively simple formula:

$$EL = PD \times LGD \times EAD$$

Where:

EL = Expected Loss: the average loss anticipated over a defined horizon, typically one year.
PD = Probability of Default: the likelihood that a borrower will default within the measurement horizon.
LGD = Loss Given Default: the proportion of the exposure that will not be recovered if default occurs, expressed as a percentage.
EAD = Exposure at Default: the total value at risk at the moment of default, accounting for any undrawn credit facilities.

This formula is elegant and powerful. It also conceals enormous complexity in each of its three components.

15.1.1 Probability of Default (PD)

PD is a forward-looking estimate. For a one-year horizon, it answers: "Given everything we know about this borrower today, what is the probability they will default over the next twelve months?"

PD estimates can be derived from several sources:

Historical default data: The most common approach in retail credit. If 2,000 borrowers with a given credit score defaulted out of 100,000 with that score over a one-year observation window, the empirical PD is 2.0%.

Agency ratings mappings: For corporate borrowers, published default rates from Moody's, S&P, or Fitch provide historical benchmarks. Moody's, for instance, publishes average one-year default rates by rating grade, updated annually. A Baa3-rated borrower historically defaulted at around 0.2% per year over the post-war period.

Market-implied PDs: Using credit default swap (CDS) spreads or bond yields to back out implied default probabilities. These are point-in-time measures reflecting current market sentiment.

Statistical models: Logistic regression, survival analysis, or machine learning models trained on borrower characteristics and historical default outcomes.

A critical distinction — one that will recur throughout this chapter — is between point-in-time (PIT) and through-the-cycle (TTC) PD estimates.

A PIT estimate reflects the borrower's current creditworthiness given the current economic environment. It is more volatile: it will rise sharply in a recession and fall in an expansion. A TTC estimate attempts to smooth through the cycle, reflecting average creditworthiness across a full economic cycle. TTC estimates are more stable but less responsive to current conditions.

Regulators generally prefer TTC estimates for capital purposes under Basel — they want capital requirements to be relatively stable and not amplify economic cycles (procyclicality). However, IFRS 9 accounting standards, which we will examine later, explicitly require PIT estimates for Expected Credit Loss (ECL) calculations because accounting is supposed to reflect current economic reality.

This creates a genuine tension for banks that must simultaneously satisfy capital regulation (Basel, preferring TTC) and accounting standards (IFRS 9, requiring PIT) from a single model infrastructure. Maya would encounter this tension directly as she scoped Verdant's modelling program.

15.1.2 Loss Given Default (LGD)

LGD represents the economic loss rate given that a default has occurred. It is the complement of the recovery rate: LGD = 1 - Recovery Rate.

LGD depends on several factors:

Collateral: A mortgage-backed loan secured on property has substantially lower LGD than an unsecured personal loan. The recovery comes from realising the collateral.
Seniority: Senior secured creditors recover more than subordinated unsecured creditors in insolvency proceedings.
Legal jurisdiction: UK insolvency law (Insolvency Act 1986, as amended) provides different recovery outcomes from US Chapter 11 or German insolvency law.
Time to recovery: Workout costs and the time value of money erode recoveries. A loan that takes four years to work out through the courts generates more costs than one resolved in six months.
Economic conditions: LGDs tend to be higher in recessions when asset values are depressed and buyers for distressed assets are scarce.

Under Advanced IRB, banks estimate their own LGDs. Under Foundation IRB — Verdant's target — the PRA prescribes supervisory estimates: 45% LGD for senior unsecured exposures, 75% for subordinated exposures. This simplification is one reason Foundation IRB is more accessible as a starting point.

15.1.3 Exposure at Default (EAD)

EAD measures how much the bank is owed at the moment of default. For term loans with fixed repayment schedules, this is relatively straightforward: it is the outstanding principal at the time of default, potentially adjusted for any accrued interest.

For revolving credit facilities — overdrafts, credit cards, revolving credit facilities common in SME banking — EAD is harder to estimate. Borrowers who are deteriorating tend to draw more heavily on their credit lines before default. This "usage pull" means that EAD for revolving facilities typically exceeds the current drawn balance.

The regulatory measure for this is the Credit Conversion Factor (CCF): the fraction of the undrawn commitment expected to be drawn before default. Empirically, CCFs for revolving SME facilities can range from 40% to 80% depending on facility type and economic conditions.

EAD = Current Drawn Amount + (CCF × Undrawn Commitment)

Under Advanced IRB, banks estimate their own CCFs. Under Foundation IRB, the PRA prescribes supervisory CCFs, typically 75% for committed revolving facilities.

15.1.4 Expected Loss in Practice

With the three components in hand, Expected Loss provides the basis for:

Loan pricing: A loan priced below EL (plus costs of capital and operations) destroys economic value.
Provisions: Under IFRS 9, provisions must reflect Expected Credit Loss over 12 months (Stage 1) or lifetime (Stages 2 and 3).
Capital requirements: Regulatory capital is sized to cover Unexpected Loss — losses that exceed the expected.
Portfolio management: Concentration limits, sector limits, and risk appetite calibration.

15.2 The Basel Framework for Credit Risk

The Basel Committee on Banking Supervision has developed an internationally harmonised framework for credit risk capital requirements through successive accords: Basel I (1988), Basel II (2004), Basel III (2010 onward, with significant revisions in 2017 known informally as Basel 3.1 or Basel IV).

The current framework offers banks a choice between two broad approaches:

15.2.1 The Standardised Approach (SA)

Under the Standardised Approach, risk weights are assigned to exposures based on external credit ratings or exposure category, without reference to internal models.

For example, under Basel III as implemented in the UK's Capital Requirements Regulation (CRR2):

Exposure Type	Risk Weight (SA)
Sovereigns (AAA to AA-)	0%
Corporates (AAA to AA-)	20%
Corporates (unrated)	100%
Retail (qualifying revolving)	75%
Residential mortgages (LTV ≤ 50%)	20%
SME (treated as retail)	75%
Defaulted exposures	150%

Risk-Weighted Assets (RWA) = Exposure × Risk Weight

Capital Requirement = RWA × 8% (minimum Tier 1 + Tier 2 under Pillar 1)

The SA is operationally straightforward but crude. A corporate exposure rated BB+ and one rated AA- both carry risk weights that do not fully differentiate their credit quality, particularly for unrated borrowers where a flat 100% applies regardless of the borrower's actual creditworthiness.

15.2.2 The Internal Ratings-Based (IRB) Approach

Under IRB, banks use their own estimates of PD (and under Advanced IRB, LGD and EAD as well) to calculate risk-weighted assets through a regulatory formula specified in the Basel framework.

The IRB formula is sophisticated, incorporating:

The bank's internal PD estimate
Supervisory or bank-estimated LGD
EAD
A supervisory asset correlation factor (which varies by exposure class)
A maturity adjustment

The intuition is that the formula converts the bank's internal risk estimates into capital requirements calibrated to a 99.9% confidence interval over a one-year horizon — meaning capital should be sufficient to absorb losses in all but the worst 1-in-1000-year scenarios.

Foundation IRB (F-IRB): Banks estimate PD internally. LGD and EAD use supervisory estimates. This requires regulatory approval but less data than Advanced IRB.

Advanced IRB (A-IRB): Banks estimate PD, LGD, and EAD internally. This requires extensive historical data (typically seven years of LGD/EAD data, five years of PD data minimum under the EBA's guidelines), a mature model governance framework, and ongoing regulatory oversight.

The IRB approach can generate substantially lower capital requirements than the SA for high-quality portfolios — which creates a strong financial incentive for well-run banks to invest in IRB capabilities. However, following the 2017 Basel 3.1 revisions, a capital floor was introduced: IRB-calculated RWAs cannot fall below 72.5% of the SA RWA for the same portfolio. This floor limits the capital advantage of IRB and was specifically designed to constrain "model optimism."

15.2.3 IRB Data Requirements and the Long-Run PD Challenge

The EBA's Guidelines on PD estimation, LGD estimation and the treatment of defaulted exposures (EBA/GL/2017/16) specify the data requirements for IRB models in granular detail.

For PD estimation, the guidelines require:

A minimum of five years of historical default data
Representative data: the data period must include at least one economic downturn
Long-run average default rates must be used, not simply recent performance
The definition of default must align with Article 178 of CRR2 (90 days past due, or the bank has assessed the borrower as unlikely to pay)

For Verdant Bank, this created an immediate problem: the bank had been operational since 2022 and had less than three years of SME credit data. Maya's first task was to understand whether this was disqualifying for F-IRB or whether there were data acquisition strategies — purchasing external data, using industry default rate studies, or applying for a regulatory waiver under Article 180(2) — that could bridge the gap.

15.3 Scorecard Models: The Workhorse of Retail Credit Risk

Before tackling IRB's statistical requirements, Maya needed to understand how credit scoring models actually work. The dominant technology in retail credit — and increasingly in SME credit — is the credit scorecard.

15.3.1 The Scorecard Concept

A credit scorecard assigns a numerical score to each applicant based on their characteristics. Higher scores indicate lower credit risk. The score is then used to make accept/decline decisions, set credit limits, determine pricing, and monitor ongoing creditworthiness.

The archetypal scorecard is built on logistic regression with Weight of Evidence (WoE) transformation. The process involves:

Variable selection: Identifying candidate predictors from application data (income, employment status, credit bureau history) and internal data (account behavior, relationship tenure).
Binning: Continuous variables are discretised into bins. "Annual income" might be binned as: <£15k, £15k–£25k, £25k–£40k, £40k–£60k, £60k+.
Weight of Evidence (WoE) transformation: Each bin is converted to a WoE value, which measures how strongly that bin predicts good vs bad borrowers:

$$WoE_i = \ln\left(\frac{\text{Distribution of Goods}_i}{\text{Distribution of Bads}_i}\right)$$

Where "Goods" are non-defaulting borrowers and "Bads" are defaulting borrowers.

Information Value (IV): IV measures the overall predictive power of a variable:

$$IV = \sum_{i}(P(\text{Goods}_i) - P(\text{Bads}_i)) \times WoE_i$$

IV guidelines for variable selection: - IV < 0.02: Not predictive (exclude) - 0.02 ≤ IV < 0.1: Weak predictor (consider) - 0.1 ≤ IV < 0.3: Medium predictor (include) - 0.3 ≤ IV < 0.5: Strong predictor (include) - IV > 0.5: Very strong, potentially suspicious (verify for data leakage)

Logistic regression: The model is fitted on WoE-transformed variables. The logistic regression naturally produces probability outputs (PD estimates).
Scorecard scaling: The regression output is scaled to an integer scorecard, typically with a defined "odds at base score" and "Points to Double the Odds" (PDO) calibration.

15.3.2 Scorecard Scaling Mathematics

The relationship between score and odds (defined as P(Good)/P(Bad)) is:

$$\text{Score} = \text{Base Score} + PDO \times \log_2(\text{Odds})$$

If base score = 600, PDO = 20, and the base odds = 50:1 (i.e., 50 goods per bad at base score), then a borrower with odds of 100:1 scores 620, and a borrower with odds of 25:1 scores 580.

The PD can be recovered from the score:

$$PD = \frac{1}{1 + \text{Odds}} = \frac{1}{1 + e^{(\text{Score} - \text{Offset}) / \text{Factor}}}$$

Where Offset and Factor are the scaling parameters derived from the base score and PDO calibration.

15.4 Python Implementation: Credit Scorecard

The following implementation demonstrates a production-grade credit scorecard class with WoE transformation, logistic regression fitting, score generation, and PD output.

import numpy as np
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score, roc_curve
from sklearn.model_selection import train_test_split
from dataclasses import dataclass, field
from typing import Optional
from datetime import date
import warnings

warnings.filterwarnings('ignore')


@dataclass
class ScorecardVariable:
    """Represents a single variable in a credit scorecard."""
    name: str
    bins: list        # Cutpoints (right edges) or category labels
    woe_values: list  # WoE value per bin
    points: list      # Scorecard points per bin (integers)
    iv: float = 0.0   # Information Value of this variable

    def get_bin_index(self, value) -> int:
        """Map a raw value to its bin index."""
        if isinstance(self.bins[0], str):
            # Categorical variable
            try:
                return self.bins.index(str(value))
            except ValueError:
                return len(self.bins) - 1  # catch-all last bin
        else:
            # Numeric variable — bins are right edges
            for i, cutpoint in enumerate(self.bins):
                if value <= cutpoint:
                    return i
            return len(self.bins) - 1  # above the highest cutpoint

    def get_woe(self, value) -> float:
        idx = self.get_bin_index(value)
        return self.woe_values[idx]

    def get_points(self, value) -> int:
        idx = self.get_bin_index(value)
        return self.points[idx]


class WoETransformer:
    """
    Weight of Evidence transformer for scorecard development.
    Bins continuous variables and computes WoE and IV.
    """

    def __init__(self, n_bins: int = 10, min_bin_size: float = 0.05):
        self.n_bins = n_bins
        self.min_bin_size = min_bin_size
        self.bin_stats: dict = {}

    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'WoETransformer':
        """Compute WoE and IV for all features."""
        total_goods = (y == 0).sum()
        total_bads = (y == 1).sum()

        for col in X.columns:
            self.bin_stats[col] = self._compute_woe_iv(
                X[col], y, total_goods, total_bads
            )
        return self

    def _compute_woe_iv(
        self, series: pd.Series, target: pd.Series,
        total_goods: int, total_bads: int
    ) -> dict:
        """Compute WoE and IV for a single feature."""
        # Bin the variable
        try:
            binned, bin_edges = pd.qcut(
                series, q=self.n_bins, retbins=True,
                duplicates='drop'
            )
        except ValueError:
            # Fallback for low-cardinality variables
            binned = series.astype(str)
            bin_edges = None

        results = []
        unique_bins = binned.unique()

        for bin_val in sorted(unique_bins, key=str):
            mask = binned == bin_val
            n_good = ((target == 0) & mask).sum()
            n_bad = ((target == 1) & mask).sum()

            # Apply small constant to avoid log(0)
            dist_good = max(n_good, 0.5) / total_goods
            dist_bad = max(n_bad, 0.5) / total_bads

            woe = np.log(dist_good / dist_bad)
            iv_component = (dist_good - dist_bad) * woe

            results.append({
                'bin': bin_val,
                'n_good': n_good,
                'n_bad': n_bad,
                'dist_good': dist_good,
                'dist_bad': dist_bad,
                'woe': woe,
                'iv': iv_component
            })

        total_iv = sum(r['iv'] for r in results)

        return {
            'bin_edges': bin_edges,
            'stats': pd.DataFrame(results),
            'iv': total_iv
        }

    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """Apply WoE transformation to features."""
        X_woe = pd.DataFrame(index=X.index)

        for col in X.columns:
            if col not in self.bin_stats:
                raise ValueError(f"Column {col} not seen during fit.")

            bin_info = self.bin_stats[col]
            bin_edges = bin_info['bin_edges']
            stats_df = bin_info['stats']
            woe_map = dict(zip(stats_df['bin'], stats_df['woe']))

            if bin_edges is not None:
                # Numeric: use qcut with same edges
                binned = pd.cut(X[col], bins=bin_edges,
                                include_lowest=True, labels=False)
                # Map back to bin labels used during fit
                bin_labels = stats_df['bin'].unique()

                def get_woe(idx):
                    if pd.isna(idx):
                        return 0.0  # impute with neutral WoE
                    return stats_df.iloc[int(idx)]['woe'] if int(idx) < len(stats_df) else 0.0

                X_woe[col + '_woe'] = binned.map(get_woe)
            else:
                # Categorical
                X_woe[col + '_woe'] = X[col].astype(str).map(
                    woe_map
                ).fillna(0.0)

        return X_woe

    def information_values(self) -> pd.Series:
        """Return IV for each feature, sorted descending."""
        ivs = {col: self.bin_stats[col]['iv']
               for col in self.bin_stats}
        return pd.Series(ivs).sort_values(ascending=False)


class CreditScorecard:
    """
    Production credit scorecard with logistic regression,
    WoE transformation, and score calibration.

    Parameters
    ----------
    base_score : int
        Score at the base odds (default 600).
    pdo : int
        Points to Double the Odds (default 20).
    target_odds : float
        Good-to-bad odds at the base score (default 50.0).
    min_iv : float
        Minimum IV threshold for variable inclusion (default 0.02).
    """

    def __init__(
        self,
        base_score: int = 600,
        pdo: int = 20,
        target_odds: float = 50.0,
        min_iv: float = 0.02,
        n_bins: int = 10
    ):
        self.base_score = base_score
        self.pdo = pdo
        self.target_odds = target_odds
        self.min_iv = min_iv
        self.n_bins = n_bins

        # Derived scaling parameters
        self.factor = pdo / np.log(2)
        self.offset = base_score - self.factor * np.log(target_odds)

        # Components (set during fit)
        self.woe_transformer = WoETransformer(n_bins=n_bins)
        self.logistic_model: Optional[LogisticRegression] = None
        self.selected_features: list = []
        self.variables: list[ScorecardVariable] = []
        self._is_fitted = False

    def fit(self, X: pd.DataFrame, y: pd.Series) -> 'CreditScorecard':
        """
        Full scorecard development pipeline:
        1. WoE transformation and IV-based variable selection
        2. Logistic regression on WoE-transformed features
        3. Score scaling and variable point assignment
        """
        # Step 1: WoE transformation
        self.woe_transformer.fit(X, y)
        iv_series = self.woe_transformer.information_values()

        # Step 2: Variable selection by IV
        self.selected_features = iv_series[
            iv_series >= self.min_iv
        ].index.tolist()

        if not self.selected_features:
            raise ValueError(
                "No features met the minimum IV threshold. "
                "Review your data or lower min_iv."
            )

        X_woe = self.woe_transformer.transform(X)[
            [f + '_woe' for f in self.selected_features]
        ]

        # Step 3: Logistic regression
        self.logistic_model = LogisticRegression(
            C=1.0, solver='lbfgs', max_iter=1000,
            random_state=42
        )
        self.logistic_model.fit(X_woe, y)

        # Step 4: Build ScorecardVariable objects with scaled points
        self._build_scorecard_variables(X)

        self._is_fitted = True
        return self

    def _build_scorecard_variables(self, X: pd.DataFrame):
        """Convert logistic coefficients to integer scorecard points."""
        coef_dict = dict(zip(
            [f + '_woe' for f in self.selected_features],
            self.logistic_model.coef_[0]
        ))
        intercept = self.logistic_model.intercept_[0]

        # Contribution of each bin to the log-odds
        # Points_ij = -(beta_j * WoE_ij + intercept/p) * factor + offset/p
        # where p = number of characteristics
        p = len(self.selected_features)

        self.variables = []
        for feat in self.selected_features:
            bin_stats = self.woe_transformer.bin_stats[feat]['stats']
            coef = coef_dict[feat + '_woe']

            bins = list(bin_stats['bin'])
            woe_vals = list(bin_stats['woe'])
            points = []

            for woe in woe_vals:
                # Score contribution from this characteristic
                raw_points = -(coef * woe + intercept / p) * self.factor + self.offset / p
                points.append(int(round(raw_points)))

            iv = self.woe_transformer.bin_stats[feat]['iv']
            self.variables.append(
                ScorecardVariable(
                    name=feat,
                    bins=bins,
                    woe_values=woe_vals,
                    points=points,
                    iv=iv
                )
            )

    def score(self, features: dict) -> int:
        """
        Generate an integer credit score for a single applicant.

        Parameters
        ----------
        features : dict
            Raw feature values keyed by feature name.

        Returns
        -------
        int
            Integer credit score (higher = lower risk).
        """
        if not self._is_fitted:
            raise RuntimeError("Model must be fitted before scoring.")

        total_points = 0
        for var in self.variables:
            raw_value = features.get(var.name)
            if raw_value is None:
                continue  # skip missing (could apply mean imputation)
            total_points += var.get_points(raw_value)

        return total_points

    def score_batch(self, X: pd.DataFrame) -> pd.Series:
        """Score a DataFrame of applicants."""
        scores = []
        for _, row in X.iterrows():
            scores.append(self.score(row.to_dict()))
        return pd.Series(scores, index=X.index, name='credit_score')

    def pd_from_score(self, score: int) -> float:
        """
        Convert a credit score to a Probability of Default.

        PD = 1 / (1 + odds)
        odds = exp((score - offset) / factor)
        """
        odds = np.exp((score - self.offset) / self.factor)
        pd_estimate = 1.0 / (1.0 + odds)
        return round(pd_estimate, 6)

    def predict_pd(self, X: pd.DataFrame) -> pd.Series:
        """Return PD estimates for a DataFrame of borrowers."""
        scores = self.score_batch(X)
        return scores.map(self.pd_from_score)

    def scorecard_table(self) -> pd.DataFrame:
        """Return a formatted scorecard table for documentation."""
        rows = []
        for var in self.variables:
            for i, (bin_label, woe_val, pts) in enumerate(
                zip(var.bins, var.woe_values, var.points)
            ):
                rows.append({
                    'Characteristic': var.name,
                    'Bin': str(bin_label),
                    'WoE': round(woe_val, 4),
                    'Points': pts,
                    'IV': round(var.iv, 4)
                })
        return pd.DataFrame(rows)


# ---------------------------------------------------------------------------
# Validation Metrics
# ---------------------------------------------------------------------------

def calculate_information_value(
    feature: pd.Series, target: pd.Series, bins: int = 10
) -> float:
    """
    Calculate Weight of Evidence and Information Value for a feature.

    Parameters
    ----------
    feature : pd.Series
        The predictor variable.
    target : pd.Series
        Binary target (1 = default, 0 = non-default).
    bins : int
        Number of quantile bins.

    Returns
    -------
    float
        Information Value.
    """
    total_bads = (target == 1).sum()
    total_goods = (target == 0).sum()

    if total_bads == 0 or total_goods == 0:
        return 0.0

    try:
        binned = pd.qcut(feature, q=bins, duplicates='drop')
    except ValueError:
        binned = pd.cut(feature, bins=bins)

    iv_total = 0.0
    for bin_label in binned.unique():
        mask = binned == bin_label
        n_bad = ((target == 1) & mask).sum()
        n_good = ((target == 0) & mask).sum()

        dist_bad = max(n_bad, 0.5) / total_bads
        dist_good = max(n_good, 0.5) / total_goods

        woe = np.log(dist_good / dist_bad)
        iv_total += (dist_good - dist_bad) * woe

    return round(iv_total, 4)


def gini_coefficient(y_true: np.ndarray, y_scores: np.ndarray) -> float:
    """
    Gini coefficient for credit model discrimination.
    Gini = 2 * AUC - 1.
    Ranges from 0 (no discrimination) to 1 (perfect discrimination).
    """
    auc = roc_auc_score(y_true, y_scores)
    return round(2 * auc - 1, 4)


def ks_statistic(y_true: np.ndarray, y_scores: np.ndarray) -> float:
    """
    Kolmogorov-Smirnov statistic.
    Maximum separation between cumulative good and bad distributions.
    """
    fpr, tpr, _ = roc_curve(y_true, y_scores)
    ks = np.max(np.abs(tpr - fpr))
    return round(float(ks), 4)


def population_stability_index(
    expected: pd.Series,
    actual: pd.Series,
    bins: int = 10
) -> float:
    """
    Population Stability Index (PSI) for model monitoring.

    Detects shifts in the score or feature distribution between the
    development (expected) and current (actual) populations.

    PSI < 0.10  : No significant shift
    0.10-0.25   : Minor shift, investigate
    PSI > 0.25  : Major shift, model may no longer be valid

    Parameters
    ----------
    expected : pd.Series
        Score/feature distribution from the development sample.
    actual : pd.Series
        Score/feature distribution from the current monitoring period.
    bins : int
        Number of bins for PSI calculation.

    Returns
    -------
    float
        PSI value.
    """
    # Create bins from the expected (development) distribution
    try:
        _, bin_edges = pd.qcut(expected, q=bins, retbins=True,
                                duplicates='drop')
    except ValueError:
        bin_edges = np.percentile(expected, np.linspace(0, 100, bins + 1))

    bin_edges[0] = -np.inf
    bin_edges[-1] = np.inf

    # Calculate proportions in each bin
    expected_counts = pd.cut(expected, bins=bin_edges).value_counts(
        sort=False, normalize=True
    )
    actual_counts = pd.cut(actual, bins=bin_edges).value_counts(
        sort=False, normalize=True
    )

    # Align indices and replace zeros
    expected_pct = expected_counts.reindex(actual_counts.index).fillna(0)
    actual_pct = actual_counts.fillna(0)

    # Avoid division by zero
    expected_pct = expected_pct.clip(lower=1e-6)
    actual_pct = actual_pct.clip(lower=1e-6)

    psi = ((actual_pct - expected_pct) * np.log(
        actual_pct / expected_pct
    )).sum()

    return round(float(psi), 4)


# ---------------------------------------------------------------------------
# Model Validation Report
# ---------------------------------------------------------------------------

@dataclass
class ModelValidationReport:
    """
    Structured output of a credit model validation exercise.
    Aligned with SR 11-7 and EBA/GL/2017/16 requirements.
    """
    model_id: str
    model_name: str
    validation_date: date
    validator: str
    gini: float
    auc: float
    ks_statistic: float
    psi: float
    validation_outcome: str  # 'approved', 'conditional', 'rejected'
    conditions: list[str] = field(default_factory=list)
    findings: list[str] = field(default_factory=list)
    next_review_date: Optional[date] = None

    def is_acceptable(self) -> bool:
        """
        Basic acceptability thresholds — bank should calibrate
        these to its own risk appetite and portfolio.
        """
        return (
            self.gini >= 0.30 and
            self.ks_statistic >= 0.20 and
            self.psi < 0.25 and
            self.validation_outcome in ('approved', 'conditional')
        )

    def summary(self) -> str:
        lines = [
            f"Model Validation Report",
            f"=======================",
            f"Model ID       : {self.model_id}",
            f"Model Name     : {self.model_name}",
            f"Validated By   : {self.validator}",
            f"Validation Date: {self.validation_date}",
            f"",
            f"Discrimination Metrics:",
            f"  Gini         : {self.gini:.4f}",
            f"  AUC-ROC      : {self.auc:.4f}",
            f"  KS Statistic : {self.ks_statistic:.4f}",
            f"",
            f"Stability:",
            f"  PSI          : {self.psi:.4f}",
            f"",
            f"Outcome: {self.validation_outcome.upper()}",
        ]
        if self.conditions:
            lines.append("Conditions:")
            for c in self.conditions:
                lines.append(f"  - {c}")
        if self.findings:
            lines.append("Findings:")
            for f_item in self.findings:
                lines.append(f"  - {f_item}")
        return "\n".join(lines)


# ---------------------------------------------------------------------------
# Example: Verdant Bank SME Scorecard Development
# ---------------------------------------------------------------------------

def verdant_bank_example():
    """
    Illustrative example of Verdant Bank building an SME credit scorecard.
    Uses synthetic data for demonstration.
    """
    np.random.seed(42)
    n_borrowers = 5000

    # Simulate SME credit application data
    data = pd.DataFrame({
        'years_in_business': np.random.choice(
            [1, 2, 3, 5, 7, 10, 15, 20], size=n_borrowers,
            p=[0.15, 0.12, 0.10, 0.15, 0.13, 0.15, 0.12, 0.08]
        ),
        'annual_revenue_gbp': np.random.lognormal(
            mean=11.5, sigma=1.2, size=n_borrowers
        ).clip(10_000, 5_000_000),
        'credit_bureau_score': np.random.normal(
            650, 80, size=n_borrowers
        ).clip(300, 900).astype(int),
        'debt_service_ratio': np.random.beta(2, 5, size=n_borrowers),
        'sector_risk_code': np.random.choice(
            ['low', 'medium', 'high'], size=n_borrowers,
            p=[0.4, 0.4, 0.2]
        )
    })

    # Synthetic default flag — correlated with risk factors
    log_odds = (
        -3.5
        + 0.5 * (data['credit_bureau_score'] < 600).astype(int)
        - 0.3 * np.log1p(data['years_in_business'])
        + 0.8 * (data['sector_risk_code'] == 'high').astype(int)
        + 1.2 * (data['debt_service_ratio'] > 0.6).astype(int)
    )
    default_prob = 1 / (1 + np.exp(-log_odds))
    data['default'] = (np.random.uniform(size=n_borrowers) < default_prob).astype(int)

    print(f"Dataset: {len(data)} SME borrowers, "
          f"default rate: {data['default'].mean():.1%}")

    # Train/test split
    X = data.drop(columns=['default'])
    y = data['default']
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.3, random_state=42, stratify=y
    )

    # Encode categorical
    X_train = X_train.copy()
    X_test = X_test.copy()
    sector_map = {'low': 0, 'medium': 1, 'high': 2}
    X_train['sector_risk_code'] = X_train['sector_risk_code'].map(sector_map)
    X_test['sector_risk_code'] = X_test['sector_risk_code'].map(sector_map)

    # Build scorecard
    scorecard = CreditScorecard(
        base_score=600, pdo=20, target_odds=50.0, min_iv=0.01
    )
    scorecard.fit(X_train, y_train)

    # Evaluate on test set
    test_scores = scorecard.score_batch(X_test)
    test_pds = scorecard.predict_pd(X_test)

    gini = gini_coefficient(y_test.values, test_pds.values)
    ks = ks_statistic(y_test.values, test_pds.values)
    auc = roc_auc_score(y_test.values, test_pds.values)

    # PSI: compare train vs test score distribution
    train_scores = scorecard.score_batch(X_train)
    psi = population_stability_index(train_scores, test_scores)

    print(f"\nValidation Results (Hold-Out Test Set):")
    print(f"  Gini        : {gini:.4f}")
    print(f"  AUC-ROC     : {auc:.4f}")
    print(f"  KS Statistic: {ks:.4f}")
    print(f"  PSI (train vs test): {psi:.4f}")

    # Build validation report
    report = ModelValidationReport(
        model_id='VB-SME-001',
        model_name='Verdant Bank SME Credit Scorecard v1.0',
        validation_date=date(2024, 6, 30),
        validator='Priya Nair, Big 4 Advisory',
        gini=gini,
        auc=auc,
        ks_statistic=ks,
        psi=psi,
        validation_outcome='conditional',
        conditions=[
            'Re-validate after 12 months or 5% PSI shift, whichever is earlier.',
            'Override policy must be documented and monitored monthly.',
            'Model cannot be used for exposures >£500k without credit committee approval.'
        ],
        findings=[
            'Gini of 0.42 is acceptable for an initial SME model — above minimum threshold of 0.30.',
            'PSI of 0.08 indicates stable population between development and test samples.',
            'Data period (2022-2024) does not include a recession — model must be stress-tested against GFC-era proxies.'
        ],
        next_review_date=date(2025, 6, 30)
    )

    print(f"\n{report.summary()}")
    print(f"\nScorecard Table (first 10 rows):")
    print(scorecard.scorecard_table().head(10).to_string(index=False))

    return scorecard, report


if __name__ == '__main__':
    scorecard, report = verdant_bank_example()

15.5 Credit Risk Model Validation

Building a model is the beginning, not the end. Regulators, auditors, and sound risk management practice all require that models be validated — independently, rigorously, and on a continuous basis.

15.5.1 The Three Dimensions of Validation

The EBA's guidelines (EBA/GL/2017/16) and the Federal Reserve's SR 11-7 (which we examine in depth in Section 15.6) both articulate three core dimensions of model validation:

1. Conceptual soundness: Is the model's underlying theory appropriate for the problem? Does the choice of logistic regression make sense for a binary default prediction problem? Are the variable selection methods statistically justified? Is the model aligned with the bank's actual lending population?

2. Ongoing monitoring: Is the model's performance being tracked in production? Are there automated alerts when model performance deteriorates? Is the population it is scoring changing in ways that might render the model less accurate?

3. Outcomes analysis (backtesting): Once sufficient time has passed, do the model's predicted default rates correspond to actual default rates? A model that predicted 3% PD for a cohort should be tested against that cohort's actual default experience twelve months later.

15.5.2 Discrimination Metrics

The primary metrics for assessing how well a credit model separates defaulters from non-defaulters are:

Gini Coefficient: $$\text{Gini} = 2 \times \text{AUC} - 1$$

The Gini coefficient ranges from -1 (perfect inverse prediction) through 0 (random) to +1 (perfect prediction). In practice, retail credit scorecards typically achieve Ginis of 0.40–0.65. Wholesale/corporate models often have lower Ginis (0.25–0.45) because corporate defaults are rarer and driven by idiosyncratic factors less captured by cross-sectional models.

Industry benchmark thresholds:

Gini Range	Interpretation
< 0.20	Poor — likely not acceptable to regulators
0.20–0.30	Marginal — may require supplementary controls
0.30–0.45	Acceptable for regulatory purposes
0.45–0.60	Good — typical for mature retail scorecards
> 0.60	Excellent — verify for data leakage

AUC-ROC (Area Under the Receiver Operating Characteristic Curve):

The AUC measures the probability that a randomly selected defaulter will receive a higher risk score (lower credit score) than a randomly selected non-defaulter. AUC = 0.5 is random; AUC = 1.0 is perfect.

AUC and Gini are related: Gini = 2 × AUC - 1. They convey the same information.

Kolmogorov-Smirnov (KS) Statistic:

The KS statistic is the maximum vertical distance between the cumulative distribution of scores for defaulters and non-defaulters. It indicates the score level at which discrimination is sharpest.

KS Range	Interpretation
< 0.20	Poor
0.20–0.30	Acceptable
0.30–0.50	Good
> 0.50	Very good

15.5.3 Calibration Metrics

Discrimination tells us whether the model ranks borrowers correctly. Calibration tells us whether the predicted probabilities are accurate in absolute terms.

Hosmer-Lemeshow Test: Groups predictions into deciles and compares predicted vs actual default rates within each group using a chi-squared statistic. A p-value below 0.05 indicates significant miscalibration.

Binomial Test: For each rating grade, tests whether the observed default rate is statistically consistent with the predicted PD, accounting for the sample size.

Brier Score: Mean squared error of probability predictions: $$\text{Brier} = \frac{1}{n}\sum_{i=1}^{n}(f_i - o_i)^2$$

Where $f_i$ is the predicted probability and $o_i$ is the actual outcome. Lower is better; 0 is perfect.

15.5.4 Population Stability Index (PSI)

The PSI measures whether the population being scored has shifted relative to the development population. It is critical for production monitoring.

$$PSI = \sum_{i=1}^{n}\left(A_i - E_i\right) \times \ln\left(\frac{A_i}{E_i}\right)$$

Where $A_i$ is the actual proportion in bin $i$ (current population) and $E_i$ is the expected proportion (development population).

PSI Value	Action
< 0.10	No action required
0.10–0.25	Investigate — monitor closely
> 0.25	Model likely invalid — consider redevelopment or overlay

The PSI is computed on both the score distribution and on individual input variables. A PSI above threshold on an input variable may indicate data quality issues or genuine population change in that characteristic.

15.6 SR 11-7: The Gold Standard for Model Risk Management

The Federal Reserve's Supervisory Guidance on Model Risk Management (SR 11-7), issued in 2011 and jointly adopted by the OCC, provides the most comprehensive regulatory framework for managing the risk that models produce incorrect outputs that lead to poor decisions.

While SR 11-7 is a US Federal Reserve document, it has become the de facto global standard. The PRA in the UK, the ECB's TRIM (Targeted Review of Internal Models) exercise, and the EBA's guidelines all draw heavily on its principles.

15.6.1 SR 11-7's Definition of Model Risk

SR 11-7 defines model risk as "the potential for adverse consequences from decisions based on incorrect or misused model outputs and reports." It identifies two primary sources:

Fundamental errors in the model: Wrong assumptions, incorrect mathematics, programming errors, inappropriate proxy data.
Misuse of the model: Applying a model outside its intended range, over-relying on outputs without appropriate judgment, ignoring model limitations.

15.6.2 The Three Pillars of SR 11-7

Pillar 1: Model Development, Implementation, and Use

Models must be fit for purpose — appropriate for the decision they support.
Development documentation must be complete: data sources, methodology, assumptions, limitations.
Testing must be rigorous, including out-of-time and out-of-sample validation.
Models must be implemented accurately — code must produce what the documentation describes.

Pillar 2: Model Validation

SR 11-7 requires that validation be performed by staff independent of the model development team. Independence is a hard requirement: the validator cannot have developed or approved the model they are validating.

Validation must include: - Evaluation of conceptual soundness - Ongoing monitoring protocols - Outcomes analysis against actual experience - Specific focus on key model assumptions and limitations

Pillar 3: Model Governance

The board and senior management are responsible for establishing a model risk management framework, including:

Model inventory: A comprehensive register of all models in use, their purpose, status, and validation history.
Model risk appetite: An explicit statement of acceptable levels of model risk.
Tiering: Models classified by materiality — more material models receive more intensive validation.
Remediation tracking: Open validation findings tracked to closure.
Ongoing reporting: Regular model risk reports to senior management and the board.

15.6.3 Applying SR 11-7 to Verdant's IRB Programme

When Priya Nair was engaged to advise Verdant Bank on their IRB readiness, her first assessment tool was an SR 11-7 gap analysis mapped to the bank's current model governance documentation.

The gap analysis revealed:

Model inventory: Verdant had an informal spreadsheet listing 12 models, but it was not maintained. The retail scorecard update from Q3 2023 had not been logged.
Independence: The Head of Risk Analytics had validated several models she had co-developed. SR 11-7 prohibits this.
Documentation: The SME scoring tool had no formal development documentation — it had been built "by committee" during the 2022 fundraising period.
Monitoring: PSI was not calculated for any model. The retail scorecard was being applied to a population that had shifted materially since its 2021 development.

Maya's response to this gap analysis was to commission a formal Model Risk Policy, establish a Model Validation Committee (meeting quarterly, chaired by the CRO), and engage Priya's firm to conduct initial independent validations of all material models before the IRB application was submitted.

15.7 Machine Learning in Credit Risk: Promise and Peril

The past decade has seen substantial interest in applying machine learning algorithms to credit risk. Random forests, gradient boosting machines (GBMs, particularly XGBoost and LightGBM), and neural networks have demonstrated discrimination metrics that frequently exceed traditional scorecards on comparable datasets.

15.7.1 Why ML Models Often Outperform Scorecards

Traditional scorecards use linear combinations of WoE-transformed variables. They cannot capture non-linear relationships or complex interactions between variables without explicit feature engineering.

ML models, particularly tree-based ensembles, can: - Detect non-linear relationships automatically - Capture interactions between features without manual specification - Handle large numbers of input features, including alternative data sources - Produce higher Gini/AUC metrics on held-out test sets

15.7.2 The Interpretability Challenge

For credit decisions, interpretability is not merely an analytical preference — it is a regulatory and legal requirement.

In the UK, the Consumer Duty (FCA, 2023) requires that firms can explain credit decisions to customers. The Equality Act 2010 prohibits discriminatory lending. Under GDPR Article 22, individuals subject to automated decisions have the right to explanation.

A gradient boosting model with 1,000 trees and 20 features does not naturally produce a "reason code" — the human-readable explanation of why an applicant was declined ("Insufficient income relative to loan amount" or "Recent credit delinquencies"). Scorecard models do: each characteristic's point contribution is observable.

Techniques exist to address ML interpretability: - SHAP (SHapley Additive exPlanations): Computes the marginal contribution of each feature to an individual prediction. - LIME (Local Interpretable Model-agnostic Explanations): Fits a simple local model around each prediction. - Partial Dependence Plots (PDPs): Show the marginal relationship between a feature and the prediction.

However, regulators, particularly the PRA in the context of IRB, have been cautious about ML models. The EBA Discussion Paper on ML in Credit Risk (EBA/DP/2021/04) noted that while ML models can improve discrimination, they create additional model risk through complexity and the difficulty of ensuring conceptual soundness.

Rafael Torres, who consulted with Verdant Bank following his time at Meridian Capital, offered a pragmatic perspective drawn from his experience implementing ML credit models at a larger institution:

"The model that gets regulatory approval is the model that the regulator can understand. At Meridian, we built a gradient boosting model that outperformed our scorecard by 8 Gini points. The PRA asked us to explain how it handled a scenario where an otherwise creditworthy borrower had a single recent missed payment. We couldn't give them a clean answer. We ran the scorecard instead."

15.7.3 Fairness in Credit ML Models

Machine learning models trained on historical credit data may perpetuate or amplify historical discrimination. If certain demographic groups have historically been denied credit (and therefore have sparser credit bureau data), an ML model trained on bureau data may systematically disadvantage those groups — even without explicitly using protected characteristics.

The FCA's work on algorithmic bias (FCA, 2022) and the FRB's SR 11-7 application to ML models both require that banks test for disparate impact — whether the model has materially different outcomes for protected groups that cannot be justified by legitimate credit risk factors.

Metrics for fairness assessment in credit models include: - Demographic parity: Do accept rates differ across protected groups? - Equalized odds: Do true positive (approvals of creditworthy borrowers) and false positive (approvals of defaulting borrowers) rates differ across groups? - Individual fairness: Do similarly situated applicants receive similar outcomes?

15.7.4 The Model Risk Implications of ML

SR 11-7 and its international equivalents were written before ML models became common in banking. However, the OFR (Office of Financial Research) and Fed have clarified that SR 11-7 applies fully to ML models — and in practice, ML models often present greater model risk than scorecards because:

Complexity: More parameters, more potential for overfitting.
Instability: Small changes in training data can produce large changes in model outputs.
Non-monotonicity: A gradient boosting model may predict lower PD for higher debt burdens in some feature space regions — a conceptually unsound result that requires human review.
Data dependency: ML models are more sensitive to data quality issues, distributional shifts, and missing values.

15.8 Retail vs. Wholesale Credit Risk Modelling

The techniques and challenges of credit risk modelling differ substantially between retail and wholesale portfolios.

15.8.1 Retail Credit Risk

Retail portfolios (mortgages, personal loans, credit cards, SME banking) are characterised by: - Large numbers of obligors: Retail portfolios may contain hundreds of thousands or millions of borrowers. - Standardised products: Retail loans are broadly homogeneous, enabling statistical modelling. - Bureau data availability: In mature markets, credit bureaus (Experian, Equifax, TransUnion in the UK) provide extensive credit history data. - Relatively short observation windows: Retail defaults often manifest within 12–24 months of credit deterioration.

For retail, statistical scorecards with logistic regression (or increasingly, ML ensembles) are the standard approach. Application scorecards (at origination), behavioural scorecards (ongoing monitoring), and collection scorecards (managing distress) address different parts of the credit lifecycle.

15.8.2 Wholesale Credit Risk

Wholesale portfolios (large corporates, financial institutions, sovereigns) are characterised by: - Small numbers of obligors: A portfolio may contain dozens or hundreds of counterparties. - Heterogeneous exposures: Each borrower may have a unique structure, requiring judgment alongside statistical models. - Rating agency analogues: Internal ratings systems often map to agency rating scales. - Low default frequency: Large corporates default rarely, making model validation by backtesting difficult — there may be only 5–10 defaults in a portfolio per year. - Analyst-driven process: Relationship managers and credit analysts play a more prominent role; models inform rather than replace judgment.

For wholesale, shadow-rating models (mapping borrower financials to an implied rating grade), Merton-based structural models (KMV-style distance-to-default), and expert judgment overlays are common.

The validation challenge is acute for wholesale models: with very few defaults, the statistical power to distinguish a good model from a bad one is low. Long observation periods (10+ years) are needed, and through-the-cycle calibration (mapping ratings to long-run average default rates) is standard.

15.9 Through-the-Cycle vs. Point-in-Time PD Estimates

One of the most consequential technical choices in credit risk modelling is whether to produce TTC or PIT PD estimates — or a blend.

15.9.1 Through-the-Cycle (TTC)

TTC PDs are calibrated to reflect average creditworthiness across a full economic cycle. They are stable — a borrower's TTC PD changes only when the borrower's fundamental creditworthiness changes, not when the macroeconomic environment improves or deteriorates.

TTC characteristics: - Low procyclicality: capital requirements do not rise sharply in recessions - Appropriate for capital purposes under Basel - Less informative for current risk assessment - Require a long historical dataset spanning at least one full cycle

15.9.2 Point-in-Time (PIT)

PIT PDs reflect current creditworthiness given the current economic environment. They move with the cycle — rising in recessions, falling in expansions. The same borrower has a higher PIT PD in 2009 than in 2007, even with identical financials, because the macroeconomic environment has deteriorated.

PIT characteristics: - Forward-looking, incorporates current conditions - Required for IFRS 9 Expected Credit Loss calculations - More procyclical — provisions and EL estimates rise sharply in downturns - Can be derived by applying a macro adjustment to a TTC model

15.9.3 Practical Approaches to the TTC/PIT Distinction

In practice, many banks produce a hybrid: a model with TTC characteristics at the rating assignment stage, with a separate macro overlay applied to generate PIT estimates for accounting purposes.

The macro overlay might be implemented as: $$PD_{PIT}(t) = PD_{TTC} \times \exp(\beta_0 + \beta_1 \times \Delta GDP_t + \beta_2 \times U_t)$$

Where $\Delta GDP_t$ is GDP growth and $U_t$ is unemployment rate at time $t$.

This "satellite model" approach — a core TTC model for capital, with a macro satellite for accounting — is common at larger banks but requires careful documentation, governance, and validation of both components.

15.10 Model Risk Framework: Governance Architecture

A mature model risk framework comprises several interconnected components:

15.10.1 Model Inventory

The model inventory is the foundation of model risk governance. Every model in production use must be registered, including:

Model ID and name
Business purpose and decisions it supports
Owner (business line) and developer
Validator and validation status
Last validation date and next scheduled review
Materiality tier (1 = most material)
Open findings and conditions
Data sources and external dependencies

15.10.2 Model Tiering

Not all models receive equal governance attention. Tiering allocates validation resources to the models that matter most. Criteria for tiering include:

Financial materiality (capital impact, provision impact)
Regulatory reliance (IRB models are Tier 1 by definition)
Number of decisions made (a retail origination scorecard scoring 50,000 applications per month is more material than a small-portfolio wholesale model)
Complexity (ML models require more intensive validation than simple scorecards)

15.10.3 Model Risk Appetite

The board should approve a model risk appetite statement that articulates:

Acceptable levels of model error (e.g., capital RWA within X% of true value)
Maximum tolerable override rates
Minimum Gini thresholds for different model tiers
PSI thresholds triggering mandatory review
Acceptable time limits between validations

15.10.4 Ongoing Monitoring

Monthly or quarterly monitoring reports should cover:

Performance metrics (Gini, KS) on recent data
PSI for scores and key input variables
Override rates and override performance (do overridden loans perform worse?)
Volume and portfolio composition changes
Data quality indicators

15.10.5 Model Change Management

Changes to models — recalibration, redevelopment, parameter updates — must follow a formal change management process:

Change request and rationale documented
Impact assessment (will the change materially alter outputs?)
Independent review of material changes
Approval by Model Validation Committee
Implementation testing
Post-implementation monitoring

15.10.6 IFRS 9 and the ECL Framework

IFRS 9 Financial Instruments (effective 2018) introduced a forward-looking Expected Credit Loss (ECL) model that replaced the IAS 39 incurred loss approach.

Under IFRS 9, financial instruments are allocated to three stages:

Stage 1: Performing loans — provision = 12-month ECL (EL over the next 12 months). PD used is a 12-month PIT estimate.
Stage 2: Significant increase in credit risk (SICR) — provision = Lifetime ECL. The threshold for SICR may include absolute and relative triggers (e.g., 200bps increase in PD, or downgrade of two rating notches).
Stage 3: Credit-impaired (effectively defaulted) — provision = Lifetime ECL. Effective Interest Rate (EIR) applied on the net carrying amount.

The ECL formula for each loan in a period: $$ECL = PD \times LGD \times EAD \times D$$

Where $D$ is a discount factor reflecting the time value of money (discounted at the EIR).

IFRS 9 ECL modelling requires: - Forward-looking PIT PD estimates incorporating macroeconomic scenarios - Multiple macroeconomic scenarios probability-weighted (not just base case) - Lifetime PD paths (not just 12-month) for Stage 2/3 - Stage allocation models (to determine SICR)

The interaction between IFRS 9 provisions and Basel capital requirements creates important dynamics. Excess provisions (where provisions exceed regulatory expected loss) can be added back to Tier 2 capital (up to 0.6% of credit RWA under CRR2). Provision shortfalls are deducted from CET1.

Summary

This chapter has traced the arc from credit risk fundamentals — the Expected Loss formula and its three components — through the regulatory architecture of Basel's Standardised and IRB approaches, the technical mechanics of scorecard development, the statistical metrics of model validation, and the governance framework of SR 11-7 model risk management.

For Maya at Verdant Bank, the journey produced a sobering recognition: building a credit risk model was not the hard part. The hard part was building the governance, data, and validation infrastructure that would make the model credible — to regulators, to auditors, to her own board, and ultimately to the small business owners who would either receive credit or be declined on the basis of its outputs.

Credit risk models are consequential instruments. Their errors have real human costs. The discipline of model risk management — independent validation, ongoing monitoring, explicit governance — exists not to bureaucratise but to acknowledge that reality honestly.

Next: Chapter 16 — Stress Testing and Scenario Analysis: ICAAP, CCAR, and the Role of Macroeconomic Models