Case Study 2: Credit Scoring Regularization — Weight Decay and Correlated Financial Features

DataField.Dev

Case Study 2: Credit Scoring Regularization — Weight Decay and Correlated Financial Features

Context

Meridian Financial's data science team is building a neural network for credit scoring to complement their existing logistic regression model. The motivation is clear: the logistic regression model achieves an AUC of 0.76, but the team believes nonlinear feature interactions (e.g., the joint effect of debt-to-income ratio and payment history) could push accuracy higher. Regulatory requirements under the Equal Credit Opportunity Act (ECOA) and Fair Credit Reporting Act (FCRA) demand that the model's behavior be explainable and stable.

The challenge: Meridian's feature set contains clusters of highly correlated financial variables. Credit utilization, total balance, number of open accounts, and available credit all capture overlapping aspects of a borrower's credit exposure. The correlation matrix reveals pairwise correlations as high as 0.94 among these features.

This case study examines what happens when a neural network is trained on correlated features without proper regularization, why the resulting model fails regulatory review, and how weight decay resolves the problem.

The Data

Meridian's credit scoring dataset includes 50,000 historical loan applications with 25 features and a binary target (default within 24 months: yes/no). The default rate is 8%.

import numpy as np
import pandas as pd
import torch
import torch.nn as nn
import torch.optim as optim
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from typing import Dict, List, Tuple

def generate_credit_data(
    n_samples: int = 50000, seed: int = 42
) -> Tuple[pd.DataFrame, np.ndarray]:
    """Generate synthetic credit scoring data with correlated features.

    Creates 25 features organized in 5 groups of 5. Within each group,
    features are highly correlated (rho > 0.85). Between groups,
    correlations are low (rho < 0.2).

    Args:
        n_samples: Number of loan applications.
        seed: Random seed.

    Returns:
        Tuple of (feature DataFrame, binary default labels).
    """
    rng = np.random.RandomState(seed)

    # Generate 5 latent factors (one per group)
    latent = rng.randn(n_samples, 5)

    # Generate 25 features: 5 groups of 5 correlated features
    features = np.zeros((n_samples, 25))
    feature_names = []

    group_names = [
        "credit_exposure",   # utilization, balance, accounts, credit_limit, debt
        "payment_history",   # on_time_pct, late_30, late_60, late_90, collections
        "credit_age",        # oldest_account, avg_age, newest_account, total_accounts, closed_pct
        "inquiry_activity",  # hard_inquiries, soft_inquiries, new_accounts_6m, apps_denied, credit_seeking
        "income_stability",  # income, employment_years, income_variance, dti, housing_cost_ratio
    ]

    for g in range(5):
        for f in range(5):
            noise_scale = 0.15 + 0.1 * rng.rand()  # 0.15-0.25 noise
            features[:, g * 5 + f] = latent[:, g] + noise_scale * rng.randn(n_samples)
            feature_names.append(f"{group_names[g]}_{f+1}")

    # Target: logistic function of latent factors with nonlinear interactions
    logit = (
        -2.5                          # base rate (8% default)
        + 0.8 * latent[:, 0]          # credit exposure increases default
        - 0.6 * latent[:, 1]          # good payment history decreases default
        - 0.3 * latent[:, 2]          # credit age decreases default
        + 0.4 * latent[:, 3]          # inquiry activity increases default
        - 0.5 * latent[:, 4]          # income stability decreases default
        + 0.3 * latent[:, 0] * latent[:, 3]  # interaction: exposure x inquiries
    )
    prob = 1.0 / (1.0 + np.exp(-logit))
    target = rng.binomial(1, prob)

    df = pd.DataFrame(features, columns=feature_names)
    return df, target


# Generate data and split
df, target = generate_credit_data()
X_train, X_test, y_train, y_test = train_test_split(
    df.values, target, test_size=0.2, random_state=42, stratify=target
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Verify correlations within the first group (credit exposure)
corr_matrix = pd.DataFrame(X_train[:, :5]).corr()
print("Within-group correlations (credit exposure):")
print(corr_matrix.round(3))

Within-group correlations (credit exposure):
       0      1      2      3      4
0  1.000  0.932  0.917  0.938  0.925
1  0.932  1.000  0.906  0.921  0.911
2  0.917  0.906  1.000  0.912  0.899
3  0.938  0.921  0.912  1.000  0.918
4  0.925  0.911  0.899  0.918  1.000

The Problem: Unregularized Training

class CreditMLP(nn.Module):
    """Credit scoring MLP.

    Args:
        input_dim: Number of features.
        hidden_dims: Hidden layer sizes.
        dropout_rate: Dropout probability.
        use_bn: Whether to use batch normalization.
    """

    def __init__(
        self,
        input_dim: int = 25,
        hidden_dims: List[int] = None,
        dropout_rate: float = 0.0,
        use_bn: bool = False,
    ) -> None:
        super().__init__()
        if hidden_dims is None:
            hidden_dims = [64, 32]

        layers = []
        prev_dim = input_dim
        for dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, dim))
            if use_bn:
                layers.append(nn.BatchNorm1d(dim))
            layers.append(nn.ReLU())
            if dropout_rate > 0:
                layers.append(nn.Dropout(dropout_rate))
            prev_dim = dim
        layers.append(nn.Linear(prev_dim, 1))
        self.network = nn.Sequential(*layers)

        # He initialization
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.kaiming_normal_(m.weight, nonlinearity="relu")
                nn.init.zeros_(m.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.network(x)


def inspect_first_layer_weights(
    model: CreditMLP, feature_names: List[str], group_size: int = 5
) -> pd.DataFrame:
    """Extract and display first-layer weights by feature group.

    Args:
        model: Trained credit scoring model.
        feature_names: Names of input features.
        group_size: Number of features per correlated group.

    Returns:
        DataFrame of first-layer weight statistics per group.
    """
    weights = model.network[0].weight.detach().cpu().numpy()  # (hidden, input)
    abs_weights = np.abs(weights)

    group_stats = []
    for g in range(len(feature_names) // group_size):
        start = g * group_size
        end = start + group_size
        group_w = weights[:, start:end]
        group_stats.append({
            "group": g,
            "mean_abs_weight": np.abs(group_w).mean(),
            "max_abs_weight": np.abs(group_w).max(),
            "weight_std": group_w.std(),
            "max_weight": group_w.max(),
            "min_weight": group_w.min(),
            "range": group_w.max() - group_w.min(),
        })

    return pd.DataFrame(group_stats)

Training without regularization:

# Unregularized model
model_unreg = CreditMLP(input_dim=25, hidden_dims=[64, 32], dropout_rate=0.0)
optimizer_unreg = optim.Adam(model_unreg.parameters(), lr=1e-3, weight_decay=0.0)
# ... train for 100 epochs ...

The first-layer weights reveal the pathology:

Unregularized model — First layer weight statistics:
  Group 0 (credit_exposure):  max=4.21, min=-3.87, range=8.08, std=1.92
  Group 1 (payment_history):  max=3.56, min=-3.12, range=6.68, std=1.64
  Group 2 (credit_age):       max=2.98, min=-2.71, range=5.69, std=1.38
  Group 3 (inquiry_activity): max=3.41, min=-3.05, range=6.46, std=1.55
  Group 4 (income_stability): max=3.15, min=-2.89, range=6.04, std=1.47

Within each correlated group, the model has learned large positive weights for some features and large negative weights for others. This is the correlated feature cancellation problem: the model exploits the near-perfect correlation to create high-magnitude, opposing weights that cancel on the training data but are individually fragile.

The Regulatory Concern

Meridian's compliance team reviewed the model and flagged two issues:

Instability under distribution shift. When the model was evaluated on a holdout from a different time period (6 months later, with slightly different economic conditions), the AUC dropped from 0.795 to 0.741. The logistic regression model dropped from 0.760 to 0.752 — far more stable.
Explainability failure. Under FCRA, denied applicants must receive an "adverse action notice" explaining which factors most negatively influenced the decision. With opposing large weights on correlated features, the feature attribution is contradictory — "high credit utilization" might appear as a positive factor for one applicant and a negative factor for another with nearly identical profiles, depending on minor differences in the correlated features.

def stability_test(
    model: nn.Module,
    X_test: np.ndarray,
    noise_std: float = 0.1,
    n_trials: int = 100,
    device: str = "cpu",
) -> Dict[str, float]:
    """Test prediction stability under small input perturbations.

    Adds Gaussian noise to correlated features and measures
    the standard deviation of output predictions.

    Args:
        model: Trained model.
        X_test: Test features.
        noise_std: Standard deviation of noise.
        n_trials: Number of noise trials.
        device: Computation device.

    Returns:
        Dict with stability metrics.
    """
    model.eval()
    X_tensor = torch.tensor(X_test, dtype=torch.float32).to(device)

    with torch.no_grad():
        base_pred = torch.sigmoid(model(X_tensor)).cpu().numpy().flatten()

    perturbed_preds = []
    for _ in range(n_trials):
        X_noisy = X_test.copy()
        # Perturb only the first group (correlated features)
        X_noisy[:, :5] += np.random.randn(len(X_test), 5) * noise_std
        X_noisy_tensor = torch.tensor(X_noisy, dtype=torch.float32).to(device)
        with torch.no_grad():
            pred = torch.sigmoid(model(X_noisy_tensor)).cpu().numpy().flatten()
        perturbed_preds.append(pred)

    perturbed_preds = np.array(perturbed_preds)  # (n_trials, n_samples)
    pred_std = perturbed_preds.std(axis=0)

    return {
        "mean_pred_std": pred_std.mean(),
        "max_pred_std": pred_std.max(),
        "pct_unstable": (pred_std > 0.05).mean() * 100,  # >5% std
    }

The Solution: Decoupled Weight Decay

The team trained three regularized variants:

# Model A: L2 regularization via Adam
model_l2 = CreditMLP(input_dim=25, hidden_dims=[64, 32], dropout_rate=0.2, use_bn=True)
optimizer_l2 = optim.Adam(model_l2.parameters(), lr=1e-3, weight_decay=1e-4)

# Model B: Decoupled weight decay via AdamW (moderate)
model_adamw_mod = CreditMLP(input_dim=25, hidden_dims=[64, 32], dropout_rate=0.2, use_bn=True)
optimizer_adamw_mod = optim.AdamW(model_adamw_mod.parameters(), lr=1e-3, weight_decay=1e-2)

# Model C: Decoupled weight decay via AdamW (strong)
model_adamw_strong = CreditMLP(input_dim=25, hidden_dims=[64, 32], dropout_rate=0.2, use_bn=True)
optimizer_adamw_strong = optim.AdamW(model_adamw_strong.parameters(), lr=1e-3, weight_decay=5e-2)

Results

Weight Magnitude Comparison

First-layer weight statistics (Group 0 — credit exposure):

  Unregularized:       max=4.21, min=-3.87, range=8.08, mean_abs=1.92
  L2 (Adam, λ=1e-4):   max=2.15, min=-1.98, range=4.13, mean_abs=0.89
  AdamW (λ=1e-2):      max=0.82, min=-0.74, range=1.56, mean_abs=0.38
  AdamW (λ=5e-2):      max=0.41, min=-0.37, range=0.78, mean_abs=0.19

Weight decay compresses the weights uniformly. The correlated-feature cancellation disappears: instead of large opposing weights, all five features in the group receive similarly small positive or negative weights, reflecting their shared underlying signal.

Performance and Stability

results = {
    "model": [
        "Logistic Regression",
        "MLP (unregularized)",
        "MLP + L2 (Adam)",
        "MLP + AdamW (λ=1e-2)",
        "MLP + AdamW (λ=5e-2)",
    ],
    "val_auc": [0.760, 0.795, 0.783, 0.791, 0.782],
    "shifted_auc": [0.752, 0.741, 0.766, 0.781, 0.778],
    "auc_drop": [0.008, 0.054, 0.017, 0.010, 0.004],
    "mean_pred_std": [0.008, 0.062, 0.031, 0.014, 0.009],
    "pct_unstable": [0.2, 18.4, 7.1, 1.8, 0.6],
}

Model	Val AUC	Shifted AUC	AUC Drop	Mean Pred Std	Unstable (%)
Logistic Regression	0.760	0.752	0.008	0.008	0.2
MLP (unregularized)	0.795	0.741	0.054	0.062	18.4
MLP + L2 (Adam)	0.783	0.766	0.017	0.031	7.1
MLP + AdamW ($\lambda$=1e-2)	0.791	0.781	0.010	0.014	1.8
MLP + AdamW ($\lambda$=5e-2)	0.782	0.778	0.004	0.009	0.6

Analysis

The unregularized MLP achieves the highest in-distribution AUC (0.795) but suffers the largest drop under distribution shift (5.4 points) and the worst stability (18.4% of predictions have >5% standard deviation under perturbation). This model would not survive regulatory review.

AdamW with $\lambda = 10^{-2}$ provides the best trade-off: it matches the unregularized model's in-distribution AUC within 0.004 points while being nearly as stable as logistic regression under distribution shift. Only 1.8% of predictions are unstable, compared to 18.4% for the unregularized model.

The L2-via-Adam approach is intermediate — it provides some stability improvement but less than AdamW at comparable performance. This confirms Loshchilov and Hutter's finding that decoupled weight decay provides more uniform regularization across parameters.

Lessons for Credit Scoring

Correlated features are the norm in financial data, not the exception. Any credit scoring model using bureau data will encounter clusters of correlated features. Regularization is not optional — it is a prerequisite for stability.
In-distribution accuracy is a misleading metric for regulated models. The unregularized model "won" on validation AUC but would fail in deployment. The relevant metric is performance under realistic distribution shifts — different time periods, different economic conditions, different customer segments.
AdamW should be the default optimizer for tabular neural networks. The interaction between L2 regularization and Adam's adaptive learning rates weakens regularization on the features that need it most (high-gradient features, which are often the correlated ones). Decoupled weight decay treats all parameters equally.
Weight magnitudes are auditable. Regulators and compliance teams cannot inspect loss landscapes, but they can inspect weight magnitudes. A model where the largest weight is 0.82 is inherently more explainable than one where the largest weight is 4.21, because each feature's contribution to the prediction is bounded.