Case Study 1: Should We Approve the Loan? A Decision Tree for Credit Risk

Contributors to Introduction to Data Science

Case Study 1: Should We Approve the Loan? A Decision Tree for Credit Risk

Tier 3 — Illustrative/Composite Example: Lakeview Community Credit Union is a fictional institution. This case study is built from widely reported patterns in how small financial institutions approach credit risk assessment. The data, model outputs, and regulatory considerations described here are composites of real-world practices documented in industry literature. No specific institution is represented, and all names, figures, and scenarios are invented for pedagogical purposes.

The Setting

Lakeview Community Credit Union serves about 14,000 members in a mid-sized Midwestern city. For decades, loan officers like Denise Marsh evaluated applications using a combination of credit scores, income verification, and — if she's being honest — gut instinct built over 22 years on the job.

Denise is good at her job. But the credit union's board has noticed something uncomfortable: approval rates vary significantly across loan officers. One officer approves 68% of applications; another approves 54%. Their portfolios have similar default rates, which means either one is too lenient or the other is too strict — or both are making arbitrary decisions that happen to work out.

The credit union's new director of analytics, Terrence, proposes a data-driven approach: build a model that can flag high-risk applications based on historical data, then use the model's logic to create consistent guidelines that all loan officers follow.

Terrence isn't trying to replace Denise. He's trying to make the decision process transparent, consistent, and defensible. And he knows that the model he builds needs to be something Denise can understand, question, and trust. A black box won't work here.

He chooses a decision tree.

The Question

Terrence frames three connected questions:

Can we predict which loans will default using information available at the time of application? (Predictive — can we build a model that works?)
What are the most important factors distinguishing loans that default from those that don't? (Descriptive — what does the model tell us about risk?)
Can we create simple, consistent decision rules that all loan officers can follow? (Practical — can we translate the model into policy?)

Notice that Terrence isn't asking "should we approve this loan?" directly. He's asking whether the data supports creating a structured decision process — and what that process should look like.

The Data

Terrence pulls five years of loan records: 3,847 loans issued between 2018 and 2023. For each loan, he has:

Features (known at application time): - credit_score: Applicant's credit score (300-850) - annual_income: Self-reported annual income - debt_to_income: Monthly debt payments divided by monthly income - loan_amount: Amount requested - employment_years: Years at current employer - loan_purpose: Reason for the loan (home improvement, debt consolidation, auto, personal) - has_mortgage: Whether the applicant has a mortgage (binary) - num_credit_lines: Number of open credit lines

Target variable: - defaulted: Whether the borrower defaulted within 2 years (1 = yes, 0 = no)

The default rate across all loans is 11.2% — so 88.8% of loans were repaid successfully. This class imbalance is important and will affect how Terrence evaluates the model.

Building the Model

Step 1: Data Preparation

import pandas as pd
from sklearn.model_selection import train_test_split

loans = pd.read_csv('lakeview_loans.csv')
print(f"Total loans: {len(loans)}")
print(f"Default rate: {loans['defaulted'].mean():.1%}")

features = ['credit_score', 'annual_income', 'debt_to_income',
            'loan_amount', 'employment_years', 'has_mortgage',
            'num_credit_lines']
X = loans[features]
y = loans['defaulted']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42, stratify=y
)

Terrence uses stratify=y to ensure both training and test sets maintain the same 11.2% default rate. Without stratification, the test set might randomly have a different default rate, making evaluation misleading.

He excludes loan_purpose for now to avoid dealing with encoding categorical variables — something he'll address later with a random forest.

Step 2: The Unpruned Tree

Terrence's first tree uses default settings:

from sklearn.tree import DecisionTreeClassifier

dt_full = DecisionTreeClassifier(random_state=42)
dt_full.fit(X_train, y_train)

print(f"Training accuracy: {dt_full.score(X_train, y_train):.3f}")
print(f"Test accuracy:     {dt_full.score(X_test, y_test):.3f}")
print(f"Tree depth:        {dt_full.get_depth()}")
print(f"Number of leaves:  {dt_full.get_n_leaves()}")

Training accuracy: 1.000
Test accuracy:     0.831
Tree depth:        24
Number of leaves:  487

A tree with 487 leaves and depth 24 is not something Denise can stick on her wall. And 83.1% test accuracy sounds decent until you remember the base rate: if you predicted "no default" for every single application, you'd be right 88.8% of the time. This model is actually worse than doing nothing.

Step 3: The Pruned Tree

Terrence tries a shallow, interpretable tree:

dt_pruned = DecisionTreeClassifier(
    max_depth=4,
    min_samples_leaf=30,
    class_weight='balanced',
    random_state=42
)
dt_pruned.fit(X_train, y_train)

The class_weight='balanced' is critical. It tells the algorithm to penalize misclassifying defaulted loans more heavily, proportional to how rare they are. Without this, the tree would learn to predict "no default" for everyone — technically accurate but useless.

from sklearn.tree import export_text

print(export_text(dt_pruned, feature_names=features))

|--- credit_score <= 642.50
|   |--- debt_to_income <= 0.38
|   |   |--- employment_years <= 2.50
|   |   |   |--- class: default
|   |   |--- employment_years > 2.50
|   |   |   |--- class: no default
|   |--- debt_to_income > 0.38
|   |   |--- class: default
|--- credit_score > 642.50
|   |--- debt_to_income <= 0.45
|   |   |--- loan_amount <= 28500.00
|   |   |   |--- class: no default
|   |   |--- loan_amount > 28500.00
|   |   |   |--- class: no default
|   |--- debt_to_income > 0.45
|   |   |--- annual_income <= 52000.00
|   |   |   |--- class: default
|   |   |--- annual_income > 52000.00
|   |   |   |--- class: no default

Now this Terrence can work with. He translates the tree into plain English:

The Model's Logic: 1. If the applicant's credit score is at or below 642: - If their debt-to-income ratio is above 38%, flag as high risk. - If their debt-to-income ratio is at or below 38% BUT they've been at their job less than 2.5 years, still flag as high risk. - If debt-to-income is low AND they've been employed 2.5+ years, approve.

If the credit score is above 642: - If their debt-to-income ratio is at or below 45%, approve (regardless of loan amount). - If debt-to-income is above 45% AND income is at or below $52,000, flag as high risk. - If debt-to-income is above 45% BUT income is above $52,000, approve.

Denise reads this and immediately nods. "This makes sense. High debt-to-income with a low credit score is dangerous. And for borderline credit scores, employment stability is a real indicator — I've seen that pattern for years."

That reaction — an experienced practitioner saying "this matches what I've observed" — is one of the most powerful validations a model can receive.

Evaluation: Beyond Accuracy

Terrence knows accuracy isn't enough. He generates a classification report:

from sklearn.metrics import classification_report

y_pred = dt_pruned.predict(X_test)
print(classification_report(y_test, y_pred,
                            target_names=['No Default', 'Default']))

              precision    recall  f1-score   support

  No Default       0.95      0.81      0.87      1024
     Default       0.30      0.64      0.41       131

    accuracy                           0.79      1155

The test accuracy (79%) is actually lower than the naive baseline (88.8%). But look at the recall for Default: 0.64. The model catches 64% of loans that will default — versus 0% if you just approved everyone. That's the whole point.

Terrence explains this to the board: "We're trading some overall accuracy for the ability to identify risky loans. The model won't catch every default, but it catches most of them. The loans it flags can go to senior review — a human still makes the final call."

The Comparison: Random Forest

Terrence also trains a random forest for comparison:

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(
    n_estimators=300,
    max_depth=10,
    class_weight='balanced',
    random_state=42
)
rf.fit(X_train, y_train)

The random forest achieves a recall of 0.72 for defaults (compared to 0.64 for the single tree) and an overall accuracy of 0.82. It's better across the board — but Terrence can't draw it on a whiteboard.

His solution: use the decision tree as the policy document (the simple rules loan officers follow) and the random forest as the risk scoring system (a more nuanced score attached to each application). When the two disagree — the tree says "approve" but the forest says "risky" — the application goes to a senior officer for review.

Lessons Learned

1. The model you show stakeholders doesn't have to be the model you deploy. The shallow decision tree was the communication tool. The random forest was the production tool. Both served essential purposes.

2. Class imbalance changes everything. Without class_weight='balanced', both models would have learned to approve everyone and call it a day. When the rare class is the one you actually care about (defaults, fraud, disease), you need to explicitly tell the model to pay attention to it.

3. Accuracy is misleading with imbalanced classes. The model with 79% accuracy was more useful than a trivial model with 88.8% accuracy. This is why Chapter 29 exists — you need better metrics.

4. Domain expertise validates the model. Denise's reaction to the tree rules — "this matches what I've observed" — was as important as the test metrics. A model that contradicts experienced practitioners should be scrutinized carefully, not trusted blindly.

5. Transparency builds trust. Denise didn't fight the model because she could read it. She could see that it captured patterns she already knew, and she could identify edge cases where she'd override it. A black-box model would have faced much more resistance.

Discussion Questions

The tree's first split is on credit_score <= 642.50. Why do you think credit score was chosen over other features? Does this mean credit score causes default risk?
Terrence excluded loan_purpose (a categorical variable) from the initial model. How might he include it? What challenges would that introduce?
The model flags 36% of applications that would default as "safe." What are the costs of these false negatives? How might Terrence adjust the model to catch more defaults, and what trade-off would that involve?
If the credit union started using this model and rejected all flagged applicants, how might the data change over time? Would the model remain valid? (This is called "feedback loop" or "model decay.")
Is it fair to use credit_score as a feature, given that credit scores themselves contain historical biases related to race and income? How might Terrence address this concern?