Case Study 2: ZestFinance -- When Credit Scoring Meets Machine Learning

DataField.Dev

Case Study 2: ZestFinance -- When Credit Scoring Meets Machine Learning

Introduction

For most of the twentieth century, credit decisions in the United States were governed by a single number: the FICO score. Developed by Fair, Isaac and Company in 1989, the FICO score compresses a consumer's credit history into a three-digit number (300 to 850) that predicts the likelihood of default. It is used in approximately 90 percent of U.S. lending decisions.

The FICO score works. It is transparent, standardized, and well-understood by regulators. But it has limitations that create both business costs and social consequences. An estimated 45 million Americans are "credit invisible" -- they lack the traditional credit history needed to generate a FICO score. Millions more have "thin files" with insufficient data for reliable scoring. These populations skew disproportionately young, immigrant, and lower-income. They are not inherently risky borrowers; they are unknown borrowers, and the traditional system cannot distinguish between the two.

ZestFinance (now Zest AI) was founded in 2009 by Douglas Merrill, a former CIO of Google, with a radical premise: machine learning models trained on thousands of data points could make better, fairer, and more inclusive credit decisions than the traditional FICO-based approach. The company's journey over the subsequent fifteen years illuminates both the promise and the peril of applying classification models to high-stakes, regulated decisions.

The Business Problem

Traditional credit scoring models use approximately 15 to 50 features, almost all derived from a consumer's credit bureau file: payment history, credit utilization, length of credit history, types of credit, and recent credit inquiries. These features are predictive but limited. They capture what happened with past credit accounts but say little about the borrower's broader financial behavior, stability, or capacity to repay.

ZestFinance's hypothesis was that ML models could use a far broader set of features -- thousands of variables, including how a borrower navigates the loan application process, cash flow patterns from bank statements, employment stability indicators, and other "alternative data" -- to build a more accurate picture of creditworthiness. More accurate models would produce two business benefits simultaneously:

Reduced default rates among approved borrowers (better identification of risky applicants)
Increased approval rates for creditworthy borrowers who are rejected by traditional models (better identification of good applicants with thin or no credit files)

This is the classification problem at its most consequential: predict whether a loan applicant will default (binary classification, positive class = default), using features that go far beyond the traditional credit bureau data.

Business Insight. ZestFinance's insight connects directly to Chapter 7's discussion of feature engineering. The traditional credit scoring industry had spent decades optimizing a narrow set of features. ZestFinance's competitive advantage was not a better algorithm -- it was a broader feature set informed by a different theory of what predicts creditworthiness.

The Modeling Approach

Feature Engineering: The Core Innovation

ZestFinance's models reportedly use over 1,000 features per applicant, derived from multiple data sources:

Application behavior. How the applicant fills out the loan application itself. Do they complete it in one sitting or over multiple sessions? Do they use consistent personal information? How long do they spend on each section? ZestFinance argued that application behavior contains signal about the applicant's organization, attention to detail, and potential for fraud.

Cash flow analysis. When applicants grant access to bank account data, ML models analyze income regularity, spending patterns, savings behavior, and the timing of deposits and withdrawals. A borrower who consistently maintains a buffer above zero balance may be a better risk than one whose account frequently hits zero, regardless of income level.

Public records and alternative data. Property records, business filings, educational records (with consent), and other public data sources that provide context about financial stability.

Traditional credit bureau data. ZestFinance does not abandon FICO-relevant data; it supplements it. The ML model considers all traditional features alongside the expanded feature set.

Model Architecture

ZestFinance uses ensemble models -- gradient boosted trees (XGBoost and proprietary variants) -- as the backbone of its credit scoring system. The choice aligns with the guidance in Section 7.6: for structured tabular data, gradient boosting consistently delivers state-of-the-art performance.

The company trains multiple specialized models and combines their outputs. For example, a fraud detection model, a capacity-to-repay model, and a willingness-to-repay model may each contribute to the final credit decision. This multi-model architecture reflects the reality that "creditworthiness" is not a single concept but a composite of several distinct risks.

The Fairness Challenge

The most significant challenge ZestFinance confronted was not technical but ethical and regulatory. Machine learning models applied to credit decisions operate under strict legal constraints, most notably the Equal Credit Opportunity Act (ECOA) and the Fair Housing Act, which prohibit discrimination based on race, color, national origin, sex, religion, marital status, and age.

The Proxy Variable Problem

Even when protected characteristics (race, gender, age) are excluded from the model's features, the model may still discriminate if it uses features that are correlated with protected characteristics. This is the proxy variable problem.

Consider zip code as a feature. In the United States, residential patterns are heavily segregated by race and income. A model that uses zip code as a predictor may effectively use it as a proxy for race, producing discriminatory outcomes even though race is never directly used.

ZestFinance encountered this problem directly. Some of its alternative data features -- application completion time, device type used for the application, time of day the application was submitted -- had correlations with demographic characteristics. Applicants using older devices or applying during non-business hours were disproportionately from lower-income and minority populations. If the model learned to penalize these features, it could produce disparate impact -- a legal concept where a facially neutral practice disproportionately harms a protected group.

Caution. The proxy variable problem is not unique to ZestFinance. It exists in every classification model that uses real-world data, because real-world data reflects real-world inequities. When features like purchase patterns, geographic data, or digital behavior correlate with protected characteristics, any model trained on those features risks encoding and amplifying societal biases. We will explore this in depth in Chapter 25 (Bias in AI Systems).

ZestFinance's Fairness Approach: ZAML

ZestFinance developed a proprietary framework called ZAML (Zest Automated Machine Learning) that incorporated fairness constraints into the modeling process. The framework included several innovations:

Adverse impact ratio testing. For each model, ZestFinance calculated the approval rate for each demographic group and ensured that the ratio between the lowest-approved group and the highest-approved group met regulatory thresholds (typically, a ratio above 0.80 is considered acceptable under the "four-fifths rule" used by regulators).

Feature-level fairness auditing. Each feature was evaluated not just for predictive power but for its correlation with protected characteristics. Features with high predictive power but high demographic correlation were flagged for review. In some cases, removing a single proxy feature reduced predictive accuracy by 0.5 to 1.0 percentage points but improved the adverse impact ratio by 3 to 5 percentage points -- a tradeoff the company deemed worthwhile.

Interpretable explanations. The Equal Credit Opportunity Act requires lenders to provide specific reasons for adverse credit decisions ("reason codes"). ZestFinance developed methods to generate reason codes from its complex ML models, translating gradient-boosted tree predictions into the standardized adverse action notice format that borrowers receive.

The Regulatory Landscape

ZestFinance's story cannot be understood apart from the regulatory environment. Credit lending in the United States is one of the most heavily regulated domains for ML deployment.

Key Regulations

Equal Credit Opportunity Act (ECOA, 1974). Prohibits discrimination in credit decisions and requires lenders to provide specific, understandable reasons for denying credit. This creates a tension with complex ML models: if the model uses 1,000 features and gradient boosting to reach a decision, how do you explain to a rejected applicant why they were denied?

Fair Credit Reporting Act (FCRA, 1970, amended). Regulates the use of consumer data in credit decisions, including requirements for accuracy, dispute resolution, and permissible purposes. Alternative data sources used by ZestFinance may or may not fall under FCRA's scope, creating legal ambiguity.

Disparate impact doctrine. Under ECOA and the Fair Housing Act, lending practices that have a disproportionate adverse effect on protected groups can be challenged even if the lender did not intend to discriminate. This doctrine applies to ML models regardless of whether protected characteristics are used as explicit features.

State-level regulations. Individual states have additional consumer protection laws that may restrict the use of certain data types or modeling techniques in credit decisions.

The Regulatory Response

Regulators have taken an increasingly active interest in ML-based credit scoring. The Consumer Financial Protection Bureau (CFPB) has issued guidance on the use of alternative data and AI in lending, emphasizing that innovation does not exempt lenders from existing anti-discrimination requirements. The CFPB's position is that ML models are subject to the same fair lending scrutiny as traditional models -- and that the opacity of complex models does not excuse lenders from understanding and explaining their models' behavior.

In 2022, the CFPB issued an advisory opinion requiring that lenders using complex algorithms provide specific and accurate adverse action notices, rejecting the argument that algorithmic complexity makes specific explanations infeasible. This ruling directly impacted companies like ZestFinance that had argued their models were too complex for traditional reason-code approaches.

Results and Market Impact

ZestFinance (rebranded as Zest AI in 2019) transitioned from an original lending business to a technology platform, licensing its ML credit scoring technology to banks and credit unions. The company has reported several outcomes:

Improved prediction. Zest AI has claimed that its ML models reduce default rates by 20 to 30 percent compared to traditional logistic regression models at the same approval rate, or increase approval rates by 15 to 25 percent at the same default rate. These improvements are particularly pronounced for "thin file" applicants with limited credit history.

Fairness improvements. The company has published case studies showing that its ZAML framework can simultaneously improve predictive accuracy and reduce adverse impact on minority groups. In one cited example, an ML model achieved a 15 percent increase in approvals for Hispanic applicants while maintaining the same overall default rate.

Industry adoption. Multiple U.S. banks and credit unions have adopted Zest AI's technology for credit card, personal loan, and auto lending decisions. The company reports that its models have been used to evaluate over 200 million credit applications.

Scrutiny and debate. Despite these results, ZestFinance/Zest AI has also faced criticism. Consumer advocacy groups have raised concerns about the use of behavioral data (application behavior, device type) as potential proxies for protected characteristics. Academic researchers have questioned whether post-hoc fairness adjustments are sufficient to address structural biases embedded in the training data.

Lessons for Business Leaders

Lesson 1: Classification in Regulated Industries Requires Interpretability Infrastructure

The technical challenge of building an accurate classification model is significant. The organizational challenge of making that model explainable, auditable, and defensible to regulators is often larger. ZestFinance invested as much in its explainability framework (ZAML) as in its core model development. Any organization deploying ML in a regulated domain should budget accordingly.

Lesson 2: Accuracy and Fairness Are Not Always in Tension

A common assumption is that fairness constraints necessarily reduce model accuracy. ZestFinance's experience suggests this is not always true. By expanding the feature set and removing proxy variables, it is sometimes possible to improve both accuracy and fairness simultaneously. The key insight is that features correlated with protected characteristics may be noisy proxies for creditworthiness -- removing them and replacing them with cleaner signals can improve both performance and equity.

Lesson 3: The Target Variable Embeds Assumptions About Risk

Default prediction models are trained on historical default data. But historical default data reflects historical lending practices -- including discriminatory practices. If certain groups were historically denied credit (and thus never had the opportunity to default or repay), the training data may systematically underestimate their creditworthiness. This is a form of selection bias that is particularly insidious in credit scoring.

Business Insight. The lesson for any business deploying classification models is this: your training data reflects the world as it was, including its inequities. If your model is trained on biased historical data and deployed to make decisions about the future, it will perpetuate the biases of the past. This is not a technical problem; it is a design choice. Addressing it requires awareness, measurement, and deliberate intervention. Chapters 25 and 26 provide frameworks for doing so.

Lesson 4: The Competitive Landscape Is Shifting

ZestFinance proved that ML-based credit scoring could outperform traditional methods. But the innovation has not remained proprietary for long. Major credit bureaus (Experian, TransUnion, Equifax) have launched their own ML-based scoring products. FICO has released FICO Score XD and FICO Score X, which incorporate alternative data. Fintech competitors (Upstart, LendingClub, SoFi) have built their own ML underwriting models.

The competitive implication: first-mover advantage in ML-based credit scoring has eroded. The durable advantage lies not in the model itself but in the data infrastructure, the fairness framework, the regulatory expertise, and the organizational capability to continuously improve and monitor the model in production.

Discussion Questions

Discussion Question 1. ZestFinance uses over 1,000 features, including application behavior (how the applicant fills out the form). Some critics argue that this is inappropriately invasive. Others argue that it enables fairer lending by evaluating applicants on more dimensions than traditional credit data. Where do you stand? What principles should guide decisions about which data is appropriate to use in credit scoring?

Discussion Question 2. The "four-fifths rule" used in adverse impact analysis requires that the selection rate for any protected group be at least 80 percent of the rate for the highest-selected group. Consider a model where the approval rate for White applicants is 70 percent and for Black applicants is 58 percent. The ratio is 58/70 = 0.83, above the 0.80 threshold. Does this mean the model is "fair"? What are the limitations of using a single statistical threshold to define fairness? (Preview: Chapter 25 explores this question in depth.)

Discussion Question 3. ZestFinance found that removing certain proxy variables (like device type) slightly reduced model accuracy but meaningfully improved fairness. Under what circumstances should a business accept reduced accuracy to improve fairness? Who should make this decision -- data scientists, business leaders, regulators, or affected communities?

Discussion Question 4. Traditional FICO scores are interpretable by design: each component has a known contribution. ML models with 1,000+ features are not. The CFPB requires that denied applicants receive specific reasons for denial. Is it possible to provide genuinely meaningful explanations for complex ML model decisions, or are post-hoc explanations (like SHAP values) fundamentally different from built-in interpretability? Does the distinction matter?

Discussion Question 5. Consider the selection bias problem: historical lending data reflects historical lending practices, including discrimination. If you were designing a new ML credit scoring system from scratch, how would you address this? Consider: (a) data augmentation, (b) fairness-constrained optimization, (c) human-in-the-loop override, (d) ongoing monitoring and adjustment. Which approach (or combination) do you believe is most effective?

Discussion Question 6. ZestFinance transitioned from being a direct lender to being a technology platform that licenses its models to banks. What are the advantages and risks of this pivot? How does the regulatory responsibility change when the model developer is not the lender? Who is liable if a licensed model produces discriminatory outcomes -- the technology provider or the lending institution?

This case study is based on publicly available information from Zest AI corporate communications, regulatory filings, published research papers, media reporting (including coverage in American Banker, Bloomberg, and The Wall Street Journal), and CFPB regulatory guidance. Specific model architectures and proprietary performance metrics that are not publicly disclosed have been presented as reasonable estimates based on published claims and industry benchmarks.