Case Study 2: When a Model Encodes a Proxy — The Bias Hidden in the Math

DataField.Dev

Case Study 2: When a Model Encodes a Proxy — The Bias Hidden in the Math

A study of the failure mode that haunts every predictive model in insurance: a variable that is perfectly legal and genuinely predictive, yet stands in for a protected characteristic the law forbids you to price on — discrimination laundered through correlation, invisible inside the model's accuracy. This case is built as a clearly-labeled composite of a real, well-documented industry and regulatory concern. The pattern — predictive models reproducing historical bias through correlated proxies — is real and is the subject of active NAIC and state regulatory work; the specific scenario below is constructed for teaching and contains no real carrier and no fabricated statistic.

Background

Picture a carrier that has done everything this chapter recommends. Its analytics team builds a gradient boosting model for risk selection, validates it rigorously out-of-sample, and confirms strong lift and a healthy Gini (§32.6). The model never uses race, religion, or national origin — those variables are not in the data at all, and the team is scrupulous about it. By every internal measure the model is excellent: it sorts good risks from bad, it improves the loss ratio, it passes the carrier's governance review. The underwriters trust it. The actuaries file the rates that flow from it. For two years it is the carrier's quiet competitive advantage.

Then a state regulator, examining the carrier as part of a broader push on algorithmic fairness, asks a question the carrier's own validation never posed: not "is the model accurate?" but "does the model produce systematically different outcomes for protected groups, and if so, why?" The carrier runs the analysis it was never asked to run before — comparing the model's scores and prices across demographic groups it can approximate from external data — and finds a disparity. Risks in certain communities are scored worse, and priced higher, in a pattern that tracks race far more closely than any legitimate risk story explains.

The model never saw race. And yet it found it.

The insurance / underwriting issue

This is proxy discrimination (Chapter 35's term), and it is the deepest danger in predictive modeling. A proxy is a permitted, predictive variable that is correlated with a prohibited one, so that pricing on the proxy reproduces — sometimes amplifies — the very discrimination the law forbids, without the prohibited variable ever appearing in the model. The mechanism is exactly the one that makes models powerful, turned against fairness.

Consider how it happens, mechanically, in terms of this chapter. A GBM (§32.3) is built to find whatever correlations predict loss, automatically, including ones no human chose. If the historical loss data carries the imprint of past discrimination — because, say, certain neighborhoods were historically under-served, under-maintained, or redlined (Chapter 35's term) — then variables correlated with those neighborhoods will be predictive of the data, and the model will seize on them. ZIP code, certain features engineered from address (§32.5), credit-based scores, even shopping or vehicle data can each act as a partial proxy for race or national origin. The model is not malicious; it has no concept of race. It is doing precisely what it was built to do: find the patterns in the data. But the data encodes history, and history includes discrimination, so the model launders that discrimination into a price that looks, on its face, purely risk-based.

The reason this failure is so dangerous is that every diagnostic in §32.6 can come back clean. Lift can be excellent — the proxy genuinely predicts the (historically distorted) losses. The Gini can be high. Out-of-sample validation can pass. The model can be, by every accuracy measure the team thought to check, a triumph. Accuracy and fairness are different questions, and a model optimized purely for the first can fail the second invisibly. This is the sharp edge of the chapter's warning in §32.5: bias in, bias out. A model trained on biased history will reproduce the bias no matter how well it is built — and no amount of predictive power detects the problem, because the problem is not a prediction error. The model predicts the biased data correctly. That is the trouble.

What it shows

First, it shows that "the model never used race" is not a defense — it is barely even relevant. Proxy discrimination operates entirely through legal variables. Removing the protected characteristic from the data does nothing to remove its shadow, which lives in every variable correlated with it. A carrier that believes "we didn't include race, so we can't be discriminating" has fundamentally misunderstood how models work. This is why §32.1's Compliance Corner insists that "the algorithm chose it" is never a regulatory defense, and why several states now require a documented disparate-impact test before a model goes live — a test for the exact failure the carrier's accuracy metrics could not see.

Second, it shows that fairness must be tested for directly, because accuracy will not reveal it. The carrier's validation regime was excellent at the question it asked — does the model predict well? — and silent on the question it didn't — does the model produce fair outcomes? The two require entirely different analyses. A model can ace every diagnostic in §32.6 and still encode a proxy, because lift and Gini measure separation, not justice. The discipline the chapter teaches — validate before you trust — has to be extended: validate for accuracy and test for disparate impact, as separate, non-negotiable steps.

Third, it shows why the human in the loop is irreplaceable for exactly the reasons §32.5 and §32.7 give. A proxy is most often caught not by the algorithm — which is blind to the concept — but by a human who asks why a variable is predictive and whether its predictive power has an illegitimate source. The underwriter and actuary who can look at an engineered feature and say "that's a proxy for protected territory, pull it" are doing work no accuracy metric can do. This is the deepest argument for the modeling triangle (§32.7): the data scientist optimizes the Gini, and someone else must be in the room to ask whether the Gini was bought with a proxy.

Outcome

In our composite, the carrier — facing the regulator and its own findings — is required to remediate: identify and remove or constrain the proxy variables, retest for disparate impact, and rebuild its governance to include a fairness review as a standing step, not an afterthought. The model's pure predictive accuracy declines slightly; its defensibility, and its fairness, improve substantially. The carrier learns the lesson the hard way that the chapter offers for free: a model you cannot defend on fairness is not deployable, however accurate it is.

The broader, real outcome is the regulatory movement this composite reflects. The NAIC has active work on artificial intelligence and big data in insurance; states including Colorado have enacted requirements (Colorado's SB21-169) aimed squarely at insurers' use of external data and algorithms that could produce unfair discrimination; and "price optimization" — using models to charge what a customer will tolerate rather than what their risk warrants — has been restricted or banned in numerous states. These are all institutional responses to the failure mode this case describes: the recognition that a model's accuracy is not a license, and that fairness must be tested for, governed, and defended. Chapter 35 takes up this entire terrain; this case is the modeling-side preview of it.

Lesson

The lesson is the one an honest treatment of predictive modeling cannot omit: a model can be a triumph of accuracy and a failure of fairness at the same time, and the accuracy will hide the failure. Everything that makes a model powerful — its hunger for predictive correlations, its indifference to why a variable predicts, its ability to find patterns no human chose — is exactly what lets it encode a proxy and launder discrimination into a price that passes every accuracy check. The defenses are not algorithmic. They are human and institutional: test for disparate impact as a separate step from accuracy; interrogate every feature for an illegitimate source; keep an underwriter and actuary in the loop whose job is to ask the why the algorithm cannot; and document it all, because "the model chose it" is a confession, not a defense. For the underwriter, this is the chapter's hardest and most important truth: your judgment is most irreplaceable precisely where the model is most confident — and most blind. The full reckoning is Chapter 35; the warning starts here.

Discussion questions

Explain, in your own words, how a model that never sees race can nonetheless discriminate by race. What is the mechanism, and what is the name for the variable that carries it? (§32.5; Ch.35)
The carrier's model passed every diagnostic in §32.6 — strong lift, high Gini, clean out-of-sample validation — and still encoded a proxy. Explain why accuracy metrics are structurally incapable of detecting this failure. (§32.6)
"We didn't include race, so we can't be discriminating." Dismantle this argument. Why is removing the protected characteristic from the data nearly irrelevant to proxy discrimination? (§32.5)
How does this case strengthen the argument for the actuary–underwriter–data-scientist triangle of §32.7? Which corner is best placed to catch a proxy, and why can't the algorithm do it? (§32.7)
Contrast this case with Case Study 1 (the GLM revolution). Case Study 1 celebrates a model's accuracy; this one warns about it. How do you hold both lessons at once as a working underwriter — embracing the model's power while refusing to let accuracy stand in for fairness? (§32.2, §32.5, §32.6)
Connect to Harbor Steel. The Harbor Steel override (this chapter's Underwriting File) is justified by adding a fact the model lacked. How is that a different kind of human intervention from catching a proxy — one completes the model's information, the other questions its inputs' legitimacy — and why does a good underwriter need to be capable of both? (§32.5, §32.7, The Underwriting File)