> — George E. P. Box, statistician. The line is quoted so often it has become a cliché, but on an
Prerequisites
- 1
- 3
- 6
- 9
- 10
- 11
- 31
Learning Objectives
- Explain how predictive models extend the rating tables of Chapter 11, and state precisely what a model adds and what it cannot supply.
- Describe a generalized linear model for insurance pricing — the Poisson-frequency and gamma-severity structure — and interpret a fitted relativity as a price.
- Contrast a gradient boosting machine with a GLM for risk selection, and judge when the extra predictive power is worth the loss of transparency.
- Explain how neural networks enable image-based underwriting, and name the exposures and failure modes that come with them.
- Evaluate a model with the underwriter's diagnostics — lift, the Gini coefficient, and a backtest — and read what each does and does not prove.
- Locate the actuary, the underwriter, and the data scientist in the model lifecycle, and decide, with documentation, when to override a model's recommendation.
In This Chapter
- Overview
- Learning Paths
- 32.1 From rating tables to predictive models
- 32.2 Generalized linear models: the industry workhorse
- 32.3 Gradient boosting and machine learning for risk selection
- 32.4 Neural networks and image-based underwriting
- 32.5 Feature engineering for insurance data
- 32.6 Model validation, backtesting, lift, and the Gini
- 32.7 The actuary–underwriter–data-scientist triangle (and when to override)
- 🗂️ The Underwriting File
- Conclusion
- Key Terms
- Spaced Review
Chapter 32: Predictive Modeling for Underwriting: GLMs, Machine Learning, and the Algorithms That Price Risk
"All models are wrong, but some are useful." — George E. P. Box, statistician. The line is quoted so often it has become a cliché, but on an underwriting desk it is a working instruction: a model that prices risk is a deliberate simplification of a world too complex to price exactly, and your job is not to ask whether it is right but whether it is useful enough for the decision in front of you — and to know the cases where it is not.
Overview
The submission on your screen has already been scored. Before you opened it, a model read forty characteristics of the risk — the construction, the loss history, the location, a dozen things pulled from third-party data you never typed — and returned a number: a 7 out of 10, decline-leaning. Twenty years ago that number did not exist; you built the price yourself from a rate manual, a loss run, and your judgment. Today the number is the first thing you see, and the temptation is to treat it as the answer rather than as an input. The whole of this chapter is about resisting that temptation intelligently — not by ignoring the model, which would waste a genuinely powerful tool, and not by deferring to it, which would surrender the judgment that is your reason for existing, but by understanding the model well enough to know what it can see, what it cannot, and when the careful reading of a file should overrule a score.
You will not build these models. That is the actuary's and the data scientist's craft, and it takes years. But you must understand them the way a pilot understands the autopilot: well enough to know what it is optimizing, where its assumptions break, what its warning lights mean, and when to take the controls. An underwriter who cannot read a model is, increasingly, an underwriter who cannot defend a price — because the price came from the model, and "the system said so" is not a defense you can give to a broker, a regulator, or your own underwriting committee.
This chapter opens the box. We trace the path from the rating tables of Chapter 11 to the predictive models that now sit on top of them. We meet the generalized linear model — the industry's workhorse, the engine behind most modern personal-lines and small-commercial pricing — and see how it splits a price into a frequency piece and a severity piece. We step up to gradient boosting and ask what the extra predictive power costs in transparency. We look at neural networks reading roof images and satellite tiles. We talk about the unglamorous work that actually decides whether a model is any good — feature engineering — and the diagnostics that tell you whether to trust it: lift, the Gini coefficient, and the backtest. And we end where the book has been heading all along: the moment the model scores Harbor Steel a 7, and you, reading what the model cannot see, write it at a 6.
In this chapter, you will learn to:
- Trace the path from a rating table to a predictive model, and say exactly what the model adds.
- Read a generalized linear model (GLM) as a structured set of relativities — a Poisson frequency model times a gamma severity model — and interpret a coefficient as a price.
- Decide when a gradient boosting machine (GBM) earns its accuracy and when its opacity disqualifies it.
- Recognize what neural networks and image-based underwriting can and cannot do.
- Judge feature engineering — why the inputs, not the algorithm, usually decide the result.
- Validate a model with lift, the Gini, and a backtest, and know what each fails to prove.
- Place yourself in the pricing-model lifecycle and document an override that will survive an audit.
Learning Paths
🏠 Personal Lines: GLMs are your world — auto and home are priced by them at scale (§32.2), and the validation diagnostics (§32.6) are how those rate filings are defended to a regulator. Watch how a relativity becomes a number on a declarations page. 🏢 Commercial Lines: Models advise but rarely decide on complex commercial risks; your weight is on §32.3 (selection scoring), §32.7 (when to override), and the Harbor Steel file, where a model's "7" meets an underwriter's judgment. 📊 Analytics: This is your chapter. Read all of it, then go deeper than it can — the GLM/GBM contrast (§32.2–32.3), feature engineering (§32.5), and the lift/Gini/backtest machinery (§32.6) are your daily tools. Note where the chapter chooses interpretability over a fraction of a Gini point, and why. 📜 Certification: Predictive modeling now appears across the AINS, AU, and CPCU material and is central to the data-and-analytics designations. The key terms here — GLM, lift, Gini, model validation — recur; learn them precisely.
32.1 From rating tables to predictive models
Start with what you already know how to do, because a predictive model is not a break from it — it is the same idea, industrialized. In Chapter 11 you built a price the classical way: a base rate for the class, then a set of rating factors (Chapter 11's term — the relativities that turn a characteristic into a multiplier on price), applied one after another. A personal-auto rate might start at a base, multiply by 1.4 for a young driver, by 0.9 for a multi-car discount, by 1.2 for the territory, and so on. Each factor was estimated, historically, by a univariate analysis: take all the young drivers, compare their loss experience to the average, and read off the relativity.
That method has a flaw that took the industry decades to fix, and understanding the flaw is the key to understanding why predictive models exist. The factors are correlated. Young drivers also tend to drive older cars, live in certain territories, and buy lower limits. When you estimate the "young driver" factor by looking only at young drivers, you are unknowingly bundling into it the effect of the cars they drive and the places they live. Apply that factor and the territory factor and the vehicle factor, and you have double-counted the overlap. The classical one-way method cannot disentangle correlated effects — and almost every real rating variable is correlated with several others.
A predictive model solves exactly this. It estimates all the factors simultaneously, holding the others constant, so that each relativity measures the effect of that variable after accounting for every other variable in the model. The young-driver factor becomes "the effect of youth, for two drivers identical in every other respect." That is the single most important thing a model buys you: multivariate estimation, the disentangling of correlated effects that a one-way rate table cannot perform. Everything else — the speed, the scale, the third-party data — is secondary to this one statistical gain.
📋 At the Desk Here is the honest picture of where models sit in the workflow, because the marketing oversells it. A predictive model does not replace the rate manual; in most carriers it produces the rate manual, or a set of relativities the actuaries load into the rating engine, which then prints the number the underwriting workstation (Chapter 31's term) shows you. For high-volume personal and small-commercial lines, the model's output may bind the policy with no human in the loop (the straight-through processing of Chapter 20). For complex commercial lines, the model produces a score or an indicated price that lands on your desk as a recommendation, and you decide. The failure mode at the desk is forgetting which regime you are in — treating a complex-commercial score, built on thin and noisy data, with the same deference you would give a personal-auto model built on ten million policies. The model's authority should scale with the data behind it, and a surprising number of expensive mistakes come from forgetting that.
The transition from tables to models is also a transition in who owns the price. In the rate-manual era, an underwriter could reconstruct a price by hand and argue with any piece of it. In the model era, the price emerges from a fitted object that no single person can recompute in their head. This is a real loss, and the chapter does not pretend otherwise — it is why the diagnostics in §32.6 and the override discipline in §32.7 matter so much. The model gives you a better price on average and a worse ability to explain any single one. The professional response is not to reject the trade but to rebuild, through documentation and validation, the explainability you used to get for free.
⚖️ Compliance Corner Whatever the model does, the rate it produces is still a filed rate, and the rate-regulation rules of Chapter 4 still apply in full. A GLM or a machine-learning model used for pricing in most states must be filed and justified to the regulator like any other rating plan — and regulators increasingly demand to see the variables, not just the output, because a model can launder a prohibited factor into a price through correlation. We treat that danger — proxy discrimination, Chapter 35's term — at length in Chapter 35; flag here only that "the model chose it" is never, by itself, a regulatory defense. A variable that is unfairly discriminatory (Chapter 4's term) does not become acceptable because an algorithm selected it for predictive power. The filing must show the variables are risk-related and permitted, and several states now require a documented test for disparate impact before a model goes live.
32.2 Generalized linear models: the industry workhorse
If you learn one modeling concept from this book, learn this one, because the generalized linear model prices more insurance than any other technique on earth. A generalized linear model (GLM) is a statistical model that relates a set of predictor variables to an outcome through a link function and an assumed error distribution, estimating all the predictors' effects simultaneously. That is a mouthful; unpack it through the underwriting question it answers, which is always: given these characteristics, what loss should we expect, and therefore what should we charge?
The GLM splits that question into two, mirroring the structure of risk you learned in Chapter 6 — frequency and severity (Chapter 6's terms). It fits one model for how often losses occur and a second for how large they are when they do, then multiplies them to get the expected loss, the pure premium (Chapter 10's term) that anchors the price.
- The frequency model predicts the number of claims per exposure. Claim counts are non-negative integers, often mostly zeros, so the GLM uses a Poisson distribution (the standard distribution for counts of rare events) with a log link — meaning the model predicts the logarithm of the expected count, which conveniently turns the relativities into multipliers.
- The severity model predicts the size of a claim given that one occurred. Claim sizes are positive, right-skewed (many small, a few enormous), so the GLM uses a gamma distribution (the standard distribution for positive, skewed amounts), again with a log link.
Multiply the two — expected frequency times expected severity — and you have the modeled pure premium for that risk. This frequency–severity split is not a mathematical nicety; it is underwriting insight. A young driver may have higher frequency but ordinary severity; a luxury vehicle may have ordinary frequency but high severity; a coastal location may drive both. Splitting the model lets you see which dimension of risk a characteristic affects, which is exactly the diagnosis an underwriter wants.
The log link is what makes a GLM feel familiar to anyone who has used a rate table, and it is worth seeing why. Because the model predicts the log of the expected loss, the individual effects add up on the log scale — and adding on the log scale is the same as multiplying on the normal scale. So the fitted model reproduces exactly the multiplicative structure of a classical rate manual: a base rate, times a relativity for each factor. The difference is that the GLM's relativities were all estimated together, each one controlling for the others.
A GLM AS A MULTIPLICATIVE RATE TABLE — personal auto frequency [constructed teaching example]
predicted relative frequency = base × (driver-age factor) × (territory factor) × (vehicle factor) × ...
base (reference risk) 1.00
driver age 22 (vs. 45) × 1.85 ← effect of youth, HOLDING territory & vehicle constant
territory: dense urban × 1.30
vehicle: high-performance × 1.20
prior coverage: continuous × 0.92
─────────────────────────────────────
predicted relative freq ≈ 1.00 × 1.85 × 1.30 × 1.20 × 0.92 ≈ 2.45× the reference risk
Read it: this risk is modeled to have about 2.45 times the claim frequency of the reference driver,
with each factor's contribution disentangled from the others. A separate gamma severity model produces
a parallel set of severity relativities; frequency × severity = the modeled pure premium.
Read the diagram as an underwriter, not a statistician. Each multiplier is a relativity — a price signal you can interpret, argue with, and defend. The driver-age factor of 1.85 says "youth, by itself, after controlling for where they live and what they drive, multiplies expected frequency by 1.85." That is a statement you can take to a regulator, test against your own loss experience, and override if your book tells a different story. This interpretability is the GLM's great virtue and the reason it remains the workhorse even as flashier methods arrive: a GLM gives you a price you can explain, factor by factor.
📋 At the Desk When the actuaries deliver a GLM, the artifact you actually work with is a table of relativities and a few diagnostics. You should be able to do three things with it without help. First, sanity-check the signs: does every relativity point the way underwriting intuition says it should? A factor that makes a clearly worse risk cheaper is a red flag — usually a sign of correlation with something not in the model, or of a data error. Second, find the big movers: which two or three factors swing the price most? Those are the ones a broker will push back on and the ones you must be able to justify. Third, spot the thin cells: a relativity estimated from a handful of claims is noise wearing the costume of signal, and the credibility lessons of Chapter 10 apply to a model coefficient exactly as they apply to a single risk's loss run. A GLM does not repeal credibility theory; it just hides the small samples inside a coefficient where you have to go looking for them.
Here is a compact, illustrative sketch of what fitting a frequency GLM looks like in code — not so you can build one, but so the words "Poisson" and "log link" attach to something concrete. Read the comments; skip the syntax if you like.
# Illustrative ONLY — a frequency GLM for auto, Poisson family, log link.
# This is the shape of the actuary's work, shown so you can read their output.
import statsmodels.formula.api as smf
import statsmodels.api as sm
# claims = observed claim COUNT per policy; exposure = car-years (the exposure unit, Ch.6)
# We model the RATE of claims by including log(exposure) as an "offset".
model = smf.glm(
formula="claims ~ C(driver_age_band) + C(territory) + C(vehicle_symbol) + prior_coverage",
data=policies,
family=sm.families.Poisson(link=sm.families.links.log()),
offset=np.log(policies["exposure"]),
).fit()
# The fitted coefficients are on the LOG scale; exponentiate to read them as relativities.
relativities = np.exp(model.params) # e.g. driver_age_band[22] -> ~1.85 (a multiplier on frequency)
print(relativities)
How to read that output: every number in relativities is a multiplier, just like the diagram. The offset
term (log(exposure)) is the trick that turns a count model into a rate model — it tells the GLM that a
policy in force for two car-years has had twice the opportunity to produce a claim as one in force for one,
so the model predicts claims per unit of exposure, which is what you price on. A separate gamma model on
the claim amounts gives the severity relativities. You will never type this; you will, regularly, read
its results and be asked whether you believe them.
The GLM's limits are as important as its powers, and an underwriter who sells it without them has failed the bible's first instruction. A GLM assumes the multiplicative structure is roughly correct — that factors multiply rather than interact in more complex ways. When two variables genuinely interact (the effect of a sports car depends on the driver's age, say), the modeler must add that interaction by hand; the GLM will not discover it. A GLM also assumes the modeler chose the right variables and the right form for each; it is only as good as the feature engineering behind it (§32.5). And like every model, it is fitted to the past and assumes the future resembles it — an assumption that climate change, social inflation, and the nuclear-verdict trend (Chapter 23's term) are quietly breaking in line after line. The GLM is a superb tool for a stable, data-rich, multiplicatively-structured problem. The further a risk departs from that ideal — novel, thin-data, regime-changing — the more the model becomes a starting point for judgment rather than a substitute for it.
32.3 Gradient boosting and machine learning for risk selection
The GLM's one real weakness — that it cannot find interactions and nonlinearities on its own — is exactly what the next family of models is built to fix, and the trade it offers is the central modeling decision of modern underwriting. A gradient boosting machine (GBM) is a machine-learning method that builds a prediction by combining hundreds or thousands of small decision trees, each one trained to correct the errors of the ones before it. You do not need the mathematics; you need the intuition and the trade-off.
The intuition: a single decision tree splits the data with simple yes/no questions ("is the roof over 25 years old? is the building in a coastal county? is the prior loss count above two?") and makes a prediction in each resulting bucket. One tree is weak and crude. But if you build a tree, look at where it was wrong, build a second tree that focuses on those errors, then a third focused on the remaining errors, and continue for a thousand rounds, the ensemble becomes extraordinarily accurate. Crucially, because trees ask questions in sequence ("if coastal and old roof and prior losses…"), a GBM discovers interactions and nonlinearities automatically — the very things a GLM must be told about by hand. For risk selection — ranking risks from best to worst so the good ones can be accepted and the bad ones declined or surcharged — a well-built GBM will almost always out-predict a GLM.
So why is the GLM still the workhorse for pricing? Because of what the GBM costs you, and the cost is not small.
GLM vs. GBM — the underwriter's trade-off [constructed teaching example]
GLM GBM (gradient boosting)
predictive power good often better, esp. with rich data
interpretability HIGH — read every relativity LOW — a "black box" of 1000s of trees
finds interactions only if you add them by hand automatically
regulatory filing well-understood, accepted harder to justify; some states resist
stability over time transparent, easy to monitor can be brittle; needs careful monitoring
best used for PRICING (the filed rate) SELECTION, triage, flagging, tie-breaking
The rule of thumb: GLM where you must EXPLAIN the price; GBM where you must RANK the risk.
Many carriers use both — a GBM to triage and flag, a GLM to set the filed rate.
Read that table as a decision, not a description. The GBM's accuracy is real and worth having. But a pricing model must be filed, explained, and defended — to regulators, to brokers, to your own governance — and a thousand-tree ensemble cannot be explained the way a relativity table can. You cannot tell a regulator "the model charged this insured more because, after accounting for the interaction between trees 412 and 887, the gradient pointed up." For selection — deciding which submissions to look at first, which to fast-track, which to flag for a human — the explainability bar is lower and the GBM's power is pure upside. For the filed price, most carriers still reach for the GLM, sometimes informed by what the GBM revealed.
🤖 Model vs. Judgment The honest tension is this: the more accurate model is the harder one to question, and that is precisely backwards from what an underwriter wants. With a GLM, when your judgment disagrees with the price, you can locate the factor you disagree with and argue it. With a GBM, the score arrives as a single number with no factor to grab — and the temptation is either to defer to it (because it is "more accurate") or to ignore it (because you can't see inside it). Both are failures. The discipline is to use the GBM's score as one input — a strong, well-validated opinion from a colleague who has read a million files and cannot explain their reasoning — and to weight it against what you can see that it cannot. A GBM that flags a risk you know to be improving (the new plant manager, the signed roof contract) is not wrong; it is uninformed about the very facts that change the answer. Your job is to supply those facts and document that you did.
There is a family of tools — interpretability methods, with names like SHAP values and partial dependence plots — that pry open a GBM after the fact, attributing each prediction to its contributing features. They help, and a good analytics team will hand you a "why" alongside the score. But they are approximations of a model that remains fundamentally opaque, and they can themselves mislead when features are correlated. Treat them as a flashlight in a dark room, not a window: better than nothing, not the same as seeing.
⚠️ Underwriting Trap The seductive trap of machine learning is overfitting — building a model so flexible that it memorizes the noise in the historical data rather than the signal, then performs beautifully on the past and badly on the future. A GBM with too many trees, or trees grown too deep, will fit the training data almost perfectly and fail on next year's risks, because it learned the accidents of the sample, not the structure of the risk. The losses from an overfit pricing model arrive exactly the way the losses from a soft-market underpricing arrive (Chapter 11's theme) — two and three years later, when the risks the model loved turn out to have been chosen for noise. The defense is the validation discipline of §32.6: never judge a model on the data it was trained on, only on data it has never seen. An underwriter who hears "the model is 98% accurate on our historical book" should ask the one question that matters: on data it was not trained on? If the answer is unclear, the number is worthless.
32.4 Neural networks and image-based underwriting
Above the GBM, in flexibility and in opacity both, sits the neural network — the technology behind the image recognition, language models, and pattern-finding that have made "AI" a headline word. For most tabular insurance pricing (rows of policies, columns of characteristics), neural networks rarely beat a well-tuned GBM and are far harder to interpret, so they are not the pricing tool of choice. Where they change the game is with data that has no natural rows and columns: images, satellite tiles, text, documents. That is a genuinely new exposure for underwriting, and it deserves a careful look.
Consider the underwriting question that image models answer. In Chapter 31 you saw third-party data flood the submission — including aerial and satellite imagery of the risk. But a photograph is just pixels until something interprets it. A neural network trained on millions of labeled roof images can look at an aerial photo of a building and estimate the roof's material, its condition, its age, the presence of tarps or patches or ponding — the very COPE observations (Chapter 9's term) an inspector would record, produced in seconds, at scale, for every property in a portfolio. The same technology reads property photos for condition, satellite imagery for wildfire fuel and defensible space, and dashcam footage for driving behavior. This is image-based underwriting, and it is one of the genuinely transformative applications of machine learning in the field.
📄 Read the Submission
text FIGURE 32.1 — "What the model sees in the roof" [constructed teaching example] THE SUBMISSION An aerial-image model scores a commercial roof from a recent satellite tile, as part of pre-filling a property submission. THE CONTEXT The model returns: roof material = built-up/modified bitumen; estimated age = 25+ yrs; condition flags = ponding in two areas, one visible patch; no tarps. Confidence: high. WHAT IT SHOWS A defensible, scalable corroboration of an aging, end-of-life roof — exactly the kind of observation that used to require an inspection and weeks of wait. WHAT IT DOESN'T It does not know the roof is under a signed replacement contract; it cannot see the interior, the deck condition, or whether the ponding has caused unseen damage. The image is a snapshot, possibly months old, of the outside only. THE DECISION Use the image to CORROBORATE the manual read and to price the roof as aging — but require the inspection and the warranty before treating the roof risk as resolved. THE LESSON An image model extends the inspector's reach; it does not replace the inspector's judgment, the interior, or the documents that change the risk.
That figure is the whole lesson of image-based underwriting in one block. The model genuinely sees something — and genuinely scales the seeing to an entire portfolio in a way no inspection force ever could. But it sees the outside, in a snapshot, with no knowledge of context or documents. For Harbor Steel, an image model will confirm what you already suspect about the thirty-year-old roof. What it will never see is the signed roof-replacement contract the broker has attached — the single fact that turns the roof from a decline-driver into a managed, time-limited subjectivity. Hold that gap; it is the heart of the override this chapter is building toward.
⚠️ Underwriting Trap Neural networks fail in ways that are alien — failure modes a human inspector would never produce, which makes them hard to anticipate. An image model can be confidently wrong: fooled by shadows, by the angle of the sun, by a tarp it reads as a roof material, by a stitched-together satellite tile months out of date. It can carry the biases of its training data — if it was trained mostly on suburban single-family roofs, it may be unreliable on industrial or rural structures, and you will not be told this in the score. And because the output looks authoritative — a clean number, high confidence — it invites exactly the over-trust it has not earned. The disciplined posture is to treat an image model's output as a strong lead requiring confirmation, never as a finding of fact, and to insist that high-consequence decisions (declines, large surcharges) rest on something a human can verify.
The deeper point, which Chapter 36 will carry forward, is that image and sensor models change what data underwriting runs on more than they change the judgment at its core. A roof score, a wildfire-fuel score, a telematics-derived driving score — each is a new, richer input. None of them decides the case. The underwriter who treats them as inputs to a judgment will be more powerful than the one who had only the application; the underwriter who treats them as the judgment itself will be automated, and will deserve to be, right up until the snapshot is wrong and there is no one who looked.
32.5 Feature engineering for insurance data
Here is the secret the algorithm-obsessed coverage of "AI in insurance" almost always misses: the model is rarely what decides the result. The inputs are. Feature engineering is the work of constructing, transforming, and selecting the input variables — the features — that a model learns from, and it is where most of the real predictive power, and most of the real danger, actually live. A mediocre algorithm on excellent features beats a brilliant algorithm on raw, poorly-constructed ones almost every time.
Understand what a "feature" is by example. The raw data might contain a date of construction and a date of loss. Neither, raw, is very predictive. But engineer them — compute "building age at time of loss," "years since last claim," "claim count in the trailing five years," "roof age relative to expected roof life" — and you have created features that carry real signal. Feature engineering is the translation of domain knowledge into model inputs: it is precisely where the underwriter's expertise enters the modeling process, and the single best reason an underwriter should be in the room when a model is built.
RAW DATA → ENGINEERED FEATURES — what the underwriter knows, made machine-readable [illustrative]
raw fields the data has engineered features the model should learn from
───────────────────────────── ──────────────────────────────────────────────
year built, today's date → building age; age relative to typical roof life
list of prior claim dates → claim FREQUENCY (count / exposure); years since last
list of prior claim amounts → claim SEVERITY (avg, max); large-loss flag (> threshold)
street address → distance to coast; distance to fire station (the PPC idea)
industry / class code → hot-work hazard flag; products-liability exposure flag
payroll, revenue → exposure base (Ch.21); loss ratio per $1M revenue
The underwriter's job in the room: name the features that capture how the risk ACTUALLY behaves —
the things you would look for in a loss run or an inspection — so the model can learn them.
Notice what that table is really doing: it is turning the contents of this book — frequency and severity, COPE, exposure bases, hot-work hazard, loss-run reading — into columns a model can consume. That is why feature engineering is the underwriter's natural contribution to the analytics team. A data scientist knows how to fit the model; an underwriter knows that "two fires in five years" is a frequency signal, that "\$1.2M then \$180K" is a severity story, that a thirty-year-old roof in a windstorm zone is an interaction, and that hot-work is the hazard a class code only hints at. Those insights, encoded as features, are what make a model underwrite rather than merely correlate.
📊 At the Desk A practical way to earn your seat in the modeling room: when the analytics team shows you a model's important variables, do not just check the signs (§32.2) — propose the features they're missing. You have read thousands of loss runs; you know which patterns predict trouble and which look scary but don't. If a commercial-property model has no feature for "trailing large-loss flag" or no interaction between "roof age" and "windstorm zone," say so. The most valuable thing an underwriter brings to a model is not a critique of the algorithm — the data scientists own that — but the domain knowledge that turns raw fields into features the model can actually learn from. This is collaboration, not oversight, and it is how the actuary–underwriter–data-scientist triangle of §32.7 produces a model better than any one corner could build alone.
Feature engineering is also where the chapter's ethical fault line first opens, because the choice of features is a choice about fairness. A feature like "distance to coast" is risk-related and defensible. A feature like "credit-based insurance score" (Chapter 8's term) is predictive but contested, restricted in some states, and the subject of a live fairness debate. A feature that is a proxy for a protected class — ZIP code standing in for race, say — can inject unfair discrimination into a model that never names the protected class at all. We hold that thread for Chapter 35, which is built around it; flag here only that the most consequential decisions in a pricing model are often made not by the algorithm but by the humans choosing which features it is allowed to see. Garbage in, garbage out is the old saying; bias in, bias out is the modern, sharper version, and feature engineering is where you stop it or let it through.
32.6 Model validation, backtesting, lift, and the Gini
A model that has not been validated is not a tool; it is a rumor. Model validation is the process of testing whether a model actually predicts well on data it did not learn from — and backtesting is validation against historical data, checking how the model would have performed had it been in force. This is the discipline that separates a model you can stake a price on from one that merely flatters the past, and an underwriter must understand it well enough to ask the right questions, because the validation results are the entire basis for trusting the score on your screen.
The cardinal rule, from which everything else follows: never judge a model on the data it was trained on. A model has, by construction, seen its training data and bent itself to fit it; its performance there is meaningless as a guide to the future. The honest test is out-of-sample — the modeler holds back a portion of the data (a test set, or better, a holdout from a later time period), fits the model on the rest, and measures performance on the part it never saw. A model that performs well in training and badly out-of-sample is overfit (§32.3) and dangerous. A model that performs comparably on both has likely learned signal, not noise. When you are told a model's accuracy, the only number worth hearing is the out-of-sample one.
Now the two diagnostics an underwriter actually needs to read, because they are the ones that appear in filings, governance decks, and the conversation about whether to trust the model. Together they go by lift/Gini.
Lift measures how well a model separates good risks from bad. The test is intuitive: take the book, sort every risk by its model-predicted loss cost from best to worst, cut the sorted list into ten equal groups (deciles), and compare the actual loss experience of the best decile to the worst. A model with strong lift will show the worst decile running at many times the loss ratio of the best — it has genuinely sorted the risks. A model with no lift shows every decile running about the same — it has sorted nothing.
A LIFT CHART — actual loss ratio by model-predicted decile [constructed teaching example]
predicted-best decile 1 ██░░░░░░░░░░░░░░░░░░ ~45% loss ratio
decile 2 ███░░░░░░░░░░░░░░░░░ ~58%
decile 3 ████░░░░░░░░░░░░░░░░ ~66%
decile 4 █████░░░░░░░░░░░░░░░ ~74%
decile 5 ██████░░░░░░░░░░░░░░ ~82%
decile 6 ███████░░░░░░░░░░░░░ ~90%
decile 7 █████████░░░░░░░░░░░ ~104%
decile 8 ███████████░░░░░░░░░ ~120%
decile 9 █████████████░░░░░░░ ~145%
predicted-worst decile 10 █████████████████░░░ ~190%
Read it: the model SEPARATES. The risks it predicted would be worst ran at ~190% loss ratio; the ones
it predicted best ran at ~45%. A monotone climb like this is what "good lift" looks like — the model
has sorted risk. A FLAT chart (every decile near 100%) would mean the model predicts nothing useful.
Read the lift chart as the single most useful picture of a model an underwriter ever sees. It answers the question that matters: does this model actually tell good risks from bad? If the worst decile runs at 190% and the best at 45%, the model has real discriminating power, and accepting the good deciles while declining or surcharging the worst will improve your book's combined ratio (Chapter 3's term — the number that tells the truth). If the chart is flat, the model is decoration. Notice that lift says nothing about whether the price level is right — a model can sort risks perfectly and still be priced too low across the board — which is why lift is necessary but not sufficient. It proves separation, not adequacy.
The Gini coefficient compresses that whole chart into a single number between 0 and 1: roughly, how unequally the model concentrates losses across the sorted risks. A Gini of 0 means the model sorts no better than random; a Gini near 1 would mean near-perfect separation. There is no universal "good" Gini — it depends on the line, the data, and what the incumbent model achieves — but it gives a single, comparable figure for "is model B better than model A?" When two models are compared, the higher Gini usually wins, provided the gain is real out-of-sample and not bought at an unacceptable cost in interpretability or fairness.
📋 At the Desk The four questions to ask whenever a model is put in front of you, in order. One: out-of-sample? Was the lift/Gini measured on data the model never saw — ideally a later time period, since insurance regimes drift? Two: does the lift hold across segments? A model with great overall lift can have no lift, or reverse lift, in an important sub-book (a state, an industry, a size band) — and that sub-book is where it will quietly lose money. Three: is it stable? Does the lift persist when the model is refit on a different year, or does it swing? Four: what's the price level? Lift proves the model ranks risk; it does not prove the overall rate is adequate (Chapter 11's discipline) — that is a separate check against your loss ratios. A model that aces all four is one you can build a price on. A model that aces only the first is a science-fair project.
🔍 Check Your Understanding 1. A vendor reports their model is "94% accurate." Why is that number, by itself, nearly useless to you, and what two things must you ask before it means anything? (§32.6) 2. A pricing model shows excellent lift overall but, when cut by region, shows flat lift in your largest coastal state. What does that tell you, and why is it dangerous to deploy the model statewide on the strength of the overall number? (§32.6) 3. Explain in one sentence why strong lift does not, by itself, mean the model's prices are adequate. (§32.6, Ch.11)
32.7 The actuary–underwriter–data-scientist triangle (and when to override)
We arrive at the human question the whole chapter has been circling: in a world where a model prices the risk, who decides, and when does a person overrule the algorithm? The answer is not a person but a triangle — three roles, each owning part of the truth, that a modern carrier must hold in balance.
THE MODELING TRIANGLE — three roles, one price [schematic]
ACTUARY
(owns the rate level,
credibility, the filed
plan, rate adequacy)
/ \
/ \
/ \
DATA SCIENTIST ──────── UNDERWRITER
(owns model build, (owns the risk decision,
features, validation, the context the model can't
the algorithm) see, the override + its defense)
The model is best when all three corners contribute: the data scientist builds it, the actuary owns
its rate level and filing, and the underwriter supplies the domain knowledge going in and the judgment
coming out. No single corner should own the price alone.
Read the triangle as a division of authority. The data scientist builds the model — the algorithm, the features (with the underwriter's help, §32.5), the validation (§32.6). The actuary owns the rate level and the filing: whether the price, in aggregate, is adequate (Chapter 11's discipline) and defensible to the regulator. The underwriter owns the risk decision: whether to accept this risk, on what terms, at what price — informed by the model but not dictated by it. The model is at its best when all three corners pull together. It fails when one corner dominates: a data scientist optimizing Gini with no underwriter to catch a proxy variable, an actuary filing a rate with no data scientist to validate it, an underwriter overriding so often the model becomes theater.
Now the decision that defines the modern underwriter: when do you override the model? The model's score is a strong, well-validated prior — the distilled experience of a million files. You do not override it casually; an underwriter who overrides whenever they "have a feeling" has simply reintroduced all the inconsistency and bias the model was built to remove, and a carrier that lets them is wasting the model. But you do override it when you can see something the model cannot, and you can articulate what that something is. The test is not "do I disagree?" but "can I name the specific fact the model lacks, and would a reasonable reviewer agree it changes the answer?"
🤖 Model vs. Judgment Three situations justify an override, and they are worth memorizing because they are defensible and the rest usually are not. First, the model is missing a material fact. It scored the risk without the signed roof-replacement contract, the new management, the just-installed sprinkler upgrade — facts that exist but never reached the model's inputs. You are not overruling the model's reasoning; you are completing its information. Second, the model is out of its domain. The risk is novel, the data thin, the situation one the model was never trained on — a new industry, an unusual structure, a combination the historical book did not contain. The model is extrapolating, and a model outside its training distribution is guessing with false confidence. Third, the model is demonstrably wrong on this case — a data error in the inputs, a misread image, a miscoded class. In all three, the override is not "my gut beats your math"; it is "I have information or context your math did not have." Write that information down. An undocumented override is indistinguishable, to an auditor, from caprice — and indistinguishable, to a regulator, from bias. The documented override is the single most important professional artifact in model-era underwriting.
This is where the pricing-model lifecycle closes the loop. The pricing-model lifecycle is the full arc a model travels — from data and feature engineering, to building, to validation, to filing and deployment, to monitoring in production, to refresh or retirement — and the underwriter's overrides are a crucial signal inside it. When underwriters consistently override the model in one direction on a particular kind of risk, that is not insubordination; it is information the model needs. Those overrides, logged and analyzed, tell the actuaries and data scientists where the model is blind — which feature is missing, which segment it misjudges — and feed the next version. The override is not the end of the conversation between human and model; it is how the model learns what only the underwriters can see.
THE PRICING-MODEL LIFECYCLE — and where the underwriter lives in it [schematic]
DATA → FEATURE ENGINEERING → BUILD → VALIDATE → FILE → DEPLOY → MONITOR → REFRESH/RETIRE
▲ │
│ │
└──────────── underwriter overrides, logged ──────────────┘
(the override log tells the model where it's blind)
A model is not a thing you build once; it is a loop. The underwriter contributes domain knowledge at
the front (features) and judgment at the back (overrides) — and the overrides, fed back, improve the
next version. Judgment and analytics are not rivals here; they are two halves of the same loop.
That loop is the chapter's deepest point, and the answer to the anxiety that a model will replace the underwriter. A model that is never overridden is a model no one is watching — and a model no one is watching drifts, encodes yesterday's world, and quietly accumulates the errors that surface as losses two years later. The underwriter who overrides well — rarely, for nameable reasons, with documentation — is not the model's adversary but its essential complement, the one role that keeps the loop honest. The future does not belong to the model or to the underwriter. It belongs to the loop, and to the underwriter who knows how to live inside it.
🔍 Check Your Understanding 1. Name the three situations that justify overriding a model's recommendation. What do all three have in common, and what justification does not make the list? (§32.7) 2. Why is an undocumented override dangerous even when the underwriter turns out to be right? Answer for both the auditor and the regulator. (§32.7) 3. The chapter says underwriter overrides are "information the model needs." Explain how a logged override feeds the pricing-model lifecycle and improves the next version. (§32.7)
🗂️ The Underwriting File
The model renders its verdict — and you overrule it, on the record. It is time to see what your carrier's predictive model makes of Harbor Steel & Fabrication. The submission has already been pre-filled and enriched (Chapter 31): the third-party property and peril data is in, the satellite roof imagery has corroborated the manual read of an aging, end-of-life roof, and now the commercial-property risk-selection model — a gradient boosting machine, validated with the lift and Gini diagnostics of this chapter — returns its score.
The model scores Harbor Steel a 7 out of 10, decline-leaning. Read what drove the score, because it is not stupid. The model can see the two fire losses in five years (a frequency signal), the \$1.2M 2023 loss (a severity signal), the thirty-year-old built-up roof confirmed by the image model, the named-windstorm zone, the fire protection class 4, the welding/hot-work occupancy class, and the pending products-liability claim. Every one of those is a real, adverse feature. A purely model-driven shop declines this risk and moves to the next submission. The score is defensible on the inputs the model had.
Now read what the model cannot see — the facts that change the answer. The two fires tell a story the model reads as "frequency" but you read as management: the 2021 fire was electrical and the 2023 fire was hot-work, both predating corrective action, and the broker, Meridian, has attached the controls that address them. There is a signed roof-replacement contract — the single fact that converts the roof from a decline-driver into a time-limited, ACV-endorsed subjectivity (the structure you drafted in Chapter 12). There is a hot-work permit program being put in place. The model scored the risk as it was; you are pricing the risk as it will be under the conditions you attach. The model is not wrong about the history — it is uninformed about the corrective controls, which is override justification number one from §32.7: a missing material fact.
So you override — to a 6 — and you document why. This is the model-override anchor the whole book has been building toward, and it pays off here exactly as the discipline of §32.7 prescribes. You do not override because you "like" the account or because the broker is a good one; you override because you can name the specific facts the model lacked (the signed roof contract, the hot-work program, the management change behind the loss history) and because a reasonable reviewer, shown those facts, would agree they change the grade from a decline-leaning 7 to a writable-with-conditions 6. You write that reasoning into the file — the facts, the controls, the subjectivities they become — so that the override survives an audit and a regulator's question alike. And you log it, because if underwriters keep overriding this model on hot-work accounts with documented corrective controls, that is information the model's next version needs: a feature for "post-loss corrective controls in place" that the model is currently blind to.
What this layer settles, and what it does not. It settles the model-versus-judgment question for Harbor Steel: the score is an input, not the decision, and the documented override resolves the tension in favor of a defensible 6. It does not repeal any of the residual risk — the aging sprinklers, the pending bracket claim, the catastrophe tail are all still there, priced and conditioned exactly as the earlier chapters built them. The model did not make the decision; it sharpened it. Running disposition: model-vs-judgment resolved; override to a 6 logged with documented reasons; the account remains quote-with-conditions, now with the analytics on the record beside the underwriter's judgment. The capstone in Chapter 40 will assemble this override into the complete file and defend it to the committee — which is, in the end, the one skill this book exists to teach.
Conclusion
Predictive models have moved from advisor to author across much of insurance pricing, and an underwriter who cannot read them cannot defend a price. We traced the path from the correlated, one-way rating tables of Chapter 11 to the multivariate generalized linear model, which disentangles correlated effects and splits a price into an interpretable frequency model (Poisson) times a severity model (gamma) — a price you can explain, factor by factor. We stepped up to the gradient boosting machine, which out-predicts the GLM by finding interactions automatically but pays for it in opacity, and drew the working rule: GLM where you must explain the price, GBM where you must rank the risk. We saw neural networks extend the inspector's reach into images and satellite tiles — and fail in alien ways that demand human confirmation. We named the quiet truth that feature engineering, not the algorithm, usually decides the result, and that it is the underwriter's natural seat at the modeling table — and the place where bias enters or is stopped. We learned to validate a model with lift, the Gini, and an out-of-sample backtest, and to ask the four questions that separate a tool from a rumor. And we located ourselves in the actuary–underwriter–data- scientist triangle and in the pricing-model lifecycle, where the disciplined, documented override is the most important professional artifact an underwriter produces.
Two themes ran through all of it. Technology augments underwriters; it does not replace them — the model is a powerful input that cannot see context, cannot read documents, cannot complete its own information, and drifts when no one overrides it. And underwriting is judgment — the score is a strong prior, but the decision, and its defense, belong to the person who can name what the model cannot see. The Harbor Steel model scored a 7; you wrote a 6, and you wrote down why.
In the next chapter we turn to the sharpest edge of adverse selection — fraud and misrepresentation — and to the data and red flags that catch it, including the disclosure gap in the very Harbor Steel application the model just scored. The model told you what the data said. Chapter 33 asks whether the data was true.
Key Terms
- Generalized linear model (GLM) — a statistical model relating predictors to an outcome through a link function and an error distribution, estimating all effects simultaneously; in insurance, typically a Poisson-frequency model and a gamma-severity model whose product is the modeled pure premium. The interpretable industry workhorse for pricing.
- Gradient boosting machine (GBM) — a machine-learning method that combines many small decision trees, each correcting the errors of the last, to produce highly accurate predictions that automatically capture interactions — at the cost of interpretability; favored for risk selection over filed pricing.
- Feature engineering — the construction, transformation, and selection of the input variables a model learns from; where domain knowledge enters the model, where most predictive power lives, and where bias is either stopped or let through.
- Model validation / backtesting — testing whether a model predicts well on data it did not learn from; backtesting checks how a model would have performed historically. The cardinal rule: never judge a model on its training data.
- Lift / Gini — diagnostics of a model's discriminating power: lift sorts risks into deciles and compares the actual loss experience of the best to the worst; the Gini compresses that separation into a single number from 0 (random) toward 1 (perfect). Both measure ranking, not price adequacy.
- The pricing-model lifecycle — the full arc of a model from data and feature engineering through build, validation, filing, deployment, monitoring, and refresh or retirement; the underwriter contributes features at the front and logged overrides at the back, which feed the next version.
Spaced Review
- Explain the one statistical thing a GLM buys you that a classical one-way rate table cannot, and why that matters for a set of correlated rating factors. (§32.1, §32.2)
- A model shows excellent lift on the data it was trained on. What is the single question that decides whether that lift means anything, and what is the danger if the answer is "we didn't check"? (§32.6)
- (From earlier.) Harbor Steel's loss history is "two fires in five years." In the language of Chapter 6, which dimension of risk is a count of fires, and which is the \$1.2M size of the 2023 fire? Why does a GLM model them separately? (§32.2, Ch.6)
- (From earlier.) The model scored a risk using a feature that is a proxy for a protected class. Why does "the algorithm selected it for predictive power" fail as a regulatory defense, and which chapter owns the full treatment? (§32.1, §32.5, Ch.4, Ch.35 preview)
- (The recurring pricing-discipline question.) You override a model's decline and write a risk it scored a
- Under what conditions does that override help the combined ratio rather than hurt it — and what one artifact makes the difference between a disciplined override and a reckless one? (§32.7, Ch.3, Ch.11)