Appendix G — Data Science and Python Toolkit

The chapters that teach the analytics — the mathematics of risk and price in Chapter 10, the build-up of a rate in Chapter 11, data-driven underwriting in Chapter 31, and predictive modeling in Chapter 32 — show you the ideas through meaning and a worked number, the way the book promised. This appendix is the companion code reference: the same ideas expressed in a little Python and SQL, so a reader on the 📊 Analytics path can see the shape of the work and a reader who needs to talk to a data scientist can read what one writes. You can skip every line of it and still own the underwriting; the code illustrates the craft, it does not replace it.

Read this first, before any snippet — three rules that hold for the whole appendix.

  1. Every dataset here is invented. The numbers in the DataFrames, the loss runs, the rates, the model outputs — all of it is a constructed teaching example, chosen to make the arithmetic legible, never taken from a real insurer's experience. When you see a loss ratio come out to 0.62, read it as a teaching figure, not a benchmark.
  2. This code is for learning, not for production. It is written to be readable, not robust. It has no error handling, no data validation, no reserve development, no leakage controls worth the name, no regulatory filing rigor, and no tests. A real pricing pipeline — the kind that goes in front of a state regulator under the not excessive, not inadequate, not unfairly discriminatory standard — is built and validated by actuaries and data scientists over months. Treat these as sketches of the concept, the way an ASCII diagram is a sketch of a building.
  3. The underwriter still owns the decision. A model produces a number; a person decides whether to accept the risk and on what terms (the theme that runs through the whole book — see Chapter 32, §32.7). Nothing in this appendix changes that. The code that scores a risk is the easy part; knowing when the score is wrong is the job.

The snippets do not run at build time, and you do not need to run them to learn from them. Each one is introduced by the underwriting question it answers and followed by how to read the output, because a calculation you cannot interpret is worse than none — it is false confidence with a number attached.

A note on libraries: the Python assumes the ordinary open-source stack a working insurance analyst uses — pandas and numpy for data, statsmodels or scikit-learn for the models. The exact APIs of these libraries change between versions; treat the calls as idiomatic illustrations, not a guarantee that a particular argument exists in the version on your machine. Inline math uses single-dollar delimiters; currency in the prose is escaped as \$.


G.1 Loss ratio and combined ratio from a DataFrame (Chapters 3, 10, 11)

The underwriting question. Is this business making money? You have a slice of a book — a set of policies with their earned premium, their incurred losses, and their expenses — and you want the two numbers the book calls the scoreboard: the loss ratio (the share of premium consumed by losses) and the combined ratio (losses plus expenses; above 100% is an underwriting loss before investment income). The chapters define these in words (Chapter 3 introduces them; Chapter 10 sits the loss ratio inside the frequency/severity math); here is the same thing in code over a table of policies.

import pandas as pd

# --- Constructed teaching example: a tiny book of business. ---
# Each row is one policy-year. All figures are illustrative, not a real insurer's.
# 'incurred_loss' already includes loss adjustment expense (LAE) for simplicity;
# in a real extract LAE is often a separate column you must add in.
book = pd.DataFrame({
    "policy_id":     ["P-001", "P-002", "P-003", "P-004", "P-005"],
    "segment":       ["property", "property", "auto", "auto", "property"],
    "earned_premium":[120_000,    85_000,    60_000,  40_000,  95_000],
    "incurred_loss": [ 54_000,    91_000,    18_000,  52_000,  33_000],   # losses + LAE
    "uw_expense":    [ 30_000,    21_000,    15_000,  10_000,  24_000],   # commission + overhead
})

# Loss ratio and combined ratio are RATIOS OF SUMS, never the average of per-policy ratios.
# Summing first weights each policy by its premium, which is what you want.
total_premium = book["earned_premium"].sum()
total_loss    = book["incurred_loss"].sum()
total_expense = book["uw_expense"].sum()

loss_ratio     = total_loss / total_premium
expense_ratio  = total_expense / total_premium
combined_ratio = loss_ratio + expense_ratio

print(f"Loss ratio:     {loss_ratio:6.1%}")
print(f"Expense ratio:  {expense_ratio:6.1%}")
print(f"Combined ratio: {combined_ratio:6.1%}")

# The same calculation, sliced by segment — this is where the truth usually hides.
by_segment = (
    book.groupby("segment")[["earned_premium", "incurred_loss", "uw_expense"]]
        .sum()
        .assign(
            loss_ratio=lambda d: d["incurred_loss"] / d["earned_premium"],
            combined_ratio=lambda d: (d["incurred_loss"] + d["uw_expense"]) / d["earned_premium"],
        )
)
print(by_segment[["loss_ratio", "combined_ratio"]])

How to read the output. On this constructed book the loss ratio is \$230,000 / \$400,000 = 57.5%, the expense ratio is \$100,000 / \$400,000 = 25.0%, and the combined ratio is 82.5% — below 100%, so this little book made an underwriting profit: it paid out about 82.5 cents for every premium dollar and kept the rest. The two warnings the book has pressed apply directly. First, the segment cut is the honest view: the blended 82.5% can hide a profitable property segment subsidizing a bleeding auto segment (run the by_segment frame and you will see the auto loss ratio is far worse than property). Second, a loss ratio is only as good as the reserves inside incurred_loss — these are estimates of claims not yet fully paid, not facts, so a "good" loss ratio on immature business can deteriorate as the claims develop (the trend-and- development point from Chapter 10, §10.4). The combined ratio tells the truth, but only about the business old enough for the truth to have arrived.

⚠️ Underwriting Trap. Averaging the per-policy loss ratios (book.eval('incurred_loss/earned_premium').mean()) gives a different, wrong number, because it weights a \$40,000 policy the same as a \$120,000 one. Always sum the dollars, then divide. The mispriced book that blows up three years later sometimes starts as an innocent spreadsheet that averaged the ratios.


G.2 The frequency / severity split (Chapters 6, 10)

The underwriting question. What is driving the losses — too many claims, or claims that are too big? The book's structure of expected loss is frequency × severity (Chapter 6 defines the two dimensions; Chapter 10 treats them as distributions). Pulling them apart matters because the two have different cures: a frequency problem is often a controls-and-selection problem, while a severity problem is a limits-and- reinsurance problem. The same total loss ratio can come from "many small claims" or "one catastrophic claim," and you price and structure those very differently.

import pandas as pd

# --- Constructed teaching example: one segment's exposure and its individual claims. ---
exposures = 1_000          # e.g., 1,000 insured vehicle-years (illustrative)

# Each row is one claim. A 1,000-exposure book that produced 7 claims this period.
claims = pd.DataFrame({
    "claim_id": [f"C-{i:03d}" for i in range(1, 8)],
    "paid":     [2_400, 5_100, 1_800, 240_000, 3_300, 900, 7_500],  # illustrative dollars
})

claim_count = len(claims)
total_losses = claims["paid"].sum()

frequency = claim_count / exposures            # claims per exposure unit
severity  = total_losses / claim_count         # average dollars per claim
pure_premium = total_losses / exposures        # expected loss per exposure unit
# Identity worth memorizing: pure premium == frequency * severity.
assert abs(pure_premium - frequency * severity) < 1e-6

print(f"Frequency:     {frequency:.3f} claims per exposure   ({claim_count} claims / {exposures} exposures)")
print(f"Severity:      ${severity:,.0f} average per claim")
print(f"Pure premium:  ${pure_premium:,.2f} per exposure     (frequency x severity)")

# Severity is fat-tailed: the MEAN is not the TYPICAL claim. Show both.
print(f"Median (typical) claim: ${claims['paid'].median():,.0f}")
print(f"Largest claim:          ${claims['paid'].max():,.0f}  "
      f"({claims['paid'].max() / total_losses:.0%} of all dollars)")

How to read the output. Frequency is 7 / 1,000 = 0.007 claims per exposure — seven claims for every thousand exposures. Severity is \$261,000 / 7 ≈ \$37,286 per claim on average — but look at the next two lines, because that average is a lie of the kind Chapter 10 (§10.1) warns about. The median claim is only about \$3,300, and a single \$240,000 claim is more than 90% of every dollar paid. This is the signature of a right-skewed severity distribution: the mean loss is dragged far above the typical loss by one rare large one. The underwriting reading: this is a severity story, not a frequency story. The cure is not tighter selection of small risks; it is a deductible that sheds the small claims you do not care about and a limit (and reinsurance behind it) sized for the large one you do. Pricing off the \$37,286 mean as if every claim looked like that would badly misstate both the risk and the right structure.

📋 At the Desk. When you split frequency from severity on real data, do it after developing the losses to ultimate and trending them (Chapter 10, §10.4) — an open claim reserved at \$50,000 may settle at \$200,000, and immature severity understates the tail. The split above uses paid-to-date figures only for legibility.


G.3 A frequency / severity GLM, conceptually (Chapters 11, 32)

The underwriting question. How does each risk characteristic move the expected cost, holding the others constant? A rating table assigns a relativity to each factor; a generalized linear model (GLM) estimates all those relativities at once from data, so you are not double-counting two factors that travel together. The industry workhorse (Chapter 32, §32.2) is a pair: a Poisson model for frequency and a gamma model for severity, whose product is the modeled pure premium that feeds the rate built in Chapter 11. The code below is deliberately schematic — it shows the shape of fitting these two models, not a filing-ready pricing model.

import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf

# --- Constructed teaching example. One row per policy-year. ---
# Features: territory (A/B/C) and a 0/1 protection-class flag (1 = better protected).
# 'exposure' is the policy-years at risk; 'claim_count' the number of claims;
# 'total_loss' the dollars of loss. All invented.
rng = np.random.default_rng(7)
n = 5_000
df = pd.DataFrame({
    "territory":  rng.choice(["A", "B", "C"], size=n, p=[0.5, 0.3, 0.2]),
    "protected":  rng.integers(0, 2, size=n),
    "exposure":   rng.uniform(0.5, 1.0, size=n),   # fraction of a year on risk
})
# (A real model fits OBSERVED claims; here the data is synthetic so the demo runs in principle.)
base_rate = 0.10
terr_mult = df["territory"].map({"A": 1.0, "B": 1.4, "C": 1.9})
prot_mult = np.where(df["protected"] == 1, 0.7, 1.0)
lam = base_rate * terr_mult * prot_mult * df["exposure"]          # expected claims
df["claim_count"] = rng.poisson(lam)
df["total_loss"]  = df["claim_count"] * rng.gamma(shape=2.0, scale=3_000, size=n)

# ---- FREQUENCY model: Poisson, with exposure as an OFFSET. ----
# The offset (log of exposure) tells the model we are modeling a RATE per exposure-year,
# so a half-year policy is not mistaken for a low-risk one.
freq_model = smf.glm(
    formula="claim_count ~ C(territory) + protected",
    data=df,
    family=sm.families.Poisson(),
    offset=np.log(df["exposure"]),
).fit()

# ---- SEVERITY model: Gamma with a log link, fit on policies that HAD a claim. ----
sev_df = df[df["claim_count"] > 0].copy()
sev_df["avg_severity"] = sev_df["total_loss"] / sev_df["claim_count"]
sev_model = smf.glm(
    formula="avg_severity ~ C(territory) + protected",
    data=sev_df,
    family=sm.families.Gamma(link=sm.families.links.Log()),
).fit()

# Relativities are exp(coefficient): a multiplier on the base, exactly like a rating factor.
print("Frequency relativities (exp of coefficients):")
print(np.exp(freq_model.params).round(3))
print("\nSeverity relativities (exp of coefficients):")
print(np.exp(sev_model.params).round(3))

# Modeled pure premium = predicted frequency-per-exposure x predicted severity.
df["pred_freq"] = freq_model.predict(df, offset=np.zeros(n))   # per 1.0 exposure
df["pred_sev"]  = sev_model.predict(df)
df["pred_pure_premium"] = df["pred_freq"] * df["pred_sev"]
print(f"\nMean modeled pure premium: ${df['pred_pure_premium'].mean():,.2f} per exposure-year")

How to read the output. Because the GLM uses a log link, the natural way to read it is exp(coefficient), which turns each estimate into a multiplicative relativity — exactly the rating factor of Chapter 11, §11.3, where 1.00 is the baseline, above 1.00 a debit, below 1.00 a credit. In this constructed example you would expect the frequency model to recover something close to the multipliers baked into the data: Territory B around 1.4 and Territory C around 1.9 relative to Territory A, and the protected flag around 0.7 (better-protected risks have fewer claims). The severity relativities read the same way, for claim size. Multiply a risk's predicted frequency by its predicted severity and you have its modeled pure premium — the expected-loss core that Chapter 11 then loads with expenses and profit to reach an indicated rate.

What the GLM can do: estimate every factor's effect simultaneously, so two correlated variables do not both get full credit for the same signal; and stay interpretable, factor by factor, which is why it is the form regulators are most comfortable seeing in a rate filing. What it cannot do: find an interaction or nonlinearity you did not put in the formula (it models what you specify, not what you forgot); price a factor you are not allowed to use, or its proxies, without inheriting the fairness problem of Chapter 35; or tell you the relativities are stable — a thin cell (few claims in Territory C) gives a wobbly estimate, the credibility humility of Chapter 10, §10.5. The split into two models is itself a modeling choice: frequency and severity respond to different drivers and are modeled separately precisely so you can see which is which.

🤖 Model vs. Judgment. A GLM relativity of 1.9 on Territory C is an average over everything in that cell. The underwriter still reads the individual account the model cannot see — the well-managed risk in a bad territory — and decides whether the 1.9 is the right answer for this submission or a debit to override with documented reasons. The model prices the class; you underwrite the risk.


G.4 A gradient-boosting risk score (Chapter 32)

The underwriting question. Of the submissions in front of me, which deserve the closest look? Where the GLM is the interpretable workhorse for filed pricing, a gradient-boosting machine (GBM) is often the tool for risk selection and triage (Chapter 32, §32.3): it combines many small decision trees, each correcting the last, and captures interactions and nonlinearities automatically. The price of that power is low interpretability — a "black box" you must validate by behavior, not by reading coefficients. Used well, it ranks a stack of risks so the underwriter's attention goes where it pays. Used badly, it is a confident number nobody can explain to a regulator.

import numpy as np
import pandas as pd
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split

# --- Constructed teaching example: submissions with a few features and a 0/1 outcome. ---
# 'bad_risk' = 1 means the account went on to run unprofitably (a label you only have in hindsight).
rng = np.random.default_rng(11)
n = 4_000
X = pd.DataFrame({
    "building_age":   rng.integers(0, 60, n),
    "prior_claims":   rng.poisson(0.8, n),
    "sprinklered":    rng.integers(0, 2, n),
    "distance_to_coast_mi": rng.uniform(0, 50, n),
})
# Synthetic "truth": older, more prior claims, unsprinklered, coastal => more likely bad.
logit = (-2.5
         + 0.03 * X["building_age"]
         + 0.6  * X["prior_claims"]
         - 0.8  * X["sprinklered"]
         - 0.04 * X["distance_to_coast_mi"])
prob = 1 / (1 + np.exp(-logit))
y = (rng.uniform(size=n) < prob).astype(int)   # 'bad_risk' label

# ALWAYS split before training. The model is judged on data it never saw.
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=42, stratify=y
)

model = GradientBoostingClassifier(
    n_estimators=200,      # number of small trees
    max_depth=3,           # each tree is shallow ("weak learner")
    learning_rate=0.05,    # how much each tree is allowed to correct
    random_state=42,
)
model.fit(X_train, y_train)

# The "risk score" is the model's predicted probability of a bad outcome, 0..1.
X_test = X_test.copy()
X_test["risk_score"] = model.predict_proba(X_test)[:, 1]

# Feature importances hint at WHAT drives the score (not WHY for any one account).
importances = pd.Series(model.feature_importances_, index=X.columns).sort_values(ascending=False)
print("Relative feature importance:")
print(importances.round(3))
print("\nFive submissions, scored highest-risk first:")
print(X_test.sort_values("risk_score", ascending=False).head().round(2))

How to read the output. The risk_score is a number between 0 and 1 — the model's estimated probability that the account turns out badly. It is a triage signal, not a verdict: a high score says "look harder here," the way the model in the book's anchor case scored Harbor Steel a 7/10 and flagged it for a decline. The feature_importances_ tell you which inputs the model leaned on overall — here you would expect prior_claims, building_age, and the coastal distance to rank high — but importance is a property of the model across all accounts, not an explanation of any single score. That distinction is the whole reason Chapter 32 (§32.7) insists the underwriter stays in the loop.

What the GBM can do: find interactions you never specified (old and unsprinklered and coastal may be far worse than the sum of the three) and rank a large stack of risks consistently and fast. What it cannot do: explain itself in the factor-by-factor way a GLM does, which is why it more often rides behind the scenes as a selection/triage score than as the filed price (a regulator can examine a relativity table; a black box is harder to defend under the unfair-discrimination standard of Chapter 4 and the bias concerns of Chapter 35); and it cannot know it is wrong about the account in front of you. The score is the start of the conversation, not the end of it — and a model trained on a hindsight label like bad_risk will faithfully reproduce whatever bias lived in that history, which is the entire warning of Chapter 32, §32.6 and Chapter 35.

⚠️ Underwriting Trap. A GBM will happily memorize the training data and look brilliant on it (model.score(X_train, y_train) near perfect) while ranking new business no better than a coin. Never judge a model on the data it learned from — only on the held-out X_test. The next section is how you do that honestly.


G.5 Model validation: a lift and Gini sketch (Chapter 32)

The underwriting question. Does the model actually separate good risks from bad — on business it has never seen? A model is worthless until it is tested out of sample (Chapter 32, §32.6). The two diagnostics the book names are lift — sort risks by predicted loss cost into bands (deciles) and compare the actual experience of the worst band to the best — and the Gini coefficient, which compresses that whole separation into one number from 0 (random) toward 1 (perfect ranking). Both measure ranking power, not price adequacy: a model can rank risks perfectly and still be miscalibrated on the absolute dollars, so lift tells you the model sorts, not that the level is right.

import numpy as np
import pandas as pd

# --- Constructed teaching example: held-out predictions vs. what actually happened. ---
# 'predicted' = the model's score on out-of-sample risks; 'actual_loss' = realized loss cost.
rng = np.random.default_rng(23)
m = 5_000
predicted = rng.uniform(0, 1, m)
# A decent-but-imperfect model: actual loss is correlated with the score, plus noise.
actual_loss = 500 * predicted + rng.normal(0, 120, m)
actual_loss = np.clip(actual_loss, 0, None)
val = pd.DataFrame({"predicted": predicted, "actual_loss": actual_loss})

# ---- LIFT: rank into deciles by PREDICTED score, then look at ACTUAL loss in each. ----
val["decile"] = pd.qcut(val["predicted"], 10, labels=False)   # 0 = lowest predicted, 9 = highest
lift_table = val.groupby("decile")["actual_loss"].mean()
best, worst = lift_table.iloc[0], lift_table.iloc[-1]
print("Mean ACTUAL loss by predicted decile (0 = best, 9 = worst):")
print(lift_table.round(1))
print(f"\nDecile lift (worst / best): {worst / best:.2f}x")

# ---- GINI: one number for ranking power, via the area under the Lorenz-style curve. ----
def gini_from_predictions(pred, actual):
    order = np.argsort(pred)                       # sort by predicted, ascending
    actual_sorted = np.asarray(actual)[order]
    cum = np.cumsum(actual_sorted) / actual_sorted.sum()
    # Trapezoidal area between the cumulative-loss curve and the 45-degree line.
    lorenz_area = np.trapz(cum, dx=1.0 / len(cum))
    return 1 - 2 * lorenz_area

gini = abs(gini_from_predictions(val["predicted"], val["actual_loss"]))
print(f"Gini coefficient: {gini:.3f}")

How to read the output. Read the lift table top to bottom: the mean actual loss should climb steadily from decile 0 (the risks the model called best) to decile 9 (the risks it called worst). If it does, the model is doing its one job — putting the worse risks in the worse bands. The decile lift of, say, "the worst decile lost about 8x what the best decile lost" is the plain-language version a manager understands in one sentence. The Gini then compresses the whole curve into a single number: closer to 0 means the model ranks no better than chance; closer to 1 means near-perfect separation. There is no universal "good" Gini — it depends on the line, the signal available, and what you are comparing against — so read it relative to the current rating plan, never as an absolute grade.

The discipline the book insists on: both of these measure sorting, not pricing. A model can ace lift and Gini and still charge the wrong absolute premium, because ranking is silent about level — you also need a calibration check (do the predicted dollars match the realized dollars, in total and by band?) before the model touches a rate. And both are computed on held-out data here; computed on training data they flatter the model and mean nothing. A model that looks excellent in validation can still fail in production when the world shifts under it (a hard market, a new peril, a law change), which is why Chapter 32, §32.6 treats validation as something you do continuously, not once.

📋 At the Desk. When a vendor or a colleague shows you a model, ask three questions before you trust the score: What data did you validate on, and was it truly out of sample? Show me the lift chart. Is it calibrated, or only ranked? If the answers are vague, the number is decoration.


G.6 Pulling a loss run and a book-of-business cut with SQL (Chapters 8, 29, 31)

The underwriting question. Get me the history before I read the application. Before any model runs, the data has to be pulled — and most of it lives in a database, not a spreadsheet. Two queries cover most of what an underwriter actually asks for: a loss run for a single account (the claim-by-claim history of Chapter 8, §8.2, the most valuable document in a commercial file), and a book-of-business cut that summarizes a segment the way Chapter 29 (§29.4) and Chapter 31 demand. The SQL below is standard and illustrative; real schemas, table names, and column conventions differ at every carrier.

First, the loss run — every claim on one account, ordered so the story is legible (cause and trajectory, not just total dollars):

-- Constructed teaching example. Schema and table names are illustrative.
-- One account's claim history across all lines, most recent first.
SELECT
    c.claim_id,
    c.line_of_business,
    c.date_of_loss,
    c.cause_of_loss,
    c.status,                                  -- 'open' or 'closed'
    c.paid_amount,
    c.reserve_amount,
    (c.paid_amount + c.reserve_amount) AS incurred_amount   -- paid + case reserve
FROM claims AS c
JOIN policies AS p
    ON c.policy_id = p.policy_id
WHERE p.account_id = 'ACCT-HARBOR-STEEL'        -- the account you are underwriting
  AND c.date_of_loss >= DATEADD(year, -5, CURRENT_DATE)   -- the standard 5-year look-back
ORDER BY
    c.date_of_loss DESC;

How to read the output. This returns one row per claim, with paid, reserve, and the incurred (paid + reserve) that matters for the loss ratio of §G.1. Read it the way Chapter 8 teaches: not as a total to sum, but as a narrative. Three small fender-benders read very differently from one \$1.2M fire. An open claim with a large reserve is a future number, not a settled one — it can develop up or down. And the cause_of_loss column is where the underwriting lives: two fires that are both electrical and both predate the current plant manager tell a story about management that the dollar total alone conceals (the Harbor Steel anchor, exactly). The query gets you the rows; reading them is the craft.

Next, the book-of-business cut — the same data rolled up to judge a segment rather than an account:

-- Constructed teaching example. A book-of-business summary by segment.
-- 'earned_premium' lives on the policy; losses are summed from claims.
SELECT
    p.segment,
    COUNT(DISTINCT p.policy_id)                       AS policy_count,
    SUM(p.earned_premium)                             AS earned_premium,
    SUM(COALESCE(c.incurred_amount, 0))               AS incurred_loss,
    -- Loss ratio = incurred losses / earned premium, computed as a ratio of SUMS.
    SUM(COALESCE(c.incurred_amount, 0))
        / NULLIF(SUM(p.earned_premium), 0)            AS loss_ratio
FROM policies AS p
LEFT JOIN (
    -- Pre-aggregate claims to ONE row per policy so the premium is not double-counted
    -- when a policy has several claims (a classic fan-out bug).
    SELECT
        policy_id,
        SUM(paid_amount + reserve_amount) AS incurred_amount
    FROM claims
    WHERE date_of_loss >= DATEADD(year, -1, CURRENT_DATE)
    GROUP BY policy_id
) AS c
    ON p.policy_id = c.policy_id
WHERE p.status = 'in_force'
GROUP BY p.segment
ORDER BY loss_ratio DESC;

How to read the output. One row per segment, with its policy count, earned premium, incurred loss, and the loss ratio that judges it — sorted worst-first so the segment bleeding money is at the top of your screen. This is the §G.1 calculation done in the database over the whole book, and it is the portfolio view of Chapter 29: a book-level average hides a profitable core subsidizing a deteriorating segment, and the only way to act on the part rather than the whole is to cut it like this. Two technical points carry an underwriting lesson. The LEFT JOIN keeps policies with zero claims in the denominator — drop them and the loss ratio is overstated, because you would be dividing losses by only the premium of the policies that happened to have a claim. And the claims are pre-aggregated to one row per policy inside the subquery: join the raw claims table directly and a policy with three claims gets its premium counted three times — the "fan-out" bug that quietly corrupts more book analyses than any modeling error. As always, the figures are only as good as the reserves inside incurred_amount and the recency of the look-back window; the SQL is exact, but what it counts is still an estimate.

⚖️ Compliance Corner. Pulling the data is the easy part; you are still bound by what you may use. A field can sit in the database and be off-limits in a decision — a protected characteristic, or a credit-based element restricted in your state, or third-party data governed by the FCRA's adverse-action and dispute rules (Chapter 8, §8.6; Chapter 35). The query does not know the law. You do. A column being available is never the same as a factor being usable.


G.7 Where this fits — and where it stops

These seven sketches trace the analytic spine of the book end to end: pull the history (§G.6), split the loss into frequency and severity (§G.2), judge the result with loss and combined ratios (§G.1), model the relativities with a GLM (§G.3), rank the risks with a GBM (§G.4), and prove the model actually separates good from bad before you trust it (§G.5). That is the pricing-model lifecycle of Chapter 32 in miniature — data and features at the front, validation at the back, the underwriter's domain knowledge and logged overrides on both ends.

What this appendix is not is a pricing system. There is no loss development, no trending to the future cost level, no credibility blending of the model against the class, no premium-versus-loss calibration, no fairness testing for proxy discrimination, no monitoring for the day the world shifts under the model, and no regulatory filing rigor — every one of which a real implementation needs, and several of which the book treats as chapters in their own right. Hold the same posture toward this code that the whole book holds toward every technique: know precisely what it can do, and in the same breath what it cannot. A model gives you a number. Reading the number — knowing when it is right, when it is wrong, and how to defend the decision you make either way — is the underwriting. That part is, and remains, yours.

🔍 Check Your Understanding. 1. Why must a loss ratio be computed as a ratio of summed dollars rather than the average of per-policy ratios? (Premium-weighting: averaging the ratios treats a tiny policy and a huge one as equals.) 2. A model posts an excellent Gini on held-out data but the realized dollars come in 20% above predicted across every band. Is the model usable as filed? (No — Gini measures ranking, not level; it is miscalibrated and needs recalibration before it prices anything.) 3. In the book-of-business SQL, what does the LEFT JOIN protect against that an inner join would get wrong? (It keeps zero-claim policies in the premium denominator, so the loss ratio is not overstated.)


Every dataset, rate, score, and loss run in this appendix is a constructed teaching example. The code is written to be read and learned from, not deployed; it omits the validation, development, calibration, fairness testing, and governance a production pricing model requires. For the concepts behind it, see Chapter 10 (the mathematics of risk and price), Chapter 11 (pricing and rating), Chapter 31 (data-driven underwriting), and Chapter 32 (predictive modeling).