Case Study 2: Data Warehouse Aggregation Patterns

DataField.Dev

Case Study 2: Data Warehouse Aggregation Patterns

Background

Continental Insurance Group (CIG) is a property and casualty insurer operating in 22 states. Their data warehouse, built on DB2 for LUW 11.5, receives nightly feeds from the policy administration system, the claims system, the billing platform, and three external data providers. The warehouse contains 8 years of history:

POLICIES: 4.2 million rows (active and historical)
CLAIMS: 2.8 million rows
CLAIM_PAYMENTS: 11 million rows
PREMIUMS_WRITTEN: 38 million rows (monthly premium records)
AGENTS: 6,200 rows
REGIONS: 85 rows (state/territory groupings)

CIG's actuarial team runs complex aggregation queries daily to compute loss ratios, reserve adequacy, and premium trends. The analytics team produces monthly management reports that slice data by state, product line, policy year, and agent tier. And the finance team needs quarterly regulatory filings that require aggregated data at very specific dimensional intersections.

Each team had built their own reporting queries independently, resulting in a collection of over 200 stored procedures, many of which produced subtly different numbers for the same metric. The CIO hired a data engineering team to consolidate and optimize the reporting layer.

The Challenges

Challenge 1: Inconsistent Loss Ratio Calculations

The loss ratio — claims paid divided by premiums earned — is the most important metric in insurance. Three different reports produced three different loss ratios for the same period:

The actuarial report showed 68.2%
The finance report showed 71.4%
The management dashboard showed 65.8%

Investigation revealed three root causes:

Root Cause A: Different NULL handling. The actuarial query used SUM(claim_amount) directly. Some claims had NULL amounts (pending adjudication). The actuarial query implicitly excluded these because SUM ignores NULLs. The finance query used SUM(COALESCE(claim_amount, estimated_amount)) which substituted an estimated amount for NULLs — a more conservative (higher) number. The dashboard used SUM(COALESCE(claim_amount, 0)) which treated pending claims as zero — an optimistic (lower) number.

Root Cause B: Different GROUP BY granularity. The actuarial report computed the loss ratio at the policy-year level and then averaged across years. The finance report computed it at the aggregate level (total claims divided by total premiums). These produce different results because averaging ratios is not the same as computing the ratio of averages.

Consider a simple example: - Year 1: Claims = $100, Premiums = $200, Loss Ratio = 50% - Year 2: Claims = $900, Premiums = $1000, Loss Ratio = 90% - Average of ratios: (50% + 90%) / 2 = 70% - Ratio of totals: $1000 / $1200 = 83.3%

Neither is "wrong" — they answer different questions. But they must be labeled clearly.

Root Cause C: Different date filters. The actuarial report filtered by policy effective date (when was the policy written?). The finance report filtered by accounting period (when was the premium booked?). The dashboard filtered by calendar date of claim payment. Each captured a different slice of the same data.

Challenge 2: The 85-State CUBE

CIG needed a quarterly regulatory filing that showed premium volume and loss experience broken down by state, product line, and policy year. The initial approach was:

SELECT state, product_line, policy_year,
       SUM(premium) AS total_premium,
       SUM(claims_paid) AS total_claims,
       CASE WHEN SUM(premium) > 0
            THEN SUM(claims_paid) / SUM(premium) * 100
            ELSE 0
       END AS loss_ratio_pct
FROM FACT_LOSS_EXPERIENCE
GROUP BY CUBE (state, product_line, policy_year);

With 22 states, 8 product lines, and 8 policy years, CUBE produced 2^3 = 8 grouping levels, but the cardinality within each level was the issue: 22 x 8 x 8 = 1,408 detail rows, plus all the subtotals. The query itself was not slow — the FACT_LOSS_EXPERIENCE table was pre-aggregated and only had about 50,000 rows.

The problem was downstream. The regulatory filing required specific combinations that did not match the CUBE output: - State-level totals (all product lines, all years) -- provided by CUBE - Product-level totals (all states, all years) -- provided by CUBE - State + product (all years) -- provided by CUBE - But not state + year (all products) -- the filing did not require this - But not year-level totals (all states, all products) -- not required

CUBE produced 4 grouping combinations that were not needed, wasting computation and cluttering the output. Worse, the finance team had to write post-processing logic to filter out the unwanted combinations.

Challenge 3: Multi-Pass Aggregation for Reserve Analysis

The actuarial team needed a "triangle" report — a standard insurance analysis that shows how claim payments develop over time. For each policy year and development month, they needed cumulative paid amounts. This required aggregating CLAIM_PAYMENTS (11 million rows) by policy year and development period, then computing running cumulative sums.

The original approach used a correlated subquery:

SELECT a.policy_year,
       a.development_month,
       (SELECT SUM(cp.amount)
        FROM CLAIM_PAYMENTS cp
        JOIN CLAIMS c ON cp.claim_id = c.claim_id
        WHERE YEAR(c.policy_effective_date) = a.policy_year
          AND MONTHS_BETWEEN(cp.payment_date, c.loss_date) <= a.development_month
       ) AS cumulative_paid
FROM (
    SELECT DISTINCT policy_year, development_month
    FROM DEVELOPMENT_PERIODS
) a
ORDER BY a.policy_year, a.development_month;

This was catastrophically slow — each row in the outer query triggered a full aggregation of the 11-million-row payments table. With 8 policy years and 96 development months, that was 768 separate aggregation passes.

Solutions

Solution to Challenge 1: A Single Source of Truth

The data engineering team created a canonical set of aggregate building blocks as views:

-- Canonical claims aggregate: one row per policy-year
CREATE VIEW V_CLAIMS_BY_POLICY_YEAR AS
SELECT YEAR(p.effective_date)     AS policy_year,
       p.state_code,
       p.product_line,
       COUNT(DISTINCT c.claim_id) AS claim_count,
       -- Three versions of claim amount, clearly named
       SUM(c.paid_amount)                              AS paid_amount_excl_pending,
       SUM(COALESCE(c.paid_amount, c.reserve_amount))  AS paid_amount_incl_reserves,
       SUM(COALESCE(c.paid_amount, 0))                 AS paid_amount_nulls_as_zero,
       COUNT(*) - COUNT(c.paid_amount)                  AS pending_claim_count
FROM POLICIES p
LEFT JOIN CLAIMS c ON p.policy_id = c.policy_id
                   AND c.status IN ('O', 'C')
GROUP BY YEAR(p.effective_date), p.state_code, p.product_line;

By computing all three NULL-handling variants in one view and naming them clearly, every downstream report could pick the appropriate column. No more implicit NULL behavior surprises. The view also tracked how many claims were still pending, making the NULL handling transparent to consumers.

The team established a rule: loss ratios are always computed from totals, not as averages of ratios, unless explicitly requested. The canonical formula became:

CASE WHEN SUM(premium) > 0
     THEN DECIMAL(SUM(claims) * 100.0 / SUM(premium), 7, 2)
     ELSE NULL
END AS loss_ratio_pct

Solution to Challenge 2: GROUPING SETS Instead of CUBE

The team replaced the CUBE with explicit GROUPING SETS that matched the regulatory filing requirements exactly:

SELECT CASE GROUPING(f.state_code)
           WHEN 1 THEN 'ALL'
           ELSE f.state_code
       END AS state,
       CASE GROUPING(f.product_line)
           WHEN 1 THEN 'ALL'
           ELSE f.product_line
       END AS product,
       CASE GROUPING(f.policy_year)
           WHEN 1 THEN 'ALL'
           ELSE CAST(f.policy_year AS VARCHAR(4))
       END AS policy_year,
       SUM(f.premium)     AS total_premium,
       SUM(f.claims_paid) AS total_claims,
       CASE WHEN SUM(f.premium) > 0
            THEN DECIMAL(SUM(f.claims_paid) * 100.0 / SUM(f.premium), 7, 2)
            ELSE NULL
       END AS loss_ratio_pct
FROM FACT_LOSS_EXPERIENCE f
GROUP BY GROUPING SETS (
    (f.state_code, f.product_line, f.policy_year),  -- full detail
    (f.state_code, f.product_line),                   -- state+product across years
    (f.state_code),                                   -- state total
    (f.product_line),                                 -- product total
    ()                                                -- grand total
)
ORDER BY GROUPING(f.state_code), f.state_code,
         GROUPING(f.product_line), f.product_line,
         GROUPING(f.policy_year), f.policy_year;

This produced exactly the 5 grouping levels the filing required, with no wasted combinations. The GROUPING() function labeled summary rows, and the output mapped directly to the filing template.

Solution to Challenge 3: Aggregate-Then-Cumulate

The data engineering team restructured the triangle report in two stages.

Stage 1: Pre-aggregate with a simple GROUP BY.

CREATE TABLE CLAIM_DEVELOPMENT_SUMMARY AS (
    SELECT YEAR(c.policy_effective_date) AS policy_year,
           MONTHS_BETWEEN(cp.payment_date, c.loss_date) AS dev_month,
           SUM(cp.amount) AS incremental_paid,
           COUNT(*) AS payment_count
    FROM CLAIM_PAYMENTS cp
    JOIN CLAIMS c ON cp.claim_id = c.claim_id
    WHERE cp.status = 'C'
    GROUP BY YEAR(c.policy_effective_date),
             MONTHS_BETWEEN(cp.payment_date, c.loss_date)
) DATA INITIALLY DEFERRED REFRESH DEFERRED;

REFRESH TABLE CLAIM_DEVELOPMENT_SUMMARY;

This MQT reduced 11 million payment rows to approximately 600 rows (8 years x ~75 active development periods). The aggregation runs once during the nightly batch.

Stage 2: Compute cumulative sums using a window function on the small MQT.

SELECT policy_year,
       dev_month,
       incremental_paid,
       SUM(incremental_paid) OVER (
           PARTITION BY policy_year
           ORDER BY dev_month
           ROWS UNBOUNDED PRECEDING
       ) AS cumulative_paid
FROM CLAIM_DEVELOPMENT_SUMMARY
ORDER BY policy_year, dev_month;

The cumulative sum runs against 600 rows instead of 11 million. Total execution time: under 1 second. The original correlated subquery approach took over 40 minutes.

Key Takeaways

NULL handling in aggregation is a business decision, not a technical one. Different treatments of NULLs (exclude, substitute, zero-fill) produce different numbers. Make the business decision explicit, name the variants clearly, and document which variant each report uses.
Averaging ratios is not the same as computing the ratio of totals. This is a mathematical fact that catches even experienced analysts. Establish a convention (usually ratio of totals) and enforce it.
GROUPING SETS is more precise than CUBE. CUBE is convenient but produces every possible combination. When a report needs specific combinations, GROUPING SETS avoids wasted computation and simplifies post-processing.
Two-stage aggregation (pre-aggregate then re-aggregate) is the fundamental data warehouse pattern. Build MQTs or summary tables at an intermediate grain, then run reporting queries against those summaries instead of the raw fact tables. This transforms query times from minutes to seconds.
Correlated subqueries with aggregation scale quadratically. If the outer query produces N rows and each triggers a full scan of M rows, the cost is O(N x M). Replace this pattern with a single GROUP BY pass followed by a window function or a join.

Discussion Questions

CIG's actuarial, finance, and management teams each had legitimate reasons for their different NULL treatments. How would you design a reporting layer that serves all three without duplicating logic or confusing consumers?
The regulatory filing requires specific grouping levels. If the regulator changes the filing format to require additional dimensions, how much effort is involved in updating the GROUPING SETS query versus how much would be involved if the team had used raw CUBE and post-processing?
The CLAIM_DEVELOPMENT_SUMMARY MQT uses REFRESH DEFERRED, meaning it must be explicitly refreshed. What would happen if the nightly refresh job failed silently? How would you detect stale MQTs and alert the operations team?