Case Study 2: Uncertainty Quantification in Climate Projections

DataField.Dev

Case Study 2: Uncertainty Quantification in Climate Projections

What Does "90% Confident" Mean for a 2050 Temperature Prediction?

Background

The Pacific Climate Research Consortium (PCRC) is tasked with providing regional climate projections to inform infrastructure planning for coastal municipalities. The central question: What is the probability distribution of 2050 global mean temperature increase relative to the pre-industrial baseline, and how should this uncertainty be communicated to policymakers?

PCRC aggregates projections from the Coupled Model Intercomparison Project Phase 6 (CMIP6), an international collaboration in which 49 modeling groups run climate simulations under standardized greenhouse gas concentration scenarios. Under the SSP2-4.5 ("middle of the road") scenario, 30 models produce projections for the 2041-2060 period.

The methodological challenge: these 30 models are not independent random samples from a population. They share code components, parameterizations, and calibration data. Some models are closely related "cousins" from the same modeling center. The standard statistical assumption of i.i.d. sampling is violated.

The Data

import numpy as np
from scipy import stats

# CMIP6 model projections: 2050 global mean temperature increase (°C)
# relative to 1850-1900 baseline, SSP2-4.5 scenario
# (Simplified from IPCC AR6 WG1 Table 4.5)
model_projections = np.array([
    1.82, 2.14, 2.31, 2.05, 2.48, 1.93, 2.71, 2.19, 2.43, 2.61,
    3.02, 2.78, 2.12, 2.34, 1.74, 2.87, 2.52, 2.23, 1.98, 2.39,
    3.18, 2.63, 2.27, 2.08, 2.76, 1.65, 2.41, 2.69, 2.53, 2.21,
])

n_models = len(model_projections)
print(f"Number of models: {n_models}")
print(f"Mean projection:  {model_projections.mean():.2f}°C")
print(f"Std deviation:    {model_projections.std(ddof=1):.2f}°C")
print(f"Range:            [{model_projections.min():.2f}, {model_projections.max():.2f}]°C")
print(f"Median:           {np.median(model_projections):.2f}°C")
print(f"IQR:              [{np.percentile(model_projections, 25):.2f}, "
      f"{np.percentile(model_projections, 75):.2f}]°C")

Step 1: The Gaussian Assumption and Its Limits

The simplest approach treats the model projections as a random sample from a Gaussian distribution.

import numpy as np
from scipy import stats
import matplotlib.pyplot as plt

# MLE for Gaussian parameters
mu_hat = model_projections.mean()
sigma_hat = model_projections.std(ddof=1)

# Confidence interval for the mean (t-distribution, accounting for unknown sigma)
t_crit = stats.t.ppf(0.95, df=n_models - 1)  # one-sided 90% CI
se = sigma_hat / np.sqrt(n_models)

ci_90_mean = (
    mu_hat - stats.t.ppf(0.95, df=n_models - 1) * se,
    mu_hat + stats.t.ppf(0.95, df=n_models - 1) * se,
)

# Prediction interval: where will the NEXT model's projection fall?
# (accounts for both parameter uncertainty and data variability)
pi_90 = (
    mu_hat - stats.t.ppf(0.95, df=n_models - 1) * sigma_hat * np.sqrt(1 + 1/n_models),
    mu_hat + stats.t.ppf(0.95, df=n_models - 1) * sigma_hat * np.sqrt(1 + 1/n_models),
)

# Probability of exceeding thresholds
p_exceed_2 = 1 - stats.norm.cdf(2.0, loc=mu_hat, scale=sigma_hat)
p_exceed_2_5 = 1 - stats.norm.cdf(2.5, loc=mu_hat, scale=sigma_hat)
p_exceed_3 = 1 - stats.norm.cdf(3.0, loc=mu_hat, scale=sigma_hat)

print(f"Gaussian fit: μ = {mu_hat:.3f}°C, σ = {sigma_hat:.3f}°C")
print(f"90% CI for mean:           [{ci_90_mean[0]:.2f}, {ci_90_mean[1]:.2f}]°C")
print(f"90% prediction interval:   [{pi_90[0]:.2f}, {pi_90[1]:.2f}]°C")
print()
print("Exceedance probabilities (Gaussian model):")
print(f"  P(ΔT > 2.0°C) = {p_exceed_2:.3f}")
print(f"  P(ΔT > 2.5°C) = {p_exceed_2_5:.3f}")
print(f"  P(ΔT > 3.0°C) = {p_exceed_3:.3f}")

Common Misconception: "The 90% confidence interval for the mean [2.22, 2.49] tells us that there is a 90% chance the temperature increase will be between 2.22°C and 2.49°C." This confuses two distinct quantities. The confidence interval for the mean tells us where the average of all models' projections likely falls. The prediction interval [1.65, 3.06] tells us where an individual projection might fall. For policy purposes, the prediction interval (or the full distribution) is usually more relevant, because the policy question is about the actual temperature, not the average of models.

Step 2: Bootstrap Confidence Intervals

Because the Gaussian assumption may not hold, we use the bootstrap to construct distribution-free confidence intervals.

import numpy as np

def bootstrap_climate_analysis(
    projections: np.ndarray,
    n_bootstrap: int = 50_000,
    random_state: int = 42,
) -> dict[str, object]:
    """Bootstrap analysis of climate model projections.

    Args:
        projections: Array of model projections (°C).
        n_bootstrap: Number of bootstrap resamples.
        random_state: Random seed.

    Returns:
        Dictionary with bootstrap estimates and CIs.
    """
    rng = np.random.RandomState(random_state)
    n = len(projections)

    boot_means = np.zeros(n_bootstrap)
    boot_medians = np.zeros(n_bootstrap)
    boot_p90 = np.zeros(n_bootstrap)
    boot_p_exceed_2_5 = np.zeros(n_bootstrap)
    boot_iqr = np.zeros(n_bootstrap)

    for b in range(n_bootstrap):
        sample = projections[rng.choice(n, size=n, replace=True)]
        boot_means[b] = sample.mean()
        boot_medians[b] = np.median(sample)
        boot_p90[b] = np.percentile(sample, 90)
        boot_p_exceed_2_5[b] = np.mean(sample > 2.5)
        boot_iqr[b] = np.percentile(sample, 75) - np.percentile(sample, 25)

    results = {}
    for name, boot_dist in [
        ("mean", boot_means),
        ("median", boot_medians),
        ("90th_percentile", boot_p90),
        ("P(>2.5°C)", boot_p_exceed_2_5),
        ("IQR", boot_iqr),
    ]:
        results[name] = {
            "estimate": np.mean(boot_dist),
            "ci_5": np.percentile(boot_dist, 5),
            "ci_95": np.percentile(boot_dist, 95),
        }

    return results

results = bootstrap_climate_analysis(model_projections)

print("Bootstrap 90% Confidence Intervals:\n")
print(f"{'Statistic':<20s} {'Estimate':>10s} {'90% CI':>25s}")
print("-" * 60)
for name, vals in results.items():
    print(f"{name:<20s} {vals['estimate']:>10.3f} "
          f"[{vals['ci_5']:>8.3f}, {vals['ci_95']:>8.3f}]")

Step 3: The Independence Problem

The standard bootstrap assumes the data points are i.i.d. — but climate models are not independent. Models from the same institution share structural components. Some use identical ocean or atmosphere sub-models. This dependence structure means the effective sample size is smaller than 30.

import numpy as np

# Model "families" — models sharing significant code or calibration
# (Simplified from Knutti et al., 2013 model genealogy)
model_families = {
    "NCAR": [0, 1],      # indices of models from NCAR
    "GFDL": [2, 3, 4],
    "UKMO": [5, 6],
    "IPSL": [7, 8, 9],
    "MPI": [10, 11],
    "ECMWF": [12, 13],
    "CNRM": [14, 15],
    "MIROC": [16, 17, 18],
    "CESM": [19, 20, 21],
    "GISS": [22, 23],
    "BCC": [24, 25],
    "CanESM": [26, 27, 28, 29],
}

n_families = len(model_families)
print(f"Number of model families:      {n_families}")
print(f"Number of individual models:   {n_models}")

# Family-averaged projections (one "independent" data point per family)
family_means = []
for family_name, indices in model_families.items():
    family_mean = model_projections[indices].mean()
    family_means.append(family_mean)
    n_members = len(indices)
    print(f"  {family_name:>8s}: {n_members} models, "
          f"mean = {family_mean:.2f}°C")

family_means = np.array(family_means)
print(f"\nFamily-averaged mean: {family_means.mean():.3f}°C")
print(f"Family-averaged std:  {family_means.std(ddof=1):.3f}°C")

# CI using family means (more conservative, accounts for dependence)
se_family = family_means.std(ddof=1) / np.sqrt(n_families)
ci_90_family = (
    family_means.mean() - stats.t.ppf(0.95, df=n_families - 1) * se_family,
    family_means.mean() + stats.t.ppf(0.95, df=n_families - 1) * se_family,
)
print(f"90% CI (family-averaged): [{ci_90_family[0]:.2f}, {ci_90_family[1]:.2f}]°C")
print(f"90% CI (all models):      [{ci_90_mean[0]:.2f}, {ci_90_mean[1]:.2f}]°C")
print(f"\nThe family-averaged CI is wider — correctly reflecting the")
print(f"reduced effective sample size due to model dependence.")

Research Insight: Abramowitz et al. (2019) and Knutti et al. (2017) argue that the CMIP multi-model ensemble should not be treated as a probabilistic sample from a population. Models were not designed to sample the space of possible climate responses — they were designed to be the best possible simulation of the climate system. The spread of the ensemble reflects structural uncertainty in modeling choices, not sampling variability. This means standard bootstrap confidence intervals may not have their nominal coverage properties. PCRC addresses this by reporting both the naive and family-averaged uncertainty, and by supplementing the ensemble analysis with observational constraints.

Step 4: Communicating Uncertainty to Policymakers

The PCRC must translate statistical uncertainty into language that infrastructure planners can act on. This requires careful attention to what different uncertainty statements mean.

import numpy as np
from scipy import stats

# Summary for policy communication
print("=" * 65)
print("PACIFIC CLIMATE RESEARCH CONSORTIUM")
print("Regional Climate Projection Summary: SSP2-4.5, 2041-2060")
print("=" * 65)
print()

# Best estimate and uncertainty range
print("TEMPERATURE INCREASE RELATIVE TO PRE-INDUSTRIAL BASELINE")
print()
print(f"  Best estimate (multi-model mean):     {mu_hat:.1f}°C")
print(f"  Likely range (66% of models):         "
      f"[{np.percentile(model_projections, 17):.1f}, "
      f"{np.percentile(model_projections, 83):.1f}]°C")
print(f"  Very likely range (90% of models):    "
      f"[{np.percentile(model_projections, 5):.1f}, "
      f"{np.percentile(model_projections, 95):.1f}]°C")
print()

# Risk-relevant probabilities
print("RISK-RELEVANT PROBABILITIES")
print()
thresholds = [1.5, 2.0, 2.5, 3.0]
for t in thresholds:
    p_exceed = np.mean(model_projections > t)
    # Bayesian bootstrap uncertainty on the probability itself
    rng = np.random.RandomState(42)
    boot_probs = np.array([
        np.mean(model_projections[rng.choice(n_models, size=n_models, replace=True)] > t)
        for _ in range(10_000)
    ])
    p_lo, p_hi = np.percentile(boot_probs, [5, 95])

    descriptor = ""
    if p_exceed > 0.9:
        descriptor = "(very likely)"
    elif p_exceed > 0.66:
        descriptor = "(likely)"
    elif p_exceed > 0.33:
        descriptor = "(about as likely as not)"
    elif p_exceed > 0.1:
        descriptor = "(unlikely)"
    else:
        descriptor = "(very unlikely)"

    print(f"  P(ΔT > {t}°C) = {p_exceed:.0%}  "
          f"[90% CI: {p_lo:.0%} - {p_hi:.0%}]  {descriptor}")

print()
print("METHODOLOGICAL NOTES")
print("  - Based on 30 CMIP6 models under SSP2-4.5")
print("  - Uncertainty ranges account for model spread, not")
print("    observational constraints or scenario uncertainty")
print("  - Likelihood language follows IPCC AR6 calibrated scale")

Step 5: The Interpretation Gap

The deepest challenge in this case study is not computational but interpretive. When PCRC reports "P(temperature increase > 2.5°C) = 40%", what does this mean?

The frequentist interpretation is awkward: "If we could somehow re-run Earth's climate many times under SSP2-4.5, 40% of the time the temperature increase would exceed 2.5°C." This does not make physical sense — there is one Earth, one climate trajectory.

The Bayesian interpretation is more natural: "Given our current understanding of climate physics, as encoded in the 30 models, we assign a 40% degree of belief to the temperature increase exceeding 2.5°C." But this raises the question of whose belief, and what information is conditioned on.

The ensemble interpretation (most common in climate science) is pragmatic: "40% of the models project an increase exceeding 2.5°C." This avoids the philosophical question but introduces a different problem: the models are not equally plausible, and their spread may not correctly represent the true uncertainty.

import numpy as np

# Demonstrating the interpretation gap with a concrete example
print("THE INTERPRETATION GAP")
print("=" * 50)
print()
print("A city planner asks: 'Should I design the seawall")
print("for a 2.5°C or 3.0°C scenario?'\n")

# Decision under uncertainty
seawall_costs = {
    "2.0°C design": {"cost": 50, "damage_if_2.5": 200, "damage_if_3.0": 500},
    "2.5°C design": {"cost": 80, "damage_if_2.5": 0, "damage_if_3.0": 300},
    "3.0°C design": {"cost": 120, "damage_if_2.5": 0, "damage_if_3.0": 0},
}

# Using our estimated probabilities
p_below_2_5 = np.mean(model_projections <= 2.5)
p_between_2_5_3 = np.mean((model_projections > 2.5) & (model_projections <= 3.0))
p_above_3 = np.mean(model_projections > 3.0)

print(f"Probability estimates:")
print(f"  P(ΔT ≤ 2.5°C)     = {p_below_2_5:.2f}")
print(f"  P(2.5 < ΔT ≤ 3.0) = {p_between_2_5_3:.2f}")
print(f"  P(ΔT > 3.0°C)     = {p_above_3:.2f}")
print()

print(f"{'Design':<15s} {'Build Cost':>12s} {'Expected Damage':>17s} {'Total Expected':>16s}")
print("-" * 65)
for design, costs in seawall_costs.items():
    expected_damage = (
        p_between_2_5_3 * costs.get("damage_if_2.5", 0) +
        p_above_3 * costs.get("damage_if_3.0", 0)
    )
    total = costs["cost"] + expected_damage
    print(f"{design:<15s} ${costs['cost']:>10d}M ${expected_damage:>15.0f}M ${total:>14.0f}M")

print()
print("The optimal design depends on the probability estimates.")
print("Underestimating uncertainty leads to underinvestment in")
print("resilience. This is why rigorous uncertainty quantification")
print("— not just point estimates — matters for policy decisions.")

Analysis and Discussion

Key findings:

The multi-model mean (2.35°C) is more certain than any individual projection. The 90% CI for the mean is approximately [2.22, 2.49], while individual model projections span [1.65, 3.18]. The distinction between parameter uncertainty and predictive uncertainty (Section 3.11) is critical for correct interpretation.
Model dependence widens the honest uncertainty range. When model families are averaged to approximate independent samples, the confidence interval widens by approximately 30%. Ignoring model dependence leads to overconfident uncertainty estimates.
Bootstrap confidence intervals for tail probabilities are wide. $P(\Delta T > 3.0°\text{C})$ is estimated at approximately 10%, but the bootstrap 90% CI for this probability is [0%, 23%]. Tail probabilities are inherently harder to estimate from small samples — a direct consequence of the $1/\sqrt{N}$ convergence rate of Monte Carlo estimation (Section 3.12).
The choice of statistical framework matters for communication. The Bayesian interpretation ("40% degree of belief") is more intuitive for policymakers than the frequentist interpretation ("40% of the time in repeated Earths"). But the ensemble interpretation ("40% of models") is the most transparent about what was actually computed. Honest communication requires stating the interpretation alongside the number.

Production Reality: The IPCC Assessment Reports use a calibrated uncertainty language that maps numerical probabilities to verbal descriptors (virtually certain: >99%, very likely: >90%, likely: >66%, about as likely as not: 33-66%, unlikely: <33%, very unlikely: <10%, exceptionally unlikely: <1%). This framework explicitly acknowledges that probabilities in climate science are not the same as frequencies in a casino — they encode expert judgment about complex physical systems. The statistical tools from this chapter (bootstrap, Monte Carlo, confidence intervals) provide the computational machinery, but the interpretation requires domain judgment.

Connection to the Chapter

This case study demonstrates: - Gaussian MLE and its limitations: Fitting a Gaussian to model projections and recognizing when the assumptions fail (Section 3.7) - Bootstrap confidence intervals: Distribution-free uncertainty quantification for arbitrary statistics, including tail probabilities (Section 3.12) - Confidence vs. credible vs. prediction intervals: The critical distinction between uncertainty about the mean, uncertainty about a parameter, and uncertainty about a future observation (Sections 3.10, 3.11) - The frequentist-Bayesian tension in practice: Different interpretations of "90% confident" have real consequences for policy decisions (Section 3.10) - Prediction is not Causation (Theme 2): The models predict temperature given a scenario, but the policy question is causal — "if we reduce emissions (intervention), what happens to temperature?" This distinction becomes central in Part III.