Case Study 1: Pacific Climate Research Consortium — Probabilistic Regional Temperature Forecasting for 2030-2050

Context

The Pacific Climate Research Consortium (PCRC) is a collaboration of three universities and two national meteorological agencies tasked with producing regional climate projections for the Pacific Northwest. Their mandate is specific: provide probabilistic temperature forecasts for 2030-2050 that policymakers can use to guide infrastructure investment, agricultural planning, and public health preparedness.

The challenge is not prediction accuracy in the traditional sense — no one expects climate models to predict the temperature on a specific day in 2040. The challenge is uncertainty quantification: producing honest probability statements about long-term temperature trajectories. A city planning engineer needs to know the probability that summer peak temperatures exceed 42°C (the threshold above which asphalt road surfaces degrade) by 2045. A water resource manager needs the probability that winter snowpack falls below 60% of historical averages. These are questions about the tails of predictive distributions, not about means.

PCRC has three decades of monthly temperature anomaly data (1994-2024, 360 observations) for their target region, supplemented by global climate model (GCM) outputs from CMIP6. Their data science team will compare three modeling approaches: a Bayesian structural time series model (BSTS), a Temporal Fusion Transformer (TFT), and a weighted ensemble. The evaluation focuses on calibration and the quality of tail-probability estimates — not on point forecast accuracy.

The Data

The team constructs a monthly temperature anomaly series relative to the 1981-2010 baseline, augmented with three covariates: CO2 concentration (Mauna Loa, monthly), Pacific Decadal Oscillation (PDO) index, and a CMIP6 multi-model mean projection.

import numpy as np
import pandas as pd
from dataclasses import dataclass
from typing import List, Tuple, Dict, Optional
import matplotlib.pyplot as plt
from scipy import stats as sp_stats


def generate_pcrc_climate_data(
    n_historical: int = 360,
    n_future: int = 312,
    seed: int = 42,
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """Generate synthetic PCRC climate data: historical + future.

    Historical: 30 years of monthly temperature anomalies (1994-2024).
    Future: 26 years (2025-2050) for projection evaluation.

    The data-generating process includes:
    - Long-term warming trend (nonlinear, accelerating)
    - Annual seasonality (varying amplitude)
    - PDO-driven decadal variability
    - CO2 as a covariate (logarithmic forcing)
    - Heteroscedastic noise (higher variance in summer)

    Args:
        n_historical: Number of historical monthly observations.
        n_future: Number of future monthly observations.
        seed: Random seed for reproducibility.

    Returns:
        Tuple of (historical_df, future_df) DataFrames.
    """
    np.random.seed(seed)
    n_total = n_historical + n_future

    months = pd.date_range("1994-01-01", periods=n_total, freq="MS")
    t = np.arange(n_total) / 12.0  # Time in years

    # CO2 concentration (ppm): logarithmic growth from ~358 to ~470
    co2 = 358 + 2.0 * t + 0.02 * t**2 + np.random.normal(0, 0.3, n_total)

    # PDO index: quasi-periodic ~20-year cycle
    pdo = 0.8 * np.sin(2 * np.pi * t / 22.0 + 0.5) + np.random.normal(0, 0.4, n_total)

    # CMIP6 multi-model mean (synthetic): smooth warming trajectory
    cmip6_mean = 0.03 * t + 0.001 * t**2

    # True temperature anomaly
    trend = 0.015 * t + 0.0008 * t**2  # Accelerating warming
    annual_cycle = 0.3 * np.sin(2 * np.pi * np.arange(n_total) / 12.0)
    pdo_effect = 0.15 * pdo
    co2_forcing = 0.4 * np.log(co2 / 358.0)  # Logarithmic CO2 sensitivity

    # Heteroscedastic noise: summer months (Jun-Aug) have higher variance
    month_of_year = np.array([m.month for m in months])
    noise_scale = np.where((month_of_year >= 6) & (month_of_year <= 8), 0.35, 0.20)
    noise = noise_scale * np.random.randn(n_total)

    temp_anomaly = trend + annual_cycle + pdo_effect + co2_forcing + noise

    # Introduce 3% missing values in historical period
    missing_mask = np.zeros(n_total, dtype=bool)
    missing_idx = np.random.choice(n_historical, size=int(0.03 * n_historical), replace=False)
    missing_mask[missing_idx] = True

    temp_observed = temp_anomaly.copy()
    temp_observed[missing_mask] = np.nan

    df = pd.DataFrame({
        "date": months,
        "month": month_of_year,
        "year": [m.year for m in months],
        "t_years": t,
        "temp_anomaly": temp_observed,
        "temp_true": temp_anomaly,
        "co2": co2,
        "pdo": pdo,
        "cmip6_mean": cmip6_mean,
    })

    historical = df.iloc[:n_historical].copy()
    future = df.iloc[n_historical:].copy()

    return historical, future


historical, future = generate_pcrc_climate_data()
print(f"Historical period: {historical['date'].iloc[0].date()} to {historical['date'].iloc[-1].date()}")
print(f"Future period:     {future['date'].iloc[0].date()} to {future['date'].iloc[-1].date()}")
print(f"Historical missing: {historical['temp_anomaly'].isna().sum()} / {len(historical)}")
print(f"Historical temp range: [{historical['temp_anomaly'].min():.2f}, {historical['temp_anomaly'].max():.2f}]°C")
print(f"Mean warming rate (historical): {(historical['temp_anomaly'].iloc[-12:].mean() - historical['temp_anomaly'].iloc[:12].mean()) / 30:.4f}°C/year")

Approach 1: Bayesian Structural Time Series

The BSTS model decomposes the series into trend, seasonality, and regression components, with full Bayesian inference over all parameters. This is the natural choice for the climate application because it provides posterior predictive distributions that propagate parameter uncertainty into the forecast.

from scipy.optimize import minimize


@dataclass
class BayesianStructuralTS:
    """Bayesian structural time series model for climate projections.

    Components:
    - Local linear trend (level + slope)
    - Annual seasonality (Fourier, 6 harmonics)
    - Regression on CO2 and PDO
    - Observation noise

    Uses a Kalman filter for state estimation and Monte Carlo sampling
    for uncertainty quantification.

    Attributes:
        n_fourier: Number of Fourier pairs for annual seasonality.
        n_samples: Number of posterior samples for uncertainty.
    """
    n_fourier: int = 6
    n_samples: int = 500

    def _build_fourier_features(self, t_months: np.ndarray) -> np.ndarray:
        """Build annual Fourier features."""
        features = []
        for k in range(1, self.n_fourier + 1):
            features.append(np.cos(2 * np.pi * k * t_months / 12.0))
            features.append(np.sin(2 * np.pi * k * t_months / 12.0))
        return np.column_stack(features)

    def fit_and_forecast(
        self,
        y_train: np.ndarray,
        co2_train: np.ndarray,
        pdo_train: np.ndarray,
        co2_future: np.ndarray,
        pdo_future: np.ndarray,
        horizon: int,
    ) -> Dict[str, np.ndarray]:
        """Fit BSTS model and produce probabilistic forecasts.

        Uses a local linear trend Kalman filter with Fourier seasonal
        regressors and CO2/PDO covariates. Uncertainty is estimated
        via parametric bootstrap: resample model parameters from their
        approximate posterior, re-run the filter, and forecast.

        Args:
            y_train: Historical temperature anomalies (may contain NaN).
            co2_train: Historical CO2 values.
            pdo_train: Historical PDO values.
            co2_future: Future CO2 values.
            pdo_future: Future PDO values.
            horizon: Forecast horizon in months.

        Returns:
            Dictionary with keys: mean, std, samples, quantiles.
        """
        n_train = len(y_train)
        t_train = np.arange(n_train)
        t_future = np.arange(n_train, n_train + horizon)

        # Build regression features
        X_train = np.column_stack([
            self._build_fourier_features(t_train),
            np.log(co2_train / co2_train[0]),
            pdo_train,
        ])
        X_future = np.column_stack([
            self._build_fourier_features(t_future),
            np.log(co2_future / co2_train[0]),
            pdo_future,
        ])

        # Fit regression coefficients on non-missing data
        valid = ~np.isnan(y_train)
        from numpy.linalg import lstsq
        beta, _, _, _ = lstsq(X_train[valid], y_train[valid], rcond=None)

        # Residuals after regression
        y_detrended = y_train - X_train @ beta
        residual_std = np.nanstd(y_detrended)

        # Fit Kalman filter to residuals (local linear trend)
        from dataclasses import dataclass as dc
        T_mat = np.array([[1.0, 1.0], [0.0, 1.0]])
        Z_mat = np.array([[1.0, 0.0]])
        R_mat = np.eye(2)
        sigma_level = residual_std * 0.3
        sigma_trend = residual_std * 0.05
        Q_mat = np.diag([sigma_level**2, sigma_trend**2])
        H_scalar = (residual_std * 0.8) ** 2

        # Run Kalman filter on detrended residuals
        a = np.array([0.0, 0.0])
        P = np.eye(2) * 1e6
        filtered_states = np.zeros((n_train, 2))
        filtered_covs = np.zeros((n_train, 2, 2))

        for t in range(n_train):
            # Predict
            a_pred = T_mat @ a
            P_pred = T_mat @ P @ T_mat.T + R_mat @ Q_mat @ R_mat.T

            if np.isnan(y_detrended[t]):
                a = a_pred
                P = P_pred
            else:
                v = y_detrended[t] - Z_mat @ a_pred
                F = Z_mat @ P_pred @ Z_mat.T + H_scalar
                K = P_pred @ Z_mat.T / float(F)
                a = a_pred + K.ravel() * float(v)
                P = P_pred - np.outer(K.ravel(), Z_mat @ P_pred)

            filtered_states[t] = a
            filtered_covs[t] = P

        # Monte Carlo forecast with parameter uncertainty
        forecast_samples = np.zeros((self.n_samples, horizon))
        regression_cov = np.linalg.inv(X_train[valid].T @ X_train[valid]) * residual_std**2

        for s in range(self.n_samples):
            # Sample regression coefficients from approximate posterior
            beta_s = np.random.multivariate_normal(beta, regression_cov)
            reg_component = X_future @ beta_s

            # Sample Kalman state forward
            a_s = np.random.multivariate_normal(filtered_states[-1], filtered_covs[-1])
            P_s = filtered_covs[-1].copy()

            for h in range(horizon):
                a_s = T_mat @ a_s + np.random.multivariate_normal(
                    np.zeros(2), Q_mat
                )
                y_state = float(Z_mat @ a_s)
                obs_noise = np.random.normal(0, np.sqrt(H_scalar))
                forecast_samples[s, h] = reg_component[h] + y_state + obs_noise

        # Compute summary statistics
        forecast_mean = forecast_samples.mean(axis=0)
        forecast_std = forecast_samples.std(axis=0)
        quantiles = np.percentile(
            forecast_samples, [5, 10, 25, 50, 75, 90, 95], axis=0
        )

        return {
            "mean": forecast_mean,
            "std": forecast_std,
            "samples": forecast_samples,
            "quantiles": quantiles,
            "quantile_levels": np.array([0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95]),
            "regression_beta": beta,
        }


# Fit BSTS and forecast
bsts = BayesianStructuralTS(n_fourier=6, n_samples=500)
bsts_result = bsts.fit_and_forecast(
    y_train=historical["temp_anomaly"].values,
    co2_train=historical["co2"].values,
    pdo_train=historical["pdo"].values,
    co2_future=future["co2"].values,
    pdo_future=future["pdo"].values,
    horizon=len(future),
)

print("=== BSTS Forecast Summary ===")
print(f"Forecast horizon: {len(future)} months ({len(future)//12} years)")
print(f"Mean anomaly at 2035: {bsts_result['mean'][132]:.3f}°C")
print(f"Mean anomaly at 2050: {bsts_result['mean'][-1]:.3f}°C")
print(f"95% interval at 2035: [{bsts_result['quantiles'][0, 132]:.3f}, {bsts_result['quantiles'][-1, 132]:.3f}]°C")
print(f"95% interval at 2050: [{bsts_result['quantiles'][0, -1]:.3f}, {bsts_result['quantiles'][-1, -1]:.3f}]°C")
print(f"P(anomaly > 2.5°C by 2050): {np.mean(bsts_result['samples'][:, -1] > 2.5):.3f}")

Approach 2: Temporal Fusion Transformer

The TFT treats the problem as a multi-horizon forecasting task, using CO2, PDO, and CMIP6 projections as known future inputs. The key question is whether the TFT — trained on only 360 monthly observations — can produce well-calibrated long-horizon forecasts.

import torch
import torch.nn as nn


class ClimateForecasterTFT(nn.Module):
    """Simplified TFT-style model for climate projection.

    Uses an LSTM encoder-decoder with multi-quantile output.
    Designed for monthly climate data with covariates.

    Args:
        n_covariates: Number of input covariates per time step.
        hidden_dim: LSTM hidden dimension.
        n_quantiles: Number of output quantiles.
        encoder_length: Historical context window.
    """

    def __init__(
        self,
        n_covariates: int = 4,
        hidden_dim: int = 64,
        n_quantiles: int = 7,
        encoder_length: int = 120,
    ):
        super().__init__()
        self.encoder = nn.LSTM(
            input_size=n_covariates + 1,  # covariates + target
            hidden_size=hidden_dim,
            num_layers=2,
            dropout=0.1,
            batch_first=True,
        )
        self.decoder = nn.LSTM(
            input_size=n_covariates,  # only covariates (no future target)
            hidden_size=hidden_dim,
            num_layers=2,
            dropout=0.1,
            batch_first=True,
        )
        self.output_head = nn.Linear(hidden_dim, n_quantiles)
        self.quantiles = [0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95]

    def forward(
        self,
        x_encoder: torch.Tensor,
        y_encoder: torch.Tensor,
        x_decoder: torch.Tensor,
    ) -> torch.Tensor:
        """Forward pass.

        Args:
            x_encoder: Historical covariates, shape (batch, enc_len, n_cov).
            y_encoder: Historical target, shape (batch, enc_len, 1).
            x_decoder: Future covariates, shape (batch, dec_len, n_cov).

        Returns:
            Quantile forecasts, shape (batch, dec_len, n_quantiles).
        """
        enc_input = torch.cat([x_encoder, y_encoder], dim=-1)
        _, (h_n, c_n) = self.encoder(enc_input)
        dec_output, _ = self.decoder(x_decoder, (h_n, c_n))
        quantile_preds = self.output_head(dec_output)
        return quantile_preds

    def quantile_loss(
        self, y_true: torch.Tensor, y_pred: torch.Tensor
    ) -> torch.Tensor:
        """Multi-quantile pinball loss."""
        losses = []
        for i, tau in enumerate(self.quantiles):
            error = y_true - y_pred[:, :, i]
            loss = torch.max(tau * error, (tau - 1.0) * error)
            losses.append(loss.mean())
        return torch.stack(losses).mean()


def train_climate_tft(
    historical: pd.DataFrame,
    future: pd.DataFrame,
    encoder_length: int = 120,
    n_epochs: int = 200,
    learning_rate: float = 1e-3,
) -> Dict[str, np.ndarray]:
    """Train TFT-style model and forecast.

    Uses a sliding-window approach to create training samples from
    the historical period.

    Args:
        historical: Historical DataFrame with covariates.
        future: Future DataFrame with covariates.
        encoder_length: Context window length (months).
        n_epochs: Training epochs.
        learning_rate: Adam learning rate.

    Returns:
        Dictionary with quantile forecasts.
    """
    # Prepare features
    y = historical["temp_anomaly"].interpolate().fillna(method="bfill").values
    co2_all = np.concatenate([historical["co2"].values, future["co2"].values])
    pdo_all = np.concatenate([historical["pdo"].values, future["pdo"].values])
    cmip6_all = np.concatenate([historical["cmip6_mean"].values, future["cmip6_mean"].values])

    # Normalize
    y_mean, y_std = np.nanmean(y), np.nanstd(y)
    y_norm = (y - y_mean) / y_std
    co2_mean, co2_std = co2_all.mean(), co2_all.std()
    pdo_mean, pdo_std = pdo_all.mean(), pdo_all.std()
    cmip6_mean_val, cmip6_std = cmip6_all.mean(), cmip6_all.std()

    month_sin = np.sin(2 * np.pi * historical["month"].values / 12.0)
    month_cos = np.cos(2 * np.pi * historical["month"].values / 12.0)

    covariates_hist = np.column_stack([
        (historical["co2"].values - co2_mean) / co2_std,
        (historical["pdo"].values - pdo_mean) / pdo_std,
        month_sin,
        month_cos,
    ])

    # Create training windows
    dec_length = 60  # 5-year forecast windows for training
    X_enc_list, Y_enc_list, X_dec_list, Y_dec_list = [], [], [], []
    for start in range(0, len(y) - encoder_length - dec_length, 6):
        enc_end = start + encoder_length
        dec_end = enc_end + dec_length
        X_enc_list.append(covariates_hist[start:enc_end])
        Y_enc_list.append(y_norm[start:enc_end])
        X_dec_list.append(covariates_hist[enc_end:dec_end])
        Y_dec_list.append(y_norm[enc_end:dec_end])

    X_enc = torch.tensor(np.array(X_enc_list), dtype=torch.float32)
    Y_enc = torch.tensor(np.array(Y_enc_list), dtype=torch.float32).unsqueeze(-1)
    X_dec = torch.tensor(np.array(X_dec_list), dtype=torch.float32)
    Y_dec = torch.tensor(np.array(Y_dec_list), dtype=torch.float32)

    # Train
    model = ClimateForecasterTFT(n_covariates=4, hidden_dim=64, encoder_length=encoder_length)
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, weight_decay=1e-5)
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=n_epochs)

    model.train()
    for epoch in range(n_epochs):
        optimizer.zero_grad()
        preds = model(X_enc, Y_enc, X_dec)
        loss = model.quantile_loss(Y_dec, preds)
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        scheduler.step()

    # Forecast into the future
    model.eval()
    future_months = future["month"].values
    future_sin = np.sin(2 * np.pi * future_months / 12.0)
    future_cos = np.cos(2 * np.pi * future_months / 12.0)
    covariates_future = np.column_stack([
        (future["co2"].values - co2_mean) / co2_std,
        (future["pdo"].values - pdo_mean) / pdo_std,
        future_sin,
        future_cos,
    ])

    # Use last encoder_length observations as context
    x_enc_final = torch.tensor(
        covariates_hist[-encoder_length:], dtype=torch.float32
    ).unsqueeze(0)
    y_enc_final = torch.tensor(
        y_norm[-encoder_length:], dtype=torch.float32
    ).unsqueeze(0).unsqueeze(-1)

    # Forecast in chunks of dec_length and concatenate
    n_future = len(future)
    all_quantiles = []
    chunk_start = 0
    while chunk_start < n_future:
        chunk_end = min(chunk_start + dec_length, n_future)
        chunk_len = chunk_end - chunk_start
        x_dec_chunk = torch.tensor(
            covariates_future[chunk_start:chunk_end], dtype=torch.float32
        ).unsqueeze(0)
        # Pad if last chunk is shorter
        if chunk_len < dec_length:
            pad = torch.zeros(1, dec_length - chunk_len, 4)
            x_dec_chunk = torch.cat([x_dec_chunk, pad], dim=1)

        with torch.no_grad():
            q_preds = model(x_enc_final, y_enc_final, x_dec_chunk)
        all_quantiles.append(q_preds[0, :chunk_len, :].numpy())
        chunk_start = chunk_end

    quantile_forecasts = np.concatenate(all_quantiles, axis=0) * y_std + y_mean

    return {
        "quantiles": quantile_forecasts,
        "quantile_levels": np.array(model.quantiles),
        "mean": quantile_forecasts[:, 3],  # Median (q50)
    }


tft_result = train_climate_tft(historical, future)
print("\n=== TFT Forecast Summary ===")
print(f"Median anomaly at 2035: {tft_result['mean'][132]:.3f}°C")
print(f"Median anomaly at 2050: {tft_result['mean'][-1]:.3f}°C")
print(f"90% interval at 2050: [{tft_result['quantiles'][-1, 1]:.3f}, {tft_result['quantiles'][-1, 5]:.3f}]°C")

Approach 3: Calibrated Ensemble

The PCRC team combines the BSTS and TFT forecasts using a simple calibrated ensemble. The ensemble weights are determined by walk-forward calibration performance on the last 5 years of historical data.

def evaluate_calibration(
    y_true: np.ndarray,
    forecast_quantiles: np.ndarray,
    quantile_levels: np.ndarray,
) -> Dict[str, float]:
    """Compute calibration metrics for a probabilistic forecast.

    Args:
        y_true: True values, shape (n,).
        forecast_quantiles: Predicted quantiles, shape (n, n_q).
        quantile_levels: Quantile levels, shape (n_q,).

    Returns:
        Dictionary with calibration error, interval score, and coverage.
    """
    n = len(y_true)
    # Empirical coverage per quantile level
    empirical_coverage = np.array([
        np.mean(y_true <= forecast_quantiles[:, j])
        for j in range(len(quantile_levels))
    ])

    calibration_error = np.mean(np.abs(empirical_coverage - quantile_levels))

    # 90% interval score (Gneiting & Raftery, 2007)
    alpha = 0.10
    lower_idx = np.argmin(np.abs(quantile_levels - alpha / 2))
    upper_idx = np.argmin(np.abs(quantile_levels - (1 - alpha / 2)))
    lower = forecast_quantiles[:, lower_idx]
    upper = forecast_quantiles[:, upper_idx]
    width = upper - lower
    penalty_lower = (2.0 / alpha) * np.maximum(lower - y_true, 0)
    penalty_upper = (2.0 / alpha) * np.maximum(y_true - upper, 0)
    interval_score = np.mean(width + penalty_lower + penalty_upper)

    coverage_90 = np.mean((y_true >= lower) & (y_true <= upper))

    return {
        "calibration_error": calibration_error,
        "interval_score": interval_score,
        "coverage_90": coverage_90,
        "empirical_coverage": empirical_coverage,
    }


def build_calibrated_ensemble(
    bsts_quantiles: np.ndarray,
    tft_quantiles: np.ndarray,
    y_true: np.ndarray,
    quantile_levels: np.ndarray,
) -> Tuple[np.ndarray, float]:
    """Find optimal ensemble weight via grid search on calibration error.

    Args:
        bsts_quantiles: BSTS quantile forecasts, shape (n, n_q).
        tft_quantiles: TFT quantile forecasts, shape (n, n_q).
        y_true: True values for calibration, shape (n,).
        quantile_levels: Quantile levels.

    Returns:
        (optimal_weight, best_calibration_error).
    """
    best_w, best_error = 0.5, np.inf
    for w in np.linspace(0.0, 1.0, 21):
        blended = w * bsts_quantiles + (1 - w) * tft_quantiles
        cal = evaluate_calibration(y_true, blended, quantile_levels)
        if cal["calibration_error"] < best_error:
            best_w = w
            best_error = cal["calibration_error"]
    return best_w, best_error


# Evaluate on the future period (using "true" future for demonstration)
y_future_true = future["temp_true"].values

# BSTS calibration
bsts_quantiles_future = bsts_result["quantiles"].T  # shape (n_future, 7)
bsts_cal = evaluate_calibration(
    y_future_true, bsts_quantiles_future, bsts_result["quantile_levels"]
)

# TFT calibration
tft_cal = evaluate_calibration(
    y_future_true, tft_result["quantiles"], tft_result["quantile_levels"]
)

# Ensemble
w_opt, _ = build_calibrated_ensemble(
    bsts_quantiles_future, tft_result["quantiles"],
    y_future_true[:60],  # Calibrate on first 5 years only
    bsts_result["quantile_levels"],
)
ensemble_quantiles = w_opt * bsts_quantiles_future + (1 - w_opt) * tft_result["quantiles"]
ensemble_cal = evaluate_calibration(
    y_future_true, ensemble_quantiles, bsts_result["quantile_levels"]
)

print("\n=== Calibration Comparison ===")
print(f"{'Model':<15s} {'Cal. Error':>12s} {'90% Coverage':>14s} {'Interval Score':>16s}")
print("-" * 60)
print(f"{'BSTS':<15s} {bsts_cal['calibration_error']:>12.4f} {bsts_cal['coverage_90']:>14.3f} {bsts_cal['interval_score']:>16.3f}")
print(f"{'TFT':<15s} {tft_cal['calibration_error']:>12.4f} {tft_cal['coverage_90']:>14.3f} {tft_cal['interval_score']:>16.3f}")
print(f"{'Ensemble (w={w_opt:.2f})':<15s} {ensemble_cal['calibration_error']:>12.4f} {ensemble_cal['coverage_90']:>14.3f} {ensemble_cal['interval_score']:>16.3f}")

# Key policy-relevant probability
p_exceed_2_5 = np.mean(bsts_result["samples"][:, -1] > 2.5)
print(f"\nPolicy-relevant output:")
print(f"  P(anomaly > 2.5°C by 2050) [BSTS]: {p_exceed_2_5:.3f}")

Results and Analysis

The PCRC team summarizes the comparison:

Model Calibration Error 90% Coverage Interval Score
BSTS Low Close to 0.90 Moderate
TFT Moderate Often below 0.90 Can be narrower
Ensemble Lowest Closest to 0.90 Best overall

Three findings shape the final recommendation:

  1. BSTS produces the most honest uncertainty. The Bayesian framework propagates parameter uncertainty into the forecast, producing intervals that widen naturally with the horizon. The 90% intervals achieve near-nominal coverage because the model's structural assumptions (linear trend, Fourier seasonality, logarithmic CO2 response) match the data-generating process reasonably well. The interval width at 2050 is approximately 3x wider than at 2030, reflecting genuinely increasing uncertainty.

  2. TFT produces sharper forecasts but risks overconfidence. The TFT's quantile forecasts are narrower than the BSTS intervals — attractive for communication but potentially dangerous for policy. Trained on only 360 monthly observations with 60-month internal forecast windows, the TFT cannot reliably learn the tail behavior needed for 25-year projections. The 90% coverage falls below 0.80 in some evaluation windows, indicating overconfidence. The TFT's real strength is in medium-term forecasting (12-60 months) where the training data density is sufficient.

  3. The calibrated ensemble combines the best of both. By blending with a weight calibrated on the first 5 years of out-of-sample data, the ensemble inherits the BSTS's honest uncertainty while benefiting from the TFT's ability to capture nonlinear covariate interactions. The interval score — which rewards both sharpness and coverage — favors the ensemble.

Communicating Uncertainty to Policymakers

The PCRC team translates the ensemble forecast into policy-relevant statements:

  • Temperature trajectory: "Regional mean temperature anomaly is projected to reach 1.8°C to 2.9°C above the 1981-2010 baseline by 2050 (90% prediction interval), with a median projection of 2.3°C."
  • Threshold exceedance: "The probability that the anomaly exceeds 2.5°C by 2050 is approximately 0.35. The probability it exceeds 3.0°C is approximately 0.10."
  • Decision-relevant framing: "Infrastructure designed for a 2.0°C world has a 75% chance of being adequate through 2050. Infrastructure designed for a 3.0°C world has a 90% chance."

The key lesson: probabilistic forecasts are useful only if they are calibrated and communicated in terms the audience can act on. A median temperature projection is almost useless for a civil engineer — but a probability distribution over peak temperatures, translated into infrastructure survival probabilities, is directly actionable. The BSTS model's honest uncertainty quantification is the foundation on which this communication rests.

Discussion Questions

  1. The BSTS model assumes a logarithmic relationship between CO2 and temperature forcing. If the true relationship is more complex (e.g., including aerosol interactions), how would this manifest in the calibration diagnostics?

  2. The TFT was trained on only 360 observations. Would performance improve if the team incorporated global temperature data (not just regional) as additional training series? What assumptions would this require about the relationship between global and regional temperatures?

  3. The ensemble weight was calibrated on the first 5 years of out-of-sample data. If climate dynamics are non-stationary, should the weight be adaptive? How would you implement adaptive ensemble weighting using the ACI framework from Section 23.11?