29 min read

> "Prediction is very difficult, especially about the future."

Chapter 23: Advanced Time Series and Temporal Models — State-Space Models, Temporal Fusion Transformers, and Probabilistic Forecasting

"Prediction is very difficult, especially about the future." — Niels Bohr (attributed)


Learning Objectives

By the end of this chapter, you will be able to:

  1. Implement state-space models (Kalman filter, hidden Markov models) and explain their Bayesian interpretation
  2. Apply deep learning architectures (TFT, N-BEATS, DeepAR) to time series forecasting
  3. Build probabilistic forecasts that quantify prediction uncertainty using conformal prediction and Bayesian methods
  4. Handle common time series challenges: missing data, irregular sampling, multiple seasonalities, regime changes
  5. Select the appropriate model family based on data characteristics and forecasting requirements

23.1 Beyond ARIMA: Why This Chapter Exists

If you have completed an intermediate data science course, you know ARIMA. You know how to decompose a time series into trend, seasonality, and residuals. You may have used Prophet to produce forecasts with uncertainty intervals. These methods are genuinely useful — and for many problems, they are sufficient.

This chapter exists because many problems are not "many problems."

Consider the Pacific Climate Research Consortium's challenge from our running example. They need to forecast regional temperatures over horizons of 20 to 50 years. The underlying process is nonlinear, non-stationary, and driven by latent physical dynamics (ocean circulation, greenhouse gas concentrations, volcanic forcing) that ARIMA cannot represent. A point forecast is worse than useless for policymakers — it creates a false sense of precision. What they need is a probabilistic forecast: a statement like "the probability that regional mean temperature exceeds 2.5°C above pre-industrial levels by 2060 is 0.73, with a 90% prediction interval of [1.8°C, 3.4°C]."

Or consider StreamRec's engagement forecasting problem. Daily active users exhibit day-of-week seasonality, month-of-year seasonality, holiday effects, promotional spikes, and regime changes when the platform launches a new content vertical. The series is short enough (3 years of daily data, ~1,100 observations) that deep learning models risk overfitting, yet complex enough that single-seasonality ARIMA cannot capture the patterns. A structural time series model — which decomposes the series into explicit trend, seasonal, and regression components — provides both interpretability and proper uncertainty quantification.

This chapter covers three families of models that go beyond the intermediate toolkit:

  1. State-space models (Sections 23.2–23.4): The Kalman filter and hidden Markov models, which formalize the idea that the observed data are generated by unobserved (latent) states evolving over time.
  2. Deep learning for time series (Sections 23.5–23.8): The Temporal Fusion Transformer (TFT), N-BEATS, and DeepAR — architectures designed specifically for temporal data, not sequence models repurposed from NLP.
  3. Probabilistic forecasting and calibration (Sections 23.9–23.12): The key differentiator from the intermediate book. Point predictions are insufficient when decisions depend on uncertainty. We will cover quantile regression, conformal prediction for distribution-free intervals, and calibration — the discipline of ensuring that your uncertainty estimates are honest.

The unifying theme is the one we have been building since Chapter 20: Know How Your Model Is Wrong. Every forecasting model makes assumptions about the data-generating process. State-space models assume a particular latent structure. Deep learning models assume that patterns in the training window will persist. Probabilistic forecasts claim to capture the true range of outcomes. In every case, we must ask: are these assumptions met? How would we know if they were not?


23.2 State-Space Models: The General Framework

A state-space model (SSM) describes a time series through two equations:

  1. Transition equation (state evolution): How the unobserved latent state $\boldsymbol{\alpha}_t$ evolves over time.
  2. Observation equation (measurement): How the observed data $y_t$ relates to the latent state.

In general form:

$$\boldsymbol{\alpha}_{t+1} = f(\boldsymbol{\alpha}_t, \boldsymbol{u}_t) + \boldsymbol{\eta}_t, \quad \boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}_t)$$

$$y_t = h(\boldsymbol{\alpha}_t, \boldsymbol{x}_t) + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, R_t)$$

where $\boldsymbol{u}_t$ and $\boldsymbol{x}_t$ are optional exogenous inputs, $\boldsymbol{\eta}_t$ is the state noise, and $\varepsilon_t$ is the observation noise.

The key insight: the observed data $y_t$ is a noisy, partial view of a richer underlying reality $\boldsymbol{\alpha}_t$. The state captures what we cannot directly measure — the "true" level of a process, its trend, its seasonal component, and any latent regime it occupies. Estimation consists of inferring the latent states from the noisy observations.

Why State-Space Models Are Bayesian

State-space models are inherently Bayesian. At each time step $t$, the model maintains a belief distribution over the latent state:

$$p(\boldsymbol{\alpha}_t \mid y_{1:t})$$

This is a posterior: the distribution of the latent state $\boldsymbol{\alpha}_t$ given all observations up to time $t$. The predict-update cycle of the Kalman filter (Section 23.3) is precisely Bayesian inference applied recursively:

  • Predict step: Apply the transition model to propagate the prior forward (today's posterior becomes tomorrow's prior).
  • Update step: Incorporate the new observation $y_{t+1}$ via the likelihood to obtain the new posterior.

This is the same sequential updating we derived in Chapter 20, Section 20.8. The difference is that here the parameter of interest (the latent state) itself changes over time, making the problem inherently sequential rather than merely conveniently sequential.

Mathematical Foundation: The connection between state-space models and Bayesian inference was formalized by Ho and Lee (1964) and later by Harrison and Stevens (1976). The Kalman filter is the optimal Bayesian filter for linear-Gaussian state-space models. For nonlinear or non-Gaussian models, approximations (extended Kalman filter, unscented Kalman filter, particle filters) are required.

What Fits into the State-Space Framework?

The framework is remarkably general. Many models you already know are special cases:

Model State $\boldsymbol{\alpha}_t$ Transition Observation
Local level (random walk + noise) $\mu_t$ (level) $\mu_{t+1} = \mu_t + \eta_t$ $y_t = \mu_t + \varepsilon_t$
Local linear trend $[\mu_t, \nu_t]^T$ (level, slope) $\mu_{t+1} = \mu_t + \nu_t + \eta_t^{(1)}$; $\nu_{t+1} = \nu_t + \eta_t^{(2)}$ $y_t = \mu_t + \varepsilon_t$
Structural time series $[\mu_t, \nu_t, \gamma_t^{(1)}, \ldots]^T$ Trend + seasonal components $y_t = \mu_t + \sum_j \gamma_t^{(j)} + \varepsilon_t$
ARIMA(p, d, q) Stacked AR/MA states Companion form matrix First element of state
Hidden Markov model Discrete $s_t \in \{1, \ldots, K\}$ Transition matrix $\mathbf{A}$ $p(y_t \mid s_t)$

The structural time series model is particularly important for applied work. Prophet, which you may have used in the intermediate course, is a structural time series model with a piecewise-linear trend, Fourier seasonality, and holiday regressors — fitted by MAP estimation rather than full Bayesian inference. Understanding the state-space framework lets you see what Prophet assumes, where those assumptions break down, and how to build something better when they do.


23.3 The Kalman Filter: Bayesian Updating for Linear Dynamical Systems

The Kalman filter solves the state estimation problem for the linear-Gaussian state-space model:

$$\boldsymbol{\alpha}_{t+1} = \mathbf{T}_t \boldsymbol{\alpha}_t + \mathbf{R}_t \boldsymbol{\eta}_t, \quad \boldsymbol{\eta}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{Q}_t)$$

$$y_t = \mathbf{Z}_t \boldsymbol{\alpha}_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, H_t)$$

where $\mathbf{T}_t$ is the transition matrix, $\mathbf{Z}_t$ is the observation matrix, $\mathbf{R}_t$ selects which state components receive noise, and $\mathbf{Q}_t$ and $H_t$ are the state and observation noise covariances.

The Predict-Update Equations

Given the filtering distribution at time $t$:

$$\boldsymbol{\alpha}_t \mid y_{1:t} \sim \mathcal{N}(\mathbf{a}_t, \mathbf{P}_t)$$

The Kalman filter proceeds in two steps:

Predict (propagate the state forward through the transition model):

$$\mathbf{a}_{t+1|t} = \mathbf{T}_t \mathbf{a}_t$$

$$\mathbf{P}_{t+1|t} = \mathbf{T}_t \mathbf{P}_t \mathbf{T}_t^T + \mathbf{R}_t \mathbf{Q}_t \mathbf{R}_t^T$$

The predicted mean $\mathbf{a}_{t+1|t}$ is our best guess of the state at $t+1$ before seeing $y_{t+1}$. The predicted covariance $\mathbf{P}_{t+1|t}$ is always larger than $\mathbf{P}_t$ — uncertainty grows between observations.

Update (incorporate the new observation):

$$\boldsymbol{v}_{t+1} = y_{t+1} - \mathbf{Z}_{t+1} \mathbf{a}_{t+1|t} \quad \text{(innovation)}$$

$$F_{t+1} = \mathbf{Z}_{t+1} \mathbf{P}_{t+1|t} \mathbf{Z}_{t+1}^T + H_{t+1} \quad \text{(innovation variance)}$$

$$\mathbf{K}_{t+1} = \mathbf{P}_{t+1|t} \mathbf{Z}_{t+1}^T F_{t+1}^{-1} \quad \text{(Kalman gain)}$$

$$\mathbf{a}_{t+1} = \mathbf{a}_{t+1|t} + \mathbf{K}_{t+1} \boldsymbol{v}_{t+1}$$

$$\mathbf{P}_{t+1} = (\mathbf{I} - \mathbf{K}_{t+1} \mathbf{Z}_{t+1}) \mathbf{P}_{t+1|t}$$

The Kalman gain $\mathbf{K}_{t+1}$ controls how much the new observation shifts the state estimate. When the observation is precise (small $H_{t+1}$), the gain is large — trust the data. When the observation is noisy (large $H_{t+1}$), the gain is small — trust the prior. This is exactly the precision-weighted averaging from Chapter 20's Normal-Normal conjugate model, generalized to the multivariate, time-varying case.

Common Misconception: The Kalman filter is sometimes described as an "algorithm" in the same category as gradient descent or k-means. It is not an optimization algorithm — it is an exact Bayesian inference procedure. For the linear-Gaussian model, the Kalman filter computes the exact posterior. There is no approximation, no convergence criterion, no hyperparameter to tune (beyond the model parameters $\mathbf{T}, \mathbf{Z}, \mathbf{Q}, H$ themselves). This is why it was the first algorithm used for real-time spacecraft navigation — it is provably optimal under its assumptions.

Implementation: A Local Linear Trend Model

Let us implement the Kalman filter for a local linear trend model and apply it to simulated climate data.

import numpy as np
from dataclasses import dataclass, field
from typing import Tuple, List, Optional
import matplotlib.pyplot as plt


@dataclass
class KalmanFilter:
    """Kalman filter for a general linear-Gaussian state-space model.

    State equation: alpha_{t+1} = T @ alpha_t + R @ eta_t,  eta_t ~ N(0, Q)
    Obs equation:   y_t = Z @ alpha_t + eps_t,              eps_t ~ N(0, H)

    Attributes:
        T: Transition matrix (m x m).
        Z: Observation matrix (1 x m) or (p x m) for multivariate.
        R: State noise selection matrix (m x r).
        Q: State noise covariance (r x r).
        H: Observation noise variance (scalar for univariate).
        a0: Initial state mean (m,).
        P0: Initial state covariance (m x m).
    """
    T: np.ndarray
    Z: np.ndarray
    R: np.ndarray
    Q: np.ndarray
    H: float
    a0: np.ndarray
    P0: np.ndarray

    def filter(
        self, y: np.ndarray
    ) -> Tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
        """Run the Kalman filter forward pass.

        Args:
            y: Observations of shape (n,). NaN values are treated as missing.

        Returns:
            a_filt: Filtered state means, shape (n, m).
            P_filt: Filtered state covariances, shape (n, m, m).
            a_pred: Predicted state means, shape (n, m).
            P_pred: Predicted state covariances, shape (n, m, m).
        """
        n = len(y)
        m = len(self.a0)

        a_filt = np.zeros((n, m))
        P_filt = np.zeros((n, m, m))
        a_pred = np.zeros((n, m))
        P_pred = np.zeros((n, m, m))
        log_lik = 0.0

        a = self.a0.copy()
        P = self.P0.copy()

        for t in range(n):
            # --- Predict ---
            a_pred[t] = self.T @ a
            P_pred[t] = self.T @ P @ self.T.T + self.R @ self.Q @ self.R.T

            if np.isnan(y[t]):
                # Missing observation: skip update, prediction becomes filtered
                a_filt[t] = a_pred[t]
                P_filt[t] = P_pred[t]
            else:
                # --- Update ---
                v = y[t] - self.Z @ a_pred[t]       # Innovation
                F = self.Z @ P_pred[t] @ self.Z.T + self.H  # Innovation var
                F_scalar = float(F)
                K = P_pred[t] @ self.Z.T / F_scalar  # Kalman gain (m,)

                a_filt[t] = a_pred[t] + K * float(v)
                P_filt[t] = P_pred[t] - np.outer(K, self.Z @ P_pred[t])

                # Accumulate log-likelihood for parameter estimation
                log_lik += -0.5 * (
                    np.log(2 * np.pi) + np.log(F_scalar) + float(v)**2 / F_scalar
                )

            a = a_filt[t]
            P = P_filt[t]

        self._log_likelihood = log_lik
        return a_filt, P_filt, a_pred, P_pred

    def smooth(
        self,
        a_filt: np.ndarray,
        P_filt: np.ndarray,
        a_pred: np.ndarray,
        P_pred: np.ndarray,
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Rauch-Tung-Striebel (RTS) smoother: backward pass.

        Uses all observations y_{1:n} to refine state estimates.

        Args:
            a_filt: Filtered state means from filter(), shape (n, m).
            P_filt: Filtered state covariances, shape (n, m, m).
            a_pred: Predicted state means, shape (n, m).
            P_pred: Predicted state covariances, shape (n, m, m).

        Returns:
            a_smooth: Smoothed state means, shape (n, m).
            P_smooth: Smoothed state covariances, shape (n, m, m).
        """
        n, m = a_filt.shape
        a_smooth = np.zeros_like(a_filt)
        P_smooth = np.zeros_like(P_filt)

        a_smooth[-1] = a_filt[-1]
        P_smooth[-1] = P_filt[-1]

        for t in range(n - 2, -1, -1):
            # Smoother gain
            L = P_filt[t] @ self.T.T @ np.linalg.inv(P_pred[t + 1])
            a_smooth[t] = a_filt[t] + L @ (a_smooth[t + 1] - a_pred[t + 1])
            P_smooth[t] = P_filt[t] + L @ (P_smooth[t + 1] - P_pred[t + 1]) @ L.T

        return a_smooth, P_smooth

    @property
    def log_likelihood(self) -> float:
        """Log-likelihood from the most recent filter() call."""
        return self._log_likelihood

    def forecast(
        self,
        a_last: np.ndarray,
        P_last: np.ndarray,
        horizon: int,
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Produce h-step-ahead forecasts with uncertainty.

        Args:
            a_last: Filtered state mean at the last observation.
            P_last: Filtered state covariance at the last observation.
            horizon: Number of steps ahead.

        Returns:
            y_pred: Predicted observation means, shape (horizon,).
            y_var: Predicted observation variances, shape (horizon,).
        """
        y_pred = np.zeros(horizon)
        y_var = np.zeros(horizon)
        a = a_last.copy()
        P = P_last.copy()

        for h in range(horizon):
            a = self.T @ a
            P = self.T @ P @ self.T.T + self.R @ self.Q @ self.R.T
            y_pred[h] = float(self.Z @ a)
            y_var[h] = float(self.Z @ P @ self.Z.T + self.H)

        return y_pred, y_var


def build_local_linear_trend(
    sigma_level: float, sigma_trend: float, sigma_obs: float
) -> KalmanFilter:
    """Construct a local linear trend state-space model.

    State: [level, trend]
    Transition: level_{t+1} = level_t + trend_t + eta_level
                trend_{t+1} = trend_t + eta_trend
    Observation: y_t = level_t + eps_t

    Args:
        sigma_level: Standard deviation of level noise.
        sigma_trend: Standard deviation of trend noise.
        sigma_obs: Standard deviation of observation noise.

    Returns:
        Configured KalmanFilter instance.
    """
    T = np.array([[1.0, 1.0],
                  [0.0, 1.0]])
    Z = np.array([[1.0, 0.0]])
    R = np.eye(2)
    Q = np.diag([sigma_level**2, sigma_trend**2])
    H = sigma_obs**2
    a0 = np.array([0.0, 0.0])
    P0 = np.eye(2) * 1e6  # Diffuse initialization

    return KalmanFilter(T=T, Z=Z, R=R, Q=Q, H=H, a0=a0, P0=P0)


# --- Demonstration: Simulated climate-like temperature anomaly ---
np.random.seed(42)
n = 360  # 30 years of monthly data

# True process: slow warming trend + noise
true_trend = 0.005  # 0.005°C per month ≈ 0.06°C per year
true_level = np.zeros(n)
true_level[0] = 0.0
for t in range(1, n):
    true_level[t] = true_level[t - 1] + true_trend + np.random.normal(0, 0.02)

y = true_level + np.random.normal(0, 0.15, n)  # Noisy observations

# Introduce 10% missing data (simulating sensor gaps)
missing_mask = np.random.rand(n) < 0.10
y_with_gaps = y.copy()
y_with_gaps[missing_mask] = np.nan

# Fit the Kalman filter
kf = build_local_linear_trend(sigma_level=0.03, sigma_trend=0.005, sigma_obs=0.15)
a_filt, P_filt, a_pred, P_pred = kf.filter(y_with_gaps)
a_smooth, P_smooth = kf.smooth(a_filt, P_filt, a_pred, P_pred)

# Forecast 60 months (5 years) ahead
y_forecast, y_forecast_var = kf.forecast(a_filt[-1], P_filt[-1], horizon=60)

print("Kalman Filter Results (last 5 observations):")
print(f"{'Time':>6s}  {'Observed':>10s}  {'Filtered':>10s}  {'Smoothed':>10s}  {'±2σ (filt)':>12s}")
for t in range(n - 5, n):
    obs_str = f"{y_with_gaps[t]:.4f}" if not np.isnan(y_with_gaps[t]) else "MISSING"
    filt_std = np.sqrt(P_filt[t, 0, 0])
    print(f"{t:>6d}  {obs_str:>10s}  {a_filt[t, 0]:>10.4f}  {a_smooth[t, 0]:>10.4f}  {2*filt_std:>12.4f}")

print(f"\nLog-likelihood: {kf.log_likelihood:.2f}")
print(f"\nForecast (5 years ahead):")
print(f"  Mean at h=60: {y_forecast[-1]:.4f}°C anomaly")
print(f"  95% interval: [{y_forecast[-1] - 1.96*np.sqrt(y_forecast_var[-1]):.4f}, "
      f"{y_forecast[-1] + 1.96*np.sqrt(y_forecast_var[-1]):.4f}]")

Several features of this implementation merit attention:

  1. Missing data handling. When y[t] is NaN, the filter skips the update step. The prediction becomes the filtered estimate, and uncertainty does not shrink. This is a natural consequence of Bayesian updating: no data means no update. Missing data is not imputed; it is propagated as increased uncertainty.

  2. Diffuse initialization. We set $\mathbf{P}_0 = 10^6 \mathbf{I}$, reflecting extreme uncertainty about the initial state. The filter converges rapidly — after a few observations, the initial conditions are forgotten.

  3. The smoother. The Rauch-Tung-Striebel (RTS) backward pass uses future observations to refine past state estimates. Smoothed estimates are always at least as precise as filtered estimates: $\text{Var}(\boldsymbol{\alpha}_t \mid y_{1:n}) \leq \text{Var}(\boldsymbol{\alpha}_t \mid y_{1:t})$.

  4. Forecast uncertainty grows. The forecast method shows that prediction intervals widen with the horizon — the system honestly reports that it knows less about the distant future. For the climate application, this growing uncertainty is precisely the information policymakers need.

Production Reality: In production systems, the Kalman filter's model parameters ($\mathbf{Q}$, $H$, and sometimes $\mathbf{T}$) are typically estimated by maximizing the log-likelihood that the filter computes as a byproduct. This is implemented in statsmodels.tsa.statespace and tensorflow_probability.sts. Our implementation stores _log_likelihood for exactly this purpose. In practice, you rarely write the filter from scratch — but understanding the predict-update cycle is essential for diagnosing failures, which happen frequently in production time series systems.


23.4 Hidden Markov Models: Discrete Latent States

The Kalman filter assumes continuous, Gaussian latent states. But many temporal processes have discrete regimes: a financial market switches between bull and bear states; a user alternates between active engagement and dormancy; a climate system transitions between El Nino and La Nina phases.

A hidden Markov model (HMM) replaces the continuous state with a discrete state variable $s_t \in \{1, 2, \ldots, K\}$ that evolves according to a transition matrix $\mathbf{A}$:

$$P(s_{t+1} = j \mid s_t = i) = A_{ij}$$

Each state generates observations according to a state-specific emission distribution:

$$y_t \mid s_t = k \sim p_k(y_t \mid \boldsymbol{\theta}_k)$$

For Gaussian emissions, $p_k(y_t \mid \boldsymbol{\theta}_k) = \mathcal{N}(y_t \mid \mu_k, \sigma_k^2)$.

The Three Fundamental HMM Problems

  1. Filtering (forward algorithm): Compute $P(s_t = k \mid y_{1:t})$ — what state is the system in right now?
  2. Decoding (Viterbi algorithm): Find $\arg\max_{s_{1:T}} P(s_{1:T} \mid y_{1:T})$ — what is the most likely sequence of states?
  3. Learning (Baum-Welch / EM): Estimate $\mathbf{A}$, $\boldsymbol{\theta}_k$, and the initial state distribution $\boldsymbol{\pi}$ from data.

The forward algorithm is the discrete analogue of the Kalman filter's predict-update cycle:

$$\alpha_t(j) = p(y_t \mid s_t = j) \sum_{i=1}^K \alpha_{t-1}(i) \, A_{ij}$$

where $\alpha_t(j) \propto P(s_t = j, y_{1:t})$. The sum over $i$ is the predict step (marginalizing over the previous state); the multiplication by the emission probability is the update step (incorporating the observation). The connection to Bayesian inference is the same — the only difference is that we are working with discrete distributions instead of Gaussians.

from scipy.special import logsumexp


@dataclass
class GaussianHMM:
    """Hidden Markov model with Gaussian emissions.

    Attributes:
        n_states: Number of hidden states K.
        means: Emission means, shape (K,).
        stds: Emission standard deviations, shape (K,).
        trans_mat: Transition matrix, shape (K, K). Rows sum to 1.
        init_dist: Initial state distribution, shape (K,). Sums to 1.
    """
    n_states: int
    means: np.ndarray
    stds: np.ndarray
    trans_mat: np.ndarray
    init_dist: np.ndarray

    def _log_emission(self, y: float, k: int) -> float:
        """Log emission probability p(y | s_t = k)."""
        from scipy.stats import norm
        return norm.logpdf(y, loc=self.means[k], scale=self.stds[k])

    def forward(self, y: np.ndarray) -> Tuple[np.ndarray, float]:
        """Forward algorithm (filtering) in log space.

        Args:
            y: Observations, shape (T,).

        Returns:
            log_alpha: Log forward variables, shape (T, K).
            log_lik: Log-likelihood of the observation sequence.
        """
        T = len(y)
        K = self.n_states
        log_alpha = np.zeros((T, K))

        # t = 0
        for k in range(K):
            log_alpha[0, k] = (
                np.log(self.init_dist[k] + 1e-300) + self._log_emission(y[0], k)
            )

        # t = 1, ..., T-1
        log_A = np.log(self.trans_mat + 1e-300)
        for t in range(1, T):
            for j in range(K):
                log_alpha[t, j] = (
                    logsumexp(log_alpha[t - 1] + log_A[:, j])
                    + self._log_emission(y[t], j)
                )

        log_lik = logsumexp(log_alpha[-1])
        return log_alpha, log_lik

    def viterbi(self, y: np.ndarray) -> np.ndarray:
        """Viterbi algorithm: most likely state sequence.

        Args:
            y: Observations, shape (T,).

        Returns:
            states: Most likely state sequence, shape (T,).
        """
        T = len(y)
        K = self.n_states
        log_delta = np.zeros((T, K))
        psi = np.zeros((T, K), dtype=int)

        log_A = np.log(self.trans_mat + 1e-300)

        # t = 0
        for k in range(K):
            log_delta[0, k] = (
                np.log(self.init_dist[k] + 1e-300) + self._log_emission(y[0], k)
            )

        # Forward pass
        for t in range(1, T):
            for j in range(K):
                candidates = log_delta[t - 1] + log_A[:, j]
                psi[t, j] = np.argmax(candidates)
                log_delta[t, j] = candidates[psi[t, j]] + self._log_emission(y[t], j)

        # Backtrack
        states = np.zeros(T, dtype=int)
        states[-1] = np.argmax(log_delta[-1])
        for t in range(T - 2, -1, -1):
            states[t] = psi[t + 1, states[t + 1]]

        return states

    def filter_probs(self, y: np.ndarray) -> np.ndarray:
        """Compute filtering probabilities P(s_t = k | y_{1:t}).

        Args:
            y: Observations, shape (T,).

        Returns:
            probs: Filtering probabilities, shape (T, K).
        """
        log_alpha, _ = self.forward(y)
        # Normalize each row
        log_norm = logsumexp(log_alpha, axis=1, keepdims=True)
        return np.exp(log_alpha - log_norm)


# --- Demo: Regime detection in engagement data ---
np.random.seed(123)

# True regime: alternating between "high engagement" and "low engagement"
T_sim = 500
true_states = np.zeros(T_sim, dtype=int)
state = 0
for t in range(1, T_sim):
    if np.random.rand() < 0.03:  # 3% chance of switching
        state = 1 - state
    true_states[t] = state

# Emissions: state 0 = high engagement (mean 0.72), state 1 = low (mean 0.35)
true_means = np.array([0.72, 0.35])
true_stds = np.array([0.08, 0.10])
y_engagement = np.array([
    np.random.normal(true_means[s], true_stds[s]) for s, t in zip(true_states, range(T_sim))
])

# Fit HMM (using known parameters for demonstration — in practice, use EM)
hmm = GaussianHMM(
    n_states=2,
    means=true_means,
    stds=true_stds,
    trans_mat=np.array([[0.97, 0.03], [0.03, 0.97]]),
    init_dist=np.array([0.5, 0.5]),
)

decoded_states = hmm.viterbi(y_engagement)
filter_probs = hmm.filter_probs(y_engagement)
_, log_lik = hmm.forward(y_engagement)

accuracy = np.mean(decoded_states == true_states)
print(f"Viterbi decoding accuracy: {accuracy:.3f}")
print(f"Log-likelihood: {log_lik:.2f}")
print(f"Mean filtering P(high engagement) when truly high: "
      f"{filter_probs[true_states == 0, 0].mean():.3f}")
print(f"Mean filtering P(high engagement) when truly low:  "
      f"{filter_probs[true_states == 1, 0].mean():.3f}")

Implementation Note: For production HMM work, use hmmlearn or pomegranate. Our from-scratch implementation is pedagogical. In particular, the Baum-Welch (EM) algorithm for parameter estimation — which we omit here for space — requires the backward algorithm in addition to the forward algorithm and iterates until convergence. hmmlearn.GaussianHMM.fit() handles this, including the tricky initialization that determines whether EM converges to a good local optimum.

Regime-Switching Models for Time Series

HMMs become time series models when the emission distributions are themselves time series models. A Markov-switching autoregression (Hamilton, 1989) replaces the Gaussian emission with a regime-dependent AR model:

$$y_t \mid s_t = k \sim \phi_{k,0} + \phi_{k,1} y_{t-1} + \cdots + \phi_{k,p} y_{t-p} + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_k^2)$$

Each regime has its own AR parameters $(\phi_{k,0}, \ldots, \phi_{k,p})$ and noise variance $\sigma_k^2$. This is valuable for modeling processes that behave differently under different conditions: economic expansions vs. recessions, normal vs. anomalous sensor readings, or organic vs. promotional engagement spikes in the StreamRec context.


23.5 Structural Time Series Models: The Bayesian Approach

A structural time series model (STM) decomposes the observed series into interpretable components:

$$y_t = \mu_t + \gamma_t + \boldsymbol{\beta}^T \mathbf{x}_t + \varepsilon_t$$

where $\mu_t$ is the trend (level + slope), $\gamma_t$ is seasonality, $\mathbf{x}_t$ are exogenous regressors, and $\varepsilon_t$ is observation noise. Each component is a state-space model, and their sum defines the complete model.

The Bayesian approach, implemented in tensorflow_probability.sts and orbit-ml, places priors on all model parameters and performs full posterior inference via MCMC or variational inference. This yields several advantages over point estimation (e.g., Prophet's MAP):

  1. Uncertainty propagation. Uncertainty in the model parameters propagates into the forecast intervals, producing wider (and more honest) intervals than plug-in estimates.
  2. Automatic regularization. Priors prevent overfitting to noise, especially for short series.
  3. Model comparison. The marginal likelihood enables principled comparison between models with different components.
import tensorflow_probability as tfp
import tensorflow as tf

# NOTE: This code requires tensorflow and tensorflow_probability.
# pip install tensorflow tensorflow_probability

sts = tfp.sts


def build_bayesian_structural_model(
    y: np.ndarray,
    seasonal_periods: List[int],
    num_fourier_terms: Optional[List[int]] = None,
) -> sts.Sum:
    """Build a Bayesian structural time series model.

    Args:
        y: Observed time series, shape (T,) or (T, 1).
        seasonal_periods: List of seasonal periods (e.g., [7, 365.25]).
        num_fourier_terms: Fourier terms per seasonal component.
            Defaults to period // 2 for each.

    Returns:
        TFP STS model (sum of components).
    """
    if num_fourier_terms is None:
        num_fourier_terms = [max(3, p // 4) for p in seasonal_periods]

    observed = tf.constant(y.reshape(-1, 1), dtype=tf.float32)

    components = []

    # Local linear trend
    trend = sts.LocalLinearTrend(observed_time_series=observed, name="trend")
    components.append(trend)

    # Seasonal components (Fourier representation)
    for period, n_terms in zip(seasonal_periods, num_fourier_terms):
        seasonal = sts.Seasonal(
            num_seasons=int(period),
            num_steps_per_season=1,
            observed_time_series=observed,
            name=f"seasonal_{int(period)}",
        )
        components.append(seasonal)

    model = sts.Sum(components, observed_time_series=observed)
    return model


def fit_and_forecast_bayesian_sts(
    y_train: np.ndarray,
    seasonal_periods: List[int],
    forecast_horizon: int,
    num_variational_steps: int = 400,
    num_samples: int = 200,
) -> dict:
    """Fit a Bayesian STS model and produce probabilistic forecasts.

    Args:
        y_train: Training time series.
        seasonal_periods: Seasonal periods to model.
        forecast_horizon: Steps ahead to forecast.
        num_variational_steps: Optimization steps for variational inference.
        num_samples: Posterior samples to draw.

    Returns:
        Dictionary with forecast mean, std, and samples.
    """
    observed = tf.constant(y_train.reshape(-1, 1), dtype=tf.float32)
    model = build_bayesian_structural_model(y_train, seasonal_periods)

    # Variational inference (faster than MCMC for prototyping)
    variational_posteriors = tfp.sts.build_factored_surrogate_posterior(model=model)

    loss_curve = tfp.vi.fit_surrogate_posterior(
        target_log_prob_fn=model.joint_distribution(observed_time_series=observed).log_prob,
        surrogate_posterior=variational_posteriors,
        optimizer=tf.optimizers.Adam(learning_rate=0.1),
        num_steps=num_variational_steps,
    )

    # Draw posterior parameter samples
    param_samples = variational_posteriors.sample(num_samples)

    # Forecast
    forecast_dist = tfp.sts.forecast(
        model=model,
        observed_time_series=observed,
        parameter_samples=param_samples,
        num_steps_forecast=forecast_horizon,
    )

    forecast_mean = forecast_dist.mean().numpy().squeeze()
    forecast_std = forecast_dist.stddev().numpy().squeeze()
    forecast_samples = forecast_dist.sample(50).numpy().squeeze()

    return {
        "mean": forecast_mean,
        "std": forecast_std,
        "samples": forecast_samples,
        "loss_curve": loss_curve.numpy(),
    }

Research Insight: Harvey (1989) established structural time series models as a rigorous statistical framework. Scott and Varian (2014) introduced the Bayesian structural time series (BSTS) model for causal impact analysis (the CausalImpact R package), which uses the same framework to estimate what a time series would have done in the absence of an intervention. This causal application — using the STM as a counterfactual predictor — connects directly to the potential outcomes framework from Chapter 16.


23.6 Deep Learning for Time Series: The Landscape

The intermediate book treated time series as a statistics problem: ARIMA, exponential smoothing, Prophet. This section treats time series as a deep learning problem. The motivating question is: when does the pattern complexity in the data justify the model complexity of a neural network?

The answer, backed by results from the M4 (Makridakis et al., 2020) and M5 competitions, is nuanced:

  • For individual, short series (< 1,000 observations), classical methods (ETS, ARIMA, Theta) often win.
  • For multiple related series (global models), deep learning excels because it can learn shared patterns across series.
  • For long series with complex patterns (multiple seasonalities, non-linear interactions with covariates), deep learning is the only viable approach.

We will cover three architectures that represent distinct philosophies:

Architecture Philosophy Key Innovation
N-BEATS Pure DL, no hand-crafted features Basis expansion with backward/forward residual links
DeepAR Autoregressive probabilistic Parametric distribution output, global training
TFT Attention-based, interpretable Variable selection + temporal attention + multi-horizon

23.7 N-BEATS: Pure Deep Learning Forecasting

N-BEATS (Neural Basis Expansion Analysis for Time Series; Oreshkin et al., 2020) is a pure deep learning approach to univariate time series forecasting. Its philosophy is radical: no feature engineering, no decomposition, no domain knowledge — just stacked fully connected layers with a clever output structure.

Architecture

N-BEATS is organized into stacks, each containing multiple blocks. Each block:

  1. Takes a lookback window $\mathbf{x} \in \mathbb{R}^L$ as input.
  2. Passes it through a stack of fully connected layers with ReLU activations.
  3. Produces two outputs via basis expansion: - Backcast $\hat{\mathbf{x}} \in \mathbb{R}^L$: the block's reconstruction of the input. - Forecast $\hat{\mathbf{y}} \in \mathbb{R}^H$: the block's partial forecast for the horizon.

The key innovation is the doubly residual architecture: the input to the next block is the original input minus the backcast (the residual that this block could not explain). The final forecast is the sum of all blocks' partial forecasts.

For the interpretable variant, blocks are constrained to specific basis functions:

  • Trend blocks use polynomial bases: $\hat{\mathbf{y}}_{\text{trend}} = \sum_{p=0}^{P} \theta_p \, \mathbf{t}^p$ where $\mathbf{t} = [1, 2, \ldots, H]^T / H$.
  • Seasonality blocks use Fourier bases: $\hat{\mathbf{y}}_{\text{seasonal}} = \sum_{k=1}^{K} [a_k \cos(2\pi k \mathbf{t} / H) + b_k \sin(2\pi k \mathbf{t} / H)]$.
import torch
import torch.nn as nn
from typing import Tuple


class NBEATSBlock(nn.Module):
    """Single N-BEATS block with fully connected layers and basis expansion.

    Args:
        lookback: Length of the input lookback window L.
        horizon: Length of the forecast horizon H.
        hidden_dim: Width of hidden layers.
        n_layers: Number of hidden layers.
        basis_type: 'generic', 'trend', or 'seasonal'.
        degree: Polynomial degree (trend) or number of harmonics (seasonal).
    """

    def __init__(
        self,
        lookback: int,
        horizon: int,
        hidden_dim: int = 256,
        n_layers: int = 4,
        basis_type: str = "generic",
        degree: int = 3,
    ):
        super().__init__()
        self.lookback = lookback
        self.horizon = horizon
        self.basis_type = basis_type

        # Shared FC stack
        layers = [nn.Linear(lookback, hidden_dim), nn.ReLU()]
        for _ in range(n_layers - 1):
            layers.extend([nn.Linear(hidden_dim, hidden_dim), nn.ReLU()])
        self.fc_stack = nn.Sequential(*layers)

        if basis_type == "generic":
            self.theta_b = nn.Linear(hidden_dim, lookback)
            self.theta_f = nn.Linear(hidden_dim, horizon)
        elif basis_type == "trend":
            self.degree = degree
            n_coeffs = degree + 1
            self.theta_b = nn.Linear(hidden_dim, n_coeffs)
            self.theta_f = nn.Linear(hidden_dim, n_coeffs)
            # Pre-compute polynomial basis
            self.register_buffer(
                "T_back",
                torch.stack(
                    [(torch.arange(lookback, dtype=torch.float32) / lookback) ** p
                     for p in range(n_coeffs)]
                ),  # (n_coeffs, L)
            )
            self.register_buffer(
                "T_fore",
                torch.stack(
                    [(torch.arange(horizon, dtype=torch.float32) / horizon) ** p
                     for p in range(n_coeffs)]
                ),  # (n_coeffs, H)
            )
        elif basis_type == "seasonal":
            self.n_harmonics = degree
            n_coeffs = 2 * degree  # sin + cos
            self.theta_b = nn.Linear(hidden_dim, n_coeffs)
            self.theta_f = nn.Linear(hidden_dim, n_coeffs)
            t_b = torch.arange(lookback, dtype=torch.float32) / lookback
            t_f = torch.arange(horizon, dtype=torch.float32) / horizon
            basis_b, basis_f = [], []
            for k in range(1, degree + 1):
                basis_b.extend([torch.cos(2 * torch.pi * k * t_b),
                                torch.sin(2 * torch.pi * k * t_b)])
                basis_f.extend([torch.cos(2 * torch.pi * k * t_f),
                                torch.sin(2 * torch.pi * k * t_f)])
            self.register_buffer("S_back", torch.stack(basis_b))  # (n_coeffs, L)
            self.register_buffer("S_fore", torch.stack(basis_f))  # (n_coeffs, H)

    def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:
        """Forward pass.

        Args:
            x: Input lookback window, shape (batch, L).

        Returns:
            backcast: Shape (batch, L).
            forecast: Shape (batch, H).
        """
        h = self.fc_stack(x)  # (batch, hidden_dim)

        if self.basis_type == "generic":
            backcast = self.theta_b(h)
            forecast = self.theta_f(h)
        elif self.basis_type == "trend":
            theta_b = self.theta_b(h)  # (batch, n_coeffs)
            theta_f = self.theta_f(h)
            backcast = theta_b @ self.T_back  # (batch, L)
            forecast = theta_f @ self.T_fore  # (batch, H)
        elif self.basis_type == "seasonal":
            theta_b = self.theta_b(h)
            theta_f = self.theta_f(h)
            backcast = theta_b @ self.S_back
            forecast = theta_f @ self.S_fore

        return backcast, forecast


class NBEATS(nn.Module):
    """N-BEATS model: stacked blocks with doubly residual connections.

    Args:
        lookback: Lookback window length.
        horizon: Forecast horizon.
        stack_configs: List of (basis_type, n_blocks, hidden_dim, degree) tuples.
    """

    def __init__(
        self,
        lookback: int,
        horizon: int,
        stack_configs: List[Tuple[str, int, int, int]],
    ):
        super().__init__()
        self.blocks = nn.ModuleList()
        for basis_type, n_blocks, hidden_dim, degree in stack_configs:
            for _ in range(n_blocks):
                self.blocks.append(
                    NBEATSBlock(lookback, horizon, hidden_dim, basis_type=basis_type, degree=degree)
                )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with doubly residual connections.

        Args:
            x: Input lookback window, shape (batch, L).

        Returns:
            forecast: Shape (batch, H).
        """
        residual = x
        forecast = torch.zeros(x.shape[0], self.blocks[0].horizon, device=x.device)

        for block in self.blocks:
            backcast, block_forecast = block(residual)
            residual = residual - backcast       # Residual connection (input)
            forecast = forecast + block_forecast  # Residual connection (output)

        return forecast


# Interpretable N-BEATS configuration
model_nbeats = NBEATS(
    lookback=60,  # 60 time steps of history
    horizon=12,   # 12 steps ahead
    stack_configs=[
        ("trend", 3, 256, 3),      # 3 trend blocks, polynomial degree 3
        ("seasonal", 3, 256, 5),   # 3 seasonal blocks, 5 harmonics
    ],
)

# Verify architecture
x_dummy = torch.randn(32, 60)
y_dummy = model_nbeats(x_dummy)
print(f"N-BEATS input: {x_dummy.shape}, output: {y_dummy.shape}")
total_params = sum(p.numel() for p in model_nbeats.parameters())
print(f"Total parameters: {total_params:,}")

Common Misconception: "N-BEATS cannot handle exogenous variables." This is true for the original architecture, which is purely univariate. However, N-BEATSx (Olivares et al., 2023) extends the architecture to accept exogenous features by concatenating them with the lookback window or feeding them through separate encoder paths. For problems with important covariates (e.g., weather data for energy demand forecasting), TFT (Section 23.8) is typically preferred.


23.8 The Temporal Fusion Transformer

The Temporal Fusion Transformer (TFT; Lim et al., 2021) is the most architecturally complete deep learning model for multi-horizon time series forecasting. It addresses a limitation that all prior DL time series models share: the inability to distinguish between different types of inputs and the lack of interpretability.

Why TFT Matters

TFT was designed to answer a practical question: "Given a complex forecasting problem with static metadata, historical time-varying inputs, and known future information (like holidays), can a single architecture handle all of them while remaining interpretable?"

The architecture has four key innovations:

  1. Variable selection networks — learned attention over input features, allowing the model to discover which variables matter at each time step.
  2. Static covariate encoders — time-invariant features (e.g., item category, geographic region) influence the network's behavior through context vectors.
  3. Temporal self-attention — multi-head attention over time steps, with an interpretable attention pattern that reveals which past time steps the model considers most important.
  4. Quantile outputs — the model directly outputs quantile forecasts (e.g., 10th, 50th, 90th percentiles), providing probabilistic predictions without distributional assumptions.

Architecture Walkthrough

Input Types:
┌──────────────────────────────────────────────────────┐
│ Static metadata         (e.g., item_id, category)    │
│ Historical observed     (e.g., past engagement)      │
│ Historical known        (e.g., past day-of-week)     │
│ Future known            (e.g., future holidays)      │
└──────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────┐
│          Variable Selection Networks                  │
│   Learned soft-attention over input features          │
│   Output: weighted combination of transformed inputs  │
└──────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────┐
│          Static Covariate Encoders                    │
│   4 context vectors: c_s, c_e, c_c, c_h              │
│   Feed into: variable selection, LSTM init,           │
│              enrichment, state init                    │
└──────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────┐
│          Temporal Processing (LSTM encoder-decoder)   │
│   Encoder: processes historical inputs                │
│   Decoder: processes known future inputs              │
│   Gated skip connections throughout                   │
└──────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────┐
│          Temporal Self-Attention (Interpretable)      │
│   Multi-head attention over decoder outputs           │
│   Attention weights reveal temporal importance        │
└──────────────────────────────────────────────────────┘
         │
         ▼
┌──────────────────────────────────────────────────────┐
│          Quantile Output Layer                        │
│   Dense → [q_10, q_50, q_90] for each future step   │
└──────────────────────────────────────────────────────┘

Implementation with PyTorch Forecasting

Building TFT from scratch is a substantial engineering effort (the paper's architecture has ~15 distinct components). In practice, the pytorch-forecasting library provides a well-tested implementation:

import pytorch_forecasting as ptf
from pytorch_forecasting import TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
import pytorch_lightning as pl
import pandas as pd


def prepare_tft_dataset(
    df: pd.DataFrame,
    target: str = "engagement_rate",
    group_col: str = "content_category",
    time_col: str = "time_idx",
    max_encoder_length: int = 90,
    max_prediction_length: int = 30,
    known_reals: Optional[List[str]] = None,
    known_categoricals: Optional[List[str]] = None,
    static_categoricals: Optional[List[str]] = None,
) -> TimeSeriesDataSet:
    """Prepare a TimeSeriesDataSet for TFT training.

    Args:
        df: DataFrame with time series data. Must contain time_idx,
            group_col, target, and any covariates.
        target: Target variable column name.
        group_col: Column identifying each time series.
        time_col: Integer time index column.
        max_encoder_length: Lookback window (historical context).
        max_prediction_length: Forecast horizon.
        known_reals: Real-valued features known in the future.
        known_categoricals: Categorical features known in the future.
        static_categoricals: Time-invariant categorical features.

    Returns:
        Configured TimeSeriesDataSet.
    """
    if known_reals is None:
        known_reals = []
    if known_categoricals is None:
        known_categoricals = ["day_of_week", "month"]
    if static_categoricals is None:
        static_categoricals = [group_col]

    dataset = TimeSeriesDataSet(
        df,
        time_idx=time_col,
        target=target,
        group_ids=[group_col],
        max_encoder_length=max_encoder_length,
        max_prediction_length=max_prediction_length,
        static_categoricals=static_categoricals,
        time_varying_known_categoricals=known_categoricals,
        time_varying_known_reals=known_reals,
        time_varying_unknown_reals=[target],
        target_normalizer=GroupNormalizer(groups=[group_col]),
        add_relative_time_idx=True,
        add_target_scales=True,
        add_encoder_length=True,
    )
    return dataset


def train_tft(
    training_dataset: TimeSeriesDataSet,
    validation_dataset: TimeSeriesDataSet,
    hidden_size: int = 64,
    attention_head_size: int = 4,
    dropout: float = 0.1,
    learning_rate: float = 1e-3,
    max_epochs: int = 50,
    quantiles: Optional[List[float]] = None,
) -> TemporalFusionTransformer:
    """Train a Temporal Fusion Transformer.

    Args:
        training_dataset: Training TimeSeriesDataSet.
        validation_dataset: Validation TimeSeriesDataSet.
        hidden_size: Hidden state dimension.
        attention_head_size: Number of attention heads.
        dropout: Dropout rate.
        learning_rate: Learning rate for Adam optimizer.
        max_epochs: Maximum training epochs.
        quantiles: Output quantiles (default [0.1, 0.5, 0.9]).

    Returns:
        Trained TFT model.
    """
    if quantiles is None:
        quantiles = [0.02, 0.1, 0.25, 0.5, 0.75, 0.9, 0.98]

    train_loader = training_dataset.to_dataloader(
        train=True, batch_size=64, num_workers=0
    )
    val_loader = validation_dataset.to_dataloader(
        train=False, batch_size=256, num_workers=0
    )

    model = TemporalFusionTransformer.from_dataset(
        training_dataset,
        hidden_size=hidden_size,
        attention_head_size=attention_head_size,
        dropout=dropout,
        hidden_continuous_size=32,
        learning_rate=learning_rate,
        loss=ptf.metrics.QuantileLoss(quantiles=quantiles),
        log_interval=10,
        reduce_on_plateau_patience=4,
    )

    trainer = pl.Trainer(
        max_epochs=max_epochs,
        accelerator="auto",
        gradient_clip_val=0.1,
        enable_progress_bar=True,
    )
    trainer.fit(model, train_dataloaders=train_loader, val_dataloaders=val_loader)

    return model


def extract_tft_interpretations(
    model: TemporalFusionTransformer,
    dataset: TimeSeriesDataSet,
) -> dict:
    """Extract interpretability outputs from a trained TFT.

    Returns variable importances and attention patterns.

    Args:
        model: Trained TFT model.
        dataset: Dataset to interpret.

    Returns:
        Dictionary with variable_importances and attention_weights.
    """
    loader = dataset.to_dataloader(train=False, batch_size=256, num_workers=0)
    interpretation = model.interpret_output(
        model.predict(loader, mode="raw", return_x=True),
        reduction="mean",
    )

    return {
        "variable_importances": {
            "encoder": interpretation["encoder_variables"],
            "decoder": interpretation["decoder_variables"],
            "static": interpretation["static_variables"],
        },
        "attention_weights": interpretation["attention"],
    }

Production Reality: TFT's attention weights are its most valuable practical feature. In a StreamRec deployment, the variable selection weights might reveal that day_of_week explains 40% of the engagement variance while a promotional flag — which the team expected to be important — explains only 3%. The temporal attention pattern might show that the model attends strongly to the same day one week ago and one year ago, confirming the dual seasonality hypothesis. These interpretability outputs transform TFT from a black-box forecaster into a hypothesis-generating tool.


23.8.1 DeepAR: Autoregressive Probabilistic Forecasting

DeepAR (Salinas et al., 2020) takes a different approach. Instead of outputting quantiles directly, it models the conditional distribution of each future time step autoregressively:

$$p(y_{t+1} \mid y_{1:t}, \mathbf{x}_{1:t+1}) = \text{Distribution}(\theta_{t+1})$$

$$\theta_{t+1} = g(\mathbf{h}_{t+1}), \quad \mathbf{h}_{t+1} = \text{LSTM}(\mathbf{h}_t, [y_t, \mathbf{x}_{t+1}])$$

The LSTM hidden state $\mathbf{h}_t$ summarizes the history. At each step, the network outputs the parameters $\theta_{t+1}$ of a chosen distribution family (Gaussian, negative binomial, Student-t, etc.). Training maximizes the log-likelihood:

$$\mathcal{L} = \sum_{t} \log p(y_t \mid \theta_t)$$

At inference time, forecasts are generated by ancestral sampling: sample $\hat{y}_{t+1} \sim p(y_{t+1} \mid \theta_{t+1})$, feed $\hat{y}_{t+1}$ back into the LSTM, generate $\hat{y}_{t+2}$, and so on. Repeating this process $M$ times yields $M$ sample paths — the empirical distribution of these paths is the probabilistic forecast.

class DeepARModel(nn.Module):
    """Simplified DeepAR model with Gaussian output.

    Args:
        input_dim: Number of input features per time step.
        hidden_dim: LSTM hidden state dimension.
        n_layers: Number of LSTM layers.
    """

    def __init__(self, input_dim: int, hidden_dim: int = 128, n_layers: int = 2):
        super().__init__()
        self.lstm = nn.LSTM(
            input_size=input_dim + 1,  # +1 for autoregressive input (y_{t-1})
            hidden_size=hidden_dim,
            num_layers=n_layers,
            dropout=0.1 if n_layers > 1 else 0.0,
            batch_first=True,
        )
        self.mu_head = nn.Linear(hidden_dim, 1)
        self.sigma_head = nn.Sequential(
            nn.Linear(hidden_dim, 1),
            nn.Softplus(),  # Ensure positive standard deviation
        )

    def forward(
        self, x: torch.Tensor, y_past: torch.Tensor
    ) -> Tuple[torch.Tensor, torch.Tensor]:
        """Forward pass during training (teacher forcing).

        Args:
            x: Covariates, shape (batch, T, input_dim).
            y_past: Target values, shape (batch, T). Used for autoregressive input.

        Returns:
            mu: Predicted means, shape (batch, T, 1).
            sigma: Predicted stds, shape (batch, T, 1).
        """
        # Autoregressive input: shift y by 1, pad with 0 at the start
        y_shifted = torch.cat(
            [torch.zeros(y_past.shape[0], 1, device=y_past.device), y_past[:, :-1]],
            dim=1,
        ).unsqueeze(-1)  # (batch, T, 1)

        lstm_input = torch.cat([x, y_shifted], dim=-1)  # (batch, T, input_dim+1)
        h, _ = self.lstm(lstm_input)  # (batch, T, hidden_dim)
        mu = self.mu_head(h)       # (batch, T, 1)
        sigma = self.sigma_head(h)  # (batch, T, 1)
        return mu, sigma

    def gaussian_nll(
        self, mu: torch.Tensor, sigma: torch.Tensor, y_true: torch.Tensor
    ) -> torch.Tensor:
        """Gaussian negative log-likelihood loss.

        Args:
            mu: Predicted means, shape (batch, T, 1).
            sigma: Predicted stds, shape (batch, T, 1).
            y_true: True values, shape (batch, T).

        Returns:
            Scalar loss.
        """
        y = y_true.unsqueeze(-1)
        nll = 0.5 * torch.log(2 * torch.pi * sigma**2) + (y - mu)**2 / (2 * sigma**2)
        return nll.mean()

    @torch.no_grad()
    def sample_forecast(
        self,
        x_hist: torch.Tensor,
        y_hist: torch.Tensor,
        x_future: torch.Tensor,
        n_samples: int = 100,
    ) -> np.ndarray:
        """Generate probabilistic forecasts via ancestral sampling.

        Args:
            x_hist: Historical covariates, shape (1, T_hist, input_dim).
            y_hist: Historical targets, shape (1, T_hist).
            x_future: Future covariates, shape (1, H, input_dim).
            n_samples: Number of sample paths.

        Returns:
            samples: Shape (n_samples, H).
        """
        self.eval()
        H = x_future.shape[1]
        all_samples = np.zeros((n_samples, H))

        for s in range(n_samples):
            # Encode history
            y_shifted = torch.cat(
                [torch.zeros(1, 1, device=y_hist.device), y_hist[:, :-1]], dim=1
            ).unsqueeze(-1)
            hist_input = torch.cat([x_hist, y_shifted], dim=-1)
            _, (h_state, c_state) = self.lstm(hist_input)

            # Autoregressive decoding
            y_prev = y_hist[:, -1:].unsqueeze(-1)  # (1, 1, 1)
            for t in range(H):
                step_input = torch.cat(
                    [x_future[:, t:t+1, :], y_prev], dim=-1
                )  # (1, 1, input_dim+1)
                out, (h_state, c_state) = self.lstm(step_input, (h_state, c_state))
                mu = self.mu_head(out).squeeze()
                sigma = self.sigma_head(out).squeeze()
                y_sample = torch.normal(mu, sigma)
                all_samples[s, t] = y_sample.item()
                y_prev = y_sample.reshape(1, 1, 1)

        return all_samples


# Verify DeepAR architecture
model_deepar = DeepARModel(input_dim=5, hidden_dim=128, n_layers=2)
x_test = torch.randn(16, 90, 5)
y_test = torch.randn(16, 90)
mu, sigma = model_deepar(x_test, y_test)
print(f"DeepAR output shapes — mu: {mu.shape}, sigma: {sigma.shape}")
print(f"sigma range: [{sigma.min().item():.4f}, {sigma.max().item():.4f}]")

The key advantage of DeepAR over TFT is simplicity: the LSTM backbone is straightforward, and the probabilistic output is a natural consequence of the autoregressive formulation. The key disadvantage is that the autoregressive generation loop is sequential, making inference slower for long horizons, and error accumulation in the sampling process can degrade forecast quality.


23.9 Probabilistic Forecasting: The Key Differentiator

A point forecast is a single number: "We predict 1,847 daily active users tomorrow." A probabilistic forecast is a distribution: "We predict daily active users tomorrow to be approximately $\mathcal{N}(1847, 120^2)$, with a 90% prediction interval of [1,650, 2,044]."

The distinction matters when decisions depend on uncertainty. A climate policymaker does not care whether the best-guess temperature increase is 2.3°C or 2.5°C — they care about the probability that it exceeds 3.0°C, because that threshold triggers different infrastructure investments. A StreamRec operations team does not care whether expected DAU is 1,847 or 1,852 — they care about the probability that DAU drops below 1,500, because that threshold triggers a content acquisition response.

Quantile Regression

The most common approach to probabilistic forecasting is quantile regression: instead of predicting the conditional mean $\mathbb{E}[y_t \mid \mathbf{x}_t]$, predict the conditional quantiles $Q_\tau(y_t \mid \mathbf{x}_t)$ for $\tau \in \{0.01, 0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95, 0.99\}$.

The quantile loss (pinball loss) for quantile level $\tau$ is:

$$\rho_\tau(y, \hat{q}) = \begin{cases} \tau (y - \hat{q}) & \text{if } y \geq \hat{q} \\ (1 - \tau)({\hat{q} - y}) & \text{if } y < \hat{q} \end{cases}$$

This asymmetric loss penalizes under-predictions more heavily for high quantiles (e.g., $\tau = 0.9$) and over-predictions more heavily for low quantiles (e.g., $\tau = 0.1$). Any model — linear, tree-based, neural network — can be trained with quantile loss to produce quantile forecasts.

def quantile_loss(
    y_true: torch.Tensor, y_pred: torch.Tensor, quantiles: List[float]
) -> torch.Tensor:
    """Quantile loss (pinball loss) for multiple quantile levels.

    Args:
        y_true: True values, shape (batch, horizon).
        y_pred: Predicted quantiles, shape (batch, horizon, n_quantiles).
        quantiles: List of quantile levels, e.g., [0.1, 0.5, 0.9].

    Returns:
        Scalar loss (mean over all dimensions and quantiles).
    """
    losses = []
    for i, tau in enumerate(quantiles):
        error = y_true - y_pred[:, :, i]
        loss = torch.max(tau * error, (tau - 1) * error)
        losses.append(loss)
    return torch.stack(losses).mean()


# Demonstration
y_true = torch.tensor([[100.0, 110.0, 105.0]])
y_pred = torch.tensor([[[90.0, 100.0, 115.0],   # h=1: q10, q50, q90
                         [95.0, 108.0, 120.0],   # h=2
                         [92.0, 103.0, 118.0]]])  # h=3
quantiles = [0.1, 0.5, 0.9]
loss = quantile_loss(y_true, y_pred, quantiles)
print(f"Quantile loss: {loss.item():.4f}")

Mathematical Foundation: Quantile regression minimizes the expected quantile loss $\mathbb{E}[\rho_\tau(Y, q)]$, and the unique minimizer is the $\tau$-th quantile of $Y$. This is the distributional analogue of the fact that mean squared error is minimized by the conditional mean, and mean absolute error is minimized by the conditional median. Koenker and Bassett (1978) established this result; Koenker (2005) is the definitive reference.

Prediction Intervals vs. Confidence Intervals

A prediction interval contains future observations with a stated probability. A confidence interval contains a fixed parameter with a stated long-run coverage frequency. These are different objects:

  • A 90% prediction interval for tomorrow's temperature says: "There is a 90% probability that tomorrow's temperature falls in this range."
  • A 90% confidence interval for the trend parameter says: "If we repeated the estimation many times, 90% of intervals would contain the true trend."

In forecasting, we care about prediction intervals. The Kalman filter produces exact prediction intervals (for linear-Gaussian models). Quantile regression produces empirical quantile-based intervals. Bayesian models produce posterior predictive intervals. All of these should be evaluated by their calibration — do the intervals actually achieve their nominal coverage?


23.10 Calibration: Are Your Uncertainty Estimates Honest?

A probabilistic forecast is calibrated if events predicted to occur with probability $p$ actually occur with frequency $p$. More precisely, for quantile forecasts:

$$\text{If } \hat{q}_\tau \text{ is the predicted } \tau\text{-quantile, then calibration requires } P(y \leq \hat{q}_\tau) = \tau$$

Calibration is the minimum requirement for probabilistic forecasting. A model that claims 90% prediction intervals but achieves only 75% coverage is worse than a model that claims 80% and achieves 80% — because the first model gives you false confidence.

Measuring Calibration

The Probability Integral Transform (PIT) test evaluates calibration. If $F_t$ is the predicted CDF at time $t$, then the PIT values $u_t = F_t(y_t)$ should be uniformly distributed on $[0, 1]$ under calibration:

from scipy import stats as sp_stats


def pit_calibration_test(
    y_true: np.ndarray,
    forecast_quantiles: np.ndarray,
    quantile_levels: np.ndarray,
) -> dict:
    """Assess forecast calibration using the PIT and empirical coverage.

    Args:
        y_true: Actual observations, shape (n,).
        forecast_quantiles: Predicted quantiles, shape (n, n_quantiles).
        quantile_levels: Quantile levels, shape (n_quantiles,).

    Returns:
        Dictionary with empirical coverage per level, PIT histogram,
        and Kolmogorov-Smirnov test for uniformity.
    """
    n = len(y_true)
    n_q = len(quantile_levels)

    # Empirical coverage
    empirical_coverage = np.zeros(n_q)
    for j, tau in enumerate(quantile_levels):
        empirical_coverage[j] = np.mean(y_true <= forecast_quantiles[:, j])

    # PIT values (interpolated)
    pit_values = np.zeros(n)
    for i in range(n):
        # Linearly interpolate CDF at y_true[i]
        pit_values[i] = np.interp(
            y_true[i],
            forecast_quantiles[i],
            quantile_levels,
            left=0.0,
            right=1.0,
        )

    # KS test for uniformity
    ks_stat, ks_pvalue = sp_stats.kstest(pit_values, "uniform")

    # Calibration error: mean absolute deviation from nominal
    calibration_error = np.mean(np.abs(empirical_coverage - quantile_levels))

    return {
        "quantile_levels": quantile_levels,
        "empirical_coverage": empirical_coverage,
        "calibration_error": calibration_error,
        "pit_values": pit_values,
        "ks_statistic": ks_stat,
        "ks_pvalue": ks_pvalue,
    }


# Demonstration with synthetic well-calibrated and miscalibrated forecasts
np.random.seed(42)
n_test = 500
y_true_demo = np.random.normal(100, 15, n_test)
quantile_levels = np.array([0.05, 0.10, 0.25, 0.50, 0.75, 0.90, 0.95])

# Well-calibrated: use the true distribution
well_cal_quantiles = np.column_stack([
    sp_stats.norm.ppf(q, loc=100, scale=15) * np.ones(n_test)
    for q in quantile_levels
])

# Overconfident: use too-narrow intervals (sigma=8 instead of 15)
overconf_quantiles = np.column_stack([
    sp_stats.norm.ppf(q, loc=100, scale=8) * np.ones(n_test)
    for q in quantile_levels
])

cal_good = pit_calibration_test(y_true_demo, well_cal_quantiles, quantile_levels)
cal_bad = pit_calibration_test(y_true_demo, overconf_quantiles, quantile_levels)

print("=== Well-Calibrated Model ===")
print(f"Mean calibration error: {cal_good['calibration_error']:.4f}")
print(f"KS test p-value (H0: uniform PIT): {cal_good['ks_pvalue']:.4f}")
for q, emp in zip(quantile_levels, cal_good['empirical_coverage']):
    print(f"  Nominal {q:.2f} → Empirical {emp:.3f}")

print("\n=== Overconfident Model ===")
print(f"Mean calibration error: {cal_bad['calibration_error']:.4f}")
print(f"KS test p-value (H0: uniform PIT): {cal_bad['ks_pvalue']:.6f}")
for q, emp in zip(quantile_levels, cal_bad['empirical_coverage']):
    print(f"  Nominal {q:.2f} → Empirical {emp:.3f}")

The overconfident model claims 90% coverage (between the 5th and 95th quantiles) but achieves much less — the 95th quantile covers only ~72% of observations instead of 95%. This is the most dangerous failure mode in probabilistic forecasting: the model looks precise but is wrong in a way that understates risk.

Common Misconception: "Sharpness is good." Sharpness (narrow prediction intervals) is desirable only conditional on calibration. An interval of $\pm 0.01$ is sharper than $\pm 10$, but if the true coverage is 30% instead of 90%, the narrow interval is actively harmful. The correct objective is Gneiting et al.'s (2007) mantra: maximize sharpness subject to calibration.


23.11 Conformal Prediction for Time Series

Conformal prediction provides distribution-free prediction intervals with finite-sample coverage guarantees. Unlike quantile regression (which requires the model to learn the quantiles correctly) or Bayesian intervals (which require the model to be well-specified), conformal prediction wraps around any point forecaster and produces intervals with guaranteed coverage — under exchangeability assumptions.

The challenge for time series: the standard conformal prediction framework assumes exchangeability (roughly, that the data points are interchangeable), which time series data violate due to temporal dependence.

Adaptive Conformal Inference (ACI)

Gibbs and Candes (2021) introduced Adaptive Conformal Inference for time series. The key idea: adjust the conformal quantile over time to maintain coverage despite distribution shift.

The algorithm:

  1. At time $t$, produce a point forecast $\hat{y}_t$ and a nonconformity score $R_t = |y_t - \hat{y}_t|$.
  2. Maintain a running quantile $\hat{q}_t$ of past nonconformity scores.
  3. Update the quantile adaptively: if the interval failed to cover (was too narrow), increase $\hat{q}_t$; if it covered (was wide enough), decrease slightly.

The update rule uses a parameter $\gamma$ (the learning rate) and target coverage $1 - \alpha$:

$$\hat{q}_{t+1} = \hat{q}_t + \gamma \left( \alpha - \mathbb{1}\{y_t \notin [\hat{y}_t - \hat{q}_t, \hat{y}_t + \hat{q}_t]\} \right)$$

Wait — this seems backward. Let us be precise. Define the error indicator $\text{err}_t = \mathbb{1}\{|y_t - \hat{y}_t| > \hat{q}_t\}$. Then:

$$\hat{q}_{t+1} = \hat{q}_t + \gamma (\text{err}_t - \alpha)$$

If the current interval failed to cover ($\text{err}_t = 1$), the correction is $+\gamma(1 - \alpha) > 0$: the interval widens. If it covered ($\text{err}_t = 0$), the correction is $-\gamma \alpha < 0$: the interval narrows slightly. Over time, this drives the coverage rate toward $1 - \alpha$.

@dataclass
class AdaptiveConformalForecaster:
    """Adaptive Conformal Inference (ACI) wrapper for any point forecaster.

    Wraps a point forecast model and produces prediction intervals with
    adaptive coverage guarantees, even under distribution shift.

    Attributes:
        alpha: Target miscoverage rate (e.g., 0.1 for 90% intervals).
        gamma: Adaptation rate. Higher = faster adaptation, more variable width.
        initial_q: Initial interval half-width.
    """
    alpha: float = 0.1
    gamma: float = 0.005
    initial_q: float = 1.0

    def run(
        self,
        y_true: np.ndarray,
        y_pred: np.ndarray,
    ) -> dict:
        """Run ACI on a sequence of predictions and observations.

        Args:
            y_true: Actual observations, shape (T,).
            y_pred: Point forecasts, shape (T,).

        Returns:
            Dictionary with lower/upper bounds, adaptive q, and coverage.
        """
        T = len(y_true)
        q_t = self.initial_q
        lower = np.zeros(T)
        upper = np.zeros(T)
        q_history = np.zeros(T)
        covered = np.zeros(T, dtype=bool)

        for t in range(T):
            lower[t] = y_pred[t] - q_t
            upper[t] = y_pred[t] + q_t
            q_history[t] = q_t

            covered[t] = lower[t] <= y_true[t] <= upper[t]
            err_t = 1.0 - float(covered[t])

            # Adaptive update
            q_t = max(0.0, q_t + self.gamma * (err_t - self.alpha))

        rolling_coverage = np.convolve(
            covered.astype(float), np.ones(50) / 50, mode="valid"
        )

        return {
            "lower": lower,
            "upper": upper,
            "q_history": q_history,
            "covered": covered,
            "overall_coverage": covered.mean(),
            "rolling_coverage": rolling_coverage,
        }


# Demonstration with a regime change
np.random.seed(42)
T_total = 600

# Regime 1 (t=0..299): mean=50, std=5
# Regime 2 (t=300..599): mean=65, std=10 (distribution shift!)
y_regime = np.concatenate([
    np.random.normal(50, 5, 300),
    np.random.normal(65, 10, 300),
])

# A naive forecaster trained on regime 1 (doesn't know about the shift)
y_pred_naive = np.full(T_total, 50.0)  # Always predicts 50
y_pred_naive[300:] = 55.0  # Partially adapts but lags

aci = AdaptiveConformalForecaster(alpha=0.10, gamma=0.01, initial_q=10.0)
result = aci.run(y_regime, y_pred_naive)

print(f"Target coverage: {1 - aci.alpha:.0%}")
print(f"Overall coverage: {result['overall_coverage']:.1%}")
print(f"Regime 1 coverage: {result['covered'][:300].mean():.1%}")
print(f"Regime 2 coverage: {result['covered'][300:].mean():.1%}")
print(f"Interval width at t=0:   {result['q_history'][0] * 2:.2f}")
print(f"Interval width at t=299: {result['q_history'][299] * 2:.2f}")
print(f"Interval width at t=350: {result['q_history'][350] * 2:.2f}")
print(f"Interval width at t=599: {result['q_history'][599] * 2:.2f}")

The key result: even though the forecaster is badly miscalibrated after the regime change (it under-predicts by 10-15 units), ACI widens the intervals to compensate. Coverage may dip temporarily during the transition, but the adaptive mechanism restores it. This is a distribution-free guarantee — it does not require the forecaster to be Gaussian, linear, or even particularly good. It requires only that the adaptation rate $\gamma$ is appropriate for the rate of distributional change.

Advanced Sidebar: The theoretical coverage guarantee for ACI is asymptotic under mixing conditions. For finite samples, Gibbs and Candes (2022) prove that the miscoverage rate satisfies $|\bar{\alpha}_T - \alpha| \leq O(\gamma + 1/(T\gamma))$, where $\bar{\alpha}_T$ is the empirical miscoverage rate over $T$ steps. Setting $\gamma = T^{-1/2}$ balances the two terms and yields coverage error $O(T^{-1/2})$. In practice, $\gamma \in [0.001, 0.05]$ works well for series of length 500 or more.


23.12 Handling Temporal Challenges

Real-world time series rarely conform to textbook assumptions. This section addresses four challenges that arise constantly in practice.

Multiple Seasonalities

StreamRec's daily engagement data exhibits at least three seasonal patterns: day-of-week (period 7), month-of-year (period ~30.4), and annual (period ~365.25). Classical ARIMA handles one seasonal period; Prophet handles multiple via Fourier terms. The state-space and deep learning approaches handle them naturally.

Fourier features are the standard approach for encoding multiple seasonalities as covariates:

def fourier_features(
    t: np.ndarray, periods: List[float], n_terms: List[int]
) -> np.ndarray:
    """Generate Fourier features for multiple seasonal periods.

    Args:
        t: Time index array, shape (T,).
        periods: List of seasonal periods (e.g., [7, 365.25]).
        n_terms: Number of Fourier terms per period (e.g., [3, 10]).

    Returns:
        Features array, shape (T, 2 * sum(n_terms)). Columns are
        [cos_1_p1, sin_1_p1, cos_2_p1, ..., cos_1_p2, sin_1_p2, ...].
    """
    features = []
    for period, K in zip(periods, n_terms):
        for k in range(1, K + 1):
            features.append(np.cos(2 * np.pi * k * t / period))
            features.append(np.sin(2 * np.pi * k * t / period))
    return np.column_stack(features)


# Example: 3 years of daily data
t = np.arange(1095)
X_fourier = fourier_features(
    t,
    periods=[7, 30.4, 365.25],
    n_terms=[3, 5, 10],
)
print(f"Fourier features shape: {X_fourier.shape}")
# Output: (1095, 36) — 2*(3+5+10) = 36 features

Missing Data and Irregular Sampling

The Kalman filter handles missing data naturally (skip the update step). For deep learning models, the standard approaches are:

  1. Indicator masking: Add a binary "missing" indicator as an input feature. Replace the missing value with 0 or the series mean. The model learns to discount masked positions.
  2. Time-aware encoding: For irregularly sampled data, encode the time gap $\Delta_t = t_{i} - t_{i-1}$ as an additional feature. GRU-D (Che et al., 2018) modifies the GRU cell to incorporate $\Delta_t$ directly.

Changepoint Detection

A changepoint is a time at which the statistical properties of the series change abruptly. The PELT (Pruned Exact Linear Time) algorithm detects changepoints by minimizing a penalized cost:

def detect_changepoints_pelt(
    y: np.ndarray, penalty: float = 10.0, min_segment: int = 10
) -> List[int]:
    """Detect changepoints using the PELT algorithm (simplified).

    Minimizes sum of segment costs + penalty * number of changepoints.
    Uses Gaussian cost (change in mean).

    Args:
        y: Time series, shape (T,).
        penalty: Penalty per changepoint (higher = fewer changepoints).
        min_segment: Minimum segment length.

    Returns:
        List of changepoint indices.
    """
    # In practice, use the `ruptures` library:
    # import ruptures as rpt
    # algo = rpt.Pelt(model="rbf", min_size=min_segment).fit(y)
    # changepoints = algo.predict(pen=penalty)

    # Simplified dynamic programming (for pedagogy — O(n^2))
    n = len(y)
    cost_cache = {}

    def segment_cost(start: int, end: int) -> float:
        """Gaussian cost: variance of segment."""
        key = (start, end)
        if key not in cost_cache:
            seg = y[start:end]
            cost_cache[key] = np.sum((seg - seg.mean()) ** 2)
        return cost_cache[key]

    # DP: f[t] = minimum cost for y[0:t]
    f = np.full(n + 1, np.inf)
    f[0] = -penalty  # Will add penalty for the first segment
    cp_back = [[] for _ in range(n + 1)]

    for t in range(min_segment, n + 1):
        for s in range(max(0, t - n), t - min_segment + 1):
            candidate = f[s] + segment_cost(s, t) + penalty
            if candidate < f[t]:
                f[t] = candidate
                cp_back[t] = cp_back[s] + [s] if s > 0 else []

    return cp_back[n]


# Demonstration
np.random.seed(42)
y_cp = np.concatenate([
    np.random.normal(10, 2, 100),
    np.random.normal(20, 2, 100),  # Changepoint at t=100
    np.random.normal(15, 3, 100),  # Changepoint at t=200
])
detected = detect_changepoints_pelt(y_cp, penalty=50.0, min_segment=20)
print(f"Detected changepoints: {detected}")
print(f"True changepoints: [100, 200]")

Implementation Note: For production changepoint detection, use the ruptures library (Truong et al., 2020). It implements PELT, binary segmentation, and bottom-up approaches with multiple cost functions (Gaussian, Poisson, kernel-based). Our DP implementation above is $O(n^2)$ and pedagogical; PELT achieves $O(n)$ expected time through pruning.

Regime-Switching Models

When changepoints are not abrupt but reflect ongoing state transitions, combine HMMs with time series components (Section 23.4). The statsmodels.tsa.regime_switching.MarkovRegression class provides a ready-to-use implementation of Hamilton's (1989) Markov-switching regression.


23.13 Walk-Forward Validation: Proper Evaluation for Time Series

Standard cross-validation is invalid for time series because it violates the temporal ordering: a fold might use future data to predict the past. Walk-forward validation (expanding window or sliding window) respects the arrow of time.

from typing import Generator


def walk_forward_splits(
    n: int,
    min_train: int,
    forecast_horizon: int,
    step: int = 1,
    expanding: bool = True,
) -> Generator[Tuple[np.ndarray, np.ndarray], None, None]:
    """Generate walk-forward train/test splits for time series.

    Args:
        n: Total number of observations.
        min_train: Minimum training set size.
        forecast_horizon: Number of steps to forecast (test set size).
        step: Number of steps to advance between splits.
        expanding: If True, training window grows; if False, fixed size.

    Yields:
        (train_indices, test_indices) tuples.
    """
    for start_test in range(min_train, n - forecast_horizon + 1, step):
        end_test = start_test + forecast_horizon
        if expanding:
            train_idx = np.arange(0, start_test)
        else:
            train_idx = np.arange(max(0, start_test - min_train), start_test)
        test_idx = np.arange(start_test, end_test)
        yield train_idx, test_idx


def evaluate_forecaster_walk_forward(
    y: np.ndarray,
    forecaster_fn,
    min_train: int = 365,
    forecast_horizon: int = 30,
    step: int = 30,
    quantile_levels: Optional[np.ndarray] = None,
) -> dict:
    """Evaluate a forecaster using walk-forward validation.

    Args:
        y: Full time series, shape (T,).
        forecaster_fn: Callable(y_train) -> dict with keys
            'point' (shape (H,)) and optionally 'quantiles' (shape (H, n_q)).
        min_train: Minimum training observations.
        forecast_horizon: Steps to forecast per fold.
        step: Steps between folds.
        quantile_levels: Quantile levels for probabilistic evaluation.

    Returns:
        Dictionary with MAE, RMSE, MASE, coverage (if quantiles provided).
    """
    if quantile_levels is None:
        quantile_levels = np.array([0.05, 0.25, 0.5, 0.75, 0.95])

    all_errors = []
    all_abs_errors = []
    all_coverages = {q: [] for q in quantile_levels}
    n_folds = 0

    for train_idx, test_idx in walk_forward_splits(
        len(y), min_train, forecast_horizon, step
    ):
        y_train = y[train_idx]
        y_test = y[test_idx]

        result = forecaster_fn(y_train)
        point_forecast = result["point"][:len(y_test)]

        errors = y_test - point_forecast
        all_errors.extend(errors.tolist())
        all_abs_errors.extend(np.abs(errors).tolist())

        if "quantiles" in result:
            q_forecast = result["quantiles"][:len(y_test)]
            for j, q in enumerate(quantile_levels):
                covered = (y_test <= q_forecast[:, j]).mean()
                all_coverages[q].append(covered)

        n_folds += 1

    # MASE denominator: MAE of naive forecast (random walk) on training
    naive_errors = np.abs(np.diff(y[:min_train]))
    naive_mae = naive_errors.mean()

    results = {
        "n_folds": n_folds,
        "mae": np.mean(all_abs_errors),
        "rmse": np.sqrt(np.mean(np.array(all_errors) ** 2)),
        "mase": np.mean(all_abs_errors) / naive_mae,
    }

    if all_coverages[quantile_levels[0]]:
        results["calibration"] = {
            q: np.mean(covs) for q, covs in all_coverages.items()
        }

    return results


# Demonstration
np.random.seed(42)
n_demo = 730  # 2 years daily
y_demo = 100 + 0.05 * np.arange(n_demo) + 10 * np.sin(2 * np.pi * np.arange(n_demo) / 365.25) + np.random.normal(0, 5, n_demo)


def naive_forecaster(y_train: np.ndarray) -> dict:
    """Last-value naive forecast (random walk)."""
    H = 30
    point = np.full(H, y_train[-1])
    residuals = np.abs(np.diff(y_train[-365:]))  # Recent residual scale
    q_scale = np.quantile(residuals, [0.05, 0.25, 0.5, 0.75, 0.95])
    quantiles = np.column_stack([point + s for s in [-2*q_scale.std(), -q_scale.std(), 0, q_scale.std(), 2*q_scale.std()]])
    return {"point": point, "quantiles": quantiles}


wf_result = evaluate_forecaster_walk_forward(
    y_demo, naive_forecaster, min_train=365, forecast_horizon=30, step=30
)
print(f"Walk-forward results ({wf_result['n_folds']} folds):")
print(f"  MAE:  {wf_result['mae']:.2f}")
print(f"  RMSE: {wf_result['rmse']:.2f}")
print(f"  MASE: {wf_result['mase']:.2f}")

The MASE (Mean Absolute Scaled Error) is the preferred metric for time series evaluation because it is scale-independent and uses the naive forecast as the denominator. MASE < 1 means the model outperforms the naive baseline; MASE > 1 means the naive is better.


23.14 Decision Framework: Classical vs. Deep Learning

When should you use a state-space model, when should you use deep learning, and when should you use Prophet and go home early?

Criterion Classical (ARIMA, ETS, STM) Deep Learning (TFT, DeepAR, N-BEATS)
Data volume Works with 50–500 observations Needs 1,000+ (per series) or many related series
Number of series Independent per-series fitting Excels at learning across 100+ series
Pattern complexity Single seasonality, linear trend Multiple seasonalities, nonlinear interactions
Covariates Limited (regression components) Handles many static + dynamic covariates
Interpretability Component decomposition is transparent TFT offers attention weights; N-BEATS trend/seasonal
Uncertainty Exact (Kalman) or well-calibrated (Bayesian STM) Requires quantile loss or distributional output
Compute Seconds Minutes to hours (GPU recommended)
Regime changes Markov-switching, changepoint models Can learn from data if sufficient examples
Missing data Kalman filter handles naturally Requires masking strategy

Research Insight: The M5 competition (Makridakis et al., 2022) showed that hybrid approaches — using LightGBM with time series features and ensemble learning — outperformed both pure classical and pure deep learning methods. The top entries combined statistical decomposition (for interpretable components) with ML models (for nonlinear residuals). This aligns with our theme: Fundamentals > Frontier. Understanding exponential smoothing, seasonal decomposition, and ARIMA residual analysis makes you a better deep learning forecaster, because you can pre-process intelligently and diagnose failures.

The StreamRec Decision

For StreamRec's engagement forecasting problem (Chapter Project), the decision framework suggests:

  1. Start with a Bayesian structural time series model. The data (3 years of daily engagement, ~8 content categories) is short enough that deep learning may overfit. The dual seasonality (weekly + annual) is well-handled by the STM's Fourier components. Bayesian inference provides honest uncertainty.

  2. Try TFT as a global model. Train a single TFT across all 8 content categories, using the category as a static covariate. The global model can leverage cross-category patterns that per-series models miss. The variable selection mechanism will reveal which features drive engagement.

  3. Use conformal prediction to wrap the best model. Regardless of which model wins on point accuracy, ACI produces distribution-free intervals that adapt to regime changes (e.g., when a new content vertical launches).

  4. Evaluate with walk-forward validation. No backtesting shortcuts. Expanding-window walk-forward with 30-day horizons and monthly steps.


23.15 Progressive Project: Temporal Engagement Forecaster for StreamRec

This section defines the project milestone for Chapter 23: building a temporal engagement forecaster for the StreamRec platform.

Problem Setup

StreamRec tracks daily engagement rates (completion rate) across 8 content categories. The data spans 3 years (1,095 days) with the following characteristics:

  • Day-of-week seasonality (weekends are 15-20% higher)
  • Annual seasonality (summer dip, winter peak, holiday spikes)
  • Category-level differences in baseline and trend
  • A regime change at month 24 when StreamRec launched a podcasting vertical
  • ~5% missing values (system outages, data pipeline failures)

Milestone M8b: Temporal Forecasting

Objective: Compare Prophet (baseline), TFT (global DL), and Bayesian structural time series models for 30-day engagement forecasting.

def generate_streamrec_engagement(
    n_days: int = 1095,
    n_categories: int = 8,
    seed: int = 42,
) -> pd.DataFrame:
    """Generate synthetic StreamRec daily engagement data.

    Creates realistic engagement time series with multiple seasonality,
    category effects, a regime change, and missing values.

    Args:
        n_days: Number of days to simulate.
        n_categories: Number of content categories.
        seed: Random seed.

    Returns:
        DataFrame with columns: date, time_idx, category, engagement_rate,
        day_of_week, month, is_weekend, is_holiday.
    """
    np.random.seed(seed)
    categories = [
        "comedy", "drama", "documentary", "action",
        "sci_fi", "horror", "romance", "podcast",
    ][:n_categories]

    base_rates = {
        "comedy": 0.48, "drama": 0.42, "documentary": 0.31, "action": 0.39,
        "sci_fi": 0.27, "horror": 0.22, "romance": 0.35, "podcast": 0.0,
    }

    dates = pd.date_range("2022-01-01", periods=n_days, freq="D")
    records = []

    for cat in categories:
        base = base_rates[cat]
        for i, date in enumerate(dates):
            t = i  # time index
            dow = date.dayofweek
            month = date.month

            # Trend
            trend = 0.0001 * t  # Slow upward trend

            # Day-of-week seasonality
            weekend_effect = 0.04 * (1 if dow >= 5 else 0)

            # Annual seasonality
            annual = 0.03 * np.sin(2 * np.pi * t / 365.25 + 1.5)

            # Regime change for podcast (launches at month 24)
            if cat == "podcast":
                if t < 730:  # Before launch
                    base_t = 0.0
                else:
                    base_t = 0.30 + 0.0003 * (t - 730)  # Ramp up
            else:
                base_t = base

            # Holiday spikes (simplified)
            is_holiday = (month == 12 and date.day >= 20) or (month == 1 and date.day <= 3)
            holiday_effect = 0.05 if is_holiday else 0.0

            # Noise
            noise = np.random.normal(0, 0.03)

            rate = np.clip(base_t + trend + weekend_effect + annual + holiday_effect + noise, 0, 1)

            # Missing data (~5%)
            if np.random.rand() < 0.05:
                rate = np.nan

            records.append({
                "date": date,
                "time_idx": t,
                "category": cat,
                "engagement_rate": rate,
                "day_of_week": dow,
                "month": month,
                "is_weekend": int(dow >= 5),
                "is_holiday": int(is_holiday),
            })

    df = pd.DataFrame(records)
    return df


# Generate data and preview
engagement_df = generate_streamrec_engagement()
print(f"Dataset shape: {engagement_df.shape}")
print(f"Date range: {engagement_df['date'].min()} to {engagement_df['date'].max()}")
print(f"Categories: {engagement_df['category'].nunique()}")
print(f"Missing values: {engagement_df['engagement_rate'].isna().sum()} "
      f"({engagement_df['engagement_rate'].isna().mean():.1%})")
print(f"\nMean engagement by category:")
print(engagement_df.groupby("category")["engagement_rate"].mean().sort_values(ascending=False).to_string())

Your tasks:

  1. Baseline (Prophet): Fit a Prophet model per category. Use the default yearly and weekly seasonality plus holiday effects. Produce 30-day forecasts with uncertainty intervals.

  2. Bayesian STM: Implement a Bayesian structural time series model (using tensorflow_probability.sts or orbit-ml) with local linear trend, weekly seasonality (7 Fourier terms), and annual seasonality (10 Fourier terms). Compare the posterior predictive intervals with Prophet's.

  3. TFT (global model): Train a single TFT across all categories using pytorch-forecasting. Use category as a static covariate, day-of-week and month as known future inputs. Evaluate the variable importance weights.

  4. Conformal wrapper: Apply ACI (Section 23.11) to the best point forecaster. Compare the conformal intervals with the model's native intervals.

  5. Walk-forward evaluation: Use expanding-window walk-forward with min_train=365, forecast_horizon=30, step=30. Report MAE, RMSE, MASE, and calibration (90% interval coverage) for each model.

  6. Report: Produce a comparison table and a recommendation. For each model, state: (a) point accuracy, (b) interval calibration, (c) interpretability of components, (d) computational cost.


23.16 Chapter Summary

This chapter extended the time series toolkit from the intermediate course in three directions.

State-space models provide the theoretical foundation. The Kalman filter is Bayesian inference for linear dynamical systems — the predict-update cycle is exactly the posterior-becomes-prior sequential updating from Chapter 20, applied to latent states that evolve over time. Hidden Markov models extend this to discrete regimes. Structural time series models decompose observed data into interpretable trend, seasonal, and regression components.

Deep learning architectures handle the pattern complexity that classical methods cannot. N-BEATS proves that pure neural networks can match classical methods on univariate benchmarks. DeepAR adds probabilistic outputs via autoregressive density estimation. TFT integrates variable selection, static metadata, temporal attention, and quantile forecasts into a single interpretable architecture. The key insight from the forecasting competitions is that global models — trained across many related series — often outperform per-series classical models, because they leverage cross-series structure.

Probabilistic forecasting is the key differentiator from the intermediate treatment. Point predictions are insufficient when decisions depend on uncertainty. Quantile regression, Bayesian posterior predictive intervals, and conformal prediction each provide prediction intervals — but only calibrated intervals are trustworthy. The PIT test and coverage analysis are essential diagnostic tools. Conformal prediction, particularly the ACI variant, provides distribution-free coverage guarantees that adapt to distributional shift.

The decision framework is not "classical or deep learning" but "what does the problem need?" Short series with simple seasonality favor classical methods. Many related series with complex patterns favor deep learning. Uncertainty-critical applications demand Bayesian methods or conformal wrappers. The best practitioner — the senior data scientist — knows all three families and chooses based on the data, the stakes, and the constraints.

The theme that runs through every section: Know How Your Model Is Wrong. The Kalman filter is wrong when the system is nonlinear. N-BEATS is wrong when there are important covariates. TFT is wrong when the training series is too short. Quantile regression is wrong when the quantile crossings are severe. Conformal prediction is wrong when the adaptation rate does not match the rate of distributional change. Knowing how each model fails is more valuable than knowing how it succeeds — because it tells you when to trust the forecast and when to hedge.