Chapter 23: Quiz
Test your understanding of advanced time series and temporal models. Answers follow each question.
Question 1
What is a state-space model? State the two equations that define it and explain what the latent state represents.
Answer
A state-space model describes a time series through two equations: 1. **Transition equation** (state evolution): $\boldsymbol{\alpha}_{t+1} = f(\boldsymbol{\alpha}_t, \boldsymbol{u}_t) + \boldsymbol{\eta}_t$ — how the unobserved latent state evolves over time. 2. **Observation equation** (measurement): $y_t = h(\boldsymbol{\alpha}_t, \mathbf{x}_t) + \varepsilon_t$ — how the observed data relates to the latent state. The latent state $\boldsymbol{\alpha}_t$ captures what cannot be directly measured: the "true" level of a process, its trend, seasonal components, and any regime it occupies. The observed data $y_t$ is a noisy, partial view of this richer underlying reality. Estimation consists of inferring the latent states from the noisy observations.Question 2
Why is the Kalman filter described as "inherently Bayesian"? What corresponds to the prior and the posterior in the predict-update cycle?
Answer
The Kalman filter is inherently Bayesian because it maintains a belief distribution (posterior) over the latent state at each time step: $p(\boldsymbol{\alpha}_t \mid y_{1:t})$. - **Predict step** (prior formation): The current posterior $p(\boldsymbol{\alpha}_t \mid y_{1:t})$ is propagated forward through the transition model to form the predictive distribution $p(\boldsymbol{\alpha}_{t+1} \mid y_{1:t})$. Today's posterior becomes tomorrow's prior. - **Update step** (posterior computation): The new observation $y_{t+1}$ is incorporated via the likelihood to obtain the new posterior $p(\boldsymbol{\alpha}_{t+1} \mid y_{1:t+1})$. This is exactly sequential Bayesian updating. For the linear-Gaussian case, the Kalman filter computes the exact posterior — there is no approximation. The Kalman gain performs precision-weighted averaging, trusting the data more when observation noise is small and the prior more when it is large.Question 3
What does the Kalman gain $\mathbf{K}_t$ control, and how does it behave when observation noise $H_t$ is very large vs. very small?
Answer
The Kalman gain controls how much the new observation shifts the state estimate. - **When $H_t$ is very small** (precise observation): The gain is large, approaching 1. The filter trusts the data heavily and the updated state estimate moves close to the observation. - **When $H_t$ is very large** (noisy observation): The gain is small, approaching 0. The filter trusts the prior prediction and largely ignores the new observation. This is precision-weighted averaging: the Kalman gain is proportional to the prediction uncertainty and inversely proportional to the total uncertainty (prediction plus observation). It is the multivariate, time-varying generalization of the Normal-Normal conjugate update from Bayesian statistics.Question 4
How does the Kalman filter handle missing observations? Why is this approach natural from a Bayesian perspective?
Answer
When an observation is missing, the Kalman filter simply skips the update step. The predicted state becomes the filtered state, and the predicted covariance becomes the filtered covariance (no uncertainty reduction occurs). This is natural from a Bayesian perspective because the update step requires a likelihood — if there is no observation, there is no likelihood to incorporate, so the posterior equals the prior. The consequence is that uncertainty grows during gaps: $P_{t|t}$ does not shrink without data. When observations resume, the filter continues the predict-update cycle normally. No imputation is needed; missing data is propagated as increased uncertainty.Question 5
Name the three fundamental problems for hidden Markov models and the algorithm that solves each one.
Answer
1. **Filtering** (forward algorithm): Compute $P(s_t = k \mid y_{1:t})$ — the probability of being in each state given observations up to time $t$. The forward algorithm computes this recursively in $O(TK^2)$ time. 2. **Decoding** (Viterbi algorithm): Find $\arg\max_{s_{1:T}} P(s_{1:T} \mid y_{1:T})$ — the single most likely sequence of hidden states given all observations. The Viterbi algorithm uses dynamic programming to solve this in $O(TK^2)$ time. 3. **Learning** (Baum-Welch / EM algorithm): Estimate the model parameters (transition matrix $\mathbf{A}$, emission parameters $\boldsymbol{\theta}_k$, initial distribution $\boldsymbol{\pi}$) from data. The Baum-Welch algorithm is the Expectation-Maximization algorithm applied to HMMs, alternating between computing expected state occupancies (E-step, using forward-backward) and updating parameters (M-step).Question 6
What is the "doubly residual" architecture in N-BEATS, and why is it important?
Answer
N-BEATS uses two residual connections per block: 1. **Input residual (backcast):** The input to the next block is the original input minus the current block's backcast — the residual that this block could not reconstruct. Each subsequent block works on what the previous blocks failed to explain. 2. **Output residual (forecast):** The final forecast is the sum of all blocks' partial forecasts. Each block contributes its own piece of the prediction. This is important because it enables a natural division of labor among blocks. In the interpretable variant, trend blocks capture the trend component (subtracting it from the input), and seasonality blocks capture the seasonal component from the remaining residual. The doubly residual structure also improves gradient flow during training, similar to skip connections in ResNets.Question 7
Explain the key architectural differences between TFT, N-BEATS, and DeepAR. When would you choose each one?
Answer
| Feature | N-BEATS | DeepAR | TFT | |---------|---------|--------|-----| | **Input** | Univariate lookback window | Univariate + covariates | Static + historical + future covariates | | **Architecture** | Stacked FC blocks | LSTM with distribution output | Variable selection + LSTM + self-attention | | **Output** | Point forecast (or quantile) | Parametric distribution (sampled) | Multi-quantile forecast | | **Interpretability** | Trend/seasonal decomposition | Low | Variable importance + attention weights | **Choose N-BEATS** when you have univariate data, no important covariates, and want a simple, fast model. It excels on competition benchmarks for pure univariate forecasting. **Choose DeepAR** when you have many related series and want a probabilistic global model with a parametric distribution assumption. It is simpler than TFT and works well when the distribution family is appropriate. **Choose TFT** when you have rich metadata (static features, known future inputs), need interpretability (variable importance, attention patterns), and want multi-horizon quantile forecasts. It is the most architecturally complete but also the most complex.Question 8
What is the quantile loss (pinball loss)? Why is it asymmetric, and what does the asymmetry achieve?
Answer
The quantile loss for quantile level $\tau$ is: $$\rho_\tau(y, \hat{q}) = \begin{cases} \tau(y - \hat{q}) & \text{if } y \geq \hat{q} \\ (1-\tau)(\hat{q} - y) & \text{if } y < \hat{q} \end{cases}$$ The asymmetry ensures that the loss-minimizing prediction is the $\tau$-th quantile of $y$, not the mean. For $\tau = 0.9$, underestimates (where $y > \hat{q}$) are penalized with weight 0.9, while overestimates are penalized with weight 0.1. This forces the model to produce a high value — one that 90% of observations fall below. For $\tau = 0.1$, the situation is reversed: overestimates are penalized more heavily, forcing the model to produce a low value. For $\tau = 0.5$, the loss is symmetric and reduces to the mean absolute error, yielding the median.Question 9
What is the difference between a prediction interval and a confidence interval? Which one is appropriate for forecasting?
Answer
- A **prediction interval** contains future observations with a stated probability. It accounts for both parameter uncertainty and irreducible noise. Example: "There is a 90% probability that tomorrow's temperature falls in [15, 25]°C." - A **confidence interval** contains a fixed parameter (like a trend slope) with a stated long-run coverage frequency. It reflects only parameter estimation uncertainty. Example: "If we repeated the estimation many times, 90% of intervals would contain the true trend slope." Prediction intervals are always wider than confidence intervals for the same nominal level because they include observation noise. **Prediction intervals are appropriate for forecasting** because we are making statements about future observations, not about fixed parameters. A confidence interval on the forecast mean excludes the observation noise and therefore understates the true range of future outcomes.Question 10
Define forecast calibration. Why is a sharp but miscalibrated model potentially worse than a well-calibrated but wide model?
Answer
A probabilistic forecast is **calibrated** if events predicted to occur with probability $p$ actually occur with frequency $p$. For quantile forecasts: if $\hat{q}_\tau$ is the predicted $\tau$-quantile, calibration requires $P(y \leq \hat{q}_\tau) = \tau$. A sharp but miscalibrated model is dangerous because it gives false confidence. If a model claims 90% prediction intervals but achieves only 60% coverage, decisions based on those intervals systematically underestimate risk. A policymaker relying on overconfident climate projections might underinvest in adaptation. A platform relying on overconfident engagement forecasts might understaff during actual demand surges. A well-calibrated but wide model is honest: it correctly communicates "we are uncertain." This is always preferable because the decision-maker can account for the uncertainty. The correct objective is Gneiting et al.'s (2007) principle: **maximize sharpness subject to calibration**. Sharpness without calibration is not precision — it is overconfidence.Question 11
What is the Probability Integral Transform (PIT), and how is it used to assess calibration?
Answer
If $F_t$ is the predicted CDF at time $t$, the PIT value is $u_t = F_t(y_t)$ — the predicted CDF evaluated at the actual observation. Under perfect calibration, the PIT values $u_1, \ldots, u_T$ are uniformly distributed on $[0, 1]$. This is because, for a correctly specified CDF, $F_t(Y_t) \sim \text{Uniform}(0, 1)$. To assess calibration: 1. Compute PIT values for all test observations. 2. Plot a PIT histogram. A uniform histogram indicates good calibration. A U-shaped histogram indicates underdispersion (overconfidence). An inverse-U shape indicates overdispersion. 3. Apply the Kolmogorov-Smirnov test for uniformity. A small p-value rejects the null hypothesis of calibration. The PIT provides a single, comprehensive diagnostic for the entire predictive distribution, not just specific quantiles.Question 12
How does Adaptive Conformal Inference (ACI) produce prediction intervals, and why does it work under distribution shift?
Answer
ACI maintains an adaptive interval half-width $\hat{q}_t$ that is updated after each observation: $$\hat{q}_{t+1} = \hat{q}_t + \gamma(\text{err}_t - \alpha)$$ where $\text{err}_t = \mathbb{1}\{|y_t - \hat{y}_t| > \hat{q}_t\}$ is the coverage failure indicator and $\alpha$ is the target miscoverage rate. If the interval fails to cover ($\text{err}_t = 1$), $\hat{q}_t$ increases by $\gamma(1-\alpha) > 0$ — the interval widens. If coverage succeeds ($\text{err}_t = 0$), $\hat{q}_t$ decreases by $\gamma\alpha < 0$ — the interval narrows slightly. This works under distribution shift because the adaptation is model-free: it does not depend on the forecaster being correctly specified. If a regime change causes the forecaster to produce larger errors, the interval widens automatically to compensate. The coverage guarantee is $|\bar{\alpha}_T - \alpha| \leq O(\gamma + 1/(T\gamma))$, which holds for any underlying process as long as the adaptation rate $\gamma$ is chosen appropriately relative to the rate of distributional change.Question 13
Why is standard $k$-fold cross-validation invalid for time series? What should be used instead?
Answer
Standard $k$-fold cross-validation randomly assigns observations to folds, which violates temporal ordering. A fold might use future data (e.g., January 2025) to predict past data (e.g., June 2024). This introduces look-ahead bias: the model has access to patterns (trends, regime changes, seasonal shifts) that would not be available at the time of the forecast, leading to optimistically biased performance estimates. **Walk-forward validation** (expanding window or sliding window) should be used instead. It respects the arrow of time: - The training set always precedes the test set. - The model is re-fit (or updated) at each fold using only past data. - Forecasts are made for a fixed horizon ahead. The expanding window variant grows the training set over time; the sliding window variant uses a fixed-size window. Both produce unbiased estimates of out-of-sample forecast performance.Question 14
What is MASE (Mean Absolute Scaled Error), and why is it preferred over MAE and RMSE for time series evaluation?
Answer
MASE is defined as: $$\text{MASE} = \frac{\text{MAE of the model}}{\text{MAE of the naive (random walk) forecast on the training set}}$$ where the naive forecast is $\hat{y}_{t+1} = y_t$. MASE is preferred because: 1. **Scale-independent:** Unlike MAE or RMSE, MASE can be compared across series with different units or magnitudes. MASE = 0.8 means 20% better than naive, regardless of whether the series is measured in degrees, dollars, or counts. 2. **Interpretable benchmark:** MASE < 1 means the model outperforms the naive baseline; MASE > 1 means the naive forecast is better. This provides an immediate sanity check. 3. **Symmetric treatment of errors:** Unlike MAPE (Mean Absolute Percentage Error), MASE does not penalize positive and negative errors differently and is well-defined when observations are zero. 4. **Appropriate for intermittent demand:** MAPE is undefined when $y_t = 0$; MASE handles this naturally.Question 15
What is a structural time series model? How does it relate to Prophet?
Answer
A structural time series model (STM) decomposes the observed series into interpretable additive components: $$y_t = \mu_t + \gamma_t + \boldsymbol{\beta}^T \mathbf{x}_t + \varepsilon_t$$ where $\mu_t$ is the trend (level + slope), $\gamma_t$ is seasonality (one or more seasonal components), $\mathbf{x}_t$ are exogenous regressors, and $\varepsilon_t$ is observation noise. Each component is modeled as a separate state-space model, and their sum defines the complete model. Prophet is a structural time series model with a piecewise-linear trend, Fourier seasonal components, and holiday regressors — fitted by MAP estimation rather than full Bayesian inference. Understanding the state-space framework reveals what Prophet assumes (piecewise-linear trend, additive components, specific Fourier orders) and where those assumptions break down. The Bayesian approach (used in `tensorflow_probability.sts` and `orbit-ml`) goes further by placing priors on all parameters and performing full posterior inference, which propagates parameter uncertainty into forecast intervals and provides automatic regularization for short series.Question 16
Explain the ancestral sampling procedure in DeepAR for generating probabilistic forecasts. What is the key disadvantage of this approach compared to direct multi-horizon methods like TFT?
Answer
DeepAR generates forecasts by ancestral sampling: 1. Encode the historical sequence through the LSTM to obtain the hidden state at the end of the history. 2. For each sample path $s = 1, \ldots, M$: - At step $h = 1$: the LSTM outputs distribution parameters $\theta_{T+1}$; sample $\hat{y}_{T+1}^{(s)} \sim p(y \mid \theta_{T+1})$. - At step $h = 2$: feed $\hat{y}_{T+1}^{(s)}$ back into the LSTM as the autoregressive input; sample $\hat{y}_{T+2}^{(s)}$. - Continue for all $H$ steps. 3. The $M$ sample paths form the empirical predictive distribution. The key disadvantage is **error accumulation**: sampled values are fed back as inputs, so errors compound over the horizon. A bad sample at step 1 biases all subsequent steps in that sample path. The generation loop is also sequential (cannot parallelize across horizon steps), making inference slower for long horizons. TFT avoids this by outputting all horizon steps simultaneously (direct multi-step forecasting), which eliminates error accumulation at the cost of not modeling the autoregressive structure across future time steps.Question 17
A Markov-switching autoregression uses regime-dependent AR parameters. How does this differ from fitting separate AR models to pre-segmented data (e.g., after detecting changepoints)?
Answer
The key differences are: 1. **Joint estimation:** The Markov-switching model estimates the AR parameters, regime-specific noise variances, and the transition matrix simultaneously. This means uncertainty about which regime a time point belongs to is incorporated into the parameter estimates. Pre-segmented AR models assume the changepoints are known with certainty, which underestimates parameter uncertainty. 2. **Soft assignment:** The Markov-switching model assigns probabilistic regime memberships — a time point might have 70% probability of being in regime 1 and 30% in regime 2, especially near transitions. Changepoint detection produces hard boundaries. 3. **Return transitions:** The Markov-switching model allows the process to return to a previous regime (the system can switch back and forth). Changepoint detection typically assumes one-directional changes — once a changepoint occurs, the old regime does not return. 4. **Forecasting:** The Markov-switching model can produce regime-weighted forecasts that account for the probability of future regime changes. A changepoint-based approach forecasts only from the current regime. The tradeoff: Markov-switching models are more principled but harder to fit (EM can converge to local optima), while changepoint + separate AR is simpler and more robust when regimes truly do not recur.Question 18
The chapter identifies a decision framework for choosing between classical and deep learning models for time series. Under what conditions does each family have a clear advantage?
Answer
**Classical methods (ARIMA, ETS, STM) have a clear advantage when:** - The series is short (50-500 observations) — deep learning overfits with limited data. - The pattern is simple (single seasonality, linear trend) — no need for the model complexity. - Interpretability is critical — component decomposition is transparent. - Exact uncertainty quantification is required — the Kalman filter produces exact Bayesian posteriors for linear-Gaussian models. - Computational resources are limited — classical models fit in seconds. **Deep learning (TFT, DeepAR, N-BEATS) has a clear advantage when:** - There are many related series (100+) — global models leverage cross-series patterns that per-series classical models cannot exploit. - Patterns are complex — multiple seasonalities, nonlinear interactions with covariates, high-dimensional covariate spaces. - Rich metadata exists — static features (e.g., category, region) and known future inputs (e.g., holidays, promotions). - The forecast horizon is long and the series is long enough (1,000+ observations per series) to train reliably. The M5 competition showed that the best approaches are often hybrids: statistical decomposition for interpretable components combined with ML models for nonlinear residuals.Question 19
What is the Rauch-Tung-Striebel (RTS) smoother, and why are smoothed state estimates always at least as precise as filtered estimates?
Answer
The RTS smoother is a backward pass that, given all observations $y_{1:n}$, refines past state estimates. Starting from the last filtered estimate and working backward, it computes: $$\hat{\boldsymbol{\alpha}}_{t|n} = \hat{\boldsymbol{\alpha}}_{t|t} + \mathbf{L}_t (\hat{\boldsymbol{\alpha}}_{t+1|n} - \hat{\boldsymbol{\alpha}}_{t+1|t})$$ where $\mathbf{L}_t = \mathbf{P}_{t|t} \mathbf{T}^T \mathbf{P}_{t+1|t}^{-1}$ is the smoother gain. Smoothed estimates are always at least as precise because: $$\text{Var}(\boldsymbol{\alpha}_t \mid y_{1:n}) \leq \text{Var}(\boldsymbol{\alpha}_t \mid y_{1:t})$$ The filtered estimate uses only past and current observations. The smoother additionally uses future observations to refine the estimate. Since conditioning on more information can only reduce (or maintain) variance, the smoothed covariance is always less than or equal to the filtered covariance. This is most visible during data gaps: the filter's uncertainty grows during the gap, but the smoother uses observations after the gap to "fill in" the missing period with reduced uncertainty.Question 20
The chapter states: "Know How Your Model Is Wrong" as the unifying theme. For each of the following models, state one key assumption that can fail and how you would detect the failure.
(a) Kalman filter (b) N-BEATS (c) TFT (d) Quantile regression (e) Adaptive Conformal Inference