Chapter 23: Further Reading

Essential Sources

1. James Durbin and Siem Jan Koopman, Time Series Analysis by State Space Methods, 2nd edition (Oxford University Press, 2012)

The definitive reference on state-space models for time series. Durbin and Koopman develop the linear Gaussian state-space framework from first principles, deriving the Kalman filter, the disturbance smoother, and maximum likelihood estimation via the prediction error decomposition. The book's greatest contribution is demonstrating that virtually every classical time series model — ARIMA, exponential smoothing, structural time series, dynamic regression — is a special case of the state-space framework. Chapter 3 (the Kalman filter) and Chapter 4 (the smoother) formalize the predict-update equations used in Section 23.3 of this chapter. Chapter 8 (non-Gaussian models) introduces particle filters and importance sampling for the nonlinear extensions that Exercise 23.29 explores.

Reading guidance: Chapters 1-4 are essential reading for anyone serious about time series. The notation is clean and the derivations are complete but accessible. Chapter 9 (structural time series models) provides the theoretical foundation for Section 23.5 and for understanding what Prophet assumes under the hood. For practitioners, Chapter 12 (software and applications) includes worked examples using the authors' SsfPack library. The R KFAS package and Python's statsmodels.tsa.statespace module both implement the algorithms from this book. If you read one book on time series after the intermediate course, this should be it — not because it covers deep learning (it does not), but because it provides the mathematical foundation that makes every other approach comprehensible.

2. Bryan Lim, Sercan O. Arik, Nicolas Loeff, and Tomas Pfister, "Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting" (International Journal of Forecasting, 2021)

The paper introducing TFT, the architecture covered in Section 23.8. Lim et al. solve a practical problem that earlier deep learning forecasting architectures ignored: real-world forecasting requires handling static metadata (e.g., item category), past observed inputs, known future inputs (e.g., holidays), and unknown future targets — all within a single architecture that remains interpretable. The four key innovations — variable selection networks, static covariate encoders, temporal self-attention, and quantile output layers — are each independently useful design patterns. The paper demonstrates that TFT matches or exceeds state-of-the-art deep learning methods on four datasets while providing variable importance weights and attention patterns that practitioners can inspect and trust.

Reading guidance: Section 3 (architecture) is the core contribution and deserves careful reading alongside the architectural diagram in this chapter's Section 23.8. The variable selection network (Section 3.3) and the gated residual network (Section 3.2) are the building blocks that Exercise 23.10 asks you to implement. Section 4 (experiments) compares TFT against DeepAR, MQRNN, and traditional methods on electricity, traffic, retail, and volatility datasets — the results contextualize when TFT's additional complexity is justified. The interpretability analysis (Section 5) demonstrates the practical value of attention-based explainability. For implementation, the pytorch-forecasting library provides a production-ready TFT that follows the paper's architecture closely. The paper's 20-page supplement contains ablation studies that show which architectural components contribute most to performance.

3. Boris N. Oreshkin, Dmitri Carpov, Nicolas Chapados, and Yoshua Bengio, "N-BEATS: Neural Basis Expansion Analysis for Interpretable Time Series Forecasting" (ICLR, 2020)

The paper that demonstrated pure deep learning can match classical methods on univariate forecasting without any time-series-specific inductive biases. N-BEATS' philosophy is radical: no feature engineering, no decomposition, no domain knowledge — just stacked fully connected layers with a clever basis expansion output and doubly residual connections. The generic variant achieves top-3 performance on the M4 competition (which includes 100,000 series from diverse domains), while the interpretable variant constrains blocks to polynomial (trend) and Fourier (seasonal) bases, enabling automatic decomposition. The doubly residual architecture — where each block subtracts its backcast from the input and adds its forecast to the output — is an elegant solution to the multi-scale decomposition problem.

Reading guidance: Section 2 (architecture) is concise and precise — the entire model can be understood in 5 pages. Figure 1 provides the clearest visualization of the doubly residual architecture. Section 3 (interpretable variant) introduces the polynomial and Fourier basis constraints that make decomposition possible without explicit feature engineering. The M4 competition results in Section 4 provide the empirical evidence that pure DL can compete with statistical ensembles, but the paper is honest about limitations: N-BEATS is univariate-only (N-BEATSx, published later by Olivares et al., 2023, adds covariates) and requires large lookback windows. For implementation, the neuralforecast library by Nixtla provides optimized N-BEATS with automatic hyperparameter selection.

4. David Salinas, Valentin Flunkert, Jan Gasthaus, and Tim Januschowski, "DeepAR: Probabilistic Forecasting with Autoregressive Recurrent Networks" (International Journal of Forecasting, 2020)

The paper that established autoregressive neural network forecasting as a practical tool at scale. DeepAR's key contribution is combining three ideas: (1) autoregressive RNN generation, (2) parametric distribution output (Gaussian, negative binomial, or other families), and (3) global training across multiple related series. The approach was developed at Amazon for demand forecasting across millions of products and demonstrated that a single global model trained on all series outperforms per-series classical methods, especially for intermittent or sparse series where individual models lack sufficient data. The probabilistic output — via ancestral sampling of full forecast paths — naturally captures cross-horizon correlations that independent quantile predictions miss.

Reading guidance: Section 2 (model) covers the architecture implemented in Section 23.8.1 of this chapter. The training procedure (Section 2.2), including teacher forcing and the negative log-likelihood loss, is implemented in the DeepARModel class. Section 3 (experiments) compares DeepAR against ETS, ARIMA, and classical baselines on electricity, traffic, and parts demand datasets — the results show that global training is the key advantage, not the neural network architecture per se. Section 2.3 (negative binomial likelihood) is essential reading for count data applications, extending Exercise 23.8. For implementation, Amazon's GluonTS library provides the canonical DeepAR implementation. The pytorch-forecasting library also includes a DeepAR variant compatible with the TFT data pipeline.

5. Tilmann Gneiting and Adrian E. Raftery, "Strictly Proper Scoring Rules, Prediction, and Estimation" (Journal of the American Statistical Association, 2007)

The theoretical foundation for evaluating probabilistic forecasts. Gneiting and Raftery formalize the principle that dominates Section 23.10: maximize sharpness subject to calibration. A scoring rule is proper if the expected score is optimized when the forecaster reports the true predictive distribution, and strictly proper if only the true distribution achieves the optimum. The paper proves that the continuous ranked probability score (CRPS), the logarithmic score, and the quantile score (pinball loss) are all strictly proper, while popular alternatives (e.g., percentage of observations within the interval) are not. The practical implication is profound: when you evaluate probabilistic forecasts with a strictly proper scoring rule, the forecaster has no incentive to hedge or misreport uncertainty.

Reading guidance: Section 2 (calibration and sharpness) provides the conceptual framework that Section 23.10 of this chapter implements. The definition of calibration via the PIT (Section 2.1) and the examples of miscalibration patterns (Figure 1) are the basis for the pit_calibration_test function. Section 3 (scoring rules) derives the CRPS, which generalizes MAE to probabilistic forecasts and is the most commonly used metric in weather forecasting. Section 4 (connections to estimation) shows that MLE is equivalent to minimizing the logarithmic score — connecting forecast evaluation to the likelihood-based methods used throughout Part IV. For the conformal prediction coverage guarantees in Section 23.11, the key companion paper is Gibbs and Candes, "Adaptive Conformal Inference Under Distribution Shift" (NeurIPS, 2021), which proves the finite-sample bounds referenced in the chapter's Advanced Sidebar.