Chapter 23: Key Takeaways

DataField.Dev

Chapter 23: Key Takeaways

State-space models are the theoretical foundation of modern time series analysis, and the Kalman filter is exact Bayesian inference for linear dynamical systems. The predict-update cycle is sequential Bayesian updating: the predicted state is the prior, the new observation provides the likelihood, and the filtered state is the posterior. The Kalman gain performs precision-weighted averaging — trusting the observation when it is precise and the prediction when the observation is noisy. Hidden Markov models extend this framework to discrete latent states, enabling regime detection and switching models. Understanding the state-space framework reveals that ARIMA, exponential smoothing, and Prophet are all special cases — they differ in their assumptions about the latent state structure, not in their fundamental mechanism.
Deep learning for time series is justified by data volume and pattern complexity, not by default. N-BEATS demonstrates that pure neural networks can match classical methods on univariate benchmarks via a doubly residual architecture with interpretable basis expansions. DeepAR adds probabilistic forecasting through autoregressive density estimation, producing sample paths that capture joint uncertainty across the forecast horizon. TFT is the most architecturally complete, integrating variable selection, static covariate encoders, temporal attention, and quantile outputs into a single interpretable model. The key advantage of deep learning is global modeling — training across many related series to leverage cross-series patterns — which is how TFT outperforms per-series classical methods on StreamRec's multi-category engagement data. The key risk is overfitting on short series.
Probabilistic forecasts are not optional when decisions depend on uncertainty — and the minimum requirement for any probabilistic forecast is calibration. A point prediction of 1,847 daily active users is useless if the operations team needs to know the probability of falling below 1,500. Quantile regression, Bayesian posterior predictive intervals, and conformal prediction each provide prediction intervals — but only calibrated intervals are trustworthy. The Probability Integral Transform test and reliability diagrams are essential diagnostics. An overconfident model (sharp but miscalibrated) is worse than a conservative model (wide but honest), because overconfidence leads to systematically understated risk. The correct objective is Gneiting's principle: maximize sharpness subject to calibration.
Conformal prediction provides distribution-free prediction intervals that adapt to distributional shift — the only method covered here with finite-sample coverage guarantees. Adaptive Conformal Inference wraps any point forecaster and produces intervals that widen when errors increase and narrow when they decrease, achieving target coverage regardless of whether the underlying model is Gaussian, linear, or correctly specified. The adaptation rate $\gamma$ controls the tradeoff between responsiveness (fast adaptation to regime changes) and stability (smooth interval widths). For production systems where the data-generating process may shift without warning — a new content vertical launches, a pandemic changes user behavior, a sensor fails — conformal wrapping is the only reliable guarantee of interval coverage.
Walk-forward validation is the only valid evaluation protocol for time series models — standard cross-validation introduces look-ahead bias that inflates performance estimates. The expanding-window walk-forward protocol respects the arrow of time: the model at each fold uses only past data for training and produces forecasts for a future window. MASE (Mean Absolute Scaled Error) is the preferred metric because it is scale-independent and benchmarks against the naive random-walk forecast — MASE < 1 means the model adds value, MASE > 1 means you should use the naive baseline. Backtesting shortcuts (training on all data, then evaluating on subsets) and random train/test splits are methodological errors that can make a poor model appear excellent.
The choice between classical and deep learning models depends on the data, not on fashion — and the best practitioners know both families. Short series with simple patterns favor classical methods: the Kalman filter, exponential smoothing, structural time series models. Many related series with complex patterns favor deep learning: TFT, DeepAR, or N-BEATS as global models. Hybrid approaches — statistical decomposition for interpretable components, ML for nonlinear residuals — won the M5 forecasting competition and represent the current state of the art. The theme that unifies every section of this chapter is Know How Your Model Is Wrong: the Kalman filter fails on nonlinear systems, N-BEATS cannot incorporate covariates, TFT overfits on short series, quantile regression suffers crossing pathologies, and conformal prediction requires an appropriate adaptation rate. Knowing the failure modes is more valuable than knowing the success modes.