Chapter 23: Exercises

DataField.Dev

Chapter 23: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

State-Space Models and the Kalman Filter

Exercise 23.1 (*)

Consider the local level (random walk plus noise) state-space model:

$$\mu_{t+1} = \mu_t + \eta_t, \quad \eta_t \sim \mathcal{N}(0, \sigma_\eta^2)$$

$$y_t = \mu_t + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, \sigma_\varepsilon^2)$$

with $\sigma_\eta = 0.5$, $\sigma_\varepsilon = 2.0$, and diffuse initialization ($P_0 = 10^6$).

(a) Write out the Kalman filter matrices $\mathbf{T}$, $\mathbf{Z}$, $\mathbf{R}$, $\mathbf{Q}$, and $H$ for this model.

(b) Simulate 200 observations from this model with $\mu_0 = 10$. Run the KalmanFilter class from the chapter and plot the filtered state estimate $\hat{\mu}_t$ with $\pm 2\sigma$ bands overlaid on the true $\mu_t$ and noisy observations $y_t$.

(c) After the filter converges (say, $t > 20$), the steady-state Kalman gain is approximately $K^* = \sigma_\eta / (\sigma_\eta + \sigma_\varepsilon)$. Verify this empirically by inspecting the Kalman gain over time. Explain intuitively why the gain is closer to 0 than to 1 when $\sigma_\varepsilon \gg \sigma_\eta$.

Exercise 23.2 (*)

The chapter's Kalman filter implementation handles missing data by skipping the update step.

(a) Using the simulated climate data from Section 23.3, increase the missing-data fraction from 10% to 50%. Re-run the filter and smoother. How does the filtered uncertainty ($\sqrt{P_{t,00}}$) change in regions with long gaps?

(b) Plot the smoothed estimate alongside the filtered estimate for a contiguous block of 20 missing observations. Explain why the smoother is more accurate than the filter during the gap.

(c) At what missing-data fraction does the Kalman filter's 5-year forecast interval width double compared to the 10% baseline?

Exercise 23.3 (**)

Parameter estimation via maximum likelihood.

The chapter's KalmanFilter stores the log-likelihood as a byproduct of the forward pass.

(a) Write a function fit_local_linear_trend(y, sigma_level_range, sigma_trend_range, sigma_obs_range) that performs a grid search over the three noise parameters, calling build_local_linear_trend and kf.filter() for each combination, and returns the parameters that maximize the log-likelihood.

(b) Generate 500 observations from a local linear trend model with $\sigma_\mu = 0.05$, $\sigma_\nu = 0.01$, $\sigma_\varepsilon = 0.20$. Use your grid search to recover these parameters. How close are the estimated values?

(c) Compare your grid-search result with statsmodels.tsa.statespace.structural.UnobservedComponents. Do the estimated parameters agree?

import statsmodels.api as sm

model = sm.tsa.UnobservedComponents(y, level="local linear trend")
result = model.fit(disp=False)
print(result.summary())

Exercise 23.4 (**)

Multi-sensor fusion with the Kalman filter.

A climate monitoring station has two temperature sensors with different noise characteristics: sensor A has $\sigma_A = 0.3\degree$C, sensor B has $\sigma_B = 1.0\degree$C. The true temperature follows a local level model with $\sigma_\eta = 0.1\degree$C.

(a) Formulate this as a multivariate observation state-space model:

$$\mathbf{y}_t = \begin{bmatrix} y_t^A \\ y_t^B \end{bmatrix} = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \mu_t + \boldsymbol{\varepsilon}_t, \quad \boldsymbol{\varepsilon}_t \sim \mathcal{N}\left(\mathbf{0}, \begin{bmatrix} \sigma_A^2 & 0 \\ 0 & \sigma_B^2 \end{bmatrix}\right)$$

Write out all Kalman filter matrices for this model.

(b) Simulate 300 observations. Run the Kalman filter and verify that the fused estimate is more precise than either sensor alone. Plot the filtered uncertainty alongside the single-sensor Kalman filter uncertainties.

(c) At $t = 150$, sensor B fails permanently ($y_t^B$ becomes NaN for $t \geq 150$). Modify the filter to handle partial missing data (sensor A still reports). Show that the filtered uncertainty increases when sensor B drops out but remains smaller than sensor-A-only.

Hidden Markov Models

Exercise 23.5 (*)

Using the GaussianHMM class from the chapter, consider a 3-state model representing user engagement regimes: "highly active" ($\mu_0 = 0.85, \sigma_0 = 0.05$), "moderately active" ($\mu_1 = 0.50, \sigma_1 = 0.08$), and "dormant" ($\mu_2 = 0.10, \sigma_2 = 0.06$).

(a) Define a transition matrix where the system tends to stay in each state (self-transition probability 0.95) with small probabilities of switching. Simulate 1,000 observations.

(b) Run the Viterbi algorithm to decode the most likely state sequence. Compute the state-level accuracy.

(c) Plot the filtering probabilities $P(s_t = k \mid y_{1:t})$ for all three states over time. Identify transitions where the filter is "uncertain" (maximum filtering probability < 0.7). What do these uncertain periods correspond to in the raw data?

Exercise 23.6 (**)

Regime-switching for StreamRec engagement.

Extend the HMM to model regime-switching in StreamRec's comedy category engagement data.

(a) Generate 500 days of comedy engagement using the generate_streamrec_engagement function from Section 23.15. Extract the comedy time series (dropping NaN).

(b) Fit a 2-state Gaussian HMM using hmmlearn.GaussianHMM with 100 EM iterations. Report the estimated means, standard deviations, and transition matrix.

from hmmlearn import hmm

model = hmm.GaussianHMM(n_components=2, covariance_type="diag", n_iter=100, random_state=42)
model.fit(y_comedy.reshape(-1, 1))
states = model.predict(y_comedy.reshape(-1, 1))

(c) Do the two states correspond to weekday and weekend engagement? Compute the fraction of time each decoded state overlaps with weekends. Discuss whether a 2-state HMM is the right model for day-of-week effects.

Deep Learning Architectures

Exercise 23.7 (*)

N-BEATS interpretable decomposition.

Using the NBEATS class from Section 23.7:

(a) Generate a synthetic series: $y_t = 0.02t + 5\sin(2\pi t / 50) + \varepsilon_t$ where $\varepsilon_t \sim \mathcal{N}(0, 1)$, for $t = 0, \ldots, 499$.

(b) Build an interpretable N-BEATS model with 1 trend stack (polynomial degree 2) and 1 seasonal stack (4 harmonics). Train it using lookback $L = 60$ and horizon $H = 12$ with MSE loss for 100 epochs.

(c) Extract the trend and seasonal backcasts from the trained model on a test window. Plot them separately. Does the trend stack capture the linear growth? Does the seasonal stack capture the sinusoidal pattern?

Exercise 23.8 (**)

DeepAR: comparing output distributions.

The chapter's DeepARModel uses a Gaussian output distribution. Many real-world count series (e.g., daily transaction counts) are better modeled by a negative binomial.

(a) Modify DeepARModel to output negative binomial parameters instead of Gaussian. Replace mu_head and sigma_head with heads that output $\mu > 0$ (via Softplus) and $\alpha > 0$ (the overdispersion parameter, via Softplus).

(b) The negative binomial log-likelihood is:

$$\log p(y \mid \mu, \alpha) = \log\Gamma(y + \alpha^{-1}) - \log\Gamma(\alpha^{-1}) - \log\Gamma(y + 1) + \alpha^{-1}\log\frac{\alpha^{-1}}{\alpha^{-1} + \mu} + y \log\frac{\mu}{\alpha^{-1} + \mu}$$

Implement this as a loss function. Verify that it reduces to the Poisson log-likelihood as $\alpha \to 0$.

(c) Generate count data from a Poisson process with rate $\lambda_t = 50 + 20\sin(2\pi t / 7)$ (weekly seasonality). Train both the Gaussian and negative binomial DeepAR models. Compare the sample forecast distributions at a peak ($t$ on a weekend) and a trough ($t$ mid-week). Which model produces more realistic integer-valued samples?

Exercise 23.9 (**)

TFT variable importance analysis.

Using pytorch-forecasting, train a TFT on the StreamRec engagement data from Section 23.15 with the following covariates:

Static: category
Known future: day_of_week, month, is_weekend, is_holiday
Unknown historical: engagement_rate

(a) Extract and report the encoder variable importances. Which feature receives the highest attention weight?

(b) Extract the temporal attention weights for the "comedy" category. Plot the attention pattern over the encoder length. Does the model attend more strongly to recent time steps, or does it show weekly periodicity in the attention pattern?

(c) Remove is_weekend from the covariate set and retrain. Does the variable importance of day_of_week increase to compensate? Does point accuracy (MAE) change?

Exercise 23.10 (***)

Implement a minimal TFT variable selection network from scratch.

The variable selection network (VSN) in TFT computes softmax attention over input features:

$$\mathbf{v}_t = \text{Softmax}(\mathbf{W}_v \, [\xi_1^{(t)}, \ldots, \xi_K^{(t)}] + \mathbf{b}_v)$$

$$\tilde{\mathbf{x}}_t = \sum_{k=1}^K v_t^{(k)} \, \text{GRN}(\xi_k^{(t)})$$

where $\text{GRN}$ is a Gated Residual Network.

(a) Implement a GatedResidualNetwork module in PyTorch:

class GatedResidualNetwork(nn.Module):
    def __init__(self, input_dim: int, hidden_dim: int, output_dim: int,
                 context_dim: int = 0, dropout: float = 0.1):
        super().__init__()
        self.fc1 = nn.Linear(input_dim + context_dim, hidden_dim)
        self.elu = nn.ELU()
        self.fc2 = nn.Linear(hidden_dim, output_dim)
        self.dropout = nn.Dropout(dropout)
        self.gate = nn.Linear(hidden_dim, output_dim)
        self.sigmoid = nn.Sigmoid()
        self.layer_norm = nn.LayerNorm(output_dim)
        self.skip = nn.Linear(input_dim, output_dim) if input_dim != output_dim else nn.Identity()

Complete the forward method with gated skip connections: $\text{output} = \text{LayerNorm}(\text{skip}(x) + \text{gate} \odot \text{fc2}(\text{ELU}(\text{fc1}(x))))$.

(b) Implement a VariableSelectionNetwork that applies a GRN to each input variable, computes softmax weights, and returns the weighted combination.

(c) Test on synthetic data: create 5 input features where only features 1 and 3 are predictive of the target. Train the VSN as a standalone predictor. Verify that the learned attention weights $v^{(1)}$ and $v^{(3)}$ dominate.

Probabilistic Forecasting

Exercise 23.11 (*)

Quantile loss by hand.

Compute the quantile loss $\rho_\tau(y, \hat{q})$ for the following cases:

$\tau$	$y$	$\hat{q}$	$\rho_\tau$
0.10	100	95	?
0.10	100	110	?
0.90	100	95	?
0.90	100	110	?
0.50	100	95	?
0.50	100	110	?

Verify your answers using the quantile_loss function from Section 23.9 (adapted for scalar inputs). Why is the loss for $\tau = 0.10$ higher when $\hat{q} = 95$ (under the true value) than when $\hat{q} = 110$ (over the true value)?

Exercise 23.12 (**)

Quantile crossing. A common pathology of independent quantile regression is quantile crossing: the predicted 90th percentile is lower than the predicted 50th percentile.

(a) Demonstrate quantile crossing by training three separate linear quantile regression models (for $\tau \in \{0.1, 0.5, 0.9\}$) on a small, noisy dataset:

from sklearn.linear_model import QuantileRegressor
import numpy as np

np.random.seed(42)
X = np.random.uniform(0, 10, (30, 1))
y = 3 * X.ravel() + np.random.normal(0, 2 + 0.5 * X.ravel())  # Heteroscedastic

Plot the three fitted quantile lines. Do any crossings occur?

(b) Implement a simple fix: isotonic rearrangement. After predicting all quantiles at a test point, sort them to enforce monotonicity. Apply this to your predictions from part (a).

(c) Explain why TFT's multi-quantile output head is less susceptible to crossing than independent quantile regression models. (Hint: consider shared hidden representations.)

Exercise 23.13 (**)

Comparing prediction interval methods.

Generate 500 observations from a time series with a changepoint at $t = 300$:

$$y_t = \begin{cases} 50 + 0.02t + 5\sin(2\pi t / 30) + \varepsilon_t, & t < 300, \quad \varepsilon_t \sim \mathcal{N}(0, 3) \\ 80 + 0.01t + 5\sin(2\pi t / 30) + \varepsilon_t, & t \geq 300, \quad \varepsilon_t \sim \mathcal{N}(0, 6) \end{cases}$$

(a) Fit a Kalman filter (local linear trend + Fourier seasonal component) to the full series. Compute 90% prediction intervals. What is the empirical coverage before and after the changepoint?

(b) Apply Adaptive Conformal Inference with $\alpha = 0.10$ and $\gamma = 0.01$ to a naive rolling-mean forecaster. Compute the empirical coverage before and after the changepoint.

(c) Which method recovers faster from the changepoint? Plot the interval widths over time for both methods. Discuss the tradeoff between model-based intervals (Kalman) and wrapper-based intervals (ACI).

Calibration

Exercise 23.14 (*)

Using the pit_calibration_test function from Section 23.10:

(a) Generate 1,000 observations from $y \sim \mathcal{N}(0, 1)$. Produce "forecasts" from three models: - Model A (well-calibrated): $\hat{F}(y) = \Phi(y; 0, 1)$ - Model B (biased): $\hat{F}(y) = \Phi(y; 0.5, 1)$ (mean shifted by 0.5) - Model C (overconfident): $\hat{F}(y) = \Phi(y; 0, 0.5)$ (variance too small)

(b) Compute the PIT histogram (20 bins) and KS test p-value for each model. Which model is detected as miscalibrated?

(c) Plot the reliability diagram: nominal quantile level ($x$-axis) vs. empirical coverage ($y$-axis). A perfectly calibrated model lies on the diagonal. Characterize the miscalibration pattern for Models B and C.

Exercise 23.15 (**)

Recalibration via isotonic regression.

When a model is miscalibrated, isotonic regression on the PIT values can correct the calibration.

(a) Generate 2,000 observations from $y \sim \mathcal{N}(0, 1)$. Create an overconfident forecast (as in Model C above). Split into calibration (first 1,000) and test (last 1,000) sets.

(b) On the calibration set, compute PIT values $u_i = \hat{F}(y_i)$. Fit an isotonic regression mapping $u_i \to \tilde{u}_i$ where $\tilde{u}_i$ should be uniform. Use sklearn.isotonic.IsotonicRegression.

(c) Apply the learned isotonic mapping to the test set PIT values. Recompute the KS test. Does the recalibration improve the KS p-value? Plot PIT histograms before and after recalibration.

Conformal Prediction

Exercise 23.16 (*)

Run the AdaptiveConformalForecaster from Section 23.11 with three different adaptation rates: $\gamma \in \{0.001, 0.01, 0.05\}$.

(a) Use the regime-change data from the chapter demonstration ($n = 600$, regime shift at $t = 300$). For each $\gamma$, report the overall coverage, regime-1 coverage, and regime-2 coverage.

(b) Plot the adaptive interval half-width $\hat{q}_t$ over time for all three values of $\gamma$. Which adapts fastest to the regime change? Which has the smoothest intervals?

(c) Derive the steady-state interval width $\hat{q}^*$ for a stationary series where the point forecaster has a fixed absolute error distribution with CDF $G$. Show that $\hat{q}^* = G^{-1}(1 - \alpha)$, the $(1-\alpha)$-quantile of the absolute error distribution.

Exercise 23.17 (***)

Conformal prediction with asymmetric intervals.

The standard ACI from the chapter produces symmetric intervals $[\hat{y}_t - \hat{q}_t, \hat{y}_t + \hat{q}_t]$. For skewed forecasting errors, asymmetric intervals are more efficient.

(a) Modify AdaptiveConformalForecaster to maintain separate upper and lower quantiles $\hat{q}_t^+$ and $\hat{q}_t^-$, each adapted with its own error indicator:

$$\hat{q}_{t+1}^+ = \hat{q}_t^+ + \gamma \left(\mathbb{1}\{y_t > \hat{y}_t + \hat{q}_t^+\} - \alpha/2\right)$$

$$\hat{q}_{t+1}^- = \hat{q}_t^- + \gamma \left(\mathbb{1}\{y_t < \hat{y}_t - \hat{q}_t^-\} - \alpha/2\right)$$

(b) Test on a series with positive skew: $y_t = 50 + \varepsilon_t$ where $\varepsilon_t \sim \text{Exponential}(1) - 1$ (shifted to have mean 0). Use $\hat{y}_t = 50$ as the point forecast. Compare the symmetric and asymmetric interval widths.

(c) Verify that the asymmetric intervals achieve the target $1-\alpha$ total coverage while having a smaller average interval width than the symmetric version.

Walk-Forward Validation and Changepoint Detection

Exercise 23.18 (*)

Using walk_forward_splits from Section 23.13:

(a) Generate 1,000 observations from a seasonal time series: $y_t = 100 + 0.05t + 10\sin(2\pi t / 365) + \varepsilon_t$, $\varepsilon_t \sim \mathcal{N}(0, 5)$.

(b) Implement two forecasters: - Naive: $\hat{y}_{t+h} = y_t$ (last observation) - Seasonal naive: $\hat{y}_{t+h} = y_{t+h-365}$ (same day last year)

(c) Evaluate both using walk-forward validation with min_train=365, forecast_horizon=30, step=30. Report MAE, RMSE, and MASE for each. Explain why the seasonal naive should win.

Exercise 23.19 (**)

Changepoint detection + forecasting pipeline.

(a) Generate a 600-observation series with two changepoints (mean shifts at $t = 200$ and $t = 400$). Use detect_changepoints_pelt to detect them. Experiment with penalty values in $\{10, 50, 100, 200\}$. Which penalty correctly identifies two changepoints?

(b) Build a forecasting pipeline that: (i) detects the most recent changepoint, (ii) fits a Kalman filter only on data after the last changepoint, and (iii) produces a 30-step forecast.

(c) Compare this changepoint-aware forecaster with a Kalman filter fit to the entire series using walk-forward validation. Does the changepoint-aware approach improve forecast accuracy after the second regime change?

Regime Switching

Exercise 23.20 (**)

Markov-switching autoregression.

(a) Using statsmodels.tsa.regime_switching.MarkovRegression, fit a 2-regime switching model to the y_regime series from the ACI demonstration in Section 23.11:

import statsmodels.api as sm

mod = sm.tsa.MarkovRegression(y_regime, k_regimes=2, order=1)
res = mod.fit()
print(res.summary())

(b) Extract the smoothed regime probabilities. Plot them alongside the true regime labels. How quickly does the model detect the regime change at $t = 300$?

(c) Use the model to produce 1-step-ahead forecasts. At the regime transition, does the forecast adapt faster than the naive ACI approach? Why?

Exercise 23.21 (***)

Model selection: how many regimes?

(a) Generate data from a 3-regime process: - Regime 0: $\mathcal{N}(0, 1)$, duration ~50 steps - Regime 1: $\mathcal{N}(5, 2)$, duration ~30 steps - Regime 2: $\mathcal{N}(-3, 1.5)$, duration ~40 steps

Generate 500 observations with transitions governed by a $3 \times 3$ transition matrix.

(b) Fit Gaussian HMMs with $K \in \{2, 3, 4, 5\}$ states using hmmlearn. For each $K$, report the log-likelihood, AIC, and BIC. Which $K$ does AIC select? Which does BIC select?

(c) Explain why BIC penalizes model complexity more heavily than AIC, and why this matters for HMMs where overfitting to noise can create spurious regimes.

Integration and Application

Exercise 23.22 (**)

Climate forecasting: Bayesian structural time series.

Using the build_bayesian_structural_model function from Section 23.5:

(a) Generate 30 years (360 months) of simulated temperature anomaly data with: a linear warming trend of $0.02\degree$C/year, annual seasonality (amplitude $0.5\degree$C), and observation noise $\sigma = 0.3\degree$C.

(b) Fit a Bayesian STS model with local linear trend and annual seasonality. Produce a 120-month (10-year) probabilistic forecast.

(c) Plot the forecast with 50%, 80%, and 95% prediction intervals. Does the interval width grow linearly, faster, or slower with the forecast horizon? Explain the growth rate in terms of the state-space model's uncertainty propagation.

Exercise 23.23 (**)

StreamRec engagement: model comparison.

This exercise follows the progressive project outline from Section 23.15.

(a) Generate the StreamRec engagement data. Extract the "comedy" category as a univariate daily series.

(b) Implement three forecasters: 1. Prophet (baseline): from prophet import Prophet 2. Kalman filter (local linear trend + Fourier seasonal): using the chapter's implementation with weekly ($K=3$) and annual ($K=5$) Fourier features as regressors 3. Rolling mean with day-of-week adjustment

(c) Evaluate all three using walk-forward validation (min_train=365, forecast_horizon=30, step=30). Report MAE, RMSE, and MASE. Which model performs best? Why?

Exercise 23.24 (***)

End-to-end probabilistic forecasting pipeline.

Build a complete forecasting pipeline for the StreamRec "comedy" category:

(a) Data preprocessing: handle missing values (forward fill, then linear interpolation), add Fourier features (weekly + annual), add binary weekend and holiday indicators.

(b) Model training: fit a DeepAR model (Gaussian output) with lookback $L = 90$ and horizon $H = 30$. Use 80% of data for training and 20% for walk-forward evaluation.

(c) Calibration assessment: on the walk-forward test folds, compute the PIT histogram and empirical coverage at $\tau \in \{0.10, 0.25, 0.50, 0.75, 0.90\}$. Is the model well-calibrated?

(d) Conformal wrapping: apply ACI to the DeepAR point forecasts ($\alpha = 0.10$, $\gamma = 0.01$). Compare the ACI coverage with the DeepAR native intervals. Which is closer to the target 90% coverage?

Exercise 23.25 (***)

N-BEATS vs. TFT: global model comparison.

(a) Generate the full StreamRec engagement data (all 8 categories, 1,095 days). Create training, validation, and test splits (70/15/15 by time).

(b) Train a global N-BEATS model (generic configuration: 2 stacks, 3 blocks each, 256 hidden units) across all categories. The input is a concatenation of the lookback window plus category-level encodings.

(c) Train a TFT model using pytorch-forecasting with category as a static covariate.

(d) Compare on the test set using MAE and MASE per category. Does TFT outperform N-BEATS on categories with strong covariate effects (e.g., weekend patterns)? Does N-BEATS outperform TFT on categories with simpler dynamics?

Exercise 23.26 (***)

Backtesting pitfalls.

A junior data scientist backtests a TFT model by: 1. Training on all available data (3 years) 2. Re-fitting the model on the first 2 years 3. Forecasting months 25-36 and computing MAE

(a) Identify the methodological error. Why does step 1 introduce look-ahead bias?

(b) The same data scientist uses the TFT's validation loss to select hyperparameters and then reports the validation MAE as the model's expected performance. Explain why this estimate is optimistically biased.

(c) Design a correct backtesting protocol with nested walk-forward validation: an outer loop for evaluation and an inner loop for hyperparameter selection. Describe the data splits precisely.

Exercise 23.27 (****)

Derive the steady-state Kalman gain for the local level model.

For the local level model ($T = Z = R = 1$, state noise $Q = \sigma_\eta^2$, observation noise $H = \sigma_\varepsilon^2$):

(a) Write the Riccati equation for the predicted state variance $P_{t+1|t}$:

$$P_{t+1|t} = P_{t|t-1} - \frac{P_{t|t-1}^2}{P_{t|t-1} + \sigma_\varepsilon^2} + \sigma_\eta^2$$

(b) In steady state, $P_{t+1|t} = P_{t|t-1} = P^*$. Solve the resulting quadratic equation for $P^*$.

(c) Show that the steady-state Kalman gain is:

$$K^* = \frac{P^*}{P^* + \sigma_\varepsilon^2} = \frac{-\sigma_\varepsilon^2 + \sqrt{\sigma_\varepsilon^4 + 4\sigma_\eta^2 \sigma_\varepsilon^2}}{2\sigma_\varepsilon^2}$$

Verify numerically for $\sigma_\eta = 0.5$, $\sigma_\varepsilon = 2.0$.

Exercise 23.28 (****)

Connections: Kalman filter, exponential smoothing, and ARIMA.

(a) Show that the local level Kalman filter with known parameters is equivalent to simple exponential smoothing (SES) with smoothing parameter $\alpha = K^*$ (the steady-state Kalman gain). Write out the SES recursion and verify the equivalence.

(b) Show that the local linear trend Kalman filter is equivalent to Holt's linear trend method.

(c) The ARIMA(0,1,1) model is $y_t - y_{t-1} = \varepsilon_t + \theta \varepsilon_{t-1}$. Show that this is the innovation form of the local level state-space model, and express $\theta$ in terms of $K^*$.

(d) Discuss the practical implications: if these models are mathematically equivalent, why would a practitioner prefer the state-space formulation? Give at least three reasons.

Exercise 23.29 (****)

Particle filter for nonlinear state-space models.

The Kalman filter is exact only for linear-Gaussian models. For nonlinear transition or observation functions, the particle filter provides a Monte Carlo approximation.

(a) Implement a bootstrap particle filter for a nonlinear state-space model:

$$\alpha_{t+1} = 0.5 \alpha_t + \frac{25 \alpha_t}{1 + \alpha_t^2} + 8\cos(1.2t) + \eta_t, \quad \eta_t \sim \mathcal{N}(0, 10)$$

$$y_t = \frac{\alpha_t^2}{20} + \varepsilon_t, \quad \varepsilon_t \sim \mathcal{N}(0, 1)$$

Use $N = 1000$ particles with systematic resampling.

(b) Simulate 100 observations. Run the particle filter and plot the particle-based filtering mean $\hat{\alpha}_t = \frac{1}{N}\sum_{i=1}^N w_t^{(i)} \alpha_t^{(i)}$ with $\pm 2\sigma$ bands.

(c) Compare with the extended Kalman filter (EKF), which linearizes the transition and observation functions. In what regions of the state space does the EKF fail while the particle filter succeeds? Why?

Exercise 23.30 (****)

Conformal prediction: theoretical coverage guarantees.

Gibbs and Candes (2022) prove that ACI achieves long-run coverage error $|\bar{\alpha}_T - \alpha| \leq O(\gamma + 1/(T\gamma))$.

(a) Verify this bound empirically. For a stationary process with $T = 2000$, run ACI with $\gamma \in \{0.001, 0.003, 0.01, 0.03, 0.1\}$. Plot $|\bar{\alpha}_T - \alpha|$ vs. $\gamma$. Does the U-shaped curve predicted by the bound appear?

(b) The optimal $\gamma^* = T^{-1/2}$. For $T = 2000$, compute $\gamma^* \approx 0.022$. Confirm that this value approximately minimizes the coverage error in your experiment.

(c) For a non-stationary process (e.g., the regime-change series), is $\gamma = T^{-1/2}$ still optimal, or should $\gamma$ be larger? Run the experiment and discuss.