Exercises: Chapter 25

DataField.Dev

Exercises: Chapter 25

Time Series Analysis and Forecasting

Exercise 1: Decomposition by Hand (Conceptual)

Consider the following monthly revenue data for a small business (in thousands of dollars):

Month	Jan	Feb	Mar	Apr	May	Jun	Jul	Aug	Sep	Oct	Nov	Dec
2022	100	95	110	120	130	140	145	135	125	115	105	150
2023	110	105	120	135	145	155	160	150	140	130	115	165

a) Is there a trend in this data? Estimate the annual growth rate.

b) Is there seasonality? Which months appear to be peaks and troughs?

c) Should you use additive or multiplicative decomposition? Justify your answer by examining whether the seasonal fluctuations are constant in absolute terms or proportional to the level.

d) The December values (150 and 165) are unusually high compared to the surrounding months. Is this part of the seasonal pattern, or would you consider it a holiday effect? How does this distinction matter for forecasting?

e) If you were using this data to forecast January 2024 revenue, what would your naive forecast be? Your seasonal naive forecast? Which would you trust more, and why?

Exercise 2: Stationarity Diagnosis (Code)

Run the following code to generate three time series with different stationarity properties.

import numpy as np
import pandas as pd
from statsmodels.tsa.stattools import adfuller
from statsmodels.graphics.tsaplots import plot_acf

np.random.seed(42)
n = 500

# Series A: White noise
series_a = pd.Series(np.random.normal(0, 1, n))

# Series B: Random walk
series_b = pd.Series(np.cumsum(np.random.normal(0, 1, n)))

# Series C: Stationary AR(1) with phi = 0.85
series_c = pd.Series(np.zeros(n))
for t in range(1, n):
    series_c.iloc[t] = 0.85 * series_c.iloc[t-1] + np.random.normal(0, 1)

a) Before running the ADF test, predict which series are stationary and which are not. Explain your reasoning.

b) Run the ADF test on all three series. Were your predictions correct?

c) Apply first differencing to Series B and run the ADF test again. Is the differenced series stationary? Explain intuitively why differencing a random walk produces white noise.

d) Plot the ACF for Series C. How many lags show significant autocorrelation? Does this match what you would expect from an AR(1) process with phi = 0.85?

e) A colleague argues that Series C is non-stationary because "it has autocorrelation." Explain why this argument is wrong. What is the relationship between autocorrelation and stationarity?

Exercise 3: ARIMA Parameter Selection (Code)

Use the following dataset to practice ARIMA parameter selection from ACF/PACF plots.

import numpy as np
import pandas as pd
from statsmodels.tsa.arima_process import ArmaProcess
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
import matplotlib.pyplot as plt

np.random.seed(42)

# Generate an ARMA(2,1) process
ar_params = np.array([1, -0.75, 0.25])  # AR coefficients (1 - 0.75*L + 0.25*L^2)
ma_params = np.array([1, 0.65])          # MA coefficients (1 + 0.65*L)
arma_process = ArmaProcess(ar_params, ma_params)
series = pd.Series(arma_process.generate_sample(nsample=400))

a) Plot the ACF and PACF of this series (use 30 lags). Based on the patterns, what ARIMA order (p, d, q) would you select? Document your reasoning --- which rule from the ACF/PACF cheat sheet are you applying?

b) Fit three candidate models: ARIMA(2,0,1), ARIMA(1,0,1), and ARIMA(2,0,2). Compare their AIC values. Which model does AIC prefer?

c) Run auto_arima (with seasonal=False) on the same data. Does it select the same model as AIC in part (b)?

d) For the best model, examine the residuals. Plot the ACF of the residuals and run the Ljung-Box test. Do the residuals look like white noise? What would it mean if they did not?

Exercise 4: Walk-Forward Validation (Code)

Implement walk-forward validation from scratch (without using TimeSeriesSplit) for the following problem.

import numpy as np
import pandas as pd

np.random.seed(42)
n_months = 60
dates = pd.date_range(start='2020-01-01', periods=n_months, freq='MS')

# Monthly sales with trend and seasonality
trend = np.linspace(500, 800, n_months)
seasonal = 50 * np.sin(2 * np.pi * np.arange(n_months) / 12)
noise = np.random.normal(0, 20, n_months)
sales = pd.Series(trend + seasonal + noise, index=dates, name='monthly_sales')

a) Implement an expanding-window walk-forward validation loop with: - Minimum training size: 24 months - Forecast horizon: 3 months - Step size: 3 months (non-overlapping test windows)

How many evaluation windows does this produce?

b) For each window, fit a SARIMA(1,1,1)(1,1,1,12) model and compute MAE and MAPE. Report results for each window and the overall average.

c) Now implement a fixed-window (sliding) version where the training window is always exactly 24 months. Compare the results to the expanding-window version. Which performs better? Why might a fixed window outperform an expanding window for some time series?

d) A colleague suggests using 3-fold KFold cross-validation instead of walk-forward validation "because it uses more of the data for evaluation." Generate both the KFold and walk-forward MAPE estimates. Which is lower? Which is more honest? Explain the discrepancy.

Exercise 5: Prophet Tuning (Code)

Use the monthly sales data from Exercise 4 to explore Prophet's sensitivity to its key hyperparameters.

a) Fit Prophet with changepoint_prior_scale values of [0.001, 0.01, 0.05, 0.1, 0.5, 1.0]. For each, forecast the last 12 months (train on the first 48) and compute MAPE. Plot the MAPE vs. changepoint_prior_scale. What do you observe?

b) For the best changepoint_prior_scale, experiment with seasonality_prior_scale values of [0.1, 1.0, 5.0, 10.0, 50.0]. Does this parameter have a large effect?

c) Add a custom "back-to-school" seasonality that peaks in August-September. Use Prophet's add_seasonality method. Does this improve the forecast? Under what circumstances would adding custom seasonalities help vs. hurt?

d) Prophet's make_future_dataframe produces point forecasts and uncertainty intervals. Plot the forecasts with uncertainty intervals for the 12-month test period. Do the actual values fall within the 80% and 95% intervals at the expected rates (i.e., approximately 80% and 95% of the time)?

Exercise 6: TurbineTech Anomaly Detection via Forecasting (Code)

Build a complete anomaly detection pipeline for TurbineTech sensor data using time series forecasting.

import numpy as np
import pandas as pd

np.random.seed(42)
n_days = 365
dates = pd.date_range(start='2024-01-01', periods=n_days, freq='D')

# Normal operating vibration with seasonal pattern
baseline = 2.5
seasonal = 0.4 * np.sin(2 * np.pi * np.arange(n_days) / 365.25)
noise = np.random.normal(0, 0.15, n_days)

# Three anomaly types injected at known locations
vibration = baseline + seasonal + noise
# Type 1: Gradual degradation starting day 200
vibration[200:] += np.linspace(0, 1.5, 165)
# Type 2: Sudden spike on day 150
vibration[150] += 2.0
# Type 3: Increased variance starting day 280
vibration[280:] += np.random.normal(0, 0.5, 85)

sensor_data = pd.Series(vibration, index=dates, name='vibration_mm_s')

a) Train an ARIMA model on the first 120 days (known-good data). Generate rolling 7-day-ahead forecasts for the remaining 245 days. At each step, retrain on all data up to the forecast origin.

b) Compute the forecast residuals (actual - predicted) for each 7-day window. Define an anomaly as a residual exceeding 2 standard deviations of the training-period residuals. How many anomaly days does the detector find?

c) Evaluate the detector: what percentage of the three injected anomalies does it catch? Does it detect the gradual degradation, the sudden spike, or the increased variance? Which is hardest to detect, and why?

d) A maintenance engineer says: "I don't care about statistical significance. Tell me when vibration exceeds 3.5 mm/s." Compare the forecast-based anomaly detector to this simple threshold approach. Which catches the degradation earlier? Which produces more false alarms?

Exercise 7: Forecast Comparison Dashboard (Code)

Build a comprehensive comparison of forecasting methods on the StreamFlow churn data.

import numpy as np
import pandas as pd

np.random.seed(42)
n_months = 48
dates = pd.date_range(start='2021-01-01', periods=n_months, freq='MS')

base_churn = 0.082
trend = np.linspace(0, -0.012, n_months)
seasonal = 0.008 * np.sin(2 * np.pi * np.arange(n_months) / 12)
noise = np.random.normal(0, 0.003, n_months)

churn = pd.Series(
    (base_churn + trend + seasonal + noise).round(4),
    index=dates, name='churn_rate'
)

a) Implement four forecasting methods: 1. Naive: Forecast = last observed value 2. Seasonal Naive: Forecast = value from 12 months ago 3. SARIMA: Use auto_arima with m=12 4. Prophet: Default settings with yearly seasonality

Use walk-forward validation with 5 expanding-window splits. Report MAE and MAPE for each method across all folds.

b) Create a summary table showing each method's average MAPE, best-fold MAPE, and worst-fold MAPE. Which method would you recommend, and why?

c) Compute the Diebold-Mariano test statistic (or a simpler paired t-test on the forecast errors) to determine whether the best model's errors are statistically significantly different from the second-best model's errors. Is the difference meaningful for business purposes?

d) StreamFlow's VP of Finance needs the churn rate forecast for the next 6 months to plan the retention budget. Using the full 48-month history, generate the forecast with the best method from part (b). Report the point forecast and 95% prediction interval for each of the 6 months. How much budget uncertainty does the prediction interval imply, given that StreamFlow has 2.4M subscribers at $15/month ARPU?

Exercise 8: Temporal Feature Engineering for ML (Code)

Convert the TurbineTech time series forecasting problem into a supervised learning problem suitable for gradient boosting.

a) Using the sensor data from Exercise 6, create the following features: - Lag features: lags 1, 3, 7, 14, 28 - Rolling statistics: 7-day and 14-day rolling mean, standard deviation, min, and max - Rate of change: difference from lag 1, difference from lag 7 - Calendar features: day of week, month, quarter

How many total features do you create? How many rows are lost to the lag/rolling calculations?

b) Split the data temporally: train on the first 250 days (after feature creation), test on the remaining days. Train an XGBoost regressor and compare its MAE to the ARIMA model from Exercise 6.

c) Which features are most important (use model.feature_importances_)? Do the lag features or the rolling features contribute more? Is this what you expected?

d) Critical question: In part (b), you used all lag features during training, including lag_1 (yesterday's value). In production, if you want to forecast 7 days ahead, you do not have yesterday's value for day 7. How does this change your feature set? Re-run the model using only features that would be available at a 7-day forecast horizon. How much does the MAE increase?

These exercises support Chapter 25: Time Series Analysis and Forecasting. Return to the chapter for reference.