Quiz: Chapter 25
Time Series Analysis and Forecasting
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
A data scientist builds a time series forecasting model and evaluates it using 5-fold cross-validation with random shuffling. The MAPE on the test folds is 2.3%. When the model is deployed and evaluated on genuinely future data, the MAPE is 8.7%. The most likely cause is:
- A) The model overfit to the training data
- B) The random cross-validation leaked future information into the training folds
- C) The model was not trained on enough data
- D) The MAPE metric is unreliable for time series
Answer: B) The random cross-validation leaked future information into the training folds. In time series, observations are temporally ordered and correlated. Random shuffling allows future values (which carry information about trends, seasonality, and regime changes) to appear in the training set while their temporal neighbors appear in the test set. The model learns patterns it should not have access to at prediction time. Walk-forward validation is the correct approach for time series evaluation.
Question 2 (Multiple Choice)
The Augmented Dickey-Fuller (ADF) test returns a p-value of 0.42. This means:
- A) The series is stationary
- B) The series is non-stationary
- C) You fail to reject the null hypothesis of a unit root --- the series is likely non-stationary
- D) You reject the null hypothesis --- the series is likely stationary
Answer: C) You fail to reject the null hypothesis of a unit root --- the series is likely non-stationary. The ADF test's null hypothesis is that the series has a unit root (is non-stationary). A high p-value (0.42 >> 0.05) means there is insufficient evidence to reject this null. The appropriate next step is to apply differencing and re-test.
Question 3 (Short Answer)
Explain the difference between additive and multiplicative decomposition. Give a real-world example where multiplicative decomposition would be more appropriate than additive.
Answer: Additive decomposition models the series as Y(t) = Trend + Seasonal + Residual, assuming the seasonal fluctuations have a constant absolute magnitude regardless of the series level. Multiplicative decomposition models Y(t) = Trend * Seasonal * Residual, assuming seasonal fluctuations are proportional to the level. Retail sales are a common multiplicative example: if a store does $100K in January and $120K in December at the $1M/year level, at the $2M/year level the December bump would be approximately $240K (20% of the higher base), not $120K (the same absolute amount).
Question 4 (Multiple Choice)
In an ARIMA(2, 1, 1) model, the "2" means:
- A) The series is differenced twice
- B) Two lagged values of the differenced series are used as predictors
- C) Two lagged forecast errors are included
- D) The model has two seasonal periods
Answer: B) Two lagged values of the differenced series are used as predictors. The first parameter (p=2) is the AR order --- the number of autoregressive terms. After differencing once (d=1), the model predicts the current differenced value using the two most recent differenced values (AR component) plus one lagged forecast error (MA component, q=1).
Question 5 (Multiple Choice)
A time series shows a slowly decaying ACF and a PACF that cuts off sharply after lag 3. This pattern suggests:
- A) An MA(3) process
- B) An AR(3) process
- C) An ARMA(3,3) process
- D) A non-stationary series requiring differencing
Answer: B) An AR(3) process. The diagnostic rule is: slowly decaying ACF + sharp PACF cutoff = AR process, with the AR order equal to the lag where PACF cuts off. The slowly decaying ACF reflects the indirect correlations propagated through the autoregressive structure. The PACF, which removes intermediate effects, isolates the direct dependence at each lag, cutting off at p=3.
Question 6 (Short Answer)
Why does auto_arima use AIC (Akaike Information Criterion) rather than the training-set error to select the best (p, d, q) parameters? What would happen if it selected the model with the lowest training MSE?
Answer: AIC penalizes model complexity: AIC = 2k - 2*ln(L), where k is the number of parameters and L is the likelihood. A model with more parameters will always fit the training data better, so selecting by training MSE would always choose the most complex model (highest p and q), leading to overfitting. AIC balances goodness of fit against the number of parameters, favoring simpler models that explain the data well without memorizing noise. This is analogous to the bias-variance tradeoff in ML.
Question 7 (Multiple Choice)
A Prophet model is fit with changepoint_prior_scale=0.001. The trend line is nearly straight. When the value is increased to changepoint_prior_scale=1.0, the trend follows every bump in the data. The appropriate interpretation is:
- A) Higher values are always better because they capture more patterns
- B) Lower values are always better because they prevent overfitting
- C) The parameter controls the tradeoff between trend smoothness and flexibility, and the best value depends on whether the underlying trend truly has sharp changes
- D) This parameter only affects holidays, not the trend
Answer: C) The parameter controls the tradeoff between trend smoothness and flexibility, and the best value depends on whether the underlying trend truly has sharp changes. changepoint_prior_scale is a regularization parameter for the trend. Very low values force a smooth (nearly linear) trend, which underfits if the true trend has real changes. Very high values allow the trend to follow noise, which overfits. The right value depends on whether the series has genuine structural breaks (price changes, market shifts) or is inherently smooth.
Question 8 (Multiple Choice)
In walk-forward (expanding window) validation with 5 folds, the first fold has a training set of 20 observations and the fifth fold has a training set of 100 observations. The MAPE across folds is: [6.8%, 3.4%, 2.2%, 1.8%, 1.6%]. Which interpretation is correct?
- A) The model is overfitting on later folds
- B) The model is underfitting on earlier folds because the training set is too small
- C) Both B and the fact that more data generally improves parameter estimates for time series models
- D) The results indicate a bug --- MAPE should be constant across folds
Answer: C) Both B and the fact that more data generally improves parameter estimates for time series models. Time series models (especially ARIMA and Prophet) need sufficient history to estimate trend, seasonality, and autocorrelation parameters reliably. With only 20 observations, a model with 12-month seasonality has less than 2 full cycles to learn from. Performance improving with more training data is expected and healthy --- it is not a sign of overfitting. Reporting the range (1.6% to 6.8%) alongside the mean gives stakeholders a realistic picture of forecast reliability.
Question 9 (Short Answer)
A TurbineTech engineer asks: "If the ARIMA model is trained on normal operating data and then the actual vibration exceeds the 95% forecast confidence interval, does that mean the bearing is definitely failing?" How would you respond?
Answer: No. Exceeding the 95% confidence interval means the observed value is statistically unusual given the model's assumptions, but it does not diagnose the cause. It could indicate bearing degradation, a sensor malfunction, an unusual but benign operating condition (e.g., extreme wind), or simply the 5% of observations that fall outside any 95% interval by definition. The forecast-based alarm should trigger an investigation, not an automatic diagnosis. Domain expertise from the maintenance engineering team is needed to interpret the alert and determine the appropriate response.
Question 10 (Multiple Choice)
MAPE (Mean Absolute Percentage Error) is problematic when:
- A) The time series has a strong trend
- B) The time series has seasonal patterns
- C) The time series contains values near or equal to zero
- D) The forecast horizon is longer than 12 months
Answer: C) The time series contains values near or equal to zero. MAPE divides each absolute error by the actual value: |actual - predicted| / |actual|. When actual values are near zero, this division produces extremely large or infinite values, making the aggregate MAPE meaningless. For series that cross zero (e.g., profit/loss, temperature in Celsius) or have values near zero (e.g., churn rate approaching 0), use MAE or RMSE instead, or use symmetric MAPE (sMAPE) which divides by the average of actual and predicted.
Question 11 (Multiple Choice)
A data scientist creates lag features for a gradient boosting model to forecast daily sales. The features include lag_1 (yesterday's sales), lag_7 (last week's sales), and a 7-day rolling mean. The model achieves excellent accuracy in backtesting. In production, the model is asked to forecast sales 7 days into the future. What is the problem?
- A) Gradient boosting cannot be used for time series
- B) The lag_1 and rolling mean features are not available at the 7-day forecast horizon
- C) The model needs more lag features
- D) The 7-day rolling mean creates data leakage
Answer: B) The lag_1 and rolling mean features are not available at the 7-day forecast horizon. When forecasting 7 days ahead, you do not know tomorrow's sales (lag_1) or the sales for any of the next 6 days needed for the rolling mean. The model was trained and backtested under conditions where these features were available (because the "future" data already existed in the test set). In production, only features available at the time of prediction can be used. For a 7-day horizon, the minimum usable lag is lag_7.
Question 12 (Short Answer)
Explain why SARIMA and Prophet are both limited for complex multivariate time series (e.g., predicting one sensor reading based on 50 other sensors). What class of models addresses this limitation?
Answer: SARIMA is fundamentally a univariate model --- it uses only the history of the target variable itself (and optionally a few exogenous variables via SARIMAX). Prophet similarly models one target with optional regressors but cannot capture complex cross-variable dynamics. When dozens of input series interact non-linearly over multiple time scales, these models lack the capacity to learn the interactions. Deep learning models --- specifically LSTMs, Temporal Convolutional Networks, and Transformer-based architectures --- can process multivariate input sequences and learn non-linear temporal dependencies across many variables simultaneously.
Question 13 (Multiple Choice)
Which of the following is the correct temporal train/test split for a forecasting problem?
- A) Randomly assign 80% of observations to training and 20% to testing
- B) Use the first 80% of observations for training and the last 20% for testing
- C) Use even-numbered months for training and odd-numbered months for testing
- D) Use alternating weeks for training and testing
Answer: B) Use the first 80% of observations for training and the last 20% for testing. The training set must precede the test set in time. Options A, C, and D all violate temporal ordering by mixing earlier and later data in both sets. This would allow the model to use future information during training, producing optimistically biased evaluation metrics that do not reflect real forecasting performance.
Question 14 (Short Answer)
StreamFlow's VP of Finance asks: "The Prophet forecast says next month's churn rate will be 7.2%. How confident are you?" Explain how you would communicate forecast uncertainty to a non-technical stakeholder without using statistical jargon.
Answer: I would say: "Our best estimate is 7.2%, but forecasts always have a range. Based on the model's past accuracy, next month's churn rate will most likely fall between 6.8% and 7.6%. In practical terms, that means we should budget for retention costs assuming churn could be as high as 7.6% (the cautious scenario) while planning revenue assuming it could be as low as 6.8% (the optimistic scenario). The further out we forecast, the wider this range gets --- our 6-month forecast has a much wider band than the 1-month forecast."
Question 15 (Multiple Choice)
First differencing a time series removes:
- A) Seasonal patterns
- B) Linear trend
- C) Autocorrelation
- D) Random noise
Answer: B) Linear trend. First differencing computes Y'(t) = Y(t) - Y(t-1), which transforms a series with a linear trend into a series that fluctuates around a constant mean. It does not remove seasonal patterns (seasonal differencing, Y(t) - Y(t-m), is needed for that), does not eliminate autocorrelation (the differenced series typically still has autocorrelation), and cannot remove noise (noise is, by definition, the unpredictable component).
This quiz supports Chapter 25: Time Series Analysis and Forecasting. Return to the chapter to review concepts.