> War Story --- A junior data scientist on TurbineTech's predictive maintenance team built a vibration forecasting model to predict bearing degradation. She split the sensor data randomly into 80/20 train/test, trained an XGBoost regressor, and got...
In This Chapter
- ARIMA, Prophet, and Temporal Cross-Validation
- Time Series Has Its Own Rules
- What Is a Time Series?
- Decomposition: Trend, Seasonality, and Residuals
- Stationarity: Why It Matters
- Autocorrelation: ACF and PACF
- ARIMA: The Classical Workhorse
- Prophet: Forecasting for Practitioners
- Temporal Cross-Validation: Walk-Forward
- Forecast Metrics: MAE, RMSE, and MAPE
- TurbineTech: Sensor Time Series in Practice
- StreamFlow: Forecasting Monthly Churn Rate
- ML Approaches to Time Series (Preview)
- Putting It All Together: A Forecasting Checklist
- Chapter Summary
Chapter 25: Time Series Analysis and Forecasting
ARIMA, Prophet, and Temporal Cross-Validation
Learning Objectives
By the end of this chapter, you will be able to:
- Decompose time series into trend, seasonality, and residual components
- Test for stationarity (ADF test) and apply differencing
- Fit ARIMA models and interpret (p, d, q) parameters
- Use Prophet for automatic trend and seasonality detection
- Evaluate forecasts with proper temporal cross-validation (walk-forward)
Time Series Has Its Own Rules
War Story --- A junior data scientist on TurbineTech's predictive maintenance team built a vibration forecasting model to predict bearing degradation. She split the sensor data randomly into 80/20 train/test, trained an XGBoost regressor, and got an MAE of 0.3 mm/s on the test set. Her manager was impressed. Then someone looked at the predictions more carefully: the model was using future vibration readings to predict past ones. On a proper temporal split --- training on months 1-8, testing on months 9-12 --- the MAE jumped to 2.1 mm/s. The model had never actually learned to forecast. It had learned to interpolate.
This is the single most common mistake in time series work: treating temporal data as if the order does not matter. It does. Time series has its own rules, and the most important one is this: information flows forward in time, never backward. Every evaluation strategy, every feature engineering decision, every cross-validation fold must respect this.
If Parts I through IV of this book trained you to think in rows and columns --- observations that are exchangeable, randomly shuffled, independently drawn --- this chapter asks you to unlearn part of that. In time series data, the order of observations is the data. Shuffling destroys it. Random splits leak the future into the past. And the standard cross-validation you have been using since Chapter 2 will lie to you.
This chapter covers the classical toolkit for time series: decomposition, stationarity testing, ARIMA for structured univariate forecasting, and Prophet for automated trend-and-seasonality detection. We will apply these to TurbineTech sensor streams and StreamFlow subscriber forecasting. And we will be honest throughout: for complex multivariate time series, ARIMA and Prophet are not enough. Deep learning approaches (LSTM, Transformer-based models) exist for that world, and Chapter 36 previews them. But the concepts in this chapter --- stationarity, autocorrelation, temporal cross-validation --- are prerequisites for everything that comes after.
What Is a Time Series?
A time series is a sequence of data points indexed by time. That is the textbook definition. The practical definition is more useful: a time series is any dataset where the row order carries information.
Stock prices. Monthly revenue. Hourly temperature. Daily active users. Sensor vibration readings every 100 milliseconds. Weekly churn rates. These are all time series, and they all share three properties that distinguish them from the tabular data we have worked with so far:
-
Temporal ordering matters. The value at time t is related to the value at time t-1, t-2, and so on. This is autocorrelation, and it is the defining feature of time series data.
-
Observations are not independent. The IID (independent and identically distributed) assumption that underlies most of classical statistics and many ML algorithms is violated. Today's stock price is not independent of yesterday's.
-
The future is unknown at prediction time. This sounds obvious, but it has profound implications for how you build features, split data, and evaluate models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta
np.random.seed(42)
# Generate a simple time series with trend, seasonality, and noise
n_days = 730 # 2 years of daily data
dates = pd.date_range(start='2023-01-01', periods=n_days, freq='D')
# Components
trend = np.linspace(100, 180, n_days) # upward trend
seasonality = 15 * np.sin(2 * np.pi * np.arange(n_days) / 365.25) # annual cycle
weekly = 5 * np.sin(2 * np.pi * np.arange(n_days) / 7) # weekly cycle
noise = np.random.normal(0, 4, n_days)
values = trend + seasonality + weekly + noise
ts = pd.Series(values, index=dates, name='daily_metric')
print("Time series shape:", ts.shape)
print(f"Date range: {ts.index.min()} to {ts.index.max()}")
print(f"\nFirst 5 values:\n{ts.head()}")
print(f"\nBasic statistics:\n{ts.describe().round(2)}")
Time series shape: (730,)
Date range: 2023-01-01 00:00:00 to 2024-12-30 00:00:00
First 5 values:
2023-01-01 102.97
2023-01-02 103.44
2023-01-03 99.88
2023-01-04 108.12
2023-01-05 107.63
Freq: D, Name: daily_metric, dtype: float64
Basic statistics:
count 730.00
mean 142.11
std 25.84
min 80.31
max 205.47
25% 120.34
50% 142.52
75% 163.89
dtype: float64
That series has three components baked in: an upward trend (business is growing), an annual seasonal pattern (summer vs. winter), and a weekly cycle (weekends differ from weekdays). Plus noise. The goal of time series analysis is to separate these components, understand them, and use them to forecast what comes next.
Decomposition: Trend, Seasonality, and Residuals
The first step in any time series analysis is decomposition: breaking the series into its constituent parts.
Additive vs. Multiplicative
There are two decomposition models:
Additive: Y(t) = Trend(t) + Seasonal(t) + Residual(t)
Use this when the seasonal fluctuations are roughly constant over time. If summer sales are always about $10K higher than winter sales, regardless of the overall level, the seasonality is additive.
Multiplicative: Y(t) = Trend(t) * Seasonal(t) * Residual(t)
Use this when seasonal fluctuations grow proportionally with the level of the series. If summer sales are always about 15% higher than winter, the absolute difference grows as the business grows. This is multiplicative seasonality.
Production Tip --- When in doubt, start with additive decomposition. If the residuals show a pattern that fans out (increasing variance over time), switch to multiplicative. You can also take the log of a multiplicative series to make it additive: log(T * S * R) = log(T) + log(S) + log(R).
from statsmodels.tsa.seasonal import seasonal_decompose
# Additive decomposition with a 365-day period
decomposition = seasonal_decompose(ts, model='additive', period=365)
fig, axes = plt.subplots(4, 1, figsize=(12, 10), sharex=True)
decomposition.observed.plot(ax=axes[0], title='Observed')
decomposition.trend.plot(ax=axes[1], title='Trend')
decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
decomposition.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()
print("Trend range:",
f"{decomposition.trend.dropna().min():.1f} to "
f"{decomposition.trend.dropna().max():.1f}")
print("Seasonal amplitude:",
f"{decomposition.seasonal.max():.1f} (peak), "
f"{decomposition.seasonal.min():.1f} (trough)")
print("Residual std:", f"{decomposition.resid.dropna().std():.2f}")
Trend range: 101.8 to 178.2
Seasonal amplitude: 14.3 (peak), -14.1 (trough)
Residual std: 5.23
The decomposition reveals what we built in: a linear upward trend, a sinusoidal annual cycle, and residuals that are mostly noise plus the weekly pattern we did not explicitly model (since we used a 365-day period).
Common Mistake --- Using
seasonal_decomposewithperiod=7on monthly data, orperiod=12on daily data. The period must match the actual seasonal cycle in your data. For daily data with annual seasonality, useperiod=365. For monthly data with annual seasonality, useperiod=12. Getting this wrong produces meaningless decompositions.
Stationarity: Why It Matters
Most classical time series models --- including ARIMA --- assume the series is stationary. A stationary series has:
- Constant mean --- the average value does not change over time
- Constant variance --- the spread of values does not change over time
- Constant autocovariance --- the correlation between Y(t) and Y(t-k) depends only on the lag k, not on the time t
Our example series is clearly non-stationary: it has an upward trend (changing mean) and possibly seasonality (periodic mean shifts). Why does stationarity matter? Because if the statistical properties of the series are changing over time, then a model fitted on the past has no reason to generalize to the future. Stationarity is the assumption that says "the rules haven't changed."
The Augmented Dickey-Fuller Test
The Augmented Dickey-Fuller (ADF) test is the standard test for stationarity. The null hypothesis is that the series has a unit root --- meaning it is non-stationary. A low p-value (< 0.05) lets you reject the null and conclude the series is stationary.
from statsmodels.tsa.stattools import adfuller
def adf_test(series, name=''):
"""Run ADF test and print results."""
result = adfuller(series.dropna(), autolag='AIC')
print(f"ADF Test: {name}")
print(f" Test Statistic: {result[0]:.4f}")
print(f" p-value: {result[1]:.4f}")
print(f" Lags Used: {result[2]}")
print(f" Observations: {result[3]}")
for key, value in result[4].items():
print(f" Critical Value ({key}): {value:.4f}")
stationary = "YES" if result[1] < 0.05 else "NO"
print(f" Stationary? {stationary}")
print()
return result[1]
# Test the original series
p_original = adf_test(ts, 'Original Series')
# Test after removing trend via differencing
ts_diff = ts.diff().dropna()
p_diff = adf_test(ts_diff, 'First Difference')
ADF Test: Original Series
Test Statistic: -1.2347
p-value: 0.6588
Lags Used: 17
Observations: 712
Critical Value (1%): -3.4394
Critical Value (5%): -2.8654
Critical Value (10%): -2.5689
Stationary? NO
ADF Test: First Difference
Test Statistic: -9.8741
p-value: 0.0000
Lags Used: 16
Observations: 712
Critical Value (1%): -3.4394
Critical Value (5%): -2.8654
Critical Value (10%): -2.5689
Stationary? YES
The original series fails the ADF test (p = 0.66) --- it is non-stationary, as expected. After first differencing (subtracting each value from the previous value), the series passes (p ~ 0.0). Differencing removed the trend.
Differencing
Differencing is the standard technique for making a series stationary. First differencing computes Y'(t) = Y(t) - Y(t-1). This removes a linear trend. Second differencing applies differencing twice: Y''(t) = Y'(t) - Y'(t-1). This removes a quadratic trend. In practice, first differencing is sufficient for most business time series.
Seasonal differencing removes seasonal patterns: Y_s(t) = Y(t) - Y(t-m), where m is the seasonal period. For monthly data with annual seasonality, m = 12.
# Seasonal differencing (annual cycle, period=365)
ts_seasonal_diff = ts.diff(365).dropna()
p_seasonal = adf_test(ts_seasonal_diff, 'Seasonal Difference (365)')
# Combine: first + seasonal differencing
ts_both = ts.diff().diff(365).dropna()
p_both = adf_test(ts_both, 'First + Seasonal Difference')
ADF Test: Seasonal Difference (365)
Test Statistic: -4.7823
p-value: 0.0001
Lags Used: 16
Observations: 348
Critical Value (1%): -3.4498
Critical Value (5%): -2.8700
Critical Value (10%): -2.5713
Stationary? YES
ADF Test: First + Seasonal Difference
Test Statistic: -10.2134
p-value: 0.0000
Lags Used: 15
Observations: 348
Critical Value (1%): -3.4498
Critical Value (5%): -2.8700
Critical Value (10%): -2.5713
Stationary? YES
Autocorrelation: ACF and PACF
Autocorrelation measures how strongly a time series correlates with lagged versions of itself. If today's value is strongly correlated with yesterday's value, the series has high autocorrelation at lag 1. If it is strongly correlated with the value from one week ago, it has high autocorrelation at lag 7.
Two functions capture this:
ACF (Autocorrelation Function): The correlation between Y(t) and Y(t-k) for each lag k. This includes both direct and indirect effects --- the correlation at lag 2 includes the effect of lag 1 as an intermediary.
PACF (Partial Autocorrelation Function): The correlation between Y(t) and Y(t-k) after removing the effects of intermediate lags. This isolates the direct relationship at each lag.
Math Sidebar --- Think of it this way: ACF at lag 3 asks "how correlated are Monday and Thursday?" PACF at lag 3 asks "how correlated are Monday and Thursday, after accounting for Tuesday and Wednesday?" ACF shows total correlation; PACF shows direct correlation.
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
fig, axes = plt.subplots(2, 2, figsize=(14, 8))
# ACF and PACF of original series
plot_acf(ts.dropna(), lags=50, ax=axes[0, 0], title='ACF - Original')
plot_pacf(ts.dropna(), lags=50, ax=axes[0, 1], title='PACF - Original')
# ACF and PACF of differenced series
plot_acf(ts_diff.dropna(), lags=50, ax=axes[1, 0], title='ACF - Differenced')
plot_pacf(ts_diff.dropna(), lags=50, ax=axes[1, 1], title='PACF - Differenced')
plt.tight_layout()
plt.savefig('acf_pacf.png', dpi=150, bbox_inches='tight')
plt.show()
The ACF and PACF plots are the diagnostic tools for selecting ARIMA parameters. Here is the cheat sheet:
| Pattern | ACF | PACF | Model |
|---|---|---|---|
| AR(p) | Decays gradually | Cuts off after lag p | Autoregressive |
| MA(q) | Cuts off after lag q | Decays gradually | Moving Average |
| ARMA(p,q) | Decays gradually | Decays gradually | Both |
"Cuts off" means the values drop to zero (inside the confidence bands) after a certain lag. "Decays gradually" means the values decrease slowly, possibly oscillating.
ARIMA: The Classical Workhorse
ARIMA stands for AutoRegressive Integrated Moving Average. The three parameters (p, d, q) control the three components:
- p (AR order): The number of lagged values used as predictors. AR(1) means Y(t) depends on Y(t-1). AR(2) means Y(t) depends on Y(t-1) and Y(t-2).
- d (Integration order): The number of times the series is differenced to achieve stationarity. d=1 means first differencing. d=2 means second differencing.
- q (MA order): The number of lagged forecast errors used. MA(1) means the model uses the error from time t-1. MA(2) uses errors from t-1 and t-2.
The ARIMA Equation
An ARIMA(1,1,1) model, after differencing once (d=1), says:
Y'(t) = c + phi_1 * Y'(t-1) + theta_1 * e(t-1) + e(t)
Where Y'(t) is the differenced series, phi_1 is the AR coefficient, theta_1 is the MA coefficient, e(t) is white noise, and c is a constant.
Fitting ARIMA in Practice
The traditional approach is to use ACF/PACF plots to select p and q, then fit the model and check diagnostics. The modern approach adds auto_arima from the pmdarima library, which searches over (p, d, q) combinations automatically.
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
import warnings
warnings.filterwarnings('ignore')
# Generate a cleaner example: monthly data for ARIMA demonstration
np.random.seed(42)
n_months = 120 # 10 years of monthly data
dates_monthly = pd.date_range(start='2015-01-01', periods=n_months, freq='MS')
# Monthly metric with trend and seasonal component
trend_m = np.linspace(1000, 1800, n_months)
seasonal_m = 100 * np.sin(2 * np.pi * np.arange(n_months) / 12)
noise_m = np.random.normal(0, 30, n_months)
monthly_ts = pd.Series(
trend_m + seasonal_m + noise_m,
index=dates_monthly,
name='monthly_metric'
)
# Train/test split: temporal, not random
train = monthly_ts[:'2022-12']
test = monthly_ts['2023-01':]
print(f"Training: {train.index.min()} to {train.index.max()} "
f"({len(train)} months)")
print(f"Testing: {test.index.min()} to {test.index.max()} "
f"({len(test)} months)")
Training: 2015-01-01 to 2022-12-01 (96 months)
Testing: 2023-01-01 to 2024-12-01 (24 months)
# Fit ARIMA(1,1,1) manually
model_arima = ARIMA(train, order=(1, 1, 1))
result_arima = model_arima.fit()
print(result_arima.summary().tables[1])
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.2134 0.108 1.978 0.048 0.002 0.425
ma.L1 -0.8743 0.058 -15.047 0.000 -0.988 -0.760
sigma2 987.4321 137.241 7.193 0.000 718.445 1256.419
==============================================================================
# Forecast
forecast_arima = result_arima.get_forecast(steps=len(test))
forecast_mean = forecast_arima.predicted_mean
forecast_ci = forecast_arima.conf_int()
# Evaluate
from sklearn.metrics import mean_absolute_error, mean_squared_error
mae = mean_absolute_error(test, forecast_mean)
rmse = np.sqrt(mean_squared_error(test, forecast_mean))
mape = np.mean(np.abs((test - forecast_mean) / test)) * 100
print(f"\nARIMA(1,1,1) Forecast Evaluation:")
print(f" MAE: {mae:.2f}")
print(f" RMSE: {rmse:.2f}")
print(f" MAPE: {mape:.2f}%")
ARIMA(1,1,1) Forecast Evaluation:
MAE: 112.34
RMSE: 127.89
MAPE: 7.21%
A MAPE of 7.2% is not terrible for a first attempt, but this model is missing the seasonal pattern entirely. ARIMA(1,1,1) has no seasonal component --- it is trying to forecast a seasonal series with a non-seasonal model.
SARIMA: Adding Seasonality
SARIMA extends ARIMA with seasonal parameters (P, D, Q, m):
- P: Seasonal AR order
- D: Seasonal differencing order
- Q: Seasonal MA order
- m: Seasonal period (12 for monthly data with annual seasonality)
from statsmodels.tsa.statespace.sarimax import SARIMAX
# SARIMA(1,1,1)(1,1,1,12) - the most common starting specification
model_sarima = SARIMAX(
train,
order=(1, 1, 1),
seasonal_order=(1, 1, 1, 12),
enforce_stationarity=False,
enforce_invertibility=False
)
result_sarima = model_sarima.fit(disp=False)
forecast_sarima = result_sarima.get_forecast(steps=len(test))
forecast_s_mean = forecast_sarima.predicted_mean
forecast_s_ci = forecast_sarima.conf_int()
mae_s = mean_absolute_error(test, forecast_s_mean)
rmse_s = np.sqrt(mean_squared_error(test, forecast_s_mean))
mape_s = np.mean(np.abs((test - forecast_s_mean) / test)) * 100
print(f"SARIMA(1,1,1)(1,1,1,12) Forecast Evaluation:")
print(f" MAE: {mae_s:.2f}")
print(f" RMSE: {rmse_s:.2f}")
print(f" MAPE: {mape_s:.2f}%")
print(f"\nImprovement over ARIMA:")
print(f" MAE reduced by {((mae - mae_s) / mae * 100):.1f}%")
SARIMA(1,1,1)(1,1,1,12) Forecast Evaluation:
MAE: 28.47
RMSE: 35.12
MAPE: 1.84%
Improvement over ARIMA:
MAE reduced by 74.7%
Adding the seasonal component cut the error by 75%. This is the difference between a model that understands the structure of your data and one that does not.
Automatic ARIMA Selection with pmdarima
Manually selecting (p, d, q) from ACF/PACF plots is a skill worth having, but auto_arima automates the process by searching over parameter combinations and selecting the model with the best AIC (Akaike Information Criterion).
import pmdarima as pm
auto_model = pm.auto_arima(
train,
seasonal=True,
m=12,
d=None, # let auto_arima determine d
D=None, # let auto_arima determine D
max_p=3, max_q=3,
max_P=2, max_Q=2,
stepwise=True, # faster than exhaustive search
suppress_warnings=True,
information_criterion='aic',
trace=True
)
print(f"\nBest model: {auto_model.summary().tables[0]}")
Performing stepwise search to minimize aic
ARIMA(2,1,2)(1,1,1)[12] : AIC=1023.456
ARIMA(0,1,0)(0,1,0)[12] : AIC=1089.231
ARIMA(1,1,0)(1,1,0)[12] : AIC=1045.678
ARIMA(0,1,1)(0,1,1)[12] : AIC=1024.891
ARIMA(1,1,1)(1,1,1)[12] : AIC=1021.234
ARIMA(1,1,1)(0,1,1)[12] : AIC=1019.876
ARIMA(1,1,1)(0,1,2)[12] : AIC=1020.543
ARIMA(1,1,1)(1,1,2)[12] : AIC=1022.678
ARIMA(0,1,1)(0,1,1)[12] : AIC=1024.891
ARIMA(2,1,1)(0,1,1)[12] : AIC=1021.234
Best model: ARIMA(1,1,1)(0,1,1)[12]
# Forecast with the auto-selected model
forecast_auto = auto_model.predict(n_periods=len(test))
mae_auto = mean_absolute_error(test, forecast_auto)
rmse_auto = np.sqrt(mean_squared_error(test, forecast_auto))
mape_auto = np.mean(np.abs((test - forecast_auto) / test)) * 100
print(f"\nauto_arima Best Model Forecast:")
print(f" MAE: {mae_auto:.2f}")
print(f" RMSE: {rmse_auto:.2f}")
print(f" MAPE: {mape_auto:.2f}%")
auto_arima Best Model Forecast:
MAE: 26.91
RMSE: 33.45
MAPE: 1.72%
Production Tip ---
auto_arimawithstepwise=Trueis fast enough for most applications. For critical forecasting systems where a few percent accuracy matters, setstepwise=Falsefor an exhaustive search. It will take 10-50x longer, but it explores every combination.
Prophet: Forecasting for Practitioners
Prophet, developed by Meta's Core Data Science team, takes a fundamentally different approach. Instead of the Box-Jenkins ARIMA methodology (identify, estimate, diagnose), Prophet decomposes the series into three additive components:
Y(t) = g(t) + s(t) + h(t) + e(t)
Where: - g(t) is the trend (piecewise linear or logistic growth) - s(t) is the seasonality (Fourier series) - h(t) is the holiday effect - e(t) is the error term
Prophet's selling points are practical: it handles missing data, it detects changepoints in the trend automatically, it accommodates multiple seasonalities (daily, weekly, annual), and it requires minimal parameter tuning. It was designed for the analyst who needs a decent forecast by Friday, not the statistician who has a month to tune a SARIMA model.
from prophet import Prophet
# Prophet requires a DataFrame with columns 'ds' (date) and 'y' (value)
train_prophet = pd.DataFrame({
'ds': train.index,
'y': train.values
})
test_prophet = pd.DataFrame({
'ds': test.index,
'y': test.values
})
# Fit Prophet
model_prophet = Prophet(
yearly_seasonality=True,
weekly_seasonality=False, # monthly data, no weekly pattern
daily_seasonality=False,
changepoint_prior_scale=0.05, # controls trend flexibility
seasonality_prior_scale=10.0,
interval_width=0.95
)
model_prophet.fit(train_prophet)
# Forecast
future = model_prophet.make_future_dataframe(periods=len(test), freq='MS')
forecast_prophet = model_prophet.predict(future)
# Extract predictions for test period
pred_prophet = forecast_prophet.set_index('ds').loc[test.index, 'yhat']
mae_prophet = mean_absolute_error(test, pred_prophet)
rmse_prophet = np.sqrt(mean_squared_error(test, pred_prophet))
mape_prophet = np.mean(np.abs((test.values - pred_prophet.values) / test.values)) * 100
print(f"Prophet Forecast Evaluation:")
print(f" MAE: {mae_prophet:.2f}")
print(f" RMSE: {rmse_prophet:.2f}")
print(f" MAPE: {mape_prophet:.2f}%")
Prophet Forecast Evaluation:
MAE: 31.24
RMSE: 38.67
MAPE: 2.01%
Prophet Components
One of Prophet's strengths is interpretable decomposition. You can see exactly what the model learned for trend and each seasonality component.
fig = model_prophet.plot_components(forecast_prophet)
plt.savefig('prophet_components.png', dpi=150, bbox_inches='tight')
plt.show()
Changepoints
Prophet automatically detects changepoints --- moments where the trend changes direction or slope. This is valuable for business time series, where external events (product launches, price changes, market shifts) can alter the trajectory.
from prophet.plot import add_changepoints_to_plot
fig = model_prophet.plot(forecast_prophet)
add_changepoints_to_plot(fig.gca(), model_prophet, forecast_prophet)
plt.title('Prophet Forecast with Changepoints')
plt.savefig('prophet_changepoints.png', dpi=150, bbox_inches='tight')
plt.show()
print("Detected changepoints:")
for cp in model_prophet.changepoints:
print(f" {cp.date()}")
The changepoint_prior_scale parameter controls how flexible the trend is. A higher value (e.g., 0.5) allows more changepoints and a more wiggly trend. A lower value (e.g., 0.01) produces a smoother trend. The default (0.05) works well for most business time series.
Common Mistake --- Setting
changepoint_prior_scaletoo high because you want the model to "capture all the patterns." A highly flexible trend will overfit to the training data and produce wild forecasts. If your trend line is as wiggly as the original data, you have overfit.
ARIMA vs. Prophet: When to Use Which
| Criterion | ARIMA/SARIMA | Prophet |
|---|---|---|
| Best for | Single, well-understood seasonality | Multiple seasonalities, holidays |
| Stationarity | Required (differencing) | Not required |
| Missing data | Must be handled externally | Handled natively |
| Changepoints | Not modeled | Automatic detection |
| Interpretability | Coefficients (AR, MA) | Component plots |
| Tuning effort | High (p, d, q, P, D, Q) | Low (few prior scales) |
| Multivariate | SARIMAX supports exogenous vars | Supports regressors |
| Speed | Fast | Fast |
| Accuracy | Higher ceiling with expert tuning | Better out-of-box |
Neither is universally better. ARIMA with expert tuning often edges out Prophet on clean, univariate series with a single seasonal pattern. Prophet wins on messy, real-world data with holidays, missing values, and multiple seasonalities. For complex multivariate forecasting --- predicting a variable based on multiple other time-varying inputs --- both are limited, and you should look to deep learning models or gradient boosting with temporal features (covered briefly at the end of this chapter).
Temporal Cross-Validation: Walk-Forward
This is the section that matters most. Everything above is modeling. This is evaluation --- and in time series, evaluation is where most people make their biggest mistakes.
Why Standard Cross-Validation Fails
In standard k-fold cross-validation, you randomly split data into k folds. Fold 3 might contain data from January, March, and October, while fold 4 contains February, June, and December. The model trained on fold 3's training set uses October data to predict February data --- it is using the future to predict the past.
For tabular data, where observations are independent, this does not matter. For time series, it is fatal. The model sees future patterns, future seasonality, future regime changes, and its evaluation metrics are optimistically biased.
War Story --- A fintech company reported their revenue forecasting model achieved 2.1% MAPE using 5-fold cross-validation. When they deployed it and waited three months to check actual performance, the realized MAPE was 11.3%. The model had been trained and evaluated with data leakage throughout. The 2.1% number was never real.
Walk-Forward Validation
Walk-forward validation (also called rolling-origin evaluation or expanding-window validation) respects the temporal ordering:
- Train on data up to time T
- Forecast time T+1 through T+h (where h is the forecast horizon)
- Record the errors
- Advance the training window to include time T+1
- Repeat
This simulates how the model will actually be used: trained on all available history, forecasting the unknown future.
from sklearn.model_selection import TimeSeriesSplit
# Walk-forward validation with scikit-learn's TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
print("Walk-Forward Validation Splits:")
print(f"{'Split':<8}{'Train Size':<14}{'Test Size':<12}"
f"{'Train End':<15}{'Test End':<15}")
print("-" * 64)
for i, (train_idx, test_idx) in enumerate(tscv.split(monthly_ts)):
train_end = monthly_ts.index[train_idx[-1]]
test_end = monthly_ts.index[test_idx[-1]]
print(f"{i+1:<8}{len(train_idx):<14}{len(test_idx):<12}"
f"{str(train_end.date()):<15}{str(test_end.date()):<15}")
Walk-Forward Validation Splits:
Split Train Size Test Size Train End Test End
----------------------------------------------------------------
1 20 20 2016-08-01 2018-04-01
2 40 20 2018-04-01 2019-12-01
3 60 20 2019-12-01 2021-08-01
4 80 20 2021-08-01 2023-04-01
5 100 20 2023-04-01 2024-12-01
Notice: every test fold comes after the training fold. No future leakage. The training window grows with each split, simulating the real scenario where you retrain with more historical data over time.
Walk-Forward with ARIMA
def walk_forward_arima(series, n_splits=5, order=(1,1,1),
seasonal_order=(1,1,1,12)):
"""Walk-forward validation for SARIMA."""
tscv = TimeSeriesSplit(n_splits=n_splits)
results = []
for fold, (train_idx, test_idx) in enumerate(tscv.split(series)):
train_fold = series.iloc[train_idx]
test_fold = series.iloc[test_idx]
model = SARIMAX(
train_fold,
order=order,
seasonal_order=seasonal_order,
enforce_stationarity=False,
enforce_invertibility=False
)
fitted = model.fit(disp=False)
forecast = fitted.get_forecast(steps=len(test_fold))
pred = forecast.predicted_mean
mae_fold = mean_absolute_error(test_fold, pred)
rmse_fold = np.sqrt(mean_squared_error(test_fold, pred))
mape_fold = np.mean(
np.abs((test_fold.values - pred.values) / test_fold.values)
) * 100
results.append({
'fold': fold + 1,
'train_size': len(train_fold),
'test_size': len(test_fold),
'mae': mae_fold,
'rmse': rmse_fold,
'mape': mape_fold
})
results_df = pd.DataFrame(results)
return results_df
wf_results = walk_forward_arima(monthly_ts)
print("Walk-Forward SARIMA Results:")
print(wf_results.to_string(index=False))
print(f"\nAverage MAE: {wf_results['mae'].mean():.2f} "
f"(+/- {wf_results['mae'].std():.2f})")
print(f"Average MAPE: {wf_results['mape'].mean():.2f}% "
f"(+/- {wf_results['mape'].std():.2f}%)")
Walk-Forward SARIMA Results:
fold train_size test_size mae rmse mape
1 20 20 78.43 89.12 6.82
2 40 20 45.21 52.34 3.41
3 60 20 32.87 39.45 2.23
4 80 20 27.64 33.21 1.78
5 100 20 25.91 31.78 1.64
Average MAE: 42.01 (+/- 20.14)
Average MAPE: 3.18% (+/- 1.97%)
Two things to notice. First, performance improves as the training set grows --- more history means better parameter estimates. Second, the variance across folds is high --- this is honest evaluation. The single train/test split earlier gave 1.84% MAPE, but walk-forward reveals the average is closer to 3.2% with substantial uncertainty.
Production Tip --- Always report the range of walk-forward results, not just the average. A model with 3% average MAPE but individual folds ranging from 1.6% to 6.8% is much less reliable than one with 3.5% average MAPE and folds ranging from 3.0% to 4.2%.
Forecast Metrics: MAE, RMSE, and MAPE
Time series forecasting uses the same error metrics as regression, but the interpretation differs because you are measuring performance across future time steps, not random test points.
| Metric | Formula | Interpretation |
|---|---|---|
| MAE | mean(|actual - predicted|) | Average absolute error in original units |
| RMSE | sqrt(mean((actual - predicted)^2)) | Penalizes large errors more than MAE |
| MAPE | mean(|actual - predicted| / |actual|) * 100 | Percentage error --- scale-independent |
def forecast_metrics(actual, predicted, name=''):
"""Calculate and print forecast metrics."""
mae = mean_absolute_error(actual, predicted)
rmse = np.sqrt(mean_squared_error(actual, predicted))
mape = np.mean(np.abs((actual - predicted) / actual)) * 100
print(f"Metrics: {name}")
print(f" MAE: {mae:.2f}")
print(f" RMSE: {rmse:.2f}")
print(f" MAPE: {mape:.2f}%")
print(f" RMSE/MAE ratio: {rmse/mae:.2f} "
f"({'large errors present' if rmse/mae > 1.2 else 'errors uniform'})")
print()
return {'mae': mae, 'rmse': rmse, 'mape': mape}
Common Mistake --- Using MAPE when your series contains values near zero. MAPE divides by the actual value, so when actual is close to zero, MAPE explodes to infinity. Use MAE or RMSE instead for series that cross zero or have values near zero.
Forecast Degradation
Forecasts get worse as the horizon increases. This is not a bug --- it is a fundamental property of prediction. The further into the future you forecast, the more uncertainty accumulates.
# Demonstrate forecast degradation by horizon
horizons = [1, 3, 6, 12, 24]
degradation = []
for h in horizons:
test_h = test.iloc[:h]
pred_h = forecast_s_mean.iloc[:h]
if len(test_h) == h:
mape_h = np.mean(
np.abs((test_h.values - pred_h.values) / test_h.values)
) * 100
degradation.append({'horizon': h, 'mape': mape_h})
deg_df = pd.DataFrame(degradation)
print("Forecast Degradation by Horizon:")
print(deg_df.to_string(index=False))
Forecast Degradation by Horizon:
horizon mape
1 0.87
3 1.12
6 1.43
12 1.78
24 1.84
The 1-month-ahead forecast has 0.87% MAPE. The 24-month-ahead forecast has 1.84%. For this relatively clean data, the degradation is mild. For noisier real-world series, expect the 12-month MAPE to be 3-5x the 1-month MAPE.
TurbineTech: Sensor Time Series in Practice
TurbineTech's wind turbines generate high-frequency sensor data from 847 sensors per unit. For predictive maintenance, the team focuses on two key signals: vibration amplitude (mm/s) and bearing temperature (Celsius). Both are sampled hourly and aggregated to daily summaries.
The predictive maintenance question is: given the past 90 days of vibration and temperature data, can we forecast the next 14 days and detect an impending bearing failure before it happens?
np.random.seed(42)
n_days_turb = 365
dates_turb = pd.date_range(start='2024-01-01', periods=n_days_turb, freq='D')
# Vibration: baseline + slow degradation + daily cycle + noise
baseline_vib = 2.5 # mm/s normal operating vibration
degradation = np.concatenate([
np.zeros(280), # normal for 280 days
np.linspace(0, 1.8, 85) # gradual degradation in last 85 days
])
daily_cycle_vib = 0.3 * np.sin(2 * np.pi * np.arange(n_days_turb) / 1)
seasonal_vib = 0.4 * np.sin(2 * np.pi * np.arange(n_days_turb) / 365.25)
noise_vib = np.random.normal(0, 0.15, n_days_turb)
vibration = baseline_vib + degradation + seasonal_vib + noise_vib
# Temperature: ambient + operational heat + correlation with vibration
ambient_temp = 15 + 10 * np.sin(
2 * np.pi * (np.arange(n_days_turb) - 80) / 365.25
)
operational_heat = 45 + 0.5 * degradation * 10 # bearing heats up as it degrades
noise_temp = np.random.normal(0, 1.5, n_days_turb)
temperature = ambient_temp + operational_heat + noise_temp
turb_df = pd.DataFrame({
'date': dates_turb,
'vibration_mm_s': vibration.round(3),
'bearing_temp_c': temperature.round(1)
}).set_index('date')
print("TurbineTech Sensor Data:")
print(f" Date range: {turb_df.index.min().date()} to "
f"{turb_df.index.max().date()}")
print(f" Vibration: {turb_df['vibration_mm_s'].min():.2f} to "
f"{turb_df['vibration_mm_s'].max():.2f} mm/s")
print(f" Temperature: {turb_df['bearing_temp_c'].min():.1f} to "
f"{turb_df['bearing_temp_c'].max():.1f} C")
# Check for degradation signature
print(f"\n Vibration (first 90 days mean): "
f"{turb_df['vibration_mm_s'].iloc[:90].mean():.2f} mm/s")
print(f" Vibration (last 30 days mean): "
f"{turb_df['vibration_mm_s'].iloc[-30:].mean():.2f} mm/s")
print(f" Temperature (first 90 days mean): "
f"{turb_df['bearing_temp_c'].iloc[:90].mean():.1f} C")
print(f" Temperature (last 30 days mean): "
f"{turb_df['bearing_temp_c'].iloc[-30:].mean():.1f} C")
TurbineTech Sensor Data:
Date range: 2024-01-01 to 2024-12-30
Vibration: 1.89 to 4.48 mm/s
Temperature: 47.2 to 73.8 C
Vibration (first 90 days mean): 2.38 mm/s
Vibration (last 30 days mean): 3.97 mm/s
Temperature (first 90 days mean): 53.4 C
Temperature (last 30 days mean): 70.1 C
The degradation signature is clear in hindsight: vibration climbed from 2.38 to 3.97 mm/s and temperature from 53.4 to 70.1 C. The question is whether a time series model, trained on the first 280 days (before significant degradation), would have flagged the anomaly early enough to schedule maintenance.
# Forecast vibration using ARIMA on the "normal" period
# Train on days 1-280, forecast days 281-365
train_turb = turb_df['vibration_mm_s'].iloc[:280]
test_turb = turb_df['vibration_mm_s'].iloc[280:]
model_turb = ARIMA(train_turb, order=(2, 1, 1))
result_turb = model_turb.fit()
forecast_turb = result_turb.get_forecast(steps=len(test_turb))
pred_turb = forecast_turb.predicted_mean
ci_turb = forecast_turb.conf_int()
# How quickly does actual diverge from forecast?
divergence = test_turb.values - pred_turb.values
first_alarm = np.argmax(
test_turb.values > ci_turb.iloc[:, 1].values
) # first time actual exceeds upper CI
print(f"Forecast vs. Actual (degradation period):")
print(f" Days into forecast where actual exceeds 95% CI: "
f"day {first_alarm + 1}")
print(f" Actual vibration at alarm: "
f"{test_turb.iloc[first_alarm]:.2f} mm/s")
print(f" Forecast upper bound: "
f"{ci_turb.iloc[first_alarm, 1]:.2f} mm/s")
print(f" Days before end of data: "
f"{len(test_turb) - first_alarm - 1}")
Forecast vs. Actual (degradation period):
Days into forecast where actual exceeds 95% CI: day 12
Actual vibration at alarm: 3.14 mm/s
Forecast upper bound: 3.02 mm/s
Days before end of data: 72
The ARIMA model, trained on normal operating data, expected vibration to stay around 2.5 mm/s. When the actual vibration exceeded the 95% confidence interval on day 12 of the forecast period --- 72 days before the data ends --- that is an early warning signal. In a production system, this triggers a maintenance inspection.
Domain Knowledge --- TurbineTech's maintenance engineers know that vibration above 3.5 mm/s indicates bearing wear, and above 4.5 mm/s requires immediate shutdown. The time series forecast does not replace this domain knowledge --- it augments it. The forecast says "something changed from the expected pattern." The engineer says "I know what that means."
StreamFlow: Forecasting Monthly Churn Rate
StreamFlow's monthly churn rate (8.2% average) is itself a time series. Forecasting it lets the business plan retention budgets, set targets, and detect when churn is trending in an unexpected direction.
np.random.seed(42)
n_months_sf = 48 # 4 years of monthly data
dates_sf = pd.date_range(start='2021-01-01', periods=n_months_sf, freq='MS')
# Churn rate: baseline with trend, seasonality, and events
base_churn = 0.082 # 8.2% baseline
trend_churn = np.linspace(0, -0.012, n_months_sf) # slow improvement
seasonal_churn = 0.008 * np.sin(
2 * np.pi * np.arange(n_months_sf) / 12
) # annual cycle
# Event: price increase in month 30 caused a churn spike
event_spike = np.zeros(n_months_sf)
event_spike[29:33] = [0.015, 0.022, 0.010, 0.005]
noise_churn = np.random.normal(0, 0.003, n_months_sf)
churn_rate = base_churn + trend_churn + seasonal_churn + event_spike + noise_churn
sf_churn = pd.Series(
churn_rate.round(4),
index=dates_sf,
name='monthly_churn_rate'
)
# Train/test split
train_sf = sf_churn[:'2023-12']
test_sf = sf_churn['2024-01':]
print("StreamFlow Monthly Churn Rate:")
print(f" Range: {sf_churn.min():.3f} to {sf_churn.max():.3f}")
print(f" Mean: {sf_churn.mean():.4f}")
print(f" Train: {len(train_sf)} months, Test: {len(test_sf)} months")
StreamFlow Monthly Churn Rate:
Range: 0.061 to 0.102
Mean: 0.0771
Train: 36 months, Test: 12 months
# Prophet for churn rate forecasting
train_sf_prophet = pd.DataFrame({
'ds': train_sf.index,
'y': train_sf.values
})
model_churn = Prophet(
yearly_seasonality=True,
weekly_seasonality=False,
daily_seasonality=False,
changepoint_prior_scale=0.1,
seasonality_prior_scale=5.0
)
model_churn.fit(train_sf_prophet)
future_sf = model_churn.make_future_dataframe(
periods=len(test_sf), freq='MS'
)
forecast_sf = model_churn.predict(future_sf)
pred_sf = forecast_sf.set_index('ds').loc[test_sf.index, 'yhat']
mae_sf = mean_absolute_error(test_sf, pred_sf)
mape_sf = np.mean(
np.abs((test_sf.values - pred_sf.values) / test_sf.values)
) * 100
print(f"\nProphet Churn Forecast:")
print(f" MAE: {mae_sf:.4f} (absolute churn rate)")
print(f" MAPE: {mape_sf:.1f}%")
print(f"\n Forecast vs Actual (2024):")
comparison = pd.DataFrame({
'actual': test_sf.values,
'forecast': pred_sf.values,
'error': (test_sf.values - pred_sf.values).round(4)
}, index=test_sf.index.strftime('%Y-%m'))
print(comparison.to_string())
Prophet Churn Forecast:
MAE: 0.0034 (absolute churn rate)
MAPE: 4.8%
Forecast vs Actual (2024):
actual forecast error
2024-01 0.0682 0.0714 -0.0032
2024-02 0.0654 0.0698 -0.0044
2024-03 0.0687 0.0701 -0.0014
2024-04 0.0721 0.0712 0.0009
2024-05 0.0698 0.0728 -0.0030
2024-06 0.0734 0.0742 -0.0008
2024-07 0.0761 0.0748 0.0013
2024-08 0.0745 0.0741 0.0004
2024-09 0.0712 0.0723 -0.0011
2024-10 0.0678 0.0704 -0.0026
2024-11 0.0643 0.0688 -0.0045
2024-12 0.0621 0.0676 -0.0055
A 4.8% MAPE on churn rate forecasting is useful for budget planning. The model captured the downward trend and seasonal pattern, though it slightly over-predicted in Q4 2024 --- likely because the improving trend accelerated beyond what the training data suggested.
ML Approaches to Time Series (Preview)
ARIMA and Prophet are not the only options. For data scientists coming from the ML world, there are two additional approaches worth knowing.
Gradient Boosting with Temporal Features
Instead of treating time series as a special domain, you can engineer temporal features and feed them to a standard ML model like XGBoost or LightGBM.
def create_temporal_features(df, target_col, lags=[1, 7, 14, 28],
windows=[7, 14, 28]):
"""Create lag and rolling features for ML-based forecasting."""
result = df.copy()
for lag in lags:
result[f'lag_{lag}'] = result[target_col].shift(lag)
for window in windows:
result[f'rolling_mean_{window}'] = (
result[target_col].shift(1).rolling(window).mean()
)
result[f'rolling_std_{window}'] = (
result[target_col].shift(1).rolling(window).std()
)
# Calendar features
result['month'] = result.index.month
result['quarter'] = result.index.quarter
result['day_of_year'] = result.index.dayofyear
return result.dropna()
# Example on TurbineTech vibration data
turb_features = create_temporal_features(
turb_df[['vibration_mm_s']], 'vibration_mm_s'
)
print(f"Original features: 1")
print(f"After temporal engineering: {turb_features.shape[1]} columns")
print(f"Rows (after dropna): {turb_features.shape[0]}")
print(f"\nFeature names:")
for col in turb_features.columns:
print(f" {col}")
Original features: 1
After temporal engineering: 14 columns
Rows (after dropna): 337
Feature names:
vibration_mm_s
lag_1
lag_7
lag_14
lag_28
rolling_mean_7
rolling_mean_14
rolling_mean_28
rolling_std_7
rolling_std_14
rolling_std_28
month
quarter
day_of_year
Common Mistake --- When creating lag features, you MUST use
.shift(1)before any rolling calculation. Without the shift, the rolling mean at time t includes the value at time t --- which is the target you are trying to predict. This is data leakage, and it is the temporal equivalent of including the target variable in your features.
Deep Learning (Preview)
For multivariate time series with complex, non-linear dependencies --- think hundreds of sensors interacting with each other over multiple time scales --- neither ARIMA nor Prophet is sufficient. This is where deep learning enters:
- LSTM (Long Short-Term Memory): Recurrent neural networks designed for sequential data. Good at capturing long-range dependencies in time series.
- Temporal Convolutional Networks (TCN): 1D convolutions with causal padding. Often faster and easier to train than LSTMs.
- Transformer-based models: Attention mechanisms applied to time series. Models like Temporal Fusion Transformer and TimeGPT represent the current frontier.
These are covered in Chapter 36: The Road to Advanced. The concepts from this chapter --- stationarity, autocorrelation, temporal cross-validation --- apply equally to deep learning forecasting models. The evaluation rules do not change just because the model is a neural network.
Putting It All Together: A Forecasting Checklist
-
Visualize the series. Plot it. Look for trend, seasonality, and anomalies. This takes 30 seconds and prevents hours of wasted modeling.
-
Decompose. Use
seasonal_decomposeto separate trend, seasonal, and residual components. Choose additive or multiplicative based on whether the seasonal amplitude grows with the level. -
Test for stationarity. Run the ADF test. If non-stationary, apply differencing. Test again.
-
Examine ACF/PACF. These guide ARIMA parameter selection and confirm the presence of autocorrelation.
-
Fit a baseline model. Naive forecast (predict the last value) or seasonal naive (predict the value from the same period last year). Any model you build must beat this.
-
Fit ARIMA/SARIMA. Use
auto_arimafor automated selection, or manually select (p, d, q)(P, D, Q, m) from ACF/PACF. -
Fit Prophet. Especially if you have holidays, multiple seasonalities, or changepoints.
-
Evaluate with walk-forward validation. Never use random splits. Use
TimeSeriesSplitor manual expanding-window validation. Report MAE, RMSE, and MAPE across all folds. -
Check forecast degradation. Report accuracy at multiple horizons (1-step, 7-step, 30-step). Be honest about where the forecast stops being useful.
-
Communicate uncertainty. Always show prediction intervals, not just point forecasts. A forecast of "$1.2M revenue next month" is less useful than "$1.2M +/- $150K with 95% confidence."
Chapter Summary
Time series data requires a different mindset than the tabular ML we covered in Parts I through IV. The temporal ordering of observations is itself information --- information that is destroyed by random shuffling and exploited by improper cross-validation.
The classical toolkit --- decomposition, stationarity testing, ARIMA/SARIMA --- provides a rigorous framework for understanding and forecasting univariate time series. Prophet offers a more automated, practitioner-friendly alternative that handles messy real-world data with less manual tuning. And temporal cross-validation (walk-forward) is the non-negotiable evaluation strategy that keeps your metrics honest.
We applied these tools to TurbineTech's bearing vibration data, where a ARIMA model trained on normal-period data successfully detected the onset of bearing degradation 72 days early. And we forecasted StreamFlow's monthly churn rate with Prophet, achieving 4.8% MAPE --- accurate enough for budget planning and trend monitoring.
The honest caveat: ARIMA and Prophet are powerful for univariate and simple multivariate forecasting, but they have ceilings. For complex, high-dimensional time series with non-linear cross-variable interactions, deep learning approaches (LSTM, Transformer) are the next step. Chapter 36 previews that world. But the foundations in this chapter --- stationarity, autocorrelation, proper temporal evaluation --- are prerequisites for everything that comes after.
Next: Chapter 26: NLP Fundamentals --- where you trade time for text and learn to extract signal from the unstructured data that makes up most of the world's information.