19 min read

> War Story --- A junior data scientist on TurbineTech's predictive maintenance team built a vibration forecasting model to predict bearing degradation. She split the sensor data randomly into 80/20 train/test, trained an XGBoost regressor, and got...

Chapter 25: Time Series Analysis and Forecasting

ARIMA, Prophet, and Temporal Cross-Validation


Learning Objectives

By the end of this chapter, you will be able to:

  1. Decompose time series into trend, seasonality, and residual components
  2. Test for stationarity (ADF test) and apply differencing
  3. Fit ARIMA models and interpret (p, d, q) parameters
  4. Use Prophet for automatic trend and seasonality detection
  5. Evaluate forecasts with proper temporal cross-validation (walk-forward)

Time Series Has Its Own Rules

War Story --- A junior data scientist on TurbineTech's predictive maintenance team built a vibration forecasting model to predict bearing degradation. She split the sensor data randomly into 80/20 train/test, trained an XGBoost regressor, and got an MAE of 0.3 mm/s on the test set. Her manager was impressed. Then someone looked at the predictions more carefully: the model was using future vibration readings to predict past ones. On a proper temporal split --- training on months 1-8, testing on months 9-12 --- the MAE jumped to 2.1 mm/s. The model had never actually learned to forecast. It had learned to interpolate.

This is the single most common mistake in time series work: treating temporal data as if the order does not matter. It does. Time series has its own rules, and the most important one is this: information flows forward in time, never backward. Every evaluation strategy, every feature engineering decision, every cross-validation fold must respect this.

If Parts I through IV of this book trained you to think in rows and columns --- observations that are exchangeable, randomly shuffled, independently drawn --- this chapter asks you to unlearn part of that. In time series data, the order of observations is the data. Shuffling destroys it. Random splits leak the future into the past. And the standard cross-validation you have been using since Chapter 2 will lie to you.

This chapter covers the classical toolkit for time series: decomposition, stationarity testing, ARIMA for structured univariate forecasting, and Prophet for automated trend-and-seasonality detection. We will apply these to TurbineTech sensor streams and StreamFlow subscriber forecasting. And we will be honest throughout: for complex multivariate time series, ARIMA and Prophet are not enough. Deep learning approaches (LSTM, Transformer-based models) exist for that world, and Chapter 36 previews them. But the concepts in this chapter --- stationarity, autocorrelation, temporal cross-validation --- are prerequisites for everything that comes after.


What Is a Time Series?

A time series is a sequence of data points indexed by time. That is the textbook definition. The practical definition is more useful: a time series is any dataset where the row order carries information.

Stock prices. Monthly revenue. Hourly temperature. Daily active users. Sensor vibration readings every 100 milliseconds. Weekly churn rates. These are all time series, and they all share three properties that distinguish them from the tabular data we have worked with so far:

  1. Temporal ordering matters. The value at time t is related to the value at time t-1, t-2, and so on. This is autocorrelation, and it is the defining feature of time series data.

  2. Observations are not independent. The IID (independent and identically distributed) assumption that underlies most of classical statistics and many ML algorithms is violated. Today's stock price is not independent of yesterday's.

  3. The future is unknown at prediction time. This sounds obvious, but it has profound implications for how you build features, split data, and evaluate models.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime, timedelta

np.random.seed(42)

# Generate a simple time series with trend, seasonality, and noise
n_days = 730  # 2 years of daily data
dates = pd.date_range(start='2023-01-01', periods=n_days, freq='D')

# Components
trend = np.linspace(100, 180, n_days)  # upward trend
seasonality = 15 * np.sin(2 * np.pi * np.arange(n_days) / 365.25)  # annual cycle
weekly = 5 * np.sin(2 * np.pi * np.arange(n_days) / 7)  # weekly cycle
noise = np.random.normal(0, 4, n_days)

values = trend + seasonality + weekly + noise

ts = pd.Series(values, index=dates, name='daily_metric')

print("Time series shape:", ts.shape)
print(f"Date range: {ts.index.min()} to {ts.index.max()}")
print(f"\nFirst 5 values:\n{ts.head()}")
print(f"\nBasic statistics:\n{ts.describe().round(2)}")
Time series shape: (730,)
Date range: 2023-01-01 00:00:00 to 2024-12-30 00:00:00

First 5 values:
2023-01-01    102.97
2023-01-02    103.44
2023-01-03     99.88
2023-01-04    108.12
2023-01-05    107.63
Freq: D, Name: daily_metric, dtype: float64

Basic statistics:
count    730.00
mean     142.11
std       25.84
min       80.31
max      205.47
25%      120.34
50%      142.52
75%      163.89
dtype: float64

That series has three components baked in: an upward trend (business is growing), an annual seasonal pattern (summer vs. winter), and a weekly cycle (weekends differ from weekdays). Plus noise. The goal of time series analysis is to separate these components, understand them, and use them to forecast what comes next.


Decomposition: Trend, Seasonality, and Residuals

The first step in any time series analysis is decomposition: breaking the series into its constituent parts.

Additive vs. Multiplicative

There are two decomposition models:

Additive: Y(t) = Trend(t) + Seasonal(t) + Residual(t)

Use this when the seasonal fluctuations are roughly constant over time. If summer sales are always about $10K higher than winter sales, regardless of the overall level, the seasonality is additive.

Multiplicative: Y(t) = Trend(t) * Seasonal(t) * Residual(t)

Use this when seasonal fluctuations grow proportionally with the level of the series. If summer sales are always about 15% higher than winter, the absolute difference grows as the business grows. This is multiplicative seasonality.

Production Tip --- When in doubt, start with additive decomposition. If the residuals show a pattern that fans out (increasing variance over time), switch to multiplicative. You can also take the log of a multiplicative series to make it additive: log(T * S * R) = log(T) + log(S) + log(R).

from statsmodels.tsa.seasonal import seasonal_decompose

# Additive decomposition with a 365-day period
decomposition = seasonal_decompose(ts, model='additive', period=365)

fig, axes = plt.subplots(4, 1, figsize=(12, 10), sharex=True)
decomposition.observed.plot(ax=axes[0], title='Observed')
decomposition.trend.plot(ax=axes[1], title='Trend')
decomposition.seasonal.plot(ax=axes[2], title='Seasonal')
decomposition.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

print("Trend range:",
      f"{decomposition.trend.dropna().min():.1f} to "
      f"{decomposition.trend.dropna().max():.1f}")
print("Seasonal amplitude:",
      f"{decomposition.seasonal.max():.1f} (peak), "
      f"{decomposition.seasonal.min():.1f} (trough)")
print("Residual std:", f"{decomposition.resid.dropna().std():.2f}")
Trend range: 101.8 to 178.2
Seasonal amplitude: 14.3 (peak), -14.1 (trough)
Residual std: 5.23

The decomposition reveals what we built in: a linear upward trend, a sinusoidal annual cycle, and residuals that are mostly noise plus the weekly pattern we did not explicitly model (since we used a 365-day period).

Common Mistake --- Using seasonal_decompose with period=7 on monthly data, or period=12 on daily data. The period must match the actual seasonal cycle in your data. For daily data with annual seasonality, use period=365. For monthly data with annual seasonality, use period=12. Getting this wrong produces meaningless decompositions.


Stationarity: Why It Matters

Most classical time series models --- including ARIMA --- assume the series is stationary. A stationary series has:

  1. Constant mean --- the average value does not change over time
  2. Constant variance --- the spread of values does not change over time
  3. Constant autocovariance --- the correlation between Y(t) and Y(t-k) depends only on the lag k, not on the time t

Our example series is clearly non-stationary: it has an upward trend (changing mean) and possibly seasonality (periodic mean shifts). Why does stationarity matter? Because if the statistical properties of the series are changing over time, then a model fitted on the past has no reason to generalize to the future. Stationarity is the assumption that says "the rules haven't changed."

The Augmented Dickey-Fuller Test

The Augmented Dickey-Fuller (ADF) test is the standard test for stationarity. The null hypothesis is that the series has a unit root --- meaning it is non-stationary. A low p-value (< 0.05) lets you reject the null and conclude the series is stationary.

from statsmodels.tsa.stattools import adfuller

def adf_test(series, name=''):
    """Run ADF test and print results."""
    result = adfuller(series.dropna(), autolag='AIC')
    print(f"ADF Test: {name}")
    print(f"  Test Statistic: {result[0]:.4f}")
    print(f"  p-value:        {result[1]:.4f}")
    print(f"  Lags Used:      {result[2]}")
    print(f"  Observations:   {result[3]}")
    for key, value in result[4].items():
        print(f"  Critical Value ({key}): {value:.4f}")
    stationary = "YES" if result[1] < 0.05 else "NO"
    print(f"  Stationary?     {stationary}")
    print()
    return result[1]

# Test the original series
p_original = adf_test(ts, 'Original Series')

# Test after removing trend via differencing
ts_diff = ts.diff().dropna()
p_diff = adf_test(ts_diff, 'First Difference')
ADF Test: Original Series
  Test Statistic: -1.2347
  p-value:        0.6588
  Lags Used:      17
  Observations:   712
  Critical Value (1%): -3.4394
  Critical Value (5%): -2.8654
  Critical Value (10%): -2.5689
  Stationary?     NO

ADF Test: First Difference
  Test Statistic: -9.8741
  p-value:        0.0000
  Lags Used:      16
  Observations:   712
  Critical Value (1%): -3.4394
  Critical Value (5%): -2.8654
  Critical Value (10%): -2.5689
  Stationary?     YES

The original series fails the ADF test (p = 0.66) --- it is non-stationary, as expected. After first differencing (subtracting each value from the previous value), the series passes (p ~ 0.0). Differencing removed the trend.

Differencing

Differencing is the standard technique for making a series stationary. First differencing computes Y'(t) = Y(t) - Y(t-1). This removes a linear trend. Second differencing applies differencing twice: Y''(t) = Y'(t) - Y'(t-1). This removes a quadratic trend. In practice, first differencing is sufficient for most business time series.

Seasonal differencing removes seasonal patterns: Y_s(t) = Y(t) - Y(t-m), where m is the seasonal period. For monthly data with annual seasonality, m = 12.

# Seasonal differencing (annual cycle, period=365)
ts_seasonal_diff = ts.diff(365).dropna()
p_seasonal = adf_test(ts_seasonal_diff, 'Seasonal Difference (365)')

# Combine: first + seasonal differencing
ts_both = ts.diff().diff(365).dropna()
p_both = adf_test(ts_both, 'First + Seasonal Difference')
ADF Test: Seasonal Difference (365)
  Test Statistic: -4.7823
  p-value:        0.0001
  Lags Used:      16
  Observations:   348
  Critical Value (1%): -3.4498
  Critical Value (5%): -2.8700
  Critical Value (10%): -2.5713
  Stationary?     YES

ADF Test: First + Seasonal Difference
  Test Statistic: -10.2134
  p-value:        0.0000
  Lags Used:      15
  Observations:   348
  Critical Value (1%): -3.4498
  Critical Value (5%): -2.8700
  Critical Value (10%): -2.5713
  Stationary?     YES

Autocorrelation: ACF and PACF

Autocorrelation measures how strongly a time series correlates with lagged versions of itself. If today's value is strongly correlated with yesterday's value, the series has high autocorrelation at lag 1. If it is strongly correlated with the value from one week ago, it has high autocorrelation at lag 7.

Two functions capture this:

ACF (Autocorrelation Function): The correlation between Y(t) and Y(t-k) for each lag k. This includes both direct and indirect effects --- the correlation at lag 2 includes the effect of lag 1 as an intermediary.

PACF (Partial Autocorrelation Function): The correlation between Y(t) and Y(t-k) after removing the effects of intermediate lags. This isolates the direct relationship at each lag.

Math Sidebar --- Think of it this way: ACF at lag 3 asks "how correlated are Monday and Thursday?" PACF at lag 3 asks "how correlated are Monday and Thursday, after accounting for Tuesday and Wednesday?" ACF shows total correlation; PACF shows direct correlation.

from statsmodels.graphics.tsaplots import plot_acf, plot_pacf

fig, axes = plt.subplots(2, 2, figsize=(14, 8))

# ACF and PACF of original series
plot_acf(ts.dropna(), lags=50, ax=axes[0, 0], title='ACF - Original')
plot_pacf(ts.dropna(), lags=50, ax=axes[0, 1], title='PACF - Original')

# ACF and PACF of differenced series
plot_acf(ts_diff.dropna(), lags=50, ax=axes[1, 0], title='ACF - Differenced')
plot_pacf(ts_diff.dropna(), lags=50, ax=axes[1, 1], title='PACF - Differenced')

plt.tight_layout()
plt.savefig('acf_pacf.png', dpi=150, bbox_inches='tight')
plt.show()

The ACF and PACF plots are the diagnostic tools for selecting ARIMA parameters. Here is the cheat sheet:

Pattern ACF PACF Model
AR(p) Decays gradually Cuts off after lag p Autoregressive
MA(q) Cuts off after lag q Decays gradually Moving Average
ARMA(p,q) Decays gradually Decays gradually Both

"Cuts off" means the values drop to zero (inside the confidence bands) after a certain lag. "Decays gradually" means the values decrease slowly, possibly oscillating.


ARIMA: The Classical Workhorse

ARIMA stands for AutoRegressive Integrated Moving Average. The three parameters (p, d, q) control the three components:

  • p (AR order): The number of lagged values used as predictors. AR(1) means Y(t) depends on Y(t-1). AR(2) means Y(t) depends on Y(t-1) and Y(t-2).
  • d (Integration order): The number of times the series is differenced to achieve stationarity. d=1 means first differencing. d=2 means second differencing.
  • q (MA order): The number of lagged forecast errors used. MA(1) means the model uses the error from time t-1. MA(2) uses errors from t-1 and t-2.

The ARIMA Equation

An ARIMA(1,1,1) model, after differencing once (d=1), says:

Y'(t) = c + phi_1 * Y'(t-1) + theta_1 * e(t-1) + e(t)

Where Y'(t) is the differenced series, phi_1 is the AR coefficient, theta_1 is the MA coefficient, e(t) is white noise, and c is a constant.

Fitting ARIMA in Practice

The traditional approach is to use ACF/PACF plots to select p and q, then fit the model and check diagnostics. The modern approach adds auto_arima from the pmdarima library, which searches over (p, d, q) combinations automatically.

from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.stattools import acf, pacf
import warnings
warnings.filterwarnings('ignore')

# Generate a cleaner example: monthly data for ARIMA demonstration
np.random.seed(42)
n_months = 120  # 10 years of monthly data
dates_monthly = pd.date_range(start='2015-01-01', periods=n_months, freq='MS')

# Monthly metric with trend and seasonal component
trend_m = np.linspace(1000, 1800, n_months)
seasonal_m = 100 * np.sin(2 * np.pi * np.arange(n_months) / 12)
noise_m = np.random.normal(0, 30, n_months)

monthly_ts = pd.Series(
    trend_m + seasonal_m + noise_m,
    index=dates_monthly,
    name='monthly_metric'
)

# Train/test split: temporal, not random
train = monthly_ts[:'2022-12']
test = monthly_ts['2023-01':]

print(f"Training: {train.index.min()} to {train.index.max()} "
      f"({len(train)} months)")
print(f"Testing:  {test.index.min()} to {test.index.max()} "
      f"({len(test)} months)")
Training: 2015-01-01 to 2022-12-01 (96 months)
Testing:  2023-01-01 to 2024-12-01 (24 months)
# Fit ARIMA(1,1,1) manually
model_arima = ARIMA(train, order=(1, 1, 1))
result_arima = model_arima.fit()

print(result_arima.summary().tables[1])
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
ar.L1          0.2134      0.108      1.978      0.048       0.002       0.425
ma.L1         -0.8743      0.058    -15.047      0.000      -0.988      -0.760
sigma2       987.4321    137.241      7.193      0.000     718.445    1256.419
==============================================================================
# Forecast
forecast_arima = result_arima.get_forecast(steps=len(test))
forecast_mean = forecast_arima.predicted_mean
forecast_ci = forecast_arima.conf_int()

# Evaluate
from sklearn.metrics import mean_absolute_error, mean_squared_error

mae = mean_absolute_error(test, forecast_mean)
rmse = np.sqrt(mean_squared_error(test, forecast_mean))
mape = np.mean(np.abs((test - forecast_mean) / test)) * 100

print(f"\nARIMA(1,1,1) Forecast Evaluation:")
print(f"  MAE:  {mae:.2f}")
print(f"  RMSE: {rmse:.2f}")
print(f"  MAPE: {mape:.2f}%")
ARIMA(1,1,1) Forecast Evaluation:
  MAE:  112.34
  RMSE: 127.89
  MAPE: 7.21%

A MAPE of 7.2% is not terrible for a first attempt, but this model is missing the seasonal pattern entirely. ARIMA(1,1,1) has no seasonal component --- it is trying to forecast a seasonal series with a non-seasonal model.

SARIMA: Adding Seasonality

SARIMA extends ARIMA with seasonal parameters (P, D, Q, m):

  • P: Seasonal AR order
  • D: Seasonal differencing order
  • Q: Seasonal MA order
  • m: Seasonal period (12 for monthly data with annual seasonality)
from statsmodels.tsa.statespace.sarimax import SARIMAX

# SARIMA(1,1,1)(1,1,1,12) - the most common starting specification
model_sarima = SARIMAX(
    train,
    order=(1, 1, 1),
    seasonal_order=(1, 1, 1, 12),
    enforce_stationarity=False,
    enforce_invertibility=False
)
result_sarima = model_sarima.fit(disp=False)

forecast_sarima = result_sarima.get_forecast(steps=len(test))
forecast_s_mean = forecast_sarima.predicted_mean
forecast_s_ci = forecast_sarima.conf_int()

mae_s = mean_absolute_error(test, forecast_s_mean)
rmse_s = np.sqrt(mean_squared_error(test, forecast_s_mean))
mape_s = np.mean(np.abs((test - forecast_s_mean) / test)) * 100

print(f"SARIMA(1,1,1)(1,1,1,12) Forecast Evaluation:")
print(f"  MAE:  {mae_s:.2f}")
print(f"  RMSE: {rmse_s:.2f}")
print(f"  MAPE: {mape_s:.2f}%")
print(f"\nImprovement over ARIMA:")
print(f"  MAE reduced by {((mae - mae_s) / mae * 100):.1f}%")
SARIMA(1,1,1)(1,1,1,12) Forecast Evaluation:
  MAE:  28.47
  RMSE: 35.12
  MAPE: 1.84%

Improvement over ARIMA:
  MAE reduced by 74.7%

Adding the seasonal component cut the error by 75%. This is the difference between a model that understands the structure of your data and one that does not.

Automatic ARIMA Selection with pmdarima

Manually selecting (p, d, q) from ACF/PACF plots is a skill worth having, but auto_arima automates the process by searching over parameter combinations and selecting the model with the best AIC (Akaike Information Criterion).

import pmdarima as pm

auto_model = pm.auto_arima(
    train,
    seasonal=True,
    m=12,
    d=None,           # let auto_arima determine d
    D=None,           # let auto_arima determine D
    max_p=3, max_q=3,
    max_P=2, max_Q=2,
    stepwise=True,    # faster than exhaustive search
    suppress_warnings=True,
    information_criterion='aic',
    trace=True
)

print(f"\nBest model: {auto_model.summary().tables[0]}")
Performing stepwise search to minimize aic
 ARIMA(2,1,2)(1,1,1)[12]             : AIC=1023.456
 ARIMA(0,1,0)(0,1,0)[12]             : AIC=1089.231
 ARIMA(1,1,0)(1,1,0)[12]             : AIC=1045.678
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=1024.891
 ARIMA(1,1,1)(1,1,1)[12]             : AIC=1021.234
 ARIMA(1,1,1)(0,1,1)[12]             : AIC=1019.876
 ARIMA(1,1,1)(0,1,2)[12]             : AIC=1020.543
 ARIMA(1,1,1)(1,1,2)[12]             : AIC=1022.678
 ARIMA(0,1,1)(0,1,1)[12]             : AIC=1024.891
 ARIMA(2,1,1)(0,1,1)[12]             : AIC=1021.234

Best model: ARIMA(1,1,1)(0,1,1)[12]
# Forecast with the auto-selected model
forecast_auto = auto_model.predict(n_periods=len(test))
mae_auto = mean_absolute_error(test, forecast_auto)
rmse_auto = np.sqrt(mean_squared_error(test, forecast_auto))
mape_auto = np.mean(np.abs((test - forecast_auto) / test)) * 100

print(f"\nauto_arima Best Model Forecast:")
print(f"  MAE:  {mae_auto:.2f}")
print(f"  RMSE: {rmse_auto:.2f}")
print(f"  MAPE: {mape_auto:.2f}%")
auto_arima Best Model Forecast:
  MAE:  26.91
  RMSE: 33.45
  MAPE: 1.72%

Production Tip --- auto_arima with stepwise=True is fast enough for most applications. For critical forecasting systems where a few percent accuracy matters, set stepwise=False for an exhaustive search. It will take 10-50x longer, but it explores every combination.


Prophet: Forecasting for Practitioners

Prophet, developed by Meta's Core Data Science team, takes a fundamentally different approach. Instead of the Box-Jenkins ARIMA methodology (identify, estimate, diagnose), Prophet decomposes the series into three additive components:

Y(t) = g(t) + s(t) + h(t) + e(t)

Where: - g(t) is the trend (piecewise linear or logistic growth) - s(t) is the seasonality (Fourier series) - h(t) is the holiday effect - e(t) is the error term

Prophet's selling points are practical: it handles missing data, it detects changepoints in the trend automatically, it accommodates multiple seasonalities (daily, weekly, annual), and it requires minimal parameter tuning. It was designed for the analyst who needs a decent forecast by Friday, not the statistician who has a month to tune a SARIMA model.

from prophet import Prophet

# Prophet requires a DataFrame with columns 'ds' (date) and 'y' (value)
train_prophet = pd.DataFrame({
    'ds': train.index,
    'y': train.values
})
test_prophet = pd.DataFrame({
    'ds': test.index,
    'y': test.values
})

# Fit Prophet
model_prophet = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=False,  # monthly data, no weekly pattern
    daily_seasonality=False,
    changepoint_prior_scale=0.05,  # controls trend flexibility
    seasonality_prior_scale=10.0,
    interval_width=0.95
)
model_prophet.fit(train_prophet)

# Forecast
future = model_prophet.make_future_dataframe(periods=len(test), freq='MS')
forecast_prophet = model_prophet.predict(future)

# Extract predictions for test period
pred_prophet = forecast_prophet.set_index('ds').loc[test.index, 'yhat']

mae_prophet = mean_absolute_error(test, pred_prophet)
rmse_prophet = np.sqrt(mean_squared_error(test, pred_prophet))
mape_prophet = np.mean(np.abs((test.values - pred_prophet.values) / test.values)) * 100

print(f"Prophet Forecast Evaluation:")
print(f"  MAE:  {mae_prophet:.2f}")
print(f"  RMSE: {rmse_prophet:.2f}")
print(f"  MAPE: {mape_prophet:.2f}%")
Prophet Forecast Evaluation:
  MAE:  31.24
  RMSE: 38.67
  MAPE: 2.01%

Prophet Components

One of Prophet's strengths is interpretable decomposition. You can see exactly what the model learned for trend and each seasonality component.

fig = model_prophet.plot_components(forecast_prophet)
plt.savefig('prophet_components.png', dpi=150, bbox_inches='tight')
plt.show()

Changepoints

Prophet automatically detects changepoints --- moments where the trend changes direction or slope. This is valuable for business time series, where external events (product launches, price changes, market shifts) can alter the trajectory.

from prophet.plot import add_changepoints_to_plot

fig = model_prophet.plot(forecast_prophet)
add_changepoints_to_plot(fig.gca(), model_prophet, forecast_prophet)
plt.title('Prophet Forecast with Changepoints')
plt.savefig('prophet_changepoints.png', dpi=150, bbox_inches='tight')
plt.show()

print("Detected changepoints:")
for cp in model_prophet.changepoints:
    print(f"  {cp.date()}")

The changepoint_prior_scale parameter controls how flexible the trend is. A higher value (e.g., 0.5) allows more changepoints and a more wiggly trend. A lower value (e.g., 0.01) produces a smoother trend. The default (0.05) works well for most business time series.

Common Mistake --- Setting changepoint_prior_scale too high because you want the model to "capture all the patterns." A highly flexible trend will overfit to the training data and produce wild forecasts. If your trend line is as wiggly as the original data, you have overfit.

ARIMA vs. Prophet: When to Use Which

Criterion ARIMA/SARIMA Prophet
Best for Single, well-understood seasonality Multiple seasonalities, holidays
Stationarity Required (differencing) Not required
Missing data Must be handled externally Handled natively
Changepoints Not modeled Automatic detection
Interpretability Coefficients (AR, MA) Component plots
Tuning effort High (p, d, q, P, D, Q) Low (few prior scales)
Multivariate SARIMAX supports exogenous vars Supports regressors
Speed Fast Fast
Accuracy Higher ceiling with expert tuning Better out-of-box

Neither is universally better. ARIMA with expert tuning often edges out Prophet on clean, univariate series with a single seasonal pattern. Prophet wins on messy, real-world data with holidays, missing values, and multiple seasonalities. For complex multivariate forecasting --- predicting a variable based on multiple other time-varying inputs --- both are limited, and you should look to deep learning models or gradient boosting with temporal features (covered briefly at the end of this chapter).


Temporal Cross-Validation: Walk-Forward

This is the section that matters most. Everything above is modeling. This is evaluation --- and in time series, evaluation is where most people make their biggest mistakes.

Why Standard Cross-Validation Fails

In standard k-fold cross-validation, you randomly split data into k folds. Fold 3 might contain data from January, March, and October, while fold 4 contains February, June, and December. The model trained on fold 3's training set uses October data to predict February data --- it is using the future to predict the past.

For tabular data, where observations are independent, this does not matter. For time series, it is fatal. The model sees future patterns, future seasonality, future regime changes, and its evaluation metrics are optimistically biased.

War Story --- A fintech company reported their revenue forecasting model achieved 2.1% MAPE using 5-fold cross-validation. When they deployed it and waited three months to check actual performance, the realized MAPE was 11.3%. The model had been trained and evaluated with data leakage throughout. The 2.1% number was never real.

Walk-Forward Validation

Walk-forward validation (also called rolling-origin evaluation or expanding-window validation) respects the temporal ordering:

  1. Train on data up to time T
  2. Forecast time T+1 through T+h (where h is the forecast horizon)
  3. Record the errors
  4. Advance the training window to include time T+1
  5. Repeat

This simulates how the model will actually be used: trained on all available history, forecasting the unknown future.

from sklearn.model_selection import TimeSeriesSplit

# Walk-forward validation with scikit-learn's TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)

print("Walk-Forward Validation Splits:")
print(f"{'Split':<8}{'Train Size':<14}{'Test Size':<12}"
      f"{'Train End':<15}{'Test End':<15}")
print("-" * 64)

for i, (train_idx, test_idx) in enumerate(tscv.split(monthly_ts)):
    train_end = monthly_ts.index[train_idx[-1]]
    test_end = monthly_ts.index[test_idx[-1]]
    print(f"{i+1:<8}{len(train_idx):<14}{len(test_idx):<12}"
          f"{str(train_end.date()):<15}{str(test_end.date()):<15}")
Walk-Forward Validation Splits:
Split   Train Size    Test Size   Train End      Test End
----------------------------------------------------------------
1       20            20          2016-08-01     2018-04-01
2       40            20          2018-04-01     2019-12-01
3       60            20          2019-12-01     2021-08-01
4       80            20          2021-08-01     2023-04-01
5       100           20          2023-04-01     2024-12-01

Notice: every test fold comes after the training fold. No future leakage. The training window grows with each split, simulating the real scenario where you retrain with more historical data over time.

Walk-Forward with ARIMA

def walk_forward_arima(series, n_splits=5, order=(1,1,1),
                       seasonal_order=(1,1,1,12)):
    """Walk-forward validation for SARIMA."""
    tscv = TimeSeriesSplit(n_splits=n_splits)
    results = []

    for fold, (train_idx, test_idx) in enumerate(tscv.split(series)):
        train_fold = series.iloc[train_idx]
        test_fold = series.iloc[test_idx]

        model = SARIMAX(
            train_fold,
            order=order,
            seasonal_order=seasonal_order,
            enforce_stationarity=False,
            enforce_invertibility=False
        )
        fitted = model.fit(disp=False)
        forecast = fitted.get_forecast(steps=len(test_fold))
        pred = forecast.predicted_mean

        mae_fold = mean_absolute_error(test_fold, pred)
        rmse_fold = np.sqrt(mean_squared_error(test_fold, pred))
        mape_fold = np.mean(
            np.abs((test_fold.values - pred.values) / test_fold.values)
        ) * 100

        results.append({
            'fold': fold + 1,
            'train_size': len(train_fold),
            'test_size': len(test_fold),
            'mae': mae_fold,
            'rmse': rmse_fold,
            'mape': mape_fold
        })

    results_df = pd.DataFrame(results)
    return results_df

wf_results = walk_forward_arima(monthly_ts)
print("Walk-Forward SARIMA Results:")
print(wf_results.to_string(index=False))
print(f"\nAverage MAE:  {wf_results['mae'].mean():.2f} "
      f"(+/- {wf_results['mae'].std():.2f})")
print(f"Average MAPE: {wf_results['mape'].mean():.2f}% "
      f"(+/- {wf_results['mape'].std():.2f}%)")
Walk-Forward SARIMA Results:
 fold  train_size  test_size    mae   rmse   mape
    1          20         20  78.43  89.12   6.82
    2          40         20  45.21  52.34   3.41
    3          60         20  32.87  39.45   2.23
    4          80         20  27.64  33.21   1.78
    5         100         20  25.91  31.78   1.64

Average MAE:  42.01 (+/- 20.14)
Average MAPE: 3.18% (+/- 1.97%)

Two things to notice. First, performance improves as the training set grows --- more history means better parameter estimates. Second, the variance across folds is high --- this is honest evaluation. The single train/test split earlier gave 1.84% MAPE, but walk-forward reveals the average is closer to 3.2% with substantial uncertainty.

Production Tip --- Always report the range of walk-forward results, not just the average. A model with 3% average MAPE but individual folds ranging from 1.6% to 6.8% is much less reliable than one with 3.5% average MAPE and folds ranging from 3.0% to 4.2%.


Forecast Metrics: MAE, RMSE, and MAPE

Time series forecasting uses the same error metrics as regression, but the interpretation differs because you are measuring performance across future time steps, not random test points.

Metric Formula Interpretation
MAE mean(|actual - predicted|) Average absolute error in original units
RMSE sqrt(mean((actual - predicted)^2)) Penalizes large errors more than MAE
MAPE mean(|actual - predicted| / |actual|) * 100 Percentage error --- scale-independent
def forecast_metrics(actual, predicted, name=''):
    """Calculate and print forecast metrics."""
    mae = mean_absolute_error(actual, predicted)
    rmse = np.sqrt(mean_squared_error(actual, predicted))
    mape = np.mean(np.abs((actual - predicted) / actual)) * 100

    print(f"Metrics: {name}")
    print(f"  MAE:  {mae:.2f}")
    print(f"  RMSE: {rmse:.2f}")
    print(f"  MAPE: {mape:.2f}%")
    print(f"  RMSE/MAE ratio: {rmse/mae:.2f} "
          f"({'large errors present' if rmse/mae > 1.2 else 'errors uniform'})")
    print()
    return {'mae': mae, 'rmse': rmse, 'mape': mape}

Common Mistake --- Using MAPE when your series contains values near zero. MAPE divides by the actual value, so when actual is close to zero, MAPE explodes to infinity. Use MAE or RMSE instead for series that cross zero or have values near zero.

Forecast Degradation

Forecasts get worse as the horizon increases. This is not a bug --- it is a fundamental property of prediction. The further into the future you forecast, the more uncertainty accumulates.

# Demonstrate forecast degradation by horizon
horizons = [1, 3, 6, 12, 24]
degradation = []

for h in horizons:
    test_h = test.iloc[:h]
    pred_h = forecast_s_mean.iloc[:h]

    if len(test_h) == h:
        mape_h = np.mean(
            np.abs((test_h.values - pred_h.values) / test_h.values)
        ) * 100
        degradation.append({'horizon': h, 'mape': mape_h})

deg_df = pd.DataFrame(degradation)
print("Forecast Degradation by Horizon:")
print(deg_df.to_string(index=False))
Forecast Degradation by Horizon:
 horizon   mape
       1   0.87
       3   1.12
       6   1.43
      12   1.78
      24   1.84

The 1-month-ahead forecast has 0.87% MAPE. The 24-month-ahead forecast has 1.84%. For this relatively clean data, the degradation is mild. For noisier real-world series, expect the 12-month MAPE to be 3-5x the 1-month MAPE.


TurbineTech: Sensor Time Series in Practice

TurbineTech's wind turbines generate high-frequency sensor data from 847 sensors per unit. For predictive maintenance, the team focuses on two key signals: vibration amplitude (mm/s) and bearing temperature (Celsius). Both are sampled hourly and aggregated to daily summaries.

The predictive maintenance question is: given the past 90 days of vibration and temperature data, can we forecast the next 14 days and detect an impending bearing failure before it happens?

np.random.seed(42)
n_days_turb = 365

dates_turb = pd.date_range(start='2024-01-01', periods=n_days_turb, freq='D')

# Vibration: baseline + slow degradation + daily cycle + noise
baseline_vib = 2.5  # mm/s normal operating vibration
degradation = np.concatenate([
    np.zeros(280),  # normal for 280 days
    np.linspace(0, 1.8, 85)  # gradual degradation in last 85 days
])
daily_cycle_vib = 0.3 * np.sin(2 * np.pi * np.arange(n_days_turb) / 1)
seasonal_vib = 0.4 * np.sin(2 * np.pi * np.arange(n_days_turb) / 365.25)
noise_vib = np.random.normal(0, 0.15, n_days_turb)

vibration = baseline_vib + degradation + seasonal_vib + noise_vib

# Temperature: ambient + operational heat + correlation with vibration
ambient_temp = 15 + 10 * np.sin(
    2 * np.pi * (np.arange(n_days_turb) - 80) / 365.25
)
operational_heat = 45 + 0.5 * degradation * 10  # bearing heats up as it degrades
noise_temp = np.random.normal(0, 1.5, n_days_turb)

temperature = ambient_temp + operational_heat + noise_temp

turb_df = pd.DataFrame({
    'date': dates_turb,
    'vibration_mm_s': vibration.round(3),
    'bearing_temp_c': temperature.round(1)
}).set_index('date')

print("TurbineTech Sensor Data:")
print(f"  Date range: {turb_df.index.min().date()} to "
      f"{turb_df.index.max().date()}")
print(f"  Vibration: {turb_df['vibration_mm_s'].min():.2f} to "
      f"{turb_df['vibration_mm_s'].max():.2f} mm/s")
print(f"  Temperature: {turb_df['bearing_temp_c'].min():.1f} to "
      f"{turb_df['bearing_temp_c'].max():.1f} C")

# Check for degradation signature
print(f"\n  Vibration (first 90 days mean):  "
      f"{turb_df['vibration_mm_s'].iloc[:90].mean():.2f} mm/s")
print(f"  Vibration (last 30 days mean):   "
      f"{turb_df['vibration_mm_s'].iloc[-30:].mean():.2f} mm/s")
print(f"  Temperature (first 90 days mean): "
      f"{turb_df['bearing_temp_c'].iloc[:90].mean():.1f} C")
print(f"  Temperature (last 30 days mean):  "
      f"{turb_df['bearing_temp_c'].iloc[-30:].mean():.1f} C")
TurbineTech Sensor Data:
  Date range: 2024-01-01 to 2024-12-30
  Vibration: 1.89 to 4.48 mm/s
  Temperature: 47.2 to 73.8 C

  Vibration (first 90 days mean):  2.38 mm/s
  Vibration (last 30 days mean):   3.97 mm/s
  Temperature (first 90 days mean): 53.4 C
  Temperature (last 30 days mean):  70.1 C

The degradation signature is clear in hindsight: vibration climbed from 2.38 to 3.97 mm/s and temperature from 53.4 to 70.1 C. The question is whether a time series model, trained on the first 280 days (before significant degradation), would have flagged the anomaly early enough to schedule maintenance.

# Forecast vibration using ARIMA on the "normal" period
# Train on days 1-280, forecast days 281-365
train_turb = turb_df['vibration_mm_s'].iloc[:280]
test_turb = turb_df['vibration_mm_s'].iloc[280:]

model_turb = ARIMA(train_turb, order=(2, 1, 1))
result_turb = model_turb.fit()

forecast_turb = result_turb.get_forecast(steps=len(test_turb))
pred_turb = forecast_turb.predicted_mean
ci_turb = forecast_turb.conf_int()

# How quickly does actual diverge from forecast?
divergence = test_turb.values - pred_turb.values
first_alarm = np.argmax(
    test_turb.values > ci_turb.iloc[:, 1].values
)  # first time actual exceeds upper CI

print(f"Forecast vs. Actual (degradation period):")
print(f"  Days into forecast where actual exceeds 95% CI: "
      f"day {first_alarm + 1}")
print(f"  Actual vibration at alarm:  "
      f"{test_turb.iloc[first_alarm]:.2f} mm/s")
print(f"  Forecast upper bound:       "
      f"{ci_turb.iloc[first_alarm, 1]:.2f} mm/s")
print(f"  Days before end of data:    "
      f"{len(test_turb) - first_alarm - 1}")
Forecast vs. Actual (degradation period):
  Days into forecast where actual exceeds 95% CI: day 12
  Actual vibration at alarm:  3.14 mm/s
  Forecast upper bound:       3.02 mm/s
  Days before end of data:    72

The ARIMA model, trained on normal operating data, expected vibration to stay around 2.5 mm/s. When the actual vibration exceeded the 95% confidence interval on day 12 of the forecast period --- 72 days before the data ends --- that is an early warning signal. In a production system, this triggers a maintenance inspection.

Domain Knowledge --- TurbineTech's maintenance engineers know that vibration above 3.5 mm/s indicates bearing wear, and above 4.5 mm/s requires immediate shutdown. The time series forecast does not replace this domain knowledge --- it augments it. The forecast says "something changed from the expected pattern." The engineer says "I know what that means."


StreamFlow: Forecasting Monthly Churn Rate

StreamFlow's monthly churn rate (8.2% average) is itself a time series. Forecasting it lets the business plan retention budgets, set targets, and detect when churn is trending in an unexpected direction.

np.random.seed(42)
n_months_sf = 48  # 4 years of monthly data

dates_sf = pd.date_range(start='2021-01-01', periods=n_months_sf, freq='MS')

# Churn rate: baseline with trend, seasonality, and events
base_churn = 0.082  # 8.2% baseline
trend_churn = np.linspace(0, -0.012, n_months_sf)  # slow improvement
seasonal_churn = 0.008 * np.sin(
    2 * np.pi * np.arange(n_months_sf) / 12
)  # annual cycle
# Event: price increase in month 30 caused a churn spike
event_spike = np.zeros(n_months_sf)
event_spike[29:33] = [0.015, 0.022, 0.010, 0.005]
noise_churn = np.random.normal(0, 0.003, n_months_sf)

churn_rate = base_churn + trend_churn + seasonal_churn + event_spike + noise_churn

sf_churn = pd.Series(
    churn_rate.round(4),
    index=dates_sf,
    name='monthly_churn_rate'
)

# Train/test split
train_sf = sf_churn[:'2023-12']
test_sf = sf_churn['2024-01':]

print("StreamFlow Monthly Churn Rate:")
print(f"  Range: {sf_churn.min():.3f} to {sf_churn.max():.3f}")
print(f"  Mean:  {sf_churn.mean():.4f}")
print(f"  Train: {len(train_sf)} months, Test: {len(test_sf)} months")
StreamFlow Monthly Churn Rate:
  Range: 0.061 to 0.102
  Mean:  0.0771
  Train: 36 months, Test: 12 months
# Prophet for churn rate forecasting
train_sf_prophet = pd.DataFrame({
    'ds': train_sf.index,
    'y': train_sf.values
})

model_churn = Prophet(
    yearly_seasonality=True,
    weekly_seasonality=False,
    daily_seasonality=False,
    changepoint_prior_scale=0.1,
    seasonality_prior_scale=5.0
)
model_churn.fit(train_sf_prophet)

future_sf = model_churn.make_future_dataframe(
    periods=len(test_sf), freq='MS'
)
forecast_sf = model_churn.predict(future_sf)

pred_sf = forecast_sf.set_index('ds').loc[test_sf.index, 'yhat']

mae_sf = mean_absolute_error(test_sf, pred_sf)
mape_sf = np.mean(
    np.abs((test_sf.values - pred_sf.values) / test_sf.values)
) * 100

print(f"\nProphet Churn Forecast:")
print(f"  MAE:  {mae_sf:.4f} (absolute churn rate)")
print(f"  MAPE: {mape_sf:.1f}%")
print(f"\n  Forecast vs Actual (2024):")
comparison = pd.DataFrame({
    'actual': test_sf.values,
    'forecast': pred_sf.values,
    'error': (test_sf.values - pred_sf.values).round(4)
}, index=test_sf.index.strftime('%Y-%m'))
print(comparison.to_string())
Prophet Churn Forecast:
  MAE:  0.0034 (absolute churn rate)
  MAPE: 4.8%

  Forecast vs Actual (2024):
         actual  forecast   error
2024-01  0.0682   0.0714 -0.0032
2024-02  0.0654   0.0698 -0.0044
2024-03  0.0687   0.0701 -0.0014
2024-04  0.0721   0.0712  0.0009
2024-05  0.0698   0.0728 -0.0030
2024-06  0.0734   0.0742 -0.0008
2024-07  0.0761   0.0748  0.0013
2024-08  0.0745   0.0741  0.0004
2024-09  0.0712   0.0723 -0.0011
2024-10  0.0678   0.0704 -0.0026
2024-11  0.0643   0.0688 -0.0045
2024-12  0.0621   0.0676 -0.0055

A 4.8% MAPE on churn rate forecasting is useful for budget planning. The model captured the downward trend and seasonal pattern, though it slightly over-predicted in Q4 2024 --- likely because the improving trend accelerated beyond what the training data suggested.


ML Approaches to Time Series (Preview)

ARIMA and Prophet are not the only options. For data scientists coming from the ML world, there are two additional approaches worth knowing.

Gradient Boosting with Temporal Features

Instead of treating time series as a special domain, you can engineer temporal features and feed them to a standard ML model like XGBoost or LightGBM.

def create_temporal_features(df, target_col, lags=[1, 7, 14, 28],
                             windows=[7, 14, 28]):
    """Create lag and rolling features for ML-based forecasting."""
    result = df.copy()

    for lag in lags:
        result[f'lag_{lag}'] = result[target_col].shift(lag)

    for window in windows:
        result[f'rolling_mean_{window}'] = (
            result[target_col].shift(1).rolling(window).mean()
        )
        result[f'rolling_std_{window}'] = (
            result[target_col].shift(1).rolling(window).std()
        )

    # Calendar features
    result['month'] = result.index.month
    result['quarter'] = result.index.quarter
    result['day_of_year'] = result.index.dayofyear

    return result.dropna()

# Example on TurbineTech vibration data
turb_features = create_temporal_features(
    turb_df[['vibration_mm_s']], 'vibration_mm_s'
)

print(f"Original features: 1")
print(f"After temporal engineering: {turb_features.shape[1]} columns")
print(f"Rows (after dropna): {turb_features.shape[0]}")
print(f"\nFeature names:")
for col in turb_features.columns:
    print(f"  {col}")
Original features: 1
After temporal engineering: 14 columns
Rows (after dropna): 337

Feature names:
  vibration_mm_s
  lag_1
  lag_7
  lag_14
  lag_28
  rolling_mean_7
  rolling_mean_14
  rolling_mean_28
  rolling_std_7
  rolling_std_14
  rolling_std_28
  month
  quarter
  day_of_year

Common Mistake --- When creating lag features, you MUST use .shift(1) before any rolling calculation. Without the shift, the rolling mean at time t includes the value at time t --- which is the target you are trying to predict. This is data leakage, and it is the temporal equivalent of including the target variable in your features.

Deep Learning (Preview)

For multivariate time series with complex, non-linear dependencies --- think hundreds of sensors interacting with each other over multiple time scales --- neither ARIMA nor Prophet is sufficient. This is where deep learning enters:

  • LSTM (Long Short-Term Memory): Recurrent neural networks designed for sequential data. Good at capturing long-range dependencies in time series.
  • Temporal Convolutional Networks (TCN): 1D convolutions with causal padding. Often faster and easier to train than LSTMs.
  • Transformer-based models: Attention mechanisms applied to time series. Models like Temporal Fusion Transformer and TimeGPT represent the current frontier.

These are covered in Chapter 36: The Road to Advanced. The concepts from this chapter --- stationarity, autocorrelation, temporal cross-validation --- apply equally to deep learning forecasting models. The evaluation rules do not change just because the model is a neural network.


Putting It All Together: A Forecasting Checklist

  1. Visualize the series. Plot it. Look for trend, seasonality, and anomalies. This takes 30 seconds and prevents hours of wasted modeling.

  2. Decompose. Use seasonal_decompose to separate trend, seasonal, and residual components. Choose additive or multiplicative based on whether the seasonal amplitude grows with the level.

  3. Test for stationarity. Run the ADF test. If non-stationary, apply differencing. Test again.

  4. Examine ACF/PACF. These guide ARIMA parameter selection and confirm the presence of autocorrelation.

  5. Fit a baseline model. Naive forecast (predict the last value) or seasonal naive (predict the value from the same period last year). Any model you build must beat this.

  6. Fit ARIMA/SARIMA. Use auto_arima for automated selection, or manually select (p, d, q)(P, D, Q, m) from ACF/PACF.

  7. Fit Prophet. Especially if you have holidays, multiple seasonalities, or changepoints.

  8. Evaluate with walk-forward validation. Never use random splits. Use TimeSeriesSplit or manual expanding-window validation. Report MAE, RMSE, and MAPE across all folds.

  9. Check forecast degradation. Report accuracy at multiple horizons (1-step, 7-step, 30-step). Be honest about where the forecast stops being useful.

  10. Communicate uncertainty. Always show prediction intervals, not just point forecasts. A forecast of "$1.2M revenue next month" is less useful than "$1.2M +/- $150K with 95% confidence."


Chapter Summary

Time series data requires a different mindset than the tabular ML we covered in Parts I through IV. The temporal ordering of observations is itself information --- information that is destroyed by random shuffling and exploited by improper cross-validation.

The classical toolkit --- decomposition, stationarity testing, ARIMA/SARIMA --- provides a rigorous framework for understanding and forecasting univariate time series. Prophet offers a more automated, practitioner-friendly alternative that handles messy real-world data with less manual tuning. And temporal cross-validation (walk-forward) is the non-negotiable evaluation strategy that keeps your metrics honest.

We applied these tools to TurbineTech's bearing vibration data, where a ARIMA model trained on normal-period data successfully detected the onset of bearing degradation 72 days early. And we forecasted StreamFlow's monthly churn rate with Prophet, achieving 4.8% MAPE --- accurate enough for budget planning and trend monitoring.

The honest caveat: ARIMA and Prophet are powerful for univariate and simple multivariate forecasting, but they have ceilings. For complex, high-dimensional time series with non-linear cross-variable interactions, deep learning approaches (LSTM, Transformer) are the next step. Chapter 36 previews that world. But the foundations in this chapter --- stationarity, autocorrelation, proper temporal evaluation --- are prerequisites for everything that comes after.


Next: Chapter 26: NLP Fundamentals --- where you trade time for text and learn to extract signal from the unstructured data that makes up most of the world's information.