Chapter 16: Time Series Forecasting

39 min read

> "The point of forecasting isn't to be right. It's to be usefully wrong — to quantify uncertainty so that decision-makers can plan for ranges, not points."

In This Chapter

When the Forecast Was Wrong Every Single Month
What Makes Time Series Data Special
Stationarity: The Assumption That Makes Everything Work
ARIMA by Intuition
Exponential Smoothing: The Elegant Workhorse
Facebook Prophet: The Tool That Changed Business Forecasting
Feature-Based Forecasting: External Regressors
LSTM for Time Series: When Deep Learning Makes Sense
Ensemble Forecasting: Combining Models for Better Results
Forecast Uncertainty: The Executive Communication Challenge
Forecast Evaluation: Measuring What Matters
Hierarchical Forecasting: Athena's Scaling Challenge
Common Pitfalls in Business Forecasting
The Athena Supply Chain Forecasting System: Full Picture
Looking Back, Looking Forward
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 16: Time Series Forecasting

"The point of forecasting isn't to be right. It's to be usefully wrong — to quantify uncertainty so that decision-makers can plan for ranges, not points."

— Professor Diane Okonkwo, MBA 7620

When the Forecast Was Wrong Every Single Month

Professor Okonkwo projects a chart onto the screen. Two lines — one blue, one orange — trace across twelve months. The blue line represents Athena Retail Group's monthly demand forecast for their top-selling athletic footwear category. The orange line represents actual sales.

The two lines never touch. Not once. In January, the forecast was 12 percent too high. In March, it was 8 percent too low. June was off by 15 percent — a miss that left warehouses overstocked heading into a critical promotional period. October underestimated demand by 11 percent, causing stockouts during back-to-school season.

"This forecast," Okonkwo says, letting the chart speak for itself, "was wrong every single month."

She lets the silence hold for three seconds. Tom Kowalski, who built regression-based forecasting models in his fintech days, shifts in his seat. He has seen charts like this before.

"But was it useful?" Okonkwo asks.

She clicks, and a shaded band appears around the blue line — the 80 percent prediction interval. The orange line of actual sales falls within that band for ten of the twelve months. Seventy-eight percent coverage on an 80 percent interval.

"Actual sales fell within the forecast's uncertainty interval 78 percent of the time. That's almost exactly what an 80 percent interval should deliver." She looks around the room. "The point forecast was wrong every month. The interval forecast was well-calibrated. The supply chain team used those intervals to set safety stock levels, negotiate with suppliers for flexible delivery windows, and build contingency plans for the upper and lower bounds. That planning saved Athena $4.2 million in the fiscal year."

NK Adeyemi types: Forecast wrong = okay, if the wrongness is quantified?

"Why can't we just tell the supply chain team exactly how many units to order?" NK asks aloud.

Okonkwo nods as if she had been waiting for the question. "Because the future is uncertain, and pretending otherwise is more dangerous than admitting it. A point forecast of 12,000 units gives the supply chain team a false sense of precision. A forecast of 10,500 to 13,800 units with 80 percent confidence gives them the information they actually need to make decisions — how much safety stock to hold, what the worst-case inventory carrying cost looks like, and when to trigger contingency orders."

She advances to the next slide: a quote attributed to statistician George Box.

All models are wrong, but some are useful.

"This chapter," Okonkwo says, "is about building useful models of the future. Not accurate crystal balls — those don't exist. Useful probabilistic models that help decision-makers plan for uncertainty."

Tom writes in his notebook: Time series != regression with a date column. Different structure, different rules.

He is about to learn just how different.

What Makes Time Series Data Special

Every dataset we have encountered so far in this textbook — customer churn in Chapter 7, demand prediction in Chapter 8, customer segments in Chapter 9 — could, in principle, have its rows shuffled without destroying the fundamental relationships. A customer's attributes predict churn regardless of whether that customer appears first or last in the dataset.

Time series data is different. The order matters. Shuffling the rows destroys the most important information: the temporal structure. The value at time t is related to the value at time t - 1, and to t - 2, and to t - 52 (if you are looking at weekly data with annual seasonality). These temporal dependencies are the entire point.

Definition: A time series is a sequence of data points indexed by time, where the ordering carries information. The fundamental assumption of time series analysis is that past patterns contain information about future patterns — and that the data-generating process has some degree of regularity.

The Four Components of a Time Series

Any business time series can be decomposed into four components. Understanding these components is essential for choosing the right forecasting method and interpreting the results correctly.

1. Trend. The long-term direction of the series. Is demand growing, shrinking, or flat over months and years? Athena's athletic footwear sales have shown a steady upward trend of approximately 6 percent annually over the past five years, driven by the broader athleisure movement and store expansion.

2. Seasonality. Regular, predictable patterns that repeat over known time periods. Retail sales exhibit strong weekly seasonality (higher on weekends), monthly seasonality (higher after paydays), and annual seasonality (holiday peaks in November-December, back-to-school in August-September). Seasonality is defined by a fixed and known period — the pattern repeats every 7 days, every 12 months, or every 52 weeks.

3. Cyclicality. Longer-term fluctuations that are not tied to a fixed calendar period. Business cycles, fashion cycles, and economic expansions and contractions create cyclical patterns that typically span multiple years. Unlike seasonality, cyclical patterns do not have a fixed period — a recession might last 18 months or 36 months.

Caution

Students frequently confuse seasonality and cyclicality. The distinction is practical: seasonal patterns have a fixed, known period (every December, every Saturday), while cyclical patterns have variable, unknown periods (recessions happen, but not on a schedule). Most business forecasting focuses on trend and seasonality because they are predictable. Cyclical patterns are real but much harder to forecast.

4. Noise (Residuals). The irregular, unpredictable variation that remains after removing trend, seasonality, and cyclical effects. Noise includes random demand fluctuations, one-time events, measurement errors, and anything else that is genuinely unpredictable. A good forecasting model captures as much of the trend and seasonality as possible while accepting that noise cannot be forecast.

Additive vs. Multiplicative Decomposition

These four components can combine in two ways:

Additive: Value = Trend + Seasonality + Noise. The seasonal swings are roughly constant in absolute terms. If December sales are always about 5,000 units higher than the annual average, regardless of whether the average is 20,000 or 40,000, the pattern is additive.

Multiplicative: Value = Trend × Seasonality × Noise. The seasonal swings grow proportionally with the trend. If December sales are always about 25 percent higher than the annual average, meaning the absolute swing grows as the business grows, the pattern is multiplicative.

Business Insight: Most retail and consumer demand series are multiplicative — the holiday bump gets bigger as the business grows. A forecasting model that assumes additive seasonality when the reality is multiplicative will systematically underestimate peaks and overestimate troughs as the business scales. When in doubt about which decomposition to use, plot the seasonal swings over time and check whether they are roughly constant (additive) or growing (multiplicative).

Decomposition in Python

Decomposition is the first step in any time series analysis. It tells you what you are working with before you start modeling.

import pandas as pd
import numpy as np
from statsmodels.tsa.seasonal import seasonal_decompose
import matplotlib.pyplot as plt

# Load Athena's weekly sales data (or generate synthetic data)
np.random.seed(42)
dates = pd.date_range(start='2022-01-01', periods=156, freq='W')  # 3 years of weekly data

# Create components
trend = np.linspace(10000, 16000, 156)  # Steady growth
weekly_index = np.arange(156)
yearly_seasonality = 2000 * np.sin(2 * np.pi * weekly_index / 52)  # Annual cycle
noise = np.random.normal(0, 500, 156)

sales = trend + yearly_seasonality + noise
sales = np.maximum(sales, 0)  # No negative sales

df = pd.DataFrame({'date': dates, 'weekly_sales': sales})
df = df.set_index('date')

# Decompose
result = seasonal_decompose(df['weekly_sales'], model='additive', period=52)

fig, axes = plt.subplots(4, 1, figsize=(12, 10), sharex=True)
result.observed.plot(ax=axes[0], title='Observed')
result.trend.plot(ax=axes[1], title='Trend')
result.seasonal.plot(ax=axes[2], title='Seasonal')
result.resid.plot(ax=axes[3], title='Residual')
plt.tight_layout()
plt.savefig('decomposition.png', dpi=150, bbox_inches='tight')
plt.show()

Code Explanation: We generate three years of synthetic weekly sales data with an upward trend, annual seasonality, and random noise. seasonal_decompose from statsmodels separates these components, producing four subplots. The trend subplot shows the long-term growth trajectory. The seasonal subplot shows the repeating annual cycle. The residual subplot shows whatever the model could not explain — ideally, this looks like random noise with no remaining pattern.

Stationarity: The Assumption That Makes Everything Work

Before we can apply most classical forecasting methods, we need to understand a concept that sounds technical but has a very practical meaning: stationarity.

Definition: A time series is stationary if its statistical properties — mean, variance, and autocorrelation — do not change over time. In plain English, a stationary series "looks the same" regardless of which time window you examine. It has no trend, no changing variance, and no seasonal pattern.

Most real-world business time series are not stationary. Sales grow over time (changing mean). Revenue volatility increases as a company scales (changing variance). Demand peaks every holiday season (seasonality). These are precisely the features that make business data interesting — and they are precisely the features that violate the stationarity assumption.

Why does stationarity matter? Because the mathematics of ARIMA and related models assume that the statistical patterns in the training data will persist into the future. If the data is non-stationary — if the mean is drifting upward, for example — then the model is fitting to a moving target. Making the data stationary through transformation is like finding a stable foundation before building a house.

Making a Series Stationary: Differencing

The most common technique for achieving stationarity is differencing — computing the change from one period to the next rather than modeling the raw values.

If your weekly sales are 10,000, 10,500, 10,200, 11,000, the first differences are +500, -300, +800. The original series might have an upward trend (non-stationary), but the differences might fluctuate around a stable mean (stationary).

First differencing removes a linear trend. Second differencing (differencing the differences) removes a quadratic trend. Seasonal differencing (subtracting the value from the same period last year) removes a seasonal pattern.

# First differencing to remove trend
df['sales_diff'] = df['weekly_sales'].diff()

# Seasonal differencing to remove annual seasonality (period = 52 weeks)
df['sales_seasonal_diff'] = df['weekly_sales'].diff(52)

fig, axes = plt.subplots(3, 1, figsize=(12, 8), sharex=True)
df['weekly_sales'].plot(ax=axes[0], title='Original Series (Non-Stationary)')
df['sales_diff'].plot(ax=axes[1], title='First Difference (Trend Removed)')
df['sales_seasonal_diff'].dropna().plot(ax=axes[2], title='Seasonal Difference (Seasonality Removed)')
plt.tight_layout()
plt.show()

Business Insight: You do not need to master the statistical tests for stationarity (Augmented Dickey-Fuller, KPSS) to use time series forecasting effectively. Instead, develop visual intuition: plot the series and ask two questions. First, does the mean appear to be shifting over time? If yes, apply first differencing. Second, does a seasonal pattern repeat? If yes, apply seasonal differencing. Most business forecasting tools (including Prophet) handle these transformations internally.

ARIMA by Intuition

ARIMA — AutoRegressive Integrated Moving Average — sounds intimidating. It is, in fact, composed of three intuitive ideas that anyone who has observed business patterns can understand.

The AR Part: Past Values Predict Future Values

The AutoRegressive component says: "The best predictor of tomorrow's sales is today's sales, adjusted by some amount."

Think about daily traffic to a website. If yesterday you had 50,000 visitors, a reasonable guess for today is something close to 50,000, perhaps adjusted up or down based on the day of the week. An AR model formalizes this intuition: the current value is a weighted combination of recent past values.

An AR(1) model uses one past value: y(t) = constant + weight × y(t-1) + noise. An AR(2) model uses two past values. The "order" of the AR model tells you how far back it looks.

The I Part: Differencing for Stationarity

The Integrated component is simply the differencing we discussed above. An ARIMA model with I = 1 means we difference the data once before applying the AR and MA components. I = 0 means no differencing (the data is already stationary). I = 2 means double differencing.

The MA Part: Past Errors Predict Future Values

The Moving Average component says: "If the model over-predicted yesterday, it will probably over-predict today by a similar amount — so we should correct for that."

This is different from a "moving average" in the financial sense. In ARIMA, the MA component models the relationship between today's value and the errors (residuals) from recent predictions. If the model predicted 10,000 units yesterday but actual sales were 10,800 (an error of +800), the MA component adjusts today's forecast upward to correct for that systematic under-prediction.

An MA(1) model uses the error from one period ago. An MA(2) model uses errors from two periods ago.

Putting It Together: ARIMA(p, d, q)

ARIMA is described by three parameters: - p = order of the AR component (how many past values) - d = degree of differencing (how many times we difference) - q = order of the MA component (how many past errors)

An ARIMA(1, 1, 1) model says: "Difference the data once, then predict based on one past value and one past error."

For seasonal data, there is a seasonal extension called SARIMA — ARIMA(p, d, q)(P, D, Q, s) — which adds seasonal AR, differencing, and MA components with a seasonal period s.

from statsmodels.tsa.arima.model import ARIMA
import warnings
warnings.filterwarnings('ignore')

# Fit an ARIMA(1, 1, 1) model
model = ARIMA(df['weekly_sales'].dropna(), order=(1, 1, 1))
fitted = model.fit()

# Print model summary
print(fitted.summary())

# Forecast the next 12 weeks
forecast = fitted.get_forecast(steps=12)
forecast_mean = forecast.predicted_mean
confidence_intervals = forecast.conf_int(alpha=0.20)  # 80% interval

print("\n12-Week Forecast with 80% Prediction Intervals:")
for i in range(12):
    print(f"Week {i+1}: {forecast_mean.iloc[i]:,.0f} "
          f"[{confidence_intervals.iloc[i, 0]:,.0f} - "
          f"{confidence_intervals.iloc[i, 1]:,.0f}]")

Code Explanation: We fit a simple ARIMA(1,1,1) model to the weekly sales data. The get_forecast method produces both point forecasts and prediction intervals. Notice that we use an 80% prediction interval — this is the same standard Okonkwo introduced in the opening example. The intervals widen as we forecast further into the future, reflecting increasing uncertainty over longer horizons.

When ARIMA Works and When It Doesn't

ARIMA excels with short-to-medium term forecasts on relatively smooth, well-behaved series. It struggles with:

Multiple seasonalities (e.g., daily data with both weekly and annual patterns)
Many external factors (promotions, holidays, weather)
Structural breaks (sudden shifts in the data-generating process)
Very long-range forecasts (the uncertainty intervals quickly become so wide they are useless)

For these common business scenarios, we need more flexible tools.

Exponential Smoothing: The Elegant Workhorse

Before Prophet and LSTM, before ARIMA became the standard reference, exponential smoothing methods were the backbone of production forecasting systems. They remain widely used today, particularly in supply chain planning, and understanding them provides important intuition about how all forecasting methods balance responsiveness with stability.

The Core Idea: Recent Data Matters More

Simple exponential smoothing starts from a beautifully simple premise: when forecasting the next value, give more weight to recent observations and progressively less weight to older ones. The parameter alpha (between 0 and 1) controls how fast the weights decay.

Alpha close to 1 means the model reacts quickly to recent changes but is volatile. It "chases" the data.
Alpha close to 0 means the model is sluggish and heavily influenced by historical patterns. It is stable but slow to respond to genuine shifts.

Business Insight: The alpha parameter in exponential smoothing captures a fundamental business tension: responsiveness vs. stability. A fashion retailer needs high alpha because trends change fast. A utility company forecasting electricity baseload needs low alpha because demand patterns are stable. The "right" alpha is a business question as much as a statistical one.

Three Flavors of Exponential Smoothing

Simple Exponential Smoothing (SES): Handles data with no trend and no seasonality. Rarely applicable in business, where trends and seasons are ubiquitous, but serves as the foundation for the more useful variants.

Double Exponential Smoothing (Holt's method): Adds a trend component. The model estimates both the level (where the series is) and the slope (where the series is going), each with its own smoothing parameter. Useful for series with trend but no seasonality.

Triple Exponential Smoothing (Holt-Winters): Adds a seasonal component on top of Holt's method. The model estimates level, trend, and seasonal factors, each with its own smoothing parameter. Comes in additive and multiplicative variants, mirroring the decomposition discussion above.

from statsmodels.tsa.holtwinters import ExponentialSmoothing

# Holt-Winters with multiplicative seasonality
hw_model = ExponentialSmoothing(
    df['weekly_sales'].dropna(),
    seasonal_periods=52,
    trend='add',
    seasonal='add',  # Use 'mul' for multiplicative
    use_boxcox=False
).fit(optimized=True)

# Forecast next 26 weeks (half a year)
hw_forecast = hw_model.forecast(26)

print("Holt-Winters Smoothing Parameters:")
print(f"  Alpha (level):    {hw_model.params['smoothing_level']:.3f}")
print(f"  Beta  (trend):    {hw_model.params['smoothing_trend']:.3f}")
print(f"  Gamma (seasonal): {hw_model.params['smoothing_seasonal']:.3f}")

Code Explanation: The ExponentialSmoothing class fits a Holt-Winters model. Setting optimized=True lets the algorithm find the best alpha, beta, and gamma values automatically by minimizing forecast error on the training data. The three smoothing parameters tell you how much the model relies on recent data for each component — a high gamma, for example, means the seasonal pattern adapts quickly to recent seasons.

Facebook Prophet: The Tool That Changed Business Forecasting

In 2017, Facebook's Core Data Science team (led by Sean J. Taylor and Ben Letham) open-sourced a forecasting tool called Prophet. Within two years, it became the most widely used forecasting tool in industry. Understanding why requires understanding what Prophet gets right about business forecasting.

Why Prophet Became Popular

Prophet succeeded not because it was the most accurate forecasting algorithm — it often is not — but because it solved the workflow problems that made forecasting painful in practice:

It handles missing data and outliers gracefully. Business time series are messy. Store closures, system outages, and data pipeline failures create gaps. Prophet does not crash or produce garbage when it encounters them.
It handles multiple seasonalities automatically. Daily, weekly, and yearly seasonal patterns can coexist without manual specification. This alone saves hours of preprocessing.
It makes adding holidays and special events trivial. Black Friday, Christmas, back-to-school, local holidays — Prophet lets you specify these as inputs and estimates their effects automatically.
It produces interpretable components. The trend, seasonality, and holiday components can be plotted separately, making it easy to explain the forecast to non-technical stakeholders.
It generates uncertainty intervals by default. Every Prophet forecast comes with prediction intervals, encouraging probabilistic thinking.
Analyst expertise is encoded as priors, not code. An analyst who knows that "demand usually grows 3-8% annually" can express that knowledge through Prophet's parameters without writing complex statistical code.

Business Insight: Prophet's popularity is a lesson in product design for data science tools. The most sophisticated model in the world is useless if practitioners cannot configure it, stakeholders cannot understand it, and the IT team cannot deploy it. Prophet won the market by being "good enough" statistically and excellent operationally. This is a recurring pattern in business AI: practical usability often matters more than theoretical optimality.

How Prophet Works (Intuition)

Prophet fits an additive model with three components:

y(t) = g(t) + s(t) + h(t) + error

g(t) is the trend — either linear or logistic growth, with automatic detection of changepoints where the growth rate shifts
s(t) is seasonality — modeled using Fourier series (fancy sine and cosine waves that capture periodic patterns)
h(t) is the holiday/event effect — additive bumps or dips on specific dates

Changepoints are one of Prophet's most powerful features. Real business trends are rarely straight lines. A product launch accelerates growth. A competitor enters the market and flattens it. A recession causes a decline. Prophet automatically identifies these inflection points and adjusts the trend accordingly.

The Complete Prophet Workflow for Athena

This is the production-quality workflow Ravi Mehta's team built for Athena's supply chain forecasting. We will build it step by step.

# ============================================================
# ATHENA RETAIL GROUP — DEMAND FORECASTING WITH PROPHET
# Chapter 16: Time Series Forecasting
# ============================================================

import pandas as pd
import numpy as np
from prophet import Prophet
from prophet.diagnostics import cross_validation, performance_metrics
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')

# ---- Step 1: Generate Synthetic Daily Sales Data ----
# In production, this would be pulled from Athena's data warehouse

np.random.seed(42)
n_days = 365 * 3  # 3 years of daily data
dates = pd.date_range(start='2023-01-01', periods=n_days, freq='D')

# Trend: steady growth with a changepoint at month 18
trend = np.where(
    np.arange(n_days) < 365 * 1.5,
    100 + 0.05 * np.arange(n_days),                    # Slow growth phase
    100 + 0.05 * (365 * 1.5) + 0.12 * (np.arange(n_days) - 365 * 1.5)  # Accelerated growth
)

# Weekly seasonality: weekends are higher
day_of_week = np.array([d.weekday() for d in dates])
weekly_effect = np.where(day_of_week >= 5, 25, -5)  # Weekend boost

# Yearly seasonality: holiday peak, summer dip
day_of_year = np.array([d.timetuple().tm_yday for d in dates])
yearly_effect = (
    30 * np.sin(2 * np.pi * day_of_year / 365.25)
    + 15 * np.cos(4 * np.pi * day_of_year / 365.25)
)

# Promotional effects: random promotions ~twice per month
promo_days = np.random.choice([0, 1], size=n_days, p=[0.93, 0.07])
promo_effect = promo_days * np.random.uniform(30, 80, n_days)

# Holiday effects (Thanksgiving week, Christmas week, Black Friday)
holiday_effect = np.zeros(n_days)
for i, d in enumerate(dates):
    if d.month == 11 and 22 <= d.day <= 28:  # Thanksgiving week
        holiday_effect[i] = 60
    if d.month == 12 and 20 <= d.day <= 31:  # Christmas period
        holiday_effect[i] = 80
    if d.month == 11 and d.day == 29:  # Black Friday (approx)
        holiday_effect[i] = 150

# Noise
noise = np.random.normal(0, 15, n_days)

# Combine
sales = trend + weekly_effect + yearly_effect + promo_effect + holiday_effect + noise
sales = np.maximum(sales, 0)

# Build the DataFrame in Prophet's required format
df_prophet = pd.DataFrame({
    'ds': dates,
    'y': sales,
    'promo': promo_days
})

print(f"Dataset: {len(df_prophet)} days of sales data")
print(f"Date range: {df_prophet['ds'].min().date()} to {df_prophet['ds'].max().date()}")
print(f"Mean daily sales: {df_prophet['y'].mean():.0f} units")
print(f"Sales range: {df_prophet['y'].min():.0f} - {df_prophet['y'].max():.0f} units")

Code Explanation: We build synthetic daily sales data that mirrors realistic retail patterns: an upward trend that accelerates at month 18 (perhaps due to a store expansion or marketing push), weekly seasonality (higher weekend sales), yearly seasonality (holiday peaks and summer dips), promotional effects, and holiday effects. The data uses Prophet's required column names: ds for the date and y for the target value.

# ---- Step 2: Define Holidays ----
# Prophet allows explicit holiday specification

holidays = pd.DataFrame({
    'holiday': (
        ['thanksgiving'] * 3 + ['black_friday'] * 3 +
        ['christmas'] * 3 + ['new_years'] * 3 +
        ['independence_day'] * 3 + ['labor_day'] * 3 +
        ['memorial_day'] * 3
    ),
    'ds': pd.to_datetime([
        # Thanksgiving (4th Thursday of November)
        '2023-11-23', '2024-11-28', '2025-11-27',
        # Black Friday
        '2023-11-24', '2024-11-29', '2025-11-28',
        # Christmas
        '2023-12-25', '2024-12-25', '2025-12-25',
        # New Year's
        '2023-01-01', '2024-01-01', '2025-01-01',
        # Independence Day
        '2023-07-04', '2024-07-04', '2025-07-04',
        # Labor Day (1st Monday of September)
        '2023-09-04', '2024-09-02', '2025-09-01',
        # Memorial Day (last Monday of May)
        '2023-05-29', '2024-05-27', '2025-05-26',
    ]),
    'lower_window': [0] * 21,  # Days before the holiday to include
    'upper_window': [1] * 21,  # Days after the holiday to include
})

# Extend holiday windows for major retail holidays
holidays.loc[holidays['holiday'] == 'thanksgiving', 'lower_window'] = -1
holidays.loc[holidays['holiday'] == 'thanksgiving', 'upper_window'] = 1
holidays.loc[holidays['holiday'] == 'black_friday', 'upper_window'] = 3  # Weekend after
holidays.loc[holidays['holiday'] == 'christmas', 'lower_window'] = -5  # Shopping week
holidays.loc[holidays['holiday'] == 'christmas', 'upper_window'] = 1

print(f"Defined {holidays['holiday'].nunique()} holiday types, "
      f"{len(holidays)} total holiday instances")
print(holidays.groupby('holiday').size())

Code Explanation: Prophet accepts a DataFrame of holidays with flexible windows. The lower_window and upper_window parameters let you specify that a holiday's effect extends beyond the single day. Black Friday's effect extends three days past (through Cyber Monday). Christmas shopping affects the five days leading up to December 25. This level of domain knowledge is exactly what differentiates a competent forecast from a naive one.

# ---- Step 3: Build and Fit the Prophet Model ----

model = Prophet(
    holidays=holidays,
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False,       # Not meaningful for daily aggregation
    changepoint_prior_scale=0.05,  # Controls trend flexibility
    seasonality_prior_scale=10,    # Controls seasonality flexibility
    holidays_prior_scale=10,       # Controls holiday effect flexibility
    interval_width=0.80,           # 80% prediction intervals
    growth='linear',
)

# Add external regressor: promotions
model.add_regressor('promo', prior_scale=10, mode='additive')

# Add custom seasonality if needed (e.g., monthly paycheck cycle)
model.add_seasonality(
    name='monthly',
    period=30.5,
    fourier_order=5,   # Complexity of the seasonal pattern
    prior_scale=0.1
)

# Fit the model
model.fit(df_prophet)
print("Model fitted successfully.")

Code Explanation: The Prophet model is configured with several important settings. changepoint_prior_scale=0.05 controls how flexible the trend is — lower values produce smoother trends, higher values let the trend change more abruptly. interval_width=0.80 sets the prediction interval width to 80 percent, matching Athena's standard. The promo column is added as an external regressor — this tells Prophet that promotional activity is a known future input that affects sales. We also add a custom monthly seasonality to capture paycheck-cycle effects.

# ---- Step 4: Generate Forecasts ----

# Create future dataframe for 90 days ahead
future = model.make_future_dataframe(periods=90)

# For the external regressor, we need to provide future values
# In practice, Athena's marketing team provides the promotion schedule
future_promos = np.random.choice([0, 1], size=len(future), p=[0.93, 0.07])
future['promo'] = future_promos

# Generate forecast
forecast = model.predict(future)

# Display key forecast columns
forecast_display = forecast[['ds', 'yhat', 'yhat_lower', 'yhat_upper']].tail(10)
forecast_display.columns = ['Date', 'Point Forecast', 'Lower 80%', 'Upper 80%']
print("\nLast 10 Days of Forecast:")
print(forecast_display.to_string(index=False, float_format='{:.0f}'.format))

# ---- Step 5: Visualize the Forecast ----

# Prophet's built-in plot
fig1 = model.plot(forecast)
plt.title("Athena Athletic Footwear — Daily Demand Forecast")
plt.xlabel("Date")
plt.ylabel("Daily Sales (Units)")
plt.tight_layout()
plt.savefig('forecast_plot.png', dpi=150, bbox_inches='tight')
plt.show()

# Component plot: see trend, seasonality, and holidays separately
fig2 = model.plot_components(forecast)
plt.tight_layout()
plt.savefig('components_plot.png', dpi=150, bbox_inches='tight')
plt.show()

Code Explanation: Prophet produces two essential visualizations. The forecast plot shows historical data as black dots, the forecast as a blue line, and the 80% prediction interval as a shaded band. The component plot shows each component separately — trend, weekly seasonality, yearly seasonality, holiday effects, and the promotion regressor. The component plot is particularly valuable for stakeholder communication because it answers "why" questions: "Why is the forecast higher in December?" Because the yearly seasonality component peaks then. "Why did the trend accelerate in mid-2024?" Because Prophet detected a changepoint there.

# ---- Step 6: Cross-Validation (Walk-Forward) ----

# Prophet's built-in cross-validation
# Initial training period: 365 days
# Forecast horizon: 30 days
# Spacing between cutoff dates: 90 days
df_cv = cross_validation(
    model,
    initial='365 days',
    period='90 days',
    horizon='30 days'
)

# Calculate performance metrics
df_metrics = performance_metrics(df_cv, rolling_window=1)

print("\nCross-Validation Results (30-Day Horizon):")
print(f"  MAE  (Mean Absolute Error):     {df_metrics['mae'].mean():.1f} units")
print(f"  RMSE (Root Mean Squared Error):  {df_metrics['rmse'].mean():.1f} units")
print(f"  MAPE (Mean Absolute % Error):    {df_metrics['mape'].mean() * 100:.1f}%")
print(f"  Coverage (80% interval):         {df_metrics['coverage'].mean() * 100:.1f}%")

Code Explanation: Cross-validation for time series is called walk-forward validation — and it is fundamentally different from the random k-fold cross-validation used in standard machine learning. Here, we train on 365 days, forecast the next 30 days, then slide the window forward by 90 days and repeat. This mimics real forecasting: you always train on the past and predict the future, never accidentally using future data to predict the past. The coverage metric is particularly important — it tells us whether the 80% interval actually contains 80% of the observed values.

# ---- Step 7: Comparison with Baselines ----

def evaluate_baselines(df, forecast_horizon=30):
    """Compare Prophet against simple baselines."""
    train = df.iloc[:-forecast_horizon]
    test = df.iloc[-forecast_horizon:]

    results = {}

    # Baseline 1: Naive (last value repeated)
    naive_forecast = np.full(forecast_horizon, train['y'].iloc[-1])
    naive_mae = np.mean(np.abs(test['y'].values - naive_forecast))
    results['Naive (Last Value)'] = naive_mae

    # Baseline 2: Seasonal Naive (same day last year)
    if len(train) > 365:
        seasonal_naive = train['y'].iloc[-365:-365 + forecast_horizon].values
        seasonal_mae = np.mean(np.abs(test['y'].values - seasonal_naive))
        results['Seasonal Naive'] = seasonal_mae

    # Baseline 3: Moving Average (28-day)
    ma_forecast = np.full(forecast_horizon, train['y'].iloc[-28:].mean())
    ma_mae = np.mean(np.abs(test['y'].values - ma_forecast))
    results['Moving Avg (28-day)'] = ma_mae

    # Prophet forecast for the test period
    prophet_forecast = forecast[
        forecast['ds'].isin(test['ds'])
    ]['yhat'].values[:forecast_horizon]
    if len(prophet_forecast) == forecast_horizon:
        prophet_mae = np.mean(np.abs(test['y'].values - prophet_forecast))
        results['Prophet'] = prophet_mae

    print("\nModel Comparison (MAE on last 30 days):")
    print("-" * 45)
    for name, mae in sorted(results.items(), key=lambda x: x[1]):
        print(f"  {name:<25} MAE: {mae:>8.1f}")

    best = min(results, key=results.get)
    print(f"\n  Best model: {best}")

    return results

baseline_results = evaluate_baselines(df_prophet)

Code Explanation: Always compare a sophisticated model against simple baselines. The naive forecast simply repeats the last observed value. The seasonal naive forecast uses the same day from last year. The moving average uses the average of the last 28 days. If Prophet cannot beat these baselines, it is adding complexity without adding value. In practice, Prophet typically beats naive methods by 15-30% on series with strong seasonality and external effects — but on very noisy or intermittent series, simpler methods sometimes win.

# ---- Step 8: Business Impact Calculation ----

def calculate_forecast_impact(baseline_mae, prophet_mae,
                               avg_daily_sales=200,
                               holding_cost_pct=0.25,
                               unit_cost=45.00,
                               stockout_cost_multiplier=3.0,
                               stores=340):
    """
    Translate forecast accuracy improvement into dollar savings.
    Athena context: 340 stores, athletic footwear category.
    """
    improvement_pct = (baseline_mae - prophet_mae) / baseline_mae * 100

    # Reduced safety stock (proportional to forecast error reduction)
    error_reduction_ratio = prophet_mae / baseline_mae
    daily_safety_stock_savings = (
        avg_daily_sales * (1 - error_reduction_ratio)
        * unit_cost * holding_cost_pct / 365
    )
    annual_safety_stock_savings = daily_safety_stock_savings * 365 * stores

    # Reduced stockouts (fewer extreme misses)
    annual_stockout_savings = (
        avg_daily_sales * 0.02 * (1 - error_reduction_ratio)
        * unit_cost * stockout_cost_multiplier * 365 * stores
    )

    total_savings = annual_safety_stock_savings + annual_stockout_savings

    print("\n" + "=" * 55)
    print("  FORECAST IMPROVEMENT — BUSINESS IMPACT ESTIMATE")
    print("=" * 55)
    print(f"  Baseline MAE:         {baseline_mae:.1f} units/day")
    print(f"  Prophet MAE:          {prophet_mae:.1f} units/day")
    print(f"  Accuracy improvement: {improvement_pct:.1f}%")
    print(f"  Stores:               {stores}")
    print("-" * 55)
    print(f"  Safety stock savings: ${annual_safety_stock_savings:>12,.0f}")
    print(f"  Stockout reduction:   ${annual_stockout_savings:>12,.0f}")
    print(f"  TOTAL ANNUAL SAVINGS: ${total_savings:>12,.0f}")
    print("=" * 55)

    return total_savings

# Use the baseline comparison results
if 'Moving Avg (28-day)' in baseline_results and 'Prophet' in baseline_results:
    savings = calculate_forecast_impact(
        baseline_mae=baseline_results['Moving Avg (28-day)'],
        prophet_mae=baseline_results['Prophet']
    )

Code Explanation: This function translates the abstract concept of "lower MAE" into dollars. The logic is straightforward: better forecasts mean less safety stock (because the uncertainty band is narrower) and fewer stockouts (because extreme misses are less frequent). The $6.1 million annual savings Ravi's team eventually achieved at Athena came from exactly this kind of calculation, aggregated across all product categories and stores.

Athena Update: Ravi's supply chain team rolled out Prophet-based forecasting across all 340 stores and 50,000 SKUs over a six-month period. The key challenge was not the model itself — it was the data pipeline. Athena's legacy POS system (the 15-year-old system mentioned in Chapter 1) produced daily sales data with a 48-hour lag and frequent missing values. The team invested three months building a data pipeline that cleaned, validated, and transformed POS data into Prophet-ready format before they wrote a single line of modeling code. As Ravi told his team: "The model is maybe 20 percent of the work. The data pipeline is 80 percent."

Feature-Based Forecasting: External Regressors

Pure time series models forecast the future based solely on the past behavior of the series itself. But in business, we often know things about the future that should influence our forecast.

The marketing team has approved a major promotional campaign for the first two weeks of March.
The weather forecast predicts a severe winter storm that will suppress foot traffic for three days.
A competitor is launching a new product that will likely cannibalize some demand.
The Federal Reserve just raised interest rates, which typically dampens consumer spending with a 2-3 month lag.

These are external regressors — variables external to the time series that influence its behavior. Prophet (as we saw above with the promo variable) and ARIMA (through ARIMAX, the extension with exogenous variables) can incorporate external regressors to improve forecast accuracy.

Choosing External Regressors

Not every potential regressor is worth including. Good external regressors have three properties:

Causal plausibility. There should be a logical reason why the variable affects demand. Temperature affects ice cream sales. Promotions affect retail traffic. A stock market index probably does not directly affect demand for athletic footwear.
Predictive power. The regressor should actually improve forecast accuracy when added to the model. Test this with cross-validation: does adding the variable reduce the forecast error? If not, remove it — it is adding complexity without benefit.
Future availability. You need to know the value of the regressor for the forecast period. This is the crucial practical constraint. The weather forecast is available 7-10 days ahead. The promotional calendar is known months ahead. GDP growth is not known until after the quarter ends — so it cannot be used for operational forecasting, only for strategic planning.

Business Insight: The most commonly used external regressors in retail demand forecasting are: (1) promotion indicators (binary: is there a promotion running?), (2) price changes, (3) holiday indicators, (4) marketing spend, and (5) weather variables (temperature, precipitation). Of these, promotions and holidays typically have the largest impact on accuracy. Weather matters more for some categories (beverages, seasonal apparel) than others (electronics, furniture).

# Adding multiple external regressors to Prophet

# Generate additional synthetic regressors
df_prophet['temperature'] = (
    60 + 25 * np.sin(2 * np.pi * np.arange(len(df_prophet)) / 365.25 - np.pi/2)
    + np.random.normal(0, 5, len(df_prophet))
)
df_prophet['marketing_spend'] = np.random.uniform(500, 2000, len(df_prophet))

# Rebuild model with additional regressors
model_ext = Prophet(
    holidays=holidays,
    yearly_seasonality=True,
    weekly_seasonality=True,
    daily_seasonality=False,
    interval_width=0.80,
)
model_ext.add_regressor('promo', prior_scale=10)
model_ext.add_regressor('temperature', prior_scale=5)
model_ext.add_regressor('marketing_spend', prior_scale=5, standardize=True)

model_ext.fit(df_prophet)

# For prediction, future regressor values must be supplied
future_ext = model_ext.make_future_dataframe(periods=30)
future_ext['promo'] = np.random.choice([0, 1], size=len(future_ext), p=[0.93, 0.07])
future_ext['temperature'] = (
    60 + 25 * np.sin(2 * np.pi * np.arange(len(future_ext)) / 365.25 - np.pi/2)
)
future_ext['marketing_spend'] = np.random.uniform(500, 2000, len(future_ext))

forecast_ext = model_ext.predict(future_ext)
print("Extended model with external regressors fitted and forecast generated.")

Caution

Adding external regressors can reduce forecast accuracy if the regressors are noisy, weakly correlated, or introduce overfitting. Always validate with cross-validation. A useful rule of thumb: if adding a regressor does not improve MAPE by at least 1-2 percentage points on the cross-validation set, remove it. The marginal complexity is not worth it.

LSTM for Time Series: When Deep Learning Makes Sense

In Chapter 13, we introduced recurrent neural networks (RNNs) and their improved variant, Long Short-Term Memory (LSTM) networks. LSTMs are specifically designed to learn patterns in sequential data, making them a natural candidate for time series forecasting.

The LSTM Intuition for Time Series

An LSTM processes a time series one step at a time, maintaining an internal "memory" that can store information about patterns observed in prior steps. At each time step, the LSTM decides:

What to remember from its memory (the "forget gate")
What new information to add to memory (the "input gate")
What to output as a prediction (the "output gate")

This architecture lets LSTMs learn complex, nonlinear temporal patterns that simpler models miss — in theory. In practice, LSTMs for time series forecasting are tricky.

Tom's LSTM Lesson

Tom Kowalski is excited about LSTMs. His computer science background makes him comfortable with neural networks, and the idea of a model that "learns" temporal patterns without being told what to look for appeals to his engineering instincts.

He builds an LSTM model for Athena's demand data. The architecture is clean: two LSTM layers with 50 units each, a dropout layer for regularization, and a dense output layer. He trains it on two years of daily data and evaluates on the third year.

The training results are spectacular. The LSTM fits the training data almost perfectly — the loss curve drops to near zero, and the in-sample predictions track the actual values with eerie precision.

Then Tom runs it on the test data. The results are mediocre. The LSTM's MAPE on the test set is 14.2 percent — worse than Prophet's 11.8 percent. The model clearly overfit to the training data, memorizing specific patterns (including noise) rather than learning the generalizable structure.

Tom spends a weekend trying to fix it: adjusting the architecture, tuning hyperparameters, adding more dropout. He reduces overfitting but cannot match Prophet's performance.

On Monday, he shows the results to Ravi. "Prophet beat my LSTM," he says, slightly deflated.

Ravi is unsurprised. "LSTMs can be powerful for time series, but they need two things that Prophet doesn't: a lot of data, and careful engineering. Prophet was designed for business forecasting with human-interpretable components. Your LSTM was designed for general sequence learning. For our problem — daily SKU-level demand with strong seasonality and holidays — Prophet's inductive biases are more appropriate."

Tom writes in his notebook: Complexity is not a virtue. Match the tool to the problem.

Business Insight: In a 2020 study, Makridakis et al. (the M4 forecasting competition) found that statistical methods outperformed machine learning methods (including LSTMs) on the majority of individual time series. Deep learning methods improved when forecasting many related series simultaneously (a technique called "global models"), but for the typical business scenario of forecasting one or a few series, simpler methods are hard to beat. The lesson for business leaders: do not assume that "more sophisticated" means "more accurate." Always benchmark against simple baselines.

When LSTMs Do Make Sense for Time Series

LSTMs can outperform simpler methods when:

You have thousands of related time series and can train a single model across all of them (global modeling). Amazon, Walmart, and other large retailers have seen success here.
The series has complex nonlinear patterns that cannot be captured by trend + seasonality + holidays.
You have very long histories (thousands of data points per series) to prevent overfitting.
The series involves irregular, event-driven patterns that external regressors alone cannot capture.

For most business forecasting tasks — quarterly revenue, monthly demand, weekly foot traffic — Prophet or Holt-Winters will match or beat an LSTM with a fraction of the engineering effort.

Ensemble Forecasting: Combining Models for Better Results

If different models make different kinds of errors, combining them can cancel out some of those errors. This is the logic behind ensemble forecasting, which is widely used in production forecasting systems.

Three Ensemble Strategies

1. Simple Averaging. Take the forecasts from multiple models and average them. Surprisingly effective. Research consistently shows that a simple average of 3-5 diverse models often beats any individual model.

2. Weighted Averaging. Give more weight to models that performed better in cross-validation. If Prophet had a MAPE of 10 percent and ARIMA had a MAPE of 15 percent, weight Prophet more heavily. The weights can be inversely proportional to error or optimized on a validation set.

3. Stacking (Meta-Learning). Use the individual model forecasts as features in a second-level model (often a simple linear regression). The stacking model learns how to optimally combine the base forecasts, potentially capturing complex interactions (e.g., "Prophet is better for holiday periods but ARIMA is better for normal periods").

def ensemble_forecast(forecasts_dict, method='weighted', cv_errors=None):
    """
    Combine multiple forecasts into an ensemble.

    Parameters
    ----------
    forecasts_dict : dict
        {model_name: forecast_array} for each model
    method : str
        'simple', 'weighted', or 'inverse_error'
    cv_errors : dict, optional
        {model_name: cv_mape} for weighted methods

    Returns
    -------
    np.array : ensemble forecast
    """
    forecasts = np.array(list(forecasts_dict.values()))
    model_names = list(forecasts_dict.keys())

    if method == 'simple':
        weights = np.ones(len(model_names)) / len(model_names)
    elif method == 'weighted' and cv_errors is not None:
        # Inverse-error weighting: better models get higher weights
        errors = np.array([cv_errors[name] for name in model_names])
        inverse_errors = 1.0 / errors
        weights = inverse_errors / inverse_errors.sum()
    else:
        weights = np.ones(len(model_names)) / len(model_names)

    ensemble = np.average(forecasts, axis=0, weights=weights)

    print("Ensemble Weights:")
    for name, w in zip(model_names, weights):
        print(f"  {name:<20} {w:.3f}")

    return ensemble

# Example: combine Prophet, Holt-Winters, and ARIMA forecasts
# (In production, each model would generate a forecast for the same horizon)
example_forecasts = {
    'Prophet': np.random.normal(200, 10, 30),       # 30-day forecast
    'Holt-Winters': np.random.normal(195, 12, 30),
    'ARIMA': np.random.normal(205, 15, 30),
}

example_cv_errors = {
    'Prophet': 0.10,        # 10% MAPE
    'Holt-Winters': 0.12,   # 12% MAPE
    'ARIMA': 0.15,           # 15% MAPE
}

ensemble = ensemble_forecast(
    example_forecasts,
    method='weighted',
    cv_errors=example_cv_errors
)

Code Explanation: The ensemble function combines multiple model forecasts using weights derived from cross-validation performance. Models with lower error (MAPE) receive higher weights. In this example, Prophet gets the highest weight because it had the lowest MAPE. The resulting ensemble forecast is a weighted average that typically performs better than any individual model because errors from different models tend to cancel out.

Business Insight: The M4 forecasting competition (2018, with 100,000 time series) found that the top-performing methods were almost all ensembles. The winning method combined a statistical model (exponential smoothing) with a neural network. The second-place method was a simple combination of several standard approaches. For business practitioners, the message is clear: instead of spending weeks trying to find the single best model, spend that time building three or four reasonable models and combining them.

Forecast Uncertainty: The Executive Communication Challenge

Every forecast is wrong. The question is: how wrong, and in which direction? Communicating uncertainty to executives who want a single number is one of the most important — and most neglected — skills in business forecasting.

Prediction Intervals vs. Point Forecasts

A point forecast says: "We expect to sell 12,000 units next week."

A prediction interval says: "We expect to sell between 10,500 and 13,800 units next week, with 80 percent confidence."

The second statement is more honest, more useful, and harder to present on a slide. This tension is real and must be managed deliberately.

Why Prediction Intervals Widen Over Time

Every forecast becomes more uncertain the further ahead it looks. A one-week forecast might have a range of plus or minus 10 percent. A twelve-week forecast might have a range of plus or minus 30 percent. This is not a flaw in the model — it is an accurate reflection of reality.

Definition: A prediction interval specifies a range within which the actual value is expected to fall with a stated probability. An 80% prediction interval means we expect the actual value to fall within the interval 80% of the time. A wider interval is more likely to be correct but less useful for planning. A narrower interval is more useful but less likely to be correct. The choice of interval width should reflect the decision context.

Scenario-Based Planning

For strategic decisions, prediction intervals can be translated into scenarios:

Optimistic scenario (upper bound of the 80% interval): What if demand is stronger than expected? Do we have the supply chain capacity?
Expected scenario (point forecast): Our best single estimate.
Conservative scenario (lower bound of the 80% interval): What if demand is weaker than expected? Can we manage the inventory carrying costs?
Stress scenario (upper bound of the 95% interval): What is the most extreme plausible demand? This is relevant for capacity planning.

def create_executive_forecast_summary(forecast_df, category_name):
    """
    Create an executive-friendly forecast summary with scenarios.
    """
    # Weekly aggregation of the daily forecast
    forecast_future = forecast_df[forecast_df['ds'] > forecast_df['ds'].iloc[-91]]
    forecast_future = forecast_future.copy()
    forecast_future['week'] = forecast_future['ds'].dt.isocalendar().week.astype(int)
    forecast_future['year'] = forecast_future['ds'].dt.year

    weekly = forecast_future.groupby(['year', 'week']).agg({
        'yhat': 'sum',
        'yhat_lower': 'sum',
        'yhat_upper': 'sum'
    }).reset_index()

    print(f"\n{'='*65}")
    print(f"  DEMAND FORECAST — {category_name.upper()}")
    print(f"  Next 12 Weeks | 80% Confidence Interval")
    print(f"{'='*65}")
    print(f"  {'Week':<8} {'Conservative':>14} {'Expected':>12} {'Optimistic':>12}")
    print(f"  {'':<8} {'(Lower 80%)':>14} {'(Point)':>12} {'(Upper 80%)':>12}")
    print(f"  {'-'*50}")

    for _, row in weekly.head(12).iterrows():
        print(f"  Wk {int(row['week']):<4} "
              f"{row['yhat_lower']:>13,.0f} "
              f"{row['yhat']:>12,.0f} "
              f"{row['yhat_upper']:>12,.0f}")

    totals = weekly.head(12)[['yhat', 'yhat_lower', 'yhat_upper']].sum()
    print(f"  {'-'*50}")
    print(f"  {'TOTAL':<8} {totals['yhat_lower']:>13,.0f} "
          f"{totals['yhat']:>12,.0f} "
          f"{totals['yhat_upper']:>12,.0f}")
    print(f"{'='*65}")

    return weekly.head(12)

# Generate the executive summary
summary = create_executive_forecast_summary(forecast, "Athletic Footwear")

Athena Update: When Ravi first presented prediction intervals to Athena's supply chain VP, the response was skeptical: "Just give me one number." Ravi pushed back: "One number means you either over-order or under-order. Three numbers — low, expected, high — mean you can plan for each scenario." He created dashboards that showed the point forecast as a bold line with the interval as a shaded band, and added scenario-based inventory recommendations: "If demand hits the upper bound, here is the contingency order we should trigger." Within six months, the supply chain team became the strongest internal advocates for probabilistic forecasting. The VP who initially wanted a single number began asking, "What does the confidence interval look like?" in every planning meeting.

Communicating Uncertainty: Practical Guidelines

Lead with the decision, not the model. Instead of "Our ARIMA(2,1,1) model with seasonal differencing produces a MAPE of 11.3%," say "Based on our analysis, we recommend ordering between 10,500 and 13,800 units, with a midpoint estimate of 12,000."
Use language executives understand. "80% confidence" means "In 8 out of 10 similar situations, the actual number falls within this range." Avoid statistical jargon.
Anchor on scenarios, not intervals. "Best case: 13,800 units. Most likely: 12,000 units. Downside: 10,500 units" is more actionable than "12,000 ± 1,600 units."
Show historical accuracy. "Over the past year, our forecasts have been within 12% of actual sales 80% of the time" builds credibility.
Be explicit about what the model does not capture. "This forecast assumes no major competitor launches, no significant supply chain disruptions, and continued current macroeconomic conditions. If any of these change, we should revise."

Forecast Evaluation: Measuring What Matters

A forecast is only as good as its evaluation methodology. The wrong metric — or worse, no systematic evaluation at all — is a recipe for "forecast accuracy theater," where the forecasting team claims success while the supply chain team quietly scrambles.

The Three Essential Metrics

MAE (Mean Absolute Error): The average of the absolute differences between forecasted and actual values. MAE is in the same units as the data (units, dollars, etc.), making it intuitive. If your MAE is 500 units, you are off by 500 units on average.

RMSE (Root Mean Squared Error): Similar to MAE but penalizes large errors more heavily. RMSE is always greater than or equal to MAE. A large gap between RMSE and MAE indicates that the model has occasional large misses — which may be operationally costly even if the average error is acceptable.

MAPE (Mean Absolute Percentage Error): The average of the absolute percentage errors. MAPE is scale-independent, making it easy to compare across product categories with different sales volumes. A MAPE of 10 percent means the forecast is off by 10 percent on average.

Caution

MAPE has a critical flaw: it blows up when actual values are close to zero. If actual sales are 2 units and the forecast is 5 units, the MAPE is 150% — a huge percentage error for a trivial absolute error of 3 units. For products with low or intermittent demand (which describes the majority of SKUs in a large retail catalog), MAPE is misleading. Use Weighted MAPE (WMAPE) instead, which weights errors by the magnitude of actual sales: WMAPE = Sum(|Actual - Forecast|) / Sum(Actual).

Walk-Forward Validation: The Only Valid Approach

In standard machine learning, we randomly split data into training and test sets. In time series, this is invalid because it allows "information leakage" from the future into the past.

Walk-forward validation (also called rolling-origin evaluation or backtesting) is the correct approach:

Train on data from time 1 to time T.
Forecast from time T+1 to time T+h (where h is the forecast horizon).
Record the errors.
Advance the training window by some increment (e.g., one month).
Repeat.

This produces multiple forecast-vs-actual comparisons across different time periods, giving you a robust estimate of how the model will perform in production.

def walk_forward_evaluation(df, model_builder, horizons=[7, 14, 30],
                            n_splits=8, min_train_days=365):
    """
    Walk-forward validation across multiple horizons.

    Parameters
    ----------
    df : pd.DataFrame
        Must have columns 'ds' and 'y'
    model_builder : callable
        Function that returns a fitted Prophet model
    horizons : list of int
        Forecast horizons in days to evaluate
    n_splits : int
        Number of walk-forward splits
    min_train_days : int
        Minimum training period in days
    """
    total_days = len(df)
    max_horizon = max(horizons)
    available_days = total_days - min_train_days - max_horizon
    step_size = available_days // n_splits

    results = {h: [] for h in horizons}

    for i in range(n_splits):
        cutoff_idx = min_train_days + i * step_size
        train = df.iloc[:cutoff_idx]

        # Build and fit model
        model = model_builder()
        model.fit(train)

        # Evaluate at each horizon
        for h in horizons:
            future = model.make_future_dataframe(periods=h)
            # Add regressor columns if they exist
            for col in train.columns:
                if col not in ['ds', 'y'] and col in future.columns:
                    pass  # Already included
                elif col not in ['ds', 'y']:
                    future[col] = 0  # Default value for regressors

            pred = model.predict(future)
            test_dates = df['ds'].iloc[cutoff_idx:cutoff_idx + h]
            test_actual = df[df['ds'].isin(test_dates)]['y'].values
            test_pred = pred[pred['ds'].isin(test_dates)]['yhat'].values

            if len(test_actual) > 0 and len(test_pred) > 0:
                min_len = min(len(test_actual), len(test_pred))
                mape = np.mean(
                    np.abs(test_actual[:min_len] - test_pred[:min_len])
                    / np.maximum(test_actual[:min_len], 1)
                )
                results[h].append(mape)

    print("\nWalk-Forward Evaluation Results:")
    print(f"  {'Horizon':<15} {'Mean MAPE':>12} {'Std MAPE':>12} {'Splits':>8}")
    print("  " + "-" * 47)
    for h in horizons:
        if results[h]:
            mean_mape = np.mean(results[h]) * 100
            std_mape = np.std(results[h]) * 100
            print(f"  {h:>3} days{'':<8} {mean_mape:>10.1f}% "
                  f"{std_mape:>10.1f}% {len(results[h]):>8}")

    return results

Code Explanation: This function performs walk-forward validation across multiple forecast horizons. The key insight is that forecast accuracy degrades with horizon length. A model might achieve 8% MAPE on 7-day forecasts but 18% MAPE on 30-day forecasts. Reporting accuracy by horizon gives stakeholders realistic expectations: "We can forecast next week's demand within 8%, but next month's demand only within 18%."

Evaluation by Horizon: Why It Matters

Athena's supply chain team uses forecasts at three horizons:

7 days: For store replenishment orders. High accuracy needed.
30 days: For warehouse inventory planning. Moderate accuracy acceptable.
90 days: For supplier negotiations and capacity planning. Lower accuracy expected, scenarios used instead of point forecasts.

A single MAPE number averaged across all horizons hides critical information. Ravi's team reports accuracy separately by horizon and sets different targets for each:

Horizon	MAPE Target	Use Case
7 days	< 10%	Store replenishment
14 days	< 15%	Regional distribution
30 days	< 20%	Warehouse planning
90 days	< 30%	Supplier negotiation

Hierarchical Forecasting: Athena's Scaling Challenge

The prompt described Athena's challenge: forecasting at SKU-store-week granularity across 340 stores and 50,000 SKUs. That is 17 million individual series — and most of them are sparse (many SKUs sell zero or one units per week at any given store).

Forecasting sparse, intermittent demand at the individual level is extremely difficult. The signal-to-noise ratio is too low. The solution is hierarchical forecasting: forecast at an aggregated level where the data is dense and the patterns are clear, then disaggregate to the granular level using proportional allocation.

The Hierarchy

Total Company
├── Region (e.g., Northeast, Southeast, Midwest, West)
│   ├── Category (e.g., Athletic Footwear, Casual Apparel)
│   │   ├── Store
│   │   │   ├── SKU

The Strategy

Forecast at the category-region level. At this level, there are enough data points per series to identify strong seasonal patterns, trend, and promotional effects. Prophet excels here.
Disaggregate to store level using historical proportions. If Store #142 historically accounts for 2.3 percent of Northeast Athletic Footwear sales, apply that proportion to the region-category forecast.
Disaggregate to SKU level using historical mix ratios. If SKU #A7842 historically accounts for 0.8 percent of Athletic Footwear sales at Store #142, apply that proportion.
Reconcile to ensure that bottom-up totals match top-down forecasts. This reconciliation step is mathematically nuanced but operationally critical.

Business Insight: Hierarchical forecasting is how nearly all large retailers actually forecast. Walmart, Target, Amazon, and Costco all use some form of hierarchical approach because the alternative — training millions of individual models — is computationally expensive and statistically unstable for sparse series. The key decision is which level to forecast at. Too high (total company) and you miss important regional and category-specific patterns. Too low (individual SKU-store) and you are modeling noise. The sweet spot is usually the level where each series has at least 50-100 non-zero observations per year.

Common Pitfalls in Business Forecasting

Professor Okonkwo dedicates the final portion of the lecture to a topic she considers more important than any algorithm: the ways forecasting goes wrong in practice.

Pitfall 1: Overfitting to History

Tom's LSTM experience illustrates the most common technical pitfall. A model that fits the training data perfectly has memorized the past, including its noise and one-time events. It will fail on new data.

The antidote: always evaluate on out-of-sample data using walk-forward validation. If the model's accuracy on training data is dramatically better than its accuracy on test data, it is overfit.

Pitfall 2: Ignoring Structural Breaks

A structural break occurs when the data-generating process changes fundamentally. A new competitor enters the market. A pandemic closes stores. Regulations reshape an industry. The model was trained on pre-break data, and post-break data follows different rules.

No model can forecast a structural break it has never seen. The defense is not algorithmic but procedural: monitor forecast accuracy continuously, and when errors suddenly spike, investigate whether the underlying dynamics have changed. We will explore this in depth in Case Study 2.

Pitfall 3: False Precision

"Our forecast for Q3 is $47,832,419."

This level of precision is absurd for a quarterly revenue forecast. It implies certainty to the nearest dollar when the true uncertainty spans millions. Yet business teams produce numbers like this routinely, because spreadsheets display many decimal places and nobody bothers to round.

The fix: round forecasts to a level of precision consistent with their accuracy. If your MAPE is 10 percent, report the forecast as "$48M" or "$47-53M," not "$47,832,419."

Pitfall 4: Forecast Accuracy Theater

"Our forecast accuracy improved from 85% to 88% this quarter."

This sounds good. But accuracy measured how? Over what horizon? At what level of aggregation? Forecasts measured at the total company level are always more accurate than forecasts at the SKU-store level (errors cancel when you aggregate). Comparing a total-company metric to a SKU-level target is like comparing a team batting average to an individual batting average — it is a category error.

"Forecast accuracy theater" occurs when forecasting teams report metrics that make them look good without reflecting the accuracy that matters operationally. The supply chain does not reorder at the company level — it reorders at the SKU-distribution-center level. That is where accuracy must be measured.

Caution

Be skeptical when someone reports a single accuracy number without specifying: (1) the metric used (MAE, MAPE, WMAPE?), (2) the forecast horizon (1 week, 1 month, 1 quarter?), (3) the level of aggregation (total, region, category, SKU?), and (4) the evaluation methodology (in-sample or walk-forward?). Without these specifics, the number is meaningless.

Pitfall 5: Confusing Forecasts with Targets

"We forecast $50 million in revenue next quarter" is a statistical estimate.

"Our target is $50 million in revenue next quarter" is a business aspiration.

These are fundamentally different things, but they get conflated constantly. When the sales team's bonus depends on hitting $50 million, the "forecast" mysteriously converges to $50 million regardless of what the data says. This is political forecasting, not statistical forecasting, and it is a major source of supply chain and financial planning errors.

NK types: Forecast = what we expect to happen. Target = what we want to happen. Keep them separate.

"Exactly," Okonkwo confirms, as if reading NK's screen. "The moment your forecast becomes aspirational, it stops being useful."

The Athena Supply Chain Forecasting System: Full Picture

Athena Update: By the end of Phase 3, Athena's demand forecasting system has the following architecture:

Data Pipeline: - Daily POS data from 340 stores, cleaned and validated within 24 hours - External data feeds: promotional calendar, holiday calendar, weather forecasts, local event schedules - Historical sales data enriched with promotional indicators, price change flags, and markdown schedules

Modeling: - Prophet models at the category-region level (approximately 200 primary forecast series) - Holt-Winters models as a secondary method for ensemble weighting - Hierarchical disaggregation to store-SKU level using historical proportions - Ensemble of Prophet and Holt-Winters, weighted by walk-forward cross-validation performance

Output: - Daily forecasts at SKU-store level, with 80% prediction intervals - Weekly forecast summary dashboards for regional planners - Monthly scenario analysis (optimistic / expected / conservative) for supply chain leadership - Automated alerts when forecast accuracy degrades below threshold

Results: - 22% improvement in forecast accuracy (WMAPE) compared to the previous moving-average approach - $6.1 million annual reduction in inventory carrying costs (from reduced safety stock) - 31% reduction in stockout incidents for top-200 SKUs - Supply chain team adoption rate: 94% of planners using the new system daily

What It Took: - 3 months of data pipeline development - 2 months of model development and validation - 6 months of phased rollout (one region at a time) - 1 full-time data engineer, 1 data scientist, and 0.5 FTE from the supply chain planning team - Total cost: approximately $450,000 in the first year (primarily personnel) - Payback period: 27 days

Looking Back, Looking Forward

Professor Okonkwo returns to the chart she showed at the beginning of the lecture. The forecast that was wrong every single month but useful every single month.

"Three things to carry out of this room," she says.

"First: forecasts are probabilistic statements about the future, not promises. The moment you strip out the uncertainty interval, you have destroyed the most valuable part of the forecast."

"Second: simpler models are not inferior models. Prophet beat Tom's LSTM today. Holt-Winters beats neural networks in many production systems. Match the tool to the problem, not the other way around."

"Third: the hardest part of forecasting is not the model. It is the data pipeline that feeds it, the evaluation methodology that validates it, and the communication strategy that gets executives to act on it."

Tom closes his notebook. He writes one final note: Humility. The model that works is the model that ships.

NK, who came in expecting a chapter about algorithms, leaves thinking about dashboards, intervals, and the politics of prediction.

Ravi sends them both a message that evening: Great lecture today. When you are building forecasts for Athena this summer, remember: the supply chain team does not care about your MAPE. They care about whether they have enough product on the shelves and not too much in the warehouse. Everything else is a means to that end.

Chapter Summary

Time series forecasting occupies a unique position in business AI: it is one of the most impactful applications (demand planning, financial forecasting, capacity planning) and one of the most commonly done poorly. This chapter covered the essential concepts — time series components, stationarity, ARIMA, exponential smoothing, Prophet, LSTMs, ensembles, and evaluation — with an emphasis on practical application over mathematical formalism.

The key lessons:

All forecasts are wrong. The goal is to be usefully wrong — quantifying uncertainty so decision-makers can plan for ranges.
Simple models often outperform complex ones. Always benchmark against naive methods.
Prophet succeeded because it solved workflow problems, not because it was the most accurate algorithm.
External regressors (promotions, holidays, weather) can significantly improve forecasts but must be validated and available for the future period.
Evaluate forecasts by horizon, by level of aggregation, and using walk-forward validation. A single accuracy number without context is meaningless.
The hardest part of production forecasting is the data pipeline, not the model.

In Chapter 17, we will move from predicting numerical futures to generating entirely new content — entering the world of large language models and the generative AI revolution that has reshaped every industry conversation since 2022.

In Chapter 34, we will return to forecasting from a financial perspective, using Athena's demand forecasting system as a worked example for calculating the ROI of AI investments.

Chapter 16 draws on concepts from Chapter 8 (regression foundations) and Chapter 13 (neural network architecture). The Prophet workflow introduced here will be referenced in Chapter 34 (Measuring AI ROI) when we calculate the financial return on Athena's forecasting investment.