Case Study 2: The 2020 Forecasting Crisis — When All Models Broke

Case Study 2: The 2020 Forecasting Crisis — When All Models Broke

Introduction

In the first week of March 2020, the demand forecasting models at virtually every major consumer goods company, retailer, airline, hotel chain, restaurant group, and financial institution were operating normally. Accuracy metrics were within historical ranges. Seasonal patterns were tracking as expected. The models were doing exactly what they were designed to do: predict the future based on the past.

Within three weeks, every one of those models was producing useless output.

The COVID-19 pandemic did not just create forecast errors. It invalidated the fundamental assumption underlying all time series forecasting: that the future will resemble the past. Consumer behavior changed faster and more dramatically than at any point in modern economic history. Categories that had been stable for decades saw demand triple overnight. Categories that had been growing for years dropped to near zero. And the models — trained on years of stable, predictable data — had nothing in their history to prepare for what happened.

This case study examines how the pandemic broke demand forecasting, what organizations learned, and how forecasting practices changed in response. It is, in the language of this chapter, the most comprehensive structural break in the modern history of business forecasting.

What Happened: The Anatomy of a Forecasting Collapse

Phase 1: The Panic Buying Shock (March 2020)

The first phase was the most violent. As COVID-19 spread and lockdowns were announced, consumers engaged in panic buying at a scale not seen since the oil crises of the 1970s — and far exceeding those events in speed and breadth.

Toilet paper. The canonical example. U.S. toilet paper sales in the week ending March 14, 2020, were 734 percent above the same week in 2019 (according to Nielsen data). No demand forecasting model had a scenario for a 7x demand spike in a category that had been one of the most stable and predictable in all of consumer goods. Toilet paper demand had an annual coefficient of variation (a measure of volatility) of approximately 3-5 percent. In a single week, it experienced a 734 percent deviation.

Hand sanitizer. Sales increased by over 600 percent. Manufacturers that produced a few million units per month suddenly faced demand for tens of millions. Supply chains that measured lead time in months could not respond to demand that appeared in days.

Cleaning products. Lysol, Clorox, and similar brands saw demand increases of 200-400 percent. Clorox's CEO later told investors that the company experienced "a year's worth of demand in a matter of weeks."

Canned goods, frozen food, flour. Home cooking surged as restaurants closed. Flour sales increased by over 200 percent. Yeast sales increased by over 600 percent. Frozen pizza sales surged by 92 percent in a single week.

What the models saw: Every one of these categories had forecasting models trained on years of smooth, predictable data. The models interpreted the initial demand spike as an outlier — a one-time anomaly to be smoothed over or ignored. ARIMA models, which rely on recent past values to predict the future, began chasing the spike with a lag. Prophet models detected changepoints but set them too late and with too little magnitude. Exponential smoothing models, depending on their alpha parameters, either overreacted (high alpha) or barely registered the change (low alpha).

Business Insight: The pandemic exposed a design assumption baked into every standard forecasting model: that demand distributions are approximately normal and that extreme deviations are temporary. These assumptions are correct in 99 percent of weeks. But the 1 percent of weeks when they fail can cause more damage than the 99 percent of weeks when they hold. This is the domain of "fat-tailed" risk that Nassim Nicholas Taleb has written about extensively — and that most operational forecasting systems are not designed to handle.

Phase 2: The Category Rotation (April-June 2020)

After the initial panic buying subsided, demand did not return to normal. Instead, consumer behavior rotated into entirely new patterns:

Work-from-home categories surged: Home office furniture (up 200%+), computer monitors, webcams, home networking equipment. These categories had been growing steadily at 5-10 percent annually. The pandemic compressed three to five years of growth into three months.

At-home entertainment exploded: Video streaming subscriptions, gaming consoles, puzzles, home fitness equipment. Peloton, which had been a niche luxury product, became a mainstream brand virtually overnight.

Travel and hospitality collapsed: U.S. airline passenger counts dropped by 96 percent from the 2019 baseline. Hotel occupancy rates fell below 25 percent — levels not seen since the Great Depression. Restaurant revenue dropped by 50-70 percent depending on market and format.

Fuel demand cratered: U.S. gasoline demand fell by 30-50 percent. Oil prices briefly went negative for the first time in history.

Apparel and luxury declined sharply: With no office, no social events, and no travel, demand for business attire, formalwear, and luxury goods dropped by 30-50 percent.

What the models could not capture: The challenge was not just magnitude but structure. The models were trained on a world where toilet paper demand was stable and office furniture grew slowly. In the new world, toilet paper was volatile and office furniture was surging. The correlations between categories, between regions, and between demand signals changed simultaneously. A model that had learned "when the economy is strong, luxury goods sales rise" was now operating in an economy where GDP was falling and luxury goods sales were falling — but home improvement sales were exploding. The relationships the models had learned were not just wrong in degree; they were wrong in kind.

Phase 3: The Whiplash (July 2020-2021)

The third phase was perhaps the most challenging for forecasters. Demand oscillated unpredictably as lockdowns were imposed, relaxed, reimposed, and relaxed again in different geographies and on different timelines.

A restaurant forecasting model might see demand recover by 40 percent when outdoor dining was permitted, then drop by 60 percent when a winter surge triggered new restrictions, then recover again when vaccines became available — all within a few months. Each time, the model had to decide: is this the new normal, or a temporary fluctuation?

The bullwhip effect amplified the chaos. When retailers saw demand spikes, they doubled their orders to suppliers. When demand normalized, they canceled orders. Suppliers, seeing the cancellations, cut production. When demand resumed, the supply was not available. This classic supply chain dynamic, well-documented in the operations management literature, played out at pandemic scale across thousands of product categories simultaneously.

How Organizations Responded

The Immediate Response: Human Override

In the first weeks of the pandemic, most organizations effectively abandoned their forecasting models and switched to human judgment. Experienced demand planners, supply chain leaders, and category managers made allocation decisions based on:

Real-time POS data (where available)
Government announcements (lockdown orders, reopening timelines)
Supply availability (what could physically be produced and shipped)
Intuition and experience

This was operationally necessary but deeply uncomfortable for organizations that had invested millions in automated forecasting systems. It also revealed something important: the models could not handle the situation, but neither could humans. Human forecasts during the pandemic were not more accurate than the broken models — they were simply more adaptive. Humans could reason about unprecedented situations ("if the governor announces a lockdown tomorrow, demand for X will spike and demand for Y will drop"), while models could only learn from data they had already seen.

The Short-Term Fix: Regime-Based Modeling

Within months, more sophisticated organizations implemented regime-based or regime-switching approaches:

Multiple model states. Instead of a single model trained on all historical data, teams built separate models for "normal," "lockdown," and "reopening" regimes. When a region entered lockdown, the system switched to the lockdown model, which was trained on data from prior lockdowns.
Shortened training windows. Instead of training on 2-3 years of history (which included a pre-pandemic world that no longer existed), teams shortened training windows to 4-8 weeks, making models responsive to the current regime at the cost of reduced statistical stability.
Indicator-based adjustments. Google Mobility data (which tracked changes in consumer movement to retail locations, workplaces, transit stations, and residences) became a widely used leading indicator. When mobility data showed a 40 percent decline in retail visits, demand models adjusted accordingly.

The Long-Term Transformation: Resilience Over Accuracy

By 2021-2022, the lessons of the pandemic had begun reshaping forecasting practices in fundamental ways:

1. Scenario planning replaced point forecasting. Organizations that had previously relied on a single demand plan began maintaining multiple scenarios — each with associated probability weights and contingency plans. The question shifted from "What will demand be?" to "What are the three most plausible demand outcomes, and are we prepared for each?"

2. Monitoring systems became as important as forecasting models. The most valuable capability during the pandemic was not the ability to predict demand accurately but the ability to detect demand shifts quickly. Real-time dashboards that compared actual sales against the forecast — with automated alerts when deviations exceeded thresholds — proved more valuable than better algorithms.

3. Supply chain flexibility became a strategic priority. Before the pandemic, supply chains were optimized for cost efficiency: long lead times, concentrated suppliers, lean inventory. After the pandemic, leading companies began investing in supply chain resilience: shorter lead times, diversified suppliers, higher safety stock for critical items, and flexible manufacturing capacity.

4. Human-in-the-loop processes were redesigned. The pandemic revealed that the standard forecasting workflow — model produces a forecast, planner reviews and adjusts, plan is executed — was too slow and too dependent on institutional assumptions. Post-pandemic, many organizations redesigned the human-in-the-loop process to focus planner attention on exceptions and scenarios rather than routine adjustments.

5. External data became non-negotiable. Before the pandemic, external data (POS data, mobility data, search trends, weather) was a "nice to have" for many organizations. After the pandemic, it was considered essential. The organizations that recovered fastest were those with real-time access to demand signals beyond their own shipment data.

What the Models Could Not Learn

The pandemic was not, strictly speaking, an unpredictable event. Epidemiologists had warned of pandemic risk for decades. What was truly unpredictable was the specific nature of the behavioral response: which categories would surge, which would collapse, how quickly consumers would adapt, and how government policy would evolve.

No amount of historical data could have prepared a model for a world in which: - Working from home shifted from an exception to the default for 40 percent of the workforce overnight - International air travel dropped by 90 percent in a month - Consumers simultaneously stockpiled groceries and stopped buying gasoline - The correlation between employment levels and consumer spending temporarily inverted (stimulus payments sustained spending even as unemployment spiked)

These were not tail risks in the statistical sense — they were shifts in the causal structure of the economy. Time series models learn correlations from historical data. When the causal structure changes, the correlations change, and the models fail.

Caution

The lesson of 2020 is not "forecasting is useless." It is "forecasting is essential for normal times and insufficient for structural breaks." The appropriate response is not to abandon forecasting but to supplement it with monitoring, scenario planning, and organizational agility. A forecast that is accurate 95 percent of the time is enormously valuable — but the organization must have a plan for the 5 percent of the time when it is not.

The Aftermath: How Forecasting Changed

By 2023-2025, post-pandemic forecasting practices at leading organizations incorporated several new principles:

Structural break detection. Models now include automated monitoring for signs that the data-generating process has changed — sudden shifts in forecast error, changes in the correlation structure between categories, or external signals (pandemic indicators, major policy changes) that suggest the current model may be invalid.

Faster model retraining. Pre-pandemic, many organizations retrained their forecasting models quarterly or annually. Post-pandemic, leading organizations retrain weekly or even daily, using shorter lookback windows when conditions are volatile.

Ensemble with human judgment. Rather than choosing between "trust the model" and "trust the planner," leading organizations combine both — using the model's statistical forecast as a baseline and the planner's judgment as an adjustment, with the weight between them varying based on the model's recent accuracy and the planner's track record.

Resilience metrics. Forecast accuracy (MAPE, WMAPE) is still tracked, but new metrics have been added: forecast bias (systematic over- or under-prediction), forecast volatility (how much the forecast changes from week to week), and recovery time (how quickly the model returns to acceptable accuracy after a disruption).

Lessons for Chapter 16

1. Stationarity is an assumption, not a fact. The chapter discusses stationarity as a mathematical requirement for ARIMA and related models. The pandemic demonstrated that stationarity is also a business assumption — an assumption that the world will continue to operate by the same rules. When that assumption breaks, the model breaks.

2. Prediction intervals matter most when they are hardest to estimate. In stable times, an 80 percent prediction interval is a useful planning tool. In a crisis, the actual outcome may fall outside the 99 percent interval — because the model's uncertainty estimate, like its point forecast, was calibrated on a world that no longer exists.

3. The value of forecasting is not the forecast — it is the process. Organizations with mature forecasting processes — those that routinely evaluated model accuracy, monitored for anomalies, and maintained contingency plans — recovered faster than organizations that treated forecasting as a "set it and forget it" function. The discipline of systematic thinking about the future proved more valuable than any specific model.

4. External regressors can be leading indicators of structural breaks. Google Mobility data, real-time POS data, and social media signals provided early warnings of demand shifts weeks before those shifts appeared in historical shipment data. Organizations that had invested in external data infrastructure before the pandemic were better positioned to detect and respond to the crisis.

5. Tom's LSTM lesson scales. Just as Tom's LSTM overfit to training data in the chapter's narrative, every forecasting model in the world overfit to a pre-pandemic training set in 2020. The principle is the same: a model that has only seen stable, predictable data cannot extrapolate to extreme, unprecedented conditions. Robustness to structural breaks requires design choices — shorter training windows, ensemble methods, human-in-the-loop oversight — not just better algorithms.

Discussion Questions

If you were building a demand forecasting system today, what specific design features would you include to make it more resilient to structural breaks? Consider both technical (model architecture, monitoring) and organizational (processes, decision-making) features.
During the pandemic, many organizations effectively abandoned their forecasting models and relied on human judgment. Under what circumstances is this the right decision? How do you know when to "trust the model" versus "trust the human"?
The chapter argues that prediction intervals are more valuable than point forecasts. Did this argument hold during the pandemic? Under what circumstances do even prediction intervals become meaningless?
Some researchers have argued that the pandemic should have been "forecastable" using scenario analysis and epidemiological models. Do you agree? What is the difference between "a pandemic is possible" (which was well-known) and "this specific pandemic will cause these specific demand shifts" (which was not)?
How would you design a forecasting system that automatically detects structural breaks and alerts human decision-makers? What signals would you monitor, and what thresholds would trigger an alert?

This case study draws on publicly available data from Nielsen, IRI, the U.S. Census Bureau, the Bureau of Transportation Statistics, Google Mobility Reports, and industry publications. Organizational responses are based on published accounts from company earnings calls, supply chain industry conferences, and academic studies of the pandemic's impact on operations management.