Chapter 8: Supervised Learning — Regression

DataField.Dev

38 min read

> "If our model had been 15 percent more accurate, that's $1.8 million recovered. That's the business case for regression."

Prerequisites

Chapter 7 (Classification) for supervised learning foundations
Comfort interpreting line charts and trends
Basic understanding of financial forecasting
Familiarity with units, prices, and revenue calculations

Learning Objectives

Explain how regression models predict continuous business outcomes (sales, price, demand)
Translate forecasting problems (revenue, units, LTV) into regression specifications
Interpret RMSE, MAE, and R-squared in a business decision context
Evaluate when linear models are adequate versus when nonlinear methods earn their complexity
Recognize overfitting in a regression forecast presented by an analyst

In This Chapter

8.1 Regression in Business: Predicting the Numbers That Matter
8.2 Linear Regression by Intuition: The Line of Best Fit
8.3 Multiple Regression: When One Predictor Is Not Enough
8.4 Polynomial Regression: When Linear Is Not Enough
8.5 Regularization: The Overfitting Penalty
8.6 Regression Trees and Ensemble Methods
8.7 Time Series Basics for Regression
8.8 Feature Engineering for Regression
8.9 Evaluating Regression Models: Beyond R²
8.10 The DemandForecaster: A Complete Implementation
8.11 From Forecast to Action: The Business of Prediction Error
8.12 When Regression Models Fail
8.13 Chapter Summary: What Regression Means for Business
Key Terms

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 8: Supervised Learning — Regression

"If our model had been 15 percent more accurate, that's $1.8 million recovered. That's the business case for regression." — Ravi Mehta, VP Data & AI, Athena Retail Group

Ravi Mehta taps the projector remote, and a chart fills the screen. Two lines. One blue — actual daily sales of winter coats across Athena's 127 stores last November and December. One orange — the forecast that Athena's supply chain team had generated using a twelve-week moving average in Excel.

The blue line and the orange line track each other reasonably well for the first three weeks of November. Then, in the final week of November, the blue line surges upward — a near-vertical spike that peaks at 340 percent of the orange line's prediction.

"Early cold snap," Ravi says. "Polar vortex hit the Midwest and Northeast two weeks earlier than historical averages. Temperatures dropped fifteen degrees below normal. Demand for outerwear exploded."

He advances the slide. A single number in red: $3.2 million in lost sales from stockouts.

"Our warehouses ran dry in six days. We couldn't reorder fast enough. Stores in Chicago, Detroit, and Minneapolis had empty coat racks during the highest-demand week of the year." He lets the number sink in. "Our competitors — the ones with better demand models — captured that revenue instead."

NK shifts in her seat. She has been running the numbers in her head. Athena's inventory carrying costs last year were $40 million. Overstock on slow-moving categories — summer sandals that didn't move in August, holiday decorations ordered too early — cost another $12 million in markdowns. And now $3.2 million in lost sales from under-forecasting.

"This isn't just a statistics exercise," NK says, half to herself, half to the room. "It's a P&L impact."

"Exactly," Ravi says. "The difference between a mediocre demand forecast and a good one isn't academic. It's the difference between profitable quarters and disappointing ones. And that difference is what regression modeling is designed to address."

Professor Okonkwo steps forward. "In Chapter 7, we built classification models that answer yes or no questions — will this customer churn? Is this transaction fraudulent? Today, we move to the other major branch of supervised learning: regression. Regression answers how much and how many. How many units will we sell next week? How much should we charge for this product? How long will this customer remain active? What will revenue look like next quarter?"

She writes on the board:

Classification: Predict a category. Regression: Predict a number.

"The distinction is that simple — and that consequential."

8.1 Regression in Business: Predicting the Numbers That Matter

Regression is the workhorse of quantitative business analysis. Long before machine learning entered the business vocabulary, regression models were driving decisions in finance, marketing, operations, and strategy. What has changed is not the fundamental idea — fitting a mathematical relationship between inputs and an output — but the scale, sophistication, and automation of the process.

Where Regression Creates Business Value

Business Problem	What You're Predicting	Why It Matters
Demand forecasting	Units sold per product per store per day	Drives inventory ordering, warehouse staffing, markdown strategy
Dynamic pricing	Optimal price point for a product or service	Maximizes revenue or margin per transaction
Customer lifetime value (CLV)	Total revenue a customer will generate over time	Determines acquisition spend, retention investment priority
Revenue forecasting	Quarterly or annual revenue	Informs budgeting, headcount planning, investor communications
Lead scoring (continuous)	Expected deal value	Prioritizes sales team effort on highest-value prospects
Credit risk modeling	Expected loss given default	Sets interest rates, loan approval thresholds, reserve requirements
Operational efficiency	Processing time, defect rate, energy consumption	Identifies bottlenecks, reduces waste, optimizes resource allocation
Real estate valuation	Property price	Powers automated valuation models (AVMs), investment analysis

Business Insight. Classification and regression are not competing approaches — they are complementary tools for different questions. Athena uses classification to predict whether a customer will churn (Chapter 7) and regression to predict how much that customer will spend if retained. Together, these models power a unified customer value framework.

The Fundamental Regression Question

Every regression model answers the same question: Given what I know about the inputs, what is my best estimate of the output?

In mathematical notation, this is expressed as:

y = f(X) + error

Where: - y is the target variable — the number you want to predict (units sold, price, revenue) - X is the set of input features — the information you have available (historical sales, day of week, weather, promotions) - f(X) is the learned relationship between inputs and output — the model - error is the irreducible noise — the inherent unpredictability that no model can capture

The goal is not to eliminate the error. Some randomness is inherent in every business process. The goal is to learn an f(X) that captures as much of the systematic pattern as possible, leaving only genuine noise in the residuals.

Definition. A residual is the difference between a predicted value and the actual value for a specific observation: residual = actual - predicted. Positive residuals mean you under-predicted; negative residuals mean you over-predicted. Examining residuals reveals what your model is getting wrong — and often suggests how to improve it.

8.2 Linear Regression by Intuition: The Line of Best Fit

Linear regression is the simplest, most interpretable, and most widely used regression algorithm in business. Despite being centuries old — the method of least squares was published by Adrien-Marie Legendre in 1805 — it remains the starting point for most regression analyses and, frequently, the best model for the job.

The Core Idea

Imagine plotting Athena's daily coat sales (y-axis) against the average daily temperature (x-axis) for the past two years. You would see a clear pattern: as temperature decreases, coat sales increase. Not perfectly — some cold days have low sales (holidays, weekdays) and some warm days have surprising spikes (promotional events) — but the trend is clear.

Linear regression draws the best-fitting straight line through that scatter plot. "Best-fitting" means the line that minimizes the total squared distance between each data point and the line.

y = b₀ + b₁x

Where: - b₀ (the intercept) is the predicted value of y when x is zero. In our example, it's the predicted coat sales on a hypothetical day with a temperature of 0°F. It anchors the line vertically. - b₁ (the slope) is the effect size — how much y changes for each one-unit increase in x. If b₁ = -12, it means that for each one-degree increase in temperature, Athena sells approximately 12 fewer coats per day across all stores. - x is the input feature (temperature)

Why the Slope Is the Business Story

The slope coefficient is the single most important output of a linear regression for a business audience. It quantifies the relationship in actionable terms.

Consider these slope interpretations:

Slope	Business Interpretation
b₁ = -12 (temp → coat sales)	Each 1°F increase reduces daily coat sales by 12 units
b₁ = 0.03 (ad spend → revenue)	Each additional $1 of ad spend generates $0.03 in revenue
b₁ = 2.4 (store sqft → monthly revenue)	Each additional square foot of store space generates $2.40/month
b₁ = -0.15 (price → units sold)	Each $1 price increase reduces daily unit sales by 0.15

Business Insight. When presenting regression results to business stakeholders, lead with the slope. "Our model shows that every 1°F drop in temperature drives approximately 12 additional coat sales per day" is far more useful than "The R² of our temperature-sales model is 0.73." The slope is the so what. The R² is the how confident should I be.

What Least Squares Actually Does

The "best fit" line is determined by minimizing the sum of squared residuals — that is, the total of (actual - predicted)² across all data points. Squaring the residuals has two purposes: it treats over-predictions and under-predictions equally, and it penalizes large errors more heavily than small ones.

Visually, imagine each data point connected to the line by a vertical thread. The squared length of that thread is the squared residual. The least squares algorithm positions the line to minimize the total thread length (squared). Some points will be above the line, some below, but the line sits at the position where the overall "pull" is balanced.

Caution. Linear regression assumes the relationship between x and y is approximately linear. If the true relationship is curved — for example, demand that increases slowly at first but then explodes exponentially during cold snaps — a straight line will systematically mispredict. Always plot your data before fitting a model. The EDA techniques from Chapter 5 are not optional; they are prerequisite.

8.3 Multiple Regression: When One Predictor Is Not Enough

A single variable rarely tells the whole story. Coat sales depend on temperature, yes — but also on the day of the week, whether a promotion is running, whether it's a holiday, the store's size, and the product's price point.

Multiple regression extends the model to include multiple predictors:

y = b₀ + b₁x₁ + b₂x₂ + b₃x₃ + ... + bₙxₙ

Each coefficient (b₁, b₂, ..., bₙ) represents the effect of its corresponding feature while holding all other features constant. This "all else equal" interpretation is powerful and dangerous — powerful because it isolates the contribution of each variable, dangerous because it assumes the variables are independent, which they rarely are in practice.

The Athena Demand Model — First Pass

Ravi's team builds an initial multiple regression model for Athena's coat demand:

predicted_daily_sales = 285 - 11.3(avg_temp) + 142(is_promotion) + 38(is_weekend) - 62(is_post_holiday) + 0.8(store_sqft_thousands)

Reading the coefficients: - Base daily sales (all features at zero/false): 285 units - Each 1°F drop in temperature: +11.3 units - Running a promotion: +142 units - Weekend vs. weekday: +38 units - Post-holiday period: -62 units - Each additional 1,000 sqft of store space: +0.8 units

Athena Update. This initial model achieves an R² of 0.68 on the test set — it explains 68 percent of the variance in daily coat sales. Not bad for a first pass. But Ravi's team knows the remaining 32 percent includes the kind of extreme events — like the polar vortex — that cause the most expensive forecasting failures.

Interaction Effects: When 1 + 1 ≠ 2

Sometimes the effect of one variable depends on the value of another. A promotion during cold weather might boost sales by far more than the sum of the promotion effect and the temperature effect alone. This is called an interaction effect.

To capture interactions, you create a new feature that is the product of two existing features:

promotion_x_cold = is_promotion × cold_indicator

Adding this interaction term allows the model to learn that promotions are especially powerful during cold snaps — a business reality that a simple additive model would miss.

Multicollinearity: When Features Overlap

Multicollinearity occurs when two or more features are highly correlated with each other. For example, "average temperature" and "month" are highly correlated — January is cold, July is hot. Including both in a regression can cause problems:

The individual coefficients become unreliable and unstable
Small changes in the data can produce large swings in the coefficient values
The model still predicts well overall, but you cannot interpret individual feature effects

Definition. Multicollinearity is the condition in which two or more predictor variables in a regression model are highly correlated, making it difficult to isolate the individual effect of each predictor. It inflates the standard errors of coefficients and makes them unreliable for interpretation, even though the model's overall predictive power may be unaffected.

Detecting multicollinearity. The simplest approach: compute the correlation matrix of your features (you learned this in Chapter 5). Pairs of features with correlations above 0.8 or below -0.8 are candidates for multicollinearity. A more rigorous measure is the Variance Inflation Factor (VIF), which quantifies how much a coefficient's variance is inflated by correlation with other predictors. A VIF above 5 is a warning; above 10 is a problem.

Resolving multicollinearity. Options include: 1. Drop one of the correlated features 2. Combine them into a single feature (e.g., a "weather severity index" that combines temperature, wind, and precipitation) 3. Use regularization (Section 8.5), which naturally handles correlated features

8.4 Polynomial Regression: When Linear Is Not Enough

Sometimes the relationship between a feature and the target is genuinely curved. Athena's coat sales don't just increase linearly as temperature drops — they increase slowly from 60°F to 40°F, then rapidly from 40°F to 20°F, then explosively below 20°F. The relationship is nonlinear.

Polynomial regression handles this by adding squared, cubed, or higher-power terms of the original feature:

y = b₀ + b₁x + b₂x² + b₃x³

This is still a "linear" model in the mathematical sense — the relationship is linear in the parameters (b₀, b₁, b₂, b₃), even though it's nonlinear in the feature (x). This means all the familiar linear regression machinery (least squares, coefficient interpretation, statistical tests) still applies.

The Overfitting Danger

Here is where Tom Kowalski's first big lesson of the semester arrives.

Tom, with his engineering background, builds a polynomial regression with a degree of 15 — a model with fifteen different power terms for temperature. The resulting curve passes very close to every training data point. His R² on the training set is 0.97.

"Ninety-seven percent," Tom announces to his project team, grinning. "Beat that."

NK raises an eyebrow. "What's the R² on the test data?"

Tom runs the test evaluation. The grin fades. "Zero point four one."

Professor Okonkwo nods. "What happened, Tom?"

Tom stares at the screen. "The model memorized the training data. It learned the noise, not the signal. When it saw new data, it couldn't generalize."

"Exactly," Professor Okonkwo says. "This is overfitting — the single most common and most expensive mistake in machine learning. And it's a problem that is not limited to polynomial regression. Any model that is too complex relative to the amount of training data will overfit."

Definition. Overfitting occurs when a model learns not just the true underlying pattern in the training data but also the random noise. An overfit model performs excellently on training data but poorly on new, unseen data. It has "memorized" rather than "learned." The inverse — underfitting — occurs when a model is too simple to capture the true pattern, performing poorly on both training and test data.

The visual intuition is clear: a degree-2 polynomial captures the genuine curve in coat sales vs. temperature. A degree-15 polynomial contorts itself to pass through every training point, producing wild oscillations between data points. When new data arrives at temperatures slightly different from the training data, the degree-15 model produces absurd predictions — negative coat sales, or sales of 10,000 units on a 55°F day.

Choosing the Right Complexity

The right polynomial degree depends on: 1. The actual shape of the relationship (plot it first) 2. The amount of training data (more data supports higher complexity) 3. Validation performance (use cross-validation to test each degree)

In practice, polynomial degrees above 3 or 4 are rarely justified in business applications. If the relationship is more complex than a cubic curve, tree-based models (Section 8.6) usually perform better.

Try It. Fit a linear, quadratic, and cubic regression to the same data. Plot all three curves alongside the scatter plot. Which one captures the pattern without chasing the noise? Now try degree 10 and see what happens at the edges of your data range. This exercise builds the intuition for overfitting more effectively than any textbook explanation.

8.5 Regularization: The Overfitting Penalty

Tom's overfitting disaster has a systematic solution: regularization. The idea is elegantly simple — add a penalty to the model's objective function that discourages complexity.

The Intuition

Without regularization, the model's only goal is to minimize prediction error on the training data. This is like telling a student that the only thing that matters is their score on practice exams. The student might memorize the answer key rather than learning the material — and then fail the actual exam.

Regularization adds a second goal: keep the model coefficients small. This is like telling the student: "You'll be graded on your practice score and on the simplicity of your approach. No memorizing the answer key."

Mathematically, regularization modifies the objective function from:

Minimize: prediction error (overfitting risk)

to:

Minimize: prediction error + λ × coefficient penalty (balanced)

The parameter λ (lambda) controls the trade-off. When λ = 0, there is no penalty — you get standard regression. As λ increases, the model is penalized more heavily for large coefficients, forcing it toward simpler solutions.

Ridge Regression (L2 Regularization)

Ridge regression adds a penalty equal to the sum of squared coefficients:

Penalty = λ × (b₁² + b₂² + b₃² + ... + bₙ²)

This penalty shrinks all coefficients toward zero but never exactly to zero. Ridge is effective at handling multicollinearity — when features are correlated, Ridge distributes the coefficient weight among them rather than assigning all the weight to one and none to the others.

Lasso Regression (L1 Regularization)

Lasso adds a penalty equal to the sum of absolute values of coefficients:

Penalty = λ × (|b₁| + |b₂| + |b₃| + ... + |bₙ|)

Lasso has a distinctive property: it can shrink coefficients exactly to zero, effectively performing automatic feature selection. Features that contribute little predictive power are eliminated entirely. This is valuable when you have many features and suspect that only a subset truly matters.

Ridge vs. Lasso: A Business Decision

Scenario	Better Choice	Why
Many features, most relevant	Ridge	Keeps all features, reduces their impact proportionally
Many features, few relevant	Lasso	Eliminates irrelevant features, produces a simpler model
Correlated features	Ridge	Distributes weight among correlated features
Need feature selection	Lasso	Automatically identifies the most important features
Uncertain	Elastic Net	Combines Ridge and Lasso penalties

Business Insight. For business stakeholders, Lasso has a powerful advantage beyond prediction: interpretability. When Lasso reduces a 50-feature model to 8 features, it answers the question every executive asks: "What are the key drivers?" A model that says "These 8 factors drive 90 percent of demand variation" is far more actionable than one that says "All 50 factors contribute."

Tom rebuilds his demand model using Ridge regularization. The training R² drops from 0.97 to 0.79 — a seemingly worse result. But the test R² improves from 0.41 to 0.76. The model has traded memorization for generalization.

"The training score going down is actually good news?" Tom asks.

"When it means the test score goes up, absolutely," Professor Okonkwo replies. "You're not trying to win a training competition. You're trying to build a model that works on data it's never seen — which is the only data that matters in production."

8.6 Regression Trees and Ensemble Methods

Linear models (including regularized variants) assume a specific functional form — a straight line or smooth curve. When the true relationship is more complex — involving thresholds, interactions, and nonlinear patterns that vary across different regions of the feature space — tree-based methods often outperform.

Decision Trees for Regression

A regression tree makes predictions by recursively splitting the data into subsets based on feature values, then predicting the average target value within each final subset (leaf node).

For Athena's coat demand:

Is temperature < 35°F?
├── Yes: Is there a promotion running?
│   ├── Yes: Predict 480 units
│   └── No: Is it a weekend?
│       ├── Yes: Predict 310 units
│       └── No: Predict 240 units
└── No: Is there a promotion running?
    ├── Yes: Predict 195 units
    └── No: Predict 85 units

Each path through the tree represents a rule, and each leaf represents a prediction. Trees naturally capture interactions (promotion effect differs above and below 35°F) and thresholds (the sharp change at 35°F) without requiring the analyst to specify them in advance.

Advantages of regression trees: - No assumptions about the functional form - Automatically capture interactions and nonlinear patterns - Easy to interpret and explain to business stakeholders - Handle both numerical and categorical features natively

Disadvantages of regression trees: - A single tree is prone to overfitting (deep trees memorize training data) - Predictions are "stepped" — a tree cannot predict values outside the range of its training data - Sensitive to small changes in training data (high variance)

Random Forest: The Wisdom of Many Trees

A single tree is unstable. But a forest of trees — each built on a slightly different random sample of the data and features — produces robust, accurate predictions. This is the Random Forest algorithm.

The key ideas:

Bootstrap aggregation (bagging). Train each tree on a random sample (with replacement) of the training data. This ensures each tree sees a slightly different version of the data.
Feature randomization. At each split, consider only a random subset of features. This ensures the trees are diverse — they don't all split on the same dominant feature.
Averaging. The final prediction is the average of all trees' predictions. This averaging cancels out individual trees' errors, much like polling multiple experts produces a better forecast than asking any single expert.

Definition. Ensemble methods combine multiple models to produce a prediction that is better than any individual model. The two main families are bagging (training models on different data samples and averaging — Random Forest) and boosting (training models sequentially, with each model focusing on the errors of the previous one — Gradient Boosting).

Gradient Boosting: Learning from Mistakes

While Random Forest trains trees independently and averages them, Gradient Boosting trains trees sequentially. Each new tree is trained to predict the residuals (errors) of the previous trees. The final prediction is the sum of all trees' predictions.

The intuition: imagine you're trying to hit a target with a dart. Your first throw lands 3 inches to the right. Your second throw aims 3 inches to the left of where the first one landed. Your third throw corrects for the remaining error. Each throw compensates for the errors of previous throws.

XGBoost (Extreme Gradient Boosting) is the most popular implementation. It adds several refinements — regularization to prevent overfitting, efficient handling of missing values, and highly optimized computation — that make it the go-to algorithm for structured data problems in industry.

Business Insight. In Kaggle competitions and industry benchmarks, gradient boosting algorithms (XGBoost, LightGBM, CatBoost) consistently outperform other methods on structured (tabular) data. They are the default choice for most business regression problems when predictive accuracy is the primary objective and interpretability is secondary. When interpretability is paramount — such as regulated industries or executive-facing models — start with linear regression or a shallow decision tree.

Model Comparison: When to Use What

Algorithm	Accuracy	Interpretability	Best For
Linear Regression	Moderate	High	Simple relationships, baseline models, regulated contexts
Ridge/Lasso	Moderate-High	High	Many features, multicollinearity, feature selection
Polynomial Regression	Moderate	Moderate	Known curved relationships (low degree)
Decision Tree	Moderate	Very High	Exploratory analysis, rule-based explanations
Random Forest	High	Low	General-purpose prediction, robust performance
XGBoost/GBM	Very High	Low	Maximum accuracy on structured data

8.7 Time Series Basics for Regression

Athena's demand forecasting problem has a temporal dimension that most standard regression problems do not: the order of the data matters. Tuesday's sales are not independent of Monday's sales. November's demand is not independent of October's demand. Ignoring the time structure in temporal data is a recipe for misleading models.

This section introduces the key concepts for incorporating time into regression models. Chapter 16 provides a comprehensive treatment of dedicated time series methods (ARIMA, Prophet, and neural network approaches). Here, we focus on how to make standard regression algorithms time-aware.

The Three Components of Time Series Data

Any time series can be decomposed into three components:

Trend — the long-term direction. Is demand growing, shrinking, or stable over months and years?
Seasonality — repeating patterns at fixed intervals. Daily patterns (weekday vs. weekend), monthly patterns (beginning of month vs. end of month), annual patterns (summer vs. winter).
Residual — what remains after removing trend and seasonality. This includes both genuine noise and the effects of non-time features (promotions, weather events, competitor actions).

Lag Features: Using the Past to Predict the Future

A lag feature is simply the target variable's value at a previous time step. If you're predicting today's sales, yesterday's sales (lag-1), last week's same-day sales (lag-7), and last year's same-day sales (lag-365) are powerful predictors.

| Date       | Sales | Lag_1  | Lag_7  | Lag_30 |
|------------|-------|--------|--------|--------|
| 2025-01-05 | 245   | 218    | 312    | 198    |
| 2025-01-06 | 189   | 245    | 285    | 210    |
| 2025-01-07 | 203   | 189    | 247    | 195    |

Lag features encode "momentum" and "memory" in the data. A product that sold well yesterday is likely to sell well today. A product that sold well on last Tuesday is likely to sell well on this Tuesday.

Rolling Statistics: Smoothing the Noise

Rolling (moving) averages and rolling standard deviations capture the local trend and variability:

Rolling mean (7-day): Average sales over the past 7 days — captures the current demand level while smoothing daily fluctuations
Rolling std (7-day): Standard deviation of sales over the past 7 days — captures demand volatility, which is useful for safety stock calculations
Rolling mean (30-day): Longer-term trend indicator

Calendar Features: Encoding Time Structure

Calendar features capture the systematic patterns associated with specific time periods:

Day of week (Monday = 0, ... Sunday = 6) — captures weekday/weekend effects
Month (1-12) — captures seasonal patterns
Day of month — captures paycheck cycles (spending often spikes on the 1st and 15th)
Week of year — captures annual seasonality
Holiday flags — binary indicators for major holidays
Days until next holiday — captures pre-holiday shopping behavior
Is payday — for businesses where spending correlates with pay cycles

Caution. When using lag features for time series prediction, you must be careful about data leakage. Your training data must only use information that would have been available at the time of prediction. If you're predicting Monday's sales, you can use the previous Friday's sales (lag-3) but not Monday's sales from the future. This seems obvious, but it's a common mistake in practice — particularly when features are computed on the full dataset before splitting into train and test sets.

8.8 Feature Engineering for Regression

Feature engineering — the art of creating new input features from raw data — is often the difference between a mediocre model and a good one. In Chapter 5, we explored data through visualization and summary statistics. Feature engineering is the next step: transforming what you've learned about the data into features that make the model's job easier.

Log Transforms: Taming Skewed Data

Many business variables — revenue, prices, salaries, web traffic — follow skewed distributions with long right tails. A few very large values dominate the raw data. Taking the logarithm compresses the scale:

import numpy as np

# Raw revenue: [100, 200, 300, 500, 50000]
# Log revenue:  [4.6, 5.3, 5.7, 6.2, 10.8]

Log-transforming the target variable often improves regression performance because it: - Reduces the influence of extreme values - Makes the relationship between features and target more linear - Stabilizes the variance (errors become more uniform across the prediction range)

Caution. When you log-transform the target, your model predicts log(y). To get predictions in original units, you must exponentiate: predicted_y = exp(predicted_log_y). A common mistake is to report the error metrics on the log scale — which looks artificially good — rather than converting back to original units first.

Interaction Terms

As we discussed in Section 8.3, interaction terms capture the idea that two features together have an effect different from the sum of their individual effects:

df['promo_x_cold'] = df['is_promotion'] * df['is_cold']
df['weekend_x_season'] = df['is_weekend'] * df['is_winter']

In demand forecasting, the interaction between promotion and season is often the most important feature in the model. A winter coat promotion in January has a vastly different effect than the same promotion in July.

Polynomial Features

For continuous variables where the relationship is curved, adding squared or cubed terms captures the nonlinearity:

df['temp_squared'] = df['avg_temp'] ** 2
df['price_squared'] = df['price'] ** 2

Use these sparingly — and always with regularization — to avoid overfitting.

Binning (Discretization)

Sometimes converting a continuous variable into categories improves model performance, especially when the relationship has clear thresholds:

df['temp_bin'] = pd.cut(
    df['avg_temp'],
    bins=[-20, 20, 35, 50, 65, 100],
    labels=['extreme_cold', 'cold', 'cool', 'mild', 'warm']
)

This is particularly useful for decision tree models and when the business operates on categorical thresholds ("If it's below freezing, activate the cold weather protocol").

Target Encoding for Categorical Variables

When a categorical feature has many levels (e.g., product SKU, store ID), one-hot encoding creates too many features. Target encoding replaces each category with the mean target value for that category:

# Instead of 500 one-hot columns for 500 stores:
store_avg_sales = df.groupby('store_id')['daily_sales'].mean()
df['store_avg'] = df['store_id'].map(store_avg_sales)

Caution. Target encoding can cause data leakage if computed on the full dataset before train-test splitting. Always compute target encodings only on the training set, then apply those values to the test set.

8.9 Evaluating Regression Models: Beyond R²

In Chapter 7, we evaluated classification models using accuracy, precision, recall, and F1. Regression evaluation requires a different set of metrics — and, critically, a different mindset about what "good" means.

R² (R-Squared): The Proportion of Variance Explained

R² measures the proportion of variance in the target variable that the model explains:

R² = 1 - (sum of squared residuals / sum of squared deviations from the mean)

An R² of 0.76 means the model explains 76 percent of the variance in the target. The remaining 24 percent is unexplained — a combination of missing features, noise, and model limitations.

Interpreting R² in business context:

R² Range	Typical Interpretation	Business Context
0.90+	Excellent	Physical systems, well-controlled processes
0.70-0.90	Good	Demand forecasting with rich features, pricing models
0.50-0.70	Moderate	Customer behavior prediction, market response models
0.30-0.50	Weak but potentially useful	Early-stage models, complex behaviors, fashion demand
< 0.30	Poor	Model may not be capturing meaningful signal

Caution. R² always increases (or stays the same) when you add more features, even if those features are random noise. This is why Tom's degree-15 polynomial had a training R² of 0.97 — the extra terms "explained" noise, not signal. Adjusted R² penalizes model complexity and is a more reliable metric for comparing models with different numbers of features.

MAE (Mean Absolute Error): The Average Mistake

MAE = average of |actual - predicted| across all observations

MAE is expressed in the same units as the target variable, making it directly interpretable. If your MAE is 42 units, your predictions are off by an average of 42 units.

RMSE (Root Mean Squared Error): Penalizing Large Errors

RMSE = square root of (average of (actual - predicted)² across all observations)

RMSE is also in the same units as the target, but it penalizes large errors more heavily than small ones because of the squaring step. If your business cares more about avoiding big misses than about average accuracy, RMSE is a better metric than MAE.

When MAE ≈ RMSE: Your errors are relatively uniform — the model doesn't have occasional catastrophic failures.

When RMSE >> MAE: You have some large outlier errors — the model is usually good but occasionally very wrong. In demand forecasting, this is often the pattern to watch for, because the big misses are the ones that cause stockouts or massive overstock.

MAPE (Mean Absolute Percentage Error): The Relative Mistake

MAPE = average of |actual - predicted| / actual × 100%

MAPE expresses error as a percentage, which is useful when comparing forecast accuracy across products with different demand levels. An MAE of 50 units is excellent for a product that sells 5,000 units/day but terrible for one that sells 100 units/day. MAPE normalizes for scale.

MAPE Range	Interpretation
< 10%	Excellent forecasting accuracy
10-20%	Good — typical for most business forecasting applications
20-30%	Reasonable — may be acceptable for volatile categories
30-50%	Inaccurate — investigate data quality, features, or model choice
> 50%	Poor — the model may not be viable for this use case

Business Insight. The choice of error metric should be driven by the business cost of errors, not by convention. In demand forecasting, an over-prediction of 100 units (excess inventory sitting in a warehouse) costs Athena differently from an under-prediction of 100 units (lost sales and disappointed customers). Section 8.11 shows how to translate these asymmetric costs into a custom business metric.

Visualizing Model Performance

Three diagnostic plots that every regression analysis should include:

Actual vs. Predicted scatter plot. Plot predicted values (x-axis) against actual values (y-axis). A perfect model produces a 45-degree line. Deviations from the line reveal systematic biases. Points far from the line are high-error observations worth investigating.
Residual plot. Plot residuals (actual - predicted) against predicted values. Residuals should be randomly scattered around zero with no pattern. Patterns in the residuals (a funnel shape, a curve, clusters) indicate that the model is systematically missing something.
Time series of actuals vs. predictions. For temporal data, plot both lines over time. This reveals whether the model tracks trends and captures seasonal patterns or systematically lags behind.

8.10 The DemandForecaster: A Complete Implementation

It's time to build. This section presents the complete DemandForecaster class — Athena's demand prediction system, implemented in Python. We will generate synthetic data that mirrors Athena's actual patterns, engineer features, compare multiple models, and evaluate results using business-relevant metrics.

Code Explanation. The code below is designed to run in a single Jupyter notebook. It uses pandas, numpy, scikit-learn, and xgboost — all standard libraries in any data science environment. If you have not yet installed xgboost, run pip install xgboost in your terminal.

Step 1: Generate Synthetic Athena Sales Data

import numpy as np
import pandas as pd
from datetime import datetime, timedelta

def generate_athena_sales_data(
    start_date: str = "2022-01-01",
    end_date: str = "2024-12-31",
    n_stores: int = 5,
    seed: int = 42
) -> pd.DataFrame:
    """
    Generate synthetic daily sales data for Athena Retail Group.

    Simulates realistic demand patterns including:
    - Seasonal variation (winter peaks, summer troughs)
    - Day-of-week effects (weekends higher)
    - Promotional lifts (random promotions with demand spikes)
    - Holiday effects (pre-holiday surge, post-holiday dip)
    - Weather influence (temperature-driven demand)
    - Store-level variation (different base demand by store size)
    - Random noise
    """
    np.random.seed(seed)

    dates = pd.date_range(start=start_date, end=end_date, freq="D")
    stores = [f"STORE_{i+1:03d}" for i in range(n_stores)]
    store_sizes = {
        "STORE_001": 15000, "STORE_002": 22000, "STORE_003": 8000,
        "STORE_004": 18000, "STORE_005": 12000
    }

    records = []

    for store in stores:
        base_demand = store_sizes[store] / 100  # Larger stores sell more

        for date in dates:
            day_of_week = date.dayofweek
            month = date.month
            day_of_year = date.dayofyear

            # --- Seasonal component (sinusoidal, peak in winter) ---
            seasonal = 30 * np.cos(2 * np.pi * (day_of_year - 15) / 365)

            # --- Temperature (correlated with season, with noise) ---
            base_temp = 55 - 25 * np.cos(2 * np.pi * (day_of_year - 15) / 365)
            temp = base_temp + np.random.normal(0, 5)

            # --- Temperature effect on demand ---
            temp_effect = -0.8 * (temp - 50)  # Below 50F increases demand

            # --- Day of week effect ---
            weekend_boost = 25 if day_of_week >= 5 else 0
            friday_boost = 10 if day_of_week == 4 else 0

            # --- Promotion (random, ~15% of days) ---
            is_promotion = 1 if np.random.random() < 0.15 else 0
            promo_lift = is_promotion * np.random.uniform(40, 120)

            # --- Holiday effects ---
            us_holidays = [
                (1, 1), (1, 15), (2, 14), (2, 19), (5, 27),
                (7, 4), (9, 2), (10, 31), (11, 28), (12, 25)
            ]
            is_near_holiday = 0
            for h_month, h_day in us_holidays:
                try:
                    h_date = datetime(date.year, h_month, h_day)
                    days_to_holiday = (h_date - date.to_pydatetime()).days
                    if 0 < days_to_holiday <= 7:
                        is_near_holiday = 1
                except ValueError:
                    pass

            holiday_effect = 35 * is_near_holiday

            # --- Combine all components ---
            demand = (
                base_demand
                + seasonal
                + temp_effect
                + weekend_boost
                + friday_boost
                + promo_lift
                + holiday_effect
                + np.random.normal(0, 15)  # Random noise
            )

            demand = max(10, int(round(demand)))  # Floor at 10 units

            records.append({
                "date": date,
                "store_id": store,
                "store_sqft": store_sizes[store],
                "daily_sales": demand,
                "avg_temp": round(temp, 1),
                "is_promotion": is_promotion,
                "is_weekend": 1 if day_of_week >= 5 else 0,
                "day_of_week": day_of_week,
                "month": month,
                "is_near_holiday": is_near_holiday,
                "product_category": "outerwear"
            })

    df = pd.DataFrame(records)
    df = df.sort_values(["store_id", "date"]).reset_index(drop=True)
    return df

# Generate the data
sales_data = generate_athena_sales_data()
print(f"Dataset shape: {sales_data.shape}")
print(f"Date range: {sales_data['date'].min()} to {sales_data['date'].max()}")
print(f"Stores: {sales_data['store_id'].nunique()}")
print(f"\nSample rows:")
print(sales_data.head(10))

Code Explanation. This function generates three years of daily sales data for five Athena stores. The demand patterns are deliberately realistic: winter peaks, weekend boosts, promotional spikes, and pre-holiday surges. The noise level (standard deviation of 15 units) ensures that no model will achieve a perfect score — just like real data.

Step 2: Feature Engineering

def engineer_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Create time-based and derived features for demand forecasting.

    Features created:
    - Lag features (1-day, 7-day, 14-day, 28-day)
    - Rolling statistics (7-day and 28-day mean and std)
    - Calendar features (week of year, quarter, day of month)
    - Interaction features (promotion x cold, weekend x season)
    - Temperature bins
    """
    df = df.copy()
    df = df.sort_values(["store_id", "date"]).reset_index(drop=True)

    # --- Lag features (within each store) ---
    for lag in [1, 7, 14, 28]:
        df[f"lag_{lag}"] = df.groupby("store_id")["daily_sales"].shift(lag)

    # --- Rolling statistics (within each store) ---
    for window in [7, 28]:
        rolling = df.groupby("store_id")["daily_sales"].transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=1).mean()
        )
        df[f"rolling_mean_{window}"] = rolling

        rolling_std = df.groupby("store_id")["daily_sales"].transform(
            lambda x: x.shift(1).rolling(window=window, min_periods=1).std()
        )
        df[f"rolling_std_{window}"] = rolling_std

    # --- Calendar features ---
    df["week_of_year"] = df["date"].dt.isocalendar().week.astype(int)
    df["quarter"] = df["date"].dt.quarter
    df["day_of_month"] = df["date"].dt.day
    df["is_month_start"] = (df["day_of_month"] <= 3).astype(int)
    df["is_month_end"] = (df["day_of_month"] >= 28).astype(int)

    # --- Interaction features ---
    df["promo_x_cold"] = df["is_promotion"] * (df["avg_temp"] < 35).astype(int)
    df["weekend_x_winter"] = df["is_weekend"] * df["month"].isin([11, 12, 1, 2]).astype(int)

    # --- Temperature bins ---
    df["temp_bin"] = pd.cut(
        df["avg_temp"],
        bins=[-30, 20, 35, 50, 65, 110],
        labels=[0, 1, 2, 3, 4]
    ).astype(float)

    # --- Drop rows with NaN from lagging (first 28 days per store) ---
    df = df.dropna().reset_index(drop=True)

    return df

# Engineer features
sales_featured = engineer_features(sales_data)
print(f"Feature-engineered shape: {sales_featured.shape}")
print(f"Features: {list(sales_featured.columns)}")

Code Explanation. Notice that every lag and rolling feature uses .shift(1) or .shift(lag) to ensure we only use past information. The rolling mean is computed on shifted data — x.shift(1).rolling(...) — so today's sales are not used to compute today's rolling average. This prevents data leakage, the most insidious error in time series modeling.

Step 3: The DemandForecaster Class

from sklearn.linear_model import LinearRegression, Ridge, Lasso
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import StandardScaler
import warnings
warnings.filterwarnings("ignore")

try:
    from xgboost import XGBRegressor
    HAS_XGBOOST = True
except ImportError:
    HAS_XGBOOST = False
    print("XGBoost not installed. Install with: pip install xgboost")


class DemandForecaster:
    """
    A multi-model demand forecasting system for Athena Retail Group.

    Compares multiple regression algorithms and evaluates them
    using business-relevant metrics including MAPE, MAE, RMSE,
    and the financial cost of forecast errors.

    Usage:
        forecaster = DemandForecaster(sales_featured)
        forecaster.prepare_data(test_months=3)
        results = forecaster.train_and_compare()
        forecaster.display_results(results)
        forecaster.plot_forecast(model_name="XGBoost")
        forecaster.calculate_business_impact(model_name="XGBoost")
    """

    FEATURE_COLUMNS = [
        "store_sqft", "avg_temp", "is_promotion", "is_weekend",
        "day_of_week", "month", "is_near_holiday",
        "lag_1", "lag_7", "lag_14", "lag_28",
        "rolling_mean_7", "rolling_mean_28",
        "rolling_std_7", "rolling_std_28",
        "week_of_year", "quarter", "day_of_month",
        "is_month_start", "is_month_end",
        "promo_x_cold", "weekend_x_winter", "temp_bin"
    ]
    TARGET = "daily_sales"

    def __init__(self, data: pd.DataFrame):
        """Initialize with feature-engineered sales data."""
        self.data = data.copy()
        self.models = {}
        self.predictions = {}
        self.scaler = StandardScaler()
        self.X_train = None
        self.X_test = None
        self.y_train = None
        self.y_test = None
        self.test_data = None

    def prepare_data(self, test_months: int = 3) -> None:
        """
        Split data into training and test sets using a temporal split.

        The last `test_months` months of data become the test set.
        This mimics real-world deployment where you train on historical
        data and predict into the future.
        """
        max_date = self.data["date"].max()
        test_start = max_date - pd.DateOffset(months=test_months)

        train_mask = self.data["date"] < test_start
        test_mask = self.data["date"] >= test_start

        self.X_train = self.data.loc[train_mask, self.FEATURE_COLUMNS].values
        self.y_train = self.data.loc[train_mask, self.TARGET].values
        self.X_test = self.data.loc[test_mask, self.FEATURE_COLUMNS].values
        self.y_test = self.data.loc[test_mask, self.TARGET].values
        self.test_data = self.data.loc[test_mask].copy()

        # Scale features for linear models
        self.X_train_scaled = self.scaler.fit_transform(self.X_train)
        self.X_test_scaled = self.scaler.transform(self.X_test)

        print(f"Training set: {len(self.y_train):,} observations")
        print(f"Test set:     {len(self.y_test):,} observations")
        print(f"Test period:  {test_start.date()} to {max_date.date()}")

    def _calculate_mape(self, actual: np.ndarray, predicted: np.ndarray) -> float:
        """Calculate Mean Absolute Percentage Error, handling zeros."""
        mask = actual > 0
        if mask.sum() == 0:
            return float("inf")
        return np.mean(np.abs((actual[mask] - predicted[mask]) / actual[mask])) * 100

    def train_and_compare(self) -> pd.DataFrame:
        """
        Train multiple regression models and compare their performance.

        Models:
        - Linear Regression (baseline)
        - Ridge Regression (L2 regularization)
        - Lasso Regression (L1 regularization / feature selection)
        - Random Forest
        - XGBoost (if installed)

        Returns a DataFrame of evaluation metrics for each model.
        """
        model_configs = {
            "Linear Regression": LinearRegression(),
            "Ridge (alpha=1.0)": Ridge(alpha=1.0),
            "Lasso (alpha=0.5)": Lasso(alpha=0.5),
            "Random Forest": RandomForestRegressor(
                n_estimators=200,
                max_depth=12,
                min_samples_leaf=10,
                random_state=42,
                n_jobs=-1
            ),
        }

        if HAS_XGBOOST:
            model_configs["XGBoost"] = XGBRegressor(
                n_estimators=300,
                max_depth=6,
                learning_rate=0.05,
                subsample=0.8,
                colsample_bytree=0.8,
                reg_alpha=0.1,
                reg_lambda=1.0,
                random_state=42,
                verbosity=0
            )

        results = []

        for name, model in model_configs.items():
            # Use scaled data for linear models, raw data for tree-based
            is_linear = name in [
                "Linear Regression", "Ridge (alpha=1.0)", "Lasso (alpha=0.5)"
            ]
            X_tr = self.X_train_scaled if is_linear else self.X_train
            X_te = self.X_test_scaled if is_linear else self.X_test

            # Train
            model.fit(X_tr, self.y_train)
            self.models[name] = model

            # Predict
            train_pred = model.predict(X_tr)
            test_pred = model.predict(X_te)

            # Ensure non-negative predictions
            test_pred = np.maximum(test_pred, 0)
            train_pred = np.maximum(train_pred, 0)

            self.predictions[name] = test_pred

            # Evaluate
            results.append({
                "Model": name,
                "Train R²": round(r2_score(self.y_train, train_pred), 4),
                "Test R²": round(r2_score(self.y_test, test_pred), 4),
                "MAE": round(mean_absolute_error(self.y_test, test_pred), 2),
                "RMSE": round(
                    np.sqrt(mean_squared_error(self.y_test, test_pred)), 2
                ),
                "MAPE (%)": round(self._calculate_mape(self.y_test, test_pred), 2),
                "Overfit Gap": round(
                    r2_score(self.y_train, train_pred)
                    - r2_score(self.y_test, test_pred), 4
                ),
            })

        results_df = pd.DataFrame(results).sort_values("Test R²", ascending=False)
        return results_df

    def display_results(self, results: pd.DataFrame) -> None:
        """Print model comparison results with business interpretation."""
        print("\n" + "=" * 80)
        print("DEMAND FORECASTER — MODEL COMPARISON RESULTS")
        print("=" * 80)
        print(results.to_string(index=False))
        print("\n" + "-" * 80)

        best = results.iloc[0]
        print(f"\nBest model: {best['Model']}")
        print(f"  Test R²:  {best['Test R²']:.4f} "
              f"({best['Test R²']*100:.1f}% of variance explained)")
        print(f"  MAE:      {best['MAE']:.1f} units "
              "(average prediction error)")
        print(f"  MAPE:     {best['MAPE (%)']:.1f}% "
              "(average percentage error)")
        print(f"  Overfit:  {best['Overfit Gap']:.4f} "
              "(gap between train and test R²)")

        if best["Overfit Gap"] > 0.10:
            print("\n  ⚠ WARNING: Overfit gap > 0.10. Consider more "
                  "regularization or simpler model.")
        elif best["Overfit Gap"] < 0.03:
            print("\n  Model generalizes well (train-test gap < 0.03).")

    def get_feature_importance(self, model_name: str = "Random Forest") -> pd.DataFrame:
        """
        Extract feature importance from tree-based models,
        or coefficient magnitudes from linear models.
        """
        model = self.models[model_name]

        if hasattr(model, "feature_importances_"):
            importances = model.feature_importances_
        elif hasattr(model, "coef_"):
            importances = np.abs(model.coef_)
        else:
            raise ValueError(f"Cannot extract importances from {model_name}")

        importance_df = pd.DataFrame({
            "Feature": self.FEATURE_COLUMNS,
            "Importance": importances
        }).sort_values("Importance", ascending=False)

        importance_df["Cumulative %"] = (
            importance_df["Importance"].cumsum()
            / importance_df["Importance"].sum()
            * 100
        ).round(1)

        return importance_df

    def plot_forecast(self, model_name: str = "XGBoost",
                      store_id: str = "STORE_001",
                      last_n_days: int = 90) -> None:
        """
        Plot actual vs. predicted sales for a specific store.

        Requires matplotlib. Displays the last `last_n_days` of the
        test period for visual clarity.
        """
        import matplotlib.pyplot as plt

        test_df = self.test_data.copy()
        test_df["predicted"] = self.predictions[model_name]

        store_df = test_df[test_df["store_id"] == store_id].tail(last_n_days)

        fig, axes = plt.subplots(2, 1, figsize=(14, 8), height_ratios=[3, 1])

        # --- Top panel: Actual vs Predicted ---
        axes[0].plot(
            store_df["date"], store_df["daily_sales"],
            label="Actual", color="#2196F3", linewidth=1.5, alpha=0.8
        )
        axes[0].plot(
            store_df["date"], store_df["predicted"],
            label="Predicted", color="#FF5722", linewidth=1.5,
            linestyle="--", alpha=0.8
        )

        # Highlight promotions
        promo_days = store_df[store_df["is_promotion"] == 1]
        axes[0].scatter(
            promo_days["date"], promo_days["daily_sales"],
            color="#4CAF50", marker="^", s=60, label="Promotion Day",
            zorder=5
        )

        axes[0].set_title(
            f"Demand Forecast: {store_id} — {model_name}",
            fontsize=14, fontweight="bold"
        )
        axes[0].set_ylabel("Daily Sales (units)")
        axes[0].legend(loc="upper right")
        axes[0].grid(True, alpha=0.3)

        # --- Bottom panel: Residuals ---
        residuals = store_df["daily_sales"] - store_df["predicted"]
        colors = ["#F44336" if r < 0 else "#4CAF50" for r in residuals]
        axes[1].bar(store_df["date"], residuals, color=colors, alpha=0.6, width=1)
        axes[1].axhline(y=0, color="black", linewidth=0.5)
        axes[1].set_ylabel("Residual (units)")
        axes[1].set_xlabel("Date")
        axes[1].grid(True, alpha=0.3)

        plt.tight_layout()
        plt.show()

    def calculate_business_impact(
        self,
        model_name: str = "XGBoost",
        unit_cost: float = 45.0,
        unit_price: float = 89.0,
        holding_cost_pct: float = 0.25,
        stockout_penalty_multiplier: float = 1.5
    ) -> dict:
        """
        Calculate the financial impact of forecast errors.

        Parameters:
        - unit_cost: Cost to purchase/produce one unit ($)
        - unit_price: Selling price per unit ($)
        - holding_cost_pct: Annual holding cost as % of unit cost
        - stockout_penalty_multiplier: Lost sale cost as multiple of margin
          (>1 to account for lost goodwill, customer churn)

        Returns a dictionary of business impact metrics.
        """
        actual = self.y_test
        predicted = self.predictions[model_name]

        margin = unit_price - unit_cost  # $44 per unit
        daily_holding_cost = unit_cost * holding_cost_pct / 365

        over_predictions = np.maximum(predicted - actual, 0)
        under_predictions = np.maximum(actual - predicted, 0)

        # Cost of over-ordering: holding cost + potential markdown
        overstock_cost = np.sum(over_predictions * daily_holding_cost)

        # Cost of under-ordering: lost margin + goodwill penalty
        stockout_cost = np.sum(
            under_predictions * margin * stockout_penalty_multiplier
        )

        total_cost = overstock_cost + stockout_cost
        total_actual_sales = np.sum(actual)
        cost_per_unit = total_cost / total_actual_sales if total_actual_sales > 0 else 0

        # Compare to naive baseline (predict yesterday's sales)
        naive_pred = np.roll(actual, 1)
        naive_pred[0] = actual[0]
        naive_over = np.maximum(naive_pred - actual, 0)
        naive_under = np.maximum(actual - naive_pred, 0)
        naive_overstock = np.sum(naive_over * daily_holding_cost)
        naive_stockout = np.sum(naive_under * margin * stockout_penalty_multiplier)
        naive_total = naive_overstock + naive_stockout

        savings = naive_total - total_cost

        impact = {
            "model": model_name,
            "test_period_days": len(actual),
            "total_units_sold": int(total_actual_sales),
            "overstock_cost": round(overstock_cost, 2),
            "stockout_cost": round(stockout_cost, 2),
            "total_forecast_cost": round(total_cost, 2),
            "cost_per_unit_sold": round(cost_per_unit, 4),
            "naive_baseline_cost": round(naive_total, 2),
            "savings_vs_naive": round(savings, 2),
            "savings_pct": round(savings / naive_total * 100, 1)
                          if naive_total > 0 else 0,
        }

        print("\n" + "=" * 60)
        print(f"BUSINESS IMPACT ANALYSIS — {model_name}")
        print("=" * 60)
        print(f"Test period:           {impact['test_period_days']} days")
        print(f"Total units sold:      {impact['total_units_sold']:,}")
        print(f"")
        print(f"Overstock cost:        ${impact['overstock_cost']:,.2f}")
        print(f"Stockout cost:         ${impact['stockout_cost']:,.2f}")
        print(f"Total forecast cost:   ${impact['total_forecast_cost']:,.2f}")
        print(f"Cost per unit sold:    ${impact['cost_per_unit_sold']:.4f}")
        print(f"")
        print(f"Naive baseline cost:   ${impact['naive_baseline_cost']:,.2f}")
        print(f"Savings vs. naive:     ${impact['savings_vs_naive']:,.2f} "
              f"({impact['savings_pct']}%)")
        print("=" * 60)

        return impact

Code Explanation. The DemandForecaster class encapsulates the entire forecasting workflow: data preparation (with a temporal train-test split), model training and comparison, feature importance analysis, forecast visualization, and business impact calculation. The temporal split is critical — we train on earlier data and test on later data, just as you would in production. Random splits would allow the model to "peek" at future patterns, producing unrealistically optimistic results.

Step 4: Run the Complete Pipeline

# Initialize and run
forecaster = DemandForecaster(sales_featured)
forecaster.prepare_data(test_months=3)

# Train all models and compare
results = forecaster.train_and_compare()
forecaster.display_results(results)

# Feature importance from Random Forest
print("\nFeature Importance (Random Forest):")
importance = forecaster.get_feature_importance("Random Forest")
print(importance.head(10).to_string(index=False))

# Visualize forecast (uncomment in Jupyter notebook)
# forecaster.plot_forecast(model_name="XGBoost", store_id="STORE_001")

# Business impact
impact = forecaster.calculate_business_impact(model_name="XGBoost")

Expected output (approximate):

Training set: 4,930 observations
Test set:     460 observations
Test period:  2024-10-01 to 2024-12-31

================================================================================
DEMAND FORECASTER — MODEL COMPARISON RESULTS
================================================================================
              Model  Train R²  Test R²    MAE   RMSE  MAPE (%)  Overfit Gap
            XGBoost    0.9412   0.8856  12.43  16.81      7.82       0.0556
      Random Forest    0.9587   0.8734  13.21  17.95      8.31       0.0853
   Ridge (alpha=1.0)   0.7823   0.7654  18.92  24.13     12.14       0.0169
Linear Regression      0.7831   0.7612  19.14  24.56     12.38       0.0219
   Lasso (alpha=0.5)   0.7456   0.7389  20.45  25.87     13.21       0.0067

Athena Update. The XGBoost model achieves a MAPE of approximately 8 percent — meaning its predictions are off by an average of 8 percent. For stable categories like basic outerwear, this is excellent. Ravi's team targets a MAPE below 15 percent for fashion items and below 10 percent for staple items. The model meets the staple target and, when deployed across Athena's pilot categories, contributes to an 18 percent reduction in overstock costs — worth approximately $2.2 million annually.

Step 5: Examining What the Model Learned

# Lasso coefficients reveal which features matter
lasso_model = forecaster.models.get("Lasso (alpha=0.5)")
if lasso_model is not None:
    lasso_features = pd.DataFrame({
        "Feature": forecaster.FEATURE_COLUMNS,
        "Coefficient": lasso_model.coef_
    })
    # Show non-zero coefficients (features Lasso selected)
    selected = lasso_features[lasso_features["Coefficient"] != 0]
    eliminated = lasso_features[lasso_features["Coefficient"] == 0]

    print(f"\nLasso selected {len(selected)} of "
          f"{len(forecaster.FEATURE_COLUMNS)} features:")
    print(selected.sort_values("Coefficient", key=abs,
                                ascending=False).to_string(index=False))
    print(f"\nLasso eliminated {len(eliminated)} features:")
    print(list(eliminated["Feature"]))

Code Explanation. Lasso's feature selection capability is on display here. By driving some coefficients exactly to zero, Lasso tells you which features are not pulling their weight. In a typical run, Lasso will eliminate redundant features like is_month_start and is_month_end while keeping the powerful lag features, temperature, and promotion indicators. This is precisely the kind of insight that Ravi can present to Athena's supply chain leadership: "Our model's accuracy depends on eight key factors — here they are, ranked by importance."

8.11 From Forecast to Action: The Business of Prediction Error

A forecast is not the end of the process. It is the beginning. The question that matters is not "How accurate is the model?" but "What should we do with this prediction?"

The Asymmetry of Errors

In most business forecasting problems, the cost of over-predicting is different from the cost of under-predicting. This asymmetry should drive every decision downstream of the model.

For Athena's demand forecasting:

Error Type	What Happens	Financial Impact
Over-prediction (ordered too much)	Excess inventory sits in warehouse, ties up capital, may require markdowns	Holding cost: ~$0.03/unit/day. Markdown loss if unsold: up to 50% of unit cost
Under-prediction (ordered too little)	Stockouts, lost sales, disappointed customers, competitor capture	Lost margin: $44/unit. Lost customer goodwill: hard to quantify but real

For outerwear, under-prediction is typically 3 to 5 times more costly than over-prediction per unit. A coat sitting in a warehouse costs Athena about $0.03 per day in holding costs. A coat not available when a customer wants it costs $44 in lost margin — plus the intangible cost of a customer who walks out and may not return.

Safety Stock: Hedging Against Forecast Error

Because forecasts are never perfect, businesses maintain safety stock — extra inventory above the forecast level — to buffer against under-prediction.

The appropriate safety stock level depends on:

Forecast accuracy (lower MAPE = less safety stock needed)
Demand variability (higher variability = more safety stock)
Service level target (99% in-stock rate requires more buffer than 95%)
Lead time (longer replenishment time = more buffer)
Cost asymmetry (expensive stockouts justify more buffer)

A standard formula:

Safety stock = z × sigma × sqrt(lead_time_days)

Where: - z is the z-score for the desired service level (1.65 for 95%, 2.33 for 99%) - sigma is the standard deviation of forecast error - lead_time_days is the number of days between ordering and receiving inventory

def calculate_safety_stock(
    forecast_errors: np.ndarray,
    service_level: float = 0.95,
    lead_time_days: int = 7
) -> dict:
    """
    Calculate safety stock based on forecast error distribution.

    Parameters:
    - forecast_errors: array of (actual - predicted) values
    - service_level: desired probability of not stocking out (0 to 1)
    - lead_time_days: days between order and receipt

    Returns safety stock recommendation.
    """
    from scipy import stats

    error_std = np.std(forecast_errors)
    z_score = stats.norm.ppf(service_level)
    safety_stock = z_score * error_std * np.sqrt(lead_time_days)

    return {
        "forecast_error_std": round(error_std, 2),
        "z_score": round(z_score, 2),
        "lead_time_days": lead_time_days,
        "safety_stock_units": round(safety_stock, 0),
        "service_level": f"{service_level*100:.0f}%"
    }

# Calculate safety stock for the best model
best_model = "XGBoost"
if best_model in forecaster.predictions:
    errors = forecaster.y_test - forecaster.predictions[best_model]
    ss = calculate_safety_stock(errors, service_level=0.95, lead_time_days=7)

    print(f"\nSafety Stock Recommendation ({best_model}):")
    for key, val in ss.items():
        print(f"  {key}: {val}")

Order Quantity: Turning Forecasts into Purchases

The final step is translating the forecast into an actual order:

Order quantity = forecast + safety stock - current inventory

This simple formula hides substantial complexity: - How often do you reorder? (Daily, weekly, monthly?) - What are the minimum order quantities from suppliers? - Are there volume discounts that change the optimal order size? - How reliable are your lead times?

These operational questions are beyond the scope of this chapter, but they underscore a critical point: the model is one component of a decision system, not the decision itself. The best demand model in the world is useless without a supply chain that can act on its predictions.

Business Insight. Ravi frames the demand forecaster's value proposition to Athena's board not in terms of R² or MAPE, but in dollars: "An 18 percent reduction in overstock costs plus a 12 percent reduction in stockout frequency translates to approximately $3.6 million in annual savings across pilot categories. Scaling to all categories, the projected annual impact is $9 to $12 million." That is the language that secures budget for ML infrastructure.

8.12 When Regression Models Fail

No chapter on regression would be complete without an honest assessment of where these models break down. Understanding failure modes is as important as understanding success patterns.

Failure Mode 1: Extrapolation Beyond Training Data

Regression models learn patterns within the range of their training data. When asked to predict outside that range — temperatures below anything in the training set, promotional discounts deeper than any historical precedent — they extrapolate, and extrapolation is unreliable.

Athena's polar vortex problem is a textbook example. The training data included temperatures as low as 5°F. The polar vortex brought -10°F to -15°F. The linear model extrapolated, predicting modestly higher demand. The actual demand was dramatically higher — a nonlinear response that the model had never seen.

Failure Mode 2: Regime Changes

Models trained on pre-pandemic data failed spectacularly during and after COVID-19. Consumer behavior, supply chain dynamics, and economic conditions shifted so fundamentally that historical patterns became irrelevant. This is concept drift at its most extreme.

Failure Mode 3: Omitted Variable Bias

If an important driver of the target variable is not included as a feature, the model's coefficients for the included features will be biased. A demand model that omits competitor pricing may attribute competitor-driven demand changes to other features, producing misleading results.

Failure Mode 4: The Promotional Problem

At Athena, promotional demand is the hardest category to forecast. Promotions interact with dozens of factors — timing, depth of discount, product category, weather, competitor promotions, marketing spend, channel, and customer segment. Each promotion is somewhat unique, making it difficult for any model to generalize.

Ravi's team addresses this with a separate promotional model that uses promotion-specific features (discount percentage, promotion type, marketing channel) in addition to the standard demand features. Even so, MAPE for promotional periods is typically double the MAPE for non-promotional periods.

Failure Mode 5: Confusing Correlation with Causation

Regression coefficients quantify correlation, not causation. A regression might show that ice cream sales and drowning deaths are positively correlated — not because ice cream causes drowning, but because both increase in hot weather (the omitted variable).

In business contexts, this matters enormously. A regression showing that customers who receive more emails have higher lifetime value does not mean that sending more emails causes higher value. It may mean that the system sends more emails to already-engaged customers. Acting on the correlation (blasting all customers with more emails) could backfire.

Caution. Never use a regression coefficient as evidence of causation without careful consideration of confounding variables, selection bias, and the possibility of reverse causation. If you need causal inference, you need experimental design (A/B testing) or specialized causal methods — not standard regression. This distinction is one of the most important in applied data science.

8.13 Chapter Summary: What Regression Means for Business

Professor Okonkwo surveys the room at the end of the session. NK has filled three pages of notes. Tom has rebuilt his model four times and finally has one that generalizes. Ravi has pulled up Athena's actual demand forecasting dashboard on his laptop, comparing the results to the class's model.

"Let me leave you with the business summary," Professor Okonkwo says.

"Regression is the most widely deployed machine learning technique in business. Not because it's glamorous — there are no headline-grabbing breakthroughs here. But because predicting how much and how many is the core of operational planning. Inventory, pricing, budgeting, staffing, capacity planning — every operational decision depends on numerical forecasts.

"What you've learned today is a progression. Start simple — linear regression gives you interpretable coefficients and a baseline. Add complexity only when the data demands it — regularization when you have too many features, polynomial terms when the relationship is curved, tree-based methods when the patterns are complex and interacting.

"But never forget: the model is a tool. The goal is better decisions. An 18 percent reduction in overstock costs is not a statistical achievement — it's money that Athena can reinvest in growth, in employees, in customer experience. The model earns its place by its impact on the business, not by its R²."

NK glances at the $3.2 million number still on the screen from Ravi's opening slide. A model that is 15 percent more accurate recovers $1.8 million. The cost of building and deploying that model is a fraction of that amount. The ROI case is not subtle.

"In Chapter 9," Professor Okonkwo continues, "we move from supervised to unsupervised learning — algorithms that find patterns without being told what to look for. But the business logic remains the same: the question comes first. The algorithm serves the question."

Key Terms

Term	Definition
Regression	Supervised learning technique for predicting continuous numerical values
Linear regression	Algorithm that fits a straight line (or hyperplane) to minimize squared residuals
Multiple regression	Linear regression with multiple predictor variables
Polynomial regression	Regression using polynomial features (x², x³) to capture curved relationships
Residual	Difference between actual and predicted value (actual - predicted)
R² (R-squared)	Proportion of variance in the target explained by the model (0 to 1)
MAE	Mean Absolute Error — average of absolute prediction errors
RMSE	Root Mean Squared Error — square root of average squared errors
MAPE	Mean Absolute Percentage Error — average percentage error
Overfitting	Model learns noise in training data, performing well on training but poorly on new data
Underfitting	Model is too simple to capture the true pattern
Regularization	Technique that penalizes model complexity to prevent overfitting
Ridge (L2)	Regularization that shrinks coefficients toward zero using squared penalty
Lasso (L1)	Regularization that can set coefficients exactly to zero, performing feature selection
Multicollinearity	High correlation among predictor variables, destabilizing coefficients
Random Forest	Ensemble of decision trees trained on random data/feature subsets
Gradient Boosting	Ensemble method where trees are trained sequentially on residuals
XGBoost	Optimized gradient boosting implementation, industry standard for tabular data
Lag feature	Target variable's value at a previous time step, used as a predictor
Safety stock	Extra inventory held above forecast to buffer against forecast errors
Data leakage	Using future information during training that wouldn't be available at prediction time
Feature engineering	Creating new input features from raw data to improve model performance
Interaction effect	When the combined effect of two features differs from the sum of their individual effects
Concept drift	Change in the relationship between features and target over time

Next chapter: Chapter 9: Unsupervised Learning — Discovering patterns without labels, from customer segmentation to anomaly detection.