Case Study 1: End-to-End — From Raw Data to Deployed Prediction

Contributors to Introduction to Data Science

Case Study 1: End-to-End — From Raw Data to Deployed Prediction

Tier 3 — Illustrative/Composite Example: GreenLeaf Energy is a fictional company. This case study is built from widely reported patterns in how utilities and energy companies use machine learning for demand forecasting. The data, model results, and business outcomes described here are composites for pedagogical purposes. All names, figures, and scenarios are invented.

The Setting

GreenLeaf Energy is a mid-sized utility company serving about 200,000 households across a metropolitan area. Every day, their operations team must decide how much electricity to purchase from the wholesale market for the following day. Buy too little, and they face expensive spot-market purchases at peak rates. Buy too much, and they waste money on electricity that nobody uses.

Currently, the operations team uses a simple rule: buy the same amount as the same day last year, plus 3% for growth. This works okay most of the time, but it fails badly during heat waves, cold snaps, holiday weeks, and special events. Those failures cost GreenLeaf an estimated $2.3 million per year in over-purchasing and spot-market penalties.

Dana Rivera, a recently hired data analyst, proposes a machine learning approach: predict tomorrow's electricity demand using weather forecasts, historical usage patterns, and calendar features. Her manager agrees to a three-month pilot.

The Question

Dana frames a clear prediction problem: Given today's conditions and tomorrow's weather forecast, how much electricity will our service area consume tomorrow?

This is a regression problem — the target is a continuous number (megawatt-hours consumed). Success means predictions that are closer to actual demand than the current "same day last year + 3%" rule.

The Data

Dana assembles three years of daily data (1,095 rows) with the following features:

Historical demand: - demand_yesterday: Yesterday's actual demand (MWh) - demand_last_week: Demand 7 days ago - demand_last_year: Demand on the same date last year - avg_demand_7day: Rolling 7-day average demand

Weather forecasts (from the National Weather Service): - forecast_high_temp: Tomorrow's forecasted high temperature (°F) - forecast_low_temp: Tomorrow's forecasted low temperature (°F) - forecast_humidity: Tomorrow's forecasted average humidity (%) - forecast_precipitation: Tomorrow's forecasted precipitation (inches)

Calendar features: - day_of_week: Monday through Sunday (categorical) - month: January through December (categorical) - is_holiday: Whether tomorrow is a federal holiday (binary) - is_weekend: Whether tomorrow is Saturday or Sunday (binary)

Target: - demand_tomorrow: Actual electricity demand the following day (MWh)

Building the Pipeline

Step 1: Explore and Understand

Before building the pipeline, Dana spends a week exploring the data in a Jupyter notebook. She discovers:

Demand has strong weekly seasonality (weekdays > weekends)
Demand has U-shaped temperature dependence (high in summer for AC, high in winter for heating, low in spring/fall)
The current "last year + 3%" method has MAE of about 2,100 MWh

This exploration informs her feature engineering and model choices, but she's careful not to include any of these insights in the pipeline in a way that would leak information.

Step 2: Build the Pipeline

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, GridSearchCV, TimeSeriesSplit
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import joblib

RANDOM_STATE = 42

# Load data
energy = pd.read_csv('greenleaf_daily_demand.csv', parse_dates=['date'])
energy = energy.sort_values('date').reset_index(drop=True)

# Define features
numeric_features = ['demand_yesterday', 'demand_last_week', 'demand_last_year',
                    'avg_demand_7day', 'forecast_high_temp', 'forecast_low_temp',
                    'forecast_humidity', 'forecast_precipitation']
categorical_features = ['day_of_week', 'month']
binary_features = ['is_holiday', 'is_weekend']

all_features = numeric_features + categorical_features + binary_features
X = energy[all_features]
y = energy['demand_tomorrow']

Step 3: Time-Aware Splitting

Dana faces a subtlety: electricity demand is time series data. She can't use random train/test splits because that would leak future information into the past. Instead, she uses a chronological split:

# Use the last 6 months as the test set
split_date = energy['date'].max() - pd.Timedelta(days=180)
train_mask = energy['date'] <= split_date

X_train = X[train_mask]
X_test = X[~train_mask]
y_train = y[train_mask]
y_test = y[~train_mask]

print(f"Training: {len(X_train)} days")
print(f"Testing:  {len(X_test)} days")

For cross-validation, she uses TimeSeriesSplit instead of regular k-fold, which ensures that training always comes before testing in time:

tscv = TimeSeriesSplit(n_splits=5)

Step 4: Define the Pipeline and Search

preprocessor = ColumnTransformer(transformers=[
    ('num', StandardScaler(), numeric_features),
    ('cat', OneHotEncoder(drop='first', handle_unknown='ignore'),
     categorical_features),
    ('bin', 'passthrough', binary_features)
])

# Try two model types
model_configs = {
    'Ridge Regression': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', Ridge())
        ]),
        'params': {
            'model__alpha': [0.01, 0.1, 1, 10, 100]
        }
    },
    'Random Forest': {
        'pipeline': Pipeline([
            ('preprocessor', preprocessor),
            ('model', RandomForestRegressor(random_state=RANDOM_STATE))
        ]),
        'params': {
            'model__n_estimators': [100, 200, 300],
            'model__max_depth': [5, 10, 15, None],
            'model__min_samples_leaf': [1, 5, 10]
        }
    }
}

best_models = {}
for name, config in model_configs.items():
    gs = GridSearchCV(
        config['pipeline'], config['params'],
        cv=tscv,
        scoring='neg_mean_absolute_error',
        n_jobs=-1
    )
    gs.fit(X_train, y_train)
    best_models[name] = gs
    print(f"\n{name}:")
    print(f"  Best MAE: {-gs.best_score_:.0f} MWh")
    print(f"  Best params: {gs.best_params_}")

Note: scikit-learn's regression scoring uses negative values (higher = better), so neg_mean_absolute_error is negative. Dana negates it for reporting.

Step 5: Evaluate on the Test Set

for name, gs in best_models.items():
    y_pred = gs.best_estimator_.predict(X_test)
    mae = mean_absolute_error(y_test, y_pred)
    rmse = np.sqrt(mean_squared_error(y_test, y_pred))
    r2 = r2_score(y_test, y_pred)
    print(f"\n{name}:")
    print(f"  MAE:  {mae:.0f} MWh")
    print(f"  RMSE: {rmse:.0f} MWh")
    print(f"  R²:   {r2:.3f}")

# Compare to baseline
baseline_pred = X_test['demand_last_year'] * 1.03
baseline_mae = mean_absolute_error(y_test, baseline_pred)
print(f"\nBaseline (last year + 3%): MAE = {baseline_mae:.0f} MWh")

Results:

Model	MAE (MWh)	RMSE (MWh)	R²
Baseline (last year + 3%)	2,100	2,850	0.71
Ridge Regression	1,450	1,920	0.84
Random Forest	1,180	1,560	0.89

The random forest reduces MAE by 44% compared to the baseline — from 2,100 to 1,180 MWh. Each MWh of over-prediction costs about $15 in wasted purchasing, and each MWh of under-prediction costs about $45 in spot-market penalties (asymmetric costs). Dana estimates annual savings of approximately $1.1 million.

Step 6: Save and Deploy

best_pipeline = best_models['Random Forest'].best_estimator_
joblib.dump(best_pipeline, 'demand_forecast_pipeline.joblib')

# Create a deployment-ready prediction function
def predict_tomorrow_demand(features_dict):
    """Predict tomorrow's electricity demand from a dictionary of features."""
    pipeline = joblib.load('demand_forecast_pipeline.joblib')
    features_df = pd.DataFrame([features_dict])
    prediction = pipeline.predict(features_df)[0]
    return round(prediction, 0)

What Dana Did Right

1. She established a baseline before modeling. The "last year + 3%" rule gave her a concrete target to beat. Without a baseline, "MAE = 1,180" is just a number. With a baseline, it's "44% better than what we were doing."

2. She respected time in her data splits. Random splits would have allowed the model to train on December 2025 data and predict November 2025 — information leakage through time. Using chronological splits and TimeSeriesSplit for cross-validation prevented this.

3. She used a pipeline for everything. The ColumnTransformer and Pipeline ensured that scaling was fit only on training data and encoding handled unseen categories gracefully. No leakage, even with cross-validation.

4. She compared multiple models systematically. Instead of just trying one model, she compared Ridge regression and Random Forest with GridSearchCV, using the same preprocessing and the same cross-validation strategy. The comparison was fair.

5. She translated the model's performance into business value. The manager doesn't care about MAE in MWh. She cares about dollars saved. Dana did the translation: 920 MWh less error per day × $15-45 cost per MWh error × 365 days = approximately $1.1M savings.

6. She saved the complete pipeline. Anyone can load the joblib file and make predictions without knowing the preprocessing details. The deployment team doesn't need to understand StandardScaler or OneHotEncoder — they just pass in a dictionary of features and get a prediction.

Lessons Learned

1. The pipeline IS the model. Dana's deliverable wasn't a Random Forest. It was a pipeline that transforms raw input data into a prediction. The preprocessing steps are inseparable from the model — they define what the model sees and how it sees it.

2. Time series data requires time-aware splits. Standard random cross-validation assumes samples are independent. Time series data violates this assumption — tomorrow's demand is correlated with today's. TimeSeriesSplit respects temporal order.

3. Business value drives adoption. Dana's model was technically sound, but what got it deployed was the $1.1M savings estimate. Translating model performance into stakeholder language is not optional — it's the bridge between analysis and impact.

Discussion Questions

Dana's model uses forecast_high_temp from the National Weather Service. What happens to the model's performance if the weather forecast is wrong? How might she account for forecast uncertainty?
The model performs well on average but might fail during extreme events (heat waves, ice storms). How could Dana test for this? What modifications might help?
Dana used TimeSeriesSplit for cross-validation. How does this differ from regular k-fold? Why is the difference important for time series data?
If GreenLeaf expanded to a new city, could Dana use the same pipeline? What would she need to change?
The costs of over-purchasing and under-purchasing are asymmetric ($15 vs. $45 per MWh). How might Dana modify the model or the prediction to account for this asymmetry? (Hint: think about biasing the prediction upward.)