Chapter 1: From Analysis to Prediction

DataField.Dev

24 min read

> War Story --- In 2019, a mid-size insurance company hired a consulting firm to build a model predicting which policyholders would file claims in the next quarter. The team was excellent. They assembled 14 months of historical data, engineered 340...

In This Chapter

How Machine Learning Thinks Differently from Statistics
The $4.2 Million Model That Answered the Wrong Question
Three Questions Data Can Answer
The ML Mindset: Why Prediction Is Different
Meet the Anchors: Three Problems, Three Industries
The Taxonomy of ML Problems
Framing: The Most Underrated Skill in Data Science
"All Models Are Wrong, Some Are Useful"
The R-Squared Illusion
Putting It All Together: The Prediction Workflow (Preview)
Key Vocabulary
Progressive Project M0: Setting Up the StreamFlow Churn Project
Bridge to Chapter 2

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 1: From Analysis to Prediction

How Machine Learning Thinks Differently from Statistics

Learning Objectives

By the end of this chapter, you will be able to:

Distinguish between descriptive analytics, inferential statistics, and predictive modeling
Explain why prediction requires a fundamentally different workflow than explanation
Identify the bias-variance tradeoff at a conceptual level
Frame a business problem as a machine learning task
Recognize the types of ML problems (classification, regression, clustering, ranking)

The $4.2 Million Model That Answered the Wrong Question

War Story --- In 2019, a mid-size insurance company hired a consulting firm to build a model predicting which policyholders would file claims in the next quarter. The team was excellent. They assembled 14 months of historical data, engineered 340 features, ran a grid search across gradient boosted trees, random forests, and regularized logistic regression, and delivered a model with an AUC of 0.91 and an R-squared of 0.74 on their holdout set. The executive presentation was polished. The model was technically superb. And it was completely useless.

The business question was never "who will file a claim?" It was "which policyholders should we proactively contact to reduce claim severity?" Those are different questions. The first is prediction. The second is causal inference combined with actionability. The model could identify high-risk policyholders with impressive accuracy, but it could not tell the company which of those policyholders would respond to outreach --- and outreach was the only lever the company had. The $4.2 million project was shelved within six months.

This is the most common failure mode in data science: building a technically excellent model that answers the wrong question.

That war story should make you uncomfortable. It should make you uncomfortable because the team did everything right --- by the metrics they chose. The data was clean. The code was production-grade. The evaluation was rigorous. And yet the project failed, because the gap between "predict accurately" and "create business value" is wider than most people realize.

This chapter is about that gap. It is about the difference between analyzing what happened and predicting what will happen --- and why the transition from one to the other requires you to rethink nearly everything you know about working with data.

Three Questions Data Can Answer

Every data problem you will encounter falls into one of three categories. Understanding which category your problem belongs to is the single most important decision you will make on any project.

Descriptive Analytics: What Happened?

Descriptive analytics summarizes historical data. It answers backward-looking questions:

What was our revenue last quarter?
How many customers churned in January?
What is the average time-to-resolution for support tickets?

The tools are familiar: aggregation, visualization, summary statistics, dashboards. Descriptive analytics is where most organizations live. It is necessary, it is valuable, and it is not machine learning.

import pandas as pd

# Descriptive analytics: what happened?
churn_summary = df.groupby('month').agg(
    total_subscribers=('subscriber_id', 'nunique'),
    churned=('churned', 'sum'),
    churn_rate=('churned', 'mean')
).reset_index()

print(churn_summary.tail(6))

      month  total_subscribers  churned  churn_rate
7   2024-08           2412000   197784      0.082
8   2024-09           2389000   193209      0.081
9   2024-10           2378000   199752      0.084
10  2024-11           2356000   197904      0.084
11  2024-12           2341000   193893      0.083
12  2025-01           2318000   192594      0.083

This tells you churn is hovering around 8.2%. It does not tell you why, and it does not tell you who is about to leave.

Inferential Statistics: Why Did It Happen?

Inferential statistics draws conclusions about populations from samples. It answers explanatory questions:

Does offering a discount reduce churn? (causal inference)
Is the relationship between usage hours and retention statistically significant? (hypothesis testing)
What factors are associated with hospital readmission? (regression analysis)

The tools are hypothesis tests, confidence intervals, regression coefficients, and p-values. Inferential statistics cares deeply about why --- about identifying causal mechanisms and quantifying uncertainty about population parameters.

import statsmodels.api as sm

# Inferential statistics: why did it happen?
X = df[['hours_watched', 'support_tickets', 'tenure_months', 'num_devices']]
X = sm.add_constant(X)
y = df['churned']

logit_model = sm.Logit(y, X).fit()
print(logit_model.summary2().tables[1])

                   Coef.   Std.Err.     z      P>|z|    [0.025    0.975]
const              1.234     0.089   13.87    0.000     1.060     1.408
hours_watched     -0.043     0.003  -14.33    0.000    -0.049    -0.037
support_tickets    0.187     0.012   15.58    0.000     0.163     0.211
tenure_months     -0.028     0.002  -14.00    0.000    -0.032    -0.024
num_devices       -0.156     0.021   -7.43    0.000    -0.197    -0.115

This tells you that each additional support ticket is associated with higher churn probability (coefficient = 0.187, p < 0.001). The coefficient is interpretable. The confidence interval is meaningful. You can make statements about the population.

But notice what it does not tell you: which specific subscriber is going to churn next month.

Predictive Modeling: What Will Happen?

Predictive modeling generates forecasts about unseen data. It answers forward-looking questions:

Which of our 2.4 million subscribers will cancel in the next 30 days?
Will this patient be readmitted within 30 days of discharge?
Will this wind turbine's bearing fail in the next 72 hours?

The tools are machine learning algorithms, train/test splits, cross-validation, and metrics like AUC, precision, recall, and RMSE. Predictive modeling does not care why something happens --- at least, not primarily. It cares whether the prediction is accurate on data the model has never seen.

from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

# Predictive modeling: what will happen?
X = df[feature_columns]
y = df['churned_within_30_days']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

model = GradientBoostingClassifier(
    n_estimators=200,
    max_depth=5,
    learning_rate=0.1,
    random_state=42
)
model.fit(X_train, y_train)

y_pred_proba = model.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_pred_proba)
print(f"Test AUC: {auc:.3f}")

Test AUC: 0.847

This model can rank subscribers by churn risk. It cannot tell you why any individual subscriber is likely to churn (though interpretation techniques exist, which we will cover in Chapter 19). It can, however, generate a list of the 10,000 subscribers most likely to cancel next month --- and that list is actionable.

The Critical Distinction

Here is the insight that separates practitioners from students:

The same data can answer descriptive, inferential, and predictive questions --- but the approach, the evaluation criteria, and the definition of "correct" are completely different for each.

A model with a high R-squared can be terrible at prediction. A model that predicts beautifully can be uninterpretable. An analysis that identifies causal mechanisms may not generalize to new data. You must decide which question you are answering before you write a single line of code.

Common Mistake --- Treating R-squared as a measure of predictive performance. R-squared measures how much variance in the training data your model explains. A model that memorizes the training data will have a perfect R-squared and abysmal predictive performance. The right question is not "how well does this model fit the data I have?" but "how well does this model predict data it has never seen?"

The ML Mindset: Why Prediction Is Different

If you have a statistics background, the shift to machine learning requires reprogramming some deeply held beliefs. Here are the three biggest mental model changes.

1. Interpretability vs. Accuracy

In statistics, you want to understand the data-generating process. You want to know which variables matter, what direction their effects go, and whether those effects are statistically significant. A complex model that predicts well but cannot be interpreted is, in the statistical tradition, suspect.

In machine learning, you want to make accurate predictions on new data. Full stop. If a 500-tree ensemble with 200 features outperforms a logistic regression, you use the ensemble. You may later interpret the ensemble (SHAP values, partial dependence plots), but interpretation is a secondary concern, not the primary objective.

This does not mean interpretability is unimportant. In healthcare, finance, and criminal justice, you may be required to explain your model's decisions. But the order of operations is different: in statistics, you start with an interpretable model and check if it predicts well enough; in ML, you start with the model that predicts best and figure out how to explain it.

2. In-Sample vs. Out-of-Sample

Statistics evaluates models on the data used to fit them. The t-statistic, the F-statistic, the p-value --- all are computed on the same data that produced the coefficients. There is nothing wrong with this for the purpose statistics is designed for: understanding the data-generating process.

Machine learning evaluates models on data the model has never seen. This is the defining feature of the ML workflow. If you evaluate on training data, you are testing memory, not learning. The entire apparatus of ML evaluation --- train/test splits, cross-validation, holdout sets --- exists because of this single principle.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error
import numpy as np

# The leaky abstraction: in-sample vs. out-of-sample
model = LinearRegression()
model.fit(X_train, y_train)

# In-sample performance (what a statistician might report)
y_train_pred = model.predict(X_train)
r2_train = r2_score(y_train, y_train_pred)

# Out-of-sample performance (what an ML practitioner reports)
y_test_pred = model.predict(X_test)
r2_test = r2_score(y_test, y_test_pred)

print(f"Training R-squared: {r2_train:.3f}")
print(f"Test R-squared:     {r2_test:.3f}")

Training R-squared: 0.812
Test R-squared:     0.643

That gap --- 0.812 vs. 0.643 --- is the difference between explaining and predicting. If you report the training R-squared, you are lying to yourself. Get comfortable with the test set number. It is always lower, and it is always more honest.

3. Bias-Variance: The Tradeoff That Rules Everything

Every predictive model's error can be decomposed into three components:

Total Error = Bias + Variance + Irreducible Noise

Bias is the error from overly simplistic assumptions. A linear model fit to nonlinear data has high bias. It underfits --- it misses real patterns because the model is not flexible enough to capture them.
Variance is the error from excessive sensitivity to the training data. A model that memorizes every quirk of the training set has high variance. It overfits --- it captures noise as if it were signal, and those "patterns" do not generalize.
Irreducible noise is the error from randomness in the world. No model can predict it. Some patients will be readmitted for reasons no feature captures. Some subscribers will cancel because of life events no data records.

The tradeoff: reducing bias (making the model more flexible) typically increases variance, and reducing variance (constraining the model) typically increases bias. The art of machine learning is finding the sweet spot where total error is minimized.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error

# Generate data with a true nonlinear relationship
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 100)).reshape(-1, 1)
y_true = 3 * np.sin(X).ravel() + 0.5 * X.ravel()
y = y_true + np.random.normal(0, 1.5, 100)

X_train, X_test = X[:70], X[70:]
y_train, y_test = y[:70], y[70:]

# Demonstrate bias-variance tradeoff with polynomial degree
degrees = [1, 4, 15]
fig, axes = plt.subplots(1, 3, figsize=(15, 4))

for ax, degree in zip(axes, degrees):
    model = make_pipeline(
        PolynomialFeatures(degree),
        LinearRegression()
    )
    model.fit(X_train, y_train)

    train_mse = mean_squared_error(y_train, model.predict(X_train))
    test_mse = mean_squared_error(y_test, model.predict(X_test))

    X_plot = np.linspace(0, 10, 300).reshape(-1, 1)
    ax.scatter(X_train, y_train, alpha=0.5, s=20, label='Train')
    ax.scatter(X_test, y_test, alpha=0.5, s=20, label='Test')
    ax.plot(X_plot, model.predict(X_plot), 'r-', linewidth=2)
    ax.set_title(f'Degree {degree}\nTrain MSE: {train_mse:.2f}, Test MSE: {test_mse:.2f}')
    ax.legend()
    ax.set_ylim(-5, 15)

plt.tight_layout()
plt.savefig('bias_variance_demo.png', dpi=150, bbox_inches='tight')
plt.show()

In this example:

Degree 1 (high bias, low variance): The line cannot capture the sinusoidal pattern. Train and test error are both high. This is underfitting.
Degree 4 (balanced): The curve captures the real pattern without memorizing noise. Train error is slightly higher than degree 15, but test error is lowest. This is the sweet spot.
Degree 15 (low bias, high variance): The polynomial contorts to pass through every training point. Train error is near zero, but test error explodes. This is overfitting.

Production Tip --- In practice, you rarely think about bias-variance in mathematical terms. Instead, you think about it operationally: "Is my model too simple for the patterns in this data?" (bias) or "Is my model memorizing quirks in the training data that won't generalize?" (variance). Learning curves --- plotting train and validation error as a function of training set size --- are the single best diagnostic tool. We will build them in Chapter 16.

Meet the Anchors: Three Problems, Three Industries

Throughout this book, we will return to three recurring examples. They are chosen to represent different industries, different data types, and different ML problem types. You will know them well by the end.

StreamFlow: SaaS Churn Prediction

The company: StreamFlow is a B2C subscription streaming analytics platform. Think of it as a service that helps content creators understand their audiences --- dashboards, trend analysis, audience segmentation. The platform has 2.4 million subscribers paying between $9.99/month (Basic) and $49.99/month (Pro).

The numbers: - Annual recurring revenue: $180M - Monthly churn rate: 8.2% (industry benchmark: 5-7%) - Customer acquisition cost: $62 per subscriber - Average revenue per user: $18.40/month - Customer lifetime value at current churn: $224

The problem: At 8.2% monthly churn, StreamFlow loses approximately 197,000 subscribers every month. Replacing them costs $12.2M in acquisition spending. The VP of Product wants a model that identifies subscribers likely to churn in the next 30 days, so the retention team can intervene with targeted offers.

The data: - Subscription events (signups, upgrades, downgrades, cancellations) - Usage logs (hours watched, features used, devices, genres, time of day) - Support ticket history (count, category, resolution time, satisfaction rating) - Billing history (payment method, failed payments, plan changes) - Demographic features (age bucket, country, signup channel)

ML problem type: Binary classification (will this subscriber churn within 30 days: yes/no)

StreamFlow is the progressive project for this book. You will build the churn prediction system piece by piece, chapter by chapter, from problem framing through deployment.

Try It --- Before reading further, write down three features you think would predict churn at StreamFlow. Do not overthink it. What would a human customer success manager look at? We will revisit your list in Chapter 6.

Metro General Hospital: Readmission Prediction

The hospital: Metro General is a 450-bed urban teaching hospital affiliated with a major university medical school. It serves a diverse patient population, including a high proportion of Medicare and Medicaid patients.

The numbers: - Annual admissions: ~28,000 - 30-day readmission rate: 17.3% (national average: 15.6%) - CMS penalty (excess readmissions): $2.1M/year - Average cost per readmission: $14,400 - Annual readmission cost burden: ~$70M

The problem: The Centers for Medicare & Medicaid Services (CMS) penalizes hospitals with excess readmission rates under the Hospital Readmissions Reduction Program (HRRP). Metro General's 17.3% rate is above the national average. The Chief Medical Officer wants to identify patients at high risk of readmission at the time of discharge, so the care coordination team can schedule follow-up appointments, arrange home health visits, and ensure medication adherence.

The data: - Electronic health records (diagnoses, procedures, lab results, vital signs) - Demographic information (age, sex, insurance type, zip code) - Admission history (number of prior admissions in 12 months, length of stay) - Discharge details (discharge disposition, medications prescribed) - Social determinants (lives alone, transportation access --- often missing)

ML problem type: Binary classification (will this patient be readmitted within 30 days: yes/no)

The tension: This problem is where prediction and explanation collide. A model that predicts readmission well might use features like "number of prior admissions" --- which is predictive but not actionable (you cannot change a patient's history). A model built for intervention needs to surface actionable factors: medication complexity, lack of follow-up appointment, inadequate home support. We will explore this tension deeply in Case Study 2 and return to it throughout the book.

TurbineTech: Manufacturing Predictive Maintenance

The company: TurbineTech manufactures and operates wind turbines across North America. Their fleet of 1,200 turbines generates power for utility companies under long-term contracts. Each turbine is instrumented with 847 sensors measuring vibration, temperature, rotational speed, pitch angle, power output, and environmental conditions.

The numbers: - Fleet size: 1,200 turbines across 38 wind farms - Sensors per turbine: 847 (sampling at 1-second to 1-minute intervals) - Data volume: ~2.3 TB/day across the fleet - Average cost of unplanned downtime: $48,000/turbine/day (lost revenue + emergency repair) - Average cost of scheduled maintenance: $8,000/turbine - Planned maintenance window: 4-6 hours - Unplanned failure repair: 3-14 days

The problem: Unplanned turbine failures are catastrophically expensive. A main bearing failure can take a turbine offline for two weeks and cost $500,000+ in repairs. TurbineTech wants to predict failures 72 hours in advance, giving maintenance crews time to schedule a controlled shutdown and repair during low-wind periods.

The data: - Sensor streams (vibration spectra, temperature readings, rotational speed, power curves) - Maintenance logs (scheduled and unscheduled maintenance events, parts replaced) - Environmental data (wind speed, ambient temperature, humidity, icing conditions) - SCADA system alerts (alarm codes, fault conditions) - Component lifecycle data (hours of operation, manufacturer, installation date)

ML problem type: This one is more complex. It could be framed as: - Binary classification: Will this turbine fail within 72 hours? (yes/no) - Regression: How many hours until failure? (remaining useful life) - Anomaly detection: Is this turbine behaving abnormally? (detect deviation from normal)

The "right" framing depends on the business context, and different framings lead to different models, different features, and different evaluation criteria. We will revisit this throughout the book.

Common Mistake --- Jumping to "what algorithm should I use?" before deciding "what type of problem is this?" The algorithm is a detail. The problem type is the architecture. Get the architecture wrong and no algorithm will save you.

The Taxonomy of ML Problems

Now that you have three concrete examples, let us formalize the types of problems machine learning can solve.

Supervised Learning

In supervised learning, you have labeled data: each training example includes both the input features and the correct answer (the target variable). The model learns the mapping from features to target.

Classification predicts a discrete label: - Will this subscriber churn? (binary classification) - What genre is this content? (multiclass classification) - Which disease does this patient have? (multiclass, potentially multi-label)

Regression predicts a continuous value: - How many hours will this subscriber watch next month? - What will this patient's blood pressure be at the 30-day follow-up? - How many hours of remaining useful life does this bearing have?

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestRegressor

# Classification: predict a label
clf = LogisticRegression(random_state=42, max_iter=1000)
clf.fit(X_train, y_train_class)  # y_train_class: 0 or 1
predictions = clf.predict(X_test)        # [0, 1, 1, 0, ...]
probabilities = clf.predict_proba(X_test)  # [[0.82, 0.18], [0.34, 0.66], ...]

# Regression: predict a number
reg = RandomForestRegressor(n_estimators=100, random_state=42)
reg.fit(X_train, y_train_cont)  # y_train_cont: 12.5, 3.2, 8.7, ...
predictions = reg.predict(X_test)        # [11.2, 4.8, 7.3, ...]

The key distinction: in classification, predict_proba is often more useful than predict. A probability of 0.73 carries more information than a label of 1. You can threshold probabilities at different cutoffs depending on the business cost of false positives vs. false negatives --- a concept we will explore deeply in Chapter 16.

Unsupervised Learning

In unsupervised learning, you have no labels. The model discovers structure in the data on its own.

Clustering groups similar data points: - Segment subscribers into behavioral groups for targeted marketing - Identify patient cohorts with similar health profiles - Group wind turbines with similar operating characteristics

Dimensionality reduction compresses high-dimensional data into fewer dimensions: - Visualize 847 sensor readings in 2D to spot anomalies - Reduce 500 features to 50 principal components before modeling - Compress text into dense vector representations

Anomaly detection identifies data points that do not belong: - Flag sensor readings that deviate from normal operating patterns - Detect fraudulent transactions in billing data - Identify unusual patient lab values

We cover unsupervised learning in Part IV (Chapters 20-24).

Framing: The Most Underrated Skill in Data Science

We now arrive at the skill that separates senior data scientists from everyone else: problem framing. This is the process of translating a vague business question into a precise machine learning task. It is the most underrated skill in the field because it is invisible --- it happens before any code is written, and it does not appear in Kaggle competitions.

The Framing Checklist

Every ML problem requires you to answer six questions before writing code:

What is the business question? (In plain language, what decision does this model support?)
What is the prediction target? (What exactly are we predicting? Define it precisely.)
What is the observation unit? (What does one row in the training data represent?)
What is the prediction horizon? (How far into the future are we predicting?)
What features are available at prediction time? (This is where most leakage happens.)
How will the prediction be used? (What action will someone take based on the model output?)

Let us apply this to StreamFlow:

Question	Answer
Business question	Which subscribers should we target with retention offers?
Prediction target	Binary: will this subscriber cancel within 30 days?
Observation unit	One subscriber at one point in time (subscriber-month)
Prediction horizon	30 days from the prediction date
Features available at prediction time	Everything known before the prediction date: historical usage, support history, billing status, tenure. NOT anything that happens after the prediction date.
How prediction is used	Top 5% highest-risk subscribers receive a retention offer (discount, feature unlock, personalized outreach)

Framing Traps

Framing errors are the most expensive mistakes in data science because they compound. A wrong frame leads to wrong features, wrong evaluation, and a model that optimizes the wrong thing.

Trap 1: Target leakage. Your features contain information that would not be available at prediction time. If you include "cancellation_reason" as a feature for churn prediction, your model will be perfect --- and useless, because you only know the cancellation reason after the customer has already churned.

Trap 2: Wrong observation unit. For StreamFlow, if you define the observation unit as "subscriber" (one row per subscriber, ever), you conflate subscribers who have been around for 3 years with those who joined last week. You need "subscriber at a point in time" --- one row per subscriber per prediction period.

Trap 3: Mismatch between prediction and action. Metro General can predict readmission. But if the care coordination team can only handle 50 follow-up cases per day, and the model flags 300, the prediction is not actionable at scale. Framing must account for operational capacity.

Trap 4: Wrong time horizon. Predicting churn "at some point in the future" is too vague. Predicting churn "within 30 days" gives the retention team a window to act. Predicting churn "within 24 hours" might be too late. The time horizon must match the operational response time.

# WRONG: target leakage --- using future information as a feature
X_leaky = df[['hours_watched', 'support_tickets', 'cancellation_reason_encoded']]
# cancellation_reason is only known AFTER churn happens!

# RIGHT: only features available at prediction time
X_clean = df[['hours_watched_last_30d', 'support_tickets_last_90d',
              'tenure_months', 'plan_type', 'num_devices',
              'days_since_last_login']]

War Story --- A healthcare analytics team built a readmission model that achieved 0.94 AUC on their test set. Extraordinary performance. Too extraordinary. After a week of debugging, they discovered that one of their features was "discharge_disposition_code" --- which included codes like "discharged to home health" and "discharged to skilled nursing facility." These codes are assigned at discharge based on the patient's assessed risk --- meaning the model was effectively learning to predict the attending physician's risk assessment, which already incorporated readmission risk. The feature was removed, AUC dropped to 0.76, and the model actually became useful because it was now surfacing new information the clinical team did not already know.

"All Models Are Wrong, Some Are Useful"

George Box's famous aphorism is the unofficial motto of machine learning. It is worth unpacking, because it encapsulates the ML mindset.

In statistics, you often care whether your model is "true" in some sense --- whether it correctly specifies the data-generating process. You test assumptions: normality, homoscedasticity, independence. Violating these assumptions means your inferences (p-values, confidence intervals) may be invalid.

In machine learning, your model is always wrong. Always. The question is whether it is useful --- whether it makes predictions accurate enough to drive better decisions than the current alternative.

The current alternative might be: - No model at all: The retention team guesses which subscribers to call. - A simple heuristic: Flag anyone who has not logged in for 14 days. - An existing model: The rule-based churn score built by the analytics team in 2022.

Your model does not need to be perfect. It needs to be better than the status quo by enough to justify the cost of building and maintaining it. This is a pragmatic, engineering-oriented worldview, and it is the ML mindset.

Production Tip --- Always build a "stupid baseline" before building anything sophisticated. For classification, the baseline is "predict the majority class for every observation." For regression, it is "predict the mean." For StreamFlow's 8.2% churn rate, predicting "no churn" for every subscriber gives you 91.8% accuracy. Any model that does not meaningfully beat this baseline is not worth deploying. We will formalize baselines in Chapter 2.

The R-Squared Illusion

We need to address R-squared directly, because it is the most commonly misunderstood metric by practitioners transitioning from statistics to ML.

R-squared (R^2), the coefficient of determination, measures the proportion of variance in the dependent variable explained by the model. It ranges from 0 to 1 (or negative, for truly terrible models on test data). In statistical contexts, an R-squared of 0.75 is considered "good."

Here is the problem: R-squared computed on training data tells you nothing about predictive performance.

import numpy as np
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

# Generate data
np.random.seed(42)
n = 50
X = np.random.uniform(0, 10, n).reshape(-1, 1)
y = 2 * X.ravel() + np.random.normal(0, 3, n)

X_train, X_test = X[:35], X[35:]
y_train, y_test = y[:35], y[35:]

# Overfit with a high-degree polynomial
poly = PolynomialFeatures(degree=20)
X_train_poly = poly.fit_transform(X_train)
X_test_poly = poly.transform(X_test)

model = LinearRegression()
model.fit(X_train_poly, y_train)

print(f"Training R-squared: {r2_score(y_train, model.predict(X_train_poly)):.4f}")
print(f"Test R-squared:     {r2_score(y_test, model.predict(X_test_poly)):.4f}")

Training R-squared: 0.9987
Test R-squared:     -142.3741

A training R-squared of 0.9987. A test R-squared of negative 142. The model is worse than predicting the mean on unseen data --- catastrophically worse. This is not a contrived example. This is what happens when you evaluate models the way statistics textbooks teach you to evaluate models.

The fix is simple: always report test set performance. Always. No exceptions.

Putting It All Together: The Prediction Workflow (Preview)

Here is a preview of the full ML workflow that we will develop across this book. For now, internalize the high-level structure:

Frame the problem (this chapter) --- Define the business question, target, observation unit, and time horizon.
Collect and prepare data (Part II) --- Extract features with SQL, engineer new features, handle categoricals and missing values.
Establish a baseline (Chapter 2) --- Build the simplest possible model. If you cannot beat it, rethink the problem.
Model and iterate (Part III) --- Try multiple algorithms, tune hyperparameters, evaluate honestly.
Evaluate rigorously (Chapter 16) --- Go beyond a single metric. Understand precision-recall tradeoffs, calibration, and fairness.
Deploy (Part VI) --- Move from notebook to production API with monitoring.
Monitor and maintain (Chapter 32) --- Models decay. Data changes. Your work is not done when the model ships.

Common Mistake --- Spending 80% of your time on step 4 (modeling) and 5% on step 1 (framing). In production data science, the ratio should be roughly reversed. The right frame with a simple model will almost always beat the wrong frame with a sophisticated model.

Key Vocabulary

Before moving on, make sure you are comfortable with these terms. We will use them constantly.

Term	Definition
Supervised learning	Learning from labeled data where the target variable is known
Unsupervised learning	Learning from unlabeled data to discover hidden structure
Features	The input variables (columns) used to make predictions; also called predictors, covariates, or independent variables
Target variable	The output variable (column) we are trying to predict; also called the label, response, or dependent variable
Training set	The subset of data used to fit the model's parameters
Test set	The subset of data withheld from training, used to evaluate generalization
Overfitting	When a model learns noise in the training data and fails to generalize
Underfitting	When a model is too simple to capture the real patterns in the data
Bias-variance tradeoff	The fundamental tension between model simplicity (bias) and model flexibility (variance)
Generalization	A model's ability to perform well on data it was not trained on
Model	A mathematical function that maps features to predictions, learned from data
Prediction vs. inference	Prediction asks "what will happen?" Inference asks "why does it happen?"

Progressive Project M0: Setting Up the StreamFlow Churn Project

It is time to start building. Throughout this book, you will construct a complete churn prediction system for StreamFlow. Each chapter adds a piece. By the end, you will have a deployed model with monitoring, explanations, and a fairness audit.

Step 1: Create the Project Repository

# In your terminal (not Jupyter)
# mkdir streamflow-churn
# cd streamflow-churn
# git init
# python -m venv venv
# source venv/bin/activate  (Linux/Mac)
# venv\Scripts\activate     (Windows)
# pip install pandas numpy scikit-learn matplotlib seaborn jupyter

Step 2: Create the Notebook

Create a Jupyter notebook called 01_problem_framing.ipynb with the following structure:

# Cell 1: Project Header
"""
StreamFlow Churn Prediction System
===================================
Chapter 1: Problem Framing

Business Question: Which StreamFlow subscribers will cancel
their subscription within the next 30 days?

Author: [Your Name]
Date: [Today's Date]
"""

# Cell 2: Imports
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Consistent style for the project
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette('husl')
pd.set_option('display.max_columns', 50)
pd.set_option('display.float_format', '{:.3f}'.format)

Step 3: Define the Problem Frame

# Cell 3: Problem Framing Document
problem_frame = {
    'business_question': 'Which subscribers should receive retention offers?',
    'prediction_target': 'Binary: churned_within_30_days (1 = canceled, 0 = retained)',
    'observation_unit': 'subscriber-month (one row per subscriber per prediction period)',
    'prediction_horizon': '30 days from prediction date',
    'features_available': [
        'Subscription: plan_type, tenure_months, plan_changes',
        'Usage: hours_last_7d, hours_last_30d, sessions_last_30d, devices_used',
        'Support: tickets_last_90d, avg_resolution_time, satisfaction_rating',
        'Billing: payment_method, failed_payments_last_90d',
        'Demographics: age_bucket, country, signup_channel'
    ],
    'action_on_prediction': (
        'Top 5% highest-risk subscribers receive targeted retention offer. '
        'Estimated cost per offer: $15. Estimated save rate: 20%. '
        'Break-even if saved subscriber stays 2+ additional months.'
    ),
    'success_metric': (
        'Primary: AUC-ROC (model discrimination). '
        'Secondary: Precision at top 5% (are the flagged subscribers actually at risk?). '
        'Business: Net revenue saved (offers sent * save rate * CLV - offer cost).'
    ),
    'baseline': (
        'Predict majority class (no churn) for all subscribers. '
        'Accuracy = 91.8%. This is the bar to clear.'
    )
}

for key, value in problem_frame.items():
    if isinstance(value, list):
        print(f"\n{key.upper()}:")
        for item in value:
            print(f"  - {item}")
    else:
        print(f"\n{key.upper()}:\n  {value}")

Step 4: Calculate the Business Case

# Cell 4: Back-of-the-envelope business case
subscribers = 2_400_000
monthly_churn_rate = 0.082
avg_revenue_per_user = 18.40  # $/month
customer_acquisition_cost = 62.00  # $
retention_offer_cost = 15.00  # $ per offer
estimated_save_rate = 0.20  # 20% of contacted subscribers stay

# Monthly churn impact
monthly_churners = int(subscribers * monthly_churn_rate)
monthly_lost_revenue = monthly_churners * avg_revenue_per_user
monthly_acquisition_cost = monthly_churners * customer_acquisition_cost

print(f"Monthly churners: {monthly_churners:,}")
print(f"Monthly lost revenue: ${monthly_lost_revenue:,.0f}")
print(f"Cost to replace: ${monthly_acquisition_cost:,.0f}")

# If we target top 5% of subscribers (120,000)
target_count = int(subscribers * 0.05)
# Assume model correctly identifies 60% of actual churners in top 5%
true_positives_in_target = int(monthly_churners * 0.60)
saves = int(min(true_positives_in_target, target_count) * estimated_save_rate)
revenue_saved = saves * avg_revenue_per_user * 6  # Assume saved subscriber stays 6 months
offer_cost = target_count * retention_offer_cost
net_value = revenue_saved - offer_cost

print(f"\n--- Model Impact Estimate ---")
print(f"Subscribers targeted: {target_count:,}")
print(f"Estimated saves: {saves:,}")
print(f"Revenue saved (6-month horizon): ${revenue_saved:,.0f}")
print(f"Offer cost: ${offer_cost:,.0f}")
print(f"Net value per month: ${net_value:,.0f}")
print(f"Net value per year: ${net_value * 12:,.0f}")

Monthly churners: 196,800
Monthly lost revenue: $3,621,120
Cost to replace: $12,201,600

--- Model Impact Estimate ---
Subscribers targeted: 120,000
Estimated saves: 23,616
Revenue saved (6-month horizon): $2,607,206
Offer cost: $1,800,000
Net value per month: $807,206
Net value per year: $9,686,477

That is the business case: a well-performing churn model could generate approximately $9.7M in net annual value. This number will become more precise as we build the model, but having a rough business case before you start modeling is essential. It tells you how good the model needs to be to justify the investment.

Try It --- Modify the assumptions in the business case calculation. What happens if the save rate is 10% instead of 20%? What if the model only captures 30% of churners in the top 5%? At what save rate does the project break even? Understanding the sensitivity of the business case to model performance is a critical skill.

Bridge to Chapter 2

You now have the mental framework for thinking about machine learning problems. You can distinguish prediction from explanation, identify the type of ML problem, frame a business question as an ML task, and understand the bias-variance tradeoff at a conceptual level.

But framing is just the first step. In Chapter 2, we will map the complete ML workflow --- from the problem frame you just created through data preparation, baseline modeling, evaluation, deployment, and monitoring. We will introduce the concept of data leakage (the silent killer of ML projects), build a stupid baseline that every model must beat, and begin to structure the StreamFlow project for reproducibility.

The framing work you did in this chapter is not a warm-up exercise. It is the foundation. Every decision you make for the rest of this book --- which features to engineer, which algorithm to choose, which metric to optimize, how to evaluate fairness --- flows from the problem frame. Get it right, and the rest follows. Get it wrong, and you end up with a $4.2 million model that answers the wrong question.

Next: Chapter 2 --- The Machine Learning Workflow: Problem Framing, Data Pipeline, Modeling, Evaluation, Deployment

In This Chapter

Chapter 1: From Analysis to Prediction

How Machine Learning Thinks Differently from Statistics

Learning Objectives

The $4.2 Million Model That Answered the Wrong Question

Three Questions Data Can Answer

Descriptive Analytics: What Happened?

Inferential Statistics: Why Did It Happen?

Predictive Modeling: What Will Happen?

The Critical Distinction

The ML Mindset: Why Prediction Is Different

1. Interpretability vs. Accuracy

2. In-Sample vs. Out-of-Sample

3. Bias-Variance: The Tradeoff That Rules Everything

Meet the Anchors: Three Problems, Three Industries

StreamFlow: SaaS Churn Prediction

Metro General Hospital: Readmission Prediction

TurbineTech: Manufacturing Predictive Maintenance

The Taxonomy of ML Problems

Supervised Learning

Unsupervised Learning

Other Problem Types

Framing: The Most Underrated Skill in Data Science

The Framing Checklist

Framing Traps

"All Models Are Wrong, Some Are Useful"

The R-Squared Illusion

Putting It All Together: The Prediction Workflow (Preview)

Key Vocabulary

Progressive Project M0: Setting Up the StreamFlow Churn Project

Step 1: Create the Project Repository

Step 2: Create the Notebook

Step 3: Define the Problem Frame

Step 4: Calculate the Business Case

Bridge to Chapter 2