You have spent 24 chapters learning to collect data, clean it, visualize it, summarize it, test hypotheses about it, and measure relationships within it. All of that work has been about understanding the data you have.
Learning Objectives
- Define what a model is and explain why all models are deliberate simplifications of reality
- Distinguish between prediction and explanation as two goals of modeling
- Differentiate supervised learning from unsupervised learning and identify when each is appropriate
- Identify features and targets in a supervised learning problem
- Split data into training and test sets using train_test_split and explain why this is essential
- Explain overfitting and underfitting using intuitive analogies and visual examples
- Describe the bias-variance tradeoff and its implications for model complexity
- Build a baseline model and explain why starting simple matters
- Frame the progressive project as a prediction problem — predicting vaccination rates from country indicators
In This Chapter
- Chapter Overview
- 25.1 What Is a Model, Really?
- 25.2 Prediction vs. Explanation: Why Are You Building This Model?
- 25.3 Supervised vs. Unsupervised Learning
- 25.4 Features and Targets: The Language of Supervised Learning
- 25.5 Training and Testing: Why You Can't Grade Your Own Exam
- 25.6 Overfitting and Underfitting: The Goldilocks Problem
- 25.7 The Bias-Variance Tradeoff
- 25.8 Generalization: The Whole Point
- 25.9 Baseline Models: Always Start Here
- 25.10 The Machine Learning Workflow: A Preview
- 25.11 The scikit-learn API: Your Modeling Toolkit
- 25.12 Project Milestone: Framing the Vaccination Prediction Problem
- 25.13 Common Misconceptions About Models
- 25.14 Ethical Considerations: When Models Can Harm
- 25.15 Chapter Summary
- Connections to What You've Learned
- Looking Ahead
Chapter 25: What Is a Model? Prediction, Explanation, and the Bias-Variance Tradeoff
"All models are wrong, but some are useful." — George Box, statistician
Chapter Overview
You have spent 24 chapters learning to collect data, clean it, visualize it, summarize it, test hypotheses about it, and measure relationships within it. All of that work has been about understanding the data you have.
Now we do something different. We use the data we have to say something about data we haven't seen yet.
That is what a model does. A model takes patterns you've found in existing data and uses them to make predictions about new data — data that hasn't arrived, or hasn't been measured, or belongs to a situation you haven't encountered before. Will this customer leave? Will this patient respond to treatment? Will this country's vaccination rate be high or low? A model is your best guess, built from evidence.
But here is the thing about models that takes most people a long time to fully appreciate: every model is wrong. Not wrong in the sense of being useless — wrong in the sense of being incomplete. A model of housing prices doesn't capture every factor that determines what someone will pay for a house. A model of disease risk doesn't capture every variable that influences whether a person gets sick. A weather model doesn't simulate every molecule in the atmosphere.
And that's fine. That's the point. A model is a deliberate simplification of reality. You throw away complexity on purpose, keeping only the patterns that matter most, so that you can make useful predictions without drowning in detail. The art of modeling is deciding what to keep and what to throw away.
This chapter introduces the fundamental ideas behind modeling — ideas that will shape everything you do in the remaining chapters of this book and throughout your career in data science. We'll talk about what models are for (prediction vs. explanation), how they learn (supervised vs. unsupervised), what can go wrong (overfitting and underfitting), and the deep tension at the heart of it all (the bias-variance tradeoff).
In this chapter, you will learn to:
- Define what a model is and explain why all models are deliberate simplifications (all paths)
- Distinguish between prediction and explanation as modeling goals (all paths)
- Differentiate supervised from unsupervised learning (all paths)
- Identify features and targets in a prediction problem (all paths)
- Split data into training and test sets and explain why this matters (all paths)
- Explain overfitting and underfitting using analogies and visual examples (all paths)
- Describe the bias-variance tradeoff and its implications (standard + deep dive)
- Build a baseline model and explain why baselines come first (all paths)
- Frame the vaccination prediction problem for the progressive project (all paths)
Threshold Concept Alert: This chapter contains a threshold concept — "a model is a deliberate simplification of reality" — that changes how you think about data science. Once you understand that every model trades accuracy for simplicity, and that this trade is intentional, you'll stop expecting models to be perfect and start asking the right question: "Is this model useful enough for the decision I need to make?"
25.1 What Is a Model, Really?
You already know what a model is. You've been using them your entire life.
When you check a weather forecast, you're reading the output of a model. When you estimate how long your commute will take based on traffic patterns, you're running a model in your head. When you look at a map to navigate a city, you're using a model — a simplified representation of a complicated physical space, with most of the detail stripped away so you can find your way.
A model, in the broadest sense, is a simplified representation of something complex. It captures the essential patterns while ignoring details that don't matter for the task at hand.
The Map Analogy
Think about a map of your city. It shows streets, landmarks, and distances. It doesn't show every tree, every crack in the sidewalk, every car parked on every block. If it did, it would be the same size as the city itself — and completely useless.
The statistician George Box famously said, "All models are wrong, but some are useful." A map is wrong — it's flat, it's out of scale, it leaves out most of reality. But it's useful for navigation. A different map might be wrong in different ways but useful for different purposes — a topographic map for hiking, a subway map for transit, a satellite image for environmental analysis.
Data science models work the same way. A model of housing prices might say: "Price depends on square footage, number of bedrooms, and location." That's wrong — price also depends on the paint color, the neighbor's dog, the buyer's mood, and a thousand other things. But the model captures enough of the pattern to make useful predictions. That is what we mean by "all models are wrong, but some are useful."
From Description to Prediction
In Parts 1 through 4 of this book, you learned to describe data:
- Summary statistics told you about central tendency and spread
- Visualizations revealed patterns and distributions
- Hypothesis tests told you whether observed patterns were likely due to chance
- Correlations measured the strength of relationships between variables
All of that work was backward-looking. You had data, and you described what was in it.
Modeling is forward-looking. You have data, and you use it to say something about new situations. The shift from "what happened" to "what will happen" (or "what would happen if") is one of the most important transitions in data science. It's the difference between a historian and a forecaster — and you're about to become a forecaster.
Two Kinds of Models: Statistical and Machine Learning
You'll hear people distinguish between "statistical models" and "machine learning models," and the boundary between them is genuinely blurry. For our purposes:
-
Statistical models tend to emphasize understanding relationships (explanation). They come from statistics traditions and focus on interpretable coefficients, confidence intervals, and hypothesis tests. Linear regression is the classic example.
-
Machine learning models tend to emphasize predictive accuracy (prediction). They come from computer science traditions and focus on making good predictions on new data, sometimes at the expense of interpretability. Random forests and neural networks are examples.
But this is a spectrum, not a binary. Linear regression is used in both traditions. Many machine learning practitioners care about interpretation. Many statisticians care about prediction. Don't get hung up on the labels — focus on the ideas.
25.2 Prediction vs. Explanation: Why Are You Building This Model?
Before you build any model, you need to answer a fundamental question: What is this model for?
There are two main answers, and they lead to very different modeling strategies.
Prediction: "What Will Happen?"
A prediction model is built to forecast outcomes on new data. The question is: given what I know about a new case, what outcome should I expect?
- Given a house's features, what price will it sell for?
- Given a patient's symptoms, what disease do they likely have?
- Given a country's GDP and healthcare spending, what vaccination rate should we expect?
For prediction, you care about accuracy. You want the model's forecasts to be as close to reality as possible. You might not care why the model works — you care that it does work. A prediction model is like a GPS: you don't need to understand satellite physics to follow the directions.
Explanation: "Why Does It Happen?"
An explanation model is built to understand the mechanisms behind outcomes. The question is: which factors actually drive the outcome, and how?
- Does education cause higher income, or do other factors explain the correlation?
- Does a drug actually reduce blood pressure, or is the improvement a placebo effect?
- Does GDP drive vaccination rates, or does the relationship work the other way?
For explanation, you care about interpretability. You want to understand the model's coefficients, know which variables matter, and make causal claims (carefully). An explanation model is like a medical diagnosis: you don't just want to know that the patient is sick — you want to know why, so you can prescribe the right treatment.
Why the Distinction Matters
These goals can conflict. A highly accurate prediction model might be a black box — it makes great predictions but you can't understand why. A highly interpretable explanation model might sacrifice some accuracy for clarity.
Consider these two approaches to predicting house prices:
The prediction approach: Feed 500 features (square footage, location, number of windows, distance to nearest coffee shop, month of sale, seller's asking price history...) into a complex algorithm. Result: very accurate predictions, but you can't explain why any particular house got the price it did.
The explanation approach: Use three features (square footage, bedrooms, neighborhood) in a simple regression. Result: slightly less accurate, but you can say clearly, "Each additional bedroom adds approximately $15,000 to the price, all else being equal."
Neither approach is better in absolute terms. The right choice depends on your goal. If you're a real estate investor trying to find undervalued properties, prediction accuracy matters most. If you're a city planner trying to understand what drives housing costs, explanation matters most.
In this book, we'll build models that lean toward the explainable end of the spectrum — linear regression, logistic regression, and decision trees. These are models where you can look inside and understand what's happening. But we'll always evaluate them on their predictive accuracy too, because a model that explains everything but predicts nothing isn't much use either.
A Warning from Chapter 24
Remember what you learned about correlation and causation? That warning applies here too, with even more force. A predictive model that uses GDP to predict vaccination rates is not proving that GDP causes higher vaccination. The model is just exploiting a pattern in the data. The pattern might be causal, or it might be confounded, or it might be coincidental. The model doesn't know and doesn't care — it just predicts.
When we build explanatory models, we'll need to be much more careful about these distinctions. For now, remember: prediction is not explanation, and a good prediction doesn't prove a causal mechanism.
25.3 Supervised vs. Unsupervised Learning
Now that you know why you might build a model (prediction or explanation), let's talk about how models learn. There are two fundamental approaches, and the distinction is simple once you see it.
Supervised Learning: Learning from Examples with Answers
Supervised learning is like studying with an answer key. You have a dataset where you know both the inputs and the correct outputs. You show the model many examples of "here's the input, here's the right answer," and the model learns the pattern that connects inputs to outputs. Then you give it new inputs (without answers) and ask it to predict.
Think of it like learning to grade essays. Someone gives you 100 essays that have already been graded (A, B, C, D, F). You study the essays and the grades, looking for patterns: longer essays tend to get higher grades, essays with clear thesis statements tend to get higher grades, essays with many spelling errors tend to get lower grades. Eventually, you develop a sense for what makes a good essay — and you can grade new essays you've never seen.
In supervised learning:
- The features (also called predictors, inputs, or independent variables) are the characteristics you use to make predictions. For house prices: square footage, bedrooms, location.
- The target (also called the response, output, label, or dependent variable) is what you're trying to predict. For house prices: the sale price.
The "supervised" part means you're supervising the learning process by providing the correct answers. The model learns by comparing its predictions to the known answers and adjusting until it gets better.
Supervised learning comes in two flavors:
- Regression: Predicting a continuous number (price, temperature, vaccination rate)
- Classification: Predicting a category (spam or not spam, disease type, high vs. low vaccination)
We'll cover regression in Chapter 26 and classification in Chapter 27.
Unsupervised Learning: Finding Structure Without Answers
Unsupervised learning is like exploring a new city without a guidebook. You have data, but no labels — no "right answers." You're looking for patterns, groups, or structure in the data itself.
Think of it like organizing your closet. Nobody tells you the categories — you look at your clothes and discover the structure yourself. "These are all winter clothes. These are work clothes. These are workout clothes." You're finding groups that the data naturally forms.
Common unsupervised learning tasks include:
- Clustering: Grouping similar items together (customer segments, document topics, gene expression patterns)
- Dimensionality reduction: Finding simpler representations of complex data (reducing 100 variables to 5 that capture most of the information)
- Anomaly detection: Finding unusual data points that don't fit any pattern
Unsupervised learning is powerful, but we won't focus on it in this book. Our remaining chapters concentrate on supervised learning — regression and classification — because those are the tasks where you have clear questions ("Can we predict this?") and clear ways to measure success ("How close are our predictions?").
Which One Are We Doing?
For the progressive project, we're doing supervised learning. We have data about countries — GDP per capita, healthcare spending, education levels — and we know each country's vaccination rate. We'll use the country indicators (features) to predict vaccination rates (target). This is supervised learning because we have the answers (actual vaccination rates) to learn from.
Specifically, when we predict a numerical vaccination rate (like 87.3%), we're doing regression. When we classify countries as "high vaccination" or "low vaccination," we're doing classification. Same data, different framing — and you'll see both in the coming chapters.
25.4 Features and Targets: The Language of Supervised Learning
Before we go further, let's make sure the vocabulary is clear. In supervised learning, you'll encounter several sets of terms that mean essentially the same thing:
| What You're Predicting | What You're Using to Predict |
|---|---|
| Target | Features |
| Response | Predictors |
| Output | Inputs |
| Label | Attributes |
| Dependent variable | Independent variables |
| y | X |
Different textbooks and communities prefer different terms. In this book, we'll use features and target most of the time, because those are the terms used by scikit-learn, the Python library we'll use for modeling.
Identifying Features and Targets
The first step in any supervised learning problem is identifying what you want to predict (the target) and what you'll use to predict it (the features).
Let's practice:
Problem: Predict house prices - Target: Sale price (dollars) - Features: Square footage, bedrooms, bathrooms, lot size, neighborhood, year built
Problem: Classify emails as spam or not spam - Target: Spam or not spam (category) - Features: Number of exclamation marks, presence of suspicious words, sender reputation, email length
Problem: Predict vaccination rate from country indicators - Target: Vaccination rate (percentage) - Features: GDP per capita, healthcare spending per capita, education index, urbanization rate
Notice that choosing features is a design decision. You decide which variables to include based on your domain knowledge, data availability, and modeling goals. Including more features isn't always better — a lesson we'll learn the hard way when we discuss overfitting.
The Feature Matrix and Target Vector
In scikit-learn (and in most machine learning frameworks), the convention is:
- X is the feature matrix — a table where each row is one observation and each column is one feature
- y is the target vector — a single column with the value you're trying to predict
import pandas as pd
# Imagine a small dataset of countries
data = pd.DataFrame({
'country': ['Norway', 'Brazil', 'Nigeria', 'Japan'],
'gdp_per_capita': [75000, 8700, 2100, 40000],
'health_spending': [7500, 930, 70, 4400],
'vaccination_rate': [97, 84, 54, 98]
})
# Features (X): what we use to predict
X = data[['gdp_per_capita', 'health_spending']]
# Target (y): what we want to predict
y = data['vaccination_rate']
This X-and-y convention will appear in every modeling chapter from here on. Get comfortable with it.
25.5 Training and Testing: Why You Can't Grade Your Own Exam
Here's a scenario. You're a student who wants to know how well you understand a subject. You study from a textbook, then you test yourself — using the exact same problems from the textbook. You get 100% on every one of them.
Do you actually understand the subject? Maybe. But maybe you just memorized the answers. The only way to really know is to test yourself on new problems — ones you haven't seen before.
This is the most important idea in predictive modeling: you must evaluate your model on data it has never seen.
The Train-Test Split
The standard approach is to divide your data into two parts:
- Training set: The data your model learns from. This is the textbook.
- Test set: The data you evaluate your model on. This is the exam.
The model sees the training data and learns patterns from it. Then you check how well those patterns work on the test data — data the model has never seen. If the model performs well on the test set, you have evidence that it has learned genuine patterns that generalize to new situations. If it performs well on training data but poorly on test data, it has memorized rather than learned.
from sklearn.model_selection import train_test_split
# Split: 80% for training, 20% for testing
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training samples: {len(X_train)}")
print(f"Test samples: {len(X_test)}")
The train_test_split function from scikit-learn randomly divides your data. The test_size=0.2 means 20% goes to testing, 80% to training. The random_state=42 ensures you get the same split every time you run the code (reproducibility).
Why Random Splitting?
You might wonder: why not just use the first 80% for training and the last 20% for testing? Because your data might be sorted in some meaningful way — by date, by country name, by some other variable. If you split sequentially, your training and test sets might represent different populations. Random splitting ensures that both sets are representative samples of the full dataset.
How Much Data for Testing?
Common splits are:
- 80/20: The most common. 80% train, 20% test.
- 70/30: More test data, useful when you want more confidence in your evaluation.
- 90/10: Less test data, useful when you don't have much data and want to maximize training.
There's no single right answer. The tradeoff is:
- More training data means the model has more examples to learn from (good)
- More test data means you have a more reliable evaluation (also good)
For the datasets we'll work with in this book, 80/20 works well.
The Cardinal Rule
Here is the rule you must never break:
Never let your model see the test data during training.
If the model has seen the test data, the test results are meaningless. It's like peeking at the answer key before the exam. This might sound obvious, but violating this rule (called data leakage) is one of the most common mistakes in machine learning, and it can be subtle. We'll talk more about this in later chapters.
25.6 Overfitting and Underfitting: The Goldilocks Problem
Now we come to one of the most important — and most intuitive — ideas in all of modeling. It's the reason we split data into training and test sets. It's the reason simple models sometimes beat complex ones. And it's the reason data scientists spend so much time worrying about model complexity.
The Memorizing Student
Imagine two students preparing for a history exam:
Student A memorizes every detail from the textbook — every date, every name, every footnote. On the practice problems (from the textbook), Student A scores 100%. But on the actual exam, which asks questions phrased differently and requires applying concepts to new scenarios, Student A struggles. They memorized the specifics but didn't learn the underlying patterns.
Student B reads the textbook but focuses on themes and patterns: "Revolutions tend to happen when economic inequality meets political repression." On practice problems, Student B gets maybe 85% — not perfect, because they skip some specific details. But on the exam, Student B scores about the same — 83%. They learned the underlying patterns, which work on new questions too.
Student A is overfitting. They learned the training data too well, including its noise and quirks, and their knowledge doesn't transfer to new data.
Student B is fitting appropriately. They learned the real patterns and can apply them to new situations.
There's also a Student C who barely reads the textbook, decides "history is just one thing after another," and predicts that every event is caused by economics. Student C gets 40% on practice problems and 40% on the exam. Consistent, but consistently bad. Student C is underfitting — using a model that's too simple to capture the real patterns.
Overfitting: Too Complex
Overfitting occurs when a model learns the training data too well — it captures not just the real patterns but also the random noise, flukes, and quirks that are specific to that particular dataset. An overfit model performs well on training data but poorly on new data.
Think of connecting dots on a scatter plot. If you draw a straight line through a cloud of points, you'll miss some points but capture the general trend. If you draw a wiggly curve that passes through every single point, you've perfectly fit the training data — but that wiggly curve is responding to noise, and it will make terrible predictions on new points.
Overfitting is the enemy of generalization. Here's what it looks like in practice:
# Signs of overfitting:
# Training score: 0.99 (nearly perfect on training data)
# Test score: 0.45 (terrible on new data)
# The gap between training and test performance is large
Underfitting: Too Simple
Underfitting occurs when a model is too simple to capture the real patterns in the data. It performs poorly on both training data and new data.
Back to the dot analogy: if the true relationship between X and Y is curved, but you insist on drawing a straight line, you'll underfit. The line misses the pattern even in the training data.
# Signs of underfitting:
# Training score: 0.35 (poor even on training data)
# Test score: 0.30 (poor on new data too)
# Both scores are low — the model is too simple
The Sweet Spot
Good modeling is about finding the sweet spot between overfitting and underfitting:
Model Complexity →
Underfitting Sweet Spot Overfitting
(too simple) (just right) (too complex)
Training: Poor Training: Good Training: Excellent
Test: Poor Test: Good Test: Poor
This is the Goldilocks problem of modeling. Too simple and you miss real patterns. Too complex and you capture noise. Just right and you capture the real patterns without the noise.
Visualizing the Tradeoff
Imagine fitting polynomial curves to a set of data points:
- Degree 1 (straight line): Underfits if the real relationship is curved. Misses the pattern.
- Degree 3 (gentle curve): Captures the overall shape. Fits well on both training and test data.
- Degree 15 (wild oscillations): Passes through every training point but oscillates wildly between them. Terrible on new data.
Here's how you might visualize this:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
np.random.seed(42)
X = np.sort(np.random.uniform(0, 10, 20)).reshape(-1, 1)
y = 2 * np.sin(X).ravel() + np.random.normal(0, 0.5, 20)
fig, axes = plt.subplots(1, 3, figsize=(15, 4))
titles = ['Underfitting (degree 1)',
'Good fit (degree 3)',
'Overfitting (degree 15)']
for ax, degree, title in zip(axes, [1, 3, 15], titles):
poly = PolynomialFeatures(degree)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)
X_plot = np.linspace(0, 10, 200).reshape(-1, 1)
y_plot = model.predict(poly.transform(X_plot))
ax.scatter(X, y, color='steelblue', label='Data')
ax.plot(X_plot, y_plot, color='coral', linewidth=2)
ax.set_title(title)
ax.set_ylim(-4, 4)
plt.tight_layout()
plt.savefig('overfitting_demo.png', dpi=150)
plt.show()
If you run this code, you'll see exactly what overfitting and underfitting look like. The degree-1 model misses the curve. The degree-3 model captures it nicely. The degree-15 model goes haywire trying to hit every point.
25.7 The Bias-Variance Tradeoff
Overfitting and underfitting are symptoms. The bias-variance tradeoff is the underlying disease — or, more accurately, the underlying tension that makes modeling fundamentally challenging.
Two Types of Error
Every model's prediction errors come from two sources:
Bias is the error from overly simplistic assumptions. A model with high bias pays too little attention to the training data and misses important patterns. It's the error of underfitting.
Think of bias as stubbornness. A model with high bias has already decided what the answer should look like and ignores evidence to the contrary. If you insist that the relationship between study hours and exam scores is a flat line (constant prediction), you have high bias — you're ignoring the obvious pattern that more studying leads to better scores.
Variance is the error from being too sensitive to the specific training data. A model with high variance pays too much attention to the training data, including its noise. It's the error of overfitting.
Think of variance as fickleness. A model with high variance changes its predictions dramatically based on which particular data points it was trained on. Train it on a slightly different sample and you get a completely different model. That instability means it can't generalize.
The Tradeoff
Here's the tension: reducing bias increases variance, and reducing variance increases bias. You can't minimize both simultaneously.
Simple model: High bias, low variance (consistently wrong)
Complex model: Low bias, high variance (sometimes right, sometimes wildly wrong)
Best model: Moderate bias, moderate variance (usually approximately right)
Think of a dartboard:
- High bias, low variance: All darts land in a tight cluster, but the cluster is far from the bullseye. Consistent but consistently off.
- Low bias, high variance: Darts are scattered all over the board, but their average position is near the bullseye. Sometimes close, sometimes way off.
- Low bias, low variance: Darts cluster tightly near the bullseye. This is what we want, but it's hard to achieve.
Why the Tradeoff Exists
The tradeoff exists because of noise in real data. Real-world data is messy — there are measurement errors, random fluctuations, and variables you haven't measured. A simple model ignores this noise (good — low variance) but also misses real patterns (bad — high bias). A complex model captures real patterns (good — low bias) but also fits the noise (bad — high variance).
The total prediction error is:
Total Error = Bias² + Variance + Irreducible Noise
You can reduce bias by making the model more complex. You can reduce variance by making it simpler. But you can't reduce the irreducible noise — that's the inherent randomness in the data that no model can capture.
Practical Implications
The bias-variance tradeoff has several practical implications:
-
Start simple. A simple model gives you a baseline. If it performs well enough, you're done. If not, you know how much improvement is needed and can add complexity strategically.
-
More data helps with variance. With more training data, complex models become more stable because they have more examples to learn from. This is why large datasets are so valuable in machine learning.
-
Watch the gap. The gap between training performance and test performance is your overfitting detector. A large gap means high variance — the model is memorizing training data.
-
Complexity is a dial, not a switch. You don't choose between "simple" and "complex." You adjust complexity gradually, watching how training and test performance change, looking for the sweet spot.
-
Domain knowledge helps. Knowing something about the problem lets you choose the right level of complexity. If you know the relationship is roughly linear, a linear model is a good starting point. If you know it's highly nonlinear, you might need something more flexible.
25.8 Generalization: The Whole Point
Everything we've discussed — train-test splits, overfitting, the bias-variance tradeoff — serves one goal: generalization.
Generalization is a model's ability to perform well on data it has never seen. A model that generalizes well has learned real patterns, not noise. It works not just on the specific data it was trained on, but on new data from the same process.
This is the difference between memorizing and learning. A model that memorizes can reproduce the training data perfectly. A model that learns can handle new situations.
How Do You Know If Your Model Generalizes?
The test set is your primary tool. If the model performs well on test data it has never seen, that's evidence of generalization. But even test set performance has limits:
- If you test many different models and choose the one with the best test performance, you've effectively used the test set for model selection, which can lead to overly optimistic estimates.
- If your test data comes from a different population than your training data (different time period, different demographics), even good test performance doesn't guarantee generalization to the new population.
We'll discuss more sophisticated evaluation strategies (like cross-validation) in Chapter 29. For now, the train-test split is your foundation.
The Generalization Mindset
Adopting a generalization mindset means:
- You never celebrate a model just because it does well on training data
- You always ask "But how does it do on data it hasn't seen?"
- You're suspicious of models that seem too good to be true
- You value consistency (similar performance on training and test data) over raw accuracy on training data
This mindset will serve you well not just in data science but in any field where you're trying to learn general lessons from specific examples.
25.9 Baseline Models: Always Start Here
Before you build any sophisticated model, you should build a baseline model — the simplest possible model that gives you a reference point.
Why Baselines Matter
Imagine someone tells you: "My machine learning model predicts customer purchases with 92% accuracy!" Impressive? Maybe. But what if 92% of customers in the dataset make a purchase? Then a model that just predicts "purchase" for everyone would also get 92% accuracy — with no machine learning at all.
A baseline model tells you what "no skill" looks like. If your fancy model can't beat the baseline, it's not actually learning anything useful.
Common Baselines
For regression (predicting numbers):
- Mean baseline: Always predict the average of the training data. If the average house price is $350,000, predict $350,000 for every house.
- Median baseline: Always predict the median. More robust to outliers.
For classification (predicting categories):
- Most-frequent baseline: Always predict the most common category. If 70% of emails are not spam, always predict "not spam."
- Random baseline: Predict randomly in proportion to class frequencies.
import numpy as np
# Regression baseline: always predict the mean
y_train = np.array([85, 92, 78, 91, 88, 76, 95, 82])
baseline_prediction = y_train.mean()
print(f"Baseline prediction: {baseline_prediction:.1f}")
# Every new observation gets this prediction
# How good is this baseline?
# Compare your model's errors to the baseline's errors
The Baseline Conversation
When you present a model to stakeholders, the first question should always be: "How does this compare to the baseline?" If your model predicts vaccination rates with a mean absolute error of 8 percentage points, and the baseline (always predict the global average) has a mean absolute error of 12 percentage points, your model is reducing error by a third. That's meaningful.
But if your model's error is 11.5 and the baseline's error is 12, you've barely improved on "just guess the average." Maybe the complexity isn't worth it.
Baselines keep you honest.
25.10 The Machine Learning Workflow: A Preview
Before we get into specific algorithms (starting in Chapter 26), let's preview the general workflow for building a supervised learning model. Every model you build will follow these steps:
Step 1: Frame the Problem
- What are you predicting? (target)
- What will you use to predict it? (features)
- Is this regression or classification?
- What does success look like? (accuracy metric)
Step 2: Prepare the Data
- Clean the data (you learned this in Part 2)
- Select and engineer features
- Handle missing values
- Split into training and test sets
Step 3: Choose a Model
- Start with a simple baseline
- Choose an appropriate algorithm (linear regression, logistic regression, decision tree, etc.)
- Consider the bias-variance tradeoff
Step 4: Train the Model
- Fit the model to the training data
- In scikit-learn:
model.fit(X_train, y_train)
Step 5: Evaluate the Model
- Test on data the model hasn't seen
- In scikit-learn:
model.score(X_test, y_test) - Compare to baseline
- Check for overfitting (training vs. test performance)
Step 6: Iterate
- If underfitting: try a more complex model or add features
- If overfitting: simplify the model, remove features, or get more data
- Repeat until satisfied
This workflow is simple to state and endlessly nuanced in practice. The next five chapters will walk you through it with specific algorithms, specific metrics, and specific code.
25.11 The scikit-learn API: Your Modeling Toolkit
All of the models we build in this book will use scikit-learn (also written as sklearn), Python's most popular machine learning library. One of scikit-learn's great strengths is its consistent API — every model follows the same pattern:
from sklearn.some_module import SomeModel
# 1. Create the model
model = SomeModel()
# 2. Train the model on training data
model.fit(X_train, y_train)
# 3. Make predictions on new data
predictions = model.predict(X_test)
# 4. Evaluate the model
score = model.score(X_test, y_test)
This consistency means that once you learn one model, switching to a different model often requires changing just one line of code — the import statement. The .fit(), .predict(), and .score() methods work the same way for linear regression, logistic regression, decision trees, random forests, and dozens of other algorithms.
Let's install scikit-learn (if you haven't already) and verify it works:
# Install: pip install scikit-learn
import sklearn
print(f"scikit-learn version: {sklearn.__version__}")
We'll use scikit-learn extensively starting in Chapter 26. For now, just know that it exists and that its consistent API will make your life much easier.
25.12 Project Milestone: Framing the Vaccination Prediction Problem
Let's apply everything from this chapter to our progressive project. We've been working with a dataset of country indicators throughout this book. Now we're going to use that data to build a predictive model.
The Question
Can we predict a country's vaccination rate from its economic and social indicators?
Framing the Problem
Let's walk through the workflow:
Step 1: Frame the problem - Target: Vaccination rate (a continuous number from 0 to 100) - Features: GDP per capita, healthcare spending per capita, education index, urbanization rate - Type: Supervised learning, regression (because the target is continuous) - Success metric: How close are our predictions to actual vaccination rates?
Step 2: Prepare the data
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the country indicators dataset
df = pd.read_csv('country_indicators.csv')
# Select features and target
features = ['gdp_per_capita', 'health_spending_pct',
'education_index', 'urban_population_pct']
target = 'vaccination_rate'
# Drop rows with missing values in our columns
model_df = df[features + [target]].dropna()
# Separate features and target
X = model_df[features]
y = model_df[target]
print(f"Dataset shape: {X.shape}")
print(f"Features: {list(X.columns)}")
print(f"Target: {target}")
print(f"\nTarget summary:")
print(y.describe())
Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
print(f"Training set: {X_train.shape[0]} countries")
print(f"Test set: {X_test.shape[0]} countries")
Step 4: Establish a baseline
import numpy as np
# Baseline: always predict the mean vaccination rate
baseline = y_train.mean()
print(f"\nBaseline prediction: {baseline:.1f}%")
print("(Predict this for every country)")
# Baseline error
baseline_errors = y_test - baseline
baseline_mae = np.abs(baseline_errors).mean()
print(f"Baseline MAE: {baseline_mae:.1f} percentage points")
The Mean Absolute Error (MAE) tells you, on average, how many percentage points your predictions are off. If the baseline MAE is 15 percentage points, that means the "just guess the average" strategy is off by about 15 points on average. Any model we build needs to beat that.
What Comes Next
In Chapter 26, we'll build a linear regression model to predict vaccination rates. We'll compare its performance to this baseline and see how much improvement we get. In Chapter 27, we'll reframe the problem as classification — predicting whether a country has "high" or "low" vaccination — and use logistic regression.
But the framework from this chapter — features, targets, train-test splits, baselines, overfitting awareness — will carry through every model we build.
25.13 Common Misconceptions About Models
Before we move on, let's address some misconceptions that trip up beginners:
Misconception 1: "More Complex = Better"
Not necessarily. A simple model that generalizes well is often better than a complex model that overfits. The best model is the simplest one that captures the important patterns. This principle is sometimes called parsimony or Occam's razor.
Misconception 2: "The Model Finds the Truth"
A model finds patterns in data. Those patterns might reflect reality, or they might reflect biases in the data, measurement errors, or confounding variables. A model is only as good as the data it's trained on. Garbage in, garbage out.
Misconception 3: "High Training Accuracy = Good Model"
Training accuracy tells you how well the model fits the data it's already seen. That's like testing a student with the same problems they studied. What matters is test accuracy — performance on new data.
Misconception 4: "I Need Big Data for Machine Learning"
Not always. For simple models like linear regression, a few hundred data points can be plenty. The amount of data you need depends on the complexity of the model and the number of features. More complex models need more data, but simple models can work with surprisingly little.
Misconception 5: "Machine Learning Is a Black Box"
Some machine learning models are hard to interpret (neural networks, for example). But many are highly interpretable — linear regression tells you exactly how each feature contributes to the prediction. We'll focus on interpretable models in this book precisely because understanding why a model makes its predictions is as important as the predictions themselves.
Misconception 6: "Machine Learning Will Replace Human Judgment"
Machine learning augments human judgment; it doesn't replace it. You still need domain knowledge to choose features, interpret results, and decide whether the model's predictions make sense. A model that predicts "this patient has a 73% probability of disease X" is useful, but the doctor still makes the diagnosis.
25.14 Ethical Considerations: When Models Can Harm
We can't discuss models without discussing their potential for harm. Models trained on historical data can perpetuate historical biases. A hiring model trained on past hiring decisions might learn to discriminate if past decisions were discriminatory. A criminal justice model trained on arrest data might learn racial biases embedded in policing patterns.
Three Questions to Ask About Any Model
-
Who does this model affect? If a model influences decisions about people (hiring, lending, healthcare, criminal justice), the stakes are much higher than if it predicts weather or product demand.
-
What biases might be in the training data? Historical data reflects historical realities, including discrimination and inequality. A model trained on this data can learn and amplify those patterns.
-
What happens when the model is wrong? Every model makes mistakes. But the consequences of mistakes are not symmetric. A false positive in cancer screening (telling a healthy person they might have cancer) causes anxiety. A false negative (missing actual cancer) can be fatal. Understanding the cost of errors matters.
We'll return to these ethical considerations throughout the modeling chapters, especially in Chapter 27 when we discuss classification errors and in Chapter 29 when we discuss model evaluation.
25.15 Chapter Summary
This chapter laid the conceptual foundation for everything that follows. Here's what you now understand:
A model is a deliberate simplification of reality. It captures important patterns while ignoring details that don't matter for the task at hand. All models are wrong — the question is whether they're useful.
Models serve two purposes: prediction (what will happen?) and explanation (why does it happen?). These goals sometimes conflict, and the right choice depends on your objective.
Supervised learning uses labeled data (features + known answers) to learn patterns that can predict outcomes for new, unlabeled data. Unsupervised learning finds structure in data without labels.
Training and test splits are essential because you must evaluate models on data they haven't seen. Never test on training data — that's grading your own exam.
Overfitting means learning noise instead of signal (too complex). Underfitting means missing real patterns (too simple). The bias-variance tradeoff is the fundamental tension between these two errors.
Generalization — performing well on new data — is the whole point. Everything else (the splits, the baselines, the complexity tuning) is in service of generalization.
Always start with a baseline. If your fancy model can't beat "just predict the average," it isn't learning anything useful.
In the next chapter, we'll build your first real predictive model: linear regression. You've already seen correlations between variables — now you'll use those relationships to make actual predictions. The concepts from this chapter — features, targets, training, testing, overfitting, baselines — will all come into play.
Connections to What You've Learned
| Concept from This Chapter | Foundation from Earlier |
|---|---|
| Features and targets | Variables and data types (Chapter 5) |
| Model as simplification | Descriptive statistics as summary (Chapter 19) |
| Train-test split | Random sampling (Chapter 22) |
| Overfitting noise | Distinguishing signal from noise (Chapter 20) |
| Prediction vs. explanation | Correlation vs. causation (Chapter 24) |
| Baseline model | Mean and median (Chapter 19) |
Looking Ahead
| Next Chapter | What You'll Learn |
|---|---|
| Chapter 26: Linear Regression | Your first predictive model — fitting lines, interpreting coefficients, making predictions |
| Chapter 27: Logistic Regression | Predicting categories instead of numbers — classification with probability outputs |
| Chapter 28: Decision Trees | A visual, intuitive approach to both regression and classification |
| Chapter 29: Evaluating Models | How to properly measure model performance with cross-validation and multiple metrics |
| Chapter 30: ML Workflow | Putting it all together — the complete machine learning pipeline |
You're at the threshold of the most exciting part of data science. Let's build some models.
Related Reading
Explore this topic in other books
Intro to Data Science Decision Trees and Random Forests Intro to Data Science ML Workflow Introductory Statistics Statistics and AI