Chapter 26: Linear Regression — Your First Predictive Model

Contributors to Introduction to Data Science

15 min read

> "Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind."

Prerequisites

{'chapter': 25, 'description': 'What Is a Model — features, targets, train-test splits, overfitting'}
{'chapter': 24, 'description': 'Correlation — understanding relationships between variables'}
{'chapter': 16, 'description': 'seaborn for scatter plots and regression visualization'}
{'chapter': 19, 'description': 'Descriptive statistics — means and standard deviations'}

Learning Objectives

Explain the intuition behind linear regression as fitting a line through data
Interpret the slope and intercept of a regression line in context
Describe how least squares minimizes the sum of squared residuals
Compute and interpret R-squared as a measure of model fit
Use scikit-learn's LinearRegression to fit and evaluate regression models
Extend simple linear regression to multiple regression with several features
Identify and explain multicollinearity and its effect on coefficient interpretation
Apply linear regression to predict vaccination rates from GDP and healthcare spending
Compare model performance to the baseline established in Chapter 25

In This Chapter

Chapter Overview
26.1 The Intuition: Finding the Best Line
26.2 Residuals: Measuring How Wrong Each Prediction Is
26.3 Least Squares: Finding the Best Line
26.4 Interpreting the Coefficients
26.5 R-Squared: How Good Is the Fit?
26.6 Using scikit-learn: The Full Workflow
26.7 Multiple Linear Regression: Adding More Features
26.8 Multicollinearity: When Features Are Correlated
26.9 Feature Scaling: Do Units Matter?
26.10 When Linear Regression Fails
26.11 Project Milestone: Predicting Vaccination Rates
26.12 What Linear Regression Cannot Do
26.13 Chapter Summary
Connections to What You've Learned
Looking Ahead

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 26: Linear Regression — Your First Predictive Model

"Essentially, all models are wrong, but some are useful. However, the approximate nature of the model must always be borne in mind." — George Box, Empirical Model-Building and Response Surfaces

Chapter Overview

In Chapter 24, you learned to measure the correlation between two variables. You could say, "GDP and vaccination rates have a strong positive correlation — r = 0.75." That's a description. It tells you the variables are related.

In Chapter 25, you learned what a model is: a deliberate simplification of reality that can make predictions on new data. You set up features, targets, and train-test splits.

Now you put the two together. You take the relationship between GDP and vaccination rate — the correlation you measured — and you turn it into a prediction machine. Given a country's GDP, what vaccination rate should you expect? That's what linear regression answers.

Linear regression is the "Hello, World" of machine learning. It's the first model most people learn, and for good reason: it's simple enough to understand completely, powerful enough to be genuinely useful, and foundational enough that nearly every other model builds on it in some way. When data scientists say they "start simple," they often mean they start with linear regression.

By the end of this chapter, you'll have built a model that predicts vaccination rates from country indicators, compared it to the baseline from Chapter 25, and understood exactly why the model makes the predictions it does.

In this chapter, you will learn to:

Explain the intuition behind linear regression as fitting a line through data (all paths)
Interpret slopes and intercepts in real-world context (all paths)
Describe how least squares works to find the best-fitting line (all paths)
Compute and interpret R-squared as a measure of model fit (all paths)
Use scikit-learn's LinearRegression to fit, predict, and evaluate (all paths)
Extend simple regression to multiple features (all paths)
Identify multicollinearity and explain its effects (standard + deep dive)
Apply linear regression to predict vaccination rates (all paths)
Compare model performance to the Chapter 25 baseline (all paths)

26.1 The Intuition: Finding the Best Line

Let's start with a picture. Imagine a scatter plot of countries: GDP per capita on the x-axis, vaccination rate on the y-axis. The points form a rough cloud that trends upward — richer countries tend to have higher vaccination rates.

Now imagine drawing a straight line through that cloud. Not just any line — the best line. The one that comes closest to all the points. That line is your linear regression model.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Simulated country data
np.random.seed(42)
n = 80

gdp = np.random.lognormal(9.5, 1.0, n)
vaccination = 40 + 6 * np.log(gdp / 1000) + np.random.normal(0, 8, n)
vaccination = np.clip(vaccination, 15, 100)

plt.figure(figsize=(10, 6))
plt.scatter(gdp / 1000, vaccination, alpha=0.6, color='steelblue')
plt.xlabel('GDP per Capita (thousands $)')
plt.ylabel('Vaccination Rate (%)')
plt.title('What Line Best Fits This Data?')
plt.show()

When you look at this scatter plot, your eye naturally draws an imaginary line through the middle of the cloud. Linear regression does the same thing — but with mathematical precision. It finds the specific line that minimizes prediction errors.

What Does the Line Mean?

A line has two parameters:

Slope: How much the y-value changes when the x-value increases by one unit. "For each additional $1,000 in GDP per capita, vaccination rate increases by approximately ___ percentage points."
Intercept: The y-value when x equals zero. "If a country had $0 GDP per capita (hypothetically), the model predicts a vaccination rate of ___ percent."

Together, the slope and intercept define a prediction rule:

Predicted vaccination rate = intercept + slope * GDP per capita

This is the equation of a line: y = b + mx (or equivalently, y = mx + b). You've known this equation since algebra class. Linear regression just finds the best values of m and b for your data.

26.2 Residuals: Measuring How Wrong Each Prediction Is

No line can pass through every point (unless all your points happen to lie perfectly on a line, which never happens with real data). So every prediction has some error. The error for a single data point is called a residual:

Residual = Actual value - Predicted value

If a country has an actual vaccination rate of 92% and your model predicts 85%, the residual is 92 - 85 = +7. Positive residual: the model underestimated.

If another country has an actual rate of 60% and your model predicts 68%, the residual is 60 - 68 = -8. Negative residual: the model overestimated.

# Visualize residuals
from sklearn.linear_model import LinearRegression

X = (gdp / 1000).reshape(-1, 1)
y = vaccination

model = LinearRegression()
model.fit(X, y)
y_pred = model.predict(X)

plt.figure(figsize=(10, 6))
plt.scatter(X, y, alpha=0.6, color='steelblue', label='Actual')
plt.plot(X, y_pred, color='coral', linewidth=2, label='Regression line')

# Draw residual lines for a few points
for i in range(0, len(X), 8):
    plt.vlines(X[i], y[i], y_pred[i], colors='gray',
               linestyles='dashed', alpha=0.5)

plt.xlabel('GDP per Capita (thousands $)')
plt.ylabel('Vaccination Rate (%)')
plt.title('Linear Regression with Residuals')
plt.legend()
plt.show()

The dashed gray lines are the residuals — the vertical distances between each actual point and the regression line. Some are positive (point above the line), some are negative (point below the line).

26.3 Least Squares: Finding the Best Line

What makes a line "best"? There are many possible lines you could draw through a scatter plot. Linear regression defines "best" using the least squares criterion: the best line is the one that minimizes the sum of squared residuals.

Why squared? Two reasons:

Squaring prevents cancellation. Positive and negative residuals would cancel out if you just summed them. A line through the middle of the data would have residuals that sum to zero — but so would many terrible lines.
Squaring penalizes large errors more. A residual of 10 contributes 100 to the sum of squares, while a residual of 5 contributes only 25. This means the line is pulled toward reducing big errors, which is usually what we want.

Sum of Squared Residuals = sum of (actual - predicted)²

The least squares line minimizes this sum.

You don't need to know the calculus behind finding this minimum (though it's elegant if you're curious). What matters is the intuition: linear regression finds the line that, on average, comes closest to all the data points, with extra penalty for being far from any single point.

# The coefficients that minimize squared residuals
print(f"Slope: {model.coef_[0]:.4f}")
print(f"Intercept: {model.intercept_:.4f}")
print(f"\nPrediction formula:")
print(f"Vaccination = {model.intercept_:.1f} + "
      f"{model.coef_[0]:.3f} * GDP_per_capita_thousands")

26.4 Interpreting the Coefficients

The slope and intercept aren't just numbers — they tell a story about the relationship in your data. Learning to read that story is one of the most valuable skills in data science.

Interpreting the Slope

The slope tells you: for each one-unit increase in the feature, the target changes by [slope] units, on average.

If the slope is 0.45 and the feature is GDP per capita in thousands of dollars:

"For each additional $1,000 in GDP per capita, the model predicts vaccination rate increases by 0.45 percentage points, on average."

Notice the careful language: "the model predicts" and "on average." We're not making a causal claim (remember Chapter 24). We're describing the pattern the model has learned from the data.

Interpreting the Intercept

The intercept is the predicted value when all features equal zero. This is sometimes meaningful and sometimes not:

Meaningful: If modeling the relationship between study hours (0 is possible) and test scores, the intercept represents the predicted score for someone who didn't study at all.
Not meaningful: If modeling the relationship between GDP per capita (never truly 0 in practice) and vaccination rates, the intercept is a mathematical artifact — the point where the line crosses the y-axis, but it doesn't correspond to a real-world scenario.

# Interpret the coefficients
slope = model.coef_[0]
intercept = model.intercept_

print(f"Intercept: {intercept:.1f}")
print(f"  -> Hypothetical vaccination rate at $0 GDP")
print(f"\nSlope: {slope:.3f}")
print(f"  -> Each additional $1K GDP per capita is")
print(f"     associated with {slope:.2f} percentage points")
print(f"     higher vaccination rate")

# Make specific predictions
gdp_values = [5, 15, 30, 50]
for g in gdp_values:
    pred = intercept + slope * g
    print(f"\n  GDP ${g}K -> predicted vaccination: {pred:.1f}%")

26.5 R-Squared: How Good Is the Fit?

You've drawn the best line. But how good is "best"? Is the line close to the points or far from them? Does the relationship explain most of the variation in the data, or just a sliver?

R-squared (R²) answers this question. It measures the proportion of variance in the target that is explained by the features.

R² = 1.0: The model explains all the variation. Every point falls exactly on the line. (This essentially never happens with real data.)
R² = 0.0: The model explains none of the variation. The line is no better than just predicting the mean for everyone.
R² = 0.65: The model explains 65% of the variation. 35% is left unexplained (due to other factors, noise, or nonlinearity).

from sklearn.metrics import r2_score, mean_absolute_error

y_pred = model.predict(X)
r2 = r2_score(y, y_pred)
mae = mean_absolute_error(y, y_pred)

print(f"R-squared: {r2:.3f}")
print(f"  -> The model explains {r2*100:.1f}% of the variation")
print(f"     in vaccination rates")
print(f"\nMean Absolute Error: {mae:.1f} percentage points")
print(f"  -> On average, predictions are off by {mae:.1f} points")

What's a "Good" R²?

This depends entirely on the domain:

Domain	Typical R²	Why
Physics experiments	0.95+	Controlled conditions, few variables
Engineering models	0.80-0.95	Well-understood systems
Social science	0.30-0.60	Human behavior is complex
Economics / health	0.20-0.50	Many unmeasured factors
Stock prices	0.01-0.10	Extremely noisy

An R² of 0.45 in a social science context is quite good. In a physics lab, it would be terrible. Context matters.

R² and the Baseline Connection

Remember the baseline model from Chapter 25? The baseline always predicts the mean. R² is directly related to the baseline: it measures how much better your model is than the "just predict the mean" approach.

R² = 0 means your model is no better than predicting the mean
R² = 0.6 means your model reduces the baseline error by 60%

26.6 Using scikit-learn: The Full Workflow

Let's put it all together with scikit-learn's consistent API. This is the workflow you'll use for every model in this book.

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score

# Step 1: Prepare data
np.random.seed(42)
n = 150

df = pd.DataFrame({
    'gdp_thousands': np.random.lognormal(2.5, 1.0, n),
    'health_spending': np.random.uniform(2, 12, n),
})

df['vaccination'] = (
    50 + 0.4 * df['gdp_thousands'] +
    2.5 * df['health_spending'] +
    np.random.normal(0, 6, n)
).clip(20, 100)

# Step 2: Define features and target
X = df[['gdp_thousands']]
y = df['vaccination']

# Step 3: Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Step 4: Establish baseline
baseline_pred = y_train.mean()
baseline_mae = mean_absolute_error(y_test,
    [baseline_pred] * len(y_test))
print(f"Baseline MAE: {baseline_mae:.2f}")

# Step 5: Train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Step 6: Evaluate
train_r2 = model.score(X_train, y_train)
test_r2 = model.score(X_test, y_test)
test_pred = model.predict(X_test)
test_mae = mean_absolute_error(y_test, test_pred)

print(f"\nLinear Regression Results:")
print(f"  Training R²: {train_r2:.3f}")
print(f"  Test R²:     {test_r2:.3f}")
print(f"  Test MAE:    {test_mae:.2f}")
print(f"  Baseline MAE: {baseline_mae:.2f}")
print(f"  Improvement over baseline: "
      f"{(1 - test_mae/baseline_mae)*100:.1f}%")

Notice the workflow: prepare, split, baseline, train, evaluate. And notice the key comparison: training R² vs. test R² (checking for overfitting) and model MAE vs. baseline MAE (checking for usefulness).

26.7 Multiple Linear Regression: Adding More Features

So far we've used one feature (GDP) to predict vaccination rates. But we have more information available — healthcare spending, education index, urbanization rate. Can we use all of them?

Multiple linear regression extends the model from one feature to many:

Simple:   y = b + m₁ * x₁
Multiple: y = b + m₁ * x₁ + m₂ * x₂ + m₃ * x₃ + ...

Each feature gets its own coefficient (slope), and the model combines them. The interpretation of each coefficient is: "the effect of this feature, holding all other features constant."

# Add more features
df['education_index'] = np.random.uniform(0.3, 0.95, n)
df['urban_pct'] = np.random.uniform(20, 95, n)

# Recalculate vaccination to include all features
df['vaccination'] = (
    30 +
    0.3 * df['gdp_thousands'] +
    2.0 * df['health_spending'] +
    25 * df['education_index'] +
    0.1 * df['urban_pct'] +
    np.random.normal(0, 5, n)
).clip(20, 100)

# Multiple regression
X_multi = df[['gdp_thousands', 'health_spending',
              'education_index', 'urban_pct']]
y = df['vaccination']

X_train, X_test, y_train, y_test = train_test_split(
    X_multi, y, test_size=0.2, random_state=42
)

multi_model = LinearRegression()
multi_model.fit(X_train, y_train)

# Compare single vs. multiple regression
print("Multiple Regression Results:")
print(f"  Training R²: {multi_model.score(X_train, y_train):.3f}")
print(f"  Test R²:     {multi_model.score(X_test, y_test):.3f}")

# Coefficients
coef_df = pd.DataFrame({
    'Feature': X_multi.columns,
    'Coefficient': multi_model.coef_
}).sort_values('Coefficient', ascending=False)

print(f"\nIntercept: {multi_model.intercept_:.2f}")
print("\nFeature coefficients:")
print(coef_df.to_string(index=False))

Interpreting Multiple Regression Coefficients

Each coefficient tells you the expected change in the target for a one-unit increase in that feature, holding all other features constant. This "holding constant" part is crucial:

"An increase of one unit in the education index is associated with approximately 25 percentage points higher vaccination rate, holding GDP, health spending, and urbanization constant."

The "holding constant" qualifier makes interpretation meaningful. Without it, the education coefficient might be confounded by GDP (richer countries have both higher education and higher vaccination).

26.8 Multicollinearity: When Features Are Correlated

Here's a trap that catches many beginners. If two features are highly correlated with each other (say, GDP per capita and healthcare spending per capita — rich countries spend more on healthcare), the model has trouble separating their individual effects.

This is called multicollinearity, and it causes two problems:

Individual coefficients become unreliable. The model might assign a large positive coefficient to GDP and a small negative coefficient to health spending, or vice versa — even though both have positive relationships with vaccination rates. The combined prediction is still good, but the individual coefficients don't tell a clear story.
Coefficients become unstable. Small changes in the data can cause large swings in the individual coefficients. The variance of coefficient estimates increases.

# Check for multicollinearity
correlation_matrix = X_multi.corr()
print("Feature correlation matrix:")
print(correlation_matrix.round(2))

What to do about multicollinearity:

For prediction: Often nothing. Multicollinearity doesn't affect prediction accuracy much — it affects coefficient interpretation. If you only care about making good predictions, you can often ignore it.
For explanation: Be careful. If you're interpreting coefficients to understand which factors matter, multicollinearity can mislead you. Consider removing one of the correlated features, or using techniques like variance inflation factors (VIF) to diagnose the problem.

This is another instance of the prediction vs. explanation distinction from Chapter 25. The right response depends on your goal.

26.9 Feature Scaling: Do Units Matter?

When you have multiple features measured in different units (GDP in thousands of dollars, education index from 0 to 1, urbanization as a percentage), the coefficients have different scales. A coefficient of 0.3 for GDP means something very different from a coefficient of 25 for education index, because the features are measured on different scales.

Feature scaling standardizes the features so they're on comparable scales:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Fit on scaled data
scaled_model = LinearRegression()
scaled_model.fit(X_train_scaled, y_train)

# Now coefficients are comparable
print("Standardized coefficients:")
for feat, coef in zip(X_multi.columns, scaled_model.coef_):
    print(f"  {feat}: {coef:.3f}")

After scaling, all features have mean 0 and standard deviation 1. Now the coefficients are directly comparable: the feature with the largest absolute coefficient has the strongest relationship with the target.

Important: For basic LinearRegression, scaling doesn't change the model's predictions — only the coefficient values. But for other models (like regularized regression or neural networks), scaling is essential. It's a good habit to develop.

Also important: Always fit the scaler on the training data only, then apply it to the test data. This prevents data leakage.

26.10 When Linear Regression Fails

Linear regression assumes a linear relationship between features and target. What happens when the relationship isn't linear?

# Nonlinear relationship example
np.random.seed(42)
x = np.random.uniform(0, 10, 100)
y_nonlinear = 3 * np.sin(x) + np.random.normal(0, 0.5, 100)

# Fit linear regression to nonlinear data
X_nl = x.reshape(-1, 1)
model_nl = LinearRegression().fit(X_nl, y_nonlinear)

plt.figure(figsize=(10, 5))
plt.scatter(x, y_nonlinear, alpha=0.6, color='steelblue')
x_line = np.linspace(0, 10, 100).reshape(-1, 1)
plt.plot(x_line, model_nl.predict(x_line),
         color='coral', linewidth=2, label='Linear regression')
plt.title(f'Linear Regression on Nonlinear Data (R² = '
          f'{model_nl.score(X_nl, y_nonlinear):.3f})')
plt.legend()
plt.show()

The line cuts through the middle of the sine wave, missing the pattern entirely. The R² is near zero. This is underfitting — the model is too simple for the data.

Recognizing Nonlinearity

How do you know if linear regression is appropriate? Two tools:

Scatter plots. Before fitting, always plot your features against the target. If the relationship is clearly curved, linear regression won't work well.
Residual plots. After fitting, plot residuals vs. predicted values. If the residuals show a pattern (like a curve), the model is missing something systematic.

# Residual plot for the linear model
y_pred = model_nl.predict(X_nl)
residuals = y_nonlinear - y_pred

plt.figure(figsize=(10, 5))
plt.scatter(y_pred, residuals, alpha=0.6, color='steelblue')
plt.axhline(y=0, color='coral', linestyle='--')
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residual Plot (Pattern = Problem)')
plt.show()

A good residual plot shows random scatter around zero — no pattern. If you see a pattern, the model is missing something.

What to Do About Nonlinearity

Several options:

Transform the feature. If the relationship between GDP and vaccination is logarithmic (rapid improvement at low GDP, diminishing returns at high GDP), try using log(GDP) as the feature.
Add polynomial features. Use x² or x³ to capture curves.
Use a different model. Decision trees (Chapter 28) can capture nonlinear relationships naturally.

# Log transformation for diminishing returns
X_log = np.log(X + 1)  # log-transform the feature
model_log = LinearRegression().fit(X_log, y)
print(f"R² with log(GDP): {model_log.score(X_log, y):.3f}")

26.11 Project Milestone: Predicting Vaccination Rates

Now let's bring everything together for the progressive project. This is the payoff: using what we've learned to build a real model.

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, r2_score
import matplotlib.pyplot as plt

# Simulated country indicators dataset
np.random.seed(42)
n = 150

countries = pd.DataFrame({
    'country': [f'Country_{i}' for i in range(n)],
    'gdp_per_capita': np.random.lognormal(9, 1.2, n),
    'health_spending_pct': np.random.uniform(2, 12, n),
    'education_index': np.random.uniform(0.3, 0.95, n),
    'urban_pct': np.random.uniform(15, 95, n)
})

countries['vaccination_rate'] = (
    20 +
    5 * np.log(countries['gdp_per_capita'] / 1000) +
    1.5 * countries['health_spending_pct'] +
    30 * countries['education_index'] +
    0.08 * countries['urban_pct'] +
    np.random.normal(0, 6, n)
).clip(15, 100)

# Define features and target
features = ['gdp_per_capita', 'health_spending_pct',
            'education_index', 'urban_pct']
X = countries[features]
y = countries['vaccination_rate']

# Split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Baseline
baseline_mae = mean_absolute_error(
    y_test, [y_train.mean()] * len(y_test))

# Model
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

# Results
print("=== Vaccination Rate Prediction ===\n")
print(f"Training R²:  {model.score(X_train, y_train):.3f}")
print(f"Test R²:      {model.score(X_test, y_test):.3f}")
print(f"Test MAE:     {mean_absolute_error(y_test, y_pred):.2f}")
print(f"Baseline MAE: {baseline_mae:.2f}")
print(f"\nModel improves on baseline by "
      f"{(1-mean_absolute_error(y_test,y_pred)/baseline_mae)*100:.1f}%")

print(f"\nIntercept: {model.intercept_:.2f}")
print("\nCoefficients:")
for feat, coef in zip(features, model.coef_):
    print(f"  {feat}: {coef:.4f}")

Interpreting the Results

Let's read the output like a data scientist:

Training vs. Test R²: If these are close (say, 0.72 vs. 0.68), the model is generalizing well. No significant overfitting.
Test MAE vs. Baseline MAE: If the model's MAE is 5.2 and the baseline's is 10.8, the model is more than twice as good as "just guess the average." That's meaningful.
Coefficients: The education index coefficient (~30) is the largest, suggesting that education is the strongest predictor of vaccination rates in this model. But remember: coefficient size depends on feature scale, so compare standardized coefficients for a fair comparison.

Visualizing Predictions vs. Actuals

plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.6, color='steelblue')
plt.plot([y.min(), y.max()], [y.min(), y.max()],
         'r--', linewidth=2, label='Perfect predictions')
plt.xlabel('Actual Vaccination Rate')
plt.ylabel('Predicted Vaccination Rate')
plt.title('Predicted vs. Actual Vaccination Rates')
plt.legend()
plt.show()

Points close to the diagonal line represent good predictions. Points far from it represent errors. This "predicted vs. actual" plot is one of the most useful diagnostic tools in regression.

26.12 What Linear Regression Cannot Do

Before we celebrate, let's be honest about linear regression's limitations:

It assumes linearity. If the true relationship is curved, the model will underfit. Use scatter plots and residual plots to check.
It's sensitive to outliers. Because it minimizes squared residuals, a single extreme point can pull the line dramatically. A country with a very unusual GDP-vaccination relationship could distort the entire model.
It doesn't prove causation. The model uses GDP to predict vaccination rates, but that doesn't mean increasing GDP will increase vaccination rates. Remember Chapter 24.
It extrapolates dangerously. The model was trained on GDP values between, say, $500 and $80,000. Predicting vaccination rates for a country with a GDP of $200,000 means extrapolating beyond the data — and the linear assumption may not hold.
It can't capture interactions. Maybe GDP matters more in countries with low education than in countries with high education. Basic linear regression treats each feature independently. (Interaction terms can address this, but that's a more advanced topic.)

Despite these limitations, linear regression remains one of the most widely used models in science, business, and public policy. Its interpretability is a superpower — when you can explain exactly what the model is doing and why, stakeholders trust it more. And trust matters.

26.13 Chapter Summary

You've built your first predictive model. Here's what you now understand:

Linear regression finds the straight line (or hyperplane, in multiple dimensions) that minimizes the sum of squared residuals — the squared differences between predicted and actual values.

The slope tells you how much the target changes for a one-unit increase in a feature. The intercept is the predicted value when all features are zero.

R-squared measures how much of the target's variance the model explains. It ranges from 0 (no better than predicting the mean) to 1 (perfect prediction).

Multiple regression extends the model to use several features simultaneously, with each coefficient interpreted as "the effect of this feature, holding other features constant."

Multicollinearity occurs when features are correlated with each other, making individual coefficients unreliable (though predictions may still be fine).

Feature scaling standardizes features to comparable scales, making coefficients directly comparable.

Always compare to the baseline. Your model's MAE means nothing without the context of what "just predicting the mean" would achieve.

In Chapter 27, we'll change the question. Instead of predicting a number (vaccination rate), we'll predict a category (high vs. low vaccination). This is classification, and the model we'll use — logistic regression — builds directly on the linear regression you've just learned.

Connections to What You've Learned

Concept from This Chapter	Foundation from Earlier
Regression line through scatter plot	Scatter plots (Chapter 15-16), Correlation (Chapter 24)
R-squared and explained variance	Variance (Chapter 19), Correlation coefficient (Chapter 24)
Train-test split and baseline comparison	Model evaluation framework (Chapter 25)
Coefficient interpretation	Correlation vs. causation (Chapter 24)
Feature matrix X and target vector y	DataFrames and Series (Chapter 7)
Overfitting check (train vs. test R²)	Bias-variance tradeoff (Chapter 25)

Looking Ahead

Next Chapter	What You'll Learn
Chapter 27: Logistic Regression	Predicting categories with probability outputs
Chapter 28: Decision Trees	A visual, nonlinear alternative to regression
Chapter 29: Evaluating Models	Cross-validation, precision, recall, and choosing metrics
Chapter 30: ML Workflow	The complete pipeline from data to deployment

Prerequisites

Learning Objectives

In This Chapter

Chapter 26: Linear Regression — Your First Predictive Model

Chapter Overview

26.1 The Intuition: Finding the Best Line

What Does the Line Mean?

26.2 Residuals: Measuring How Wrong Each Prediction Is

26.3 Least Squares: Finding the Best Line

26.4 Interpreting the Coefficients

Interpreting the Slope

Interpreting the Intercept

26.5 R-Squared: How Good Is the Fit?

What's a "Good" R²?

R² and the Baseline Connection

26.6 Using scikit-learn: The Full Workflow

26.7 Multiple Linear Regression: Adding More Features

Interpreting Multiple Regression Coefficients

26.8 Multicollinearity: When Features Are Correlated

26.9 Feature Scaling: Do Units Matter?

26.10 When Linear Regression Fails

Recognizing Nonlinearity

What to Do About Nonlinearity

26.11 Project Milestone: Predicting Vaccination Rates

Interpreting the Results

Visualizing Predictions vs. Actuals

26.12 What Linear Regression Cannot Do

26.13 Chapter Summary

Connections to What You've Learned

Looking Ahead

Related Reading