> --- George E. P. Box, Empirical Model-Building and Response Surfaces (1987)
In This Chapter
Chapter 6: Supervised Learning: Regression and Classification
"All models are wrong, but some are useful." --- George E. P. Box, Empirical Model-Building and Response Surfaces (1987)
Supervised learning is the workhorse of applied machine learning. From predicting tomorrow's stock price to detecting fraudulent transactions, from diagnosing diseases in medical images to routing autonomous vehicles, the vast majority of production AI systems rely on supervised learning at their core. In this chapter, you will learn the fundamental algorithms that power these systems---starting with the elegant simplicity of linear regression, progressing through the probabilistic elegance of logistic regression, and culminating in the raw predictive power of ensemble methods like random forests and gradient boosting.
As we saw in Chapter 4, probability and statistics give us the language to reason about uncertainty. In Chapter 5, we studied the optimization techniques that allow us to find the best parameters for a model. Now we bring these ideas together: supervised learning is the art and science of learning a mapping from inputs to outputs, guided by labeled examples. By the end of this chapter, you will have a deep understanding of the most important supervised learning algorithms and the practical judgment to select the right tool for any given problem.
6.1 The Supervised Learning Framework
At its core, supervised learning is deceptively simple. We are given a training set of $n$ input-output pairs:
$$\mathcal{D} = \{(\mathbf{x}_1, y_1), (\mathbf{x}_2, y_2), \ldots, (\mathbf{x}_n, y_n)\}$$
where each $\mathbf{x}_i \in \mathbb{R}^d$ is a feature vector (also called an input, predictor, or independent variable) and each $y_i$ is the corresponding target (also called the label, response, or dependent variable). Our goal is to learn a function $f: \mathbb{R}^d \to \mathcal{Y}$ that generalizes well to unseen data---that is, the function should make accurate predictions on new inputs that were not part of the training set.
The nature of the target $y$ determines the type of supervised learning problem:
- Regression: When $y \in \mathbb{R}$ is continuous. Example: predicting house prices, estimating temperature, forecasting revenue.
- Classification: When $y \in \{0, 1, \ldots, K-1\}$ is discrete. Example: spam detection ($K=2$), digit recognition ($K=10$), disease diagnosis.
6.1.1 The Learning Pipeline
Every supervised learning project follows a common pipeline, regardless of the specific algorithm:
- Data collection and preprocessing: Gather labeled examples, handle missing values, encode categorical features, and normalize or standardize numerical features (as we discussed in Chapter 2).
- Feature engineering: Transform raw inputs into representations that make patterns easier for the model to learn.
- Model selection: Choose a hypothesis class $\mathcal{H}$ (e.g., linear models, decision trees, neural networks).
- Training: Use an optimization algorithm (Chapter 5) to find the parameters $\boldsymbol{\theta}^*$ that minimize a loss function $\mathcal{L}$ on the training data.
- Evaluation: Assess the learned model on held-out test data to estimate its generalization performance.
- Deployment and monitoring: Put the model into production and track its performance over time.
6.1.2 Loss Functions
The loss function $\mathcal{L}(\boldsymbol{\theta})$ quantifies how well our model's predictions match the true targets. For regression, the most common choice is the mean squared error (MSE):
$$\mathcal{L}_{\text{MSE}}(\boldsymbol{\theta}) = \frac{1}{n} \sum_{i=1}^{n} (y_i - f(\mathbf{x}_i; \boldsymbol{\theta}))^2$$
where: - $n$ is the number of training examples - $y_i$ is the true target for the $i$-th example - $f(\mathbf{x}_i; \boldsymbol{\theta})$ is the model's prediction for input $\mathbf{x}_i$ with parameters $\boldsymbol{\theta}$
For classification, we typically use the cross-entropy loss (also called log loss):
$$\mathcal{L}_{\text{CE}}(\boldsymbol{\theta}) = -\frac{1}{n} \sum_{i=1}^{n} \left[ y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i) \right]$$
where $\hat{p}_i = P(y_i = 1 \mid \mathbf{x}_i; \boldsymbol{\theta})$ is the model's predicted probability that the $i$-th example belongs to class 1.
6.1.3 Empirical Risk Minimization
The theoretical foundation of supervised learning is empirical risk minimization (ERM). The true risk of a function $f$ is its expected loss over the data-generating distribution:
$$R(f) = \mathbb{E}_{(\mathbf{x}, y) \sim P}[\ell(f(\mathbf{x}), y)]$$
where $\ell$ is a loss function and $P$ is the (unknown) joint distribution of inputs and targets. Since we cannot compute $R(f)$ directly, we approximate it with the empirical risk over the training set:
$$\hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} \ell(f(\mathbf{x}_i), y_i)$$
ERM selects the function that minimizes the empirical risk: $f^* = \arg\min_{f \in \mathcal{H}} \hat{R}(f)$. The gap between the true risk and the empirical risk is governed by the complexity of the hypothesis class $\mathcal{H}$ and the amount of training data. This gap is precisely what the bias-variance tradeoff (Section 6.7) captures, and understanding it is essential for building models that generalize well.
6.1.4 Train, Validation, and Test Splits
A critical principle in supervised learning is that we must evaluate our model on data it has never seen during training. The standard approach is to split the dataset into three parts:
- Training set (typically 60--80%): Used to fit the model parameters.
- Validation set (typically 10--20%): Used to tune hyperparameters and select among competing models.
- Test set (typically 10--20%): Used once, at the very end, to estimate final generalization performance.
from sklearn.model_selection import train_test_split
# First split: separate test set
X_train_val, X_test, y_train_val, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Second split: separate validation set from training set
X_train, X_val, y_train, y_val = train_test_split(
X_train_val, y_train_val, test_size=0.25, random_state=42 # 0.25 * 0.8 = 0.2
)
When data is scarce, k-fold cross-validation provides a more robust estimate of performance. We will explore cross-validation strategies in greater detail in Chapter 8.
6.1.5 A Historical Note
Supervised learning has deep roots. The method of least squares was independently developed by Legendre (1805) and Gauss (1809) for fitting astronomical observations. Fisher's linear discriminant analysis (1936) was among the first classification algorithms. The perceptron (Rosenblatt, 1958) introduced the idea of learning from data through iterative updates---a precursor to the neural networks we will explore in Part III. What unites these methods across two centuries is the same fundamental idea: given examples of inputs and outputs, learn a mapping that generalizes to new inputs.
6.2 Linear Regression
Linear regression is the starting point for nearly all of machine learning. Despite its simplicity, it remains one of the most widely used algorithms in practice, and its mathematical framework provides the foundation for understanding more complex models.
6.2.1 The Model
A linear regression model assumes that the target $y$ is a linear function of the features:
$$\hat{y} = f(\mathbf{x}; \mathbf{w}, b) = \mathbf{w}^\top \mathbf{x} + b = w_1 x_1 + w_2 x_2 + \cdots + w_d x_d + b$$
where: - $\mathbf{w} \in \mathbb{R}^d$ is the weight vector (also called coefficients) - $b \in \mathbb{R}$ is the bias term (also called the intercept) - $\mathbf{x} \in \mathbb{R}^d$ is the input feature vector
For notational convenience, we often absorb the bias into the weight vector by appending a 1 to each feature vector. Define $\tilde{\mathbf{x}} = [1, x_1, x_2, \ldots, x_d]^\top$ and $\boldsymbol{\theta} = [b, w_1, w_2, \ldots, w_d]^\top$. Then:
$$\hat{y} = \boldsymbol{\theta}^\top \tilde{\mathbf{x}}$$
For the full dataset, we can write this compactly in matrix form. Let $\mathbf{X} \in \mathbb{R}^{n \times (d+1)}$ be the design matrix with each row being $\tilde{\mathbf{x}}_i^\top$, and $\mathbf{y} \in \mathbb{R}^n$ be the vector of targets:
$$\hat{\mathbf{y}} = \mathbf{X} \boldsymbol{\theta}$$
6.2.2 Ordinary Least Squares (OLS)
The OLS objective is to minimize the sum of squared residuals:
$$\mathcal{L}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 = (\mathbf{y} - \mathbf{X}\boldsymbol{\theta})^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$$
The intuition is straightforward: we want the total squared distance between our predictions and the true values to be as small as possible. As we saw in Chapter 5, we find the minimum by taking the gradient and setting it to zero:
$$\nabla_{\boldsymbol{\theta}} \mathcal{L} = -2\mathbf{X}^\top(\mathbf{y} - \mathbf{X}\boldsymbol{\theta}) = \mathbf{0}$$
Solving for $\boldsymbol{\theta}$, we obtain the normal equations:
$$\mathbf{X}^\top \mathbf{X} \boldsymbol{\theta} = \mathbf{X}^\top \mathbf{y}$$
If $\mathbf{X}^\top \mathbf{X}$ is invertible (which requires $n \geq d+1$ and no perfect multicollinearity among features), the closed-form solution is:
$$\boldsymbol{\theta}^* = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{y}$$
This is the famous OLS estimator. As we discussed in Chapter 3, the matrix $(\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top$ is the Moore-Penrose pseudoinverse of $\mathbf{X}$.
Worked Example: Suppose we have three data points with a single feature:
| $x$ | $y$ |
|---|---|
| 1 | 2.1 |
| 2 | 3.9 |
| 3 | 6.2 |
We want to fit $\hat{y} = \theta_0 + \theta_1 x$. The design matrix and target vector are:
$$\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3 \end{bmatrix}, \quad \mathbf{y} = \begin{bmatrix} 2.1 \\ 3.9 \\ 6.2 \end{bmatrix}$$
Computing step by step:
$$\mathbf{X}^\top \mathbf{X} = \begin{bmatrix} 3 & 6 \\ 6 & 14 \end{bmatrix}, \quad \mathbf{X}^\top \mathbf{y} = \begin{bmatrix} 12.2 \\ 28.5 \end{bmatrix}$$
$$(\mathbf{X}^\top \mathbf{X})^{-1} = \frac{1}{3 \cdot 14 - 6 \cdot 6}\begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix} = \frac{1}{6}\begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix}$$
$$\boldsymbol{\theta}^* = \frac{1}{6}\begin{bmatrix} 14 & -6 \\ -6 & 3 \end{bmatrix}\begin{bmatrix} 12.2 \\ 28.5 \end{bmatrix} = \frac{1}{6}\begin{bmatrix} 170.8 - 171.0 \\ -73.2 + 85.5 \end{bmatrix} = \frac{1}{6}\begin{bmatrix} -0.2 \\ 12.3 \end{bmatrix} = \begin{bmatrix} -0.0333 \\ 2.05 \end{bmatrix}$$
So our fitted model is $\hat{y} = -0.033 + 2.05x$. The slope of 2.05 tells us that for each unit increase in $x$, we expect $y$ to increase by approximately 2.05 units.
6.2.3 Gradient Descent for Linear Regression
While the closed-form OLS solution is elegant, it requires computing the matrix inverse $(\mathbf{X}^\top \mathbf{X})^{-1}$, which has $O(d^3)$ complexity. When $d$ is large (thousands or millions of features), this becomes computationally prohibitive. As we explored in Chapter 5, gradient descent offers a scalable alternative.
The gradient of the MSE loss with respect to $\boldsymbol{\theta}$ is:
$$\nabla_{\boldsymbol{\theta}} \mathcal{L} = -\frac{2}{n} \mathbf{X}^\top (\mathbf{y} - \mathbf{X}\boldsymbol{\theta})$$
The gradient descent update rule is:
$$\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \alpha \nabla_{\boldsymbol{\theta}} \mathcal{L}(\boldsymbol{\theta}^{(t)})$$
where $\alpha > 0$ is the learning rate. In practice, we often use stochastic gradient descent (SGD) or mini-batch gradient descent, which approximate the full gradient using a random subset of the data at each step. This reduces the per-iteration cost from $O(nd)$ to $O(bd)$, where $b$ is the batch size.
import numpy as np
def gradient_descent_linear_regression(
X: np.ndarray,
y: np.ndarray,
learning_rate: float = 0.01,
n_iterations: int = 1000,
) -> np.ndarray:
"""Fit linear regression using batch gradient descent.
Args:
X: Design matrix of shape (n_samples, n_features).
y: Target vector of shape (n_samples,).
learning_rate: Step size for gradient updates.
n_iterations: Number of gradient descent iterations.
Returns:
Learned parameter vector of shape (n_features,).
"""
n_samples = X.shape[0]
theta = np.zeros(X.shape[1])
for _ in range(n_iterations):
predictions = X @ theta
residuals = y - predictions
gradient = (-2 / n_samples) * (X.T @ residuals)
theta = theta - learning_rate * gradient
return theta
6.2.4 Evaluating Regression Models
Beyond MSE, several metrics help us assess regression model quality:
Root Mean Squared Error (RMSE): Same units as the target, more interpretable than MSE.
$$\text{RMSE} = \sqrt{\frac{1}{n}\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}$$
Mean Absolute Error (MAE): More robust to outliers than MSE.
$$\text{MAE} = \frac{1}{n}\sum_{i=1}^{n}|y_i - \hat{y}_i|$$
Coefficient of Determination ($R^2$): The proportion of variance in the target explained by the model.
$$R^2 = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat{y}_i)^2}{\sum_{i=1}^{n}(y_i - \bar{y})^2}$$
where $\bar{y}$ is the mean of the targets. An $R^2$ of 1.0 means perfect prediction; an $R^2$ of 0.0 means the model is no better than predicting the mean.
6.2.5 Regularization: Ridge and Lasso
When the number of features is large relative to the number of training examples, OLS tends to overfit. Regularization adds a penalty term to the loss function that discourages overly complex models.
Ridge Regression ($L_2$ regularization) adds a penalty proportional to the squared magnitude of the weights:
$$\mathcal{L}_{\text{Ridge}}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|_2^2$$
where $\lambda > 0$ is the regularization strength. The closed-form solution becomes:
$$\boldsymbol{\theta}^*_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1} \mathbf{X}^\top \mathbf{y}$$
The addition of $\lambda \mathbf{I}$ ensures the matrix is always invertible, which is why ridge regression is sometimes called Tikhonov regularization.
Lasso Regression ($L_1$ regularization) uses the absolute value of the weights:
$$\mathcal{L}_{\text{Lasso}}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 + \lambda \|\boldsymbol{\theta}\|_1$$
The key difference is that Lasso tends to produce sparse solutions---many weights are driven exactly to zero. This makes Lasso useful for feature selection: the features with non-zero weights are the ones the model considers most important.
Elastic Net combines both penalties:
$$\mathcal{L}_{\text{ElasticNet}}(\boldsymbol{\theta}) = \|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2 + \lambda_1 \|\boldsymbol{\theta}\|_1 + \lambda_2 \|\boldsymbol{\theta}\|_2^2$$
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
# Ordinary Least Squares
ols = LinearRegression().fit(X_train, y_train)
# Ridge Regression (L2)
ridge = Ridge(alpha=1.0).fit(X_train, y_train)
# Lasso Regression (L1)
lasso = Lasso(alpha=0.1).fit(X_train, y_train)
# Elastic Net (L1 + L2)
elastic = ElasticNet(alpha=0.1, l1_ratio=0.5).fit(X_train, y_train)
6.3 Polynomial Regression and Non-Linear Features
Linear regression assumes a linear relationship between features and target. In many real-world problems, this assumption is too restrictive. Polynomial regression extends linear regression by introducing polynomial features, allowing the model to capture non-linear patterns while still using the same linear algebra machinery.
6.3.1 Polynomial Feature Expansion
The idea is beautifully simple: transform the input features into a higher-dimensional space using polynomial terms, then fit a linear model in that expanded space. For a single feature $x$, a polynomial of degree $p$ gives us:
$$\hat{y} = \theta_0 + \theta_1 x + \theta_2 x^2 + \cdots + \theta_p x^p$$
This is still a linear model---linear in the parameters $\boldsymbol{\theta}$, even though it is non-linear in the original feature $x$. This distinction is crucial.
For multiple features, the expansion includes all interaction terms up to degree $p$. For example, with two features $x_1, x_2$ and degree 2:
$$\hat{y} = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1^2 + \theta_4 x_1 x_2 + \theta_5 x_2^2$$
The number of features grows combinatorially: for $d$ original features and degree $p$, the number of polynomial features is $\binom{d + p}{p}$. For $d = 10$ and $p = 3$, this gives 286 features.
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
# Create a polynomial regression pipeline (degree 3)
poly_model = make_pipeline(
PolynomialFeatures(degree=3, include_bias=False),
LinearRegression()
)
poly_model.fit(X_train, y_train)
6.3.2 The Danger of High-Degree Polynomials
A polynomial of sufficiently high degree can pass through every training point exactly, achieving zero training error. However, such a model will perform terribly on new data because it has memorized the noise in the training set rather than learning the true underlying pattern. This phenomenon is called overfitting, and it is one of the central challenges in machine learning.
Worked Example: Consider fitting polynomials of degree 1, 3, and 15 to 20 data points generated from $y = \sin(x) + \epsilon$:
- Degree 1 (underfitting): The straight line misses the sinusoidal pattern entirely. Training MSE: 0.45, Test MSE: 0.52.
- Degree 3 (good fit): The cubic curve captures the overall shape. Training MSE: 0.08, Test MSE: 0.11.
- Degree 15 (overfitting): The curve wiggles wildly between data points. Training MSE: 0.001, Test MSE: 3.72.
We see that the degree-15 polynomial has the lowest training error but the highest test error. This is a classic example of the bias-variance tradeoff, which we will examine in detail in Section 6.7.
6.4 Logistic Regression and Classification
Despite its name, logistic regression is a classification algorithm. It is one of the most important models in all of machine learning: simple enough to understand completely, powerful enough to serve as a strong baseline in production, and foundational to understanding neural networks (which we will explore in Chapters 11--14).
6.4.1 From Linear Regression to Classification
The naive approach to binary classification would be to use linear regression and threshold the output at 0.5. The problem with this approach is that linear regression produces unbounded outputs in $(-\infty, +\infty)$, while we need probabilities in $[0, 1]$.
The solution is to pass the linear output through the sigmoid function (also called the logistic function):
$$\sigma(z) = \frac{1}{1 + e^{-z}}$$
where: - $z = \mathbf{w}^\top \mathbf{x} + b$ is the linear combination (called the logit) - $\sigma(z)$ maps any real number to the interval $(0, 1)$
The sigmoid function has several elegant properties: - $\sigma(0) = 0.5$ - $\sigma(z) \to 1$ as $z \to +\infty$ and $\sigma(z) \to 0$ as $z \to -\infty$ - $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ (the derivative has a beautiful closed form) - $\sigma(-z) = 1 - \sigma(z)$ (symmetry)
6.4.2 The Logistic Regression Model
The logistic regression model defines the probability that an input belongs to class 1:
$$P(y = 1 \mid \mathbf{x}; \boldsymbol{\theta}) = \sigma(\mathbf{w}^\top \mathbf{x} + b) = \frac{1}{1 + e^{-(\mathbf{w}^\top \mathbf{x} + b)}}$$
The prediction rule is:
$$\hat{y} = \begin{cases} 1 & \text{if } P(y = 1 \mid \mathbf{x}) \geq 0.5 \\ 0 & \text{otherwise} \end{cases}$$
which is equivalent to classifying based on the sign of the logit: $\hat{y} = 1$ if $\mathbf{w}^\top \mathbf{x} + b \geq 0$.
6.4.3 Maximum Likelihood Estimation
We train logistic regression by maximum likelihood estimation (MLE), as we discussed in Chapter 4. Given the training data, the likelihood is:
$$L(\boldsymbol{\theta}) = \prod_{i=1}^{n} \hat{p}_i^{y_i} (1 - \hat{p}_i)^{1 - y_i}$$
Taking the negative log-likelihood gives us the binary cross-entropy loss:
$$\mathcal{L}(\boldsymbol{\theta}) = -\frac{1}{n}\sum_{i=1}^{n}\left[y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)\right]$$
Unlike the MSE loss for linear regression, this has no closed-form solution. We must use gradient-based optimization. The gradient is:
$$\nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{n}\mathbf{X}^\top (\hat{\mathbf{p}} - \mathbf{y})$$
where $\hat{\mathbf{p}}$ is the vector of predicted probabilities. Notice the remarkable similarity to the gradient for linear regression---the only difference is that the residuals are now $\hat{\mathbf{p}} - \mathbf{y}$ instead of $\hat{\mathbf{y}} - \mathbf{y}$.
Worked Example: Suppose we have a simple dataset for classifying whether a student passes an exam based on hours studied:
| Hours ($x$) | Pass ($y$) |
|---|---|
| 1 | 0 |
| 2 | 0 |
| 3 | 0 |
| 4 | 1 |
| 5 | 1 |
| 6 | 1 |
After training (via gradient descent), suppose we obtain $w = 1.5$ and $b = -5.0$. Then for a student who studies 3.5 hours:
$$z = 1.5 \times 3.5 + (-5.0) = 0.25$$ $$P(\text{pass} \mid x=3.5) = \sigma(0.25) = \frac{1}{1 + e^{-0.25}} \approx 0.562$$
The model predicts a 56.2% probability of passing, which would be classified as "pass" (since $0.562 > 0.5$).
6.4.4 Decision Boundaries
The decision boundary is the set of points where the model is equally uncertain about both classes---that is, where $P(y = 1 \mid \mathbf{x}) = 0.5$. For logistic regression, this occurs when $\mathbf{w}^\top \mathbf{x} + b = 0$, which defines a hyperplane in the feature space.
In two dimensions, the decision boundary is a straight line:
$$w_1 x_1 + w_2 x_2 + b = 0 \quad \Longrightarrow \quad x_2 = -\frac{w_1}{w_2}x_1 - \frac{b}{w_2}$$
This linear decision boundary is both a strength and a limitation. It works well when the classes are approximately linearly separable, but it fails when the true boundary is non-linear. We will see how other algorithms handle non-linear boundaries in the sections that follow.
6.4.5 Multiclass Classification
Logistic regression extends naturally to $K > 2$ classes through softmax regression (also called multinomial logistic regression). For each class $k$, we define a score:
$$z_k = \mathbf{w}_k^\top \mathbf{x} + b_k$$
The probability of class $k$ is given by the softmax function:
$$P(y = k \mid \mathbf{x}) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$$
The softmax function ensures that all probabilities are positive and sum to one. The loss becomes the categorical cross-entropy:
$$\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{K} \mathbb{1}[y_i = k] \log P(y_i = k \mid \mathbf{x}_i)$$
6.4.6 Evaluating Classification Models
For classification, accuracy alone is often misleading (especially with imbalanced classes). Key metrics include:
Confusion Matrix: A table showing true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN).
Precision: Of all predicted positives, how many are actually positive?
$$\text{Precision} = \frac{TP}{TP + FP}$$
Recall (Sensitivity): Of all actual positives, how many did we correctly identify?
$$\text{Recall} = \frac{TP}{TP + FN}$$
F1 Score: The harmonic mean of precision and recall.
$$F_1 = 2 \cdot \frac{\text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}}$$
ROC-AUC: The area under the Receiver Operating Characteristic curve, which plots the true positive rate against the false positive rate at various thresholds. An AUC of 1.0 indicates perfect classification; 0.5 indicates random guessing.
from sklearn.metrics import (
accuracy_score, precision_score, recall_score,
f1_score, roc_auc_score, confusion_matrix,
classification_report,
)
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]
print(classification_report(y_test, y_pred))
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")
6.5 Support Vector Machines
Support Vector Machines (SVMs) take a fundamentally different approach to classification. Instead of modeling probabilities, SVMs find the hyperplane that maximizes the margin---the distance between the decision boundary and the nearest data points from each class. This geometric perspective leads to a model with strong theoretical guarantees and excellent practical performance.
6.5.1 The Maximum Margin Classifier
Consider a binary classification problem where the classes are linearly separable. There are infinitely many hyperplanes that correctly separate the two classes. The SVM selects the one with the largest margin, which is the distance between the hyperplane and the closest data points (called support vectors).
Formally, the hyperplane is defined by $\mathbf{w}^\top \mathbf{x} + b = 0$. The distance from a point $\mathbf{x}_i$ to this hyperplane is:
$$\frac{|{\mathbf{w}^\top \mathbf{x}_i + b}|}{\|\mathbf{w}\|}$$
The margin is twice the distance from the hyperplane to the nearest point. The SVM optimization problem is:
$$\max_{\mathbf{w}, b} \frac{2}{\|\mathbf{w}\|} \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall i$$
This is equivalent to:
$$\min_{\mathbf{w}, b} \frac{1}{2}\|\mathbf{w}\|^2 \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 \quad \forall i$$
where we use $y_i \in \{-1, +1\}$ (note the SVM convention differs from logistic regression's $\{0, 1\}$).
6.5.2 Soft Margin and the C Parameter
In practice, data is rarely perfectly separable. The soft margin SVM allows some misclassifications by introducing slack variables $\xi_i \geq 0$:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} \xi_i \quad \text{subject to} \quad y_i(\mathbf{w}^\top \mathbf{x}_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0$$
where: - $\xi_i$ measures how much the $i$-th point violates the margin ($\xi_i = 0$ for correctly classified points outside the margin) - $C > 0$ controls the tradeoff between maximizing the margin and minimizing misclassifications
A large $C$ penalizes misclassifications heavily (small margin, low training error). A small $C$ allows more violations (large margin, potentially higher training error but better generalization).
6.5.3 The Kernel Trick
The real power of SVMs lies in the kernel trick, which allows them to learn non-linear decision boundaries without explicitly computing the high-dimensional feature transformation.
The key insight is that the SVM optimization problem depends on the data only through dot products $\mathbf{x}_i^\top \mathbf{x}_j$. If we replace these dot products with a kernel function $K(\mathbf{x}_i, \mathbf{x}_j) = \boldsymbol{\phi}(\mathbf{x}_i)^\top \boldsymbol{\phi}(\mathbf{x}_j)$, we implicitly map the data into a higher-dimensional space where it may be linearly separable.
Common kernels include:
| Kernel | Formula | Use Case |
|---|---|---|
| Linear | $K(\mathbf{x}_i, \mathbf{x}_j) = \mathbf{x}_i^\top \mathbf{x}_j$ | Linearly separable data |
| Polynomial | $K(\mathbf{x}_i, \mathbf{x}_j) = (\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)^p$ | Moderate non-linearity |
| RBF (Gaussian) | $K(\mathbf{x}_i, \mathbf{x}_j) = \exp(-\gamma \|\mathbf{x}_i - \mathbf{x}_j\|^2)$ | General non-linear problems |
| Sigmoid | $K(\mathbf{x}_i, \mathbf{x}_j) = \tanh(\gamma \mathbf{x}_i^\top \mathbf{x}_j + r)$ | Neural network-like behavior |
The RBF kernel is the most popular choice. The parameter $\gamma$ controls the "reach" of each training example: a large $\gamma$ means each point influences only its immediate neighborhood (risk of overfitting), while a small $\gamma$ means each point has broad influence (risk of underfitting).
from sklearn.svm import SVC
# Linear SVM
svm_linear = SVC(kernel="linear", C=1.0).fit(X_train, y_train)
# RBF SVM
svm_rbf = SVC(kernel="rbf", C=10.0, gamma=0.1).fit(X_train, y_train)
# Polynomial SVM
svm_poly = SVC(kernel="poly", degree=3, C=1.0).fit(X_train, y_train)
6.5.4 The Kernel Trick in Depth
To build deeper intuition for why the kernel trick works, consider a two-dimensional dataset where the positive class forms a circle centered at the origin and the negative class surrounds it. No linear boundary can separate these classes in the original space. However, if we map each point $\mathbf{x} = (x_1, x_2)$ to a three-dimensional space using $\boldsymbol{\phi}(\mathbf{x}) = (x_1^2, x_2^2, \sqrt{2}\, x_1 x_2)$, the circular boundary becomes a linear hyperplane in the new space.
The polynomial kernel $K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j)^2$ computes exactly this dot product in the lifted space without ever constructing the explicit mapping. For the RBF kernel, the implicit feature space is infinite-dimensional, corresponding to all possible polynomial features of all degrees. This is why the RBF kernel is so powerful---it can represent arbitrarily complex decision boundaries.
Mercer's Theorem guarantees that any positive semi-definite kernel function corresponds to a dot product in some (possibly infinite-dimensional) Hilbert space. This provides the mathematical foundation for the kernel trick: as long as $K$ is a valid kernel, the SVM optimization in the lifted space is well-defined, even if we never explicitly compute the feature map.
Worked Example: Consider the XOR dataset with four points: $(0,0) \to -1$, $(1,1) \to -1$, $(0,1) \to +1$, $(1,0) \to +1$. No linear boundary can separate these classes. Using a polynomial kernel of degree 2 with $\gamma = 1$ and $r = 1$:
$$K(\mathbf{x}_i, \mathbf{x}_j) = (\mathbf{x}_i^\top \mathbf{x}_j + 1)^2$$
For the points $(0,1)$ and $(1,0)$: $K = (0 \cdot 1 + 1 \cdot 0 + 1)^2 = 1$. For $(0,0)$ and $(1,1)$: $K = (0 \cdot 1 + 0 \cdot 1 + 1)^2 = 1$. For $(0,1)$ and $(1,1)$: $K = (0 \cdot 1 + 1 \cdot 1 + 1)^2 = 4$. The kernel matrix reveals structure that allows a linear separator in the lifted space, solving the XOR problem.
6.5.5 Visualizing Decision Boundaries
Understanding how different algorithms partition the feature space is essential for developing intuition about their strengths and weaknesses. The decision boundary is the surface in feature space where the model's prediction changes from one class to another.
For a two-dimensional feature space, we can visualize decision boundaries by evaluating the model on a dense grid of points and coloring each point by its predicted class. The following code illustrates this approach:
import numpy as np
from sklearn.svm import SVC
from sklearn.datasets import make_moons
def compute_decision_boundary(
model,
X: np.ndarray,
resolution: int = 200,
padding: float = 0.5,
) -> tuple[np.ndarray, np.ndarray, np.ndarray]:
"""Compute decision boundary on a 2D grid.
Args:
model: A fitted scikit-learn classifier.
X: Feature matrix of shape (n_samples, 2).
resolution: Number of grid points per axis.
padding: Extra space around the data range.
Returns:
Tuple of (xx, yy, Z) where xx and yy define the grid
and Z contains the predicted class at each grid point.
"""
x_min, x_max = X[:, 0].min() - padding, X[:, 0].max() + padding
y_min, y_max = X[:, 1].min() - padding, X[:, 1].max() + padding
xx, yy = np.meshgrid(
np.linspace(x_min, x_max, resolution),
np.linspace(y_min, y_max, resolution),
)
grid = np.c_[xx.ravel(), yy.ravel()]
Z = model.predict(grid).reshape(xx.shape)
return xx, yy, Z
# Compare linear vs RBF decision boundaries on the moons dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
svm_linear = SVC(kernel="linear", C=1.0).fit(X, y)
svm_rbf = SVC(kernel="rbf", C=10.0, gamma=0.5).fit(X, y)
# The linear SVM produces a straight line that cannot separate
# the interleaving crescents, while the RBF SVM wraps a smooth
# non-linear boundary around each class.
Different algorithms produce characteristically different decision boundaries:
- Logistic regression: Always produces linear (straight-line or hyperplane) boundaries.
- SVM with RBF kernel: Produces smooth, curving boundaries whose complexity depends on $C$ and $\gamma$.
- Decision trees: Produce axis-aligned, staircase-like boundaries (because each split tests one feature against a threshold).
- Random forests: Produce smoother, axis-aligned boundaries due to averaging over many trees.
- k-Nearest Neighbors: Produce highly irregular boundaries that follow the local density of training points.
Visualizing decision boundaries on validation data can reveal whether a model is underfitting (too-simple boundary that misses patterns) or overfitting (too-complex boundary that follows noise).
6.5.6 SVMs for Regression (SVR)
SVMs can also be used for regression via Support Vector Regression (SVR). Instead of minimizing the squared error, SVR defines an $\epsilon$-insensitive tube around the predictions and only penalizes points that fall outside this tube. This makes SVR robust to small amounts of noise.
The SVR optimization problem is:
$$\min_{\mathbf{w}, b, \boldsymbol{\xi}, \boldsymbol{\xi}^*} \frac{1}{2}\|\mathbf{w}\|^2 + C \sum_{i=1}^{n} (\xi_i + \xi_i^*)$$
subject to:
$$y_i - \mathbf{w}^\top \mathbf{x}_i - b \leq \epsilon + \xi_i, \quad \mathbf{w}^\top \mathbf{x}_i + b - y_i \leq \epsilon + \xi_i^*, \quad \xi_i, \xi_i^* \geq 0$$
where $\epsilon$ defines the width of the insensitive tube, and $\xi_i, \xi_i^*$ are slack variables for points outside the tube. The kernel trick applies to SVR just as it does to classification SVMs, enabling non-linear regression.
6.6 Tree-Based Methods
Decision trees and their ensemble extensions represent a fundamentally different approach to supervised learning. Instead of finding a global function that maps inputs to outputs, tree-based methods recursively partition the feature space into regions and make simple predictions within each region. They are among the most practically important algorithms in AI engineering.
6.6.1 Decision Trees
A decision tree makes predictions by asking a sequence of yes/no questions about the features. Starting at the root node, each internal node tests a feature against a threshold, each branch corresponds to the outcome of the test, and each leaf node contains a prediction.
Building a Decision Tree: The tree is built top-down by greedily selecting the best feature and threshold to split on at each node. "Best" is defined by an impurity measure that quantifies how mixed the classes are in a node.
For classification, common impurity measures are:
Gini Impurity: $$G = 1 - \sum_{k=1}^{K} p_k^2$$
where $p_k$ is the proportion of class $k$ in the node. For a pure node (all one class), $G = 0$. For a balanced binary split, $G = 0.5$.
Entropy (Information Gain): $$H = -\sum_{k=1}^{K} p_k \log_2 p_k$$
A split is evaluated by computing the weighted average impurity of the child nodes and selecting the split that reduces impurity the most:
$$\text{Information Gain} = H(\text{parent}) - \sum_{j} \frac{n_j}{n} H(\text{child}_j)$$
For regression trees, the splitting criterion is typically the variance reduction: we split on the feature and threshold that minimizes the weighted sum of squared errors in the resulting child nodes.
Worked Example: Suppose we have the following dataset for predicting whether to play tennis:
| Outlook | Temperature | Play? |
|---|---|---|
| Sunny | Hot | No |
| Sunny | Mild | No |
| Overcast | Hot | Yes |
| Rainy | Mild | Yes |
| Rainy | Cool | Yes |
| Overcast | Cool | Yes |
The root set has 4 Yes and 2 No. The Gini impurity is:
$$G = 1 - \left(\frac{4}{6}\right)^2 - \left(\frac{2}{6}\right)^2 = 1 - 0.444 - 0.111 = 0.444$$
If we split on Outlook: - Sunny: 0 Yes, 2 No. $G_{\text{Sunny}} = 0$ - Overcast: 2 Yes, 0 No. $G_{\text{Overcast}} = 0$ - Rainy: 2 Yes, 0 No. $G_{\text{Rainy}} = 0$
Weighted Gini after split: $\frac{2}{6}(0) + \frac{2}{6}(0) + \frac{2}{6}(0) = 0$
This is a perfect split! Outlook alone can separate the classes perfectly in this small example.
6.6.2 Controlling Tree Complexity
Without constraints, a decision tree will keep splitting until every leaf is pure, perfectly memorizing the training data. To prevent overfitting, we use several strategies:
- Maximum depth (
max_depth): Limit how deep the tree can grow. - Minimum samples per leaf (
min_samples_leaf): Require a minimum number of samples in each leaf. - Minimum samples to split (
min_samples_split): Require a minimum number of samples to consider a split. - Maximum number of features (
max_features): Consider only a random subset of features at each split (important for random forests). - Pruning: Grow the full tree and then remove branches that do not improve validation performance.
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
# Classification tree with constraints
clf_tree = DecisionTreeClassifier(
max_depth=5,
min_samples_leaf=10,
min_samples_split=20,
random_state=42,
)
clf_tree.fit(X_train, y_train)
# Regression tree
reg_tree = DecisionTreeRegressor(
max_depth=5,
min_samples_leaf=10,
random_state=42,
)
reg_tree.fit(X_train, y_train)
6.6.3 Random Forests
A single decision tree is prone to high variance---small changes in the training data can produce dramatically different trees. Random forests address this by building an ensemble of decision trees and averaging their predictions. This is an application of the principle of bagging (bootstrap aggregating):
- Create $B$ bootstrap samples (random samples with replacement) from the training data.
- Fit a decision tree to each bootstrap sample, with an additional twist: at each split, only consider a random subset of $m$ features (typically $m = \sqrt{d}$ for classification or $m = d/3$ for regression).
- For classification, take the majority vote across all trees. For regression, take the average prediction.
The combination of bootstrap sampling and random feature selection ensures that the individual trees are decorrelated, which means their errors tend to cancel out when averaged. The resulting ensemble has much lower variance than any single tree.
$$\hat{y}_{\text{RF}} = \frac{1}{B}\sum_{b=1}^{B} f_b(\mathbf{x})$$
Why it works: If we have $B$ trees, each with variance $\sigma^2$ and pairwise correlation $\rho$, the variance of the average is:
$$\text{Var}\left(\hat{y}_{\text{RF}}\right) = \rho \sigma^2 + \frac{1 - \rho}{B}\sigma^2$$
By making $\rho$ small (through randomization) and $B$ large, we can dramatically reduce variance without increasing bias.
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
# Classification random forest
rf_clf = RandomForestClassifier(
n_estimators=100,
max_depth=10,
min_samples_leaf=5,
max_features="sqrt",
random_state=42,
n_jobs=-1, # Use all CPU cores
)
rf_clf.fit(X_train, y_train)
# Feature importance
importances = rf_clf.feature_importances_
6.6.4 Gradient Boosting
While random forests reduce variance by averaging independent trees, gradient boosting reduces bias by building trees sequentially, where each new tree corrects the errors of the ensemble so far. This is an application of boosting, one of the most powerful ideas in machine learning.
The algorithm works as follows:
- Initialize the model with a constant prediction (e.g., the mean of the targets).
- For $m = 1, 2, \ldots, M$: a. Compute the pseudo-residuals: the negative gradient of the loss function with respect to the current predictions. b. Fit a decision tree $h_m(\mathbf{x})$ to the pseudo-residuals. c. Update the model: $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})$
where $\eta$ is the learning rate (shrinkage parameter). The final model is:
$$F_M(\mathbf{x}) = F_0(\mathbf{x}) + \eta \sum_{m=1}^{M} h_m(\mathbf{x})$$
For squared error loss, the pseudo-residuals are simply the actual residuals $y_i - F_{m-1}(\mathbf{x}_i)$, giving this algorithm its intuitive appeal: each tree learns to predict the errors of the current model.
Key hyperparameters:
- n_estimators ($M$): Number of boosting stages (trees).
- learning_rate ($\eta$): Shrinkage applied to each tree. Smaller values need more trees but often generalize better.
- max_depth: Depth of individual trees (typically 3--8 for boosting, much shallower than random forest trees).
- subsample: Fraction of training data used for each tree (stochastic gradient boosting).
6.6.5 Gradient Boosting Internals: A Deeper Look
To build solid intuition for gradient boosting, let us walk through the first few iterations step by step on a small regression dataset. Suppose we have five data points and are using squared error loss.
Initialization: $F_0(\mathbf{x}) = \bar{y}$ (the mean of all targets). If $\mathbf{y} = [2, 4, 6, 8, 10]$, then $F_0 = 6.0$ for all inputs.
Iteration 1: Compute the residuals (pseudo-residuals for squared error loss):
$$r_i^{(1)} = y_i - F_0(\mathbf{x}_i) = [-4, -2, 0, 2, 4]$$
Fit a shallow decision tree $h_1(\mathbf{x})$ to these residuals. Suppose $h_1$ produces predictions $[-3.5, -1.5, 0, 1.5, 3.5]$. With learning rate $\eta = 0.1$:
$$F_1(\mathbf{x}_i) = F_0(\mathbf{x}_i) + 0.1 \cdot h_1(\mathbf{x}_i) = [5.65, 5.85, 6.0, 6.15, 6.35]$$
Iteration 2: Compute new residuals:
$$r_i^{(2)} = y_i - F_1(\mathbf{x}_i) = [-3.65, -1.85, 0, 1.85, 3.65]$$
Fit another tree $h_2$ to these residuals and update. Each iteration shaves off a fraction of the remaining error.
The key insight is that gradient boosting performs functional gradient descent in the space of functions. Just as ordinary gradient descent takes small steps in parameter space, gradient boosting takes small steps in function space by adding weak learners. The learning rate $\eta$ controls the step size, and smaller values require more trees but typically generalize better because each step is more conservative.
For loss functions other than squared error, the pseudo-residuals are the negative gradient of the loss with respect to the current prediction:
$$r_i^{(m)} = -\frac{\partial \ell(y_i, F_{m-1}(\mathbf{x}_i))}{\partial F_{m-1}(\mathbf{x}_i)}$$
For binary cross-entropy loss, the pseudo-residuals become $r_i^{(m)} = y_i - \sigma(F_{m-1}(\mathbf{x}_i))$, where $\sigma$ is the sigmoid function---the difference between the true label and the predicted probability.
6.6.6 XGBoost: Optimized Gradient Boosting
XGBoost (eXtreme Gradient Boosting) is an optimized implementation of gradient boosting that has dominated machine learning competitions for years. It incorporates several improvements:
- Regularized objective: Adds $L_1$ and $L_2$ penalties on the tree leaf weights.
- Second-order approximation: Uses both the gradient and the Hessian (second derivative) of the loss for more accurate split finding.
- Efficient split finding: Uses a histogram-based algorithm for approximate split finding.
- Built-in handling of missing values: Learns the optimal direction for missing values at each split.
- Parallelization: Parallelizes tree construction for faster training.
The XGBoost objective for the $m$-th tree is:
$$\mathcal{L}^{(m)} = \sum_{i=1}^{n}\left[g_i h_m(\mathbf{x}_i) + \frac{1}{2} h_i h_m(\mathbf{x}_i)^2\right] + \Omega(h_m)$$
where $g_i$ and $h_i$ are the first and second derivatives of the loss with respect to the current prediction, and $\Omega$ is the regularization term:
$$\Omega(h_m) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2$$
where $T$ is the number of leaves and $w_j$ is the weight of the $j$-th leaf.
from sklearn.ensemble import GradientBoostingClassifier
# scikit-learn gradient boosting
gb_clf = GradientBoostingClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
subsample=0.8,
random_state=42,
)
gb_clf.fit(X_train, y_train)
For production workloads, you would typically use the dedicated XGBoost library:
# Note: requires `pip install xgboost`
import xgboost as xgb
xgb_clf = xgb.XGBClassifier(
n_estimators=200,
learning_rate=0.1,
max_depth=5,
subsample=0.8,
colsample_bytree=0.8,
reg_alpha=0.1, # L1 regularization
reg_lambda=1.0, # L2 regularization
random_state=42,
use_label_encoder=False,
eval_metric="logloss",
)
xgb_clf.fit(X_train, y_train)
6.6.7 Feature Importance Analysis
Understanding which features drive a model's predictions is critical for interpretability, debugging, and gaining domain insights. Tree-based methods offer several approaches to measuring feature importance.
Impurity-based importance (also called Gini importance or mean decrease in impurity) measures how much each feature reduces impurity across all splits in all trees:
$$\text{Importance}(j) = \sum_{\text{tree } t} \sum_{\text{node } v \text{ splitting on } j} \frac{n_v}{n} \Delta I(v)$$
where $n_v$ is the number of samples reaching node $v$, $n$ is the total number of samples, and $\Delta I(v)$ is the impurity decrease at that node. This is fast to compute because it is a byproduct of training. However, it has a known bias: features with more unique values tend to appear more important, even if they are uninformative (such as an ID column with unique values for every sample).
Permutation importance provides an unbiased alternative. For each feature $j$, randomly shuffle its values across all samples and measure how much the model's performance degrades:
$$\text{PermImportance}(j) = \text{Score}_{\text{original}} - \text{Score}_{\text{shuffled}}$$
Features that are truly important will cause a large drop in performance when shuffled. This method works with any model, not just trees, and evaluates importance on held-out data, making it more trustworthy.
from sklearn.inspection import permutation_importance
# Compute permutation importance on the test set
perm_result = permutation_importance(
rf_clf, X_test, y_test,
n_repeats=10,
random_state=42,
n_jobs=-1,
)
# Display sorted feature importances
import pandas as pd
importances_df = pd.DataFrame({
"feature": feature_names,
"mean_importance": perm_result.importances_mean,
"std_importance": perm_result.importances_std,
}).sort_values("mean_importance", ascending=False)
print(importances_df.head(10))
SHAP (SHapley Additive exPlanations) values provide a game-theoretic approach to feature importance. For each prediction, SHAP assigns a contribution value to each feature, based on the concept of Shapley values from cooperative game theory. SHAP values sum to the difference between the model's prediction and the average prediction, giving a complete and consistent attribution. We will explore SHAP and other interpretability methods in greater depth in Chapter 20.
Best practice: Always compare multiple importance methods. If a feature consistently ranks high across impurity-based importance, permutation importance, and SHAP, you can be confident it is genuinely important.
6.6.8 Handling Imbalanced Classes
In many real-world classification problems, one class is far more common than the other. Consider fraud detection, where only 0.1% of transactions are fraudulent, or medical diagnosis, where a rare disease affects 1 in 10,000 patients. Standard algorithms trained on imbalanced data tend to predict the majority class almost exclusively, achieving high accuracy but failing to detect the rare events that matter most.
Several strategies address class imbalance:
Class weights: Most scikit-learn classifiers accept a class_weight parameter that adjusts the loss function to penalize misclassification of the minority class more heavily. Setting class_weight="balanced" automatically weights classes inversely proportional to their frequency:
$$w_k = \frac{n}{K \cdot n_k}$$
where $n$ is the total number of samples, $K$ is the number of classes, and $n_k$ is the number of samples in class $k$.
from sklearn.ensemble import RandomForestClassifier
# Balanced class weights for imbalanced data
rf_balanced = RandomForestClassifier(
n_estimators=200,
class_weight="balanced",
random_state=42,
)
rf_balanced.fit(X_train, y_train)
Oversampling the minority class: Techniques like SMOTE (Synthetic Minority Over-sampling Technique) create synthetic examples of the minority class by interpolating between existing minority samples. SMOTE selects a minority sample, finds its $k$ nearest minority neighbors, and creates a new sample at a random point along the line connecting the sample to one of its neighbors.
Undersampling the majority class: Randomly remove examples from the majority class to balance the dataset. This is simple but discards potentially useful data.
Threshold tuning: Instead of using the default 0.5 threshold for classification, optimize the threshold using the precision-recall curve on the validation set. For example, if recall is paramount, you might lower the threshold to catch more positive cases at the cost of more false positives.
Evaluation with appropriate metrics: As we discussed in Section 6.4.6 and will explore further in Chapter 8, accuracy is misleading for imbalanced datasets. Use precision, recall, F1 score, or AUC-PR instead.
6.6.9 Comparing Ensemble Methods
| Property | Random Forest | Gradient Boosting |
|---|---|---|
| Training | Parallel (fast) | Sequential (slower) |
| Reduces | Variance | Bias |
| Tree depth | Deep trees | Shallow trees |
| Overfitting risk | Lower | Higher |
| Tuning difficulty | Easy | Moderate |
| Missing values | Needs imputation | XGBoost handles natively |
| Typical performance | Good | Often best on tabular data |
In practice, gradient boosting (especially XGBoost, LightGBM, or CatBoost) tends to produce the best results on tabular/structured data, while random forests are easier to tune and more robust to hyperparameter choices. We will explore advanced ensemble strategies in Chapter 9.
6.7 The Bias-Variance Tradeoff
The bias-variance tradeoff is one of the most fundamental concepts in machine learning. It explains why models can fail in two fundamentally different ways and provides a framework for choosing model complexity.
6.7.1 Decomposition of Expected Error
For any model $\hat{f}(\mathbf{x})$ trained on a random sample from the true distribution, the expected squared error at a point $\mathbf{x}$ can be decomposed as:
$$\mathbb{E}\left[(y - \hat{f}(\mathbf{x}))^2\right] = \underbrace{\left(\mathbb{E}[\hat{f}(\mathbf{x})] - f(\mathbf{x})\right)^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}\left[\left(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})]\right)^2\right]}_{\text{Variance}} + \underbrace{\sigma^2_\epsilon}_{\text{Irreducible error}}$$
where: - Bias measures how far the average prediction is from the true value. High bias means the model is too simple to capture the true pattern (underfitting). - Variance measures how much the predictions fluctuate across different training sets. High variance means the model is too sensitive to the specific training data (overfitting). - Irreducible error ($\sigma^2_\epsilon$) is the inherent noise in the data that no model can eliminate.
The total error is the sum of all three terms. The key insight is that reducing bias typically increases variance, and vice versa. This is the tradeoff.
6.7.2 Mathematical Derivation of the Decomposition
The derivation of the bias-variance decomposition is instructive and connects several ideas from Chapter 4. Assume the true data-generating process is $y = f(\mathbf{x}) + \epsilon$ where $\epsilon$ is zero-mean noise with variance $\sigma^2_\epsilon$, independent of $\mathbf{x}$. The expectation $\mathbb{E}[\cdot]$ is taken over both the randomness in the training set $\mathcal{D}$ (which determines $\hat{f}$) and the noise $\epsilon$.
We begin by expanding the squared error at a fixed test point $\mathbf{x}$:
$$\mathbb{E}\left[(y - \hat{f}(\mathbf{x}))^2\right] = \mathbb{E}\left[(f(\mathbf{x}) + \epsilon - \hat{f}(\mathbf{x}))^2\right]$$
Let $\bar{f}(\mathbf{x}) = \mathbb{E}[\hat{f}(\mathbf{x})]$ denote the average prediction across all possible training sets. We add and subtract $\bar{f}(\mathbf{x})$:
$$= \mathbb{E}\left[(\epsilon + f(\mathbf{x}) - \bar{f}(\mathbf{x}) + \bar{f}(\mathbf{x}) - \hat{f}(\mathbf{x}))^2\right]$$
Expanding the square and using the fact that $\mathbb{E}[\epsilon] = 0$ and that $\epsilon$ is independent of $\hat{f}$:
$$= \mathbb{E}[\epsilon^2] + (f(\mathbf{x}) - \bar{f}(\mathbf{x}))^2 + \mathbb{E}[(\hat{f}(\mathbf{x}) - \bar{f}(\mathbf{x}))^2]$$
The cross terms vanish because: (1) $\mathbb{E}[\epsilon \cdot (\cdot)] = 0$ since $\epsilon$ is independent and zero-mean, and (2) $\mathbb{E}[\hat{f}(\mathbf{x}) - \bar{f}(\mathbf{x})] = 0$ by definition of $\bar{f}$.
We see that the three terms are exactly:
$$= \underbrace{\sigma^2_\epsilon}_{\text{Irreducible error}} + \underbrace{(f(\mathbf{x}) - \bar{f}(\mathbf{x}))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(\mathbf{x}) - \bar{f}(\mathbf{x}))^2]}_{\text{Variance}}$$
This derivation makes the tradeoff crystal clear: bias is the squared difference between the true function and the average model prediction, while variance is the expected squared deviation of the model from its own average. No model can reduce the irreducible error.
6.7.3 The Tradeoff in Practice
Let us revisit the polynomial regression example from Section 6.3:
- Low-degree polynomial (high bias, low variance): The model is too simple. It consistently misses the true pattern (high bias), but its predictions are stable across different training sets (low variance). This is underfitting.
- High-degree polynomial (low bias, high variance): The model is complex enough to capture any pattern (low bias), but it is highly sensitive to the specific training data (high variance). This is overfitting.
- Optimal degree (balanced): The model is complex enough to capture the true pattern but not so complex that it memorizes noise.
Different algorithms have different positions on the bias-variance spectrum:
| Model | Bias | Variance | Typical Fix |
|---|---|---|---|
| Linear regression | High | Low | Add features, polynomial terms |
| Decision tree (deep) | Low | High | Pruning, max_depth |
| Random forest | Low | Low-Medium | More trees (diminishing returns) |
| Gradient boosting | Low | Medium | Early stopping, lower learning rate |
| k-Nearest Neighbors (small $k$) | Low | High | Increase $k$ |
6.7.4 Strategies to Manage the Tradeoff
- Regularization: Add a penalty to the loss function that discourages complex models (Ridge, Lasso, weight decay).
- Cross-validation: Use $k$-fold cross-validation to estimate the generalization error for different model complexities and choose the one that minimizes validation error.
- Ensemble methods: Bagging (random forests) reduces variance; boosting reduces bias.
- Early stopping: For iterative algorithms (gradient boosting, neural networks), stop training when validation error starts to increase.
- More data: With more training data, the variance of the model decreases, allowing us to use more complex models without overfitting.
Worked Example: Consider fitting polynomial regression models of degree 1 through 15 on a dataset of $n = 50$ points with true relationship $y = 3x^2 - 2x + 1 + \epsilon$ where $\epsilon \sim \mathcal{N}(0, 1)$.
Using 5-fold cross-validation:
| Degree | Train MSE | Validation MSE | Interpretation |
|---|---|---|---|
| 1 | 15.2 | 16.1 | Underfitting (high bias) |
| 2 | 1.05 | 1.12 | Optimal |
| 3 | 1.03 | 1.15 | Slightly overcomplex |
| 5 | 0.98 | 1.35 | Overfitting begins |
| 10 | 0.42 | 4.87 | Severe overfitting |
| 15 | 0.08 | 28.3 | Extreme overfitting |
We see that degree 2 achieves the best validation MSE, which makes sense since the true function is a degree-2 polynomial. As the degree increases beyond 2, training error continues to decrease but validation error explodes---the classic signature of overfitting.
6.7.5 Learning Curves
A learning curve plots model performance as a function of training set size. It provides direct insight into whether a model is suffering from high bias or high variance:
- High bias: Both training and validation errors are high, and they converge to a similar (high) value. Adding more data will not help---you need a more complex model.
- High variance: Training error is low but validation error is high, with a large gap between them. Adding more data will help close the gap.
from sklearn.model_selection import learning_curve
import numpy as np
train_sizes, train_scores, val_scores = learning_curve(
estimator=model,
X=X_train,
y=y_train,
train_sizes=np.linspace(0.1, 1.0, 10),
cv=5,
scoring="neg_mean_squared_error",
n_jobs=-1,
)
6.8 Putting It All Together: Model Selection
With so many algorithms to choose from, how do you decide which one to use for a given problem? Here is a practical decision framework for AI engineers.
6.8.1 Start Simple
Always start with a simple baseline: - Regression: Linear regression (or Ridge/Lasso for regularization). - Classification: Logistic regression.
These models are fast to train, easy to interpret, and often surprisingly competitive. They also serve as a benchmark for more complex models.
6.8.2 When to Use What
Linear models (linear/logistic regression, Ridge, Lasso): - Many features relative to samples - Need for interpretability - When the relationship is approximately linear - As a baseline for any problem
Support Vector Machines: - Medium-sized datasets (up to ~100K samples) - When non-linear boundaries are needed (use RBF kernel) - When the number of features is large relative to the number of samples (linear SVM)
Decision Trees: - When interpretability is paramount - When features are a mix of numerical and categorical - As building blocks for ensembles
Random Forests: - General-purpose algorithm for tabular data - When you want good performance with minimal tuning - When feature importance rankings are needed
Gradient Boosting (XGBoost/LightGBM/CatBoost): - When maximum predictive performance on tabular data is needed - When you have time for hyperparameter tuning - Competition-winning algorithm for structured data
6.8.3 A Systematic Comparison
The following workflow works well in practice:
- Split data into train/validation/test.
- Train a logistic regression (or linear regression) baseline.
- Train a random forest with default hyperparameters.
- Train a gradient boosting model with light tuning.
- Compare validation performance using appropriate metrics.
- Tune the most promising model's hyperparameters using cross-validation.
- Evaluate the final model on the test set once.
from sklearn.model_selection import cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
models = {
"Logistic Regression": LogisticRegression(max_iter=1000),
"Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
"Gradient Boosting": GradientBoostingClassifier(
n_estimators=100, random_state=42
),
}
for name, model in models.items():
scores = cross_val_score(model, X_train, y_train, cv=5, scoring="f1")
print(f"{name}: F1 = {scores.mean():.4f} (+/- {scores.std():.4f})")
6.8.4 Hyperparameter Tuning
Once you have selected a model, tune its hyperparameters using grid search or randomized search with cross-validation:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
param_distributions = {
"n_estimators": randint(50, 500),
"max_depth": randint(3, 15),
"min_samples_leaf": randint(1, 20),
"learning_rate": uniform(0.01, 0.29),
"subsample": uniform(0.5, 0.5),
}
search = RandomizedSearchCV(
GradientBoostingClassifier(random_state=42),
param_distributions=param_distributions,
n_iter=50,
cv=5,
scoring="f1",
random_state=42,
n_jobs=-1,
)
search.fit(X_train, y_train)
print(f"Best parameters: {search.best_params_}")
print(f"Best F1 score: {search.best_score_:.4f}")
6.8.5 Practical Comparison on a Real Dataset
To solidify these ideas, let us compare the algorithms we have studied on a realistic classification task. The following example uses the breast cancer dataset (569 samples, 30 features, binary classification) to illustrate how different models perform and how to interpret the results.
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import (
RandomForestClassifier,
GradientBoostingClassifier,
)
# Load dataset
data = load_breast_cancer()
X, y = data.data, data.target
# Define models with pipelines (scaling for models that need it)
models = {
"Logistic Regression": make_pipeline(
StandardScaler(), LogisticRegression(max_iter=5000, C=1.0)
),
"SVM (RBF)": make_pipeline(
StandardScaler(), SVC(kernel="rbf", C=10.0, gamma="scale")
),
"Decision Tree": DecisionTreeClassifier(max_depth=5, random_state=42),
"Random Forest": RandomForestClassifier(
n_estimators=200, max_depth=10, random_state=42
),
"Gradient Boosting": GradientBoostingClassifier(
n_estimators=200, learning_rate=0.1, max_depth=3, random_state=42
),
}
# Evaluate each model with stratified 10-fold cross-validation
cv = StratifiedKFold(n_splits=10, shuffle=True, random_state=42)
for name, model in models.items():
scores = cross_val_score(model, X, y, cv=cv, scoring="accuracy")
print(f"{name:25s}: {scores.mean():.4f} +/- {scores.std():.4f}")
Typical results on this dataset:
| Model | Accuracy (mean +/- std) |
|---|---|
| Logistic Regression | 0.9719 +/- 0.0217 |
| SVM (RBF) | 0.9754 +/- 0.0199 |
| Decision Tree | 0.9333 +/- 0.0287 |
| Random Forest | 0.9649 +/- 0.0251 |
| Gradient Boosting | 0.9684 +/- 0.0233 |
Several observations emerge from this comparison:
- Linear models are surprisingly competitive. Logistic regression and SVM perform among the best, suggesting that the decision boundary for this dataset is approximately linear in the original feature space.
- A single decision tree underperforms due to its high variance. The tree memorizes specific splits that do not generalize well.
- Ensemble methods improve on individual trees but do not dramatically outperform the linear models here, because the underlying relationship does not require complex non-linear boundaries.
- The standard deviations overlap, meaning the differences between most models are not statistically significant. As we will discuss in Chapter 8, a formal statistical test (such as McNemar's test or the corrected paired t-test) is needed to determine whether one model is genuinely better than another.
This example illustrates a crucial practical lesson: always start with a simple baseline. If a linear model performs well, the added complexity of an ensemble may not be justified. Complex models earn their place only when they provide a meaningful improvement on validation data.
6.9 Summary
In this chapter, you have learned the fundamental algorithms of supervised learning---the backbone of applied machine learning and AI engineering. Let us summarize the key ideas:
Linear Regression provides the foundation: model the target as a linear combination of features, optimize via OLS or gradient descent, and regularize with Ridge ($L_2$), Lasso ($L_1$), or Elastic Net to prevent overfitting. Polynomial feature expansion allows linear models to capture non-linear relationships.
Logistic Regression extends the linear framework to classification by passing the linear output through the sigmoid function and optimizing the cross-entropy loss via maximum likelihood. It produces calibrated probability estimates and serves as the building block for neural networks.
Support Vector Machines find the maximum-margin decision boundary and use the kernel trick to handle non-linear data. The RBF kernel is the most versatile choice for non-linear classification.
Decision Trees partition the feature space through a sequence of axis-aligned splits. They are interpretable but prone to overfitting.
Random Forests build ensembles of decorrelated trees via bagging and random feature selection, dramatically reducing variance.
Gradient Boosting (and its optimized implementations like XGBoost) builds trees sequentially, with each tree correcting the errors of the ensemble, achieving state-of-the-art performance on tabular data.
The Bias-Variance Tradeoff is the central tension in model selection: too-simple models underfit (high bias), too-complex models overfit (high variance), and the art of machine learning is finding the right balance.
In the next chapter, we will explore unsupervised learning methods that find patterns in data without labeled targets. In Chapter 8, we will dive deeper into model evaluation and selection strategies, and in Chapter 9, we will study advanced ensemble and stacking techniques. The supervised learning algorithms you have mastered here will serve you throughout the rest of this book and throughout your career as an AI engineer.
Related Reading
Explore this topic in other books
Sports Betting Feature Engineering for ML College Football Analytics Machine Learning for Football NFL Analytics ML Prediction Models Basketball Analytics Machine Learning in Basketball Soccer Analytics Machine Learning for Soccer Prediction Markets ML for Market Prediction Sports Betting Regression Analysis Sports Betting Advanced Regression & Classification Soccer Analytics ML & Regression for Soccer College Football Analytics Intro to Predictive Analytics