Chapter 12: Support Vector Machines

DataField.Dev

21 min read

> War Story --- In 2012, a team at a medical imaging startup was classifying breast cancer biopsies from microscope images. They had 569 samples --- 357 benign, 212 malignant --- and 30 features extracted from each image. The dataset was small, the...

In This Chapter

Maximum Margin Classifiers and the Kernel Trick
The Algorithm That Refused to Die
The Maximum Margin Principle
Soft Margins and the C Parameter
Feature Scaling: Non-Negotiable for SVMs
The Kernel Trick: When Linear Is Not Enough
Choosing Your Kernel
Tuning C and Gamma Together
SVM for Regression: SVR
Practical Considerations
Multi-class Classification with SVMs
A Complete Example: Putting It All Together
SVMs vs. the Competition: An Honest Assessment
Common Pitfalls
Chapter Summary

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 12: Support Vector Machines

Maximum Margin Classifiers and the Kernel Trick

Learning Objectives

By the end of this chapter, you will be able to:

Explain the maximum margin principle geometrically
Understand soft margins and the C parameter's role
Apply the kernel trick for non-linear classification
Choose between linear, polynomial, and RBF kernels
Recognize when SVMs are (and aren't) the right choice

The Algorithm That Refused to Die

War Story --- In 2012, a team at a medical imaging startup was classifying breast cancer biopsies from microscope images. They had 569 samples --- 357 benign, 212 malignant --- and 30 features extracted from each image. The dataset was small, the features were dense, and the classes were not trivially separable. They tried logistic regression (91% accuracy), a random forest (94%), and then, almost as an afterthought, a support vector machine with an RBF kernel. It hit 97.2% accuracy with a sensitivity of 98.1% on the malignant class. On this dataset, with this sample size, the SVM was the best classifier in the room.

That dataset was the Wisconsin Breast Cancer Dataset, and it has been an SVM showcase ever since. Not because SVMs are always the best --- they are not --- but because on small-to-medium datasets with complex decision boundaries, SVMs have a geometric elegance and predictive power that other algorithms struggle to match.

Here is the honest truth about support vector machines in 2026: gradient boosting has eaten their lunch. If you have 50,000 rows of tabular data and need a classification model by Friday, you are going to use XGBoost or LightGBM. They are faster to train, easier to tune, handle mixed feature types natively, and produce comparable or better accuracy on most real-world datasets.

So why dedicate an entire chapter to SVMs?

Three reasons. First, the ideas behind SVMs --- margins, support vectors, the kernel trick --- are among the most beautiful and foundational in all of machine learning. Understanding them makes you a better practitioner of every algorithm. Second, SVMs still win in specific regimes: small datasets, high-dimensional feature spaces, problems where you need a clean decision boundary. Third, you will encounter SVMs in legacy systems, in research papers, and in interviews. You need to know what they are, how they work, and when to reach for them.

This chapter is about all three.

The Maximum Margin Principle

The Problem: Many Lines Can Separate Two Classes

Imagine you have a two-dimensional dataset with two classes --- red dots and blue dots --- that are perfectly separable by a straight line. Not a complicated dataset. Just dots on a plane, with a gap between the two groups.

Here is the critical observation: there are infinitely many lines that separate the two classes perfectly. You could draw a line that barely misses the closest red dot. You could draw one that barely misses the closest blue dot. You could draw one that cuts diagonally through the gap. All of them achieve 100% training accuracy. All of them correctly classify every point in the training set.

So which line is "best"?

Logistic regression answers this question by maximizing the likelihood of the observed labels --- it finds the line that assigns the highest probabilities to the correct classes. That is a valid answer. But SVMs take a different approach, one rooted in geometry rather than probability.

The SVM Answer: Maximize the Margin

The margin is the distance between the decision boundary and the closest data point from either class. The SVM finds the line (or hyperplane, in higher dimensions) that maximizes this margin.

Why is a larger margin better? Intuition first: a decision boundary with a wide margin is "more confident." It has more room for error. If a new test point falls slightly to the wrong side of where you expected, a wide-margin classifier still gets it right because there is a buffer zone. A narrow-margin classifier, on the other hand, is living on the edge --- one slightly noisy test point and it misclassifies.

The formal justification comes from statistical learning theory: maximizing the margin minimizes an upper bound on the generalization error. You do not need to understand the proof to use SVMs, but it is worth knowing that the margin principle is not just geometric intuition --- it has theoretical backing.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs
from sklearn.svm import SVC

# Generate a simple two-class dataset
X, y = make_blobs(n_samples=100, centers=2, cluster_std=1.2, random_state=42)

# Fit a linear SVM
svm = SVC(kernel='linear', C=1e6)  # Very large C = hard margin
svm.fit(X, y)

# The decision boundary
w = svm.coef_[0]
b = svm.intercept_[0]

# The support vectors
print(f"Number of support vectors: {len(svm.support_vectors_)}")
print(f"Support vectors per class: {svm.n_support_}")
print(f"Weight vector: [{w[0]:.4f}, {w[1]:.4f}]")
print(f"Intercept: {b:.4f}")

Number of support vectors: 3
Support vectors per class: [1 2]
Weight vector: [-0.7514, -0.9160]
Intercept: 2.3847

Three points. Out of 100, only three data points determined the entire decision boundary. Those three points are the support vectors --- and they are the key to everything.

Support Vectors: The Points That Matter

A support vector is a data point that lies exactly on the margin boundary. These are the closest points from each class to the decision boundary. Remove a support vector, and the decision boundary moves. Remove any other point, and the boundary stays exactly the same.

This is a remarkable property. In logistic regression, every training point contributes to the model parameters. In an SVM, only the support vectors matter. The other points could be anywhere --- as long as they are on the correct side of the margin, they have zero influence on the model.

The practical implications:

Sparsity. The model depends only on a small subset of training points. This makes SVMs memory-efficient at prediction time.
Robustness to outliers far from the boundary. Points deep inside their own class territory cannot affect the decision boundary.
Sensitivity to points near the boundary. The points that are hardest to classify are exactly the ones that define the model. Add noise to a support vector and the boundary shifts.

The Geometry of the Margin

Let's be precise. In two dimensions, the SVM finds a line of the form:

w1 * x1 + w2 * x2 + b = 0

This line is the decision boundary. The two margin boundaries (gutters) are:

w1 * x1 + w2 * x2 + b = +1 (one class side) w1 * x1 + w2 * x2 + b = -1 (other class side)

The margin width is 2 / ||w||, where ||w|| is the length of the weight vector. Maximizing the margin is equivalent to minimizing ||w||. This gives us the optimization problem:

Minimize (1/2) ||w||^2 subject to y_i (w . x_i + b) >= 1 for all i

Each constraint says: every training point must be on the correct side of the margin. The factor y_i (which is +1 or -1 for the two classes) flips the inequality direction so a single constraint covers both classes.

This is a convex quadratic optimization problem. It has a unique global minimum. No local minima to get trapped in. No random initialization that changes the answer. One dataset, one solution. This mathematical cleanness is part of what makes SVMs appealing.

Soft Margins and the C Parameter

Reality Check: Data Is Not Perfectly Separable

The hard-margin SVM we just described works beautifully when the two classes are perfectly separable with a gap between them. In real data, this almost never happens. Classes overlap. There is noise. Some points from class A are sitting right in the middle of class B.

If we insist on classifying every training point correctly (hard margin), two things can go wrong:

No solution exists. If the classes overlap, no hyperplane can perfectly separate them. The optimization problem is infeasible.
Overfitting. Even if a separating hyperplane exists, forcing it to classify every point correctly might produce a very narrow margin that generalizes poorly.

The fix is the soft margin: allow some points to be on the wrong side of the margin --- or even on the wrong side of the decision boundary --- but penalize them for it.

Slack Variables

For each training point, we introduce a slack variable (xi_i, pronounced "ksi-i") that measures how much the point violates the margin:

xi_i = 0: the point is on the correct side of the margin (no violation)
0 < xi_i < 1: the point is inside the margin but on the correct side of the boundary (margin violation)
xi_i >= 1: the point is on the wrong side of the decision boundary (misclassification)

The optimization problem becomes:

Minimize (1/2) ||w||^2 + C * sum(xi_i) subject to y_i (w . x_i + b) >= 1 - xi_i and xi_i >= 0 for all i

The first term wants a wide margin (small ||w||). The second term wants few violations (small slack). The parameter C controls the tradeoff.

The C Parameter: Your Most Important Tuning Knob

The C parameter is the penalty for margin violations. It is the single most important hyperparameter in an SVM.

Large C: "I really care about classifying training points correctly." The SVM tolerates a narrow margin to reduce violations. Risk: overfitting.
Small C: "I really care about a wide margin." The SVM tolerates more violations to keep the margin wide. Risk: underfitting.

from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score

# Generate a dataset with some noise
X, y = make_classification(
    n_samples=200, n_features=2, n_redundant=0,
    n_informative=2, n_clusters_per_class=1,
    flip_y=0.1, random_state=42
)

# Try different C values
for C in [0.001, 0.01, 0.1, 1, 10, 100, 1000]:
    svm = SVC(kernel='linear', C=C, random_state=42)
    scores = cross_val_score(svm, X, y, cv=5, scoring='accuracy')
    print(f"C={C:<8} Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})  "
          f"Support vectors: {SVC(kernel='linear', C=C).fit(X, y).n_support_.sum()}")

C=0.001    Accuracy: 0.820 (+/- 0.037)  Support vectors: 173
C=0.01     Accuracy: 0.870 (+/- 0.029)  Support vectors: 112
C=0.1      Accuracy: 0.890 (+/- 0.037)  Support vectors: 56
C=1        Accuracy: 0.895 (+/- 0.037)  Support vectors: 34
C=10       Accuracy: 0.895 (+/- 0.034)  Support vectors: 27
C=100      Accuracy: 0.890 (+/- 0.037)  Support vectors: 24
C=1000     Accuracy: 0.885 (+/- 0.037)  Support vectors: 22

Notice the pattern:

At C=0.001, the SVM barely tries to classify correctly. It has 173 support vectors out of 200 points --- nearly every point is a support vector. The margin is wide but accuracy is poor.
At C=1, the balance is right. Accuracy peaks at 89.5% with 34 support vectors.
At C=1000, the SVM is trying too hard to classify every training point. Fewer support vectors (tighter margin) but cross-validation accuracy starts to dip --- classic overfitting.

Practitioner's Rule of Thumb --- Start with C=1.0 and search over [0.001, 0.01, 0.1, 1, 10, 100]. The number of support vectors is a useful diagnostic: if the majority of your training points are support vectors, C is probably too small; if you have very few, C might be too large. But always let cross-validation decide.

Feature Scaling: Non-Negotiable for SVMs

Before we go any further, a critical practical point: SVMs require feature scaling. This is not optional. It is not "recommended." It is mandatory.

The reason is geometric. The margin is computed as a distance in feature space. If one feature ranges from 0 to 1 and another ranges from 0 to 1,000,000, the margin calculation is dominated entirely by the large-scale feature. The SVM effectively ignores the small-scale feature.

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.datasets import load_breast_cancer

X, y = load_breast_cancer(return_X_y=True)

# WITHOUT scaling
svm_no_scale = SVC(kernel='rbf', C=1.0, random_state=42)
scores_no_scale = cross_val_score(svm_no_scale, X, y, cv=5, scoring='accuracy')

# WITH scaling
svm_scaled = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, random_state=42))
])
scores_scaled = cross_val_score(svm_scaled, X, y, cv=5, scoring='accuracy')

print(f"Without scaling: {scores_no_scale.mean():.3f} (+/- {scores_no_scale.std():.3f})")
print(f"With scaling:    {scores_scaled.mean():.3f} (+/- {scores_scaled.std():.3f})")

Without scaling: 0.627 (+/- 0.052)
With scaling:    0.963 (+/- 0.018)

A 34-point accuracy improvement just from scaling. This is not a subtle effect. An unscaled SVM is a broken SVM.

Always use a Pipeline. The scaler must be fit on the training fold only --- never on the test fold. Wrapping the scaler and SVM in a Pipeline ensures this happens automatically during cross-validation. If you scale the entire dataset before splitting, you have data leakage.

The Kernel Trick: When Linear Is Not Enough

The Problem with Linearity

Everything we have discussed so far assumes the decision boundary is a straight line (or hyperplane). But many real-world problems are not linearly separable. Consider two concentric circles: red dots in a ring around blue dots in the center. No straight line can separate them.

The obvious solution: transform the features into a higher-dimensional space where the classes are linearly separable. For the concentric circles example, if you add a third feature z = x1^2 + x2^2 (the squared distance from the origin), the inner circle has small z values and the outer circle has large z values. A horizontal plane in this 3D space separates them perfectly.

The less obvious problem: computing these transformations is expensive. If you have d features and apply all degree-2 polynomial combinations, you get O(d^2) features. Degree-3 gives O(d^3). For a dataset with 1,000 features, a degree-3 polynomial transformation produces roughly a billion features. That is not going to fit in memory.

The Trick: We Don't Need the Transformation

Here is where SVMs get elegant. The key insight, and the reason SVMs were a breakthrough:

The Kernel Trick --- The SVM optimization problem can be reformulated so that it depends only on dot products between data points, not on the data points themselves. If we can compute the dot product between two points in the transformed space without actually transforming them, we get the benefits of a high-dimensional decision boundary at the cost of computing in the original space.

A kernel function K(x_i, x_j) computes the dot product between x_i and x_j in some (potentially infinite-dimensional) transformed space, without ever explicitly computing the transformation.

Let's make this concrete:

import numpy as np
from sklearn.datasets import make_circles
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# Concentric circles: not linearly separable
X, y = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=42)

# Linear SVM: fails
linear_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='linear', C=1.0, random_state=42))
])
linear_scores = cross_val_score(linear_svm, X, y, cv=5, scoring='accuracy')

# RBF kernel SVM: succeeds
rbf_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, random_state=42))
])
rbf_scores = cross_val_score(rbf_svm, X, y, cv=5, scoring='accuracy')

print(f"Linear kernel: {linear_scores.mean():.3f} (+/- {linear_scores.std():.3f})")
print(f"RBF kernel:    {rbf_scores.mean():.3f} (+/- {rbf_scores.std():.3f})")

Linear kernel: 0.470 (+/- 0.040)
RBF kernel:    0.887 (+/- 0.029)

The linear SVM cannot do better than random guessing on concentric circles (50% is chance for balanced binary). The RBF kernel SVM draws a circular decision boundary --- in the original 2D space --- by implicitly working in an infinite-dimensional feature space. It never computes that infinite-dimensional transformation. It just computes kernel values.

The Kernel Matrix

For a dataset of n training points, the SVM computes the kernel matrix (or Gram matrix): an n-by-n matrix where entry (i, j) is K(x_i, x_j). This matrix captures all the pairwise similarity information the SVM needs.

This is also why SVMs scale poorly. The kernel matrix has n^2 entries. For 10,000 training points, that is 100 million entries. For 100,000 points, 10 billion. This is the fundamental computational bottleneck that limits SVMs to small-to-medium datasets.

Choosing Your Kernel

scikit-learn offers four kernel options for SVC. Here is what each one does and when to use it.

Linear Kernel

K(x_i, x_j) = x_i . x_j

The dot product. No transformation at all. The decision boundary is a hyperplane in the original feature space.

When to use it: - High-dimensional data where d >> n (text classification, genomics) - When you suspect the decision boundary is approximately linear - When you need speed (linear kernels are much faster than RBF) - As a baseline before trying non-linear kernels

from sklearn.svm import LinearSVC

# For linear kernels, LinearSVC is faster than SVC(kernel='linear')
# It uses liblinear instead of libsvm and scales better
linear_model = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])

LinearSVC vs. SVC(kernel='linear') --- They solve the same problem differently. LinearSVC uses liblinear (optimized for linear case, scales to millions of samples). SVC(kernel='linear') uses libsvm (general purpose, computes the kernel matrix, O(n^2) memory). For linear problems, always prefer LinearSVC. It is not just faster --- it is a different algorithm that avoids the kernel matrix entirely.

Polynomial Kernel

K(x_i, x_j) = (gamma * x_i . x_j + coef0)^degree

Computes the dot product in a polynomial feature space. Degree 2 captures all pairwise interactions. Degree 3 captures all triple interactions.

Hyperparameters: - degree: the polynomial degree (default 3) - gamma: scaling factor for the dot product - coef0: independent term (default 0)

When to use it: - When you know the relationship involves feature interactions - When you want a non-linear boundary that is "smoother" than RBF - Rarely in practice --- RBF usually works as well or better and has fewer hyperparameters

poly_svm = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='poly', degree=3, C=1.0, random_state=42))
])

Radial Basis Function (RBF) Kernel

K(x_i, x_j) = exp(-gamma * ||x_i - x_j||^2)

The RBF kernel --- also called the Gaussian kernel --- measures the similarity between two points as a function of the Euclidean distance between them. Points that are close in feature space have kernel values near 1; points that are far apart have kernel values near 0.

The RBF kernel implicitly maps data into an infinite-dimensional feature space. You read that correctly: infinite-dimensional. The kernel trick lets us compute dot products in this space without ever representing it.

The gamma parameter controls how far the influence of a single training example reaches:

Large gamma: Each point has a small radius of influence. The decision boundary can be very wiggly --- it conforms closely to each training point. Risk: overfitting.
Small gamma: Each point has a large radius of influence. The decision boundary is smoother. Risk: underfitting.

from sklearn.datasets import make_moons
from sklearn.model_selection import cross_val_score

X, y = make_moons(n_samples=300, noise=0.2, random_state=42)

for gamma in [0.01, 0.1, 1, 10, 100]:
    svm = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='rbf', C=1.0, gamma=gamma, random_state=42))
    ])
    scores = cross_val_score(svm, X, y, cv=5, scoring='accuracy')
    n_sv = SVC(kernel='rbf', C=1.0, gamma=gamma).fit(
        StandardScaler().fit_transform(X), y
    ).n_support_.sum()
    print(f"gamma={gamma:<6} Accuracy: {scores.mean():.3f} (+/- {scores.std():.3f})  "
          f"Support vectors: {n_sv}")

gamma=0.01   Accuracy: 0.857 (+/- 0.029)  Support vectors: 203
gamma=0.1    Accuracy: 0.870 (+/- 0.040)  Support vectors: 112
gamma=1      Accuracy: 0.893 (+/- 0.029)  Support vectors: 73
gamma=10     Accuracy: 0.883 (+/- 0.040)  Support vectors: 119
gamma=100    Accuracy: 0.657 (+/- 0.085)  Support vectors: 256

At gamma=100, the SVM overfits badly --- every point defines its own tiny island of influence, and the model cannot generalize. At gamma=0.01, the boundary is too smooth to capture the moon-shaped structure. The sweet spot is in the middle.

Practitioner's Rule of Thumb --- scikit-learn's default gamma is 'scale', which sets gamma = 1 / (n_features * X.var()). This is a reasonable starting point. Search over [0.001, 0.01, 0.1, 1, 10] during hyperparameter tuning.

Sigmoid Kernel

K(x_i, x_j) = tanh(gamma * x_i . x_j + coef0)

Included for completeness. The sigmoid kernel is rarely used --- it is not guaranteed to produce a valid (positive semi-definite) kernel matrix, and there is almost no scenario where it outperforms RBF. Skip it.

Summary: Kernel Selection Flowchart

Start
  |
  v
Is n_features >> n_samples? (e.g., text, genomics)
  |
  +--> YES --> Use LinearSVC
  |
  +--> NO --> Is the relationship likely linear?
                |
                +--> YES --> Start with LinearSVC
                |             (if inadequate, try RBF)
                |
                +--> NO --> Use SVC(kernel='rbf')
                             Tune C and gamma

In practice, 90% of SVM usage falls into two buckets: LinearSVC for high-dimensional sparse data, and SVC(kernel='rbf') for everything else.

Tuning C and Gamma Together

For the RBF kernel, you have two hyperparameters that interact with each other. C controls the penalty for margin violations. Gamma controls the radius of influence of each training point. You must tune them jointly.

from sklearn.model_selection import GridSearchCV
from sklearn.datasets import make_moons
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

X, y = make_moons(n_samples=300, noise=0.2, random_state=42)

pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', random_state=42))
])

param_grid = {
    'svm__C': [0.01, 0.1, 1, 10, 100],
    'svm__gamma': [0.01, 0.1, 1, 10]
}

grid = GridSearchCV(pipe, param_grid, cv=5, scoring='accuracy', refit=True)
grid.fit(X, y)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.3f}")

# Show the interaction
import pandas as pd
results = pd.DataFrame(grid.cv_results_)
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_svm__C',
    columns='param_svm__gamma'
)
print("\nAccuracy by (C, gamma):")
print(pivot.round(3).to_string())

Best parameters: {'svm__C': 10, 'svm__gamma': 1}
Best CV accuracy: 0.900

Accuracy by (C, gamma):
param_svm__gamma   0.01  0.1    1     10
param_svm__C
0.01              0.533  0.530  0.610  0.600
0.1               0.857  0.857  0.870  0.633
1                 0.857  0.870  0.893  0.883
10                0.857  0.873  0.900  0.887
100               0.857  0.873  0.897  0.870

The interaction is visible. At low C and low gamma, the model underfits (wide margin, smooth boundary). At high C and high gamma, it overfits (narrow margin, wiggly boundary). The best performance is in the middle diagonal.

C and gamma are on a log scale. Always search over powers of 10 (or at least powers of 3). The difference between C=1 and C=2 is negligible. The difference between C=1 and C=10 can be dramatic.

SVM for Regression: SVR

SVMs are not just for classification. Support Vector Regression (SVR) adapts the same margin-based framework to continuous targets. Instead of maximizing the margin between classes, SVR defines an epsilon-insensitive tube around the regression line: predictions within epsilon of the true value incur zero loss; predictions outside the tube are penalized.

from sklearn.svm import SVR
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
import numpy as np

# Synthetic regression data
np.random.seed(42)
X = np.sort(5 * np.random.rand(200, 1), axis=0)
y = np.sin(X).ravel() + np.random.normal(0, 0.1, X.shape[0])

# Linear SVR
linear_svr = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='linear', C=1.0, epsilon=0.1))
])

# RBF SVR
rbf_svr = Pipeline([
    ('scaler', StandardScaler()),
    ('svr', SVR(kernel='rbf', C=10, gamma=0.1, epsilon=0.1))
])

linear_scores = cross_val_score(linear_svr, X, y, cv=5, scoring='r2')
rbf_scores = cross_val_score(rbf_svr, X, y, cv=5, scoring='r2')

print(f"Linear SVR R2: {linear_scores.mean():.3f} (+/- {linear_scores.std():.3f})")
print(f"RBF SVR R2:    {rbf_scores.mean():.3f} (+/- {rbf_scores.std():.3f})")

Linear SVR R2: 0.342 (+/- 0.267)
RBF SVR R2:    0.907 (+/- 0.035)

SVR has three hyperparameters: C (violation penalty), gamma (kernel width, for non-linear kernels), and epsilon (tube width). More hyperparameters to tune, same general guidance: use grid search, scale your features, prefer RBF unless you have a reason for linear.

In practice, SVR is used less than SVC. Gradient boosting regressors (Chapter 14) are usually faster and easier to tune for regression tasks. SVR remains useful for small datasets with complex non-linear relationships --- the same niche as SVC.

Practical Considerations

When SVMs Shine

Small to medium datasets (n < 10,000). The kernel matrix is n-by-n. Below ~10,000 samples, this is manageable. Above that, you are waiting.
High-dimensional feature spaces. Text classification with TF-IDF produces thousands of features. SVMs with linear kernels handle this well because the margin is defined in the feature space, and high-dimensional spaces have more room for separation.
Clear margin of separation. If the classes have a natural gap, SVMs exploit it elegantly.
When you need a clean decision boundary. SVMs produce a single boundary defined by support vectors. No ensemble averaging, no probabilistic thresholding. Just geometry.

When SVMs Struggle

Large datasets (n > 50,000). The kernel matrix becomes prohibitively expensive. LinearSVC scales well, but SVC with a non-linear kernel does not.
Noisy, overlapping classes. SVMs try to find a boundary. If the classes overlap significantly, the boundary becomes meaningless and ensemble methods (which aggregate many weak decisions) perform better.
Mixed feature types. SVMs need numeric features. Categorical variables must be encoded first. Tree-based methods handle mixed types natively.
Interpretability requirements. You cannot easily extract "feature importance" from an SVM. The model is defined by support vectors and kernel evaluations, not by per-feature weights (except in the linear case). If stakeholders need to understand why the model makes a decision, logistic regression or a decision tree is a better choice.
Probability calibration. SVC does not natively produce well-calibrated probabilities. The probability=True flag uses Platt scaling (fitting a sigmoid to the SVM scores), but this adds computational cost and the probabilities are approximate. If you need calibrated probabilities, logistic regression is more natural.

Scaling to Larger Datasets

When you want SVM-like behavior on larger datasets, you have options:

from sklearn.svm import LinearSVC
from sklearn.kernel_approximation import RBFSampler, Nystroem
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

# Option 1: LinearSVC (scales to millions for linear problems)
pipe_linear = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])

# Option 2: Kernel approximation + linear model
# RBFSampler approximates the RBF kernel using random Fourier features
pipe_approx = Pipeline([
    ('scaler', StandardScaler()),
    ('kernel_approx', RBFSampler(gamma=1.0, n_components=300, random_state=42)),
    ('svm', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])

# Option 3: SGDClassifier with hinge loss (online SVM)
# Scales to datasets that don't fit in memory
pipe_sgd = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SGDClassifier(loss='hinge', alpha=0.001, max_iter=1000, random_state=42))
])

SGDClassifier(loss='hinge') is a linear SVM trained with stochastic gradient descent. It sees one sample at a time and scales to arbitrarily large datasets. The tradeoff is that you lose the kernel trick (linear only) and the solution is approximate. But for large-scale linear classification, it is the go-to.

Multi-class Classification with SVMs

SVMs are inherently binary classifiers. scikit-learn handles multi-class problems using one of two strategies:

One-vs-Rest (OvR): Train one SVM per class. Each SVM separates one class from all others. Predict the class whose SVM gives the highest decision value. LinearSVC uses this by default.
One-vs-One (OvO): Train one SVM for every pair of classes. For k classes, this is k(k-1)/2 SVMs. Each SVM votes for one of the two classes it was trained on. Predict the class with the most votes. SVC uses this by default.

from sklearn.datasets import load_iris
from sklearn.svm import SVC, LinearSVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

X, y = load_iris(return_X_y=True)

# SVC: One-vs-One (default)
svc_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', C=1.0, random_state=42))
])
svc_scores = cross_val_score(svc_pipe, X, y, cv=5, scoring='accuracy')

# LinearSVC: One-vs-Rest (default)
lsvc_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])
lsvc_scores = cross_val_score(lsvc_pipe, X, y, cv=5, scoring='accuracy')

print(f"SVC (OvO, RBF):          {svc_scores.mean():.3f} (+/- {svc_scores.std():.3f})")
print(f"LinearSVC (OvR, linear): {lsvc_scores.mean():.3f} (+/- {lsvc_scores.std():.3f})")

SVC (OvO, RBF):          0.973 (+/- 0.024)
LinearSVC (OvR, linear): 0.960 (+/- 0.022)

For k classes, OvO trains k(k-1)/2 models but each on a smaller subset. OvR trains k models but each on the full dataset. When k is large (say, 100 classes), OvO trains 4,950 models --- but each is fast because the training set is small.

A Complete Example: Putting It All Together

Let's walk through a complete SVM workflow on a real dataset.

import numpy as np
import pandas as pd
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC, LinearSVC
from sklearn.metrics import classification_report, confusion_matrix

# Load data
data = load_breast_cancer()
X, y = data.data, data.target
feature_names = data.feature_names

print(f"Dataset shape: {X.shape}")
print(f"Class distribution: {np.bincount(y)} (0=malignant, 1=benign)")
print(f"Feature range: min={X.min():.1f}, max={X.max():.1f}")

Dataset shape: (569, 30)
Class distribution: [212 357] (0=malignant, 1=benign)
Feature range: min=6.98, max=4254.0

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Step 1: Quick baseline with LinearSVC
linear_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])
linear_scores = cross_val_score(linear_pipe, X_train, y_train, cv=5, scoring='accuracy')
print(f"LinearSVC baseline: {linear_scores.mean():.3f} (+/- {linear_scores.std():.3f})")

LinearSVC baseline: 0.974 (+/- 0.011)

# Step 2: Try RBF kernel with grid search
rbf_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', random_state=42))
])

param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 0.01, 0.1, 1]
}

grid = GridSearchCV(
    rbf_pipe, param_grid, cv=5, scoring='accuracy',
    refit=True, n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"Best RBF params: {grid.best_params_}")
print(f"Best RBF CV accuracy: {grid.best_score_:.3f}")

Best RBF params: {'svm__C': 10, 'svm__gamma': 'scale'}
Best RBF CV accuracy: 0.978

# Step 3: Evaluate on test set
best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print(classification_report(y_test, y_pred, target_names=['Malignant', 'Benign']))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

   Malignant       0.98      0.95      0.96        42
      Benign       0.97      0.99      0.98        72

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114

Confusion Matrix:
[[40  2]
 [ 1 71]]

# Step 4: Inspect the model
svm_model = best_model.named_steps['svm']
print(f"Number of support vectors: {svm_model.n_support_}")
print(f"Total support vectors: {svm_model.n_support_.sum()} out of {len(X_train)} training points")
print(f"Percentage of data as support vectors: "
      f"{svm_model.n_support_.sum() / len(X_train) * 100:.1f}%")

Number of support vectors: [39 44]
Total support vectors: 83 out of 455 training points
Percentage of data as support vectors: 18.2%

18% of training points are support vectors. That is a reasonable number --- not so many that the margin is trivially wide, not so few that the model is overly rigid. The model uses 83 points out of 455 to define its decision boundary, and achieves 97% test accuracy.

SVMs vs. the Competition: An Honest Assessment

Here is where SVMs stand relative to the algorithms you will learn in this part of the book:

Criterion	SVM (RBF)	Logistic Regression	Random Forest	Gradient Boosting
Small datasets (n < 1K)	Excellent	Good	Good	Moderate
Medium datasets (1K-10K)	Good	Good	Excellent	Excellent
Large datasets (n > 50K)	Poor (slow)	Excellent	Good	Excellent
High-dimensional sparse	Excellent (linear)	Excellent	Moderate	Moderate
Non-linear boundaries	Excellent	Poor	Excellent	Excellent
Mixed feature types	Poor (encoding needed)	Poor (encoding needed)	Excellent	Excellent
Interpretability	Poor	Excellent	Moderate	Poor
Calibrated probabilities	Poor	Excellent	Moderate	Moderate
Training time	Slow (non-linear)	Fast	Moderate	Moderate
Prediction time	Fast (sparse SVs)	Fast	Moderate	Fast
Hyperparameter sensitivity	High	Low	Low	Moderate

The honest summary: for most tabular data problems in 2026, gradient boosting or a well-tuned random forest will match or beat an SVM with less effort. SVMs are the right choice in specific niches --- small data, high dimensions, clean boundaries --- and are always the right choice for understanding how machine learning really works.

Common Pitfalls

1. Forgetting to Scale Features

We said it once, we will say it again. Unscaled features produce garbage SVM models. Use StandardScaler or MinMaxScaler inside a Pipeline.

2. Using SVC on Large Datasets

If n > 10,000, do not use SVC with a non-linear kernel. Use LinearSVC, kernel approximation (RBFSampler), or a different algorithm. Fitting SVC(kernel='rbf') on 100,000 samples will take hours and consume gigabytes of memory for the kernel matrix.

3. Not Tuning C and Gamma Together

C and gamma interact. Tuning one while holding the other at default is leaving performance on the table. Always do a joint grid search.

4. Over-relying on probability=True

Setting probability=True in SVC enables probability estimates via Platt scaling, but it adds significant overhead (it runs an internal 5-fold cross-validation) and the probabilities are not as reliable as those from logistic regression. If you need probabilities, consider whether logistic regression or a calibrated ensemble would serve you better.

5. Ignoring the Number of Support Vectors

The number of support vectors is a diagnostic you should always check. If >50% of your training points are support vectors, the SVM is not finding a meaningful structure. If <1%, the model might be too rigid. In a well-fitting SVM, the support vector percentage is typically 10-30%.

Chapter Summary

Support vector machines find the decision boundary with the maximum margin between classes. The key ideas:

The margin is the distance between the boundary and the closest points. Maximizing it improves generalization.
Support vectors are the points on the margin boundary. Only they determine the model.
Soft margins (controlled by C) allow misclassifications to prevent overfitting.
The kernel trick lets SVMs find non-linear boundaries by computing dot products in a transformed space without ever computing the transformation.
RBF kernel is the default choice for non-linear problems. Tune C and gamma jointly.
LinearSVC is the right tool for high-dimensional or large-scale linear problems.
Feature scaling is mandatory. An unscaled SVM is a broken SVM.

SVMs are not the algorithm you will reach for on most tabular data problems in 2026. That job belongs to gradient boosting (Chapter 14). But the ideas --- margins, support vectors, kernels, the relationship between model complexity and generalization --- are foundational. Every time you think about the tradeoff between fitting the training data and generalizing to new data, you are thinking in the language SVMs formalized.

Next chapter: Chapter 13 --- Tree-Based Methods, where we trade geometry for partitioning and meet the algorithms that dominate modern tabular ML.