Case Study 1: Handwritten Digit Classification --- Where SVMs Still Win

DataField.Dev

Where SVMs Still Win" type: case-study chapter: 12 part: 3

Case Study 1: Handwritten Digit Classification --- Where SVMs Still Win

Background

In 2012, before deep learning conquered computer vision, the best models on the MNIST handwritten digit dataset were support vector machines. A carefully tuned SVM with an RBF kernel could achieve 98.6% accuracy on the 10-class digit classification problem --- a result that held the state of the art for years.

Deep learning has since surpassed that number (99.7%+ with convolutional neural networks), but the SVM result remains instructive. It demonstrates exactly the conditions under which SVMs excel: moderate dataset size, high-dimensional features, and a problem where the geometric structure of the data matters.

This case study uses scikit-learn's digits dataset --- a smaller version of MNIST with 8x8 pixel images (64 features) and 1,797 samples --- to walk through a complete SVM classification pipeline. The smaller scale lets us focus on methodology without waiting 20 minutes for a kernel matrix to compute.

The Data

import numpy as np
import pandas as pd
from sklearn.datasets import load_digits
import matplotlib.pyplot as plt

digits = load_digits()
X, y = digits.data, digits.target

print(f"Dataset shape: {X.shape}")
print(f"Number of classes: {len(np.unique(y))}")
print(f"Samples per class:")
for digit in range(10):
    print(f"  Digit {digit}: {np.sum(y == digit)}")
print(f"\nFeature range: [{X.min():.0f}, {X.max():.0f}]")
print(f"Features represent: 8x8 pixel intensities (0-16 grayscale)")

Dataset shape: (1797, 64)
Number of classes: 10
Samples per class:
  Digit 0: 178
  Digit 1: 182
  Digit 2: 177
  Digit 3: 183
  Digit 4: 181
  Digit 5: 182
  Digit 6: 181
  Digit 7: 179
  Digit 8: 174
  Digit 9: 180

Feature range: [0, 16]
Features represent: 8x8 pixel intensities (0-16 grayscale)

1,797 samples, 64 features, 10 classes, roughly balanced. This is a textbook SVM scenario: the dataset is small enough that the kernel matrix (1,797 x 1,797 = ~3.2 million entries) is trivial to compute, and the feature space is high-dimensional enough that a non-linear kernel can exploit complex structure.

Why This Problem Suits SVMs

Three properties make digit classification a strong SVM use case:

Moderate sample size. 1,797 samples means the kernel matrix fits comfortably in memory. No scaling concerns.
High-dimensional feature space. 64 features (pixels) create a space where non-linear kernels can find expressive decision boundaries. Digits that look similar in raw pixel space (3 vs. 8, 4 vs. 9) may become separable in the kernel-transformed space.
Clear class structure. Handwritten digits have consistent geometric structure --- the strokes follow patterns. SVMs are good at finding boundaries that respect this structure, especially when the boundaries are not simple hyperplanes.

Establishing Baselines

Before tuning an SVM, establish reference points.

from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC, SVC

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

# Baseline 1: Logistic Regression
lr_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression(max_iter=5000, random_state=42))
])
lr_scores = cross_val_score(lr_pipe, X_train, y_train, cv=5, scoring='accuracy')

# Baseline 2: KNN
knn_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', KNeighborsClassifier(n_neighbors=5))
])
knn_scores = cross_val_score(knn_pipe, X_train, y_train, cv=5, scoring='accuracy')

# Baseline 3: Linear SVM
lsvc_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LinearSVC(C=1.0, max_iter=10000, random_state=42))
])
lsvc_scores = cross_val_score(lsvc_pipe, X_train, y_train, cv=5, scoring='accuracy')

print("5-Fold CV Accuracy:")
print(f"  Logistic Regression: {lr_scores.mean():.3f} (+/- {lr_scores.std():.3f})")
print(f"  KNN (k=5):           {knn_scores.mean():.3f} (+/- {knn_scores.std():.3f})")
print(f"  Linear SVM:          {lsvc_scores.mean():.3f} (+/- {lsvc_scores.std():.3f})")

5-Fold CV Accuracy:
  Logistic Regression: 0.961 (+/- 0.009)
  KNN (k=5):           0.981 (+/- 0.009)
  Linear SVM:          0.957 (+/- 0.010)

KNN at 98.1% is the baseline to beat. Logistic regression and the linear SVM are competitive but slightly behind. The question: can a non-linear SVM (RBF kernel) close or eliminate the gap with KNN?

Tuning the RBF SVM

from sklearn.model_selection import GridSearchCV

rbf_pipe = Pipeline([
    ('scaler', StandardScaler()),
    ('svm', SVC(kernel='rbf', random_state=42))
])

param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 0.001, 0.01, 0.1]
}

grid = GridSearchCV(
    rbf_pipe, param_grid, cv=5,
    scoring='accuracy', refit=True, n_jobs=-1
)
grid.fit(X_train, y_train)

print(f"Best parameters: {grid.best_params_}")
print(f"Best CV accuracy: {grid.best_score_:.4f}")

# Show the full grid
results = pd.DataFrame(grid.cv_results_)
pivot = results.pivot_table(
    values='mean_test_score',
    index='param_svm__C',
    columns='param_svm__gamma'
)
print("\nCV Accuracy by (C, gamma):")
print(pivot.round(4).to_string())

Best parameters: {'svm__C': 10, 'svm__gamma': 'scale'}
Best CV accuracy: 0.9903

CV Accuracy by (C, gamma):
param_svm__gamma  0.001  0.01   0.1    scale
param_svm__C
0.1               0.9345  0.9749  0.9749  0.9763
1                 0.9749  0.9889  0.9826  0.9868
10                0.9805  0.9889  0.9756  0.9903
100               0.9812  0.9882  0.9582  0.9896

99.03% cross-validation accuracy. That beats KNN (98.1%) and logistic regression (96.1%). The best parameters are C=10 with gamma='scale' (which computes gamma as 1 / (n_features * X.var())).

Notice the grid: at gamma=0.1, high C values start to overfit (95.82% at C=100). At gamma=0.001, performance plateaus --- the kernel is too smooth. The sweet spot is in the middle.

Test Set Evaluation

from sklearn.metrics import classification_report, confusion_matrix

best_model = grid.best_estimator_
y_pred = best_model.predict(X_test)

print(f"Test accuracy: {best_model.score(X_test, y_test):.4f}")
print()
print(classification_report(y_test, y_pred))

Test accuracy: 0.9889

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       1.00      1.00      1.00        36
           2       1.00      1.00      1.00        35
           3       0.97      1.00      0.99        37
           4       1.00      1.00      1.00        36
           5       0.97      1.00      0.99        37
           6       1.00      1.00      1.00        36
           7       1.00      0.97      0.99        36
           8       1.00      0.97      0.99        35
           9       0.97      0.97      0.97        36

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360

98.9% test accuracy. Only 4 errors out of 360 test samples. Let's look at what was confused:

cm = confusion_matrix(y_test, y_pred)
print("Confusion matrix (rows=true, cols=predicted):")
print(cm)

# Find misclassifications
errors = np.where(y_pred != y_test)[0]
print(f"\nMisclassified samples: {len(errors)}")
for idx in errors:
    print(f"  Sample {idx}: true={y_test[idx]}, predicted={y_pred[idx]}")

Confusion matrix (rows=true, cols=predicted):
[[36  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  0  0  0  0  0  0]
 [ 0  0 35  0  0  0  0  0  0  0]
 [ 0  0  0 37  0  0  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  0  0 37  0  0  0  0]
 [ 0  0  0  0  0  0 36  0  0  0]
 [ 0  0  0  0  0  0  0 35  0  1]
 [ 0  0  0  1  0  0  0  0 34  0]
 [ 0  0  0  0  0  1  0  0  0 35]]

Misclassified samples: 4
  Sample 82: true=7, predicted=9
  Sample 222: true=8, predicted=3
  Sample 307: true=9, predicted=5
  Sample 329: true=9, predicted=5

The errors are reasonable: 7 confused with 9 (similar shapes), 8 confused with 3, 9 confused with 5. These are the genuinely ambiguous digits, especially in low-resolution 8x8 images.

Model Diagnostics

svm_model = best_model.named_steps['svm']

print(f"Total support vectors: {svm_model.n_support_.sum()}")
print(f"Training set size: {len(X_train)}")
print(f"Support vector percentage: {svm_model.n_support_.sum() / len(X_train) * 100:.1f}%")
print(f"\nSupport vectors per class:")
for digit, n_sv in enumerate(svm_model.n_support_):
    print(f"  Digit {digit}: {n_sv}")

Total support vectors: 462
Training set size: 1437
Support vector percentage: 32.1%

Support vectors per class:
  Digit 0: 26
  Digit 1: 51
  Digit 2: 45
  Digit 3: 57
  Digit 4: 43
  Digit 5: 50
  Digit 6: 36
  Digit 7: 47
  Digit 8: 60
  Digit 9: 47

32% of training points are support vectors. This is a healthy number for a 10-class problem --- recall that one-vs-one trains 45 binary classifiers, and different classifiers select different support vectors. Class 8 has the most support vectors (60), which makes sense: 8 is the most visually complex digit and overlaps with 3, 6, and 9.

Key Takeaways from This Case Study

SVMs excel on small, high-dimensional datasets. 1,797 samples with 64 features is the SVM sweet spot. The kernel matrix is trivial to compute, and the RBF kernel can exploit non-linear structure that linear models miss.
The RBF kernel outperformed the linear kernel by 3.3 percentage points (99.0% vs. 95.7%). For digit classification, the decision boundaries between classes are not linear --- they require curves that separate similar-looking digits.
Grid search over C and gamma is essential. The best parameters (C=10, gamma='scale') were not the defaults. Without tuning, the default SVM achieves ~98.7% --- still good, but the grid search found the extra 0.3%.
Support vector analysis reveals class difficulty. Digit 8 required the most support vectors, confirming what visual inspection suggests: 8 is the hardest digit to distinguish from its neighbors.
For this dataset size, SVMs are competitive with any algorithm. On the full MNIST (70,000 samples, 784 features), you would want a neural network. On 1,797 samples, the SVM is hard to beat.

Discussion Questions

If this dataset had 100,000 samples instead of 1,797, would you still use an SVM? What alternative would you choose, and why?
The features are raw pixel intensities. If you applied PCA to reduce from 64 to 20 features, would you expect the SVM to perform better, worse, or about the same? Why?
This dataset has balanced classes. If digit 9 had only 20 samples while all other digits had 180, how would you modify the SVM approach?

This case study supports Chapter 12: Support Vector Machines. Return to the chapter for the full treatment of SVM theory and practice.