Case Study 2: SVM vs. Gradient Boosting --- The Showdown on Tabular Data


Background

This is the match most practitioners want to see. On one side: the support vector machine, armed with geometric elegance and the kernel trick. On the other: gradient boosting, the reigning champion of tabular data competitions, beloved by Kaggle winners and production ML teams alike.

The question is not "which is better in theory?" The question is: on a real dataset, with reasonable tuning effort, which wins, and by how much?

We will use the heart disease dataset --- a tabular classification problem with mixed features, moderate size, and clinical relevance. This is the kind of problem data scientists encounter constantly: predict a binary outcome from a handful of structured features. It is gradient boosting's home turf. Let's see if the SVM can compete.


The Data

import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml

# Heart disease dataset (Statlog version)
# 270 samples, 13 features, binary target
heart = fetch_openml(name='heart-statlog', version=1, as_frame=True, parser='auto')
X = heart.data
y = (heart.target == '2').astype(int)  # 1 = disease, 0 = no disease

print(f"Shape: {X.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")
print(f"\nFeatures:")
print(X.dtypes.to_string())
print(f"\nSample statistics:")
print(X.describe().round(2).to_string())
Shape: (270, 13)
Class distribution: {0: 150, 1: 120}

Features:
age         float64
resting_bp  float64
cholesterol float64
max_hr      float64
oldpeak     float64
sex         category
chest_pain  category
fasting_bs  category
rest_ecg    category
exang       category
slope       category
thal        category
vessels     category

Sample statistics:
         age  resting_bp  cholesterol  max_hr  oldpeak
count  270.0      270.00       270.00  270.00   270.00
mean    54.4      131.34       246.69  149.68     1.05
std      9.0       17.81        51.78   23.17     1.15
min     29.0       94.00       126.00   71.00     0.00
25%     48.0      120.00       211.00  133.75     0.00
50%     55.0      130.00       243.50  153.50     0.80
75%     61.0      140.00       280.00  166.00     1.60
max     77.0      200.00       564.00  202.00     6.20

270 samples. 13 features: 5 numeric, 8 categorical. Binary target (heart disease present/absent). This is small data --- the SVM should be competitive here.


Data Preparation

SVMs need numeric features and scaling. Gradient boosting handles categories natively (with some encoders). Let's prepare both.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# Identify column types
numeric_features = X.select_dtypes(include=['float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()

print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")

# Preprocessor: scale numeric, one-hot encode categorical
preprocessor = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), numeric_features),
        ('cat', OneHotEncoder(drop='if_binary', sparse_output=False), categorical_features)
    ]
)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)

print(f"\nTrain: {X_train.shape[0]} samples")
print(f"Test:  {X_test.shape[0]} samples")
Numeric features (5): ['age', 'resting_bp', 'cholesterol', 'max_hr', 'oldpeak']
Categorical features (8): ['sex', 'chest_pain', 'fasting_bs', 'rest_ecg', 'exang', 'slope', 'thal', 'vessels']

Train: 216 samples
Test:  54 samples

The Contenders

Contender 1: Linear SVM

from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score

linear_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('svm', LinearSVC(max_iter=10000, random_state=42))
])

linear_param_grid = {'svm__C': [0.01, 0.1, 1, 10, 100]}

linear_grid = GridSearchCV(
    linear_pipe, linear_param_grid, cv=5,
    scoring='accuracy', refit=True, n_jobs=-1
)
linear_grid.fit(X_train, y_train)

print(f"Linear SVM --- Best C: {linear_grid.best_params_['svm__C']}")
print(f"Linear SVM --- Best CV accuracy: {linear_grid.best_score_:.3f}")
Linear SVM --- Best C: 0.1
Linear SVM --- Best CV accuracy: 0.838

Contender 2: RBF SVM

from sklearn.svm import SVC

rbf_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('svm', SVC(kernel='rbf', random_state=42))
])

rbf_param_grid = {
    'svm__C': [0.1, 1, 10, 100],
    'svm__gamma': ['scale', 0.01, 0.1, 1]
}

rbf_grid = GridSearchCV(
    rbf_pipe, rbf_param_grid, cv=5,
    scoring='accuracy', refit=True, n_jobs=-1
)
rbf_grid.fit(X_train, y_train)

print(f"RBF SVM --- Best params: {rbf_grid.best_params_}")
print(f"RBF SVM --- Best CV accuracy: {rbf_grid.best_score_:.3f}")
RBF SVM --- Best params: {'svm__C': 10, 'svm__gamma': 'scale'}
RBF SVM --- Best CV accuracy: 0.847

Contender 3: Gradient Boosting (LightGBM)

from sklearn.ensemble import GradientBoostingClassifier

# Using scikit-learn's GradientBoostingClassifier as a stand-in
# (LightGBM would be faster on larger data, but overkill here)
gb_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('gb', GradientBoostingClassifier(random_state=42))
])

gb_param_grid = {
    'gb__n_estimators': [50, 100, 200],
    'gb__max_depth': [3, 5, 7],
    'gb__learning_rate': [0.01, 0.1, 0.2]
}

gb_grid = GridSearchCV(
    gb_pipe, gb_param_grid, cv=5,
    scoring='accuracy', refit=True, n_jobs=-1
)
gb_grid.fit(X_train, y_train)

print(f"Gradient Boosting --- Best params: {gb_grid.best_params_}")
print(f"Gradient Boosting --- Best CV accuracy: {gb_grid.best_score_:.3f}")
Gradient Boosting --- Best params: {'gb__learning_rate': 0.1, 'gb__max_depth': 3, 'gb__n_estimators': 100}
Gradient Boosting --- Best CV accuracy: 0.843

Contender 4: Logistic Regression (Baseline)

from sklearn.linear_model import LogisticRegression

lr_pipe = Pipeline([
    ('preprocessor', preprocessor),
    ('lr', LogisticRegression(max_iter=5000, random_state=42))
])

lr_param_grid = {'lr__C': [0.01, 0.1, 1, 10, 100]}

lr_grid = GridSearchCV(
    lr_pipe, lr_param_grid, cv=5,
    scoring='accuracy', refit=True, n_jobs=-1
)
lr_grid.fit(X_train, y_train)

print(f"Logistic Regression --- Best C: {lr_grid.best_params_['lr__C']}")
print(f"Logistic Regression --- Best CV accuracy: {lr_grid.best_score_:.3f}")
Logistic Regression --- Best C: 0.1
Logistic Regression --- Best CV accuracy: 0.838

The Scoreboard

from sklearn.metrics import accuracy_score, f1_score, classification_report

models = {
    'Logistic Regression': lr_grid,
    'Linear SVM': linear_grid,
    'RBF SVM': rbf_grid,
    'Gradient Boosting': gb_grid,
}

print(f"{'Model':<25} {'CV Accuracy':>12} {'Test Accuracy':>14} {'Test F1':>10}")
print("-" * 65)

for name, grid in models.items():
    y_pred = grid.predict(X_test)
    test_acc = accuracy_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    print(f"{name:<25} {grid.best_score_:>12.3f} {test_acc:>14.3f} {test_f1:>10.3f}")
Model                      CV Accuracy  Test Accuracy    Test F1
-----------------------------------------------------------------
Logistic Regression              0.838          0.852      0.818
Linear SVM                       0.838          0.852      0.818
RBF SVM                          0.847          0.870      0.842
Gradient Boosting                0.843          0.852      0.826

Analysis: What Happened?

The results are close. All four models are within 2 percentage points of each other on test accuracy. But look at the pattern:

  1. RBF SVM won --- by a slim margin. 87.0% test accuracy vs. 85.2% for the others. On 54 test samples, that difference is one sample. Not statistically significant. But the SVM did not lose.

  2. Gradient boosting did not dominate. On 270 samples, gradient boosting cannot build the deep ensemble it needs. With only 216 training points, a 200-tree ensemble is memorizing noise. The grid search correctly selected a conservative model (100 trees, depth 3, learning rate 0.1).

  3. The linear models (logistic regression and linear SVM) were competitive. With only 270 samples and 13 features, there may not be enough non-linear structure for the RBF kernel to exploit. The linear boundary is almost as good.

  4. The SVM's advantage is modest. One or two extra correct predictions out of 54. On a dataset this small, rerun with a different random seed and the rankings might shuffle.


The Deeper Lesson: Dataset Size Matters

This showdown has a clear moral: the advantage of gradient boosting grows with dataset size.

# Simulate the effect of dataset size
# (Using synthetic data to illustrate the principle)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

results = []
for n_samples in [100, 300, 1000, 3000, 10000]:
    X_sim, y_sim = make_classification(
        n_samples=n_samples, n_features=15, n_informative=10,
        n_redundant=3, n_clusters_per_class=2,
        flip_y=0.1, random_state=42
    )

    svm_pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='rbf', C=10, gamma='scale', random_state=42))
    ])

    gb_model = GradientBoostingClassifier(
        n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42
    )

    svm_scores = cross_val_score(svm_pipe, X_sim, y_sim, cv=5, scoring='accuracy')
    gb_scores = cross_val_score(gb_model, X_sim, y_sim, cv=5, scoring='accuracy')

    results.append({
        'n_samples': n_samples,
        'SVM (RBF)': svm_scores.mean(),
        'Gradient Boosting': gb_scores.mean(),
        'Winner': 'SVM' if svm_scores.mean() > gb_scores.mean() else 'GB'
    })

results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
 n_samples  SVM (RBF)  Gradient Boosting Winner
       100      0.850              0.810    SVM
       300      0.883              0.877    SVM
      1000      0.907              0.913     GB
      3000      0.918              0.932     GB
     10000      0.924              0.945     GB

The crossover point is around 500-1,000 samples. Below that, the SVM's margin-based geometry handles small data gracefully. Above that, gradient boosting's ability to build deep, additive models pays off.

The practical rule: If you have fewer than ~1,000 samples and clean numeric features, try an SVM. If you have more than ~5,000 samples, gradient boosting will almost certainly win. Between 1,000 and 5,000, both are worth trying.


When to Pick Which

Factor Favor SVM Favor Gradient Boosting
Dataset size n < 1,000 n > 5,000
Feature types All numeric Mixed (numeric + categorical)
Feature scaling Already scaled or easy to scale Prefer no scaling requirement
Interpretability Not needed Feature importance helpful
Probability calibration Not critical Needed
Training time Acceptable for small n Needs to be fast at any n
Decision boundary Clean geometric separation Complex, ensemble-based
Hyperparameter tuning Willing to tune C and gamma Fewer critical hyperparameters

The Verdict

On the heart disease dataset (270 samples, mixed features), the RBF SVM and gradient boosting are effectively tied. The SVM edges ahead by a statistically insignificant margin. In a real clinical setting, the difference between these models would be invisible --- you would choose based on interpretability, deployment constraints, and what the clinical team trusts.

The broader point: SVMs are not obsolete. They occupy a specific ecological niche --- small data, numeric features, complex boundaries --- where they remain competitive. Gradient boosting is the generalist that wins almost everywhere else. A good practitioner knows both and chooses based on the problem, not on what's trendy.


Discussion Questions

  1. On this 270-sample dataset, all models performed similarly. If you could collect 10,000 more patient records, which model would you expect to benefit most? Why?

  2. The heart disease dataset has 8 categorical features that required one-hot encoding for the SVM. Does this encoding put the SVM at a disadvantage compared to a tree-based method that handles categories natively? Why or why not?

  3. In a clinical setting, a doctor might ask: "Why did the model flag this patient?" Which of the four models in this comparison can best answer that question? How would you handle this requirement if the best-performing model is the least interpretable?

  4. The test set has only 54 samples. How confident are you in the test accuracy numbers reported here? What would you do to get a more reliable estimate?


This case study supports Chapter 12: Support Vector Machines. Return to the chapter for the full treatment of SVM theory and practice.