Case Study 2: SVM vs. Gradient Boosting --- The Showdown on Tabular Data
Background
This is the match most practitioners want to see. On one side: the support vector machine, armed with geometric elegance and the kernel trick. On the other: gradient boosting, the reigning champion of tabular data competitions, beloved by Kaggle winners and production ML teams alike.
The question is not "which is better in theory?" The question is: on a real dataset, with reasonable tuning effort, which wins, and by how much?
We will use the heart disease dataset --- a tabular classification problem with mixed features, moderate size, and clinical relevance. This is the kind of problem data scientists encounter constantly: predict a binary outcome from a handful of structured features. It is gradient boosting's home turf. Let's see if the SVM can compete.
The Data
import numpy as np
import pandas as pd
from sklearn.datasets import fetch_openml
# Heart disease dataset (Statlog version)
# 270 samples, 13 features, binary target
heart = fetch_openml(name='heart-statlog', version=1, as_frame=True, parser='auto')
X = heart.data
y = (heart.target == '2').astype(int) # 1 = disease, 0 = no disease
print(f"Shape: {X.shape}")
print(f"Class distribution: {y.value_counts().to_dict()}")
print(f"\nFeatures:")
print(X.dtypes.to_string())
print(f"\nSample statistics:")
print(X.describe().round(2).to_string())
Shape: (270, 13)
Class distribution: {0: 150, 1: 120}
Features:
age float64
resting_bp float64
cholesterol float64
max_hr float64
oldpeak float64
sex category
chest_pain category
fasting_bs category
rest_ecg category
exang category
slope category
thal category
vessels category
Sample statistics:
age resting_bp cholesterol max_hr oldpeak
count 270.0 270.00 270.00 270.00 270.00
mean 54.4 131.34 246.69 149.68 1.05
std 9.0 17.81 51.78 23.17 1.15
min 29.0 94.00 126.00 71.00 0.00
25% 48.0 120.00 211.00 133.75 0.00
50% 55.0 130.00 243.50 153.50 0.80
75% 61.0 140.00 280.00 166.00 1.60
max 77.0 200.00 564.00 202.00 6.20
270 samples. 13 features: 5 numeric, 8 categorical. Binary target (heart disease present/absent). This is small data --- the SVM should be competitive here.
Data Preparation
SVMs need numeric features and scaling. Gradient boosting handles categories natively (with some encoders). Let's prepare both.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# Identify column types
numeric_features = X.select_dtypes(include=['float64']).columns.tolist()
categorical_features = X.select_dtypes(include=['category']).columns.tolist()
print(f"Numeric features ({len(numeric_features)}): {numeric_features}")
print(f"Categorical features ({len(categorical_features)}): {categorical_features}")
# Preprocessor: scale numeric, one-hot encode categorical
preprocessor = ColumnTransformer(
transformers=[
('num', StandardScaler(), numeric_features),
('cat', OneHotEncoder(drop='if_binary', sparse_output=False), categorical_features)
]
)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, stratify=y, random_state=42
)
print(f"\nTrain: {X_train.shape[0]} samples")
print(f"Test: {X_test.shape[0]} samples")
Numeric features (5): ['age', 'resting_bp', 'cholesterol', 'max_hr', 'oldpeak']
Categorical features (8): ['sex', 'chest_pain', 'fasting_bs', 'rest_ecg', 'exang', 'slope', 'thal', 'vessels']
Train: 216 samples
Test: 54 samples
The Contenders
Contender 1: Linear SVM
from sklearn.svm import LinearSVC
from sklearn.model_selection import GridSearchCV, cross_val_score
linear_pipe = Pipeline([
('preprocessor', preprocessor),
('svm', LinearSVC(max_iter=10000, random_state=42))
])
linear_param_grid = {'svm__C': [0.01, 0.1, 1, 10, 100]}
linear_grid = GridSearchCV(
linear_pipe, linear_param_grid, cv=5,
scoring='accuracy', refit=True, n_jobs=-1
)
linear_grid.fit(X_train, y_train)
print(f"Linear SVM --- Best C: {linear_grid.best_params_['svm__C']}")
print(f"Linear SVM --- Best CV accuracy: {linear_grid.best_score_:.3f}")
Linear SVM --- Best C: 0.1
Linear SVM --- Best CV accuracy: 0.838
Contender 2: RBF SVM
from sklearn.svm import SVC
rbf_pipe = Pipeline([
('preprocessor', preprocessor),
('svm', SVC(kernel='rbf', random_state=42))
])
rbf_param_grid = {
'svm__C': [0.1, 1, 10, 100],
'svm__gamma': ['scale', 0.01, 0.1, 1]
}
rbf_grid = GridSearchCV(
rbf_pipe, rbf_param_grid, cv=5,
scoring='accuracy', refit=True, n_jobs=-1
)
rbf_grid.fit(X_train, y_train)
print(f"RBF SVM --- Best params: {rbf_grid.best_params_}")
print(f"RBF SVM --- Best CV accuracy: {rbf_grid.best_score_:.3f}")
RBF SVM --- Best params: {'svm__C': 10, 'svm__gamma': 'scale'}
RBF SVM --- Best CV accuracy: 0.847
Contender 3: Gradient Boosting (LightGBM)
from sklearn.ensemble import GradientBoostingClassifier
# Using scikit-learn's GradientBoostingClassifier as a stand-in
# (LightGBM would be faster on larger data, but overkill here)
gb_pipe = Pipeline([
('preprocessor', preprocessor),
('gb', GradientBoostingClassifier(random_state=42))
])
gb_param_grid = {
'gb__n_estimators': [50, 100, 200],
'gb__max_depth': [3, 5, 7],
'gb__learning_rate': [0.01, 0.1, 0.2]
}
gb_grid = GridSearchCV(
gb_pipe, gb_param_grid, cv=5,
scoring='accuracy', refit=True, n_jobs=-1
)
gb_grid.fit(X_train, y_train)
print(f"Gradient Boosting --- Best params: {gb_grid.best_params_}")
print(f"Gradient Boosting --- Best CV accuracy: {gb_grid.best_score_:.3f}")
Gradient Boosting --- Best params: {'gb__learning_rate': 0.1, 'gb__max_depth': 3, 'gb__n_estimators': 100}
Gradient Boosting --- Best CV accuracy: 0.843
Contender 4: Logistic Regression (Baseline)
from sklearn.linear_model import LogisticRegression
lr_pipe = Pipeline([
('preprocessor', preprocessor),
('lr', LogisticRegression(max_iter=5000, random_state=42))
])
lr_param_grid = {'lr__C': [0.01, 0.1, 1, 10, 100]}
lr_grid = GridSearchCV(
lr_pipe, lr_param_grid, cv=5,
scoring='accuracy', refit=True, n_jobs=-1
)
lr_grid.fit(X_train, y_train)
print(f"Logistic Regression --- Best C: {lr_grid.best_params_['lr__C']}")
print(f"Logistic Regression --- Best CV accuracy: {lr_grid.best_score_:.3f}")
Logistic Regression --- Best C: 0.1
Logistic Regression --- Best CV accuracy: 0.838
The Scoreboard
from sklearn.metrics import accuracy_score, f1_score, classification_report
models = {
'Logistic Regression': lr_grid,
'Linear SVM': linear_grid,
'RBF SVM': rbf_grid,
'Gradient Boosting': gb_grid,
}
print(f"{'Model':<25} {'CV Accuracy':>12} {'Test Accuracy':>14} {'Test F1':>10}")
print("-" * 65)
for name, grid in models.items():
y_pred = grid.predict(X_test)
test_acc = accuracy_score(y_test, y_pred)
test_f1 = f1_score(y_test, y_pred)
print(f"{name:<25} {grid.best_score_:>12.3f} {test_acc:>14.3f} {test_f1:>10.3f}")
Model CV Accuracy Test Accuracy Test F1
-----------------------------------------------------------------
Logistic Regression 0.838 0.852 0.818
Linear SVM 0.838 0.852 0.818
RBF SVM 0.847 0.870 0.842
Gradient Boosting 0.843 0.852 0.826
Analysis: What Happened?
The results are close. All four models are within 2 percentage points of each other on test accuracy. But look at the pattern:
-
RBF SVM won --- by a slim margin. 87.0% test accuracy vs. 85.2% for the others. On 54 test samples, that difference is one sample. Not statistically significant. But the SVM did not lose.
-
Gradient boosting did not dominate. On 270 samples, gradient boosting cannot build the deep ensemble it needs. With only 216 training points, a 200-tree ensemble is memorizing noise. The grid search correctly selected a conservative model (100 trees, depth 3, learning rate 0.1).
-
The linear models (logistic regression and linear SVM) were competitive. With only 270 samples and 13 features, there may not be enough non-linear structure for the RBF kernel to exploit. The linear boundary is almost as good.
-
The SVM's advantage is modest. One or two extra correct predictions out of 54. On a dataset this small, rerun with a different random seed and the rankings might shuffle.
The Deeper Lesson: Dataset Size Matters
This showdown has a clear moral: the advantage of gradient boosting grows with dataset size.
# Simulate the effect of dataset size
# (Using synthetic data to illustrate the principle)
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score
results = []
for n_samples in [100, 300, 1000, 3000, 10000]:
X_sim, y_sim = make_classification(
n_samples=n_samples, n_features=15, n_informative=10,
n_redundant=3, n_clusters_per_class=2,
flip_y=0.1, random_state=42
)
svm_pipe = Pipeline([
('scaler', StandardScaler()),
('svm', SVC(kernel='rbf', C=10, gamma='scale', random_state=42))
])
gb_model = GradientBoostingClassifier(
n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42
)
svm_scores = cross_val_score(svm_pipe, X_sim, y_sim, cv=5, scoring='accuracy')
gb_scores = cross_val_score(gb_model, X_sim, y_sim, cv=5, scoring='accuracy')
results.append({
'n_samples': n_samples,
'SVM (RBF)': svm_scores.mean(),
'Gradient Boosting': gb_scores.mean(),
'Winner': 'SVM' if svm_scores.mean() > gb_scores.mean() else 'GB'
})
results_df = pd.DataFrame(results)
print(results_df.to_string(index=False))
n_samples SVM (RBF) Gradient Boosting Winner
100 0.850 0.810 SVM
300 0.883 0.877 SVM
1000 0.907 0.913 GB
3000 0.918 0.932 GB
10000 0.924 0.945 GB
The crossover point is around 500-1,000 samples. Below that, the SVM's margin-based geometry handles small data gracefully. Above that, gradient boosting's ability to build deep, additive models pays off.
The practical rule: If you have fewer than ~1,000 samples and clean numeric features, try an SVM. If you have more than ~5,000 samples, gradient boosting will almost certainly win. Between 1,000 and 5,000, both are worth trying.
When to Pick Which
| Factor | Favor SVM | Favor Gradient Boosting |
|---|---|---|
| Dataset size | n < 1,000 | n > 5,000 |
| Feature types | All numeric | Mixed (numeric + categorical) |
| Feature scaling | Already scaled or easy to scale | Prefer no scaling requirement |
| Interpretability | Not needed | Feature importance helpful |
| Probability calibration | Not critical | Needed |
| Training time | Acceptable for small n | Needs to be fast at any n |
| Decision boundary | Clean geometric separation | Complex, ensemble-based |
| Hyperparameter tuning | Willing to tune C and gamma | Fewer critical hyperparameters |
The Verdict
On the heart disease dataset (270 samples, mixed features), the RBF SVM and gradient boosting are effectively tied. The SVM edges ahead by a statistically insignificant margin. In a real clinical setting, the difference between these models would be invisible --- you would choose based on interpretability, deployment constraints, and what the clinical team trusts.
The broader point: SVMs are not obsolete. They occupy a specific ecological niche --- small data, numeric features, complex boundaries --- where they remain competitive. Gradient boosting is the generalist that wins almost everywhere else. A good practitioner knows both and chooses based on the problem, not on what's trendy.
Discussion Questions
-
On this 270-sample dataset, all models performed similarly. If you could collect 10,000 more patient records, which model would you expect to benefit most? Why?
-
The heart disease dataset has 8 categorical features that required one-hot encoding for the SVM. Does this encoding put the SVM at a disadvantage compared to a tree-based method that handles categories natively? Why or why not?
-
In a clinical setting, a doctor might ask: "Why did the model flag this patient?" Which of the four models in this comparison can best answer that question? How would you handle this requirement if the best-performing model is the least interpretable?
-
The test set has only 54 samples. How confident are you in the test accuracy numbers reported here? What would you do to get a more reliable estimate?
This case study supports Chapter 12: Support Vector Machines. Return to the chapter for the full treatment of SVM theory and practice.