Exercises: Chapter 12

DataField.Dev

Exercises: Chapter 12

Support Vector Machines

Exercise 1: Margin and Support Vectors (Conceptual)

Consider a 2D dataset with the following six points:

Point	x1	x2	Class
A	1	3	+1
B	2	2	+1
C	3	3	+1
D	5	1	-1
E	6	2	-1
F	7	3	-1

a) Sketch these points on a 2D plane. Is this dataset linearly separable?

b) Which points are most likely to be support vectors? Explain your reasoning without fitting a model.

c) If you remove point A, does the maximum margin decision boundary change? What about point B?

d) If you add a new point G at (4, 2) with class +1, what happens to the margin? Does it widen, narrow, or stay the same?

Exercise 2: The C Parameter (Conceptual + Code)

a) In your own words, explain what happens geometrically as C increases from 0.001 to 1000 in a soft-margin SVM. Address: margin width, number of support vectors, and risk of overfitting.

b) Run the following code and answer the questions below:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_val_score

X, y = make_classification(
    n_samples=300, n_features=2, n_redundant=0,
    n_informative=2, n_clusters_per_class=1,
    flip_y=0.15, random_state=42
)

for C in [0.001, 0.01, 0.1, 1.0, 10.0, 100.0]:
    pipe = Pipeline([
        ('scaler', StandardScaler()),
        ('svm', SVC(kernel='linear', C=C, random_state=42))
    ])
    scores = cross_val_score(pipe, X, y, cv=5, scoring='accuracy')
    pipe.fit(X, y)
    n_sv = pipe.named_steps['svm'].n_support_.sum()
    print(f"C={C:<8} Accuracy: {scores.mean():.3f}  Support vectors: {n_sv}")

At which C value does accuracy peak?
Describe the relationship between C and the number of support vectors.
At C=0.001, why are there so many support vectors?

Exercise 3: Feature Scaling Impact (Code)

Using the wine dataset from scikit-learn, demonstrate the impact of feature scaling on SVM performance.

from sklearn.datasets import load_wine
X, y = load_wine(return_X_y=True)

a) Train an SVC(kernel='rbf', C=1.0, random_state=42) without scaling. Report 5-fold cross-validation accuracy.

b) Train the same model with StandardScaler in a Pipeline. Report 5-fold cross-validation accuracy.

c) Repeat parts (a) and (b) using MinMaxScaler instead of StandardScaler. Is there a meaningful difference between the two scalers?

d) Print the min and max of each feature (before scaling) to explain why the unscaled model performs so poorly.

Exercise 4: Kernel Selection (Code)

Generate the following synthetic datasets and determine which kernel is best for each:

from sklearn.datasets import make_blobs, make_circles, make_moons

# Dataset A: linearly separable blobs
X_a, y_a = make_blobs(n_samples=300, centers=2, cluster_std=1.5, random_state=42)

# Dataset B: concentric circles
X_b, y_b = make_circles(n_samples=300, noise=0.1, factor=0.5, random_state=42)

# Dataset C: interleaving moons
X_c, y_c = make_moons(n_samples=300, noise=0.15, random_state=42)

For each dataset:

a) Fit SVMs with linear, polynomial (degree 2 and 3), and RBF kernels. Use C=1.0 and default gamma. Report 5-fold cross-validation accuracy for each.

b) Which kernel performs best on each dataset? Explain why, in terms of the shape of the decision boundary needed.

c) For Dataset B, explain why the linear kernel performs near chance level (~50%).

Exercise 5: Grid Search for C and Gamma (Code)

Using the breast cancer dataset:

from sklearn.datasets import load_breast_cancer
X, y = load_breast_cancer(return_X_y=True)

a) Split into 80% train / 20% test with stratify=y and random_state=42.

b) Build a Pipeline with StandardScaler and SVC(kernel='rbf', random_state=42).

c) Run a GridSearchCV over: - C: [0.01, 0.1, 1, 10, 100] - gamma: [0.001, 0.01, 0.1, 1]

Use 5-fold CV and scoring='accuracy'.

d) Report the best parameters and best CV score.

e) Evaluate on the test set. Print the classification report and confusion matrix.

f) How many support vectors does the best model have? What percentage of training points are support vectors?

Exercise 6: LinearSVC vs. SVC Scaling (Code)

This exercise demonstrates why LinearSVC is preferred for large datasets.

a) Generate a dataset with 50,000 samples:

from sklearn.datasets import make_classification
X_large, y_large = make_classification(
    n_samples=50000, n_features=20, n_informative=15,
    n_redundant=5, random_state=42
)

b) Time how long it takes to fit LinearSVC(C=1.0, max_iter=10000, random_state=42) with a StandardScaler pipeline. Use %%timeit or time.time().

c) Time how long it takes to fit SVC(kernel='linear', C=1.0, random_state=42) with the same pipeline. (Warning: this may take minutes. If it takes more than 5 minutes, stop it and note that.)

d) Compare the two models' cross-validation accuracy on a 5,000-sample subset. Are the results similar?

e) Explain why the timing difference exists, even though both are fitting a linear SVM.

Exercise 7: SVM vs. Logistic Regression (Applied)

Using the digits dataset (a smaller version of MNIST):

from sklearn.datasets import load_digits
X, y = load_digits(return_X_y=True)
print(f"Shape: {X.shape}, Classes: {len(set(y))}")

a) Split into train/test (80/20, stratify, random_state=42).

b) Train and evaluate three models: - LogisticRegression(max_iter=5000, random_state=42) - LinearSVC(C=1.0, max_iter=10000, random_state=42) - SVC(kernel='rbf', C=10, gamma='scale', random_state=42)

All with StandardScaler in a pipeline. Report test accuracy for each.

c) Which model wins? Why does the SVM with RBF kernel perform well here? (Think about the dataset's characteristics: sample size, dimensionality, number of classes.)

d) For the RBF SVM, how many support vectors are there? What fraction of training points?

Exercise 8: SVR for Regression (Code)

import numpy as np
np.random.seed(42)
X = np.sort(5 * np.random.rand(300, 1), axis=0)
y = np.sin(X).ravel() + 0.5 * np.sin(3 * X).ravel() + np.random.normal(0, 0.15, 300)

a) Fit an SVR(kernel='rbf') with default parameters (C=1.0, epsilon=0.1, gamma='scale') inside a scaling pipeline. Report 5-fold R-squared.

b) Use GridSearchCV to tune C over [0.1, 1, 10, 100] and gamma over [0.01, 0.1, 1, 10]. Report the best parameters and best R-squared.

c) Plot the true data and the predictions from the best model. Does the SVR capture the oscillating pattern?

d) How does the epsilon parameter affect the number of support vectors? Try epsilon values [0.01, 0.1, 0.5, 1.0] with C=10 and gamma=1. Explain the trend.

Exercise 9: StreamFlow Application (Applied)

Scenario: You are building a churn classifier for StreamFlow. The marketing team has flagged a segment of 2,000 subscribers with 8 behavioral features (hours watched, login frequency, support tickets, etc.) and a binary churn label. The team wants to know: can an SVM catch the non-linear patterns that logistic regression might miss?

Using a synthetic stand-in:

from sklearn.datasets import make_classification

X_stream, y_stream = make_classification(
    n_samples=2000, n_features=8, n_informative=5,
    n_redundant=2, n_clusters_per_class=2,
    weights=[0.85, 0.15], flip_y=0.05, random_state=42
)

a) Note the class imbalance (85/15). Split into train/test (80/20, stratify, random_state=42).

b) Train a LinearSVC and an SVC(kernel='rbf') with scaling. Tune the RBF SVM's C and gamma via grid search. Use scoring='f1' since the classes are imbalanced.

c) Compare both models using precision, recall, F1, and the confusion matrix on the test set.

d) Does the RBF kernel improve over the linear SVM for this problem? Is the improvement large enough to justify the added tuning complexity?

e) Given that this is a 2,000-sample dataset, would you deploy an SVM or recommend gradient boosting? Justify your answer considering dataset size, interpretability requirements, and maintenance costs.

Exercise 10: When Not to Use SVMs (Conceptual)

For each of the following scenarios, explain whether an SVM is a good choice. If not, suggest a better alternative and explain why.

a) Classifying email as spam or not-spam based on TF-IDF features from 500,000 emails.

b) Predicting house prices from 15 mixed features (numeric and categorical) with 50,000 samples.

c) Classifying cell types from gene expression data with 20,000 features and 500 samples.

d) Building a real-time fraud detection system that must process 10,000 transactions per second and return probabilities.

e) A medical diagnosis model where the doctor needs to understand why the model flagged a patient.

These exercises support Chapter 12: Support Vector Machines. Return to the chapter to review concepts before attempting them.