Quiz: Chapter 12
Support Vector Machines
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
What does the maximum margin principle seek to maximize?
- A) The number of correctly classified training points
- B) The distance between the decision boundary and the closest data points from either class
- C) The total number of support vectors
- D) The log-likelihood of the observed class labels
Answer: B) The distance between the decision boundary and the closest data points from either class. The SVM finds the hyperplane with the widest margin --- the largest gap between the boundary and the nearest training points. This is fundamentally a geometric objective, not a probabilistic one.
Question 2 (Multiple Choice)
Which of the following statements about support vectors is TRUE?
- A) All training points are support vectors
- B) Support vectors are the points farthest from the decision boundary
- C) Removing a non-support-vector training point does not change the decision boundary
- D) The number of support vectors always equals the number of features
Answer: C) Removing a non-support-vector training point does not change the decision boundary. Support vectors are the points on the margin boundary --- the closest points to the decision boundary. Only they determine the model. All other points are irrelevant to the boundary's position.
Question 3 (Short Answer)
Explain the role of the C parameter in a soft-margin SVM. What happens when C is very large? Very small?
Answer: C controls the tradeoff between margin width and tolerance for misclassifications. When C is very large, the SVM penalizes violations heavily, producing a narrow margin that tries to classify every training point correctly --- risking overfitting. When C is very small, the SVM prioritizes a wide margin and tolerates many misclassifications --- risking underfitting. The optimal C is found via cross-validation.
Question 4 (Multiple Choice)
A data scientist trains an SVM on a dataset with features ranging from 0-1 (age normalized) and 0-500,000 (salary in dollars). The model performs poorly. The most likely cause is:
- A) The dataset is too large
- B) The kernel function is inappropriate
- C) The features are not scaled to a common range
- D) The C parameter is too large
Answer: C) The features are not scaled to a common range. SVMs compute margins as distances in feature space. When features have drastically different scales, the margin calculation is dominated by the large-scale feature (salary), effectively ignoring the small-scale feature (age). Scaling is mandatory for SVMs.
Question 5 (Multiple Choice)
The kernel trick allows SVMs to:
- A) Train faster on large datasets
- B) Find non-linear decision boundaries by computing dot products in a transformed space without explicitly performing the transformation
- C) Automatically select the best features
- D) Produce calibrated probability estimates
Answer: B) Find non-linear decision boundaries by computing dot products in a transformed space without explicitly performing the transformation. The kernel function computes the inner product between data points in a (potentially infinite-dimensional) feature space, enabling non-linear boundaries without the computational cost of the actual transformation.
Question 6 (Short Answer)
Why does SVC(kernel='rbf') become impractical for datasets with more than ~10,000 samples? What alternatives exist?
Answer: SVC with a non-linear kernel computes an n-by-n kernel matrix, requiring O(n^2) memory and O(n^2) to O(n^3) time. For 10,000 samples that is 100 million entries; for 100,000 it is 10 billion. Alternatives include LinearSVC (avoids the kernel matrix, scales linearly), kernel approximation methods like RBFSampler or Nystroem paired with a linear model, and SGDClassifier(loss='hinge') for online learning.
Question 7 (Multiple Choice)
Which kernel would you try FIRST for classifying documents based on TF-IDF features (high-dimensional sparse vectors)?
- A) RBF kernel
- B) Polynomial kernel (degree 3)
- C) Linear kernel
- D) Sigmoid kernel
Answer: C) Linear kernel. Text data represented as TF-IDF vectors is high-dimensional and sparse, and linear decision boundaries work well in high-dimensional spaces because there is more room for separation. Linear kernels are also much faster than non-linear alternatives, which matters when the feature space is large.
Question 8 (Multiple Choice)
In an RBF kernel SVM, increasing gamma while holding C constant will:
- A) Make the decision boundary smoother
- B) Increase the number of support vectors
- C) Reduce each training point's radius of influence, making the boundary more complex
- D) Have no effect on the decision boundary
Answer: C) Reduce each training point's radius of influence, making the boundary more complex. High gamma means each support vector's influence decays rapidly with distance, allowing the boundary to conform more closely to individual training points. This can lead to overfitting, where the boundary wraps tightly around each point.
Question 9 (Short Answer)
What is the difference between LinearSVC and SVC(kernel='linear') in scikit-learn? When should you prefer one over the other?
Answer: Both find a linear decision boundary, but they use different optimization algorithms. LinearSVC uses liblinear, which avoids computing the kernel matrix and scales to millions of samples. SVC(kernel='linear') uses libsvm, which computes the full n-by-n kernel matrix and is O(n^2) in memory. Prefer LinearSVC for datasets larger than a few thousand samples, and for any linear SVM in production.
Question 10 (Multiple Choice)
A soft-margin SVM has a slack variable xi_i = 0.7 for a particular training point. This means:
- A) The point is correctly classified and outside the margin
- B) The point is inside the margin but on the correct side of the decision boundary
- C) The point is misclassified (on the wrong side of the decision boundary)
- D) The point is a support vector with zero margin violation
Answer: B) The point is inside the margin but on the correct side of the decision boundary. A slack value of 0 means the point is on or outside the margin (no violation). A value between 0 and 1 means the point is inside the margin but still correctly classified. A value >= 1 means the point crosses the decision boundary and is misclassified.
Question 11 (Multiple Choice)
You fit an SVM and find that 92% of your training points are support vectors. This most likely indicates:
- A) The model is overfitting
- B) The model is performing optimally
- C) The C parameter is too small (under-penalizing violations)
- D) The kernel is too complex
Answer: C) The C parameter is too small (under-penalizing violations). When nearly all points are support vectors, the margin is extremely wide and the SVM is not discriminating well between points near the boundary and points far from it. This typically means C is too small, so violations are cheap and the SVM does not try hard enough to classify correctly.
Question 12 (Short Answer)
A colleague says: "SVMs are obsolete because gradient boosting beats them on every dataset." Is this claim accurate? Provide a nuanced response.
Answer: The claim is an oversimplification. Gradient boosting generally outperforms SVMs on large tabular datasets and is easier to tune, which is why it dominates in industry. However, SVMs remain competitive on small datasets (n < 1,000), high-dimensional sparse data (text classification), and problems with clean decision boundaries. SVMs also have theoretical guarantees (margin bounds) and produce sparse models defined by support vectors. The underlying ideas --- margins, kernels, the bias-complexity tradeoff --- are foundational regardless of which algorithm you deploy.
Question 13 (Multiple Choice)
Which of the following is the correct workflow for using an SVM with cross-validation?
- A) Scale the entire dataset, then split into folds, then train and evaluate
- B) Split into folds, scale each fold independently, then train and evaluate
- C) Use a Pipeline that scales and trains within each fold, so the scaler is fit only on training data
- D) Scale the training set, use the same fixed scaling parameters from the training set on the test set manually
Answer: C) Use a Pipeline that scales and trains within each fold, so the scaler is fit only on training data. Scaling the entire dataset before splitting (option A) leaks information from the test fold into the training fold. A Pipeline ensures that the scaler sees only training data in each fold, preventing data leakage automatically.
Question 14 (Multiple Choice)
For a 10-class classification problem, SVC (using the default one-vs-one strategy) trains:
- A) 10 binary classifiers
- B) 20 binary classifiers
- C) 45 binary classifiers
- D) 100 binary classifiers
Answer: C) 45 binary classifiers. One-vs-one trains one classifier for every pair of classes. For k=10 classes, that is k(k-1)/2 = 10(9)/2 = 45 classifiers. Each is trained on only the samples from the two classes it compares.
Question 15 (Short Answer)
Why does SVC(probability=True) add significant computational cost? When might you use it anyway?
Answer: Setting probability=True triggers Platt scaling, which fits a sigmoid function to the SVM's decision values. To estimate these parameters reliably, scikit-learn runs an internal 5-fold cross-validation, roughly increasing training time by 5x. You might use it when downstream tasks require probability estimates (e.g., ranking by confidence, setting custom thresholds, or calibrating predictions), and when the dataset is small enough that the extra cost is acceptable.
This quiz covers Chapter 12: Support Vector Machines. Return to the chapter for full context.