Chapter 6 Quiz: Supervised Learning --- Regression and Classification
Test your understanding of the key concepts from this chapter. Try to answer each question before revealing the solution.
Question 1
What is the fundamental difference between regression and classification in supervised learning?
Show Answer
In **regression**, the target variable $y$ is continuous (e.g., house price, temperature), and the model predicts a real-valued output. In **classification**, the target variable $y$ is discrete (e.g., spam/not-spam, digit 0--9), and the model predicts a class label or probability distribution over classes.Question 2
Write the closed-form solution for ordinary least squares (OLS) linear regression. Under what conditions does this solution exist?
Show Answer
The OLS solution is: $$\boldsymbol{\theta}^* = (\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \mathbf{y}$$ This solution exists when $\mathbf{X}^\top \mathbf{X}$ is invertible, which requires: (1) the number of training examples $n$ is at least $d+1$ (where $d$ is the number of features), and (2) there is no perfect multicollinearity among the features (no feature is a perfect linear combination of other features).Question 3
Why is gradient descent preferred over the OLS closed-form solution in some scenarios?
Show Answer
The closed-form solution requires computing the matrix inverse $(\mathbf{X}^\top \mathbf{X})^{-1}$, which has $O(d^3)$ time complexity. When the number of features $d$ is very large (thousands or millions), this becomes computationally prohibitive. Gradient descent has $O(nd)$ per-iteration cost and can be made even more efficient with stochastic or mini-batch variants ($O(bd)$ per iteration). Additionally, gradient descent can handle streaming data and is easily parallelizable.Question 4
What is the key difference between Ridge regression and Lasso regression? When would you prefer one over the other?
Show Answer
**Ridge** ($L_2$) adds $\lambda\|\boldsymbol{\theta}\|_2^2$ to the loss and shrinks all coefficients toward zero but never to exactly zero. **Lasso** ($L_1$) adds $\lambda\|\boldsymbol{\theta}\|_1$ and can drive coefficients to exactly zero, performing automatic feature selection. Prefer **Lasso** when you believe only a few features are truly relevant (sparse model) or when you want interpretable feature selection. Prefer **Ridge** when many features contribute small effects, or when features are correlated (Lasso tends to arbitrarily select one from a group of correlated features). Use **Elastic Net** when you want the benefits of both.Question 5
What does an $R^2$ value of 0.85 mean? Can $R^2$ be negative?
Show Answer
An $R^2$ of 0.85 means that 85% of the variance in the target variable is explained by the model. The remaining 15% is due to factors not captured by the model. Yes, $R^2$ can be negative. This occurs when the model's predictions are worse than simply predicting the mean of the target for every input. A negative $R^2$ indicates a very poor model---even a constant prediction would be better. This can happen when evaluating a model on test data that differs significantly from the training data, or when the model severely overfits.Question 6
Why does logistic regression use the cross-entropy loss instead of mean squared error?
Show Answer
There are two main reasons: (1) **Convexity**: The cross-entropy loss with a sigmoid output is convex in the parameters, guaranteeing a unique global minimum. MSE with a sigmoid output is non-convex, with many local minima. (2) **Gradient behavior**: With MSE and sigmoid, the gradient can become very small when the prediction is confidently wrong (due to the flat tails of the sigmoid), causing extremely slow learning. Cross-entropy avoids this because the gradient is proportional to the prediction error $(\hat{p} - y)$, maintaining strong gradients even for large errors. (3) **Probabilistic interpretation**: Cross-entropy corresponds to maximum likelihood estimation under a Bernoulli model, providing a principled probabilistic foundation.Question 7
What is the decision boundary of a logistic regression model? How does it relate to the model parameters?
Show Answer
The decision boundary is the set of points where $P(y=1|\mathbf{x}) = 0.5$, which corresponds to $\mathbf{w}^\top \mathbf{x} + b = 0$. This defines a **hyperplane** in the feature space. The weight vector $\mathbf{w}$ is normal (perpendicular) to this hyperplane, determining its orientation. The bias $b$ determines its offset from the origin. Points on one side of the hyperplane are classified as class 1, and points on the other side as class 0.Question 8
Explain the difference between the sigmoid function and the softmax function. When is each used?
Show Answer
The **sigmoid** function $\sigma(z) = 1/(1+e^{-z})$ maps a single real number to $(0,1)$ and is used for **binary** classification. It takes one logit as input and outputs one probability. The **softmax** function $\text{softmax}(z_k) = e^{z_k}/\sum_j e^{z_j}$ maps a vector of $K$ real numbers to a probability distribution over $K$ classes. It is used for **multiclass** classification. All outputs are positive and sum to 1. For $K=2$, softmax reduces to sigmoid.Question 9
A classifier has the following confusion matrix. Calculate precision, recall, F1, and accuracy.
| Predicted Pos | Predicted Neg | |
|---|---|---|
| Actual Pos | 40 | 10 |
| Actual Neg | 20 | 130 |
Show Answer
- **TP** = 40, **FN** = 10, **FP** = 20, **TN** = 130 - **Precision** = TP/(TP+FP) = 40/(40+20) = 40/60 = 0.667 - **Recall** = TP/(TP+FN) = 40/(40+10) = 40/50 = 0.800 - **F1** = 2 * (0.667 * 0.800)/(0.667 + 0.800) = 2 * 0.5336/1.467 = 0.727 - **Accuracy** = (TP+TN)/(TP+TN+FP+FN) = 170/200 = 0.850Question 10
What is the "kernel trick" in SVMs? Why is it computationally advantageous?
Show Answer
The kernel trick allows SVMs to learn non-linear decision boundaries without explicitly computing the high-dimensional feature transformation $\boldsymbol{\phi}(\mathbf{x})$. Instead, we replace all dot products $\mathbf{x}_i^\top \mathbf{x}_j$ in the SVM formulation with a kernel function $K(\mathbf{x}_i, \mathbf{x}_j) = \boldsymbol{\phi}(\mathbf{x}_i)^\top \boldsymbol{\phi}(\mathbf{x}_j)$. This is computationally advantageous because: (1) We never need to explicitly compute or store the high-dimensional feature vectors. The RBF kernel, for example, implicitly maps to an infinite-dimensional space, which would be impossible to compute explicitly. (2) The kernel function is typically $O(d)$ to evaluate, regardless of the dimensionality of the implicit feature space.Question 11
What does the $C$ parameter control in an SVM? What happens when $C$ is very large versus very small?
Show Answer
$C$ controls the tradeoff between maximizing the margin and minimizing classification errors on the training data. - **Large $C$**: Heavy penalty for misclassifications. The model tries hard to classify every training point correctly, resulting in a narrow margin. This can lead to overfitting. - **Small $C$**: Mild penalty for misclassifications. The model tolerates more errors in exchange for a wider margin, leading to a simpler model. This can lead to underfitting. Think of $C$ as the inverse of a regularization parameter: large $C$ = less regularization, small $C$ = more regularization.Question 12
How does a decision tree decide which feature and threshold to split on?
Show Answer
At each node, the tree evaluates all possible features and all possible thresholds for each feature. For each candidate split, it computes the weighted average impurity of the resulting child nodes using an impurity measure (Gini impurity or entropy for classification, variance for regression). The split that produces the greatest reduction in impurity (or equivalently, the greatest information gain) is selected. This is a greedy algorithm---it makes the locally optimal choice at each step, which does not guarantee a globally optimal tree.Question 13
What is the Gini impurity for a node where 70% of samples belong to class A and 30% to class B?
Show Answer
$$G = 1 - \sum_k p_k^2 = 1 - (0.7^2 + 0.3^2) = 1 - (0.49 + 0.09) = 1 - 0.58 = 0.42$$ The Gini impurity is 0.42. For comparison, a pure node has $G = 0$, and a perfectly balanced binary node has $G = 0.5$ (maximum impurity for two classes).Question 14
Why are random forests generally more robust than individual decision trees?
Show Answer
Random forests reduce variance through two mechanisms of randomization: (1) **Bootstrap aggregating (bagging)**: Each tree is trained on a random bootstrap sample, so different trees see different subsets of the data. (2) **Random feature selection**: At each split, only a random subset of features is considered, which decorrelates the trees. When the trees are averaged (regression) or combined by majority vote (classification), the uncorrelated errors tend to cancel out. This dramatically reduces the variance of the ensemble compared to any single tree, while keeping the bias approximately the same. By the law of large numbers, the more independent trees we average, the lower the variance of the ensemble.Question 15
Explain the key difference between bagging (as in random forests) and boosting (as in gradient boosting).
Show Answer
**Bagging** trains multiple models **independently** (in parallel) on bootstrap samples and combines them by averaging or voting. It primarily reduces **variance** by decorrelating the models' errors. **Boosting** trains models **sequentially**, where each new model focuses on correcting the errors made by the previous ensemble. It primarily reduces **bias** by iteratively improving the ensemble's predictions. In bagging, each model is typically a strong learner (deep tree). In boosting, each model is typically a weak learner (shallow tree), and the ensemble's strength comes from combining many weak learners.Question 16
In gradient boosting, what are "pseudo-residuals"? Why is the method called "gradient" boosting?
Show Answer
Pseudo-residuals are the negative gradients of the loss function with respect to the current model's predictions. For squared error loss, the pseudo-residuals are simply the actual residuals ($y_i - F(\mathbf{x}_i)$), but for other loss functions (e.g., cross-entropy, absolute error), they take different forms. The method is called "gradient" boosting because each new tree is fit to the negative gradient of the loss function, effectively performing gradient descent in function space. Just as standard gradient descent steps in the direction that reduces the loss most steeply in parameter space, gradient boosting adds a function (tree) in the direction that reduces the loss most steeply in the space of predictions.Question 17
What is the purpose of the learning rate (shrinkage parameter) in gradient boosting?
Show Answer
The learning rate $\eta$ (typically 0.01--0.3) scales the contribution of each new tree: $F_m(\mathbf{x}) = F_{m-1}(\mathbf{x}) + \eta \cdot h_m(\mathbf{x})$. A smaller learning rate means each tree makes a smaller correction, which has two benefits: (1) It regularizes the model, reducing the risk of overfitting. (2) It gives the subsequent trees more "room" to make corrections, often leading to better final performance. The tradeoff is that a smaller learning rate requires more trees (more boosting iterations) to achieve the same training loss, increasing computational cost. In practice, the combination of a small learning rate with many trees (and early stopping) tends to produce the best generalization.Question 18
Write the bias-variance decomposition of the expected squared error. Explain each term intuitively.
Show Answer
$$\mathbb{E}[(y - \hat{f}(\mathbf{x}))^2] = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$ - **Bias$^2$** = $(\mathbb{E}[\hat{f}(\mathbf{x})] - f(\mathbf{x}))^2$: How far the average prediction (across many training sets) is from the true value. High bias means the model class is too restrictive to capture the true pattern. Intuition: a straight line fit to a quadratic curve will always have high bias. - **Variance** = $\mathbb{E}[(\hat{f}(\mathbf{x}) - \mathbb{E}[\hat{f}(\mathbf{x})])^2]$: How much the predictions fluctuate when trained on different samples from the same distribution. High variance means the model is overly sensitive to the specific training data. Intuition: a degree-20 polynomial fit to 25 points will produce wildly different curves for different samples. - **Irreducible Error** = $\sigma^2_\epsilon$: Noise inherent in the data that no model can eliminate. Even the true function $f(\mathbf{x})$ cannot predict $y$ perfectly because of this noise.Question 19
How can you tell from a learning curve whether a model suffers from high bias or high variance?
Show Answer
**High bias** (underfitting): Both training error and validation error are high, and they converge to a similar value as training set size increases. The gap between them is small. Adding more training data will NOT significantly help---the model is too simple. **High variance** (overfitting): Training error is low but validation error is much higher, with a large gap between them. As training set size increases, the gap narrows. Adding more training data WILL help, as it constrains the model and reduces the gap. Alternatively, you can reduce model complexity, add regularization, or use an ensemble method.Question 20
You train a random forest with 100 trees and get a test accuracy of 92%. You then train with 1000 trees and get 92.1%. What explains this diminishing return?
Show Answer
Random forests reduce variance by averaging predictions from multiple trees. The variance of the average of $B$ identically distributed predictions (with pairwise correlation $\rho$ and individual variance $\sigma^2$) is: $$\rho\sigma^2 + \frac{(1-\rho)}{B}\sigma^2$$ As $B$ increases, only the second term decreases, and it approaches zero. However, the first term $\rho\sigma^2$ remains constant regardless of $B$. Once the second term is small relative to the first, adding more trees provides negligible improvement. Beyond a certain number of trees, the ensemble's error is dominated by the correlation between trees and bias, neither of which is reduced by adding more trees.Question 21
Name three advantages and three disadvantages of decision trees compared to linear models.
Show Answer
**Advantages of decision trees**: 1. **No feature scaling required**: Trees are invariant to monotonic transformations of features. 2. **Handle non-linear relationships**: Can capture complex interactions without explicit feature engineering. 3. **Highly interpretable**: The tree structure can be visualized and explained to non-technical stakeholders. **Disadvantages of decision trees**: 1. **High variance**: Small changes in data can produce very different trees (unstable). 2. **Greedy splitting**: Locally optimal splits may not lead to a globally optimal tree. 3. **Axis-aligned boundaries**: Standard trees can only split parallel to feature axes, requiring many splits to approximate diagonal or curved boundaries.Question 22
When would you choose an SVM over logistic regression for a binary classification problem?
Show Answer
You might prefer an SVM when: 1. **Non-linear decision boundaries are needed**: SVMs with the RBF kernel can learn complex boundaries through the kernel trick, whereas logistic regression produces only linear boundaries (unless you manually add polynomial features). 2. **The data is high-dimensional with few samples**: SVMs are effective when $d \gg n$ (e.g., text classification) because the margin-maximization objective acts as a regularizer. 3. **You want robustness to outliers**: The SVM's hinge loss only penalizes misclassified points and points within the margin. Points far from the boundary do not affect the model (unlike logistic regression, where all points influence the fit). However, logistic regression is preferred when you need calibrated probability outputs, when you have a very large dataset (SVMs scale poorly beyond ~100K samples), or when interpretability of coefficients is important.Question 23
What is the difference between max_features in a decision tree and in a random forest? Why is it important for random forests?
Show Answer
In a **single decision tree**, `max_features` limits the number of features considered at each split, but this is less commonly used. In a **random forest**, `max_features` is critical: at each split, the tree considers only a random subset of features (typically $\sqrt{d}$ for classification or $d/3$ for regression). This ensures that the trees in the forest are **decorrelated**. Without this, all trees would tend to split on the same strong features, making them highly correlated. Averaging correlated trees provides much less variance reduction than averaging decorrelated ones. The `max_features` parameter directly controls the correlation-variance tradeoff: smaller values produce more diverse (decorrelated) trees but potentially weaker individual trees.Question 24
Explain why gradient boosting typically uses shallow trees (depth 3--8) while random forests use deep trees.
Show Answer
**Random forests** use deep trees because each tree needs to be a strong learner on its own. Deep trees have low bias but high variance. The forest reduces variance through averaging, so starting with low-bias trees is ideal. **Gradient boosting** uses shallow trees because each tree only needs to capture the residual pattern left by the previous ensemble. Shallow trees are weak learners with high bias, which is fine because boosting's sequential correction process systematically reduces bias. Using deep trees in boosting would risk overfitting at each stage, as each tree would try to fit the residuals too aggressively. The depth controls the **interaction order** of features: a tree of depth $d$ can capture interactions among up to $d$ features.Question 25
You have a dataset with 50 features and 10,000 samples. Your gradient boosting model achieves 95% training accuracy but only 78% test accuracy. List four specific strategies to close this gap, ordered from simplest to most complex.