Chapter 6 Exercises: Supervised Learning --- Regression and Classification
These exercises progress from foundational understanding through implementation to applied problem-solving. Exercises marked with (Code) require implementation. Exercises marked with (Math) require derivation or calculation. Exercises marked with (Conceptual) test understanding of key ideas.
Section A: Linear Regression (Exercises 1--10)
Exercise 1 (Math)
Given the following dataset, compute the OLS solution for simple linear regression $\hat{y} = \theta_0 + \theta_1 x$ by hand using the normal equations.
| $x$ | $y$ |
|---|---|
| 1 | 3 |
| 2 | 5 |
| 3 | 7 |
| 4 | 9 |
| 5 | 11 |
Verify that the fitted line passes through the point $(\bar{x}, \bar{y})$.
Exercise 2 (Math)
Starting from the MSE loss function $\mathcal{L}(\boldsymbol{\theta}) = \frac{1}{n}\|\mathbf{y} - \mathbf{X}\boldsymbol{\theta}\|^2$, derive the gradient $\nabla_{\boldsymbol{\theta}} \mathcal{L}$ and show that setting it to zero yields the normal equations $\mathbf{X}^\top \mathbf{X}\boldsymbol{\theta} = \mathbf{X}^\top \mathbf{y}$. State clearly which matrix calculus identities you use (refer back to Chapter 3 if needed).
Exercise 3 (Math)
Prove that the Ridge regression solution $\boldsymbol{\theta}^*_{\text{Ridge}} = (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})^{-1}\mathbf{X}^\top \mathbf{y}$ always exists for $\lambda > 0$, even when $\mathbf{X}^\top \mathbf{X}$ is singular. Hint: show that $\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I}$ is positive definite by considering $\mathbf{v}^\top (\mathbf{X}^\top \mathbf{X} + \lambda \mathbf{I})\mathbf{v}$ for any non-zero $\mathbf{v}$.
Exercise 4 (Code)
Implement linear regression from scratch using NumPy (both the closed-form OLS solution and gradient descent). Use the California Housing dataset from scikit-learn. Compare:
1. Your OLS implementation vs. sklearn.linear_model.LinearRegression
2. Your gradient descent implementation (experiment with learning rates 0.001, 0.01, 0.1)
3. Plot the loss curve for each learning rate over 1000 iterations
Exercise 5 (Code)
Using the California Housing dataset, fit Ridge regression with $\lambda \in \{0.001, 0.01, 0.1, 1.0, 10.0, 100.0\}$. Plot the magnitude of each coefficient as a function of $\lambda$ (this is called a regularization path). What happens to the coefficients as $\lambda$ increases? Which features are the last to shrink to near-zero?
Exercise 6 (Code)
Repeat Exercise 5 using Lasso regression. Compare the regularization paths for Ridge and Lasso. Which coefficients become exactly zero under Lasso? Explain why this happens based on the geometry of the $L_1$ penalty (recall the "diamond" constraint region from Chapter 5).
Exercise 7 (Conceptual)
Explain why feature scaling (standardization or normalization) is important for gradient descent but not for the OLS closed-form solution. Under what circumstances would you still want to scale features before applying OLS?
Exercise 8 (Math)
The $R^2$ metric can be negative. Construct a specific numerical example (with at least 4 data points) where a linear regression model achieves $R^2 < 0$. What does a negative $R^2$ mean in practical terms?
Exercise 9 (Code)
Implement mini-batch stochastic gradient descent for linear regression. Your function should accept a batch_size parameter. Compare the convergence behavior for batch sizes of 1, 16, 64, and $n$ (full batch) on the California Housing dataset. Plot the loss curves and discuss the tradeoff between convergence speed and stability.
Exercise 10 (Conceptual)
A colleague proposes adding 50 polynomial features (degree-50 polynomial of a single input variable) to a linear regression model but regularizing heavily with Ridge ($\lambda = 1000$). Could this work? Discuss the interplay between model complexity and regularization.
Section B: Polynomial Regression and Overfitting (Exercises 11--14)
Exercise 11 (Code)
Generate 50 data points from $y = \sin(2\pi x) + \epsilon$, where $x \sim \text{Uniform}(0, 1)$ and $\epsilon \sim \mathcal{N}(0, 0.3)$. Fit polynomial models of degree 1, 3, 5, 9, and 15. For each: 1. Plot the fitted curve overlaid on the data 2. Report training MSE and test MSE (use a separate test set of 200 points) 3. Identify the degree that gives the best test MSE
Exercise 12 (Code)
Extend Exercise 11 by adding Ridge regularization ($\lambda = 0.01$) to the degree-15 polynomial. Compare the resulting fit to the un-regularized degree-15 polynomial and the degree-3 polynomial. What do you observe?
Exercise 13 (Math)
For polynomial regression of degree $p$ with $d$ original features, derive the formula for the number of polynomial features (including interaction terms). Calculate this number for $d = 5, p = 4$. Discuss the practical implications of this combinatorial explosion.
Exercise 14 (Conceptual)
Explain why a degree-$n$ polynomial (where $n$ is the number of training points) can always achieve zero training error. Under what conditions would this actually be a good model?
Section C: Logistic Regression and Classification (Exercises 15--21)
Exercise 15 (Math)
Show that the sigmoid function satisfies $\sigma(-z) = 1 - \sigma(z)$ and $\sigma'(z) = \sigma(z)(1 - \sigma(z))$. Then use these identities to simplify the gradient of the binary cross-entropy loss for logistic regression.
Exercise 16 (Math)
Derive the gradient of the binary cross-entropy loss for logistic regression:
$$\nabla_{\boldsymbol{\theta}} \mathcal{L} = \frac{1}{n}\mathbf{X}^\top(\hat{\mathbf{p}} - \mathbf{y})$$
Show each step of the derivation, starting from $\mathcal{L} = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log \hat{p}_i + (1-y_i)\log(1-\hat{p}_i)]$.
Exercise 17 (Code)
Implement logistic regression from scratch using gradient descent with NumPy. Include:
1. The sigmoid function
2. The cross-entropy loss function
3. The gradient computation
4. A fit method that performs gradient descent
5. A predict_proba method that returns probabilities
6. A predict method that returns class labels
Test your implementation on the Breast Cancer Wisconsin dataset from scikit-learn and compare against sklearn.linear_model.LogisticRegression.
Exercise 18 (Code)
Using the Iris dataset (3 classes), implement softmax regression (multinomial logistic regression) from scratch. Verify your implementation against scikit-learn's LogisticRegression(multi_class='multinomial'). Report the confusion matrix and per-class precision, recall, and F1 scores.
Exercise 19 (Code)
Create a synthetic 2D classification dataset where the classes are not linearly separable (e.g., concentric circles using sklearn.datasets.make_circles). Fit logistic regression and visualize the decision boundary. Then add polynomial features (degree 2 and degree 4) and visualize the resulting decision boundaries. Discuss the improvement.
Exercise 20 (Conceptual)
A binary classifier on an imbalanced dataset (95% negative, 5% positive) achieves 95% accuracy. Is this a good model? What metrics would you use instead? Calculate the precision, recall, and F1 score for a model that always predicts "negative."
Exercise 21 (Code)
Generate a synthetic binary classification dataset with sklearn.datasets.make_classification. Compute the ROC curve and Precision-Recall curve for logistic regression. Plot both curves. Calculate AUC-ROC and average precision. When is the Precision-Recall curve more informative than the ROC curve?
Section D: Support Vector Machines (Exercises 22--25)
Exercise 22 (Math)
Consider two data points: $\mathbf{x}_1 = (1, 1)$ with $y_1 = +1$ and $\mathbf{x}_2 = (3, 3)$ with $y_2 = -1$. Find the maximum-margin hyperplane by: 1. Writing down the SVM optimization problem 2. Finding the optimal $\mathbf{w}$ and $b$ 3. Computing the margin 4. Identifying the support vectors
Exercise 23 (Code)
Using sklearn.datasets.make_moons (a non-linearly separable dataset), compare the decision boundaries of:
1. Linear SVM (kernel='linear')
2. Polynomial SVM (kernel='poly', degree 3)
3. RBF SVM (kernel='rbf')
Visualize all three decision boundaries on the same figure. Which kernel performs best?
Exercise 24 (Code)
Investigate the effect of the $C$ parameter on the SVM decision boundary. Using a 2D dataset: 1. Fit RBF SVMs with $C \in \{0.01, 0.1, 1.0, 10.0, 100.0\}$ 2. Visualize the decision boundary for each $C$ 3. Report the number of support vectors for each $C$ 4. Discuss the relationship between $C$, the margin, and the number of support vectors
Exercise 25 (Conceptual)
Explain why SVMs are memory-efficient at prediction time compared to algorithms like k-Nearest Neighbors. What determines the prediction-time complexity of an SVM? Can you think of a scenario where an SVM would have poor prediction-time performance?
Section E: Tree-Based Methods and Ensembles (Exercises 26--32)
Exercise 26 (Math)
Given the following dataset, build a decision tree of depth 1 (a "stump") for classification. Calculate the Gini impurity for each possible split on each feature, and identify the optimal split.
| $x_1$ | $x_2$ | $y$ |
|---|---|---|
| 2 | 3 | 0 |
| 1 | 4 | 0 |
| 3 | 1 | 1 |
| 4 | 2 | 1 |
| 3 | 3 | 0 |
| 5 | 1 | 1 |
Exercise 27 (Code)
Fit a decision tree classifier to the Iris dataset with varying max_depth from 1 to 20. Plot training accuracy and validation accuracy (5-fold CV) as a function of max_depth. At what depth does overfitting begin? Use sklearn.tree.export_text or sklearn.tree.plot_tree to visualize the tree at the optimal depth.
Exercise 28 (Code)
Implement a simplified random forest from scratch:
1. Create bootstrap samples from the training data
2. Fit a DecisionTreeClassifier on each bootstrap sample with max_features='sqrt'
3. Make predictions by majority vote
4. Compare your implementation to sklearn.ensemble.RandomForestClassifier
Test on the Wine dataset from scikit-learn.
Exercise 29 (Code)
Investigate the effect of n_estimators on random forest performance. Using the Digits dataset:
1. Fit random forests with $n \in \{1, 5, 10, 25, 50, 100, 200, 500\}$ estimators
2. Plot accuracy vs. number of estimators
3. At what point do additional trees stop providing significant improvement?
4. Measure the training time for each value of $n$
Exercise 30 (Code)
Compare gradient boosting performance for different learning rates and numbers of estimators. Using the California Housing dataset:
1. Fit GradientBoostingRegressor with learning rates $\eta \in \{0.01, 0.05, 0.1, 0.3\}$ and $n\_estimators \in \{50, 100, 200, 500\}$
2. Create a heatmap of test RMSE for each combination
3. Verify the inverse relationship between learning rate and optimal number of estimators
Exercise 31 (Code)
Build a complete model comparison pipeline for a classification task: 1. Load the Breast Cancer Wisconsin dataset 2. Preprocess the data (standardize features, train/test split) 3. Train and evaluate: Logistic Regression, SVM (linear), SVM (RBF), Decision Tree, Random Forest, Gradient Boosting 4. Use 5-fold cross-validation for each model 5. Report mean and standard deviation of accuracy, precision, recall, and F1 6. Visualize results with a grouped bar chart
Exercise 32 (Code)
Extract and compare feature importances from a Random Forest and Gradient Boosting model trained on the same dataset (use the Diabetes dataset from scikit-learn). Do the models agree on which features are most important? Plot the importances side by side.
Section F: Bias-Variance Tradeoff (Exercises 33--35)
Exercise 33 (Code)
Empirically demonstrate the bias-variance decomposition for polynomial regression: 1. Generate 100 different training sets (each with 30 points) from $y = x^3 - 2x + \epsilon$, $\epsilon \sim \mathcal{N}(0, 1)$ 2. For degrees 1, 2, 3, 5, 10, 15, fit a polynomial to each training set 3. At a fixed test point $x_0 = 0.5$, compute: - Average prediction across the 100 models (estimate of $\mathbb{E}[\hat{f}(x_0)]$) - Bias$^2$: $(\mathbb{E}[\hat{f}(x_0)] - f(x_0))^2$ - Variance: $\mathbb{E}[(\hat{f}(x_0) - \mathbb{E}[\hat{f}(x_0)])^2]$ 4. Plot bias$^2$, variance, and total error as a function of polynomial degree
Exercise 34 (Code)
Generate learning curves for three models (linear regression, decision tree with max_depth=3, and random forest with 100 trees) on the California Housing dataset. For each model: 1. Plot training and validation error vs. training set size 2. Diagnose whether each model suffers from high bias or high variance 3. For the model with high variance, suggest and implement a fix
Exercise 35 (Conceptual)
You are given a model that achieves: - Training RMSE: 0.5 - Validation RMSE: 5.2 - Test RMSE: 5.1
Diagnose the problem and propose three specific strategies to improve performance. Justify each strategy in terms of the bias-variance tradeoff.