Chapter 6 Key Takeaways

The Big Picture

Supervised learning is the foundation of applied machine learning. Given labeled training data $\{(\mathbf{x}_i, y_i)\}$, you learn a mapping $f$ from inputs to outputs that generalizes to unseen data. The two core tasks are regression (continuous targets) and classification (discrete targets).

Core Algorithms at a Glance

Linear Regression

Models $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ and minimizes mean squared error.
The OLS closed-form solution is $\boldsymbol{\theta}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.
Gradient descent is preferred when the number of features is large.
Regularization prevents overfitting: Ridge ($L_2$) shrinks all coefficients; Lasso ($L_1$) drives some to exactly zero for feature selection.

Polynomial Regression

Extends linear regression to non-linear relationships by adding polynomial feature terms.
The model remains linear in its parameters, so the same OLS/gradient descent machinery applies.
High-degree polynomials overfit badly without regularization.

Logistic Regression

Classification via the sigmoid function: $P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + b)$.
Trained by maximizing likelihood (equivalently, minimizing cross-entropy loss).
Produces calibrated probability estimates, not just class labels.
Decision boundary is a hyperplane in feature space.
Extends to multiclass via softmax regression.

Support Vector Machines

Find the maximum-margin hyperplane separating classes.
The kernel trick enables non-linear decision boundaries without explicit feature transformation.
The $C$ parameter controls the margin-error tradeoff.
Memory-efficient at prediction time (depends only on support vectors).

Decision Trees

Recursively partition feature space via axis-aligned splits.
Splitting criterion: Gini impurity or entropy (classification), variance reduction (regression).
Highly interpretable but prone to overfitting without pruning or depth constraints.
No feature scaling required; handle mixed feature types naturally.

Random Forests

Ensemble of decorrelated decision trees via bagging and random feature selection.
Primarily reduces variance while maintaining low bias.
Robust to hyperparameter choices; hard to "break" with bad settings.
Provide feature importance rankings.

Gradient Boosting (XGBoost)

Sequential ensemble: each new tree corrects the errors of the previous ensemble.
Primarily reduces bias by fitting pseudo-residuals.
Uses shallow trees (weak learners) with a learning rate for regularization.
State-of-the-art for tabular data; dominates ML competitions.
Key hyperparameters: learning rate, number of trees, max depth, subsample fraction.

The Bias-Variance Tradeoff

$$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

High bias (underfitting): Model is too simple. Both train and test errors are high. Fix: increase model complexity, add features, reduce regularization.
High variance (overfitting): Model is too complex. Train error is low but test error is high. Fix: reduce model complexity, add regularization, get more data, use ensembles.
Use learning curves and cross-validation to diagnose and manage the tradeoff.

Evaluation Metrics

Regression

RMSE: Root mean squared error (same units as target; penalizes large errors)
MAE: Mean absolute error (robust to outliers)
$R^2$: Proportion of variance explained (1.0 = perfect; 0.0 = no better than mean)

Classification

Accuracy: Fraction correct (misleading with imbalanced classes)
Precision: Of predicted positives, how many are truly positive
Recall: Of actual positives, how many are correctly identified
F1 Score: Harmonic mean of precision and recall
ROC-AUC: Area under the ROC curve (threshold-independent)

Practical Decision Framework

Always start with a simple baseline (linear/logistic regression).
Tree-based ensembles (random forest, gradient boosting) are the go-to for tabular data.
SVMs work well for medium-sized datasets with complex boundaries.
Feature scaling is required for linear models, SVMs, and gradient descent---not for tree-based methods.
Regularization is almost always beneficial; tune the strength via cross-validation.
Evaluate on held-out test data exactly once for the final performance estimate.
Use cross-validation for model selection and hyperparameter tuning.

Common Pitfalls to Avoid

Evaluating on training data: Always use a separate test set or cross-validation.
Data leakage: Never use information from the test set during training or preprocessing (e.g., fitting the scaler on the full dataset).
Ignoring class imbalance: Use stratified splits, class weights, and appropriate metrics (F1, AUC-ROC).
Ignoring the bias-variance tradeoff: A model with very low training error is not necessarily a good model.
Using accuracy alone for imbalanced data: A model that always predicts the majority class can have high accuracy but is useless.
Overfitting to the validation set: If you tune hyperparameters extensively on the validation set, your validation performance may be optimistically biased. Reserve the test set for final evaluation only.

Looking Ahead

Chapter 7: Unsupervised learning---finding structure without labels.
Chapter 8: Model evaluation, cross-validation, and selection strategies in depth.
Chapter 9: Advanced ensemble methods, stacking, and blending.
Chapters 11--14: Neural networks and deep learning build directly on the logistic regression and gradient descent foundations from this chapter.