Chapter 6 Key Takeaways

The Big Picture

Supervised learning is the foundation of applied machine learning. Given labeled training data $\{(\mathbf{x}_i, y_i)\}$, you learn a mapping $f$ from inputs to outputs that generalizes to unseen data. The two core tasks are regression (continuous targets) and classification (discrete targets).


Core Algorithms at a Glance

Linear Regression

  • Models $\hat{y} = \mathbf{w}^\top \mathbf{x} + b$ and minimizes mean squared error.
  • The OLS closed-form solution is $\boldsymbol{\theta}^* = (\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$.
  • Gradient descent is preferred when the number of features is large.
  • Regularization prevents overfitting: Ridge ($L_2$) shrinks all coefficients; Lasso ($L_1$) drives some to exactly zero for feature selection.

Polynomial Regression

  • Extends linear regression to non-linear relationships by adding polynomial feature terms.
  • The model remains linear in its parameters, so the same OLS/gradient descent machinery applies.
  • High-degree polynomials overfit badly without regularization.

Logistic Regression

  • Classification via the sigmoid function: $P(y=1|\mathbf{x}) = \sigma(\mathbf{w}^\top\mathbf{x} + b)$.
  • Trained by maximizing likelihood (equivalently, minimizing cross-entropy loss).
  • Produces calibrated probability estimates, not just class labels.
  • Decision boundary is a hyperplane in feature space.
  • Extends to multiclass via softmax regression.

Support Vector Machines

  • Find the maximum-margin hyperplane separating classes.
  • The kernel trick enables non-linear decision boundaries without explicit feature transformation.
  • The $C$ parameter controls the margin-error tradeoff.
  • Memory-efficient at prediction time (depends only on support vectors).

Decision Trees

  • Recursively partition feature space via axis-aligned splits.
  • Splitting criterion: Gini impurity or entropy (classification), variance reduction (regression).
  • Highly interpretable but prone to overfitting without pruning or depth constraints.
  • No feature scaling required; handle mixed feature types naturally.

Random Forests

  • Ensemble of decorrelated decision trees via bagging and random feature selection.
  • Primarily reduces variance while maintaining low bias.
  • Robust to hyperparameter choices; hard to "break" with bad settings.
  • Provide feature importance rankings.

Gradient Boosting (XGBoost)

  • Sequential ensemble: each new tree corrects the errors of the previous ensemble.
  • Primarily reduces bias by fitting pseudo-residuals.
  • Uses shallow trees (weak learners) with a learning rate for regularization.
  • State-of-the-art for tabular data; dominates ML competitions.
  • Key hyperparameters: learning rate, number of trees, max depth, subsample fraction.

The Bias-Variance Tradeoff

$$\text{Expected Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}$$

  • High bias (underfitting): Model is too simple. Both train and test errors are high. Fix: increase model complexity, add features, reduce regularization.
  • High variance (overfitting): Model is too complex. Train error is low but test error is high. Fix: reduce model complexity, add regularization, get more data, use ensembles.
  • Use learning curves and cross-validation to diagnose and manage the tradeoff.

Evaluation Metrics

Regression

  • RMSE: Root mean squared error (same units as target; penalizes large errors)
  • MAE: Mean absolute error (robust to outliers)
  • $R^2$: Proportion of variance explained (1.0 = perfect; 0.0 = no better than mean)

Classification

  • Accuracy: Fraction correct (misleading with imbalanced classes)
  • Precision: Of predicted positives, how many are truly positive
  • Recall: Of actual positives, how many are correctly identified
  • F1 Score: Harmonic mean of precision and recall
  • ROC-AUC: Area under the ROC curve (threshold-independent)

Practical Decision Framework

  1. Always start with a simple baseline (linear/logistic regression).
  2. Tree-based ensembles (random forest, gradient boosting) are the go-to for tabular data.
  3. SVMs work well for medium-sized datasets with complex boundaries.
  4. Feature scaling is required for linear models, SVMs, and gradient descent---not for tree-based methods.
  5. Regularization is almost always beneficial; tune the strength via cross-validation.
  6. Evaluate on held-out test data exactly once for the final performance estimate.
  7. Use cross-validation for model selection and hyperparameter tuning.

Common Pitfalls to Avoid

  1. Evaluating on training data: Always use a separate test set or cross-validation.
  2. Data leakage: Never use information from the test set during training or preprocessing (e.g., fitting the scaler on the full dataset).
  3. Ignoring class imbalance: Use stratified splits, class weights, and appropriate metrics (F1, AUC-ROC).
  4. Ignoring the bias-variance tradeoff: A model with very low training error is not necessarily a good model.
  5. Using accuracy alone for imbalanced data: A model that always predicts the majority class can have high accuracy but is useless.
  6. Overfitting to the validation set: If you tune hyperparameters extensively on the validation set, your validation performance may be optimistically biased. Reserve the test set for final evaluation only.

Looking Ahead

  • Chapter 7: Unsupervised learning---finding structure without labels.
  • Chapter 8: Model evaluation, cross-validation, and selection strategies in depth.
  • Chapter 9: Advanced ensemble methods, stacking, and blending.
  • Chapters 11--14: Neural networks and deep learning build directly on the logistic regression and gradient descent foundations from this chapter.