Further Reading — Chapter 4: The Math Behind ML
Visual and Video Resources
3Blue1Brown — Essence of Linear Algebra (YouTube series) Grant Sanderson's animated series on linear algebra is the single best visual introduction to vectors, matrices, dot products, and transformations. If the linear algebra section of this chapter felt abstract, watch episodes 1 through 7. The geometric intuition will stick. youtube.com/playlist?list=PLZHQObOWTQDPD3MizzM2xVFitgF8hE_ab
3Blue1Brown — Essence of Calculus (YouTube series) The companion series to the linear algebra playlist. Episodes on derivatives and the chain rule are particularly relevant for understanding gradient descent. Sanderson's animation of gradient descent in 3D is worth the 20 minutes alone. youtube.com/playlist?list=PLZHQObOWTQDMsr9K-rj53DwVRMYO3t5Yr
StatQuest with Josh Starmer — Gradient Descent, Step-by-Step Josh Starmer explains gradient descent with characteristic clarity and no unnecessary jargon. His step-by-step numerical examples are especially useful if the numpy implementation in this chapter felt fast. He also has excellent videos on cross-entropy and regularization. youtube.com/c/joshstarmer
Textbooks
Deisenroth, Faisal, and Ong — Mathematics for Machine Learning (Cambridge University Press, 2020) The most recommended math-for-ML textbook, and for good reason. Chapters 2 (linear algebra), 5 (vector calculus), 6 (probability), and 7 (optimization) cover everything in this chapter at greater depth with excellent diagrams. The book is freely available as a PDF from the authors' website. Read this if you want to go one level deeper without reaching graduate-level abstraction. mml-book.github.io
Boyd and Vandenberghe — Convex Optimization (Cambridge University Press, 2004) The definitive reference on optimization theory. Chapters 1-3 cover convexity, gradient descent, and the geometry of optimization with mathematical precision. This is a graduate-level text — start with the "Mathematics for Machine Learning" book above and come here when you want the full theory behind why convex loss functions guarantee convergence to a global minimum. web.stanford.edu/~boyd/cvxbook/
Goodfellow, Bengio, and Courville — Deep Learning, Chapter 4: "Numerical Computation" and Chapter 5: "Machine Learning Basics" (MIT Press, 2016) The "Deep Learning bible" has two chapters that provide excellent coverage of the math in this chapter, with emphasis on the connections to neural networks. Chapter 4 covers gradient-based optimization with production-level detail. Even if you never build a neural network, these chapters are worth reading for their treatment of numerical stability and optimizer behavior. deeplearningbook.org
Articles and Blog Posts
Ruder, Sebastian — "An Overview of Gradient Descent Optimization Algorithms" (2016) A comprehensive survey of gradient descent variants: SGD, momentum, Nesterov, AdaGrad, RMSProp, Adam, and more. This chapter covered vanilla gradient descent. When you encounter Adam (the optimizer used by most deep learning frameworks) in later chapters, Ruder's article is the best place to understand how it works and how it improves on what you learned here. ruder.io/optimizing-gradient-descent/
Hastie, Tibshirani, and Friedman — The Elements of Statistical Learning, Chapter 3: "Linear Methods for Regression" (Springer, 2009) Sections 3.4 (ridge regression) and 3.5 (lasso) provide the authoritative treatment of L1 and L2 regularization with full mathematical derivations. The geometric interpretation of the constraint regions (Figure 3.11 in the book) is the most reproduced diagram in ML education. The book is freely available as a PDF. hastie.su.domains/ElemStatLearn/
Interactive and Hands-On
Seeing Theory — Brown University An interactive visualization of probability concepts. The Bayesian inference chapter lets you drag sliders and watch priors update into posteriors in real time. If Bayes' theorem still feels mechanical after reading this chapter, 10 minutes on this site will make it intuitive. seeing-theory.brown.edu
Distill.pub — "Momentum" and "Why Momentum Really Works" Distill articles combine rigorous mathematics with interactive visualizations. Their treatment of momentum in gradient descent shows you what happens in the loss landscape when you add momentum — something impossible to convey in a static textbook. The publication ceased in 2021, but the archived articles remain some of the best ML explanations ever produced. distill.pub
Khan Academy — Linear Algebra and Multivariable Calculus If the linear algebra or calculus content in this chapter assumed too much background, Khan Academy provides patient, step-by-step instruction from the foundations. Start with "Vectors and spaces" for linear algebra and "Gradient and directional derivatives" for calculus. No prerequisites beyond high school algebra. khanacademy.org
Papers
Tibshirani, Robert — "Regression Shrinkage and Selection via the Lasso" (Journal of the Royal Statistical Society, 1996) The original paper introducing L1 regularization. If you want to understand why the diamond constraint region produces sparse solutions at a formal mathematical level, this is the primary source. Surprisingly readable for an academic paper. Section 2 contains the geometric argument that has taught a generation of statisticians and data scientists.
Robbins, Herbert and Monro, Sutton — "A Stochastic Approximation Method" (Annals of Mathematical Statistics, 1951) The paper that introduced stochastic gradient descent. A historical curiosity worth skimming — the core idea (update parameters using noisy gradient estimates from small batches) is the same one that powers every modern deep learning optimizer 75 years later.
What to Read Next
If you are working through this book in order, the math from this chapter will become concrete in the chapters ahead. You do not need to master every concept before proceeding — the notation and ideas will reinforce themselves through repeated use. Specifically:
- Chapter 11 (Linear Models Revisited) applies everything from Sections 4.2-4.5 directly
- Chapter 12 (SVMs) uses hinge loss and the margin concept from Section 4.4
- Chapter 14 (Gradient Boosting) uses custom loss functions and gradient computation
- Chapter 16 (Model Evaluation) returns to probability calibration from Section 4.1
- Chapter 18 (Hyperparameter Tuning) uses cross-validation to choose regularization strength from Section 4.5