Chapter 6: Further Reading
Essential Sources
1. Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Deep Learning (MIT Press, 2016)
The standard graduate-level reference for deep learning. Chapters 6 (Deep Feedforward Networks) and 6.5 (Back-Propagation) cover the same material as this chapter at a similar mathematical level but with more theoretical depth. The treatment of the universal approximation theorem (Section 6.4.1) is concise and precise. Chapter 8 (Optimization for Training Deep Models) extends the SGD treatment into the territory covered by our Chapter 7. The book is freely available online at deeplearningbook.org. Read Chapter 6 alongside this chapter — it provides the complementary perspective of three of the field's founders. The notation differs slightly from ours (Goodfellow uses $h$ for hidden activations where we use $a$), so cross-referencing builds fluency with multiple conventions, which is important when reading papers.
2. Andrej Karpathy, "Yes You Should Understand Backprop" (Medium, 2016) and the micrograd repository (GitHub, 2020)
Two resources from the same author that approach neural networks from a programmer's perspective. The blog post argues — with examples of subtle backpropagation bugs — that understanding backprop at the implementation level is not optional for practitioners. The micrograd repository is a complete automatic differentiation engine in approximately 100 lines of Python, implementing scalar-valued reverse-mode autodiff with a Value class, operations, and topological-sort-based backward pass. Exercise 6.19 in this chapter is directly inspired by micrograd. Build it yourself, then read Karpathy's implementation to compare.
Reading guidance: Start with the blog post (15-minute read). Then read the micrograd source code — it is short enough to read in one sitting. If you want more depth, Karpathy's neural networks lecture series (available on YouTube) walks through the implementation step by step.
3. Xavier Glorot and Yoshua Bengio, "Understanding the Difficulty of Training Deep Feedforward Neural Networks" (AISTATS, 2010); Kaiming He et al., "Delving Deep into Rectifiers" (ICCV, 2015)
The two papers that established the modern understanding of weight initialization. Glorot and Bengio (2010) derive the Xavier initialization by analyzing variance propagation through sigmoid and tanh networks — the same analysis presented in Section 6.9. He et al. (2015) extend this analysis to ReLU networks, deriving the $\text{Var}(w) = 2/n_\text{in}$ formula and demonstrating that it enables training of networks with 30+ layers without batch normalization. Together, these two papers explain why initialization is not an arbitrary choice but a mathematical requirement for stable training.
Reading guidance: Read Glorot and Bengio first — Sections 1-4 cover the variance analysis and the initialization proposal. Then read He et al. Sections 2.2 (the ReLU-aware derivation) and 3 (experiments showing that proper initialization enables training of very deep networks). The experimental results in both papers are as important as the theory: they show what happens in practice when initialization is wrong.
4. George Cybenko, "Approximation by Superpositions of a Sigmoidal Function" (Mathematics of Control, Signals, and Systems, 1989); Kurt Hornik, Maxwell Stinchcombe, and Halbert White, "Multilayer Feedforward Networks Are Universal Approximators" (Neural Networks, 1989)
The two foundational papers on the universal approximation theorem. Cybenko (1989) proved the result for sigmoid activations using the Hahn-Banach theorem; Hornik, Stinchcombe, and White (1989) extended it to arbitrary bounded non-constant activation functions using the Stone-Weierstrass theorem. Both proofs are existence proofs — they demonstrate that an approximating network exists but provide no constructive method for finding it.
Reading guidance: These are mathematical papers and more demanding than the other recommendations. Read for the theorem statements and the discussion of what the results do and do not imply, rather than for the proof techniques (unless you have a background in functional analysis). For a modern perspective on depth vs. width in approximation, see Eldan and Shamir, "The Power of Depth for Feedforward Neural Networks" (COLT, 2016), which proves that certain functions require exponentially many neurons in a single hidden layer but only polynomially many in a two-layer network.
5. David E. Rumelhart, Geoffrey E. Hinton, and Ronald J. Williams, "Learning Representations by Back-Propagating Errors" (Nature, 1986)
The paper that introduced backpropagation for neural networks to the wider scientific community. At three pages, it is remarkably concise — the entire backpropagation algorithm, its application to learning internal representations, and experimental results on XOR and symmetry detection, all in a single Nature letter. The clarity of the writing stands out: Rumelhart, Hinton, and Williams explain the chain rule application in terms that remain the standard exposition 40 years later. Reading the original paper gives you both the historical context (this was the breakthrough that revived neural networks after the Minsky-Papert critique) and an appreciation for how much of modern deep learning was already implicit in this three-page letter.
Reading guidance: Read the entire paper — it is only three pages. Pay attention to Figure 1, which shows the computational graph and gradient flow that we formalized in Section 6.7. Note the hidden representation learned for the XOR problem (their Figure 3): this is the earliest clear demonstration that backpropagation discovers useful internal representations, the insight that eventually led to deep learning.