Chapter 11: Further Reading

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapters 6 (Deep Feedforward Networks), 7 (Regularization), and 8 (Optimization) provide a thorough mathematical treatment of the topics in this chapter. The derivation of backpropagation through computational graphs in Chapter 6 complements our matrix-calculus approach. Available free at deeplearningbook.org.
Bishop, C. M. & Bishop, H. (2024). Deep Learning: Foundations and Concepts. Springer. A modern, comprehensive textbook covering neural networks from first principles. Chapters 5-7 cover single-layer networks, deep networks, and gradient-based optimization with exceptional mathematical clarity.
Nielsen, M. A. (2015). Neural Networks and Deep Learning. Determination Press. A freely available online book (neuralnetworksanddeeplearning.com) that builds neural networks from scratch with a pedagogical approach similar to this chapter. The interactive visualizations of backpropagation are particularly instructive.

Stevens, E., Antiga, L., & Viehmann, T. (2020). Deep Learning with PyTorch. Manning Publications. A practical guide to PyTorch that covers tensors, autograd, and nn.Module in depth. Chapters 5-8 are directly relevant to the PyTorch implementation material in this chapter.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. An interactive textbook (d2l.ai) with executable code in PyTorch, TensorFlow, and JAX. The MLP chapters build networks from scratch in a style very similar to our approach.

3Blue1Brown: Neural Networks (YouTube). Grant Sanderson's four-part series on neural networks provides stunning visual intuition for forward passes, gradient descent, and backpropagation. The animations of how hidden layers transform feature space are unmatched. Start here for visual understanding.
Andrej Karpathy: "The spelled-out intro to neural networks and backpropagation: building micrograd" (YouTube, 2022). A two-hour live-coding session where Karpathy builds a minimal automatic differentiation engine from scratch, then uses it to train a neural network. Directly complements Exercise 35 in this chapter.
Stanford CS231n: Convolutional Neural Networks for Visual Recognition. Lectures 3-5 cover backpropagation, neural network training, and the transition from linear classifiers to multi-layer networks. The course notes on backpropagation (cs231n.github.io) are among the best available.
MIT 6.S191: Introduction to Deep Learning. A fast-paced course that covers neural network foundations in the first two lectures. Freely available at introtodeeplearning.com with TensorFlow/Keras lab notebooks.

McCulloch, W. S. & Pitts, W. (1943). "A logical calculus of the ideas immanent in nervous activity." Bulletin of Mathematical Biophysics, 5(4), 115-133. The paper that started it all: the first mathematical model of an artificial neuron. While the McCulloch-Pitts neuron is far simpler than modern artificial neurons, the core idea---modeling computation as weighted sums followed by thresholding---remains the foundation of the field.
Rosenblatt, F. (1958). "The Perceptron: A probabilistic model for information storage and organization in the brain." Psychological Review, 65(6), 386-408. The original perceptron paper, introducing the first trainable artificial neuron. The perceptron learning rule described here is the ancestor of all modern neural network training algorithms.
Minsky, M. L. & Papert, S. A. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. The book that proved the perceptron cannot learn XOR and contributed to the first "AI winter." Understanding this limitation motivates the multi-layer architectures developed in this chapter.

Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). "Learning representations by back-propagating errors." Nature, 323(6088), 533-536. The paper that popularized backpropagation and demonstrated that hidden layers could learn useful internal representations. While backpropagation had been discovered independently by several researchers, this paper established it as the standard training algorithm.

Nair, V. & Hinton, G. E. (2010). "Rectified Linear Units Improve Restricted Boltzmann Machines." ICML. The paper that popularized ReLU, demonstrating its advantages over sigmoid and tanh for training deep networks. The simplicity of ReLU ($\max(0, z)$) belies its enormous impact on the field.
Hendrycks, D. & Gimpel, K. (2016). "Gaussian Error Linear Units (GELUs)." arXiv:1606.08415. Introduces GELU, which has become the standard activation function in transformer architectures (BERT, GPT). GELU provides a smooth, non-monotonic alternative to ReLU.

Cybenko, G. (1989). "Approximation by superpositions of a sigmoidal function." Mathematics of Control, Signals and Systems, 2(4), 303-314. The original universal approximation theorem for sigmoid networks. Proves that a single hidden layer with sigmoid activation can approximate any continuous function on a compact subset of $\mathbb{R}^n$.
Hornik, K. (1991). "Approximation capabilities of multilayer feedforward networks." Neural Networks, 4(2), 251-257. Extends the universal approximation theorem to arbitrary non-polynomial activation functions, including ReLU.

He, K., Zhang, X., Ren, S., & Sun, J. (2015). "Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification." ICCV. Derives He initialization for ReLU networks and demonstrates its critical importance for training very deep networks. This paper also introduced PReLU (parametric ReLU).
Glorot, X. & Bengio, Y. (2010). "Understanding the difficulty of training deep feedforward neural networks." AISTATS. Analyzes activation and gradient flow in deep networks and derives Xavier/Glorot initialization for sigmoid and tanh networks.

PyTorch Documentation: torch.nn module. https://pytorch.org/docs/stable/nn.html The authoritative reference for all neural network building blocks. Pay particular attention to nn.Linear, nn.ReLU, nn.Module, and the various loss functions.
PyTorch Documentation: Autograd Mechanics. https://pytorch.org/docs/stable/notes/autograd.html A detailed explanation of how autograd builds and traverses computational graphs. Essential reading for understanding requires_grad, backward(), and torch.no_grad().
PyTorch Tutorials: "Learning PyTorch with Examples." https://pytorch.org/tutorials/beginner/pytorch_with_examples.html Builds neural networks using raw tensors, then autograd, then nn.Module, mirroring the progression in Section 11.7 of this chapter.

The neural network foundations from this chapter connect directly to:

Chapter 12 (Training Deep Networks): Production-quality training pipelines, advanced optimizers, learning rate schedules, normalization, and debugging.
Chapter 13 (Regularization): Dropout, data augmentation, weight decay, and techniques for preventing overfitting.
Chapter 14 (Convolutional Neural Networks): Local connectivity and weight sharing for spatial data; the convolution operation as a specialized linear layer.
Chapter 15 (Recurrent Neural Networks): Weight sharing across time steps; the LSTM cell as an extension of the gating concepts we first encounter with activation functions.