Chapter 15: Further Reading

Textbooks

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. Chapter 10 (Sequence Modeling: Recurrent and Recursive Nets) provides a thorough mathematical treatment of RNNs, BPTT, and gated architectures. The analysis of gradient flow through time is particularly rigorous.
Zhang, A., Lipton, Z. C., Li, M., & Smola, A. J. (2023). Dive into Deep Learning. Cambridge University Press. Chapters 9-10 (d2l.ai) cover RNNs with executable code in PyTorch. The progressive implementation from character-level language models to modern seq2seq architectures is excellent for building intuition.
Jurafsky, D. & Martin, J. H. (2024). Speech and Language Processing. 3rd edition (draft). Chapters on RNNs and sequence models from the NLP perspective, with strong coverage of language modeling, neural machine translation, and practical applications.
Prince, S. J. D. (2023). Understanding Deep Learning. MIT Press. Chapters 12-13 cover recurrent networks and sequence-to-sequence models with clear diagrams and mathematical exposition.

Hochreiter, S. & Schmidhuber, J. (1997). "Long Short-Term Memory." Neural Computation, 9(8), 1735-1780. The original LSTM paper. Introduced the cell state and gating mechanism that solved the vanishing gradient problem. One of the most cited papers in deep learning.
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation." arXiv:1406.1078. Introduced the GRU as a simpler alternative to LSTM. Also proposed the encoder-decoder framework for sequence-to-sequence tasks.
Sutskever, I., Vinyals, O., & Le, Q. V. (2014). "Sequence to Sequence Learning with Neural Networks." NeurIPS. Demonstrated that seq2seq models with LSTMs could achieve state-of-the-art machine translation. Established the encoder-decoder paradigm that remains foundational.

Bengio, Y., Simard, P., & Frasconi, P. (1994). "Learning Long-Term Dependencies with Gradient Descent is Difficult." IEEE Transactions on Neural Networks, 5(2), 157-166. The seminal analysis of the vanishing gradient problem in RNNs. Provided the mathematical framework for understanding why vanilla RNNs fail on long sequences.
Pascanu, R., Mikolov, T., & Bengio, Y. (2013). "On the Difficulty of Training Recurrent Neural Networks." ICML. Extended the analysis of gradient problems to include exploding gradients. Proposed gradient clipping as a practical solution and analyzed the loss surface geometry.
Saxe, A. M., McClelland, J. L., & Ganguli, S. (2014). "Exact solutions to the nonlinear dynamics of learning in deep linear networks." ICLR. Provides theoretical insights into training dynamics relevant to understanding gradient flow in deep recurrent networks.

Bahdanau, D., Cho, K., & Bengio, Y. (2015). "Neural Machine Translation by Jointly Learning to Align and Translate." ICLR. Introduced additive attention for seq2seq models. Eliminated the information bottleneck and dramatically improved translation quality on long sentences.
Luong, M.-T., Pham, H., & Manning, C. D. (2015). "Effective Approaches to Attention-based Neural Machine Translation." EMNLP. Proposed multiplicative (dot-product) attention and compared different attention mechanisms. Established the global vs. local attention taxonomy.

Gal, Y. & Ghahramani, Z. (2016). "A Theoretically Grounded Application of Dropout to Recurrent Neural Networks." NeurIPS. Introduced variational dropout for RNNs, which uses the same dropout mask across time steps. Addresses the problem of naive dropout destroying temporal information.
Merity, S., Keskar, N. S., & Socher, R. (2018). "Regularizing and Optimizing LSTM Language Models." ICLR. AWD-LSTM: introduced weight dropout, variable-length backpropagation, and other techniques that made LSTMs competitive with early Transformer models on language modeling benchmarks.

Schuster, M. & Paliwal, K. K. (1997). "Bidirectional Recurrent Neural Networks." IEEE Transactions on Signal Processing, 45(11), 2673-2681. The original bidirectional RNN paper. Showed that processing sequences in both directions improves performance on tasks where future context is informative.
Graves, A. (2013). "Generating Sequences With Recurrent Neural Networks." arXiv:1308.0850. Demonstrated deep RNNs for handwriting generation and introduced several practical techniques for training deep recurrent architectures.

Gu, A., Goel, K., & Re, C. (2022). "Efficiently Modeling Long Sequences with Structured State Spaces." ICLR. Introduced S4 (Structured State Spaces for Sequence Modeling), bridging RNNs and linear dynamical systems. Achieves strong performance on very long sequences.
Gu, A. & Dao, T. (2023). "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." arXiv:2312.00752. Mamba adds data-dependent (selective) gating to state space models, achieving Transformer-level performance with linear scaling in sequence length.

Andrej Karpathy. "The Unreasonable Effectiveness of Recurrent Neural Networks." (Blog post, 2015). http://karpathy.github.io/2015/05/21/rnn-effectiveness/ A classic and accessible introduction to RNNs through character-level language modeling. Shows surprisingly coherent text generated by LSTMs trained on Shakespeare, Wikipedia, and LaTeX.
Christopher Olah. "Understanding LSTM Networks." (Blog post, 2015). https://colah.github.io/posts/2015-08-Understanding-LSTMs/ The definitive visual guide to LSTM internals. Clear diagrams of gate operations and information flow that have become standard reference material.
PyTorch RNN Tutorial. https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html Official PyTorch tutorial for building character-level RNNs, covering both classification and generation tasks.
Stanford CS224n: Natural Language Processing with Deep Learning. http://web.stanford.edu/class/cs224n/ Comprehensive course covering RNNs, LSTMs, attention, and Transformers with excellent lecture materials.

Chapter 16 (Attention Mechanisms): Full treatment of attention---the concept previewed in Section 15.11---including additive, multiplicative, and multi-head attention.
Chapters 17-18 (Transformers): How self-attention replaces recurrence entirely, enabling parallelism and scaling to massive datasets.
State Space Models: Modern architectures (Mamba, S4) that combine RNN-like linear-time inference with Transformer-like training parallelism.