Chapter 9: Further Reading

Essential Sources

1. Sepp Hochreiter and Jurgen Schmidhuber, "Long Short-Term Memory" (Neural Computation, 1997)

The paper that introduced the LSTM architecture and, in doing so, made it possible to train recurrent networks on tasks requiring memory over hundreds of time steps. Hochreiter and Schmidhuber identified the vanishing gradient problem not just empirically but theoretically — their analysis of the "constant error flow" requirement leads directly to the cell state design. The cell state is updated additively, with multiplicative gates controlling the flow of information in and out. The original paper uses notation that differs from modern conventions (the forget gate was added later by Gers et al., 2000), but the core insight — that additive updates create gradient highways — is stated with remarkable clarity.

Reading guidance: Start with Section 1, which frames the problem: existing gradient-based methods cannot learn dependencies spanning more than about 10 time steps because the error signal either vanishes or explodes. The constant error carousel (Section 2) is the key conceptual contribution — it is the direct precursor to the cell state. Sections 3-4 describe the full architecture and the gating mechanism. The experiments (Section 5) are simple by modern standards (embedded Reber grammars, adding problems) but are carefully designed to isolate long-range dependency learning. For the addition of the forget gate, read Gers et al., "Learning to Forget: Continual Prediction with LSTM" (Neural Computation, 2000). For the most thorough empirical evaluation of LSTM variants (which gates matter, which can be removed), see Greff et al., "LSTM: A Search Space Odyssey" (IEEE TNNLS, 2017) — a systematic ablation study that confirms that the forget gate and the output activation function are the most critical components.

2. Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio, "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation" (EMNLP, 2014)

This paper introduced two ideas: the Gated Recurrent Unit (GRU) and the encoder-decoder architecture for sequence-to-sequence learning. The GRU simplifies the LSTM by merging the cell state and hidden state and coupling the forget and input gates into a single update gate. The encoder-decoder architecture — an encoder RNN that reads the input sequence and a decoder RNN that generates the output sequence — became the standard framework for machine translation before the transformer era. The paper also introduced the concept of learning phrase-level representations for translation, which anticipates the subword tokenization methods (BPE, SentencePiece) now standard in NLP.

Reading guidance: Section 2.1 describes the GRU equations (called "hidden unit" in the paper). Compare these carefully with the LSTM equations: the update gate $z$ plays the role of the forget gate (via $1 - z$) and the input gate simultaneously, while the reset gate $r$ controls how much of the previous hidden state participates in the candidate computation. Section 3 describes the encoder-decoder architecture. The encoder reads the source sentence and produces a fixed-length context vector; the decoder generates the target sentence conditioned on this vector. This is the architecture that Bahdanau et al. (2014) then augmented with attention. For the GRU-vs-LSTM comparison, also read Chung et al., "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling" (NIPS Workshop, 2014), which finds that neither architecture consistently outperforms the other across tasks.

3. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio, "Neural Machine Translation by Jointly Learning to Align and Translate" (ICLR, 2015)

The paper that introduced the attention mechanism to neural machine translation — and, by extension, to all of deep learning. Bahdanau et al. identified the information bottleneck in the encoder-decoder architecture: the fixed-length context vector cannot encode all the information in a long source sentence. Their solution was to let the decoder "attend" to all encoder hidden states at each decoding step, computing a dynamic context vector as a weighted sum of encoder outputs. The attention weights are learned end-to-end and provide an interpretable alignment between source and target positions. This paper is the direct precursor to the transformer's self-attention mechanism.

Reading guidance: Section 2 reviews the encoder-decoder approach and its limitation. Section 3 introduces the attention mechanism (called "alignment model"). The key equation is 3.1: $\alpha_{ij} = \frac{\exp(e_{ij})}{\sum_k \exp(e_{ik})}$ where $e_{ij} = a(s_{i-1}, h_j)$ is the alignment score computed by a feedforward network (this is what we call Bahdanau or additive attention). Figure 3 is one of the most important figures in the attention literature: it shows the learned alignment between English and French sentences, revealing that the model has learned word-order differences between languages. Section 4 presents the experimental results on English-French translation, showing that attention dramatically improves performance on long sentences — exactly the bottleneck it was designed to solve. For the multiplicative alternative, read Luong et al., "Effective Approaches to Attention-based Neural Machine Translation" (EMNLP, 2015), which introduces dot-product and general (bilinear) attention scores and compares global vs. local attention. The scaled dot-product form (dividing by $\sqrt{d}$) appears in the transformer paper.

4. Yoshua Bengio, Patrice Simard, and Paolo Frasconi, "Learning Long-Term Dependencies with Gradient Descent Is Difficult" (IEEE Transactions on Neural Networks, 1994)

The theoretical foundation for understanding why vanilla RNNs fail on long-range dependencies. Bengio et al. proved that for a class of recurrent networks with bounded nonlinearities (including $\tanh$ and sigmoid), the gradient of the loss with respect to early hidden states decays exponentially with the time lag. The paper analyzes the eigenvalue structure of the Jacobian product chain and shows that the gradient vanishes when the spectral radius of the recurrent weight matrix (after accounting for the nonlinearity derivative) is less than 1 — which is the typical regime. The paper also explores alternative training approaches (simulated annealing, discrete error propagation) that avoid the gradient-based bottleneck, though these did not become practically dominant.

Reading guidance: Sections 2-3 contain the core theoretical result. The key insight is Theorem 1, which bounds the gradient norm: $\| \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_k} \| \leq (\gamma \sigma_{\max})^{T-k}$, where $\gamma$ is the maximum derivative of the activation function and $\sigma_{\max}$ is the largest singular value of $\mathbf{W}_{hh}$. Section 4 provides experimental confirmation on simple tasks where the dependency length is controlled. This paper should be read alongside Hochreiter and Schmidhuber (1997) — Bengio et al. identified the problem; Hochreiter and Schmidhuber engineered the solution. For a modern perspective on the same issue, see Pascanu et al., "On the Difficulty of Training Recurrent Neural Networks" (ICML, 2013), which provides updated analysis and connects gradient clipping to the exploding gradient problem.

5. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le, "Sequence to Sequence Learning with Neural Networks" (NeurIPS, 2014)

The paper that demonstrated the seq2seq architecture could achieve competitive machine translation quality using only neural networks, without the complex pipelines of phrase-based statistical MT. Sutskever et al. used a deep LSTM encoder (4 layers) to read the source sentence and a deep LSTM decoder (4 layers) to generate the target sentence. Three key engineering decisions proved important: (1) reversing the source sentence (so the first words of the source are close to the first words of the target in the computational graph, reducing the effective distance for gradient flow), (2) using deep LSTMs (4 layers), and (3) using an ensemble of models. This paper, together with Bahdanau et al. (2014), established the encoder-decoder paradigm that dominated NLP until the transformer.

Reading guidance: Section 2 describes the architecture — note the reversed source sentence trick (Section 3.1), which improves BLEU by several points and illustrates how even the input preprocessing can affect gradient flow in RNNs. Section 3.2 describes the training details, including teacher forcing (which they call "ground truth labels as decoder inputs"). Table 1 shows the key result: a neural seq2seq model achieves 34.8 BLEU on English-French translation, close to the state-of-the-art phrase-based system at 33.3 BLEU. The analysis of sentence length (Figure 3) shows that performance drops for sentences longer than about 35 tokens — this is exactly the information bottleneck that Bahdanau et al.'s attention mechanism solved. Read this paper and Bahdanau et al. together to appreciate the rapid progress: the bottleneck identified in Sutskever et al. was solved within months by Bahdanau et al.'s attention mechanism.