Chapter 15: Key Takeaways
The Big Picture
Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that carries information from one time step to the next. The vanilla RNN introduced this idea but suffers from vanishing/exploding gradients, limiting its ability to learn long-range dependencies. Gated architectures---LSTM and GRU---solve this problem with learned gates that control information flow through the sequence. While Transformers have largely replaced RNNs in NLP, recurrent models remain relevant for time series, streaming applications, and resource-constrained settings.
Section-by-Section Summary
| Section | Key Concept | What to Remember |
|---|---|---|
| 15.1 Sequential Data | Why sequences are different | Order matters; models need variable-length handling, parameter sharing, and memory |
| 15.2 Vanilla RNN | Recurrence relation | $\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})$; same weights at every step |
| 15.3 Vanishing/Exploding Gradients | BPTT gradient products | Product of Jacobians shrinks/explodes exponentially; limits vanilla RNN to ~10-20 step memory |
| 15.4 LSTM | Gated cell state | Forget, input, and output gates; additive cell update $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ prevents vanishing gradients |
| 15.5 GRU | Simplified gating | Two gates (update, reset) instead of three; linear interpolation between old and new state; fewer parameters, comparable performance |
| 15.6 Bidirectional RNNs | Two-direction processing | Forward + backward RNNs for full context; only for tasks where full sequence is available (not autoregressive generation) |
| 15.7 Deep RNNs | Stacking layers | 2-4 layers typical; dropout between layers (not across time); residual connections for deeper stacks |
| 15.8 Seq2Seq | Encoder-decoder | Encoder compresses input to context vector; decoder generates output autoregressively; information bottleneck limits long sequences |
| 15.9 Teacher Forcing | Training acceleration | Use ground-truth as decoder input during training; causes exposure bias; mitigate with scheduled sampling |
| 15.10 Beam Search | Better decoding | Maintain top-k hypotheses; length normalization prevents short-sequence bias; beam width 4-10 typical |
| 15.11 Attention Preview | Dynamic context | Decoder attends to all encoder states; eliminates bottleneck; leads to Transformers |
| 15.12 Practical Considerations | Implementation details | Pack sequences for efficiency; detach hidden states between chunks; gradient clipping essential |
Core Equations
Vanilla RNN
$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$
LSTM (complete)
$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad \text{(candidate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell update)}$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \quad \text{(hidden state)}$$
GRU
$$\mathbf{z}_t = \sigma(\mathbf{W}_z [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)}$$ $$\mathbf{r}_t = \sigma(\mathbf{W}_r [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)}$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$
Parameter Count Reference
| Architecture | Weight Matrices | Parameters (input $d$, hidden $n$) |
|---|---|---|
| Vanilla RNN | 2 ($\mathbf{W}_{xh}, \mathbf{W}_{hh}$) | $n(d + n) + n$ |
| LSTM | 4 (one per gate + candidate) | $4[n(d + n) + n]$ |
| GRU | 3 (update, reset, candidate) | $3[n(d + n) + n]$ |
Common Pitfalls
-
Forgetting to detach hidden states: When training on long sequences in chunks, always call
.detach()on the hidden state between chunks. Otherwise, the computation graph grows unboundedly, causing memory issues and stale gradients. -
Not using packed sequences: Processing padding tokens wastes computation and can bias the model. Always use
pack_padded_sequence/pad_packed_sequencefor variable-length inputs. -
Ignoring gradient clipping: RNNs are prone to exploding gradients, especially early in training. Always clip gradients (max norm 1.0-5.0).
-
Using bidirectional RNNs for autoregressive tasks: Bidirectional processing requires the full sequence upfront. Do not use for language modeling, text generation, or any task requiring left-to-right generation.
-
Random train/test splits for time series: Time series data requires chronological splitting. Random splits create data leakage because the model sees future data during training.
-
Confusing hidden state shapes in bidirectional multi-layer LSTMs: The hidden state has shape
(num_layers * num_directions, batch, hidden_size). For a 2-layer bidirectional LSTM, this is(4, batch, hidden_size).
LSTM vs. GRU Decision Guide
| Factor | Choose LSTM | Choose GRU |
|---|---|---|
| Long dependencies (>100 steps) | Preferred | May work |
| Training speed | Slower | Faster (~20%) |
| Model size | Larger (4x RNN) | Smaller (3x RNN) |
| Small dataset | Either | Slightly preferred (fewer params) |
| Well-established best practice | Yes (more studied) | Growing adoption |
Looking Ahead
- Chapter 16: Attention mechanisms in depth---the concept previewed in Section 15.11 becomes a full framework
- Chapter 17-18: Transformers replace recurrence with self-attention; the encoder-decoder pattern from seq2seq reappears
- Modern hybrids: State space models (Mamba, S4) combine RNN-like recurrence with Transformer-like performance