Chapter 15: Key Takeaways

The Big Picture

Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that carries information from one time step to the next. The vanilla RNN introduced this idea but suffers from vanishing/exploding gradients, limiting its ability to learn long-range dependencies. Gated architectures---LSTM and GRU---solve this problem with learned gates that control information flow through the sequence. While Transformers have largely replaced RNNs in NLP, recurrent models remain relevant for time series, streaming applications, and resource-constrained settings.

Section-by-Section Summary

Section	Key Concept	What to Remember
15.1 Sequential Data	Why sequences are different	Order matters; models need variable-length handling, parameter sharing, and memory
15.2 Vanilla RNN	Recurrence relation	$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})$; same weights at every step
15.3 Vanishing/Exploding Gradients	BPTT gradient products	Product of Jacobians shrinks/explodes exponentially; limits vanilla RNN to ~10-20 step memory
15.4 LSTM	Gated cell state	Forget, input, and output gates; additive cell update $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ prevents vanishing gradients
15.5 GRU	Simplified gating	Two gates (update, reset) instead of three; linear interpolation between old and new state; fewer parameters, comparable performance
15.6 Bidirectional RNNs	Two-direction processing	Forward + backward RNNs for full context; only for tasks where full sequence is available (not autoregressive generation)
15.7 Deep RNNs	Stacking layers	2-4 layers typical; dropout between layers (not across time); residual connections for deeper stacks
15.8 Seq2Seq	Encoder-decoder	Encoder compresses input to context vector; decoder generates output autoregressively; information bottleneck limits long sequences
15.9 Teacher Forcing	Training acceleration	Use ground-truth as decoder input during training; causes exposure bias; mitigate with scheduled sampling
15.10 Beam Search	Better decoding	Maintain top-k hypotheses; length normalization prevents short-sequence bias; beam width 4-10 typical
15.11 Attention Preview	Dynamic context	Decoder attends to all encoder states; eliminates bottleneck; leads to Transformers
15.12 Practical Considerations	Implementation details	Pack sequences for efficiency; detach hidden states between chunks; gradient clipping essential

Core Equations

Vanilla RNN

$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$

LSTM (complete)

$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad \text{(candidate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell update)}$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \quad \text{(hidden state)}$$

GRU

$$\mathbf{z}_t = \sigma(\mathbf{W}_z [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)}$$ $$\mathbf{r}_t = \sigma(\mathbf{W}_r [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)}$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$

Parameter Count Reference

Architecture	Weight Matrices	Parameters (input $d$, hidden $n$)
Vanilla RNN	2 ($\mathbf{W}_{xh}, \mathbf{W}_{hh}$)	$n(d + n) + n$
LSTM	4 (one per gate + candidate)	$4[n(d + n) + n]$
GRU	3 (update, reset, candidate)	$3[n(d + n) + n]$

Common Pitfalls

Forgetting to detach hidden states: When training on long sequences in chunks, always call .detach() on the hidden state between chunks. Otherwise, the computation graph grows unboundedly, causing memory issues and stale gradients.
Not using packed sequences: Processing padding tokens wastes computation and can bias the model. Always use pack_padded_sequence / pad_packed_sequence for variable-length inputs.
Ignoring gradient clipping: RNNs are prone to exploding gradients, especially early in training. Always clip gradients (max norm 1.0-5.0).
Using bidirectional RNNs for autoregressive tasks: Bidirectional processing requires the full sequence upfront. Do not use for language modeling, text generation, or any task requiring left-to-right generation.
Random train/test splits for time series: Time series data requires chronological splitting. Random splits create data leakage because the model sees future data during training.
Confusing hidden state shapes in bidirectional multi-layer LSTMs: The hidden state has shape (num_layers * num_directions, batch, hidden_size). For a 2-layer bidirectional LSTM, this is (4, batch, hidden_size).

LSTM vs. GRU Decision Guide

Factor	Choose LSTM	Choose GRU
Long dependencies (>100 steps)	Preferred	May work
Training speed	Slower	Faster (~20%)
Model size	Larger (4x RNN)	Smaller (3x RNN)
Small dataset	Either	Slightly preferred (fewer params)
Well-established best practice	Yes (more studied)	Growing adoption

Looking Ahead

Chapter 16: Attention mechanisms in depth---the concept previewed in Section 15.11 becomes a full framework
Chapter 17-18: Transformers replace recurrence with self-attention; the encoder-decoder pattern from seq2seq reappears
Modern hybrids: State space models (Mamba, S4) combine RNN-like recurrence with Transformer-like performance