Chapter 15: Key Takeaways

The Big Picture

Recurrent Neural Networks (RNNs) process sequential data by maintaining a hidden state that carries information from one time step to the next. The vanilla RNN introduced this idea but suffers from vanishing/exploding gradients, limiting its ability to learn long-range dependencies. Gated architectures---LSTM and GRU---solve this problem with learned gates that control information flow through the sequence. While Transformers have largely replaced RNNs in NLP, recurrent models remain relevant for time series, streaming applications, and resource-constrained settings.

Section-by-Section Summary

Section Key Concept What to Remember
15.1 Sequential Data Why sequences are different Order matters; models need variable-length handling, parameter sharing, and memory
15.2 Vanilla RNN Recurrence relation $\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b})$; same weights at every step
15.3 Vanishing/Exploding Gradients BPTT gradient products Product of Jacobians shrinks/explodes exponentially; limits vanilla RNN to ~10-20 step memory
15.4 LSTM Gated cell state Forget, input, and output gates; additive cell update $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ prevents vanishing gradients
15.5 GRU Simplified gating Two gates (update, reset) instead of three; linear interpolation between old and new state; fewer parameters, comparable performance
15.6 Bidirectional RNNs Two-direction processing Forward + backward RNNs for full context; only for tasks where full sequence is available (not autoregressive generation)
15.7 Deep RNNs Stacking layers 2-4 layers typical; dropout between layers (not across time); residual connections for deeper stacks
15.8 Seq2Seq Encoder-decoder Encoder compresses input to context vector; decoder generates output autoregressively; information bottleneck limits long sequences
15.9 Teacher Forcing Training acceleration Use ground-truth as decoder input during training; causes exposure bias; mitigate with scheduled sampling
15.10 Beam Search Better decoding Maintain top-k hypotheses; length normalization prevents short-sequence bias; beam width 4-10 typical
15.11 Attention Preview Dynamic context Decoder attends to all encoder states; eliminates bottleneck; leads to Transformers
15.12 Practical Considerations Implementation details Pack sequences for efficiency; detach hidden states between chunks; gradient clipping essential

Core Equations

Vanilla RNN

$$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$

LSTM (complete)

$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma(\mathbf{W}_i [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad \text{(candidate)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell update)}$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_o [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t) \quad \text{(hidden state)}$$

GRU

$$\mathbf{z}_t = \sigma(\mathbf{W}_z [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(update gate)}$$ $$\mathbf{r}_t = \sigma(\mathbf{W}_r [\mathbf{h}_{t-1}, \mathbf{x}_t]) \quad \text{(reset gate)}$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{r}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$$

Parameter Count Reference

Architecture Weight Matrices Parameters (input $d$, hidden $n$)
Vanilla RNN 2 ($\mathbf{W}_{xh}, \mathbf{W}_{hh}$) $n(d + n) + n$
LSTM 4 (one per gate + candidate) $4[n(d + n) + n]$
GRU 3 (update, reset, candidate) $3[n(d + n) + n]$

Common Pitfalls

  1. Forgetting to detach hidden states: When training on long sequences in chunks, always call .detach() on the hidden state between chunks. Otherwise, the computation graph grows unboundedly, causing memory issues and stale gradients.

  2. Not using packed sequences: Processing padding tokens wastes computation and can bias the model. Always use pack_padded_sequence / pad_packed_sequence for variable-length inputs.

  3. Ignoring gradient clipping: RNNs are prone to exploding gradients, especially early in training. Always clip gradients (max norm 1.0-5.0).

  4. Using bidirectional RNNs for autoregressive tasks: Bidirectional processing requires the full sequence upfront. Do not use for language modeling, text generation, or any task requiring left-to-right generation.

  5. Random train/test splits for time series: Time series data requires chronological splitting. Random splits create data leakage because the model sees future data during training.

  6. Confusing hidden state shapes in bidirectional multi-layer LSTMs: The hidden state has shape (num_layers * num_directions, batch, hidden_size). For a 2-layer bidirectional LSTM, this is (4, batch, hidden_size).

LSTM vs. GRU Decision Guide

Factor Choose LSTM Choose GRU
Long dependencies (>100 steps) Preferred May work
Training speed Slower Faster (~20%)
Model size Larger (4x RNN) Smaller (3x RNN)
Small dataset Either Slightly preferred (fewer params)
Well-established best practice Yes (more studied) Growing adoption

Looking Ahead

  • Chapter 16: Attention mechanisms in depth---the concept previewed in Section 15.11 becomes a full framework
  • Chapter 17-18: Transformers replace recurrence with self-attention; the encoder-decoder pattern from seq2seq reappears
  • Modern hybrids: State space models (Mamba, S4) combine RNN-like recurrence with Transformer-like performance