Chapter 9: Exercises

DataField.Dev

Chapter 9: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Vanilla RNN Fundamentals

Exercise 9.1 (*)

Consider a vanilla RNN with input size $d = 3$, hidden size $h = 4$, and initial hidden state $\mathbf{h}_0 = \mathbf{0}$.

(a) How many learnable parameters does this RNN cell have (excluding the output layer)? Show your calculation.

(b) Given the input sequence $\mathbf{x}_1 = [1, 0, 0]^T$, $\mathbf{x}_2 = [0, 1, 0]^T$, and the weight matrices:

$$\mathbf{W}_{xh} = \begin{bmatrix} 0.1 & 0.2 & 0.0 \\ 0.0 & 0.1 & 0.3 \\ 0.2 & 0.0 & 0.1 \\ 0.1 & 0.1 & 0.1 \end{bmatrix}, \quad \mathbf{W}_{hh} = 0.5 \cdot \mathbf{I}_4, \quad \mathbf{b}_h = \mathbf{0}$$

compute $\mathbf{h}_1$ and $\mathbf{h}_2$ by hand. Use $\tanh$ activation.

(c) If the sequence length were 1,000 instead of 2, would the parameter count change? Why or why not?

Exercise 9.2 (*)

A vanilla RNN with hidden size $h = 128$ is applied to two tasks:

Task A: Sentiment classification of movie reviews (average length: 200 tokens)
Task B: Named entity recognition in tweets (average length: 30 tokens)

(a) For each task, explain whether you would use the final hidden state or all hidden states, and why.

(b) For Task A, estimate the effective "depth" of the unrolled network. Why might this be problematic?

(c) Would a bidirectional RNN be appropriate for Task A? For Task B? Explain.

Exercise 9.3 (**)

Using PyTorch's nn.RNN, train a vanilla RNN on a synthetic copy task: the input is a sequence of 5 random integers from $\{0, 1, \ldots, 9\}$, followed by a delimiter token, and the target output is the same 5 integers repeated in the same order.

(a) Train with sequence length 5. Report the accuracy after 500 epochs.

(b) Increase the sequence length to 20. What happens to performance? To the gradient norms?

(c) Plot the gradient norm of weight_hh_l0 as a function of training step for both sequence lengths. What does this illustrate?

import torch
import torch.nn as nn

def generate_copy_task(batch_size: int, seq_len: int, vocab_size: int = 10):
    """Generate a batch of copy task data.

    Input:  [x1, x2, ..., xN, DELIM, 0, 0, ..., 0]  (length 2N + 1)
    Target: [0, 0, ..., 0, DELIM, x1, x2, ..., xN]  (length 2N + 1)
    DELIM = vocab_size (index 10)
    """
    delim = vocab_size
    x = torch.randint(0, vocab_size, (batch_size, seq_len))

    # Input: sequence + delimiter + zeros
    input_seq = torch.cat([
        x,
        torch.full((batch_size, 1), delim),
        torch.zeros(batch_size, seq_len, dtype=torch.long),
    ], dim=1)

    # Target: zeros + delimiter + sequence
    target_seq = torch.cat([
        torch.zeros(batch_size, seq_len, dtype=torch.long),
        torch.full((batch_size, 1), delim),
        x,
    ], dim=1)

    return input_seq, target_seq

Vanishing/Exploding Gradients

Exercise 9.4 (**)

(a) For a vanilla RNN with $\mathbf{W}_{hh} \in \mathbb{R}^{h \times h}$, compute the eigenvalues of $\mathbf{W}_{hh}^n$ in terms of the eigenvalues $\lambda_1, \ldots, \lambda_h$ of $\mathbf{W}_{hh}$.

(b) If $\mathbf{W}_{hh}$ has spectral radius $\rho = 0.95$, how many time steps does it take for the largest eigenvalue of $\mathbf{W}_{hh}^n$ to drop below $10^{-3}$? Below $10^{-6}$?

(c) If $\rho = 1.05$, how many steps until the largest eigenvalue exceeds $10^3$? What does this imply about gradient clipping thresholds?

Exercise 9.5 (***)

Derive the full BPTT gradient for a vanilla RNN. Starting from the loss:

$$\mathcal{L} = \sum_{t=1}^{T} \mathcal{L}_t(\mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y, \mathbf{y}_t)$$

where $\mathbf{h}_t = \tanh(\mathbf{W}_{xh}\mathbf{x}_t + \mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{b}_h)$:

(a) Write out $\frac{\partial \mathcal{L}}{\partial \mathbf{W}_{hh}}$ as a sum over time steps.

(b) Show that each term in the sum involves a product of Jacobians $\prod_{j=k+1}^{t} \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$.

(c) Bound $\left\| \prod_{j=k+1}^{t} \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}} \right\|$ using the spectral norm of $\mathbf{W}_{hh}$ and the fact that $|\tanh'(z)| \leq 1$.

(d) Explain why ReLU activations do not solve the vanishing gradient problem for RNNs (even though they help in feedforward networks).

Exercise 9.6 (**)

Implement gradient clipping and compare three strategies:

(a) No clipping

(b) Clip by global norm: torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

(c) Clip by value: torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Train a vanilla RNN on the copy task (Exercise 9.3) with sequence length 15. Plot training loss curves for all three strategies. Which converges? Which diverges?

LSTM and GRU

Exercise 9.7 (*)

For an LSTM with input size $d = 32$ and hidden size $h = 128$:

(a) Compute the total number of parameters in one LSTM cell (weights + biases for all four gates).

(b) Compute the total number of parameters for a 3-layer stacked LSTM.

(c) How does this compare to a GRU with the same $d$ and $h$? Compute the ratio.

Exercise 9.8 (**)

The forget gate bias initialization trick (setting $\mathbf{b}_f = 1$) is widely used but not always explained.

(a) What is $\sigma(1)$? What does this imply about the initial behavior of the forget gate?

(b) What would happen if $\mathbf{b}_f = -2$ initially? How would this affect the LSTM's ability to remember long-term information during early training?

(c) Train two LSTMs on a sequence classification task of your choice — one with $\mathbf{b}_f = 0$ and one with $\mathbf{b}_f = 1$. Plot the validation accuracy curves. Does the initialization matter in practice?

Exercise 9.9 (***)

Implement a GRU cell from scratch (no nn.GRU). Your implementation should:

Accept input_size and hidden_size as constructor arguments
Implement the update gate, reset gate, and candidate hidden state
Combine the weight matrices for all gates into a single matrix for efficiency (as we did for the LSTM in the chapter)
Include proper weight initialization

Verify that your implementation produces the same output (up to floating-point precision) as nn.GRUCell with the same weights.

class MyGRUCell(nn.Module):
    """GRU cell implementation from scratch.

    Args:
        input_size: Dimension of input vectors.
        hidden_size: Dimension of hidden state.
    """

    def __init__(self, input_size: int, hidden_size: int) -> None:
        super().__init__()
        self.hidden_size = hidden_size
        # Your implementation here
        pass

    def forward(self, x_t: torch.Tensor, h_prev: torch.Tensor) -> torch.Tensor:
        """Process one time step.

        Args:
            x_t: shape (batch, input_size).
            h_prev: shape (batch, hidden_size).

        Returns:
            h_t: shape (batch, hidden_size).
        """
        # Your implementation here
        pass

Exercise 9.10 (**)

Compare the gradient flow through an LSTM and a vanilla RNN experimentally.

(a) Train both on a task where the label depends on the first token of the sequence (a "remember the first input" task) with sequences of length 50, 100, and 200.

(b) At each sequence length, record the gradient norm with respect to the parameters at the first time step (use hooks or manual backpropagation).

(c) Plot gradient norm vs. sequence length for both architectures. At what sequence length does the vanilla RNN gradient become negligible ($< 10^{-6}$)?

Exercise 9.11 (*)

In the LSTM equations, the forget gate and input gate are computed independently. The GRU couples them: $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$.

(a) What constraint does this coupling impose? Under what conditions would an LSTM behave differently from a GRU because of this?

(b) Propose a modification to the GRU that decouples the update gate into separate "forget" and "input" components. How many additional parameters does this add?

Bidirectional RNNs

Exercise 9.12 (*)

(a) A bidirectional LSTM with hidden size $h = 256$ and 2 layers is applied to a sequence of length 80. What is the dimension of the representation at each position?

(b) How many parameters does this bidirectional LSTM have (input size $d = 128$)?

(c) Why is a bidirectional RNN inappropriate for autoregressive language modeling?

Exercise 9.13 (**)

Implement a bidirectional LSTM for sequence labeling (e.g., POS tagging). Given a sequence of tokens, predict a label for each token.

(a) Use nn.LSTM(bidirectional=True) and verify that the output at each position has dimension $2h$.

(b) Add a CRF (conditional random field) layer on top of the BiLSTM outputs. How does this differ from independently classifying each position?

(c) Train on a synthetic tagging task where the correct tag for a token depends on both its left and right context. Compare unidirectional LSTM, bidirectional LSTM, and bidirectional LSTM + CRF.

Sequence-to-Sequence and Teacher Forcing

Exercise 9.14 (**)

(a) Implement scheduled sampling for the seq2seq model in Section 9.7. Start with teacher_forcing_ratio = 1.0 and decay it linearly to 0.0 over training.

(b) Compare three training strategies: (i) full teacher forcing, (ii) no teacher forcing, (iii) scheduled sampling. Which converges fastest? Which has the best inference performance?

(c) Explain the exposure bias problem. How does scheduled sampling address it?

Exercise 9.15 (**)

Implement beam search for the seq2seq model.

(a) Implement beam search with beam width $B \in \{1, 3, 5, 10\}$.

(b) Add length normalization: $\text{score}(\mathbf{y}) = \frac{1}{|\mathbf{y}|^\alpha} \sum_t \log P(y_t | \cdot)$ with $\alpha \in \{0, 0.5, 1.0\}$.

(c) On a synthetic sequence reversal task, measure BLEU score vs. beam width. At what beam width does performance saturate?

from typing import List, Tuple

def beam_search(
    decoder: nn.Module,
    initial_hidden: Tuple[torch.Tensor, torch.Tensor],
    sos_token: int,
    eos_token: int,
    beam_width: int = 5,
    max_length: int = 50,
    length_alpha: float = 0.7,
) -> List[int]:
    """Beam search decoding for a seq2seq model.

    Args:
        decoder: The decoder module.
        initial_hidden: (h_0, c_0) from the encoder.
        sos_token: Start-of-sequence token index.
        eos_token: End-of-sequence token index.
        beam_width: Number of candidates to maintain.
        max_length: Maximum output sequence length.
        length_alpha: Length normalization exponent.

    Returns:
        Best output sequence as a list of token indices.
    """
    # Your implementation here
    pass

Exercise 9.16 (*)

A seq2seq model with encoder hidden size 512 and decoder hidden size 512 is trained for machine translation. The encoder processes an input sentence of 40 tokens. The decoder generates an output of 35 tokens.

(a) Without attention, what is the dimension of the context vector? How many bits of information (assuming 32-bit floats) does this represent?

(b) Is this sufficient to encode 40 tokens, each from a vocabulary of 30,000? Justify your answer using information theory (Chapter 4).

(c) How does attention solve this bottleneck?

Attention Mechanisms

Exercise 9.17 (**)

Implement the Bahdanau attention mechanism and visualize the attention weights.

(a) Train a seq2seq model with Bahdanau attention on a synthetic date-format conversion task (e.g., "March 15, 2024" $\to$ "2024-03-15").

(b) For 10 test examples, plot the attention weight matrix $[\alpha_{tj}]$ as a heatmap. What patterns do you observe?

(c) Do the attention weights align with human intuition about which input positions are relevant for each output position?

Exercise 9.18 (***)

Compare Bahdanau and Luong attention on the same task.

(a) Implement both attention mechanisms in a seq2seq model for sequence reversal.

(b) Train both on sequences of length 20, 50, and 100. Compare validation accuracy and training time.

(c) Analyze the learned attention patterns. Are there qualitative differences between additive and multiplicative attention?

(d) Which attention mechanism is more computationally efficient? Analyze the FLOPs for both as a function of encoder length $T_s$, decoder hidden size $d_s$, encoder hidden size $d_h$, and attention dimension $a$.

Exercise 9.19 (***)

The scaled dot-product attention divides by $\sqrt{d}$. This section derives why.

(a) Assume $\mathbf{q}, \mathbf{k} \in \mathbb{R}^d$ with entries drawn independently from $\mathcal{N}(0, 1)$. Compute $\mathbb{E}[\mathbf{q}^T \mathbf{k}]$ and $\text{Var}(\mathbf{q}^T \mathbf{k})$.

(b) Show that without scaling, as $d$ increases, the softmax of the dot products concentrates on the maximum value (i.e., produces near-one-hot distributions). Why is this problematic for gradient flow?

(c) Show that dividing by $\sqrt{d}$ normalizes the variance of the dot product to 1, preventing softmax saturation.

Climate and StreamRec Applications

Exercise 9.20 (**)

Extend the Climate LSTM from Section 9.10 with the following modifications:

(a) Multivariate input: Add precipitation and wind speed as additional input features (modify generate_climate_series to produce 3 channels). Retrain and compare validation loss.

(b) Multi-step autoregressive forecasting: Instead of predicting all 12 future months at once, predict one month at a time and feed predictions back as input. Compare with the direct multi-output approach.

(c) Bidirectional encoding: Use a bidirectional LSTM to encode the input window, then use the concatenated final states to produce the forecast. Does this help?

Exercise 9.21 (**)

Extend the StreamRec Session LSTM from Section 9.11.

(a) Add temporal features: encode the time gap between consecutive interactions (in minutes) as an additional input alongside the item embedding. Does this improve Hit@10?

(b) Add item metadata: concatenate the item's category embedding (from a separate nn.Embedding) with the item ID embedding. Retrain and evaluate.

(c) Replace the final hidden state with an attention-based aggregation over all LSTM hidden states. Use Luong dot-product attention with a learnable query vector. Compare Hit@10 with the final-state baseline.

Exercise 9.22 (**)

Ablation study. Using the StreamRec Session LSTM:

(a) Vary the hidden size: $h \in \{32, 64, 128, 256, 512\}$. Plot Hit@10 and training time vs. hidden size. Where is the diminishing returns threshold?

(b) Vary the number of layers: $L \in \{1, 2, 3, 4\}$. Does adding layers beyond 2 help?

(c) Vary the maximum session length: $T \in \{5, 10, 20, 40\}$. How does truncation length affect performance? What does this tell you about the relevant context window for next-item prediction?

Advanced Topics

Exercise 9.23 (***)

Orthogonal initialization for $\mathbf{W}_{hh}$ (Saxe et al., 2014) is claimed to mitigate the vanishing/exploding gradient problem in vanilla RNNs.

(a) Initialize $\mathbf{W}_{hh}$ as a random orthogonal matrix using nn.init.orthogonal_. What are the eigenvalues of an orthogonal matrix? Why might this help?

(b) Train a vanilla RNN with orthogonal $\mathbf{W}_{hh}$ on the copy task (Exercise 9.3) with sequence length 50. Compare with Xavier initialization.

(c) Does orthogonal initialization fully solve the vanishing gradient problem? Why or why not? (Hint: consider the effect of the $\tanh$ nonlinearity.)

Exercise 9.24 (****)

Explore the relationship between RNNs and dynamical systems.

(a) A vanilla RNN with no input ($\mathbf{x}_t = \mathbf{0}$ for all $t$) is an autonomous dynamical system: $\mathbf{h}_t = \tanh(\mathbf{W}_{hh} \mathbf{h}_{t-1})$. For $h = 2$ and various $\mathbf{W}_{hh}$ matrices, simulate this system for 1,000 steps and plot the trajectories $(\mathbf{h}_t[0], \mathbf{h}_t[1])$. Identify fixed points, limit cycles, and chaotic behavior depending on the spectral radius of $\mathbf{W}_{hh}$.

(b) How do the fixed points of this dynamical system relate to the information the RNN can "remember" from the distant past?

(c) Read Sussillo and Barak, "Opening the Black Box: Low-Dimensional Dynamics in High-Dimensional Recurrent Neural Networks" (Neural Computation, 2013). Summarize the main insight in 2-3 paragraphs.

Exercise 9.25 (****)

State Space Models (SSMs) such as S4 (Gu et al., 2022) and Mamba (Gu and Dao, 2023) are a modern alternative to both RNNs and transformers for sequence modeling.

(a) An SSM is defined by: $\mathbf{h}_t = \mathbf{A}\mathbf{h}_{t-1} + \mathbf{B}\mathbf{x}_t$, $\mathbf{y}_t = \mathbf{C}\mathbf{h}_t$. Compare this to the vanilla RNN equation. What does the SSM gain by making the state transition linear?

(b) The key insight of S4 is that the linear recurrence can be computed in parallel using a scan operation, unlike the sequential RNN. Explain why a nonlinear recurrence (like the vanilla RNN) cannot be parallelized in the same way.

(c) Mamba adds input-dependent (selective) state transitions: $\mathbf{A}_t, \mathbf{B}_t, \mathbf{C}_t$ depend on $\mathbf{x}_t$. How does this relate to the gating mechanism in LSTMs?

(d) Implement a simple 1D SSM (state dimension $n = 16$, input dimension $d = 1$) and apply it to the climate temperature forecasting task from Section 9.10. Compare with the LSTM in terms of validation loss and training time.

Exercise 9.26 (***)

Implement a full seq2seq model with Bahdanau attention for a synthetic task of your choice (date conversion, sequence reversal, simple arithmetic, or character-level transliteration).

Your implementation should include:

Encoder (bidirectional LSTM)
Bahdanau attention layer
Decoder (LSTM with attention)
Teacher forcing with scheduled sampling
Beam search decoding
Attention weight visualization

Train for at least 50 epochs and report: training loss curve, validation accuracy, and attention heatmaps for 5 representative test examples.

Exercise 9.27 (**)

Regularization for RNNs. Standard dropout (randomly zeroing activations) cannot be applied to the recurrent connections of an RNN, because the noise changes at every time step, destroying the hidden state's ability to carry information forward.

(a) Explain why applying standard dropout to the hidden-to-hidden connection $\mathbf{h}_{t-1} \to \mathbf{h}_t$ is problematic.

(b) Variational dropout (Gal and Ghahramani, 2016) applies the same dropout mask at every time step within a sequence. Implement this for an LSTM and compare with standard dropout on a sequence classification task.

(c) PyTorch's nn.LSTM has a dropout parameter. Does it apply dropout to recurrent connections or only to the output between layers? Check the documentation and verify experimentally.

Exercise 9.28 (**)

Peephole connections (Gers and Schmidhuber, 2000) modify the LSTM by allowing the gates to look at the cell state directly:

$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}; \mathbf{x}_t] + \mathbf{W}_{cf} \odot \mathbf{c}_{t-1} + \mathbf{b}_f)$$

(a) How many additional parameters do peephole connections add to an LSTM with hidden size $h$?

(b) Implement peephole connections by modifying the LSTMCell class from the chapter.

(c) Greff et al. (2017) found that peephole connections rarely improve performance. Can you reproduce this finding on a task of your choice?

Exercise 9.29 (**)

Computational complexity analysis. For a single-layer LSTM with input size $d$, hidden size $h$, and sequence length $T$:

(a) Derive the FLOPs for the forward pass as a function of $d$, $h$, and $T$.

(b) Derive the memory usage during training (including stored activations for backpropagation).

(c) Compare with a transformer's self-attention computation: FLOPs $= O(T^2 \cdot d)$, memory $= O(T^2)$ for the attention matrix. At what sequence length $T$ does the transformer's cost exceed the LSTM's?

(d) Plot the crossover point for $h = d = 256$. How does FlashAttention (which reduces memory to $O(T)$) change this analysis?

Exercise 9.30 (****)

Open question. Transformers have largely replaced RNNs for sequence modeling. But the theoretical understanding of why transformers generalize better is incomplete.

(a) Read Bhatt et al., "On the Approximation Power of Transformers" (ICML, 2022) or a similar recent paper on transformer expressivity.

(b) Summarize the key results: what function classes can transformers approximate that RNNs cannot (and vice versa)?

(c) Design an experiment to test one of these theoretical predictions. Construct a synthetic task that should separate transformer and LSTM performance based on the theory. Run the experiment and report whether the empirical results match the theory.