Chapter 15: Exercises
Conceptual Exercises
Exercise 15.1: Sequence Data Classification
For each of the following data types, identify the most appropriate RNN task configuration (one-to-one, one-to-many, many-to-one, many-to-many aligned, or many-to-many unaligned). Justify your choice.
a) Predicting whether a movie review is positive or negative b) Generating a musical melody from a genre label c) Translating an English sentence to German d) Predicting the part of speech for each word in a sentence e) Predicting the next word in a sentence given the previous words
Exercise 15.2: Parameter Counting
A vanilla RNN has input_size=100, hidden_size=256, and output_size=50. Calculate the total number of trainable parameters, including all biases.
Exercise 15.3: Vanishing Gradients
Consider a vanilla RNN with hidden_size=2 and the weight matrix $\mathbf{W}_{hh} = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.3 \end{pmatrix}$.
a) Compute $\mathbf{W}_{hh}^{10}$ (i.e., the matrix raised to the 10th power). b) What happens to the gradient signal after 50 time steps? c) What would change if the diagonal values were 1.5 and 1.3 instead?
Exercise 15.4: LSTM Gate Analysis
An LSTM cell has the following gate values at time step $t$: - Forget gate: $\mathbf{f}_t = [0.9, 0.1, 0.5]$ - Input gate: $\mathbf{i}_t = [0.2, 0.8, 0.5]$ - Candidate cell: $\tilde{\mathbf{c}}_t = [1.0, -0.5, 0.3]$ - Previous cell state: $\mathbf{c}_{t-1} = [2.0, 1.0, -1.0]$
a) Compute the new cell state $\mathbf{c}_t$. b) Interpret what the forget gate is doing for each dimension. c) Which dimension is receiving the most new information?
Exercise 15.5: GRU vs. LSTM
Compare the parameter count of an LSTM and a GRU with input_size=128 and hidden_size=256. Show your calculations.
Exercise 15.6: Bidirectional RNN Output
A bidirectional LSTM with hidden_size=128 processes a sequence of length 20. What is the dimensionality of the output at each time step? What is the shape of the final hidden state for a 2-layer bidirectional LSTM?
Exercise 15.7: Teacher Forcing Analysis
Explain why teacher forcing can lead to poor performance at inference time. Propose and describe three different strategies to mitigate this issue.
Exercise 15.8: Beam Search Comparison
Given the following partial probabilities for a vocabulary of {A, B, C,
| Step 1 | P | Step 2 (given A) | P | Step 2 (given B) | P |
|---|---|---|---|---|---|
| A | 0.4 | A | 0.1 | A | 0.6 |
| B | 0.35 | B | 0.3 | B | 0.2 |
| C | 0.2 | C | 0.5 | 0.15 | |
| 0.05 | 0.1 | C | 0.05 |
a) What sequence does greedy decoding produce? b) Enumerate all candidates after step 2 with beam width $k=2$. c) Which beam width $k$ is needed to find the globally optimal two-token sequence?
Exercise 15.9: Information Bottleneck
Explain the information bottleneck problem in the basic seq2seq model. Why does performance degrade for longer sequences? How does attention resolve this?
Exercise 15.10: Gradient Clipping
A model has a gradient vector $\mathbf{g} = [3.0, 4.0]$ and the clipping threshold is $\theta = 2.5$.
a) Compute $\|\mathbf{g}\|$. b) Compute the clipped gradient. c) Verify that the clipped gradient has norm $\theta$.
Coding Exercises
Exercise 15.11: Custom RNN Cell
Implement a vanilla RNN cell from scratch (without using nn.RNNCell) that supports:
- Configurable activation function (tanh or relu)
- Optional layer normalization
- Proper weight initialization (Xavier uniform)
Test it on a synthetic sequence classification task.
Exercise 15.12: LSTM Character-Level Language Model
Build a character-level language model using PyTorch's nn.LSTM. Train it on a text corpus (you may use a portion of Shakespeare or any other text). The model should:
- Accept a character and predict the next character
- Use temperature-controlled sampling for generation
- Generate coherent text after training
Exercise 15.13: GRU for Sequence Classification
Implement a GRU-based model for classifying sequences of varying length. Use packed sequences to handle padding correctly. Compare the training speed and accuracy against an equivalent LSTM model.
Exercise 15.14: Bidirectional Sentiment Analysis
Build a bidirectional LSTM for sentiment analysis on movie reviews. Compare the performance of: - Unidirectional LSTM - Bidirectional LSTM - Bidirectional GRU
Use the same hidden size for all models and report accuracy and training time.
Exercise 15.15: Seq2Seq for Number Reversal
Build a seq2seq model that learns to reverse a sequence of digits. For example, input "1 2 3 4 5" should produce "5 4 3 2 1". Train with and without teacher forcing and compare convergence speed.
Exercise 15.16: Beam Search Implementation
Extend the beam search implementation from the chapter to support: - Minimum output length constraint - N-gram blocking (prevent repeated n-grams) - Diverse beam search (penalize similar beams)
Test on the number reversal task from Exercise 15.15.
Exercise 15.17: Scheduled Sampling
Implement scheduled sampling for the seq2seq model where the teacher forcing ratio decays according to: - Linear decay - Exponential decay - Inverse sigmoid decay: $\frac{k}{k + \exp(i/k)}$ where $i$ is the epoch
Compare the three schedules on the number reversal task.
Exercise 15.18: Gradient Flow Visualization
Write code to track and visualize the gradient magnitude at each time step during BPTT for: - Vanilla RNN - LSTM - GRU
Use a sequence of length 100 and plot the gradient norms. This will visually demonstrate the vanishing gradient problem and how gated architectures solve it.
Exercise 15.19: Truncated BPTT
Implement truncated backpropagation through time (TBPTT) for training a language model. Compare the training behavior with: - Full BPTT (truncation length = sequence length) - TBPTT with truncation length 20 - TBPTT with truncation length 5
Measure both training speed and final perplexity.
Exercise 15.20: Multi-Layer RNN with Residual Connections
Implement a deep LSTM (4 layers) with residual connections between layers. Compare training stability and convergence against a deep LSTM without residual connections. Use gradient norm tracking to illustrate the difference.
Applied Exercises
Exercise 15.21: Stock Price Direction Prediction
Using historical stock price data (open, high, low, close, volume), build an LSTM model that predicts whether the next day's closing price will be higher or lower than today's. Use proper train/validation/test splits (time-based, not random). Report accuracy, precision, and recall.
Exercise 15.22: Name Generation
Train a character-level RNN (your choice of LSTM or GRU) on a dataset of names from a specific language/culture. The model should be able to generate new, plausible-sounding names. Implement temperature-controlled sampling and analyze how temperature affects output quality.
Exercise 15.23: Spell Checker with Seq2Seq
Build a seq2seq model that corrects misspelled words. Create a synthetic dataset by introducing common typos (character swaps, insertions, deletions) to correctly spelled words. Evaluate using character-level accuracy and word-level accuracy.
Exercise 15.24: Music Sequence Modeling
Represent musical notes as sequences of MIDI note numbers and durations. Train an LSTM to generate simple melodies by predicting the next note given previous notes. Evaluate by generating samples and analyzing their musical properties (key consistency, rhythmic regularity).
Exercise 15.25: Anomaly Detection in Time Series
Build an LSTM autoencoder (encoder-decoder where input = output) for time series anomaly detection. Train on normal data, then detect anomalies as sequences with high reconstruction error. Test on a dataset with known anomalies and report precision/recall.
Research-Oriented Exercises
Exercise 15.26: Peephole LSTM
Implement the peephole LSTM variant where the gates also look at the cell state:
$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{W}_{cf} \odot \mathbf{c}_{t-1} + \mathbf{b}_f)$$
Compare performance against the standard LSTM on a sequence memorization task.
Exercise 15.27: Minimal Gate Unit
Implement the Minimal Gate Unit (MGU) which uses a single gate:
$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{f}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\mathbf{h}_t = (1 - \mathbf{f}_t) \odot \mathbf{h}_{t-1} + \mathbf{f}_t \odot \tilde{\mathbf{h}}_t$$
Compare parameter count and performance against GRU and LSTM.
Exercise 15.28: Attention Visualization
Implement additive (Bahdanau) attention for the seq2seq model. Train on a simple translation task (e.g., reversing sequences or simple word substitution). Visualize the attention weights as a heatmap and interpret the patterns.
Exercise 15.29: Ablation Study
Conduct an ablation study on the LSTM by systematically removing each gate: - LSTM without forget gate ($\mathbf{f}_t = 1$ always) - LSTM without input gate ($\mathbf{i}_t = 1$ always) - LSTM without output gate ($\mathbf{o}_t = 1$ always)
Compare on a long-range dependency task (e.g., copy task with long delays). Report which gate is most critical.
Exercise 15.30: RNN vs. Transformer Comparison
For a sentiment analysis task, compare: - Single-layer LSTM - 2-layer bidirectional LSTM - A small Transformer encoder (2 layers, 4 heads)
Control for parameter count. Compare accuracy, training time, and inference speed. Discuss the trade-offs.
Exercise 15.31: Variational Dropout for RNNs
Implement variational dropout (same mask across time steps) for an LSTM. Compare against standard dropout on a text classification task. Measure both accuracy and calibration.
Exercise 15.32: Hidden State Analysis
Train an LSTM on a bracket-matching task (determining whether a sequence of parentheses is balanced). Analyze the hidden states using PCA visualization. Can you identify individual hidden units that track the nesting depth?
Exercise 15.33: Sequence-to-Sequence with Copy Mechanism
Implement a pointer network (copy mechanism) for the seq2seq model that can copy tokens directly from the input. Test on a task that requires copying (e.g., text summarization where rare words must be preserved).
Exercise 15.34: Multi-Task RNN
Build an RNN that simultaneously performs: - Sentiment classification (many-to-one) - Part-of-speech tagging (many-to-many)
Share the bottom LSTM layers and have separate task-specific output heads. Compare against single-task models.
Exercise 15.35: Continuous-Time RNN
Implement an ODE-RNN where the hidden state evolves according to a neural ODE between observations:
$$\frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}(t))$$
Apply to irregularly-sampled time series data and compare against a standard LSTM that ignores time gaps.