Chapter 15: Exercises

Conceptual Exercises

Exercise 15.1: Sequence Data Classification

For each of the following data types, identify the most appropriate RNN task configuration (one-to-one, one-to-many, many-to-one, many-to-many aligned, or many-to-many unaligned). Justify your choice.

a) Predicting whether a movie review is positive or negative b) Generating a musical melody from a genre label c) Translating an English sentence to German d) Predicting the part of speech for each word in a sentence e) Predicting the next word in a sentence given the previous words

Exercise 15.2: Parameter Counting

A vanilla RNN has input_size=100, hidden_size=256, and output_size=50. Calculate the total number of trainable parameters, including all biases.

Exercise 15.3: Vanishing Gradients

Consider a vanilla RNN with hidden_size=2 and the weight matrix $\mathbf{W}_{hh} = \begin{pmatrix} 0.5 & 0 \\ 0 & 0.3 \end{pmatrix}$.

a) Compute $\mathbf{W}_{hh}^{10}$ (i.e., the matrix raised to the 10th power). b) What happens to the gradient signal after 50 time steps? c) What would change if the diagonal values were 1.5 and 1.3 instead?

Exercise 15.4: LSTM Gate Analysis

An LSTM cell has the following gate values at time step $t$: - Forget gate: $\mathbf{f}_t = [0.9, 0.1, 0.5]$ - Input gate: $\mathbf{i}_t = [0.2, 0.8, 0.5]$ - Candidate cell: $\tilde{\mathbf{c}}_t = [1.0, -0.5, 0.3]$ - Previous cell state: $\mathbf{c}_{t-1} = [2.0, 1.0, -1.0]$

a) Compute the new cell state $\mathbf{c}_t$. b) Interpret what the forget gate is doing for each dimension. c) Which dimension is receiving the most new information?

Exercise 15.5: GRU vs. LSTM

Compare the parameter count of an LSTM and a GRU with input_size=128 and hidden_size=256. Show your calculations.

Exercise 15.6: Bidirectional RNN Output

A bidirectional LSTM with hidden_size=128 processes a sequence of length 20. What is the dimensionality of the output at each time step? What is the shape of the final hidden state for a 2-layer bidirectional LSTM?

Exercise 15.7: Teacher Forcing Analysis

Explain why teacher forcing can lead to poor performance at inference time. Propose and describe three different strategies to mitigate this issue.

Exercise 15.8: Beam Search Comparison

Given the following partial probabilities for a vocabulary of {A, B, C, }:

Step 1	P	Step 2 (given A)	P	Step 2 (given B)	P
A	0.4	A	0.1	A	0.6
B	0.35	B	0.3	B	0.2
C	0.2	C	0.5		0.15
	0.05		0.1	C	0.05

a) What sequence does greedy decoding produce? b) Enumerate all candidates after step 2 with beam width $k=2$. c) Which beam width $k$ is needed to find the globally optimal two-token sequence?

Exercise 15.9: Information Bottleneck

Explain the information bottleneck problem in the basic seq2seq model. Why does performance degrade for longer sequences? How does attention resolve this?

Exercise 15.10: Gradient Clipping

A model has a gradient vector $\mathbf{g} = [3.0, 4.0]$ and the clipping threshold is $\theta = 2.5$.

a) Compute $\|\mathbf{g}\|$. b) Compute the clipped gradient. c) Verify that the clipped gradient has norm $\theta$.

Coding Exercises

Exercise 15.11: Custom RNN Cell

Implement a vanilla RNN cell from scratch (without using nn.RNNCell) that supports: - Configurable activation function (tanh or relu) - Optional layer normalization - Proper weight initialization (Xavier uniform)

Test it on a synthetic sequence classification task.

Exercise 15.12: LSTM Character-Level Language Model

Build a character-level language model using PyTorch's nn.LSTM. Train it on a text corpus (you may use a portion of Shakespeare or any other text). The model should: - Accept a character and predict the next character - Use temperature-controlled sampling for generation - Generate coherent text after training

Exercise 15.13: GRU for Sequence Classification

Implement a GRU-based model for classifying sequences of varying length. Use packed sequences to handle padding correctly. Compare the training speed and accuracy against an equivalent LSTM model.

Exercise 15.14: Bidirectional Sentiment Analysis

Build a bidirectional LSTM for sentiment analysis on movie reviews. Compare the performance of: - Unidirectional LSTM - Bidirectional LSTM - Bidirectional GRU

Use the same hidden size for all models and report accuracy and training time.

Exercise 15.15: Seq2Seq for Number Reversal

Build a seq2seq model that learns to reverse a sequence of digits. For example, input "1 2 3 4 5" should produce "5 4 3 2 1". Train with and without teacher forcing and compare convergence speed.

Exercise 15.16: Beam Search Implementation

Extend the beam search implementation from the chapter to support: - Minimum output length constraint - N-gram blocking (prevent repeated n-grams) - Diverse beam search (penalize similar beams)

Test on the number reversal task from Exercise 15.15.

Exercise 15.17: Scheduled Sampling

Implement scheduled sampling for the seq2seq model where the teacher forcing ratio decays according to: - Linear decay - Exponential decay - Inverse sigmoid decay: $\frac{k}{k + \exp(i/k)}$ where $i$ is the epoch

Compare the three schedules on the number reversal task.

Exercise 15.18: Gradient Flow Visualization

Write code to track and visualize the gradient magnitude at each time step during BPTT for: - Vanilla RNN - LSTM - GRU

Use a sequence of length 100 and plot the gradient norms. This will visually demonstrate the vanishing gradient problem and how gated architectures solve it.

Exercise 15.19: Truncated BPTT

Implement truncated backpropagation through time (TBPTT) for training a language model. Compare the training behavior with: - Full BPTT (truncation length = sequence length) - TBPTT with truncation length 20 - TBPTT with truncation length 5

Measure both training speed and final perplexity.

Exercise 15.20: Multi-Layer RNN with Residual Connections

Implement a deep LSTM (4 layers) with residual connections between layers. Compare training stability and convergence against a deep LSTM without residual connections. Use gradient norm tracking to illustrate the difference.

Applied Exercises

Exercise 15.21: Stock Price Direction Prediction

Using historical stock price data (open, high, low, close, volume), build an LSTM model that predicts whether the next day's closing price will be higher or lower than today's. Use proper train/validation/test splits (time-based, not random). Report accuracy, precision, and recall.

Exercise 15.22: Name Generation

Train a character-level RNN (your choice of LSTM or GRU) on a dataset of names from a specific language/culture. The model should be able to generate new, plausible-sounding names. Implement temperature-controlled sampling and analyze how temperature affects output quality.

Exercise 15.23: Spell Checker with Seq2Seq

Build a seq2seq model that corrects misspelled words. Create a synthetic dataset by introducing common typos (character swaps, insertions, deletions) to correctly spelled words. Evaluate using character-level accuracy and word-level accuracy.

Exercise 15.24: Music Sequence Modeling

Represent musical notes as sequences of MIDI note numbers and durations. Train an LSTM to generate simple melodies by predicting the next note given previous notes. Evaluate by generating samples and analyzing their musical properties (key consistency, rhythmic regularity).

Exercise 15.25: Anomaly Detection in Time Series

Build an LSTM autoencoder (encoder-decoder where input = output) for time series anomaly detection. Train on normal data, then detect anomalies as sequences with high reconstruction error. Test on a dataset with known anomalies and report precision/recall.

Research-Oriented Exercises

Exercise 15.26: Peephole LSTM

Implement the peephole LSTM variant where the gates also look at the cell state:

$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{W}_{cf} \odot \mathbf{c}_{t-1} + \mathbf{b}_f)$$

Compare performance against the standard LSTM on a sequence memorization task.

Exercise 15.27: Minimal Gate Unit

Implement the Minimal Gate Unit (MGU) which uses a single gate:

$$\mathbf{f}_t = \sigma(\mathbf{W}_f [\mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\tilde{\mathbf{h}}_t = \tanh(\mathbf{W}_h [\mathbf{f}_t \odot \mathbf{h}_{t-1}, \mathbf{x}_t])$$ $$\mathbf{h}_t = (1 - \mathbf{f}_t) \odot \mathbf{h}_{t-1} + \mathbf{f}_t \odot \tilde{\mathbf{h}}_t$$

Compare parameter count and performance against GRU and LSTM.

Exercise 15.28: Attention Visualization

Implement additive (Bahdanau) attention for the seq2seq model. Train on a simple translation task (e.g., reversing sequences or simple word substitution). Visualize the attention weights as a heatmap and interpret the patterns.

Exercise 15.29: Ablation Study

Conduct an ablation study on the LSTM by systematically removing each gate: - LSTM without forget gate ($\mathbf{f}_t = 1$ always) - LSTM without input gate ($\mathbf{i}_t = 1$ always) - LSTM without output gate ($\mathbf{o}_t = 1$ always)

Compare on a long-range dependency task (e.g., copy task with long delays). Report which gate is most critical.

Exercise 15.30: RNN vs. Transformer Comparison

For a sentiment analysis task, compare: - Single-layer LSTM - 2-layer bidirectional LSTM - A small Transformer encoder (2 layers, 4 heads)

Control for parameter count. Compare accuracy, training time, and inference speed. Discuss the trade-offs.

Exercise 15.31: Variational Dropout for RNNs

Implement variational dropout (same mask across time steps) for an LSTM. Compare against standard dropout on a text classification task. Measure both accuracy and calibration.

Exercise 15.32: Hidden State Analysis

Train an LSTM on a bracket-matching task (determining whether a sequence of parentheses is balanced). Analyze the hidden states using PCA visualization. Can you identify individual hidden units that track the nesting depth?

Exercise 15.33: Sequence-to-Sequence with Copy Mechanism

Implement a pointer network (copy mechanism) for the seq2seq model that can copy tokens directly from the input. Test on a task that requires copying (e.g., text summarization where rare words must be preserved).

Exercise 15.34: Multi-Task RNN

Build an RNN that simultaneously performs: - Sentiment classification (many-to-one) - Part-of-speech tagging (many-to-many)

Share the bottom LSTM layers and have separate task-specific output heads. Compare against single-task models.

Exercise 15.35: Continuous-Time RNN

Implement an ODE-RNN where the hidden state evolves according to a neural ODE between observations:

$$\frac{d\mathbf{h}}{dt} = f_\theta(\mathbf{h}(t))$$

Apply to irregularly-sampled time series data and compare against a standard LSTM that ignores time gaps.