Chapter 9: Quiz

Test your understanding of recurrent networks and sequence modeling. Answers follow each question.


Question 1

What is the fundamental difference between how an MLP and an RNN process a sequence of length $T$?

Answer An MLP processes a fixed-size input vector — to handle a sequence, it must flatten or concatenate all $T$ elements into a single vector, requiring $O(T \cdot d)$ input parameters and losing the ability to generalize across positions. An RNN processes the sequence **one element at a time**, maintaining a hidden state $\mathbf{h}_t$ that is updated at each step using the **same** weight matrices. This parameter sharing across time makes the parameter count independent of $T$ and allows the RNN to process sequences of arbitrary length.

Question 2

Write the recurrence equation for a vanilla RNN and identify the three weight matrices.

Answer $$\mathbf{h}_t = \tanh(\mathbf{W}_{xh} \mathbf{x}_t + \mathbf{W}_{hh} \mathbf{h}_{t-1} + \mathbf{b}_h)$$ $$\mathbf{y}_t = \mathbf{W}_{hy} \mathbf{h}_t + \mathbf{b}_y$$ - $\mathbf{W}_{xh} \in \mathbb{R}^{h \times d}$: input-to-hidden weights - $\mathbf{W}_{hh} \in \mathbb{R}^{h \times h}$: hidden-to-hidden (recurrent) weights - $\mathbf{W}_{hy} \in \mathbb{R}^{o \times h}$: hidden-to-output weights The same $\mathbf{W}_{xh}$ and $\mathbf{W}_{hh}$ are reused at every time step.

Question 3

Explain backpropagation through time (BPTT). Why is it called that?

Answer BPTT is the application of standard backpropagation to the **unrolled** computation graph of an RNN. When we unroll the recurrence for $T$ time steps, the RNN becomes a $T$-layer deep feedforward network with shared weights. Backpropagation through this unrolled graph requires computing gradients backward through all $T$ steps — hence "through time." The gradient of the loss with respect to $\mathbf{W}_{hh}$ involves a sum over contributions from every time step, each requiring the chain rule through all subsequent steps back to the current one.

Question 4

What is the vanishing gradient problem? State the mathematical condition under which it occurs.

Answer The gradient of the loss at time $T$ with respect to the hidden state at time $k$ involves a product of Jacobian matrices: $$\frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_k} = \prod_{t=k+1}^{T} \text{diag}(\tanh'(\mathbf{z}_t)) \cdot \mathbf{W}_{hh}$$ The vanishing gradient problem occurs when the spectral radius $\rho$ of the effective per-step Jacobian satisfies $\rho < 1$. In this case, the product norm decreases **exponentially** as $(T - k)$ increases: $\| \frac{\partial \mathbf{h}_T}{\partial \mathbf{h}_k} \| \leq \rho^{T-k} \to 0$. Since $|\tanh'(z)| \leq 1$, the $\tanh$ derivative contracts the gradient at every step, making vanishing gradients the default behavior for vanilla RNNs.

Question 5

Gradient clipping prevents exploding gradients. Does it also solve the vanishing gradient problem? Why or why not?

Answer **No.** Gradient clipping rescales gradients that exceed a threshold, preventing numerical overflow and training instability from exploding gradients. However, it cannot amplify vanishing gradients — if the gradient norm is $10^{-8}$, clipping does nothing to make it larger. The vanishing gradient problem requires an **architectural** solution (such as the LSTM's additive cell state updates), not a gradient manipulation technique.

Question 6

What is truncated BPTT? What tradeoff does it make?

Answer Truncated BPTT limits backpropagation to $k$ time steps instead of the full sequence length $T$. The forward pass proceeds through the entire sequence (the hidden state carries information forward across segment boundaries), but gradients are only computed within windows of $k$ steps. The tradeoff: it makes training computationally feasible for very long sequences, but it **cannot learn dependencies that span more than $k$ steps** through gradient descent, even if the hidden state can represent them in the forward pass.

Question 7

Name the three gates in an LSTM and describe the role of each in one sentence.

Answer 1. **Forget gate** ($\mathbf{f}_t$): Decides which elements of the cell state $\mathbf{c}_{t-1}$ to discard (values near 0) or retain (values near 1). 2. **Input gate** ($\mathbf{i}_t$): Controls how much of the new candidate information $\tilde{\mathbf{c}}_t$ to add to the cell state. 3. **Output gate** ($\mathbf{o}_t$): Controls how much of the cell state (after $\tanh$) to expose as the hidden state $\mathbf{h}_t$ for the current output and the next time step.

Question 8

Write the LSTM cell state update equation and explain why it solves the vanishing gradient problem.

Answer $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$$ This is an **additive** update: the new cell state is the old cell state scaled by the forget gate plus new information scaled by the input gate. The gradient through this path is: $$\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \text{diag}(\mathbf{f}_t)$$ When $\mathbf{f}_t \approx 1$, this is approximately the **identity matrix** — no weight matrix $\mathbf{W}_{hh}$ appears in this gradient path. The gradient can flow through the cell state for hundreds of time steps without multiplicative attenuation, creating a "gradient highway."

Question 9

What is the standard initialization trick for the LSTM forget gate bias, and why is it used?

Answer The forget gate bias is initialized to a **positive value** (typically 1.0 or 2.0), so that $\sigma(\mathbf{b}_f) \approx \sigma(1) \approx 0.73$ at the start of training. This biases the LSTM to **remember by default**: the cell state is mostly preserved at each step, ensuring that long-range gradient flow is enabled from the beginning. Without this trick ($\mathbf{b}_f = 0$, so $\sigma(0) = 0.5$), the LSTM forgets half the cell state at each step during early training, weakening the gradient highway before the network has learned which information is worth remembering.

Question 10

How does a GRU differ from an LSTM structurally? Name two specific differences.

Answer 1. **Merged state:** The GRU has a single hidden state $\mathbf{h}_t$, while the LSTM has both a hidden state $\mathbf{h}_t$ and a separate cell state $\mathbf{c}_t$. The GRU does not distinguish between long-term memory (cell state) and short-term output (hidden state). 2. **Coupled gates:** The GRU uses an interpolation $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$, where the update gate $\mathbf{z}_t$ controls both forgetting and writing. In the LSTM, the forget gate $\mathbf{f}_t$ and input gate $\mathbf{i}_t$ are independent — the LSTM can simultaneously retain all old information and add new information, while the GRU cannot. Additional: the GRU has 2 gates (update, reset) vs. the LSTM's 3 (forget, input, output), resulting in $3h(h+d)$ vs. $4h(h+d)$ parameters.

Question 11

When is a bidirectional RNN appropriate? When is it inappropriate?

Answer **Appropriate:** When the entire input sequence is available at once and the representation at each position should capture both past and future context. Examples: sequence classification, sequence labeling (POS tagging, NER), encoding for seq2seq models. **Inappropriate:** For **autoregressive generation** (language modeling, text generation, speech synthesis) where future tokens are not available at prediction time. Using future context during training but not during inference creates a train-test mismatch. Also inappropriate for **streaming/online** applications where data arrives sequentially and predictions must be made before future data is observed.

Question 12

In a seq2seq model without attention, what is the "information bottleneck" problem?

Answer The encoder compresses the entire input sequence into a single fixed-size context vector $\mathbf{c} = \mathbf{h}_T^{\text{enc}} \in \mathbb{R}^h$. This vector must encode all the information the decoder needs to generate the output sequence. For short inputs, this compression is manageable. For long inputs (e.g., a 100-word sentence encoded into a 512-dimensional vector), significant information is lost — the last few input tokens tend to dominate the context vector, and information about early tokens is severely attenuated. This causes seq2seq performance to degrade substantially as input length increases.

Question 13

Explain the difference between teacher forcing and free-running during seq2seq training.

Answer **Teacher forcing:** At each decoder step $t$, the input is the **ground-truth** previous token $y_{t-1}$, regardless of what the decoder predicted. This provides stable, informative gradients and trains faster, but creates **exposure bias**: the decoder is trained on perfect inputs but tested on its own (imperfect) predictions. **Free-running:** At each step $t$, the input is the decoder's own **predicted** previous token $\hat{y}_{t-1}$. This matches inference behavior, but errors compound — a single wrong prediction can derail the entire output sequence, making training slow and unstable. **Scheduled sampling** bridges both: start with teacher forcing and gradually increase the probability of using model predictions during training.

Question 14

What is beam search? How does beam width $B$ affect the quality-cost tradeoff?

Answer Beam search is a decoding algorithm that maintains the top $B$ candidate sequences at each step. At every decoding step, each of the $B$ candidates is expanded by all $V$ vocabulary tokens, yielding $B \times V$ candidates, and the top $B$ by cumulative log-probability are retained. $B = 1$ is greedy search. Larger $B$ explores more of the output space, generally improving quality (up to a point) at a **linear** cost in computation ($B \times$ the cost of greedy decoding). Length normalization divides the log-probability by the sequence length to avoid penalizing longer outputs. In practice, $B = 4$ or $B = 5$ is typical; beyond $B \approx 10$, quality gains are negligible for most tasks.

Question 15

Write the Bahdanau (additive) attention score function and explain each component.

Answer $$e_{tj} = \mathbf{v}^T \tanh(\mathbf{W}_s \mathbf{s}_{t-1} + \mathbf{W}_h \mathbf{h}_j^{\text{enc}})$$ - $\mathbf{s}_{t-1} \in \mathbb{R}^{d_s}$: decoder hidden state at the previous step (the "query" — what information does the decoder need?) - $\mathbf{h}_j^{\text{enc}} \in \mathbb{R}^{d_h}$: encoder hidden state at position $j$ (the "key/value" — what information is available?) - $\mathbf{W}_s \in \mathbb{R}^{a \times d_s}$: projects the decoder state into a shared attention space - $\mathbf{W}_h \in \mathbb{R}^{a \times d_h}$: projects the encoder state into the same attention space - $\tanh$: nonlinear combination of the two projections - $\mathbf{v} \in \mathbb{R}^a$: learned vector that reduces the $a$-dimensional attention representation to a scalar score The scores are normalized via softmax to produce attention weights $\alpha_{tj}$, which weight the encoder hidden states to form a context vector $\mathbf{c}_t = \sum_j \alpha_{tj} \mathbf{h}_j^{\text{enc}}$.

Question 16

How does Luong (multiplicative) attention differ from Bahdanau (additive) attention? Which is more computationally efficient?

Answer Luong attention uses a **multiplicative** score function instead of an additive one: - **General:** $e_{tj} = \mathbf{s}_t^T \mathbf{W}_a \mathbf{h}_j^{\text{enc}}$ (bilinear form) - **Dot:** $e_{tj} = \mathbf{s}_t^T \mathbf{h}_j^{\text{enc}}$ (no learnable parameters in the score) - **Scaled dot:** $e_{tj} = \frac{\mathbf{s}_t^T \mathbf{h}_j^{\text{enc}}}{\sqrt{d}}$ The **dot-product** variant is the most computationally efficient because it requires no learnable parameters and the computation is a single batch matrix multiply. Bahdanau attention requires two matrix multiplications ($\mathbf{W}_s$, $\mathbf{W}_h$), a $\tanh$, and a dot product with $\mathbf{v}$ — roughly $2\times$ the computation. The scaled dot-product form is the one used in transformers (Chapter 10).

Question 17

Trace the conceptual chain from RNNs to transformers in three steps. For each step, identify the specific problem solved.

Answer 1. **RNN $\to$ LSTM (1997):** Vanilla RNNs suffer from vanishing gradients, preventing them from learning long-range dependencies. LSTMs solve this by introducing a cell state with additive updates, creating a gradient highway through the sequence. 2. **LSTM $\to$ Attention (2014):** The seq2seq encoder-decoder architecture compresses the entire input into a fixed-size context vector, creating an information bottleneck. Attention solves this by letting the decoder look at **all** encoder hidden states at each step, dynamically weighting them by relevance. 3. **Attention $\to$ Transformer (2017):** RNNs with attention still process sequences **sequentially** (each hidden state depends on the previous one), preventing parallel computation. The transformer replaces recurrence entirely with **self-attention** (every position attends to every other position directly) and uses positional encoding to convey order. This enables full parallelization during training and direct long-range connections.

Question 18

Name three scenarios where LSTMs remain a better choice than transformers in 2025.

Answer 1. **Streaming / online inference:** LSTMs have $O(h^2)$ per-step inference cost independent of history length. Transformers require $O(T \cdot d)$ per step (or an $O(T \cdot d)$ KV-cache), making them more expensive for continuous streams. 2. **Edge deployment / resource constraints:** LSTMs require no attention matrix computation ($O(T^2)$ or $O(T)$ with FlashAttention), making them feasible on microcontrollers and mobile devices with limited compute and memory. 3. **Quick baselines:** LSTMs are well-understood, train in hours, and have stable hyperparameters. For a new sequence task, an LSTM baseline establishes a performance floor quickly. If a transformer does not substantially beat it, the added complexity is unjustified.

Question 19

In the StreamRec session LSTM, the model achieves Hit@10 $\approx$ 15% on 5,000 items. A random baseline would achieve Hit@10 $\approx$ 0.2%. Explain what this metric measures and why the LSTM's performance is meaningful.

Answer **Hit@10** measures the fraction of test examples where the true next item appears in the model's top 10 predictions. With 5,000 items, random guessing would place the true item in the top 10 with probability $10/5000 = 0.2\%$. The LSTM's 15% means it is roughly **75x better than random** — it has learned meaningful sequential patterns in user browsing behavior (e.g., within-category browsing, transition patterns between categories). This is meaningful because it demonstrates that the order of item interactions carries predictive information that a non-sequential model would miss. The metric also serves as a concrete baseline for the transformer comparison in Chapter 10.

Question 20

An LSTM with $h = 256$ and $d = 64$ has approximately 330,000 parameters. A transformer with 4 layers, 4 heads, $d = 256$, and an FFN dimension of 1024 has approximately 3.3 million parameters — 10x more. Under what circumstances is the smaller LSTM preferable despite having fewer parameters?

Answer The LSTM is preferable when: 1. **Data is limited.** With fewer parameters, the LSTM is less prone to overfitting on small datasets. The transformer's capacity advantage only materializes with sufficient training data. 2. **Sequence length is moderate** ($T \leq 100$). For short sequences, the LSTM's sequential processing overhead is small, and the transformer's $O(T^2)$ attention computation offers no efficiency gain. 3. **Inference is latency-sensitive and sequential.** Processing one new token costs the LSTM $O(h^2) \approx 65{,}536$ FLOPs. The transformer's per-token cost is $O(T \cdot d)$, which grows with history length. 4. **Deployment environment is constrained.** 330K parameters fit in L1 cache on many processors. 3.3M parameters may not, causing cache misses and slower inference. 5. **The task does not require long-range dependencies.** If the relevant context for prediction is within the last 20-50 tokens, the LSTM can capture it effectively, and the transformer's ability to attend to any position is unnecessary.