Chapter 15: Quiz
Test your understanding of Recurrent Neural Networks and Sequence Modeling. Each question has one correct answer unless stated otherwise.
Question 1
What is the primary advantage of parameter sharing across time steps in an RNN?
- A) It reduces the number of training examples needed
- B) It allows the model to handle variable-length sequences and generalize across positions
- C) It eliminates the need for backpropagation
- D) It prevents overfitting entirely
Answer
**B)** Parameter sharing means the same weights are applied at every time step, which allows the model to process sequences of any length and generalize patterns learned at one position to other positions. Without sharing, each position would need its own weights, making variable-length inputs impossible and preventing generalization.Question 2
In a vanilla RNN, what causes the vanishing gradient problem?
- A) Using too many training examples
- B) The repeated multiplication of the weight matrix Jacobians during backpropagation through time
- C) The use of the softmax activation function
- D) Having too many output classes
Answer
**B)** During BPTT, the gradient involves a product of Jacobians $\prod_{j=k+1}^{t} \frac{\partial \mathbf{h}_j}{\partial \mathbf{h}_{j-1}}$. When the spectral radius of $\mathbf{W}_{hh}$ is less than 1, this product shrinks exponentially toward zero as the number of time steps increases, preventing learning of long-range dependencies.Question 3
Which gate in an LSTM cell is responsible for deciding what information to discard from the cell state?
- A) Input gate
- B) Output gate
- C) Forget gate
- D) Update gate
Answer
**C)** The forget gate $\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f)$ produces values between 0 and 1 that multiply the previous cell state element-wise. A value near 0 means "forget this information" and a value near 1 means "keep this information."Question 4
Why is the cell state update in an LSTM additive rather than multiplicative?
- A) To reduce computational cost
- B) To allow gradients to flow through time without vanishing
- C) To simplify the implementation
- D) To make the model bidirectional
Answer
**B)** The cell state update $\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t$ is additive. This means the gradient $\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t$, which can remain close to 1 when the forget gate is open. This prevents the exponential decay of gradients that plagues vanilla RNNs.Question 5
How does a GRU differ from an LSTM?
- A) A GRU has 3 gates; an LSTM has 2 gates
- B) A GRU uses a separate cell state; an LSTM does not
- C) A GRU merges the cell state into the hidden state and uses 2 gates instead of 3
- D) A GRU cannot process sequences longer than 100 steps
Answer
**C)** The GRU simplifies the LSTM by combining the cell state and hidden state into a single vector and using two gates (update and reset) instead of three (forget, input, output). This reduces the parameter count while maintaining comparable performance.Question 6
In a GRU, what does the update gate $\mathbf{z}_t$ control?
- A) Which parts of the input to use
- B) The balance between keeping the previous hidden state and accepting the new candidate
- C) The learning rate for that time step
- D) Whether to skip the current time step
Answer
**B)** The update gate controls the interpolation: $\mathbf{h}_t = (1 - \mathbf{z}_t) \odot \mathbf{h}_{t-1} + \mathbf{z}_t \odot \tilde{\mathbf{h}}_t$. When $\mathbf{z}_t \approx 0$, the hidden state is copied forward; when $\mathbf{z}_t \approx 1$, it is replaced with the new candidate.Question 7
When is a bidirectional RNN NOT appropriate?
- A) Sentiment analysis
- B) Named entity recognition
- C) Autoregressive language generation
- D) Part-of-speech tagging
Answer
**C)** Autoregressive language generation requires producing tokens left-to-right, where each token depends only on previous tokens. A bidirectional RNN requires the full sequence to be available, which contradicts the autoregressive setting. Bidirectional RNNs are appropriate for tasks where the entire input is known before prediction.Question 8
What is the output dimensionality of a bidirectional LSTM with hidden_size=128 at each time step?
- A) 128
- B) 256
- C) 64
- D) 512
Answer
**B)** A bidirectional LSTM concatenates the forward and backward hidden states at each time step, resulting in a dimensionality of $128 \times 2 = 256$.Question 9
In a sequence-to-sequence model, what is the "context vector"?
- A) The average of all input embeddings
- B) The final hidden state of the encoder, passed to initialize the decoder
- C) The output of the first decoder step
- D) A learnable parameter vector
Answer
**B)** In the basic seq2seq architecture, the encoder processes the entire input sequence and its final hidden state (or a transformation of it) serves as the context vector. This context vector is used to initialize the decoder's hidden state, providing it with a compressed representation of the input.Question 10
What is the main limitation of the basic seq2seq model that attention mechanisms address?
- A) It cannot handle sequences longer than 512 tokens
- B) The entire input must be compressed into a single fixed-length context vector
- C) It requires too many parameters
- D) It cannot be trained with backpropagation
Answer
**B)** The basic seq2seq model compresses the entire input sequence into a single fixed-length vector, creating an information bottleneck. For long sequences, this vector cannot capture all the relevant information. Attention allows the decoder to access all encoder hidden states at each step, eliminating this bottleneck.Question 11
What is teacher forcing?
- A) A regularization technique that adds noise to the input
- B) Using the ground-truth previous token as the decoder input during training
- C) A method for increasing the beam width during inference
- D) Forcing the model to use a specific learning rate schedule
Answer
**B)** Teacher forcing provides the ground-truth target token as input to the decoder at each time step during training, rather than the model's own prediction. This prevents error accumulation and speeds up convergence but introduces exposure bias.Question 12
What is exposure bias?
- A) The model's tendency to memorize the training data
- B) A mismatch between training (seeing ground truth) and inference (seeing own predictions)
- C) A bias in the data collection process
- D) The tendency for early layers to receive smaller gradients
Answer
**B)** Exposure bias occurs because teacher forcing always provides correct previous tokens during training, but at inference time the model must use its own (potentially incorrect) predictions. This distribution mismatch can cause error accumulation during inference.Question 13
In beam search with beam width $k=3$, how many candidate sequences are maintained at each step?
- A) 1
- B) 3
- C) $3 \times |V|$ where $|V|$ is the vocabulary size
- D) All possible sequences
Answer
**B)** Beam search maintains the top $k$ candidates at each step. While it evaluates $k \times |V|$ expansions at each step, it prunes back to the top $k=3$ candidates before proceeding.Question 14
Why is length normalization important in beam search?
- A) To ensure all output sequences have the same length
- B) To counteract the bias toward shorter sequences since each token adds negative log probability
- C) To speed up the search algorithm
- D) To prevent the model from generating unknown tokens
Answer
**B)** Without normalization, beam search prefers shorter sequences because the joint probability $P(y_1, \ldots, y_T)$ is a product of terms less than 1 (or equivalently, each token adds a negative log probability). Dividing by $|y|^\alpha$ normalizes the score by sequence length.Question 15
Gradient clipping addresses which problem?
- A) Vanishing gradients
- B) Exploding gradients
- C) Both vanishing and exploding gradients
- D) Overfitting
Answer
**B)** Gradient clipping rescales the gradient when its norm exceeds a threshold, preventing updates from being too large. It addresses exploding gradients but does nothing for vanishing gradients (which require architectural changes like LSTM/GRU gates).Question 16
What is the purpose of pack_padded_sequence in PyTorch?
- A) To compress the model's parameters
- B) To ensure the RNN does not process padding tokens, improving correctness and efficiency
- C) To reduce the vocabulary size
- D) To enable GPU acceleration
Answer
**B)** `pack_padded_sequence` creates a packed representation that tells the RNN which elements are padding, so they are skipped during processing. This prevents padding tokens from affecting the hidden state and also improves computational efficiency.Question 17
For a many-to-one task (e.g., sentiment classification), which hidden state is typically used for the final prediction?
- A) The hidden state at $t=0$
- B) The average of all hidden states
- C) The hidden state at the final time step $t=T$
- D) All of the above are equally common approaches
Answer
**D)** While using the final hidden state ($t=T$) is the most common approach, averaging all hidden states (mean pooling) and even max pooling across time steps are also widely used and can sometimes yield better results. The best choice depends on the specific task.Question 18
An LSTM with input_size=50 and hidden_size=100 has approximately how many parameters (excluding biases)?
- A) 15,000
- B) 30,000
- C) 60,000
- D) 120,000
Answer
**C)** The LSTM has four sets of weight matrices (for forget, input, cell candidate, and output gates). Each set processes a concatenated input of size $50 + 100 = 150$ and maps to `hidden_size=100`. So the total is $4 \times 150 \times 100 = 60{,}000$ parameters (excluding biases).Question 19
What is truncated backpropagation through time (TBPTT)?
- A) A method that removes certain time steps from the sequence
- B) Limiting the number of time steps over which gradients are backpropagated
- C) A technique for reducing the model size
- D) A way to increase the batch size
Answer
**B)** Truncated BPTT limits backpropagation to a fixed number of recent time steps rather than propagating gradients through the entire sequence. This reduces memory usage and computation cost, but it limits the model's ability to learn very long-range dependencies.Question 20
In a stacked (deep) RNN with 3 layers, where should dropout be applied?
- A) Only to the input layer
- B) Between layers (vertically) but not across time steps (horizontally)
- C) At every time step in every layer
- D) Only to the output layer
Answer
**B)** Standard practice is to apply dropout between RNN layers (to the output of each layer before passing to the next) but not across time steps within a layer, as that would disrupt the temporal information flow in the hidden state. PyTorch's `dropout` parameter in `nn.LSTM` implements this behavior.Question 21
What advantage do attention mechanisms provide over the basic seq2seq context vector?
- A) They require fewer parameters
- B) They allow the decoder to dynamically focus on different parts of the input at each step
- C) They eliminate the need for an encoder
- D) They guarantee the output length matches the input length
Answer
**B)** Attention mechanisms compute a weighted combination of all encoder hidden states at each decoder step, allowing the model to focus on the most relevant parts of the input. This is more flexible and information-preserving than compressing the entire input into a single context vector.Question 22
Which of the following correctly describes the relationship between the LSTM's forget gate and gradient flow?
- A) The forget gate blocks all gradient flow when set to 0
- B) The gradient through the cell state is scaled by the forget gate: $\frac{\partial c_t}{\partial c_{t-1}} = f_t$
- C) The forget gate has no effect on gradient flow
- D) The forget gate amplifies gradients when set to values greater than 1
Answer
**B)** The gradient of the cell state with respect to the previous cell state is exactly the forget gate value: $\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t$. When the forget gate is close to 1, gradients pass through almost unchanged; when it is close to 0, gradients are attenuated. Since the forget gate is sigmoid-activated, its values are always in $[0, 1]$.Question 23
Scheduled sampling is a technique that:
- A) Schedules which batches to process first during training
- B) Gradually reduces the teacher forcing ratio during training to bridge the gap between training and inference
- C) Schedules the learning rate decay
- D) Determines the order in which sequences are processed
Answer
**B)** Scheduled sampling starts training with a high teacher forcing ratio (mostly using ground-truth inputs) and gradually reduces it, transitioning toward free-running mode (using the model's own predictions). This helps mitigate exposure bias by gradually exposing the model to its own errors during training.Question 24
What is the "beam search curse"?
- A) Beam search always produces shorter sequences than greedy decoding
- B) Very large beam widths can paradoxically decrease output quality
- C) Beam search cannot be used with attention mechanisms
- D) Beam search requires exponential computation time
Answer
**B)** Counter-intuitively, very large beam widths can lead to worse outputs because the search may favor longer, lower-quality sequences or sequences that are high-probability under the model but low-quality by human standards. This reveals a mismatch between the model's probability distribution and actual quality.Question 25
When processing sequences of different lengths in a batch, what is the correct order of operations?
- A) Pad all sequences to the same length, then pack, then pass through the RNN, then unpack
- B) Truncate all sequences to the shortest length, then process
- C) Process each sequence individually (batch size 1)
- D) Sort sequences by length, then pad, then process
Answer
**A)** The standard procedure is: (1) pad sequences to the maximum length in the batch, (2) use `pack_padded_sequence` to create a packed representation, (3) pass through the RNN, and (4) use `pad_packed_sequence` to recover the padded output. Sorting by length (D) can be done for efficiency but is not required with `enforce_sorted=False`.Question 26
Which of the following is NOT a valid strategy for initializing the hidden state $\mathbf{h}_0$ of an RNN?
- A) Initialize to zeros
- B) Initialize as a learnable parameter
- C) Initialize from the output of another network
- D) All of the above are valid strategies
Answer
**D)** All three are valid strategies. Zero initialization is the most common default. A learnable $\mathbf{h}_0$ can be beneficial when the initial state matters. Initializing from another network's output is used in encoder-decoder architectures where the decoder's initial state comes from the encoder.Question 27
State space models like S4 and Mamba relate to RNNs because:
- A) They are identical to vanilla RNNs
- B) They use recurrent-like sequential processing but with linear dynamics that enable parallel training
- C) They have completely replaced RNNs in all applications
- D) They cannot process sequences at all