Chapter 21: Quiz
Multiple Choice Questions
Question 1
In the autoregressive factorization $P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{ A) This factorization is an approximation that assumes conditional independence.
B) This factorization is exact and follows from the chain rule of probability.
C) This factorization requires that the tokens are independent.
D) This factorization only works for natural language, not other sequence types. Answer: B
The chain rule of probability allows any joint distribution to be decomposed into a product of conditionals. This is an identity, not an approximation. No independence assumptions are made. What is the primary purpose of the causal mask in a decoder-only Transformer? A) To reduce the computational cost of attention.
B) To prevent each position from attending to future positions.
C) To ensure that the model can handle variable-length sequences.
D) To add positional information to the token representations. Answer: B
The causal mask prevents each position from attending to future positions (positions with higher indices). This ensures the autoregressive property: the prediction at position $t$ depends only on tokens at positions $1, \ldots, t$. In the causal mask matrix for a sequence of length 6, how many positions can token at position 4 (1-indexed) attend to? A) 3
B) 4
C) 5
D) 6 Answer: B
Token at position 4 can attend to positions 1, 2, 3, and 4 (itself and all preceding positions), for a total of 4 positions. Which of the following is NOT a difference between GPT-1 and GPT-2? A) GPT-2 uses pre-norm (layer norm before attention) instead of post-norm.
B) GPT-2 has a larger vocabulary.
C) GPT-2 uses sinusoidal positional encodings instead of learned ones.
D) GPT-2 has a longer context window (1024 vs. 512 tokens). Answer: C
Both GPT-1 and GPT-2 use learned positional embeddings, not sinusoidal ones. The switch from post-norm to pre-norm, larger vocabulary, and longer context window are all real differences. What is "weight tying" in the context of a language model? A) Using the same learning rate for all layers.
B) Sharing the token embedding matrix with the output projection matrix.
C) Tying the key and value projection matrices together.
D) Forcing all attention heads to have the same weights. Answer: B
Weight tying means using the same matrix for both the input token embeddings and the output projection (language modeling head). This reduces the number of parameters and provides implicit regularization. When using temperature scaling with $\tau = 0.1$ on the logits before sampling, the resulting distribution will be: A) Nearly uniform across all tokens.
B) Very peaked, with almost all probability mass on the most likely token.
C) Identical to the original distribution.
D) Inverted, with the least likely token becoming most likely. Answer: B
As $\tau \to 0$, the softmax distribution becomes increasingly peaked. With $\tau = 0.1$, the logits are divided by 0.1 (multiplied by 10), which greatly amplifies the differences between logits, making the distribution very peaked around the highest-logit token. In nucleus (top-p) sampling with $p = 0.9$, which set of tokens is sampled from? A) Always the top 90% of the vocabulary by count.
B) The smallest set of tokens whose cumulative probability exceeds 0.9.
C) A random 90% subset of the vocabulary.
D) The tokens whose individual probability exceeds 0.9. Answer: B
Nucleus sampling sorts tokens by probability and includes the minimum number of tokens needed for their cumulative probability to exceed $p$. This set adapts in size depending on the distribution's shape. What is the computational advantage of KV caching during autoregressive generation? A) It reduces the number of layers that need to be computed.
B) It eliminates the need for the softmax operation.
C) It avoids recomputing keys and values for previously processed tokens.
D) It reduces the vocabulary size during generation. Answer: C
KV caching stores the key and value tensors computed for all previous tokens. At each generation step, only the new token's key and value need to be computed, avoiding redundant recomputation of KV pairs for the entire context. For a GPT-2-sized model (36 layers, 20 heads, head dimension 64) generating with a context of 1024 tokens in float16, approximately how much memory does the KV cache require per sequence? A) ~19 MB
B) ~94 MB
C) ~189 MB
D) ~378 MB Answer: C
KV cache = $2 \times 36 \times 1024 \times 20 \times 64 \times 2$ bytes = $2 \times 36 \times 1024 \times 20 \times 64 \times 2 = 188,743,680$ bytes $\approx 189$ MB. Which activation function is used in GPT-2's feed-forward network? A) ReLU
B) Sigmoid
C) GELU
D) Tanh Answer: C
GPT-2 uses the Gaussian Error Linear Unit (GELU) activation function, which provides a smooth approximation to ReLU and has been found to work well in Transformer models. If a language model has a perplexity of 1 on a test set, what does this mean? A) The model is perfectly uncertain about every token.
B) The model perfectly predicts every token in the test set.
C) The model has not been trained.
D) The model assigns uniform probability to all tokens. Answer: B
Perplexity $= \exp(\text{average cross-entropy loss})$. A perplexity of 1 means the cross-entropy loss is 0, which means the model assigns probability 1 to every correct next token. This would represent perfect prediction. Which innovation was the PRIMARY contribution of GPT-3? A) The Transformer architecture.
B) Pre-training followed by fine-tuning.
C) In-context learning (few-shot prompting without gradient updates).
D) Byte-pair encoding tokenization. Answer: C
GPT-3's key contribution was demonstrating that a sufficiently large language model can learn new tasks from just a few examples provided in the prompt (in-context learning), without any gradient updates or fine-tuning. In the pre-norm Transformer block, the residual connection is: A) $h' = \text{LayerNorm}(h + \text{Attention}(h))$
B) $h' = h + \text{Attention}(\text{LayerNorm}(h))$
C) $h' = \text{LayerNorm}(h) + \text{Attention}(h)$
D) $h' = h + \text{LayerNorm}(\text{Attention}(h))$ Answer: B
In the pre-norm architecture, layer normalization is applied to the input before it enters the sub-layer (attention or FFN), and the residual connection adds the original input to the sub-layer output. During training, the causal language model processes a chunk of $T$ tokens. How many next-token predictions does it make in a single forward pass? A) 1
B) $T - 1$
C) $T$
D) $T^2$ Answer: C
The model makes $T$ predictions: position 0 predicts token 1, position 1 predicts token 2, ..., position $T-1$ predicts token $T$. All predictions are computed in parallel thanks to the causal mask. (Note: depending on how the input/target split is set up, this could also be $T-1$; both answers are defensible depending on implementation.) What is the purpose of gradient clipping during language model training? A) To speed up training by skipping small gradients.
B) To prevent exploding gradients by capping the gradient norm.
C) To implement L2 regularization.
D) To ensure the model uses only positive gradients. Answer: B
Gradient clipping caps the norm of the gradient vector to a maximum value. This prevents exploding gradients, which can cause training instability, especially in deep models. Greedy decoding tends to produce which type of text? A) Highly creative and diverse text.
B) Repetitive and generic text.
C) Randomly incoherent text.
D) Text that closely matches the training data verbatim. Answer: B
Greedy decoding always selects the highest-probability token, which tends to produce repetitive, generic text. It often gets stuck in loops (e.g., repeating the same phrase) because it always follows the single highest-probability path. In the FFN of a GPT-style model with hidden dimension $d = 768$, the inner dimension is typically: A) $d / 4 = 192$
B) $d = 768$
C) $2d = 1536$
D) $4d = 3072$ Answer: D
The standard expansion ratio for the feed-forward network in GPT-style models is 4x, so the inner dimension is $4d = 3072$ for $d = 768$. What advantage does Rotary Positional Embedding (RoPE) have over learned absolute positional embeddings? A) It uses fewer parameters.
B) It enables the model to generalize to longer sequences than seen during training.
C) It eliminates the need for attention.
D) It is simpler to implement. Answer: B
RoPE encodes relative position information through rotation, meaning the attention score between two tokens depends only on their relative distance, not their absolute positions. This allows the model to potentially generalize to sequence lengths longer than those seen during training. Which of the following is a correct ordering of training stages for a model like ChatGPT? A) RLHF, then pre-training, then SFT.
B) SFT, then pre-training, then RLHF.
C) Pre-training, then SFT, then RLHF.
D) Pre-training, then RLHF, then SFT. Answer: C
The standard pipeline is: (1) pre-training with next-token prediction on a large corpus, (2) supervised fine-tuning (SFT) on instruction-response pairs, and (3) reinforcement learning from human feedback (RLHF) to align the model with human preferences. In top-k sampling with $k = 1$, the generation strategy is equivalent to: A) Uniform random sampling.
B) Nucleus sampling with $p = 0$.
C) Greedy decoding.
D) Temperature sampling with $\tau = 1$. Answer: C
With $k = 1$, only the single most likely token is included in the sampling set, which is exactly greedy decoding (always selecting the most probable token). What is the feed-forward inner dimension in a GPT model where $d_{\text{model}} = 1600$ and the standard 4x expansion ratio is used? A) 400
B) 1600
C) 3200
D) 6400 Answer: D
With a 4x expansion ratio, $d_{ff} = 4 \times 1600 = 6400$. If we increase the temperature from 1.0 to 2.0, what happens to the entropy of the token distribution? A) Entropy decreases.
B) Entropy stays the same.
C) Entropy increases.
D) The effect depends on the original distribution. Answer: C
Higher temperature flattens the distribution (makes it more uniform), which increases entropy. Entropy is maximized for a uniform distribution and minimized for a delta distribution. In the context of autoregressive generation, what does "degenerate repetition" refer to? A) The model generates tokens from a degenerate probability distribution.
B) The model gets stuck in a loop, repeating the same phrase or pattern.
C) The model generates increasingly shorter responses.
D) The model's performance degenerates over many generation steps. Answer: B
Degenerate repetition refers to the well-known failure mode where autoregressive models, especially with greedy or low-temperature decoding, get stuck repeating the same word, phrase, or pattern indefinitely. Which of the following is NOT a strategy to combat repetitive generation? A) Temperature scaling.
B) Repetition penalty.
C) Causal masking.
D) Nucleus sampling. Answer: C
Causal masking is a fundamental architectural feature that ensures the autoregressive property---it is not a strategy to combat repetition. Temperature scaling, repetition penalty, and nucleus sampling are all strategies that help prevent or mitigate repetitive generation. For a model with block size 512, what happens if you try to generate token 513 without any truncation or sliding window? A) The model automatically extends its context window.
B) The model crashes or produces an error.
C) The model wraps around and starts overwriting earlier positions.
D) The model ignores the extra token. Answer: B
If the model uses learned positional embeddings of maximum length 512, attempting to access position 513 will cause an index-out-of-bounds error. The model cannot process sequences longer than its block size without explicit handling (such as truncating the context to the most recent 512 tokens).
Question 2
Question 3
Question 4
Question 5
Question 6
Question 7
Question 8
Question 9
Question 10
Question 11
Question 12
Question 13
Question 14
Question 15
Question 16
Question 17
Question 18
Question 19
Question 20
Question 21
Question 22
Question 23
Question 24
Question 25