Chapter 21: Quiz

Multiple Choice Questions

Question 1

In the autoregressive factorization $P(x_1, x_2, \ldots, x_T) = \prod_{t=1}^{T} P(x_t \mid x_{

A) This factorization is an approximation that assumes conditional independence. B) This factorization is exact and follows from the chain rule of probability. C) This factorization requires that the tokens are independent. D) This factorization only works for natural language, not other sequence types.

Answer: B The chain rule of probability allows any joint distribution to be decomposed into a product of conditionals. This is an identity, not an approximation. No independence assumptions are made.

Question 2

What is the primary purpose of the causal mask in a decoder-only Transformer?

A) To reduce the computational cost of attention. B) To prevent each position from attending to future positions. C) To ensure that the model can handle variable-length sequences. D) To add positional information to the token representations.

Answer: B The causal mask prevents each position from attending to future positions (positions with higher indices). This ensures the autoregressive property: the prediction at position $t$ depends only on tokens at positions $1, \ldots, t$.

Question 3

In the causal mask matrix for a sequence of length 6, how many positions can token at position 4 (1-indexed) attend to?

A) 3 B) 4 C) 5 D) 6

Answer: B Token at position 4 can attend to positions 1, 2, 3, and 4 (itself and all preceding positions), for a total of 4 positions.

Question 4

Which of the following is NOT a difference between GPT-1 and GPT-2?

A) GPT-2 uses pre-norm (layer norm before attention) instead of post-norm. B) GPT-2 has a larger vocabulary. C) GPT-2 uses sinusoidal positional encodings instead of learned ones. D) GPT-2 has a longer context window (1024 vs. 512 tokens).

Answer: C Both GPT-1 and GPT-2 use learned positional embeddings, not sinusoidal ones. The switch from post-norm to pre-norm, larger vocabulary, and longer context window are all real differences.

Question 5

What is "weight tying" in the context of a language model?

A) Using the same learning rate for all layers. B) Sharing the token embedding matrix with the output projection matrix. C) Tying the key and value projection matrices together. D) Forcing all attention heads to have the same weights.

Answer: B Weight tying means using the same matrix for both the input token embeddings and the output projection (language modeling head). This reduces the number of parameters and provides implicit regularization.

Question 6

When using temperature scaling with $\tau = 0.1$ on the logits before sampling, the resulting distribution will be:

A) Nearly uniform across all tokens. B) Very peaked, with almost all probability mass on the most likely token. C) Identical to the original distribution. D) Inverted, with the least likely token becoming most likely.

Answer: B As $\tau \to 0$, the softmax distribution becomes increasingly peaked. With $\tau = 0.1$, the logits are divided by 0.1 (multiplied by 10), which greatly amplifies the differences between logits, making the distribution very peaked around the highest-logit token.

Question 7

In nucleus (top-p) sampling with $p = 0.9$, which set of tokens is sampled from?

A) Always the top 90% of the vocabulary by count. B) The smallest set of tokens whose cumulative probability exceeds 0.9. C) A random 90% subset of the vocabulary. D) The tokens whose individual probability exceeds 0.9.

Answer: B Nucleus sampling sorts tokens by probability and includes the minimum number of tokens needed for their cumulative probability to exceed $p$. This set adapts in size depending on the distribution's shape.

Question 8

What is the computational advantage of KV caching during autoregressive generation?

A) It reduces the number of layers that need to be computed. B) It eliminates the need for the softmax operation. C) It avoids recomputing keys and values for previously processed tokens. D) It reduces the vocabulary size during generation.

Answer: C KV caching stores the key and value tensors computed for all previous tokens. At each generation step, only the new token's key and value need to be computed, avoiding redundant recomputation of KV pairs for the entire context.

Question 9

For a GPT-2-sized model (36 layers, 20 heads, head dimension 64) generating with a context of 1024 tokens in float16, approximately how much memory does the KV cache require per sequence?

A) ~19 MB B) ~94 MB C) ~189 MB D) ~378 MB

Answer: C KV cache = $2 \times 36 \times 1024 \times 20 \times 64 \times 2$ bytes = $2 \times 36 \times 1024 \times 20 \times 64 \times 2 = 188,743,680$ bytes $\approx 189$ MB.

Question 10

Which activation function is used in GPT-2's feed-forward network?

A) ReLU B) Sigmoid C) GELU D) Tanh

Answer: C GPT-2 uses the Gaussian Error Linear Unit (GELU) activation function, which provides a smooth approximation to ReLU and has been found to work well in Transformer models.

Question 11

If a language model has a perplexity of 1 on a test set, what does this mean?

A) The model is perfectly uncertain about every token. B) The model perfectly predicts every token in the test set. C) The model has not been trained. D) The model assigns uniform probability to all tokens.

Answer: B Perplexity $= \exp(\text{average cross-entropy loss})$. A perplexity of 1 means the cross-entropy loss is 0, which means the model assigns probability 1 to every correct next token. This would represent perfect prediction.

Question 12

Which innovation was the PRIMARY contribution of GPT-3?

A) The Transformer architecture. B) Pre-training followed by fine-tuning. C) In-context learning (few-shot prompting without gradient updates). D) Byte-pair encoding tokenization.

Answer: C GPT-3's key contribution was demonstrating that a sufficiently large language model can learn new tasks from just a few examples provided in the prompt (in-context learning), without any gradient updates or fine-tuning.

Question 13

In the pre-norm Transformer block, the residual connection is:

A) $h' = \text{LayerNorm}(h + \text{Attention}(h))$ B) $h' = h + \text{Attention}(\text{LayerNorm}(h))$ C) $h' = \text{LayerNorm}(h) + \text{Attention}(h)$ D) $h' = h + \text{LayerNorm}(\text{Attention}(h))$

Answer: B In the pre-norm architecture, layer normalization is applied to the input before it enters the sub-layer (attention or FFN), and the residual connection adds the original input to the sub-layer output.

Question 14

During training, the causal language model processes a chunk of $T$ tokens. How many next-token predictions does it make in a single forward pass?

A) 1 B) $T - 1$ C) $T$ D) $T^2$

Answer: C The model makes $T$ predictions: position 0 predicts token 1, position 1 predicts token 2, ..., position $T-1$ predicts token $T$. All predictions are computed in parallel thanks to the causal mask. (Note: depending on how the input/target split is set up, this could also be $T-1$; both answers are defensible depending on implementation.)

Question 15

What is the purpose of gradient clipping during language model training?

A) To speed up training by skipping small gradients. B) To prevent exploding gradients by capping the gradient norm. C) To implement L2 regularization. D) To ensure the model uses only positive gradients.

Answer: B Gradient clipping caps the norm of the gradient vector to a maximum value. This prevents exploding gradients, which can cause training instability, especially in deep models.

Question 16

Greedy decoding tends to produce which type of text?

A) Highly creative and diverse text. B) Repetitive and generic text. C) Randomly incoherent text. D) Text that closely matches the training data verbatim.

Answer: B Greedy decoding always selects the highest-probability token, which tends to produce repetitive, generic text. It often gets stuck in loops (e.g., repeating the same phrase) because it always follows the single highest-probability path.

Question 17

In the FFN of a GPT-style model with hidden dimension $d = 768$, the inner dimension is typically:

A) $d / 4 = 192$ B) $d = 768$ C) $2d = 1536$ D) $4d = 3072$

Answer: D The standard expansion ratio for the feed-forward network in GPT-style models is 4x, so the inner dimension is $4d = 3072$ for $d = 768$.

Question 18

What advantage does Rotary Positional Embedding (RoPE) have over learned absolute positional embeddings?

A) It uses fewer parameters. B) It enables the model to generalize to longer sequences than seen during training. C) It eliminates the need for attention. D) It is simpler to implement.

Answer: B RoPE encodes relative position information through rotation, meaning the attention score between two tokens depends only on their relative distance, not their absolute positions. This allows the model to potentially generalize to sequence lengths longer than those seen during training.

Question 19

Which of the following is a correct ordering of training stages for a model like ChatGPT?

A) RLHF, then pre-training, then SFT. B) SFT, then pre-training, then RLHF. C) Pre-training, then SFT, then RLHF. D) Pre-training, then RLHF, then SFT.

Answer: C The standard pipeline is: (1) pre-training with next-token prediction on a large corpus, (2) supervised fine-tuning (SFT) on instruction-response pairs, and (3) reinforcement learning from human feedback (RLHF) to align the model with human preferences.

Question 20

In top-k sampling with $k = 1$, the generation strategy is equivalent to:

A) Uniform random sampling. B) Nucleus sampling with $p = 0$. C) Greedy decoding. D) Temperature sampling with $\tau = 1$.

Answer: C With $k = 1$, only the single most likely token is included in the sampling set, which is exactly greedy decoding (always selecting the most probable token).

Question 21

What is the feed-forward inner dimension in a GPT model where $d_{\text{model}} = 1600$ and the standard 4x expansion ratio is used?

A) 400 B) 1600 C) 3200 D) 6400

Answer: D With a 4x expansion ratio, $d_{ff} = 4 \times 1600 = 6400$.

Question 22

If we increase the temperature from 1.0 to 2.0, what happens to the entropy of the token distribution?

A) Entropy decreases. B) Entropy stays the same. C) Entropy increases. D) The effect depends on the original distribution.

Answer: C Higher temperature flattens the distribution (makes it more uniform), which increases entropy. Entropy is maximized for a uniform distribution and minimized for a delta distribution.

Question 23

In the context of autoregressive generation, what does "degenerate repetition" refer to?

A) The model generates tokens from a degenerate probability distribution. B) The model gets stuck in a loop, repeating the same phrase or pattern. C) The model generates increasingly shorter responses. D) The model's performance degenerates over many generation steps.

Answer: B Degenerate repetition refers to the well-known failure mode where autoregressive models, especially with greedy or low-temperature decoding, get stuck repeating the same word, phrase, or pattern indefinitely.

Question 24

Which of the following is NOT a strategy to combat repetitive generation?

A) Temperature scaling. B) Repetition penalty. C) Causal masking. D) Nucleus sampling.

Answer: C Causal masking is a fundamental architectural feature that ensures the autoregressive property---it is not a strategy to combat repetition. Temperature scaling, repetition penalty, and nucleus sampling are all strategies that help prevent or mitigate repetitive generation.

Question 25

For a model with block size 512, what happens if you try to generate token 513 without any truncation or sliding window?

A) The model automatically extends its context window. B) The model crashes or produces an error. C) The model wraps around and starts overwriting earlier positions. D) The model ignores the extra token.

Answer: B If the model uses learned positional embeddings of maximum length 512, attempting to access position 513 will cause an index-out-of-bounds error. The model cannot process sequences longer than its block size without explicit handling (such as truncating the context to the most recent 512 tokens).