Chapter 19: Quiz

Test your understanding of the Transformer architecture. Try to answer each question before revealing the solution.

Question 1

What is the primary motivation for removing recurrence from the Transformer architecture?

Show Answer

The primary motivation is **parallelization**. RNNs process tokens sequentially --- the hidden state at position $t$ depends on position $t-1$ --- which prevents parallel computation across sequence positions. Self-attention computes interactions between all positions simultaneously, allowing full parallelization during training. This leads to dramatically faster training on modern hardware (GPUs/TPUs). A secondary benefit is shorter path lengths for long-range dependencies: in self-attention, any two positions interact in a single step, versus $O(n)$ steps in an RNN.

Question 2

In the sinusoidal positional encoding formula $PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}})$, what role does the denominator $10000^{2i/d_{\text{model}}}$ play?

Show Answer

The denominator controls the **frequency** (or wavelength) of the sinusoid for each dimension. As $i$ increases (moving to higher dimensions), the denominator grows, making the sinusoid oscillate more slowly (lower frequency, longer wavelength). This creates a spectrum of frequencies: - Low-order dimensions ($i$ small): High frequency, capturing fine-grained positional differences - High-order dimensions ($i$ large): Low frequency, capturing coarse positional patterns The base of 10,000 was chosen so that the wavelengths range from $2\pi$ to $10000 \cdot 2\pi$, covering a wide range of scales.

Question 3

What is the difference between padding masking and causal masking in the Transformer?

Show Answer

**Padding masking** prevents attention from attending to padding tokens (positions that contain no real data but exist to make sequences in a batch the same length). It is applied in both the encoder and decoder. **Causal masking** (look-ahead masking) prevents decoder positions from attending to future positions in the target sequence. It enforces the autoregressive property: position $t$ can only attend to positions $1, \ldots, t$. It is applied only in the decoder's self-attention, not in the encoder. They serve completely different purposes: padding masking handles variable-length inputs, while causal masking preserves the autoregressive generation order.

Question 4

Why does the Transformer scale the embeddings by $\sqrt{d_{\text{model}}}$ before adding positional encodings?

Show Answer

The embedding layer initializes its weights with values that have a standard deviation roughly proportional to 1 (depending on initialization). Meanwhile, the sinusoidal positional encodings have values bounded in $[-1, 1]$. If the model dimension is large (e.g., 512), the embedding values would be much smaller in magnitude than the positional encoding values, causing the positional signal to dominate. Multiplying embeddings by $\sqrt{d_{\text{model}}}$ scales them up so that the embedding and positional encoding signals are on comparable scales, allowing the model to effectively use both sources of information.

Question 5

In the Transformer decoder, what are the three sublayers in each block, and what is the purpose of each?

Show Answer

1. **Masked multi-head self-attention**: Allows each decoder position to attend to all previous decoder positions (but not future ones, due to the causal mask). This lets the decoder use its own previous outputs as context. 2. **Multi-head cross-attention**: Allows the decoder to attend to the encoder's output. Queries come from the decoder; keys and values come from the encoder. This is how the decoder "reads" the source sequence. 3. **Position-wise feed-forward network**: Applies a two-layer MLP independently to each position, providing nonlinear transformation capacity and serving as a form of per-position processing. Each sublayer is wrapped with a residual connection and layer normalization.

Question 6

What is the purpose of residual connections in the Transformer?

Show Answer

Residual connections serve two main purposes: 1. **Gradient flow**: They create a direct path for gradients to flow backward through the network. Without them, gradients must pass through every attention and FFN computation, which can cause vanishing or exploding gradients in deep networks. 2. **Incremental learning**: Each sublayer only needs to learn the *residual* (the difference from the identity function), which is generally easier than learning the full transformation. This allows the model to be deeper without degrading performance. The residual stream also provides a useful conceptual framework: each sublayer reads from and writes additive updates to a shared information stream.

Question 7

What is the difference between pre-norm and post-norm configurations, and which is generally preferred for training stability?

Show Answer

**Post-norm** (original Transformer): Layer normalization is applied *after* the residual addition: $$\text{output} = \text{LayerNorm}(x + \text{Sublayer}(x))$$ **Pre-norm**: Layer normalization is applied *before* the sublayer: $$\text{output} = x + \text{Sublayer}(\text{LayerNorm}(x))$$ **Pre-norm is generally preferred** for training stability because the residual path remains an unmodified identity mapping (the gradient flows through the addition directly). Post-norm places the layer norm on the gradient path, which can impede gradient flow. Pre-norm models are more robust to learning rate choices and train more stably, especially for deeper models. GPT-2, GPT-3, and most modern large language models use pre-norm.

Question 8

Why is the feed-forward network called "position-wise"?

Show Answer

It is called "position-wise" because the same feed-forward network is applied **independently** to each position (token) in the sequence. There is no interaction between different positions within the FFN. The same weight matrices $W_1$ and $W_2$ are shared across all positions (parameter sharing), but each position's computation is independent. Interactions between positions happen exclusively in the attention layers, not in the FFN. This separation of concerns --- attention for inter-position communication, FFN for per-position processing --- is a key design principle of the Transformer.

Question 9

What is teacher forcing, and why is it used during Transformer training?

Show Answer

**Teacher forcing** means providing the correct (ground-truth) previous tokens as input to the decoder during training, rather than the model's own predictions. For example, when training to translate "Hello world" to "Bonjour monde," the decoder always receives the correct prefix as input, even if it would have predicted different tokens. It is used because: 1. **Parallelization**: With teacher forcing and causal masking, all output positions can be computed in a single forward pass (no need for sequential generation). 2. **Training stability**: The model does not compound its own errors during training, which leads to faster and more stable convergence. 3. **Efficiency**: Sequential generation would require $T$ forward passes for a sequence of length $T$; teacher forcing requires just one. The downside is exposure bias: at inference time, the model uses its own predictions (which may contain errors), creating a mismatch with training conditions.

Question 10

What is the computational complexity of self-attention with respect to sequence length $n$ and model dimension $d$?

Show Answer

Self-attention has complexity $O(n^2 \cdot d)$. The breakdown: - Computing $QK^T$: matrix multiplication of $(n \times d)$ with $(d \times n)$ = $O(n^2 \cdot d)$ - Applying softmax: $O(n^2)$ - Computing the weighted sum with $V$: matrix multiplication of $(n \times n)$ with $(n \times d)$ = $O(n^2 \cdot d)$ The $n^2$ factor comes from every position attending to every other position. This quadratic scaling is the main limitation of standard Transformers for very long sequences, motivating efficient attention variants like Linformer ($O(n)$), Performer ($O(n)$), and Flash Attention (same complexity but optimized memory access).

Question 11

In the original Transformer, what are the hyperparameters for the base model?

Show Answer

The original Transformer base model uses: - $d_{\text{model}} = 512$ (model dimension) - $d_{\text{ff}} = 2048$ (feed-forward inner dimension, $= 4 \times d_{\text{model}}$) - $h = 8$ (number of attention heads) - $d_k = d_v = 64$ (head dimension, $= d_{\text{model}} / h$) - $N = 6$ (number of layers for both encoder and decoder) - Dropout rate $= 0.1$ - Label smoothing $\epsilon = 0.1$ - Warm-up steps $= 4000$ The base model has approximately 65 million parameters.

Question 12

Why does the Transformer use layer normalization instead of batch normalization?

Show Answer

There are several reasons: 1. **Variable sequence lengths**: Batch normalization normalizes across the batch dimension, but sequences in a batch often have different lengths. Normalizing statistics would be computed over a mix of real tokens and padding, leading to unreliable estimates. 2. **Batch size independence**: Layer normalization normalizes across the feature dimension for each individual example, so it works the same regardless of batch size. This is important because Transformer training often uses small effective batch sizes. 3. **Sequence position independence**: Batch normalization would compute different statistics for each position, which makes it sensitive to sequence length distribution. 4. **Inference consistency**: Layer normalization behaves identically during training and inference (no running statistics to maintain), simplifying deployment.

Question 13

What is weight tying, and why is it used in Transformers?

Show Answer

**Weight tying** means sharing the weight matrix between the decoder's embedding layer and the output projection layer. The embedding maps token IDs to vectors (lookup), while the output projection maps vectors to logits over the vocabulary (linear transformation). Since these perform conceptually inverse operations, it makes sense to share their weights. Benefits: 1. **Fewer parameters**: The embedding matrix for a 32K vocabulary with $d_{\text{model}} = 512$ has 16M parameters. Weight tying eliminates this redundancy. 2. **Better generalization**: The shared representation acts as a regularizer, as the embedding must serve both input and output purposes. 3. **Empirically effective**: Studies show it improves performance, especially for smaller models. Weight tying was used in the original Transformer and is common in many subsequent models (GPT-2, T5, etc.).

Question 14

Explain the warm-up phase of the Transformer's learning rate schedule. Why is it necessary?

Show Answer

The warm-up phase **linearly increases** the learning rate from 0 to its peak value over a specified number of steps (typically 4,000). After the warm-up, the learning rate **decays** proportionally to the inverse square root of the step number. The warm-up is necessary because: 1. **Random initialization**: At the start of training, model parameters are random, so the Adam optimizer's second-moment estimates are unreliable. Large learning rates at this stage can cause divergent, poorly-directed parameter updates. 2. **Adam's adaptive estimates**: Adam needs several steps to accumulate accurate moving averages of gradients and squared gradients. The warm-up gives it time to calibrate before making large updates. 3. **Attention instability**: Attention weights early in training can be highly volatile. Gradual increases in learning rate prevent the model from making catastrophically large steps during this unstable phase. Without warm-up, Transformer training often diverges or converges to poor solutions.

Question 15

What is label smoothing, and what value is used in the original Transformer?

Show Answer

**Label smoothing** replaces hard one-hot target distributions with softer distributions. Instead of assigning probability 1.0 to the correct token and 0.0 to all others, it assigns: - $1 - \epsilon$ to the correct token - $\epsilon / (V - 1)$ to each incorrect token where $\epsilon$ is the smoothing parameter and $V$ is the vocabulary size. The original Transformer uses $\epsilon = 0.1$. Effects: - **Hurts perplexity**: The model learns not to assign 100% probability to correct tokens, so its cross-entropy loss is higher than without smoothing. - **Improves BLEU**: The softer training signal prevents overconfidence and improves generalization, leading to better translation quality as measured by BLEU. - **Regularization**: Acts as a form of regularization by preventing the model from becoming too confident about any single prediction.

Question 16

How many sublayers does each encoder block have, and how many does each decoder block have?

Show Answer

Each **encoder block** has **2 sublayers**: 1. Multi-head self-attention (unmasked) 2. Position-wise feed-forward network Each **decoder block** has **3 sublayers**: 1. Masked multi-head self-attention (with causal mask) 2. Multi-head cross-attention (queries from decoder, keys/values from encoder) 3. Position-wise feed-forward network Each sublayer in both encoder and decoder is wrapped with a residual connection and layer normalization.

Question 17

What are the three main architectural variants of the Transformer, and what tasks is each best suited for?

Show Answer

1. **Encoder-only** (e.g., BERT): Uses only the encoder with bidirectional (unmasked) self-attention. Best for understanding tasks: text classification, named entity recognition, extractive question answering, sentence embedding. 2. **Decoder-only** (e.g., GPT, LLaMA): Uses only the decoder with causal (masked) self-attention. Best for generative tasks: text generation, code generation, conversational AI. This variant dominates modern large language models. 3. **Encoder-decoder** (e.g., T5, BART, original Transformer): Uses both encoder and decoder with cross-attention connecting them. Best for sequence-to-sequence tasks: machine translation, summarization, abstractive question answering.

Question 18

In the FFN, why is the intermediate dimension $d_{\text{ff}}$ typically $4 \times d_{\text{model}}$?

Show Answer

The expansion to $4 \times d_{\text{model}}$ creates a **bottleneck** architecture. The data is projected into a higher-dimensional space where: 1. **More expressive nonlinearity**: The ReLU/GELU activation operates in a space with 4x more dimensions, allowing it to carve out more complex decision boundaries. 2. **Larger capacity**: More neurons mean more patterns can be stored and retrieved (in the key-value memory interpretation). 3. **Practical trade-off**: The factor of 4 was found empirically to provide a good balance between model capacity and computational cost. The FFN already accounts for roughly two-thirds of each layer's parameters, so making it much larger would be prohibitively expensive. The projection back to $d_{\text{model}}$ ensures the residual stream dimension remains consistent across layers.

Question 19

What would happen if you removed positional encoding entirely from the Transformer?

Show Answer

Without positional encoding, the Transformer would treat the input as a **bag of tokens** (a set, not a sequence). Self-attention is permutation-equivariant: if you permute the input tokens, the output is permuted in the same way, but the relative attention patterns are unchanged. Consequences: 1. **Loss of word order**: "Dog bites man" and "Man bites dog" would produce identical representations (up to permutation). 2. **Poor performance on order-sensitive tasks**: Tasks like translation, where word order carries meaning, would suffer dramatically. 3. **Some tasks might still work**: Bag-of-words classification tasks (e.g., sentiment analysis) might still perform reasonably, since they rely more on which words appear than their order. 4. **The model could partially compensate**: Content-based attention patterns might help the model infer some ordering from semantic cues, but this would be limited and unreliable.

Question 20

Why is gradient clipping (max norm = 1.0) commonly used in Transformer training?

Show Answer

Gradient clipping prevents **gradient explosions** that can destabilize training: 1. **Attention sharpness**: The softmax in attention can produce very peaked distributions, leading to large gradients through certain paths. 2. **Deep networks**: With 6+ layers, each containing multiple sublayers, gradient magnitudes can compound. 3. **Adam interaction**: While Adam normalizes gradients by their second moment, extreme gradients can still cause instability, especially early in training when the moment estimates are noisy. 4. **Cross-attention coupling**: The interaction between encoder and decoder gradients through cross-attention can amplify gradient magnitudes. The max norm of 1.0 ensures that the total gradient norm never exceeds 1.0. If it does, all gradients are scaled down proportionally, preserving their direction while limiting their magnitude. This is a safety mechanism that prevents catastrophic parameter updates while allowing normal-magnitude gradients to pass through unchanged.

Question 21

Explain the concept of "exposure bias" in Transformer training with teacher forcing.

Show Answer

**Exposure bias** refers to the mismatch between training and inference conditions: - **During training**: The decoder always sees the correct previous tokens (teacher forcing). It never encounters its own mistakes. - **During inference**: The decoder uses its own predictions as input for subsequent positions. If it makes an error, all subsequent predictions are conditioned on that error. This means the model is never trained to recover from its own mistakes, so errors can compound during generation. For example, if the model generates an incorrect word early in a translation, subsequent words may be nonsensical because the model has never seen such erroneous contexts during training. Mitigations include: - **Scheduled sampling**: Gradually mixing model predictions with ground truth during training - **Beam search**: Maintaining multiple candidate sequences to reduce the impact of individual errors - **Sequence-level training**: Using reinforcement learning objectives like REINFORCE to train on complete generated sequences

Question 22

How does the Transformer handle variable-length sequences within a batch?

Show Answer

The Transformer handles variable-length sequences through **padding and masking**: 1. **Padding**: All sequences in a batch are padded to the same length (typically the length of the longest sequence in the batch) using a special `` token with index 0. 2. **Padding masks**: Boolean masks are created where `True` indicates a padding position. These masks are passed to the attention mechanism. 3. **Attention masking**: The padding mask is used to set attention scores for padding positions to $-\infty$ before the softmax, ensuring they receive zero attention weight. This prevents the model from attending to meaningless padding tokens. 4. **Loss masking**: The loss function ignores padding positions by using `ignore_index=PAD_IDX` in `CrossEntropyLoss`, so padding tokens do not contribute to the training signal. This approach is less efficient than processing sequences of the same length (computation is wasted on padding), which is why many implementations sort sequences by length or use dynamic batching to minimize padding.

Question 23

What is the register_buffer mechanism used for the positional encoding, and why is it needed?

Show Answer

`register_buffer` registers a tensor as a **buffer** of the module. This means: 1. **Not a parameter**: The tensor is not included in `model.parameters()` and does not receive gradient updates during training. This is correct for sinusoidal positional encodings, which are deterministic and should not be learned. 2. **Saved with the model**: When you call `model.state_dict()` or `torch.save(model)`, the buffer is included. This ensures the positional encoding is preserved when loading a saved model. 3. **Device management**: When you call `model.to(device)` or `model.cuda()`, the buffer is moved to the appropriate device along with the model's parameters. Without `register_buffer`, you would need to manually move the positional encoding tensor to the correct device. 4. **Part of the module**: The buffer appears in the module's state dict and is properly handled by `model.eval()` and `model.train()` (though normalization layers, not buffers, are the main concern there). Without `register_buffer`, the positional encoding would be a plain Python attribute that could end up on the wrong device or be lost during serialization.

Question 24

In the Transformer training loop, why do we reshape the logits and targets before computing the loss?

Show Answer

The logits from the Transformer have shape `(batch_size, seq_len, vocab_size)` and the targets have shape `(batch_size, seq_len)`. However, `nn.CrossEntropyLoss` expects: - **Input**: `(N, C)` where `N` is the number of examples and `C` is the number of classes - **Target**: `(N,)` where each value is a class index So we reshape: - `logits.reshape(-1, vocab_size)` -> shape `(batch_size * seq_len, vocab_size)` - `targets.reshape(-1)` -> shape `(batch_size * seq_len,)` This "flattens" the batch and sequence dimensions, treating each token prediction as an independent classification problem. The `ignore_index=PAD_IDX` argument ensures that padding positions do not contribute to the loss, so this flattening does not introduce incorrect training signals from padding tokens.

Question 25

What distinguishes the Transformer from a standard attention-augmented RNN encoder-decoder?

Show Answer

Key distinctions: 1. **No recurrence**: The Transformer has no RNN, LSTM, or GRU components. All sequence processing happens through attention and feed-forward layers. 2. **Self-attention**: Both the encoder and decoder use self-attention, allowing each token to attend to all other tokens in the same sequence. In attention-augmented RNNs, attention is only used between encoder and decoder (cross-attention). 3. **Multi-head attention**: The Transformer uses multiple parallel attention heads, each learning different attention patterns. Traditional attention mechanisms typically use a single attention computation. 4. **Full parallelism**: All positions are processed simultaneously (not sequentially), enabling much faster training. 5. **Positional encoding**: Explicit positional information must be injected since there is no inherent sequential processing. RNNs get positional information implicitly from their sequential nature. 6. **Deeper stacking**: The Transformer uses 6+ identical layers stacked with residual connections, while attention-augmented RNNs typically have 1--2 layers with attention applied between them. 7. **Position-wise FFN**: Each layer includes a feed-forward network applied independently to each position, which has no analog in the standard attention-augmented RNN.