Chapter 10: Quiz

DataField.Dev

Chapter 10: Quiz

Test your understanding of the transformer architecture. Answers follow each question.

Question 1

What is the formula for scaled dot-product attention? Why is the scaling factor $1/\sqrt{d_k}$ necessary?

Answer

$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\sqrt{d_k}}\right)\mathbf{V}$$ The scaling factor $1/\sqrt{d_k}$ is necessary because the dot product of two random vectors in $\mathbb{R}^{d_k}$ has variance $d_k$ (assuming unit-variance components). Without scaling, as $d_k$ grows, the dot products grow in magnitude, pushing the softmax into saturated regions where its gradient is nearly zero. Dividing by $\sqrt{d_k}$ normalizes the variance of the dot products to 1, keeping the softmax in its well-behaved regime and enabling stable gradient flow during training.

Question 2

In self-attention, what are the three roles each position plays, and how are they computed from the input?

Answer

Each position simultaneously serves as a **query** (what information it is looking for), a **key** (what information it advertises to other positions), and a **value** (the information it provides when attended to). These are computed by projecting the input embedding $\mathbf{x}$ through three learned linear transformations: $\mathbf{q} = \mathbf{x}\mathbf{W}^Q$, $\mathbf{k} = \mathbf{x}\mathbf{W}^K$, $\mathbf{v} = \mathbf{x}\mathbf{W}^V$. The separation of key and value is important: the key determines *when* a position is attended to (matching with queries), while the value determines *what information* is transmitted.

Question 3

Why does multi-head attention use $h$ separate attention heads instead of a single head with dimension $d_{\text{model}}$?

Answer

A single attention head produces one attention distribution per position — one way of relating positions to one another. But sequences contain multiple types of relationships simultaneously (syntactic, semantic, positional, etc.). The softmax constraint (weights sum to 1) means attending strongly to one position requires attending weakly to others, so a single head cannot capture multiple relationship types simultaneously. Multi-head attention runs $h$ independent attention functions in parallel, each with dimension $d_k = d_{\text{model}}/h$, allowing different heads to specialize in different relationship types. The total parameter count is the same as a single full-dimension head, but the representational capacity is greater because the heads can decompose different aspects of the input.

Question 4

What is the purpose of positional encoding in a transformer, and why is it needed?

Answer

Self-attention is a **set operation** — it computes pairwise similarities and weighted averages without any notion of position or order. If the input sequence is permuted, the outputs are the same permutation of the original outputs (permutation equivariance). Without positional encoding, the transformer treats "the dog bit the man" identically to "the man bit the dog." Positional encoding injects position information into the input embeddings, typically by adding a position-dependent vector, so that the attention mechanism can learn position-dependent and relative-position-dependent behaviors through the Q/K dot products.

Question 5

Explain the rotation matrix property of sinusoidal positional encodings and why it is useful.

Answer

For any fixed offset $k$ and frequency $\omega$, the sinusoidal encoding at position $pos + k$ can be expressed as a rotation matrix applied to the encoding at position $pos$: $$\begin{bmatrix} \sin(\omega(pos+k)) \\ \cos(\omega(pos+k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega k) & \sin(\omega k) \\ -\sin(\omega k) & \cos(\omega k) \end{bmatrix} \begin{bmatrix} \sin(\omega \cdot pos) \\ \cos(\omega \cdot pos) \end{bmatrix}$$ This means the relationship between any two positions at a fixed distance is the same linear transformation regardless of absolute position. The model can learn relative positional relationships (e.g., "two positions apart") through linear projections in Q and K, rather than having to memorize every pair of absolute positions independently.

Question 6

What are the two sub-layers in a transformer block, and what complementary roles do they serve?

Answer

The two sub-layers are **multi-head self-attention** and a **position-wise feed-forward network (FFN)**. They serve complementary roles: attention performs **inter-token computation** — it routes information between positions, allowing each position to aggregate information from other positions. The FFN performs **intra-token computation** — it applies an independent nonlinear transformation at each position, processing the information that attention has gathered. Research on transformer circuits has shown that attention heads move information while FFN layers store and retrieve knowledge, functioning as a learned key-value memory.

Question 7

What is the difference between pre-LN and post-LN transformers? Why does pre-LN dominate in practice?

Answer

**Post-LN** applies layer normalization after the residual addition: $\mathbf{x} = \text{LayerNorm}(\mathbf{x} + \text{SubLayer}(\mathbf{x}))$. **Pre-LN** applies it before the sub-layer: $\mathbf{x} = \mathbf{x} + \text{SubLayer}(\text{LayerNorm}(\mathbf{x}))$. Pre-LN dominates because it provides a cleaner gradient path: the residual connection delivers gradients directly from the loss to any layer without passing through LayerNorm operations. Post-LN can cause gradient explosion during early training, requiring careful learning rate warmup. Pre-LN is more robust to hyperparameter choices and trains stably without warmup. Post-LN can achieve slightly better final performance when properly tuned, but the training stability of pre-LN is preferred in practice.

Question 8

What is the "residual stream" interpretation of the transformer, and what does it imply?

Answer

The residual stream view (Elhage et al., 2021) treats the transformer as a shared communication channel that flows through the network. Each attention layer and FFN layer **reads from** the stream, computes a contribution, and **writes back** to the stream via addition. The output at any position is the input embedding plus all contributions from all layers. This implies the transformer is not a strict pipeline — layer $L$ can access the original input and all previous layers' outputs simultaneously, because all are additively combined in the residual stream. This broadcasting architecture is more flexible than a sequential pipeline, where each layer can only access the immediately preceding layer's output.

Question 9

What is causal masking, and why is it required in the decoder but not the encoder?

Answer

Causal masking prevents position $i$ from attending to positions $j > i$ — each position can only see itself and previous positions. It is implemented by setting the upper-triangular entries of the attention score matrix to $-\infty$ before softmax, so future positions receive zero attention weight. It is required in the decoder to preserve the **autoregressive property**: during generation, the prediction at position $t$ can only depend on positions $1, \ldots, t$ (not on positions $t+1, \ldots, n$, which have not been generated yet). The encoder does not need causal masking because it processes an already-complete input sequence — every position should see every other position for maximum contextualization.

Question 10

What is cross-attention, and how does it differ from self-attention in terms of where Q, K, V come from?

Answer

In **self-attention**, Q, K, and V all come from the same sequence: $\mathbf{Q} = \mathbf{X}\mathbf{W}^Q$, $\mathbf{K} = \mathbf{X}\mathbf{W}^K$, $\mathbf{V} = \mathbf{X}\mathbf{W}^V$. In **cross-attention**, queries come from one sequence (the decoder) and keys/values come from another (the encoder): $\mathbf{Q} = \mathbf{X}_{\text{dec}}\mathbf{W}^Q$, $\mathbf{K} = \mathbf{X}_{\text{enc}}\mathbf{W}^K$, $\mathbf{V} = \mathbf{X}_{\text{enc}}\mathbf{W}^V$. Cross-attention allows the decoder to "read" the encoder's output — each decoder position asks "what in the source is relevant to what I am generating?" and the encoder representations provide the answers through the key-value mechanism.

Question 11

Name the three standard transformer architecture variants and give an example model for each.

Answer

1. **Encoder-decoder**: The full architecture with encoder (bidirectional self-attention) and decoder (causal self-attention + cross-attention). Examples: T5, BART, mBART. Used for sequence-to-sequence tasks like translation and summarization. 2. **Encoder-only**: Only the encoder half, with bidirectional self-attention over the full input. Examples: BERT, RoBERTa, DeBERTa. Used for classification, NER, and retrieval. 3. **Decoder-only**: Only the decoder half, with causal self-attention. Examples: GPT-2/3/4, LLaMA, Mistral, Claude. Used for text generation and, increasingly, as general-purpose LLMs for all tasks.

Question 12

What is the time complexity of self-attention per layer, and what is the dominant term?

Answer

The time complexity is $O(n^2 d)$, where $n$ is the sequence length and $d$ is the head dimension. The dominant computation is the score matrix $\mathbf{Q}\mathbf{K}^\top \in \mathbb{R}^{n \times n}$, which requires $O(n^2 d)$ FLOPs. The attention-weighted value sum $\text{softmax}(\cdot) \cdot \mathbf{V}$ also costs $O(n^2 d)$. The quadratic dependence on $n$ is why long context windows are expensive: doubling the sequence length quadruples the attention cost.

Question 13

Why is naive attention implementation memory-bound rather than compute-bound on modern GPUs?

Answer

The naive implementation materializes the full $n \times n$ attention score matrix in GPU HBM (high-bandwidth memory), then reads it back for the softmax, then reads it again for multiplication with V. The arithmetic intensity (FLOPs per byte transferred) of these operations is low — the GPU spends more time transferring data between HBM and compute cores than it spends computing. This makes the operation memory-bound. Flash attention addresses this by computing attention in tiles that fit in fast on-chip SRAM, never materializing the full attention matrix in HBM, and achieving the GPU's theoretical compute throughput.

Question 14

How does flash attention reduce memory usage from $O(n^2)$ to $O(n)$ without changing the result?

Answer

Flash attention processes Q, K, V in blocks that fit in GPU SRAM. For each block of queries, it iterates over blocks of keys and values, computing the attention scores, applying the online softmax algorithm (which maintains a running maximum and running sum), and accumulating the output — all in SRAM. The online softmax trick allows the normalization constant to be computed incrementally across blocks without needing the full score vector. Only the final output (of size $O(n \times d)$) is written to HBM. The $n \times n$ attention matrix is never fully materialized, reducing peak memory from $O(n^2)$ to $O(n)$.

Question 15

What is the KV-cache, and why does it exist?

Answer

The KV-cache stores the key and value projections from all previously processed positions during autoregressive generation. Without it, generating token $t+1$ would require recomputing the keys and values for all $t$ previous positions, making the total cost $O(T^3 d)$ for a $T$-token sequence. With the KV-cache, only the new token's query needs to be computed; the keys and values for positions $1, \ldots, t$ are retrieved from the cache, reducing per-step cost from $O(t^2 d)$ to $O(td)$ and total cost from $O(T^3 d)$ to $O(T^2 d)$. The trade-off is memory: the cache stores $2 \times N \times h \times d_k$ values per token per sequence, which can be several GB for large models with long contexts.

Question 16

What does the comparison table in Section 4 of "Attention Is All You Need" show, and why is the "sequential operations" row the most important?

Answer

The table compares self-attention, recurrence, and convolution on three criteria: complexity per layer, sequential operations, and maximum path length. The sequential operations row is most important because it determines **parallelizability**: self-attention requires $O(1)$ sequential operations (all pairwise interactions are computed in a single matrix multiply), while recurrence requires $O(n)$ sequential operations (each hidden state depends on the previous one). This means self-attention can fully exploit GPU parallelism — all positions are processed simultaneously — while RNNs cannot, making transformers dramatically faster to train despite their higher per-layer FLOP count.

Question 17

True or False: A transformer with positional encoding is permutation invariant.

Answer

**False.** A transformer *without* positional encoding is permutation **equivariant** (not invariant — the outputs permute with the inputs). Adding positional encoding breaks this equivariance because the positional signal changes when positions are permuted. With positional encoding, the transformer is sensitive to input order, which is essential for sequence tasks where order matters. Note the distinction: permutation **invariant** means the output does not change at all under permutation; permutation **equivariant** means the output permutes consistently with the input.

Question 18

How does Grouped-Query Attention (GQA) reduce KV-cache memory? What is the trade-off?

Answer

Standard multi-head attention uses separate key and value projections for each of $h$ heads, requiring the KV-cache to store $2 \times h \times d_k$ values per token. GQA groups query heads into $g$ groups (where $g < h$), and each group shares a single set of key-value projections. This reduces KV-cache memory by a factor of $h/g$. For example, LLaMA 2 70B uses $h = 32$ query heads with $g = 8$ KV groups, achieving a $4\times$ reduction in KV-cache memory. The trade-off is a small decrease in model quality, but empirical results show the degradation is minimal when $g$ is not too small (e.g., $g = 8$ for $h = 32$).

Question 19

A transformer encoder for StreamRec session modeling uses causal masking. A colleague suggests removing the causal mask to allow bidirectional attention, arguing that "more information is always better." Evaluate this argument.

Answer

The argument is wrong if the model is used for **next-item prediction** during serving. At inference time, when predicting what the user will engage with next, the model does not have access to future items — the user has not clicked them yet. If the model was trained with bidirectional attention, it learned to use future context that is unavailable at inference time, creating a **train-serve skew** that degrades predictions. Causal masking ensures the training setup matches the inference setup. However, if the task is session-level classification (e.g., predicting user satisfaction from the entire session), bidirectional attention is correct because the full session is available at inference time. The choice depends on whether the model predicts at each timestep (causal) or from the complete sequence (bidirectional).

Question 20

Compare the transformer, CNN, and LSTM in terms of inductive bias, data efficiency, and best use cases in 2025.

Answer

**Inductive bias:** CNNs encode locality and translation equivariance. LSTMs encode sequential ordering and recurrence. Transformers encode essentially no structural assumptions — they learn structure from data via attention and positional encoding. **Data efficiency:** CNNs are the most data-efficient because their strong inductive bias reduces the hypothesis space — a 3x3 kernel "knows" to look locally. LSTMs are moderately efficient. Transformers are the least data-efficient because they must learn spatial/temporal structure from data, requiring much more training data or pretraining on large corpora. **Best use cases (2025):** CNNs remain efficient for spatial data when data is limited and for lightweight feature extractors in larger systems. LSTMs are relevant for streaming/online settings, edge deployment with strict memory constraints, and as baselines. Transformers dominate nearly every other setting — NLP, vision (ViT), recommendation, scientific computing — especially when pretrained models or large datasets are available. The practical recommendation is to start with a transformer unless there is a specific reason (data scarcity, streaming, latency) to choose otherwise.