Quiz: The Attention Mechanism

Test your understanding of the attention mechanism concepts covered in Chapter 18. Each question has a hidden answer --- try to answer first before revealing it.

Question 1

What is the fundamental limitation of standard seq2seq models (without attention) that motivated the development of attention mechanisms?

Show Answer

The **information bottleneck**: the entire input sequence is compressed into a single fixed-length context vector $\mathbf{c} = \mathbf{h}_T$. This vector has finite capacity, so as the input sequence grows longer, the model loses information. Empirically, translation quality degrades sharply for sentences longer than 20--30 words.

Question 2

In Bahdanau attention, what are the three steps to compute the context vector $\mathbf{c}_t$ at decoder time step $t$?

Show Answer

1. **Compute alignment scores:** $e_{tj} = \mathbf{v}_a^\top \tanh(\mathbf{W}_a \mathbf{s}_{t-1} + \mathbf{U}_a \mathbf{h}_j)$ for each encoder position $j$. 2. **Normalize with softmax:** $\alpha_{tj} = \text{softmax}(e_{tj})$ to get attention weights that sum to 1. 3. **Compute weighted sum:** $\mathbf{c}_t = \sum_{j=1}^{T} \alpha_{tj} \mathbf{h}_j$ to produce the context vector.

Question 3

What is the key architectural difference between Bahdanau and Luong attention in terms of which decoder hidden state is used?

Show Answer

Bahdanau attention uses the **previous** decoder hidden state $\mathbf{s}_{t-1}$ to compute attention at step $t$, while Luong attention uses the **current** decoder hidden state $\mathbf{s}_t$. This means Bahdanau attention is computed before the decoder RNN step, while Luong attention is computed after.

Question 4

Name the three scoring functions proposed by Luong et al. and write their mathematical formulas.

Show Answer

1. **Dot:** $e_{tj} = \mathbf{s}_t^\top \mathbf{h}_j$ (requires same dimensionality) 2. **General (bilinear):** $e_{tj} = \mathbf{s}_t^\top \mathbf{W}_a \mathbf{h}_j$ (learnable matrix) 3. **Concat:** $e_{tj} = \mathbf{v}_a^\top \tanh(\mathbf{W}_a [\mathbf{s}_t ; \mathbf{h}_j])$ (similar to Bahdanau)

Question 5

In the Query-Key-Value (QKV) framework, what do Q, K, and V represent conceptually?

Show Answer

- **Query (Q):** What you are looking for --- represents the current position's information need. - **Key (K):** An index/label for each position --- represents what information each position advertises. - **Value (V):** The actual content each position carries --- represents the information to be retrieved. Attention computes similarity between queries and keys to determine how to weight the values.

Question 6

Why does scaled dot-product attention divide by $\sqrt{d_k}$?

Show Answer

When query and key vectors have i.i.d. components with mean 0 and variance 1, their dot product has variance $d_k$ (the dimension). As $d_k$ grows, the dot products become large in magnitude, pushing softmax into saturated regions where gradients are extremely small. Dividing by $\sqrt{d_k}$ normalizes the variance back to 1, keeping softmax inputs in a well-conditioned range.

Question 7

What is the difference between self-attention and cross-attention?

Show Answer

- **Self-attention:** The queries, keys, and values all come from the **same** sequence. Each position attends to every other position in the same sequence. - **Cross-attention:** Queries come from one sequence (e.g., the decoder), while keys and values come from a **different** sequence (e.g., the encoder). This allows the decoder to attend to encoder representations.

Question 8

In multi-head attention with $d_{\text{model}} = 512$ and $h = 8$ heads, what is the dimension $d_k$ per head?

Show Answer

$d_k = d_{\text{model}} / h = 512 / 8 = 64$ Each head operates on a 64-dimensional subspace of the 512-dimensional model space.

Question 9

What are the two main types of attention masks, and when is each used?

Show Answer

1. **Padding mask:** Used when batching sequences of different lengths. Prevents attention from attending to padding tokens (positions beyond the actual sequence length). 2. **Causal (look-ahead) mask:** Used in autoregressive models (e.g., language models). Prevents each position from attending to future positions --- position $i$ can only attend to positions $\leq i$.

Question 10

What is the time complexity of self-attention with respect to sequence length $n$ and dimension $d$?

Show Answer

$O(n^2 d)$. The dominant operation is computing $\mathbf{Q}\mathbf{K}^\top$, which produces an $n \times n$ matrix, each element requiring $O(d)$ operations. The space complexity is $O(n^2)$ for storing the attention matrix.

Question 11

How does the maximum path length between any two tokens compare between self-attention and an RNN?

Show Answer

- **Self-attention:** $O(1)$ --- any token can directly attend to any other token in a single layer. - **RNN:** $O(n)$ --- information must propagate through $n$ sequential time steps to travel from one end of the sequence to the other. This shorter path length is one of the key advantages of attention for modeling long-range dependencies.

Question 12

True or False: Attention weights can be directly interpreted as feature importance scores that explain why a model made a particular prediction.

Show Answer

**False.** While attention weights show what the model "looks at," research by Jain and Wallace (2019) demonstrated that attention weights often do not correlate with gradient-based feature importance, and alternative attention distributions can produce the same predictions. Attention weights are *descriptive* but not necessarily *explanatory*.

Question 13

Explain the "attention as a differentiable dictionary lookup" analogy. What makes it different from a traditional dictionary?

Show Answer

In a traditional dictionary, you provide an exact key and retrieve the corresponding value (hard lookup). In attention, you provide a query, compute its *similarity* to all keys, and return a *weighted combination* of all values. The differences are: 1. **Soft matching:** Instead of exact match, similarity is computed via dot products. 2. **Weighted output:** Instead of returning one value, returns a weighted sum of all values. 3. **Differentiable:** The entire operation is differentiable, enabling gradient-based learning. 4. **Trainable:** The query, key, and value projections are learned parameters.

Question 14

Why does multi-head attention use separate projection matrices ($\mathbf{W}^Q$, $\mathbf{W}^K$, $\mathbf{W}^V$) for queries, keys, and values rather than using the input directly?

Show Answer

Separate projections allow the model to learn different roles for different purposes: - $\mathbf{W}^Q$ learns what to search for (queries) - $\mathbf{W}^K$ learns what to advertise (keys) - $\mathbf{W}^V$ learns what to communicate (values) These are fundamentally different roles. Without separate projections, these roles would be conflated. Each head's projections also enable learning different relationship types (syntactic, semantic, positional, etc.).

Question 15

What is the output projection $\mathbf{W}^O$ in multi-head attention, and why is it necessary?

Show Answer

$\mathbf{W}^O \in \mathbb{R}^{hd_v \times d_{\text{model}}}$ is a linear projection applied after concatenating all head outputs. It is necessary to: 1. Map the concatenated output (dimension $h \cdot d_v$) back to $d_{\text{model}}$ dimensions. 2. Allow the model to learn how to combine information from different heads. 3. Mix information across heads, since heads compute attention independently. Without $\mathbf{W}^O$, the heads could not interact, limiting the expressiveness of multi-head attention.

Question 16

How is a causal mask typically applied to attention scores before softmax?

Show Answer

Positions that should be masked (future positions) are set to $-\infty$ in the attention scores before applying softmax: $$\text{scores}_{ij} \leftarrow \begin{cases} \text{scores}_{ij} & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$$ Since $\exp(-\infty) = 0$, softmax assigns zero weight to future positions. In practice, this is implemented using `masked_fill` with a boolean upper-triangular matrix.

Question 17

What is the difference between global attention and local attention as proposed by Luong et al.?

Show Answer

- **Global attention:** At each decoder step, attend to **all** encoder hidden states. The attention weights are computed over the entire input sequence. - **Local attention:** First predict an aligned position $p_t$ in the source, then attend only to a **window** of encoder hidden states around $p_t$. This reduces computation from $O(T)$ to $O(w)$ where $w$ is the window size. Local attention is a compromise between hard attention (one position) and soft/global attention (all positions).

Question 18

If you scale dot-product attention by $1/\sqrt{d_k}$ and the keys have been normalized to unit length, what is the geometric interpretation of the attention scores?

Show Answer

When keys are unit-normalized, the dot product $\mathbf{q}^\top \mathbf{k} / \sqrt{d_k}$ is proportional to the **cosine similarity** between the query and key vectors (scaled by $\|\mathbf{q}\| / \sqrt{d_k}$). Higher attention scores correspond to keys that are more aligned in direction with the query in the embedding space. This means attention is measuring angular similarity in the projected space.

Question 19

Name three types of relationships that different heads in multi-head attention have been observed to learn.

Show Answer

Research by Clark et al. (2019) on BERT found heads that specialize in: 1. **Positional relationships:** Attending to the next or previous token 2. **Syntactic relationships:** Approximating dependency parse relations (e.g., subject attending to its verb) 3. **Coreference relationships:** Pronouns attending to their antecedents 4. **Separator/structural patterns:** Attending to special tokens like [SEP] 5. **Broad/diffuse attention:** Acting as "bag of words" aggregators over the entire sequence (Any three of these are acceptable.)

Question 20

Why is self-attention alone insufficient for sequence modeling, requiring the addition of positional encodings?

Show Answer

Self-attention is **permutation equivariant**: if you permute the input tokens, the output is permuted in the same way, but the attention weights and values are unchanged. This means pure self-attention treats the input as a **set**, not a **sequence** --- it has no notion of order. Positional encodings inject position information into the token representations, breaking this permutation symmetry and allowing the model to distinguish between different orderings of the same tokens.

Question 21

What is Flash Attention, and what problem does it solve?

Show Answer

Flash Attention (Dao et al., 2022) is an IO-aware implementation of exact attention that avoids materializing the full $n \times n$ attention matrix in GPU high-bandwidth memory (HBM). Instead, it computes attention in blocks using tiling, keeping intermediate results in fast SRAM. It computes the **exact same** result as standard attention but with: - Reduced memory usage from $O(n^2)$ to $O(n)$ - Faster wall-clock time due to fewer HBM reads/writes - Enables processing longer sequences without running out of memory

Question 22

In the context vector formula $\mathbf{c}_t = \sum_{j=1}^{T} \alpha_{tj} \mathbf{h}_j$, what happens when one attention weight $\alpha_{tk} = 1$ and all others are 0?

Show Answer

When $\alpha_{tk} = 1$ and all other weights are 0, the context vector becomes exactly $\mathbf{c}_t = \mathbf{h}_k$, i.e., the encoder hidden state at position $k$. This is equivalent to **hard attention** --- the model is selecting a single input position. Soft attention (where weights are distributed) computes an interpolation between multiple encoder states, enabling smoother gradient flow.

Question 23

Calculate the total number of parameters in a multi-head attention layer with $d_{\text{model}} = 256$, $h = 4$ heads, and no biases.

Show Answer

The four projection matrices are: - $\mathbf{W}^Q \in \mathbb{R}^{256 \times 256}$: 65,536 parameters - $\mathbf{W}^K \in \mathbb{R}^{256 \times 256}$: 65,536 parameters - $\mathbf{W}^V \in \mathbb{R}^{256 \times 256}$: 65,536 parameters - $\mathbf{W}^O \in \mathbb{R}^{256 \times 256}$: 65,536 parameters Total: $4 \times 256^2 = 4 \times 65{,}536 = \mathbf{262{,}144}$ parameters. Note: The number of heads affects how the dimensions are split internally but does not change the total parameter count.

Question 24

What would happen if you applied attention without the softmax normalization (just using raw scores as weights)?

Show Answer

Without softmax: 1. **Weights would not sum to 1**, so the context vector's magnitude would depend on the scale of the scores, making training unstable. 2. **Weights could be negative** (if scores are negative), which would subtract value information rather than aggregate it. 3. **No probabilistic interpretation** --- attention weights would not form a distribution over positions. 4. **Gradient flow** would be altered --- softmax provides a natural gradient that encourages competition between positions. The model might still learn something, but training would be much harder and less stable.

Question 25

For a sequence of length $n = 4096$ with $d_k = 64$, how many elements are in the attention matrix, and approximately how much memory does it require in float32?

Show Answer

The attention matrix has $n \times n = 4096 \times 4096 = 16{,}777{,}216$ elements. In float32 (4 bytes per element): $16{,}777{,}216 \times 4 = 67{,}108{,}864$ bytes $\approx 64$ MB. For a model with $h = 16$ heads and batch size $B = 32$: $64 \text{ MB} \times 16 \times 32 = 32{,}768 \text{ MB} \approx 32$ GB. This illustrates why the quadratic memory cost of attention is a major practical concern for long sequences.

Question 26

Explain why the fused QKV projection (using a single linear layer instead of three) is more efficient on GPUs.

Show Answer

GPUs are most efficient when performing large, contiguous matrix multiplications. Three separate projections require three kernel launches and three reads of the input from global memory: - $\mathbf{Q} = \mathbf{X}\mathbf{W}^Q$, then $\mathbf{K} = \mathbf{X}\mathbf{W}^K$, then $\mathbf{V} = \mathbf{X}\mathbf{W}^V$ A fused projection combines these into one operation: - $[\mathbf{Q}; \mathbf{K}; \mathbf{V}] = \mathbf{X} [\mathbf{W}^Q; \mathbf{W}^K; \mathbf{W}^V]$ This requires only one kernel launch and one read of $\mathbf{X}$, reducing memory bandwidth overhead and improving GPU utilization.