Exercises: The Attention Mechanism
These exercises progress from foundational understanding to research-level implementation. Exercises marked with a single asterisk (*) are intermediate, and those marked with a double asterisk (**) are advanced.
Section A: Conceptual Foundations (Exercises 1--8)
Exercise 1: The Information Bottleneck
Consider a standard seq2seq encoder--decoder model (without attention) that uses a context vector $\mathbf{c} \in \mathbb{R}^{256}$.
(a) If the encoder processes a 50-word English sentence, approximately how many bits of information per word can the context vector store? Assume each dimension of $\mathbf{c}$ carries roughly 8 bits (float32 precision, accounting for redundancy).
(b) A typical English word carries about 10--12 bits of information (Shannon, 1951). Does the context vector have sufficient capacity? What happens as sentence length increases to 200 words?
(c) Explain why the attention mechanism does not have this same bottleneck.
Exercise 2: Attention Weight Properties
Prove that attention weights $\alpha_{tj}$ computed via softmax satisfy the following properties:
(a) Non-negativity: $\alpha_{tj} \geq 0$ for all $t, j$.
(b) Normalization: $\sum_{j=1}^{T} \alpha_{tj} = 1$ for each $t$.
(c) Show that if one alignment score $e_{tk}$ is much larger than all others (i.e., $e_{tk} \gg e_{tj}$ for $j \neq k$), then $\alpha_{tk} \approx 1$ and $\alpha_{tj} \approx 0$ for $j \neq k$. What does this imply about the context vector $\mathbf{c}_t$?
Exercise 3: Bahdanau vs. Luong
Compare Bahdanau and Luong attention along the following dimensions:
(a) Count the number of learnable parameters in Bahdanau attention (additive) and Luong attention (general/bilinear), assuming the encoder and decoder both have hidden dimension $d = 512$ and the alignment dimension is $d_a = 256$.
(b) Derive the computational complexity (in FLOPs) for computing all alignment scores for a single decoder step, given encoder sequence length $T$.
(c) Under what circumstances would you prefer Bahdanau over Luong, and vice versa?
Exercise 4: The Scaling Factor
(a) Let $\mathbf{q}, \mathbf{k} \in \mathbb{R}^{d_k}$ where each component is i.i.d. $\mathcal{N}(0, 1)$. Derive $\text{Var}(\mathbf{q}^\top \mathbf{k})$.
(b) Compute the standard deviation of unscaled dot products for $d_k \in \{64, 128, 256, 512, 1024\}$.
(c) Write a PyTorch script that empirically verifies your derivation by sampling many random vectors and computing dot product statistics.
(d) Show that after scaling by $1/\sqrt{d_k}$, the variance becomes 1 regardless of $d_k$.
Exercise 5: Softmax Temperature
The scaled dot-product attention can be generalized using a temperature parameter $\tau$:
$$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\tau}\right)\mathbf{V}$$
(a) What happens to the attention distribution as $\tau \to 0^+$? As $\tau \to \infty$?
(b) Show that the standard scaling factor $\tau = \sqrt{d_k}$ is optimal (in the sense of maintaining unit variance) when $\mathbf{Q}$ and $\mathbf{K}$ have unit-variance entries.
(c) Implement scaled dot-product attention with a learnable temperature parameter. Train it on a simple task and observe how $\tau$ evolves during training.
Exercise 6: Self-Attention and Permutation Equivariance
(a) Prove that self-attention (without positional encodings) is permutation equivariant. That is, if we permute the rows of $\mathbf{X}$ by a permutation matrix $\mathbf{P}$, the output is permuted by the same $\mathbf{P}$:
$$\text{SelfAttn}(\mathbf{PX}) = \mathbf{P} \cdot \text{SelfAttn}(\mathbf{X})$$
(b) Why is permutation equivariance a problem for sequence modeling? How do positional encodings break this symmetry?
(c) Is self-attention permutation invariant? Explain the difference between equivariance and invariance.
Exercise 7: Attention as Kernel Smoothing
The attention mechanism can be viewed as a form of Nadaraya--Watson kernel regression. In kernel regression, the prediction at query point $q$ is:
$$\hat{f}(q) = \frac{\sum_i K(q, x_i) y_i}{\sum_i K(q, x_i)}$$
(a) Show that scaled dot-product attention is equivalent to kernel regression with the softmax kernel $K(\mathbf{q}, \mathbf{k}) = \exp(\mathbf{q}^\top \mathbf{k} / \sqrt{d_k})$.
(b) What other kernels could we use? What would a Gaussian kernel look like in the attention framework?
(c) Why is the softmax kernel particularly convenient for neural network training?
Exercise 8: Multi-Head Parameter Count
For a Transformer model with the following configuration:
- $d_{\text{model}} = 768$
- $h = 12$ attention heads
- $d_k = d_v = 64$
(a) Calculate the total number of parameters in a single multi-head attention layer (including all Q, K, V, and output projections, but excluding biases).
(b) How does this compare to a single-head attention with $d_k = 768$?
(c) A BERT-base model has 12 Transformer layers. How many total parameters are in the attention layers alone?
Section B: Implementation Exercises (Exercises 9--18)
Exercise 9: Bahdanau Attention from Scratch
Implement Bahdanau attention in PyTorch without using any high-level library functions for the attention computation itself.
(a) Create a BahdanauAttention class with the full forward pass.
(b) Write a test that verifies attention weights sum to 1 for each query.
(c) Integrate it into a simple encoder--decoder model for sequence reversal (given [1, 2, 3, 4], output [4, 3, 2, 1]).
Exercise 10: Luong Attention Variants
Implement all three Luong scoring methods (dot, general, concat) in a single class.
(a) Verify that all three produce valid attention distributions (non-negative, sum to 1).
(b) Compare the number of learnable parameters for each method with $d = 256$.
(c) Time the forward pass for each method with sequence length 100 and batch size 32. Which is fastest?
Exercise 11: Scaled Dot-Product Attention with Masking
Implement scaled dot-product attention that supports both padding masks and causal masks.
(a) Verify that masked positions receive exactly zero attention weight after softmax.
(b) Test with a causal mask that the attention weight matrix is lower-triangular.
(c) Handle the edge case where an entire row is masked (all keys are masked for a given query). What should the output be?
Exercise 12: Multi-Head Attention*
Implement multi-head attention from scratch (not using nn.MultiheadAttention).
(a) Verify that your implementation matches nn.MultiheadAttention numerically (up to floating-point precision) when given the same weights.
(b) Implement the "fused QKV projection" optimization and measure the speedup.
(c) Add a method to extract and return attention weights for all heads.
Exercise 13: Relative Position Attention*
Implement relative position attention as described in Shaw et al. (2018).
(a) Instead of absolute positional encodings, add learnable relative position biases to the attention scores.
(b) The relative position bias for query at position $i$ and key at position $j$ depends only on $i - j$ (clipped to a maximum distance).
(c) Verify that your implementation handles sequences of different lengths correctly.
Exercise 14: Attention Visualization
Write a function that produces attention heatmap visualizations.
(a) Generate a heatmap for single-head attention on a synthetic sequence.
(b) Create a grid of heatmaps showing all heads in a multi-head attention layer.
(c) Animate attention weights across decoder time steps to show how attention shifts during sequence generation.
Exercise 15: Attention-Based Sequence Classifier*
Build a complete text classifier using self-attention (no RNN).
(a) Embed input tokens, apply 2 layers of multi-head self-attention with residual connections, then average-pool the output and classify.
(b) Train on a sentiment classification dataset (e.g., IMDB reviews or SST-2).
(c) Compare the classification accuracy with an LSTM-based classifier from Chapter 16.
(d) Visualize which words receive the highest attention for positive vs. negative reviews.
Exercise 16: Causal Language Model with Attention*
Implement a simple causal (autoregressive) language model using only self-attention (no recurrence).
(a) Use multi-head self-attention with a causal mask.
(b) Train on a small text corpus and generate text by sampling from the model.
(c) Compare perplexity with an LSTM language model trained on the same data.
Exercise 17: Efficient Attention**
Implement two efficient attention variants:
(a) Sparse attention: Each position only attends to positions within a sliding window of size $w$ plus a set of global positions. Implement the masking pattern.
(b) Linear attention: Replace softmax with the kernel $\phi(\mathbf{x}) = \text{elu}(\mathbf{x}) + 1$ and use the associativity of matrix multiplication to achieve $O(n)$ complexity. Verify the complexity empirically.
(c) Compare the speed and accuracy of both variants against standard attention on a sequence classification task with varying sequence lengths.
Exercise 18: Attention Gradient Analysis**
Analyze the gradients flowing through the attention mechanism.
(a) For a simple single-head attention with $n = 3$ and $d_k = 2$, manually compute $\frac{\partial \text{output}}{\partial \mathbf{Q}}$, $\frac{\partial \text{output}}{\partial \mathbf{K}}$, and $\frac{\partial \text{output}}{\partial \mathbf{V}}$.
(b) Verify your manual computation using torch.autograd.
(c) Explain why the gradient with respect to $\mathbf{V}$ is simpler than the gradients with respect to $\mathbf{Q}$ and $\mathbf{K}$.
Section C: Analysis and Research Exercises (Exercises 19--28)
Exercise 19: Attention and the Rank Bottleneck*
(a) The attention matrix $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$ has at most rank $d_k$. Prove this.
(b) If we have $n = 100$ tokens and $d_k = 64$, what does this rank constraint imply about the expressiveness of attention?
(c) Multi-head attention partially addresses this. Explain how having $h$ heads with $d_k = d_{\text{model}}/h$ interacts with the rank bottleneck.
Exercise 20: Attention Without Softmax*
Explore what happens when we replace softmax with other normalization functions.
(a) Implement attention with L1 normalization (divide scores by their sum of absolute values) instead of softmax. Does it still work?
(b) Implement attention with sigmoid activation (no normalization --- weights do not sum to 1). What are the implications?
(c) Recent work (e.g., "Sigmoid Attention" by Ramapuram et al., 2024) has revisited non-softmax attention. Discuss the pros and cons.
Exercise 21: Cross-Attention Analysis*
In a Transformer decoder, cross-attention attends from the decoder to the encoder.
(a) If the encoder has sequence length $T_{\text{enc}}$ and the decoder has length $T_{\text{dec}}$, what is the shape of the cross-attention matrix?
(b) For machine translation from German to English, what patterns would you expect to see in the cross-attention matrix for a well-trained model?
(c) Implement a simple encoder-decoder with cross-attention and train it on a synthetic task (e.g., sorting numbers). Visualize the cross-attention patterns.
Exercise 22: Attention Entropy*
The entropy of the attention distribution measures how "spread out" the attention is.
(a) Define the attention entropy for a single query attending to $m$ keys:
$$H = -\sum_{j=1}^{m} \alpha_j \log \alpha_j$$
What is the maximum possible entropy? What is the minimum?
(b) Compute the average attention entropy across all heads and layers for a pre-trained Transformer model (you can use a Hugging Face model). Which layers have higher entropy?
(c) Is there a correlation between attention entropy and layer depth? What might this tell us about how information flows through the network?
Exercise 23: Attention and Graph Neural Networks*
Self-attention can be viewed as a message-passing operation on a fully connected graph.
(a) Express self-attention in terms of graph neural network (GNN) notation: nodes, edges, messages, and aggregation.
(b) How does masking correspond to changing the graph structure?
(c) What is the key difference between GAT (Graph Attention Networks) and Transformer self-attention?
Exercise 24: Attention Complexity Benchmark*
Empirically measure the computational cost of attention.
(a) Time the forward pass of scaled dot-product attention for sequence lengths $n \in \{128, 256, 512, 1024, 2048, 4096\}$ and $d_k = 64$. Plot the results and verify the $O(n^2)$ scaling.
(b) Measure memory consumption for the same sequence lengths using torch.cuda.max_memory_allocated() (if GPU is available) or estimate from tensor sizes.
(c) At what sequence length does attention become the bottleneck compared to the feed-forward layers in a Transformer (which have $O(n \cdot d^2)$ complexity)?
Exercise 25: Hard Attention**
Unlike soft attention (which computes a weighted average), hard attention selects a single position.
(a) Implement hard attention using torch.argmax on the attention scores. Why is this not differentiable?
(b) Implement hard attention using the Straight-Through Estimator (STE): use argmax in the forward pass but pass gradients through the softmax in the backward pass.
(c) Implement hard attention using the Gumbel-softmax trick. Compare the gradient estimates from STE and Gumbel-softmax.
Exercise 26: Analyzing Attention Head Redundancy**
Some attention heads may be redundant or unimportant.
(a) Implement "attention head pruning" by zeroing out entire heads one at a time and measuring the change in model output (e.g., loss on a validation set).
(b) Compute the pairwise cosine similarity between the output projections of different heads. Are some heads learning similar functions?
(c) Michel et al. (2019) showed that many heads can be pruned with little loss in performance. Replicate this finding on a small model.
Exercise 27: Mixture of Attention Heads**
Instead of concatenating head outputs, implement a mixture-of-experts approach.
(a) Compute a gating score for each head and take a weighted sum instead of concatenation.
(b) Train on a language modeling task and observe which heads receive higher gating scores.
(c) Does this approach outperform standard concatenation-based multi-head attention? Discuss why or why not.
Exercise 28: Attention Distillation**
Knowledge distillation can be applied to attention patterns.
(a) Train a large multi-head attention model (teacher) on a classification task.
(b) Train a smaller model (student) to match both the teacher's output and the teacher's attention distributions (using KL divergence on attention weights).
(c) Compare the student's performance with and without attention distillation. Does mimicking the teacher's attention help?
Section D: Mathematical Proofs (Exercises 29--32)
Exercise 29: Softmax Invariance
Prove that softmax is invariant to constant shifts: for any constant $c$,
$$\text{softmax}(\mathbf{z} + c\mathbf{1}) = \text{softmax}(\mathbf{z})$$
Explain why this property is exploited in numerically stable softmax implementations.
Exercise 30: Attention and Convolution
Show that a 1D convolution with kernel size $k$ can be expressed as a special case of masked attention where:
(a) The mask allows each position to attend only to positions within a window of size $k$.
(b) The attention weights are fixed (not input-dependent).
What additional capability does attention have that convolution lacks?
Exercise 31: Expressiveness of Multi-Head Attention*
Prove that multi-head attention with $h$ heads and head dimension $d_k$ can represent any function that single-head attention with dimension $h \cdot d_k$ can represent, but the converse is not true.
Hint: Consider the output projection $\mathbf{W}^O$ and show that it can "select" a single head's output.
Exercise 32: Lipschitz Continuity of Attention**
(a) Is the attention function $\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})$ Lipschitz continuous with respect to $\mathbf{Q}$?
(b) If so, derive an upper bound on the Lipschitz constant in terms of $\|\mathbf{K}\|$ and $\|\mathbf{V}\|$.
(c) What does this tell us about the stability of attention to small perturbations in the queries?
Coding Challenge: Build a Mini-Transformer Attention Block
Combine everything from this chapter to build a complete attention block:
- Multi-head self-attention with causal masking
- Layer normalization and residual connections
- A feed-forward network with ReLU activation
Train your block as a single-layer language model on a small text dataset and report perplexity. This exercise prepares you for Chapter 19, where we will build the full Transformer architecture.
# Starter code
import torch
import torch.nn as nn
torch.manual_seed(42)
class AttentionBlock(nn.Module):
"""A single Transformer-style attention block.
Combines multi-head self-attention with a feed-forward network,
layer normalization, and residual connections.
Args:
d_model: Model dimension.
num_heads: Number of attention heads.
d_ff: Feed-forward hidden dimension.
dropout: Dropout rate.
"""
def __init__(
self,
d_model: int,
num_heads: int,
d_ff: int,
dropout: float = 0.1,
) -> None:
super().__init__()
# TODO: Implement the full attention block
pass
def forward(
self,
x: torch.Tensor,
mask: torch.Tensor | None = None,
) -> torch.Tensor:
"""Forward pass.
Args:
x: Input of shape (batch_size, seq_len, d_model).
mask: Optional attention mask.
Returns:
Output of shape (batch_size, seq_len, d_model).
"""
# TODO: Implement the forward pass
pass