Chapter 10: Exercises

DataField.Dev

Chapter 10: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field

Attention Fundamentals

Exercise 10.1 (*)

Given query $\mathbf{q} = [1, 0, 1, 0]$, keys $\mathbf{k}_1 = [1, 1, 0, 0]$, $\mathbf{k}_2 = [0, 0, 1, 1]$, $\mathbf{k}_3 = [1, 0, 1, 0]$, and values $\mathbf{v}_1 = [1, 0]$, $\mathbf{v}_2 = [0, 1]$, $\mathbf{v}_3 = [1, 1]$:

(a) Compute the raw dot-product scores $\mathbf{q}^\top \mathbf{k}_i$ for each key.

(b) Compute the scaled scores (divide by $\sqrt{d_k}$ where $d_k = 4$).

(c) Apply softmax to the scaled scores to get attention weights $\alpha_1, \alpha_2, \alpha_3$.

(d) Compute the attention output $\sum_i \alpha_i \mathbf{v}_i$.

(e) Which key receives the most attention? Why does this make geometric sense?

Exercise 10.2 (*)

The scaling factor $1/\sqrt{d_k}$ normalizes the variance of dot products.

(a) If $q_j, k_j \sim \mathcal{N}(0, 1)$ i.i.d., what is $\mathbb{E}[q_j k_j]$ and $\text{Var}(q_j k_j)$?

(b) Using your answer to (a), derive $\text{Var}(\mathbf{q}^\top \mathbf{k}) = d_k$ for $\mathbf{q}, \mathbf{k} \in \mathbb{R}^{d_k}$.

(c) Verify empirically: generate 10,000 pairs of random vectors in $\mathbb{R}^{64}$ and $\mathbb{R}^{512}$, compute dot products, and compare the standard deviations against $\sqrt{64}$ and $\sqrt{512}$.

import torch

torch.manual_seed(42)
d_values = [64, 512]
for d in d_values:
    q = torch.randn(10000, d)
    k = torch.randn(10000, d)
    dots = (q * k).sum(dim=-1)
    scaled_dots = dots / (d ** 0.5)
    print(f"d={d}: raw std={dots.std():.2f} (expected {d**0.5:.2f}), "
          f"scaled std={scaled_dots.std():.2f} (expected 1.00)")

Exercise 10.3 (*)

Explain the difference between self-attention and cross-attention in terms of where Q, K, and V come from. For each of the following scenarios, state which type is used:

(a) An encoder layer processing a source sentence in a translation model.

(b) A decoder layer attending to the encoder's output in a translation model.

(c) A decoder layer attending to previously generated tokens.

(d) A BERT model computing contextualized representations.

(e) A Vision Transformer (ViT) processing image patches.

Exercise 10.4 (*)

For a transformer with $d_{\text{model}} = 512$ and $h = 8$ attention heads:

(a) What is $d_k = d_v$ for each head?

(b) What are the shapes of $\mathbf{W}_i^Q$, $\mathbf{W}_i^K$, $\mathbf{W}_i^V$ for one head?

(c) What is the shape of $\mathbf{W}^O$ (the output projection)?

(d) How many total parameters are in one multi-head attention layer (excluding biases)?

(e) How does this compare to a single-head attention layer with $d_k = 512$? Show that the parameter count is the same.

Positional Encoding

Exercise 10.5 (**)

(a) Implement sinusoidal positional encoding and plot the first 100 positions for dimensions 0, 1, 50, and 51 of a $d_{\text{model}} = 128$ encoding. Describe what you observe about the frequencies.

import torch
import math
import matplotlib.pyplot as plt

def compute_sinusoidal_pe(max_len: int, d_model: int) -> torch.Tensor:
    """Compute sinusoidal positional encoding matrix.

    Args:
        max_len: Maximum sequence length.
        d_model: Model dimension.

    Returns:
        Tensor of shape (max_len, d_model).
    """
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, d_model, 2, dtype=torch.float)
        * (-math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

pe = compute_sinusoidal_pe(100, 128)
# Plot dimensions 0, 1, 50, 51
# YOUR VISUALIZATION CODE HERE

(b) Compute the cosine similarity between $PE_0$ and $PE_k$ for $k = 1, 2, \ldots, 99$. How does similarity change with distance?

(c) For a fixed offset $k = 5$, compute $PE_{pos+5} - PE_{pos}$ for $pos = 0, 10, 20, \ldots, 90$. Are these differences approximately constant? What does this imply about the model's ability to learn relative positions?

Exercise 10.6 (**)

(a) Verify the rotation matrix property of sinusoidal positional encodings. For a single frequency $\omega$, show numerically that:

$$\begin{bmatrix} \sin(\omega(pos + k)) \\ \cos(\omega(pos + k)) \end{bmatrix} = \begin{bmatrix} \cos(\omega k) & \sin(\omega k) \\ -\sin(\omega k) & \cos(\omega k) \end{bmatrix} \begin{bmatrix} \sin(\omega \cdot pos) \\ \cos(\omega \cdot pos) \end{bmatrix}$$

for $\omega = 1/10000^{0/128}$, $pos = 7$, and $k = 3$.

(b) Explain why this property means the model can learn relative positional relationships through linear projections in Q and K.

(c) Why is this property not preserved when we use learned positional embeddings?

Architecture and Design

Exercise 10.7 (**)

Consider a transformer block with $d_{\text{model}} = 256$, $h = 8$ heads, and $d_{ff} = 1024$.

(a) Compute the total parameter count for one transformer block, broken down by component: - Multi-head attention (QKV projection + output projection) - FFN (two linear layers) - Layer normalization (2 instances)

(b) What fraction of the parameters are in the FFN? Why is this fraction significant?

(c) If we increase the model to $d_{\text{model}} = 1024$ and $d_{ff} = 4096$ (keeping $h = 8$), by what factor does the parameter count increase? Is it linear or quadratic in $d_{\text{model}}$?

Exercise 10.8 (**)

Compare pre-LN and post-LN transformers.

(a) Write the mathematical expressions for a single sub-layer in both the pre-LN and post-LN variants. Be explicit about the order of operations.

(b) Consider a 6-layer pre-LN transformer. Trace the gradient path from the loss to the input of the first layer. Identify every operation the gradient passes through. Count how many LayerNorm Jacobians it encounters.

(c) Repeat (b) for a 6-layer post-LN transformer. How many LayerNorm Jacobians does the gradient encounter on the direct residual path?

(d) Explain why the difference in (b) and (c) makes pre-LN easier to train without learning rate warmup.

Exercise 10.9 (**)

The causal mask prevents the decoder from attending to future positions.

(a) Write out the $4 \times 4$ causal mask matrix (using 0 for allowed positions and $-\infty$ for masked positions).

(b) After adding this mask to attention scores $\mathbf{S}$ and applying softmax row-wise, what is the structure of the resulting attention weight matrix? Which entries are zero?

(c) A bidirectional model (like BERT) does not use a causal mask. Explain why this means BERT cannot be used directly for text generation, even though it is trained on a language modeling objective (masked LM is different from autoregressive LM).

(d) Prefix LM (used in some models like T5 in certain configurations) applies causal masking only to the output portion, while the prefix (input) is fully bidirectional. Write the mask matrix for a sequence where positions 1-3 are the prefix and positions 4-6 are the target.

Complexity and Efficiency

Exercise 10.10 (**)

(a) For a self-attention layer with sequence length $n$ and dimension $d$, derive the FLOPs for: - Computing $\mathbf{Q}\mathbf{K}^\top$ (the attention score matrix) - Computing $\text{softmax}(\cdot)$ (approximate) - Computing $\text{Attn} \cdot \mathbf{V}$ (the weighted value sum)

(b) For $n = 4096$ and $d = 1024$, compute the total FLOPs and the memory required for the attention score matrix (in float32 bytes).

(c) At what sequence length $n$ does the attention computation (FLOPs) exceed the FFN computation? Assume $d_{ff} = 4d$.

(d) For the same $n$ and $d$, compute the memory required for the KV-cache for a 32-layer, 32-head model with $d_k = 64$ in float16. How does this compare to the model parameters themselves?

Exercise 10.11 (**)

Flash attention and standard attention compute the same result but have different memory profiles.

(a) What is the peak memory usage of the naive attention algorithm in terms of $n$ and $d$? Which intermediate tensor dominates?

(b) Flash attention reduces peak memory to $O(n)$ by never materializing the full attention matrix. Explain conceptually how tiling and the online softmax trick achieve this.

(c) In PyTorch 2.0+, F.scaled_dot_product_attention automatically selects the best attention backend. Write code that compares the wall-clock time of: - Naive attention (manual matmul + softmax + matmul) - F.scaled_dot_product_attention

for sequence lengths $n \in \{256, 1024, 4096, 8192\}$ with $d = 64$ and batch size 8.

import torch
import torch.nn.functional as F
import time

def benchmark_attention(seq_len: int, d: int = 64, batch: int = 8) -> dict:
    """Compare naive vs. PyTorch optimized attention timing.

    Args:
        seq_len: Sequence length.
        d: Head dimension.
        batch: Batch size.

    Returns:
        Dictionary with timing results in milliseconds.
    """
    device = "cuda" if torch.cuda.is_available() else "cpu"
    Q = torch.randn(batch, seq_len, d, device=device)
    K = torch.randn(batch, seq_len, d, device=device)
    V = torch.randn(batch, seq_len, d, device=device)

    # Warmup
    for _ in range(5):
        _ = F.scaled_dot_product_attention(Q, K, V)

    if device == "cuda":
        torch.cuda.synchronize()

    # Naive attention
    start = time.perf_counter()
    for _ in range(20):
        scores = torch.matmul(Q, K.transpose(-2, -1)) / (d ** 0.5)
        weights = F.softmax(scores, dim=-1)
        out_naive = torch.matmul(weights, V)
    if device == "cuda":
        torch.cuda.synchronize()
    naive_ms = (time.perf_counter() - start) / 20 * 1000

    # Optimized attention
    start = time.perf_counter()
    for _ in range(20):
        out_opt = F.scaled_dot_product_attention(Q, K, V)
    if device == "cuda":
        torch.cuda.synchronize()
    opt_ms = (time.perf_counter() - start) / 20 * 1000

    return {"seq_len": seq_len, "naive_ms": naive_ms, "optimized_ms": opt_ms,
            "speedup": naive_ms / opt_ms}

for n in [256, 1024, 4096, 8192]:
    result = benchmark_attention(n)
    print(f"n={result['seq_len']:5d} | Naive: {result['naive_ms']:7.2f} ms | "
          f"Optimized: {result['optimized_ms']:7.2f} ms | "
          f"Speedup: {result['speedup']:.2f}x")

Implementation

Exercise 10.12 (***)

Implement the online softmax algorithm from scratch. Your implementation should:

(a) Process a 1D tensor of scores in blocks of size $B$.

(b) Maintain a running maximum and running sum across blocks.

(c) Produce output identical to torch.softmax(scores, dim=0).

(d) Verify numerical equivalence for random scores of length 1024 with block sizes $B \in \{16, 64, 256\}$.

def online_softmax(scores: torch.Tensor, block_size: int) -> torch.Tensor:
    """Compute softmax using the online algorithm (block-wise processing).

    This is the core mathematical trick behind flash attention.

    Args:
        scores: 1D tensor of attention scores.
        block_size: Number of scores to process per block.

    Returns:
        Softmax probabilities, same shape as input.
    """
    n = scores.size(0)
    num_blocks = (n + block_size - 1) // block_size

    # Pass 1: compute global max and sum using running statistics
    running_max = torch.tensor(float("-inf"))
    running_sum = torch.tensor(0.0)

    for b in range(num_blocks):
        start = b * block_size
        end = min(start + block_size, n)
        block = scores[start:end]

        block_max = block.max()
        new_max = torch.max(running_max, block_max)

        # Rescale previous sum to new max
        running_sum = running_sum * torch.exp(running_max - new_max)
        running_sum = running_sum + torch.exp(block - new_max).sum()
        running_max = new_max

    # Pass 2: compute normalized probabilities
    probs = torch.exp(scores - running_max) / running_sum
    return probs

# Test
torch.manual_seed(0)
scores = torch.randn(1024)
ref = torch.softmax(scores, dim=0)
for B in [16, 64, 256]:
    result = online_softmax(scores, block_size=B)
    print(f"Block size {B:3d}: max error = {(result - ref).abs().max():.2e}")

Exercise 10.13 (***)

Implement a complete transformer decoder block with causal self-attention and cross-attention. Your implementation should:

(a) Include three sub-layers: masked self-attention, cross-attention, and FFN.

(b) Use pre-LN for all three sub-layers.

(c) Accept encoder output as input to the cross-attention sub-layer.

class TransformerDecoderBlock(nn.Module):
    """Single transformer decoder block with pre-LN architecture.

    Sub-layers:
        1. Masked multi-head self-attention (causal)
        2. Multi-head cross-attention (decoder queries, encoder keys/values)
        3. Position-wise feed-forward network
    """

    def __init__(
        self,
        d_model: int,
        num_heads: int,
        d_ff: int,
        dropout: float = 0.1,
    ) -> None:
        super().__init__()
        # YOUR IMPLEMENTATION HERE
        pass

    def forward(
        self,
        x: torch.Tensor,
        encoder_output: torch.Tensor,
        causal_mask: Optional[torch.Tensor] = None,
        encoder_mask: Optional[torch.Tensor] = None,
    ) -> torch.Tensor:
        """Forward pass through one decoder block.

        Args:
            x: Decoder input of shape (batch, tgt_len, d_model).
            encoder_output: Encoder output of shape (batch, src_len, d_model).
            causal_mask: Causal attention mask for self-attention.
            encoder_mask: Padding mask for cross-attention.

        Returns:
            Decoder output of shape (batch, tgt_len, d_model).
        """
        # YOUR IMPLEMENTATION HERE
        pass

(d) Test your decoder block with random encoder output ($\text{src\_len} = 10$, $\text{tgt\_len} = 8$, $d_{\text{model}} = 64$, $h = 4$) and verify the output shape.

Exercise 10.14 (***)

Implement a complete transformer for character-level language modeling.

(a) Use a decoder-only architecture with causal masking.

(b) Train on a simple corpus (e.g., the first 10,000 characters of a book from Project Gutenberg, or a synthetic corpus).

(c) Generate text by sampling from the model autoregressively. Use temperature-controlled sampling.

(d) Plot the training loss curve and show sample generations at epoch 1, 10, and 50. How does generation quality improve with training?

Exercise 10.15 (***)

Implement KV-caching for autoregressive generation with your decoder-only transformer from Exercise 10.14.

(a) Modify the attention mechanism to accept and return a KV-cache (stored key and value tensors from previous positions).

(b) Implement a generate function that uses the KV-cache to avoid redundant computation during token-by-token generation.

(c) Benchmark the wall-clock time for generating 100 tokens with and without KV-caching. What speedup do you observe?

@torch.no_grad()
def generate_with_kv_cache(
    model: nn.Module,
    prompt: torch.Tensor,
    max_new_tokens: int = 100,
    temperature: float = 1.0,
) -> torch.Tensor:
    """Generate tokens autoregressively with KV-cache.

    Args:
        model: Decoder-only transformer model.
        prompt: Initial token IDs of shape (1, prompt_len).
        max_new_tokens: Number of tokens to generate.
        temperature: Sampling temperature.

    Returns:
        Generated token IDs of shape (1, prompt_len + max_new_tokens).
    """
    # YOUR IMPLEMENTATION HERE
    pass

Conceptual and Analytical

Exercise 10.16 (**)

The feed-forward network in a transformer block has been interpreted as a key-value memory (Geva et al., 2021).

(a) In the FFN $\text{GELU}(\mathbf{x}\mathbf{W}_1)\mathbf{W}_2$, identify what plays the role of "keys" and what plays the role of "values."

(b) If $d_{\text{model}} = 512$ and $d_{ff} = 2048$, how many "memory slots" does the FFN have?

(c) How does this interpretation explain why increasing $d_{ff}$ improves model capacity for knowledge-intensive tasks?

Exercise 10.17 (**)

The "residual stream" view (Elhage et al., 2021) treats the transformer as a series of layers that read from and write to a shared communication channel.

(a) Write the output of a 3-layer pre-LN transformer encoder as a single expression showing all residual additions explicitly. Use $A_l$ for the attention sub-layer and $F_l$ for the FFN sub-layer of layer $l$.

(b) In this view, can layer 3's attention mechanism access the original input embedding? Explain.

(c) Can layer 3 access the output of layer 1's FFN? Explain.

(d) Why does this communication pattern make the transformer more flexible than a strict sequential pipeline?

Exercise 10.18 (**)

You are designing a transformer for StreamRec session modeling. The session has up to 100 items, each represented by a 64-dimensional embedding. You have a GPU with 16 GB of memory and a latency budget of 10 ms per session at inference.

(a) Choose $d_{\text{model}}$, $h$, $N$ (number of layers), and $d_{ff}$. Justify your choices with parameter count and FLOP estimates.

(b) Should you use sinusoidal or learned positional embeddings? Justify.

(c) Should you use causal or bidirectional attention? What are the trade-offs for recommendation?

(d) Estimate the inference latency for your architecture on a modern GPU (assume ~150 TFLOPS for float16).

Exercise 10.19 (***)

Prove that self-attention is permutation equivariant in the absence of positional encodings.

(a) Let $\mathbf{P}_\sigma$ be the permutation matrix corresponding to permutation $\sigma$. Show that $\text{Attention}(\mathbf{P}_\sigma \mathbf{Q}, \mathbf{P}_\sigma \mathbf{K}, \mathbf{P}_\sigma \mathbf{V}) = \mathbf{P}_\sigma \text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})$.

(b) Explain why this property means that self-attention treats its input as a set, not a sequence.

(c) How do positional encodings break this permutation equivariance?

Exercise 10.20 (***)

Analyze the attention patterns of a trained transformer.

(a) Train the TransformerEncoder from Section 10.9 on the PatternSequenceDataset. After training, extract attention weights for 10 test examples using return_attention=True.

(b) For each attention head in each layer, compute the average attention entropy:

$$H(\alpha) = -\sum_{j=1}^{n} \alpha_j \log \alpha_j$$

(c) Heads with low entropy produce focused (peaked) attention; heads with high entropy produce diffuse (uniform) attention. Classify each head as "focused" or "diffuse" based on a threshold of your choosing.

(d) Do different layers show different entropy profiles? What does this suggest about the division of labor across layers?

Applications and Extensions

Exercise 10.21 (**)

Implement a Vision Transformer (ViT) for CIFAR-10 classification.

(a) Split each $32 \times 32 \times 3$ image into $4 \times 4$ patches, producing 64 patches of dimension $4 \times 4 \times 3 = 48$.

(b) Linearly project each patch to $d_{\text{model}} = 128$. Prepend a learnable [CLS] token.

(c) Add learned positional embeddings for 65 positions (64 patches + 1 CLS).

(d) Pass through 4 transformer encoder blocks with 4 heads.

(e) Classify using the [CLS] token output.

(f) Train for 50 epochs and compare accuracy against the CNN from Chapter 8.

Exercise 10.22 (***)

Implement sliding-window (local) attention.

(a) Modify the scaled_dot_product_attention function to accept a window size $w$ and mask all positions outside the window $[i - w/2, i + w/2]$.

(b) Train two versions of the TransformerEncoder on the PatternSequenceDataset: one with full attention and one with window size $w = 11$ (5 positions on each side).

(c) How does window attention affect accuracy when the embedded pattern has length 5? Length 15? Why?

Exercise 10.23 (***)

Implement the softmax temperature parameter and analyze its effect.

(a) Modify the attention computation to use temperature $\tau$:

$$\text{Attention}_\tau(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\top}{\tau \sqrt{d_k}}\right)\mathbf{V}$$

(b) For a fixed set of attention scores, plot the attention weight distribution for $\tau \in \{0.1, 0.5, 1.0, 2.0, 5.0\}$. What happens as $\tau \to 0$? As $\tau \to \infty$?

(c) Train the pattern classification model with $\tau = 0.5$ and $\tau = 2.0$. Compare convergence speed and final accuracy.

(d) When might you want $\tau < 1$ (sharper attention)? When might $\tau > 1$ (softer attention)?

Exercise 10.24 (***)

The attention score matrix $\mathbf{A} = \text{softmax}(\mathbf{Q}\mathbf{K}^\top / \sqrt{d_k})$ is a doubly-stochastic-like matrix (each row sums to 1, but columns do not necessarily sum to 1).

(a) For a trained model on the pattern classification task, compute the column sums of the attention matrix averaged over 100 examples. Which positions have the highest column sums?

(b) Positions with high column sums are "popular" — many positions attend to them. In natural language, what kinds of tokens tend to have high column sums? (Hint: think about function words vs. content words.)

(c) Apply Sinkhorn normalization (alternately normalizing rows and columns) to produce a doubly stochastic attention matrix. Does this improve or hurt classification accuracy?

Exercise 10.25 (***)

Implement relative positional encoding.

(a) Instead of adding absolute positional encodings to the input, modify the attention score computation to include a learnable relative position bias:

$$\text{score}(i, j) = \frac{\mathbf{q}_i^\top \mathbf{k}_j + b_{i-j}}{\sqrt{d_k}}$$

where $b_{i-j}$ is a learnable scalar indexed by the relative distance $i - j$.

(b) Implement this modification, using a learnable parameter vector of length $2L - 1$ for maximum sequence length $L$.

(c) Train on the pattern classification task and compare with sinusoidal positional encoding. Which works better for patterns at random positions? Why?

StreamRec Progressive Project

Exercise 10.26 (***)

Extend the transformer session model to handle multi-modal item features.

(a) Instead of a single item embedding, concatenate: item ID embedding (64-dim), category embedding (16-dim), time-since-last-interaction embedding (16-dim), and engagement-type embedding (16-dim). Use a linear projection to produce a single $d_{\text{model}}$-dimensional representation.

(b) Add a time-aware positional encoding: instead of integer positions, use the actual timestamp differences between items. Modify the sinusoidal encoding to use continuous time values.

(c) Train and compare against the basic transformer session model from Section 10.13.

Exercise 10.27 (***)

Implement attention visualization for the StreamRec session model.

(a) For 5 test sessions, extract and plot the attention weights from each head of each layer as heatmaps.

(b) Identify at least two qualitatively different attention patterns (e.g., recency-focused, category-focused, periodic).

(c) For a session where the model makes a correct prediction that the LSTM got wrong, show the attention weights and explain what information the transformer used that the LSTM could not access.

Exercise 10.28 (****)

Design and implement an experiment comparing the LSTM session model (Chapter 9) and the transformer session model under controlled conditions.

(a) Use the same training data, hyperparameter budget (total parameters within 10% of each other), and evaluation protocol.

(b) Systematically vary session length from 10 to 200 items. At what length does the transformer begin to outperform the LSTM?

(c) Introduce "planted" long-range dependencies: an item at position $t_0$ predicts an item at position $t_0 + k$ for $k \in \{5, 10, 20, 50\}$. At what dependency range $k$ does the LSTM fail while the transformer succeeds?

(d) Measure inference latency for both models as a function of session length. At what session length does the transformer become slower than the LSTM?

(e) Write a summary recommending which model to use under which conditions for StreamRec.

Exercise 10.29 (****)

Implement a simple version of Grouped-Query Attention (GQA).

(a) Modify MultiHeadAttention to support $h_q$ query heads and $h_{kv}$ key-value heads, where $h_{kv}$ divides $h_q$. Each group of $h_q / h_{kv}$ query heads shares one set of key-value projections.

(b) Verify that GQA with $h_{kv} = h_q$ is equivalent to standard multi-head attention, and GQA with $h_{kv} = 1$ is equivalent to multi-query attention.

(c) Train the StreamRec session model with $h_q = 8$ and $h_{kv} \in \{1, 2, 4, 8\}$. Report accuracy and KV-cache memory for each configuration.

(d) What is the optimal $h_{kv}$ for StreamRec, and why might this differ from LLM settings?

Exercise 10.30 (****)

The universal approximation theorem for transformers (Yun et al., 2020) states that transformers are universal approximators of continuous permutation-equivariant sequence-to-sequence functions. However, the proof requires $O(n)$ layers for sequence length $n$.

(a) State the theorem precisely. What assumptions are required about $d_{\text{model}}$ and $d_{ff}$?

(b) In practice, transformers use $O(1)$ layers (6-96 layers regardless of sequence length). What does this imply about the class of functions that practical transformers can approximate?

(c) Design an experiment to test whether a fixed-depth transformer can learn a function that theoretically requires $O(n)$ layers. Suggestion: try to learn a function that requires "counting" — e.g., output 1 if the number of token A in the sequence exceeds the number of token B, and 0 otherwise. Vary sequence length and report accuracy.

(d) If the counting task fails at long sequences, propose an architectural modification that might help. (Hint: consider the relationship between layer depth and computational steps.)