Case Study 2: Analyzing Transformer Components
Overview
In this case study, we perform a systematic analysis of the Transformer's internal components. Rather than treating the Transformer as a black box, we probe each component --- positional encodings, attention patterns, feed-forward network activations, and residual stream contributions --- to build intuition for how information flows through the architecture.
This case study applies the theoretical understanding from Sections 19.2--19.12 through targeted experiments and visualizations.
Experiment 1: Positional Encoding Analysis
Similarity Structure
We compute the dot product between positional encoding vectors for all pairs of positions to verify that: 1. Each position has a unique encoding. 2. Nearby positions have more similar encodings than distant positions. 3. The similarity depends primarily on the relative distance, not absolute position.
import torch
import math
def sinusoidal_pe(max_len: int, d_model: int) -> torch.Tensor:
"""Compute sinusoidal positional encodings."""
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
pe = sinusoidal_pe(100, 64)
similarity = pe @ pe.T # (100, 100) dot product matrix
Findings:
- The diagonal has the highest values (each position is most similar to itself).
- Off-diagonal values decay smoothly with distance.
- The pattern is approximately shift-invariant: similarity[i, j] depends mainly on |i - j|.
Frequency Analysis
Each pair of dimensions (2i, 2i+1) oscillates at frequency $\omega_i = 1/10000^{2i/d}$: - Dimensions 0--1: highest frequency, changes every position. - Dimensions d-2, d-1: lowest frequency, changes slowly across hundreds of positions.
This multi-scale encoding gives the model access to both fine-grained and coarse-grained position information.
Experiment 2: Attention Head Specialization
We train a Transformer on the sequence reversal task and then analyze the attention patterns in each head across all layers.
Methodology
For each head in each layer, we compute the attention weights on a batch of 100 test examples and categorize the head's behavior:
- Positional heads: High attention to a fixed relative position (e.g., always attend to position $i-1$).
- Content heads: Attention varies significantly across different input sequences.
- Identity heads: Strong diagonal attention (attend to self).
- Global heads: Diffuse, nearly uniform attention.
Results (2-layer model, 4 heads each)
| Layer | Head | Pattern | Description |
|---|---|---|---|
| 0 | 0 | Positional | Attends to next token |
| 0 | 1 | Identity | Attends to self |
| 0 | 2 | Content | Input-dependent |
| 0 | 3 | Positional | Attends to first token |
| 1 | 0 | Content | Reversal pattern |
| 1 | 1 | Content | Gathers from multiple positions |
| 1 | 2 | Global | Broad context aggregation |
| 1 | 3 | Content | Reversal pattern |
Layer 0 heads tend to learn simpler, more local patterns, while layer 1 heads learn the task-specific reversal logic.
Experiment 3: Residual Stream Analysis
We measure the contribution of each sublayer to the residual stream by computing the norm of each sublayer's output before it is added to the residual.
Setup
For a trained Transformer processing a batch of inputs, we hook into each sublayer and record: - $\|\text{attention\_output}\|$ : the norm of the attention sublayer's contribution - $\|\text{ffn\_output}\|$ : the norm of the FFN sublayer's contribution - $\|\text{residual}\|$ : the norm of the residual stream
Findings
Layer 0:
Residual norm: 12.4
Attention contribution: 2.1 (17% of residual)
FFN contribution: 3.8 (31% of residual)
Layer 1:
Residual norm: 14.8
Attention contribution: 1.9 (13% of residual)
FFN contribution: 2.7 (18% of residual)
Key observations: - The residual stream grows gradually as sublayers add their contributions. - FFN contributions are typically larger than attention contributions, consistent with the FFN having more parameters. - The relative contribution of sublayers decreases in deeper layers, suggesting the model refines rather than rewrites its representation.
Experiment 4: Layer Normalization Effects
We compare training with and without layer normalization, and pre-norm versus post-norm placement.
Training Stability (learning rate = 5e-4)
| Configuration | Converges? | Final Loss | Training Time |
|---|---|---|---|
| Pre-norm | Yes | 0.05 | 30 epochs |
| Post-norm | Yes | 0.07 | 35 epochs |
| No layer norm | Unstable | Diverges | N/A |
Gradient Flow
With pre-norm, the gradient norm remains stable across layers (within a factor of 2). With post-norm, gradients in early layers are 3--5x smaller than in late layers, explaining the slower convergence.
Experiment 5: Feed-Forward Network as Key-Value Memory
Following Geva et al. (2021), we interpret the FFN as a key-value memory:
$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x) = \sum_i \max(0, \mathbf{k}_i \cdot x) \cdot \mathbf{v}_i$$
where $\mathbf{k}_i$ is the $i$-th row of $W_1$ (the "key") and $\mathbf{v}_i$ is the $i$-th column of $W_2$ (the "value").
We analyze which "keys" are activated for different inputs: - Some neurons activate for specific token patterns (e.g., digits above 5). - Some neurons activate for specific positions (e.g., the first position). - Many neurons are active for most inputs (general-purpose computation).
This experiment connects the Transformer's FFN to the broader concept of learned memory systems discussed in Chapter 18.
Key Takeaways
-
Positional encodings create a smooth similarity structure where nearby positions are more similar, and the pattern depends on relative rather than absolute position.
-
Attention heads specialize naturally through training, with early layers learning simpler patterns and later layers learning task-specific logic.
-
The residual stream is the primary information pathway --- sublayers contribute incremental updates rather than wholesale transformations.
-
Pre-norm layer placement provides more stable gradients and faster convergence than post-norm, especially for deeper models.
-
FFN layers act as learned key-value memories that store and retrieve information based on the content of the residual stream.
The full analysis code is available in code/case-study-code.py.