Case Study 2: Analyzing Transformer Components

Overview

In this case study, we perform a systematic analysis of the Transformer's internal components. Rather than treating the Transformer as a black box, we probe each component --- positional encodings, attention patterns, feed-forward network activations, and residual stream contributions --- to build intuition for how information flows through the architecture.

This case study applies the theoretical understanding from Sections 19.2--19.12 through targeted experiments and visualizations.


Experiment 1: Positional Encoding Analysis

Similarity Structure

We compute the dot product between positional encoding vectors for all pairs of positions to verify that: 1. Each position has a unique encoding. 2. Nearby positions have more similar encodings than distant positions. 3. The similarity depends primarily on the relative distance, not absolute position.

import torch
import math

def sinusoidal_pe(max_len: int, d_model: int) -> torch.Tensor:
    """Compute sinusoidal positional encodings."""
    pe = torch.zeros(max_len, d_model)
    position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
    div_term = torch.exp(
        torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model)
    )
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe

pe = sinusoidal_pe(100, 64)
similarity = pe @ pe.T  # (100, 100) dot product matrix

Findings: - The diagonal has the highest values (each position is most similar to itself). - Off-diagonal values decay smoothly with distance. - The pattern is approximately shift-invariant: similarity[i, j] depends mainly on |i - j|.

Frequency Analysis

Each pair of dimensions (2i, 2i+1) oscillates at frequency $\omega_i = 1/10000^{2i/d}$: - Dimensions 0--1: highest frequency, changes every position. - Dimensions d-2, d-1: lowest frequency, changes slowly across hundreds of positions.

This multi-scale encoding gives the model access to both fine-grained and coarse-grained position information.


Experiment 2: Attention Head Specialization

We train a Transformer on the sequence reversal task and then analyze the attention patterns in each head across all layers.

Methodology

For each head in each layer, we compute the attention weights on a batch of 100 test examples and categorize the head's behavior:

  1. Positional heads: High attention to a fixed relative position (e.g., always attend to position $i-1$).
  2. Content heads: Attention varies significantly across different input sequences.
  3. Identity heads: Strong diagonal attention (attend to self).
  4. Global heads: Diffuse, nearly uniform attention.

Results (2-layer model, 4 heads each)

Layer Head Pattern Description
0 0 Positional Attends to next token
0 1 Identity Attends to self
0 2 Content Input-dependent
0 3 Positional Attends to first token
1 0 Content Reversal pattern
1 1 Content Gathers from multiple positions
1 2 Global Broad context aggregation
1 3 Content Reversal pattern

Layer 0 heads tend to learn simpler, more local patterns, while layer 1 heads learn the task-specific reversal logic.


Experiment 3: Residual Stream Analysis

We measure the contribution of each sublayer to the residual stream by computing the norm of each sublayer's output before it is added to the residual.

Setup

For a trained Transformer processing a batch of inputs, we hook into each sublayer and record: - $\|\text{attention\_output}\|$ : the norm of the attention sublayer's contribution - $\|\text{ffn\_output}\|$ : the norm of the FFN sublayer's contribution - $\|\text{residual}\|$ : the norm of the residual stream

Findings

Layer 0:
  Residual norm: 12.4
  Attention contribution: 2.1 (17% of residual)
  FFN contribution: 3.8 (31% of residual)

Layer 1:
  Residual norm: 14.8
  Attention contribution: 1.9 (13% of residual)
  FFN contribution: 2.7 (18% of residual)

Key observations: - The residual stream grows gradually as sublayers add their contributions. - FFN contributions are typically larger than attention contributions, consistent with the FFN having more parameters. - The relative contribution of sublayers decreases in deeper layers, suggesting the model refines rather than rewrites its representation.


Experiment 4: Layer Normalization Effects

We compare training with and without layer normalization, and pre-norm versus post-norm placement.

Training Stability (learning rate = 5e-4)

Configuration Converges? Final Loss Training Time
Pre-norm Yes 0.05 30 epochs
Post-norm Yes 0.07 35 epochs
No layer norm Unstable Diverges N/A

Gradient Flow

With pre-norm, the gradient norm remains stable across layers (within a factor of 2). With post-norm, gradients in early layers are 3--5x smaller than in late layers, explaining the slower convergence.


Experiment 5: Feed-Forward Network as Key-Value Memory

Following Geva et al. (2021), we interpret the FFN as a key-value memory:

$$\text{FFN}(x) = W_2 \cdot \text{ReLU}(W_1 x) = \sum_i \max(0, \mathbf{k}_i \cdot x) \cdot \mathbf{v}_i$$

where $\mathbf{k}_i$ is the $i$-th row of $W_1$ (the "key") and $\mathbf{v}_i$ is the $i$-th column of $W_2$ (the "value").

We analyze which "keys" are activated for different inputs: - Some neurons activate for specific token patterns (e.g., digits above 5). - Some neurons activate for specific positions (e.g., the first position). - Many neurons are active for most inputs (general-purpose computation).

This experiment connects the Transformer's FFN to the broader concept of learned memory systems discussed in Chapter 18.


Key Takeaways

  1. Positional encodings create a smooth similarity structure where nearby positions are more similar, and the pattern depends on relative rather than absolute position.

  2. Attention heads specialize naturally through training, with early layers learning simpler patterns and later layers learning task-specific logic.

  3. The residual stream is the primary information pathway --- sublayers contribute incremental updates rather than wholesale transformations.

  4. Pre-norm layer placement provides more stable gradients and faster convergence than post-norm, especially for deeper models.

  5. FFN layers act as learned key-value memories that store and retrieve information based on the content of the residual stream.

The full analysis code is available in code/case-study-code.py.