The Transformer architecture replaced recurrence with attention as the primary mechanism for sequence modeling. By processing all positions in parallel through self-attention, the Transformer is faster to train, easier to scale, and better at capturing long-range dependencies than RNN-based models. The architecture from "Attention Is All You Need" (2017) remains the foundation of virtually every modern language model.
Architecture Components
Positional Encoding
Self-attention is permutation-equivariant --- it has no inherent notion of order.
Sinusoidal positional encodings use sine/cosine functions at different frequencies to create unique position fingerprints.
Learned positional embeddings are more flexible but limited to the maximum training length.
Positional encodings are added (not concatenated) to token embeddings.
Layer Normalization
Normalizes across the feature dimension: $\text{LayerNorm}(\mathbf{x}) = \gamma \odot \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} + \beta$.
Pre-norm (before sublayer) provides more stable training than post-norm (after residual).
Independent of batch size, unlike batch normalization.
Feed-Forward Network
Two-layer MLP applied independently at each position: $\text{FFN}(x) = W_2 \cdot \text{GELU}(W_1 x + b_1) + b_2$.
Inner dimension is typically $4 \times d_{\text{model}}$, creating an expand-then-compress bottleneck.
The primary source of nonlinearity and the majority of parameters per layer.