Decoder-only Transformers, trained with the autoregressive next-token prediction objective, form the foundation of modern large language models. The GPT family demonstrated that this deceptively simple framework---predict the next word given all previous words---when scaled to billions of parameters and trained on massive text corpora, produces models with remarkable language understanding and generation capabilities. The architecture has remained remarkably stable across generations; the primary lever has been scale.
Autoregressive Language Modeling
Autoregressive factorization: $P(\mathbf{x}) = \prod_{t=1}^{T} P(x_t \mid x_1, \ldots, x_{t-1})$. This is exact (no approximations) and enables straightforward sequential generation.
Training objective: Minimize the cross-entropy loss $\mathcal{L} = -\frac{1}{N} \sum_{t=1}^{N} \log P_\theta(x_t \mid \mathbf{x}_{
Perplexity: $\text{PPL} = \exp(\mathcal{L})$. Lower is better. Interpretable as the effective number of equally likely next tokens.
Parallel training: Despite sequential generation, training processes all positions simultaneously using causal masking.
Causal (Left-to-Right) Masking
Each position can only attend to itself and previous positions: $M_{ij} = \begin{cases} 0 & \text{if } j \leq i \\ -\infty & \text{if } j > i \end{cases}$
Prevents "information leakage" from future tokens during training.
Without causal masking, the model can "cheat" by copying answers from future positions, achieving low training loss but failing at generation.
GPT Architecture Evolution
Feature
GPT-1 (2018)
GPT-2 (2019)
GPT-3 (2020)
Parameters
117M
1.5B
175B
Layers
12
48
96
Hidden dim
768
1600
12288
Context length
512
1024
2048
Layer norm
Post-norm
Pre-norm
Pre-norm
Key innovation
Pre-train + fine-tune
Zero-shot transfer
In-context learning
Core Architecture (Shared Across All GPT Models)
Token embedding + learned positional embedding
Stack of $L$ decoder blocks (masked self-attention + FFN)
Final layer norm
Language modeling head (linear projection to vocabulary, weight-tied with token embedding)
Key Design Decisions
Pre-norm (GPT-2+): Layer norm before each sublayer, not after. More stable training.
GELU activation: Smoother than ReLU, standard in GPT models.
Weight tying: Input embedding and output projection share the same matrix.
Residual scaling: GPT-2 scales residual connections by $1/\sqrt{N}$ at initialization.