Recipe 2: Training a Transformer for NLP

Optimizer: AdamW with betas (0.9, 0.98), weight decay 0.01 - Learning rate: Peak 5e-4, linear warmup for 4,000 steps, then cosine decay - Batch size: Effective 256--2048 (with gradient accumulation) - Gradient clipping: max_norm=1.0 - Dropout: 0.1 on attention and feed-forward layers - Mixed precisi