2.3 Training Configuration

Optimizer: AdamW with weight decay 0.01. - Learning rate: Sweep over {5e-6, 1e-5, 2e-5, 5e-5} with cosine schedule. - Warmup: 10% of total steps. - Batch size: Effective batch size of 16-64 (use gradient accumulation as needed). - Maximum sequence length: 2048 tokens (or 4096 if resources allow). -