Configuration:

Parallelism: DDP (NCCL backend, NVLink) - Local batch size per GPU: 8 - Global batch size: 64 - Learning rate: $3 \times 10^{-4} \times (64/8) = 2.4 \times 10^{-3}$ (linear scaling, 8x base) - Warmup: 2,000 steps (approximately 5% of total steps) - Optimizer: LAMB (required for global batch size > 2