Optimizer: AdamW with weight decay 0.01 - Learning rate: 1e-5 to 5e-5 for pretrained layers, 10x higher for new head - Warmup: 100--500 steps - Epochs: 3--10 (much less than training from scratch) - Gradient clipping: max_norm=1.0 - Use parameter groups to set different learning rates for backbone v