Match the tokenizer to the model

never mix tokenizers from different pre-trained models. 3. **Domain-specific pre-training** (further pre-training on domain text before fine-tuning) can significantly improve results for specialized domains. 4. **Gradient accumulation** enables effective large batch sizes on limited GPU memory. 5. *