3.1 Automated Evaluation

Evaluate on the held-out test set using: - **Perplexity**: Compare base model vs. fine-tuned model on domain text. - **Generation quality**: ROUGE-1, ROUGE-L, and BERTScore against reference answers. - **Exact match / F1**: For extractive QA-style questions. - **Domain terminology accuracy**: Check