Conduct human evaluation on at least 50 test examples. - For each example, a human rater (you, a teammate, or a recruited evaluator) scores responses from the base model and the fine-tuned model on: - **Correctness** (1-5): Is the information factually accurate? - **Helpfulness** (1-5): Does the res