Create a test set of at least 50 questions with ground-truth answers and source documents. - Evaluate with the following metrics: - **Answer quality**: Human evaluation on a 1-5 scale for correctness, completeness, and clarity. Also compute automated metrics (ROUGE-L, BERTScore against reference ans