Recruit 2-3 evaluators (can be teammates or volunteers). - Have each evaluator interact with the system on 20 tasks (covering all scenario types). - Collect ratings (1-5) for: - **Correctness**: Is the response factually accurate? - **Relevance**: Does the response address the user's actual question