A standardized test or set of tests used to evaluate and compare the performance of AI models, tools, or code. Examples include **HumanEval**, **MBPP**, and **SWE-bench**. (Ch. 3, Appendix A)