Within 1-2% of reported performance on the same dataset and evaluation protocol. - Consistent across at least three random seeds. - Achievable within the reported compute budget (within 2x).