benchmark datasets are curated; production data has missing values, label noise, and inconsistencies. **(2) Static evaluation** — benchmarks are fixed; production data undergoes distribution shift over time. **(3) Unlimited tuning** — papers tune hyperparameters extensively on the benchmark; product