Do results make sense? - Are they robust to different methodological choices? - Do they replicate on holdout data? - Would a domain expert find the conclusions reasonable?