[ ] Define your standard evaluation battery for new tools in your domain (like Raj's six-task battery for coding) - [ ] Commit to running your battery on any new tool that seems relevant before forming a strong opinion about it