undertrained

they used too many parameters relative to the amount of training data. The key finding is that model size and training data should scale roughly equally: