hard labels

one-hot encoded targets where the correct class has probability 1 and all others have probability 0. This forces the model to predict increasingly extreme logits to minimize cross-entropy loss, which has two problems: