Do not use label smoothing for knowledge distillation (the soft teacher labels already provide smoothing). - Common values: 0.1 for most tasks, 0.05 for tasks with very clean labels, 0.2 for noisy labels. - Label smoothing was a key component in the original Transformer paper (Chapter 14 will discus