Allows higher learning rates without divergence - Acts as a mild regularizer (due to batch noise) - Reduces sensitivity to weight initialization - Smooths the loss landscape