stochastic gradient descent

which jostles the system out of shallow local optima and saddle points. Learning rate schedules start with large steps (which can cross valleys) and gradually reduce the step size (for precise convergence). And techniques like "dropout" randomly deactivate portions of the network during training, pr