GELU is differentiable at $z = 0$, avoiding the sharp corner of ReLU. This smoothness can benefit optimization. (2) **No dead neurons** — GELU has a non-zero gradient for slightly negative inputs, so neurons are not permanently killed. (3) **Probabilistic interpretation** — GELU can be viewed as a s