the model is selecting a single input position. Soft attention (where weights are distributed) computes an interpolation between multiple encoder states, enabling smoother gradient flow.