a learnable bias term added to the attention logits based on the relative spatial distance between tokens, parameterized by a table of size $(2M-1) \times (2M-1)$. This differs from ViT's **absolute position embeddings**, which are learnable vectors added to each patch embedding based on its absolut