Token pruning/merging

reduce sequence length by removing or combining less important tokens. (4) **Linear attention** — approximate softmax attention with kernel methods for $O(Nd^2)$ cost. (5) **FlashAttention** — IO-aware implementation that doesn't reduce FLOPs but dramatically improves wall-clock time and memory usag