**What:** Networks that use self-attention to process entire sequences in parallel, allowing every element to attend to every other element. - **When:** NLP (BERT, GPT), increasingly time series, vision (ViT), and multi-modal tasks. - **Why they matter:** Transformers are the architecture behind GPT