Part V: Beyond Text — Multimodal and Generative AI

"The future of AI is not text-only — it is multimodal." — Common industry observation, 2024

The transformer architecture was born in the world of machine translation, but its ambitions quickly outgrew language. Part V explores how the ideas from Part IV extend beyond text to encompass images, audio, video, and their combinations.

We begin with vision transformers, which demonstrated that the same attention-based architecture that processes text can also achieve state-of-the-art results in computer vision — challenging the decades-long dominance of convolutional neural networks. We then study diffusion models, the generative framework behind Stable Diffusion, DALL-E, and the explosion of AI image generation. Multimodal models like CLIP and LLaVA show how to build systems that understand both images and text simultaneously. Speech and audio AI covers the transformer revolution in voice recognition, text-to-speech, and music generation. We close with video understanding and generation, one of the most active frontiers in AI research.

Each chapter in Part V builds on the transformer foundations from Part IV while introducing domain-specific challenges and techniques. By the end, you will understand how a single architectural paradigm — attention over learned representations — has unified AI across modalities.

Chapters in This Part

Chapter	Title	Key Question
26	Vision Transformers and Modern Computer Vision	How do transformers process images?
27	Diffusion Models and Image Generation	How do diffusion models create photorealistic images from noise?
28	Multimodal Models and Vision-Language AI	How do we build systems that understand both images and text?
29	Speech, Audio, and Music AI	How do transformers process and generate audio?
30	Video Understanding and Generation	How do we extend AI to the temporal dimension of video?

What You Will Be Able to Do After Part V

Implement and fine-tune Vision Transformers for image classification
Understand the mathematics and implementation of diffusion models
Build multimodal systems that process text and images jointly
Work with speech recognition and text-to-speech systems
Understand video transformer architectures and temporal modeling

Prerequisites

Part IV (transformer architecture and attention mechanisms)
Chapter 14 (CNN concepts for comparison with ViT)
PyTorch and HuggingFace proficiency