Part V: Beyond Text — Multimodal and Generative AI

"The future of AI is not text-only — it is multimodal." — Common industry observation, 2024


The transformer architecture was born in the world of machine translation, but its ambitions quickly outgrew language. Part V explores how the ideas from Part IV extend beyond text to encompass images, audio, video, and their combinations.

We begin with vision transformers, which demonstrated that the same attention-based architecture that processes text can also achieve state-of-the-art results in computer vision — challenging the decades-long dominance of convolutional neural networks. We then study diffusion models, the generative framework behind Stable Diffusion, DALL-E, and the explosion of AI image generation. Multimodal models like CLIP and LLaVA show how to build systems that understand both images and text simultaneously. Speech and audio AI covers the transformer revolution in voice recognition, text-to-speech, and music generation. We close with video understanding and generation, one of the most active frontiers in AI research.

Each chapter in Part V builds on the transformer foundations from Part IV while introducing domain-specific challenges and techniques. By the end, you will understand how a single architectural paradigm — attention over learned representations — has unified AI across modalities.

Chapters in This Part

Chapter Title Key Question
26 Vision Transformers and Modern Computer Vision How do transformers process images?
27 Diffusion Models and Image Generation How do diffusion models create photorealistic images from noise?
28 Multimodal Models and Vision-Language AI How do we build systems that understand both images and text?
29 Speech, Audio, and Music AI How do transformers process and generate audio?
30 Video Understanding and Generation How do we extend AI to the temporal dimension of video?

What You Will Be Able to Do After Part V

  • Implement and fine-tune Vision Transformers for image classification
  • Understand the mathematics and implementation of diffusion models
  • Build multimodal systems that process text and images jointly
  • Work with speech recognition and text-to-speech systems
  • Understand video transformer architectures and temporal modeling

Prerequisites

  • Part IV (transformer architecture and attention mechanisms)
  • Chapter 14 (CNN concepts for comparison with ViT)
  • PyTorch and HuggingFace proficiency

Chapters in This Part