Part VI: AI Systems Engineering
"The hard part of AI is not building the model — it is everything else." — Adapted from the "Hidden Technical Debt in Machine Learning Systems" paper
Training a model is only the beginning. The real challenge of AI engineering lies in building systems that are reliable, scalable, maintainable, and useful in production. Part VI bridges the gap between AI research and AI engineering, covering the systems-level thinking that separates demos from products.
We begin with retrieval-augmented generation (RAG), a pattern that has become the standard approach for building LLM applications that need to access external knowledge. AI agents and tool use show how to build systems where language models can take actions in the world — calling APIs, executing code, and reasoning about multi-step plans. Inference optimization covers the techniques that make it economically feasible to serve large models: quantization, distillation, KV caching, and specialized serving frameworks. MLOps and LLMOps provide the operational backbone for managing AI systems in production. We close with distributed training, explaining how modern models are trained across hundreds or thousands of GPUs.
If Part IV teaches you how transformers work, Part VI teaches you how to make them work in the real world.
Chapters in This Part
| Chapter | Title | Key Question |
|---|---|---|
| 31 | Retrieval-Augmented Generation (RAG) | How do we give LLMs access to external knowledge? |
| 32 | AI Agents and Tool Use | How do we build systems where LLMs can take actions? |
| 33 | Inference Optimization and Model Serving | How do we serve models efficiently at scale? |
| 34 | MLOps and LLMOps | How do we operationalize AI systems in production? |
| 35 | Distributed Training and Scaling | How do we train models across multiple GPUs and nodes? |
What You Will Be Able to Do After Part VI
- Build production RAG systems with embedding models and vector databases
- Design and implement AI agent architectures with tool use
- Apply quantization, distillation, and caching to optimize inference
- Set up experiment tracking, CI/CD, and monitoring for ML systems
- Understand data parallelism, model parallelism, and FSDP
Prerequisites
- Part IV (transformer architecture and LLMs)
- Software engineering experience (APIs, databases, deployment)
- Familiarity with Docker and cloud computing (helpful but not required)