Part VI: AI Systems Engineering

"The hard part of AI is not building the model — it is everything else." — Adapted from the "Hidden Technical Debt in Machine Learning Systems" paper

Training a model is only the beginning. The real challenge of AI engineering lies in building systems that are reliable, scalable, maintainable, and useful in production. Part VI bridges the gap between AI research and AI engineering, covering the systems-level thinking that separates demos from products.

We begin with retrieval-augmented generation (RAG), a pattern that has become the standard approach for building LLM applications that need to access external knowledge. AI agents and tool use show how to build systems where language models can take actions in the world — calling APIs, executing code, and reasoning about multi-step plans. Inference optimization covers the techniques that make it economically feasible to serve large models: quantization, distillation, KV caching, and specialized serving frameworks. MLOps and LLMOps provide the operational backbone for managing AI systems in production. We close with distributed training, explaining how modern models are trained across hundreds or thousands of GPUs.

If Part IV teaches you how transformers work, Part VI teaches you how to make them work in the real world.

Chapters in This Part

Chapter	Title	Key Question
31	Retrieval-Augmented Generation (RAG)	How do we give LLMs access to external knowledge?
32	AI Agents and Tool Use	How do we build systems where LLMs can take actions?
33	Inference Optimization and Model Serving	How do we serve models efficiently at scale?
34	MLOps and LLMOps	How do we operationalize AI systems in production?
35	Distributed Training and Scaling	How do we train models across multiple GPUs and nodes?

What You Will Be Able to Do After Part VI

Build production RAG systems with embedding models and vector databases
Design and implement AI agent architectures with tool use
Apply quantization, distillation, and caching to optimize inference
Set up experiment tracking, CI/CD, and monitoring for ML systems
Understand data parallelism, model parallelism, and FSDP

Prerequisites

Part IV (transformer architecture and LLMs)
Software engineering experience (APIs, databases, deployment)
Familiarity with Docker and cloud computing (helpful but not required)