Further Reading: Chapter 31
Model Deployment
Official Documentation
1. FastAPI Documentation --- fastapi.tiangolo.com The primary reference for everything FastAPI. Start with the "First Steps" tutorial, which builds a minimal API in 10 lines. Then read "Path Parameters," "Request Body" (Pydantic models), "Response Model," and "Handling Errors" in sequence. The "Dependencies" section covers dependency injection for database connections and authentication --- patterns you will need when your API grows beyond a single endpoint. The documentation is exceptionally well-written, with working examples for every feature.
2. Pydantic Documentation --- docs.pydantic.dev
The v2 documentation covers model definition, field validation, custom validators, and serialization. The "Field" page explains Field(...) with constraints (ge, le, min_length, max_length). The "Types" page covers Literal, Annotated, and custom types. For ML APIs, the "Model Config" section is useful for controlling JSON serialization behavior. Pydantic v2 is a full rewrite with significant performance improvements over v1.
3. Docker Documentation --- docs.docker.com
Start with "Get Started" for Docker fundamentals (images, containers, volumes, networks). The "Dockerfile Reference" covers every instruction (FROM, COPY, RUN, CMD, HEALTHCHECK, USER). The "Best Practices" guide covers multi-stage builds, layer caching, .dockerignore, and security considerations. For ML deployments, the "Docker Compose" section covers multi-container local development.
4. Uvicorn Documentation --- www.uvicorn.org
Uvicorn is the ASGI server that runs FastAPI. The documentation covers worker configuration (--workers), logging, SSL/TLS, and deployment behind a reverse proxy (Nginx, Traefik). For production, the guidance on using Gunicorn as a process manager with Uvicorn workers is essential: gunicorn app:app -w 4 -k uvicorn.workers.UvicornWorker.
Model Serving Foundations
5. Designing Machine Learning Systems --- Chip Huyen (2022) Chapter 7 ("Model Deployment and Prediction Service") is the most comprehensive treatment of batch vs. real-time serving, model compression, edge deployment, and serving infrastructure. Huyen covers the tradeoffs between embedding models in applications, deploying them as microservices, and using managed serving platforms. The discussion of online prediction vs. batch prediction vs. streaming prediction is the best in any textbook. O'Reilly.
6. Building Machine Learning Powered Applications --- Emmanuel Ameisen (2020) Chapters 9--11 cover the full deployment pipeline: packaging models, building APIs, testing in production, and monitoring. Ameisen writes from a practitioner perspective with concrete examples from industry. The chapter on testing ML applications is particularly strong, covering property-based tests, data validation tests, and integration tests that go beyond the standard unit test approach. O'Reilly.
7. "Machine Learning Systems Design" --- Chip Huyen (Stanford CS 329S, Course Notes) Lecture notes from the Stanford course covering system design patterns for ML in production. The model serving module covers REST vs. gRPC, batch vs. online, latency optimization, and model versioning. Available freely at huyenchip.com/machine-learning-systems-design. More concise than the book, with useful system design diagrams.
FastAPI for ML
8. "Deploying ML Models with FastAPI" --- Sebastien Ramirez (FastAPI Creator), PyCon 2022 A conference talk by the creator of FastAPI demonstrating how to serve ML models with Pydantic validation, background tasks, and async endpoints. Ramirez covers patterns specific to ML workloads: loading models at startup (not per-request), handling long-running predictions with background tasks, and using dependency injection for model versioning. Available on YouTube.
9. "Serving ML Models in Production with FastAPI" --- Real Python Tutorial
A step-by-step tutorial building a scikit-learn model serving API with FastAPI. Covers project structure, Pydantic schemas, error handling, testing with TestClient, and Docker deployment. The tutorial is accessible for beginners and follows software engineering best practices. Available at realpython.com.
10. "Full Stack Machine Learning" --- Mark Treveil et al. (2020) Chapters on model serving cover the spectrum from Flask (simple but limited) through FastAPI (modern, typed) to TensorFlow Serving and TorchServe (framework-specific). The comparison helps you understand when FastAPI is the right choice (scikit-learn, XGBoost, custom models) vs. when framework-specific servers are better (large deep learning models with GPU inference). O'Reilly.
Docker for Data Science
11. "Docker for Data Science" --- Joshua Cook (2022) A data-scientist-friendly introduction to Docker covering image creation, volume mounts for data, GPU passthrough, and multi-container setups for ML workflows. The Jupyter-in-Docker and model-serving-in-Docker chapters are directly relevant. The book does not assume prior Docker experience. Apress.
12. "Best Practices for Writing Dockerfiles" --- Docker Official Guide
The authoritative guide to Dockerfile optimization. Covers multi-stage builds, minimizing layer count, using .dockerignore, choosing minimal base images (slim, alpine), and avoiding common mistakes (running as root, not pinning dependency versions, copying unnecessary files). Available at docs.docker.com/develop/develop-images/dockerfile_best-practices.
Cloud Deployment
13. AWS ECS Documentation --- docs.aws.amazon.com/ecs The reference for deploying containers on AWS. The "Getting Started with Fargate" tutorial walks through task definitions, services, and load balancers. The "Best Practices" section covers container health checks, auto-scaling policies, and logging with CloudWatch. For ML workloads, the documentation on task resource allocation (CPU, memory) and placement strategies is relevant.
14. Google Cloud Run Documentation --- cloud.google.com/run/docs
The simplest path from a Docker container to a production endpoint. The "Quickstart" deploys a container in under five minutes. The "Tips" section covers cold start mitigation (--min-instances), concurrency settings, memory configuration, and connecting to databases. Cloud Run's scale-to-zero model is ideal for low-traffic ML endpoints where cost matters.
15. "Deploying Machine Learning Models on AWS" --- AWS Machine Learning Blog A series of blog posts covering deployment options: SageMaker endpoints (managed), ECS/Fargate (containers), Lambda (serverless), and App Runner (simplified containers). The posts compare latency, cost, and complexity for each option. The SageMaker vs. DIY comparison is particularly useful for deciding when a managed platform is worth the higher cost.
Deployment Patterns and Strategies
16. "Continuous Delivery for Machine Learning" --- Danilo Sato, Arif Wider, and Christoph Windheuser (2019) A ThoughtWorks article introducing CD4ML: applying continuous delivery principles to ML systems. Covers model versioning, automated testing for ML, deployment pipelines, and canary deployments. The article includes a reference architecture diagram that connects experiment tracking (MLflow), model serving (Docker + API), and monitoring into an end-to-end pipeline. Available at martinfowler.com/articles/cd4ml.html.
17. "Blue-Green Deployments" --- Martin Fowler (2010) The original blog post explaining blue-green deployments. Written for general software, but the pattern applies directly to ML model deployments. The post clarifies the distinction between blue-green (full traffic switch) and canary (gradual traffic shift) and when to use each. Available at martinfowler.com/bliki/BlueGreenDeployment.html.
18. Reliable Machine Learning --- Cathy Chen et al. (2022) Chapters on deployment cover model release processes, traffic management (canary, shadow mode, A/B testing), rollback strategies, and deployment automation. The Google-internal perspective provides patterns for large-scale deployments that are applicable (in simplified form) to smaller teams. O'Reilly.
Latency Optimization
19. "Optimizing ML Inference Latency" --- AWS re:Invent 2023 Talk A technical talk covering inference optimization techniques: model quantization, ONNX Runtime, batching strategies, caching, and hardware selection. The talk includes benchmarks showing latency improvements from switching from Python model serving to ONNX Runtime for scikit-learn and XGBoost models. Available on YouTube.
20. ONNX Runtime Documentation --- onnxruntime.ai ONNX (Open Neural Network Exchange) provides a standardized model format with an optimized inference engine. For scikit-learn models, converting to ONNX and serving with ONNX Runtime can reduce inference latency by 2--10x compared to native Python inference. The documentation includes conversion tutorials for scikit-learn, XGBoost, and LightGBM. Relevant when your API needs sub-10ms inference.
How to Use This List
If you are deploying your first model, start with the FastAPI documentation (item 1) and the Docker documentation (item 3). Build a minimal API, containerize it, and run it locally. That exercise alone will teach you more than reading all 20 items.
If you need to choose between batch and real-time deployment, read Huyen (item 5, Chapter 7) for the framework and Case Study 2 in this chapter for a worked example.
If your latency requirements are tight (< 50 ms), read the ONNX Runtime documentation (item 20) and the AWS inference optimization talk (item 19). Most latency problems are solved by profiling, not by switching frameworks.
If you are designing a deployment pipeline for a team, read the CD4ML article (item 16) and the Reliable Machine Learning book (item 18). These provide the organizational and process patterns that make deployment sustainable, not just possible.
This reading list supports Chapter 31: Model Deployment. Return to the chapter to review concepts before diving in.