Key Takeaways: Chapter 31
Model Deployment
-
Your model is not deployed until someone else can use it without you being in the room. A model in a Jupyter notebook is a prototype. A model behind an HTTP endpoint, running in a container, with validated input schemas and structured responses, is a deployed model. The difference is not sophistication --- it is the difference between a demonstration and a product.
-
FastAPI is the standard for building ML prediction APIs in Python. It combines type-hint-based request validation (via Pydantic), automatic interactive documentation (Swagger UI at
/docs), and high performance (async, runs on Uvicorn). A minimal prediction endpoint is fewer than 30 lines of code. A production-ready endpoint with schemas, health checks, SHAP explanations, and error handling is still under 200. -
Pydantic schemas are contracts, not conveniences. A request schema defines exactly what the client must send: field names, types, ranges, and allowed values. A response schema defines exactly what the server returns. If the client sends a string where you expect a float, FastAPI returns a 422 error with a detailed message --- before your model code runs. This catches bugs at the API boundary, not deep inside the inference pipeline.
-
Docker makes your deployment reproducible across environments. A Dockerfile specifies the OS, Python version, installed packages, application code, and startup command. The resulting image runs identically on your laptop, a colleague's laptop, a CI server, and a cloud instance. Use multi-stage builds to keep images small, copy
requirements.txtbefore application code for layer caching, run as a non-root user for security, and include aHEALTHCHECKinstruction. -
Batch prediction and real-time prediction are complementary, not competing. Real-time APIs serve interactive use cases where a human or system is waiting (latency < 1 second). Batch jobs serve scheduled, high-volume scoring where nobody is waiting (hours are fine). Many production systems use both: a real-time API for the customer portal and a nightly batch job for marketing campaigns. The choice depends on the business requirement, not the model.
-
SHAP is often the latency bottleneck, not the model. A scikit-learn
predict_probacall takes microseconds. A SHAPTreeExplainercall takes 10--200 milliseconds. For real-time endpoints with strict latency requirements, consider pre-computing SHAP values in batch, caching explanations, or serving them from a separate asynchronous endpoint. -
Model versioning in production requires explicit tags, not "latest." Every Docker image should be tagged with a model version (
churn-api:v2.3.1), and every prediction response should include the model version. This ensures traceability: any prediction can be linked back to a specific model, MLflow run, and training data version. Rolling back is a tag swap, not a debugging session. -
Canary deployments protect you from production surprises. Route 5--10% of traffic to the new model version while monitoring accuracy, latency, and error rates. If metrics hold for 24--48 hours, increase the split. If they degrade, roll back. A canary catches problems that holdout evaluation misses: data distribution shifts, edge cases in production data, and latency regressions.
-
Cloud deployment is infrastructure, not magic. AWS ECS Fargate and Google Cloud Run both run Docker containers. ECS gives you fine-grained control over networking, scaling policies, and multi-container orchestration. Cloud Run gives you simplicity: a single command deploys your container with automatic HTTPS and scale-to-zero. Choose based on complexity requirements, not brand preference.
-
The preprocessing contract is the most dangerous single point of failure. If the real-time API and the batch job encode features differently, predictions will be silently wrong. No error, no crash, no alert --- just incorrect scores. Extract encoding logic into a shared module that both paths import. Test it once. Trust it everywhere.
If You Remember One Thing
A model with an AUC of 0.88 that is deployed, monitored, and serving predictions is infinitely more valuable than a model with an AUC of 0.92 that lives in a notebook on someone's laptop. Deployment is not the last step of a data science project. It is the first step of a data science product. FastAPI gives you the interface. Docker gives you the portability. Cloud platforms give you the scale. Pydantic gives you the safety net. Everything else is engineering detail --- important, but solvable. The model that never gets deployed solves nothing.
These takeaways summarize Chapter 31: Model Deployment. Return to the chapter for full context.