Key Takeaways: Chapter 31

Model Deployment


  1. Your model is not deployed until someone else can use it without you being in the room. A model in a Jupyter notebook is a prototype. A model behind an HTTP endpoint, running in a container, with validated input schemas and structured responses, is a deployed model. The difference is not sophistication --- it is the difference between a demonstration and a product.

  2. FastAPI is the standard for building ML prediction APIs in Python. It combines type-hint-based request validation (via Pydantic), automatic interactive documentation (Swagger UI at /docs), and high performance (async, runs on Uvicorn). A minimal prediction endpoint is fewer than 30 lines of code. A production-ready endpoint with schemas, health checks, SHAP explanations, and error handling is still under 200.

  3. Pydantic schemas are contracts, not conveniences. A request schema defines exactly what the client must send: field names, types, ranges, and allowed values. A response schema defines exactly what the server returns. If the client sends a string where you expect a float, FastAPI returns a 422 error with a detailed message --- before your model code runs. This catches bugs at the API boundary, not deep inside the inference pipeline.

  4. Docker makes your deployment reproducible across environments. A Dockerfile specifies the OS, Python version, installed packages, application code, and startup command. The resulting image runs identically on your laptop, a colleague's laptop, a CI server, and a cloud instance. Use multi-stage builds to keep images small, copy requirements.txt before application code for layer caching, run as a non-root user for security, and include a HEALTHCHECK instruction.

  5. Batch prediction and real-time prediction are complementary, not competing. Real-time APIs serve interactive use cases where a human or system is waiting (latency < 1 second). Batch jobs serve scheduled, high-volume scoring where nobody is waiting (hours are fine). Many production systems use both: a real-time API for the customer portal and a nightly batch job for marketing campaigns. The choice depends on the business requirement, not the model.

  6. SHAP is often the latency bottleneck, not the model. A scikit-learn predict_proba call takes microseconds. A SHAP TreeExplainer call takes 10--200 milliseconds. For real-time endpoints with strict latency requirements, consider pre-computing SHAP values in batch, caching explanations, or serving them from a separate asynchronous endpoint.

  7. Model versioning in production requires explicit tags, not "latest." Every Docker image should be tagged with a model version (churn-api:v2.3.1), and every prediction response should include the model version. This ensures traceability: any prediction can be linked back to a specific model, MLflow run, and training data version. Rolling back is a tag swap, not a debugging session.

  8. Canary deployments protect you from production surprises. Route 5--10% of traffic to the new model version while monitoring accuracy, latency, and error rates. If metrics hold for 24--48 hours, increase the split. If they degrade, roll back. A canary catches problems that holdout evaluation misses: data distribution shifts, edge cases in production data, and latency regressions.

  9. Cloud deployment is infrastructure, not magic. AWS ECS Fargate and Google Cloud Run both run Docker containers. ECS gives you fine-grained control over networking, scaling policies, and multi-container orchestration. Cloud Run gives you simplicity: a single command deploys your container with automatic HTTPS and scale-to-zero. Choose based on complexity requirements, not brand preference.

  10. The preprocessing contract is the most dangerous single point of failure. If the real-time API and the batch job encode features differently, predictions will be silently wrong. No error, no crash, no alert --- just incorrect scores. Extract encoding logic into a shared module that both paths import. Test it once. Trust it everywhere.


If You Remember One Thing

A model with an AUC of 0.88 that is deployed, monitored, and serving predictions is infinitely more valuable than a model with an AUC of 0.92 that lives in a notebook on someone's laptop. Deployment is not the last step of a data science project. It is the first step of a data science product. FastAPI gives you the interface. Docker gives you the portability. Cloud platforms give you the scale. Pydantic gives you the safety net. Everything else is engineering detail --- important, but solvable. The model that never gets deployed solves nothing.


These takeaways summarize Chapter 31: Model Deployment. Return to the chapter for full context.