> War Story --- A data scientist at a mid-size insurance company spent three months building a claims fraud detection model. It had an AUC of 0.94 on the holdout set. The SHAP values were intuitive. The stakeholders were excited. She presented the...
In This Chapter
- REST APIs with FastAPI, Containerization with Docker, and Basic Cloud Deployment
- Your Model Is Not Deployed
- REST APIs: The Universal Interface
- FastAPI: The Right Tool for Model Serving
- Pydantic: Contracts Between Client and Server
- Testing the API
- Batch Prediction vs. Real-Time Prediction
- Prediction Latency: Where the Time Goes
- Docker: Packaging for Portability
- docker-compose: Multi-Container Development
- Model Versioning in Production
- Cloud Deployment: AWS and GCP Walkthrough
- Bringing It Together: The Complete Deployment Pipeline
- Progressive Project M10: Deploy the StreamFlow Churn Model
- Summary
Chapter 31: Model Deployment
REST APIs with FastAPI, Containerization with Docker, and Basic Cloud Deployment
Learning Objectives
By the end of this chapter, you will be able to:
- Wrap a scikit-learn model in a FastAPI REST API
- Define request/response schemas with Pydantic
- Containerize the API with Docker
- Deploy to a cloud platform (basic AWS/GCP walkthrough)
- Handle prediction latency, batch vs. real-time, and model versioning in production
Your Model Is Not Deployed
War Story --- A data scientist at a mid-size insurance company spent three months building a claims fraud detection model. It had an AUC of 0.94 on the holdout set. The SHAP values were intuitive. The stakeholders were excited. She presented the results to the VP of Claims in a slide deck with a confusion matrix and a feature importance chart. The VP asked one question: "How does our claims processing system call this model?" The data scientist opened her laptop, launched a Jupyter notebook, ran three cells, and showed the output. The VP stared at her. "I cannot put your laptop in the data center." The model was never deployed. It lived in a notebook on a shared drive until the data scientist left the company eight months later.
Your model is not deployed until someone else can use it without you being in the room.
That sentence is the entire philosophy of this chapter. A model in a notebook is a prototype. A model behind an API endpoint, running in a container, accessible over HTTP, with health checks and structured request/response schemas --- that is a deployed model. The difference between the two is not sophistication. It is engineering discipline.
This chapter covers the three layers of deployment:
- Model serving --- wrapping the model in a REST API with FastAPI
- Containerization --- packaging the API and all its dependencies in a Docker container
- Cloud deployment --- pushing the container to a cloud platform where it runs without your laptop
We will use the StreamFlow churn model throughout. By the end of this chapter, anyone on the internet (or your corporate network) will be able to send an HTTP request with customer features and receive a churn probability and the top three SHAP explanations in return. No Jupyter notebook required.
REST APIs: The Universal Interface
A REST API (Representational State Transfer Application Programming Interface) is a way for two systems to communicate over HTTP. You have used REST APIs every time you made a request to a web application. When your browser loads a page, it sends an HTTP GET request to a server and receives HTML in return. When a mobile app submits a form, it sends an HTTP POST request with JSON data and receives a JSON response.
Model serving uses exactly the same pattern. Your model is the server. The client sends a POST request with input features as JSON. The server runs the model, and sends back a JSON response with the prediction.
Client Server (Your Model)
| |
| POST /predict |
| {"tenure": 14, "monthly_charges": 89} |
| ------------------------------------> |
| | Load features
| | Run model.predict_proba()
| | Compute SHAP values
| 200 OK |
| {"churn_probability": 0.73, |
| "top_reasons": [...]} |
| <------------------------------------ |
Why REST? Because it is the lingua franca of software systems. Every programming language, every platform, every cloud service speaks HTTP. Your ML model does not care whether the caller is a Python script, a Java microservice, a React frontend, or a curl command. It receives JSON, returns JSON. That universality is the point.
FastAPI: The Right Tool for Model Serving
FastAPI is a modern Python web framework built on two ideas: type hints and speed. It was created by Sebastian Ramirez in 2018 and has become the standard for building ML serving APIs in Python. Here is why:
- Automatic validation. FastAPI uses Pydantic models to validate incoming requests. If a client sends a string where you expected a float, FastAPI returns a clear 422 error before your code ever runs.
- Automatic documentation. FastAPI generates an interactive Swagger UI at
/docsand a ReDoc page at/redoc. No extra work required. - Asynchronous support. FastAPI runs on Uvicorn (an ASGI server), which handles concurrent requests efficiently.
- Performance. FastAPI is one of the fastest Python web frameworks, comparable to Node.js and Go for I/O-bound workloads.
Installation
pip install fastapi uvicorn scikit-learn joblib numpy pandas shap
The Minimal Prediction API
Let us start with the absolute minimum: a FastAPI app that loads a model and returns predictions.
# app.py --- Minimal prediction API
from fastapi import FastAPI
import joblib
import numpy as np
app = FastAPI(title="StreamFlow Churn Predictor", version="1.0.0")
# Load the model at startup (not per request)
model = joblib.load("model/churn_model.joblib")
@app.get("/health")
def health_check():
"""Health check endpoint for load balancers and monitoring."""
return {"status": "healthy"}
@app.post("/predict")
def predict(features: dict):
"""Accept raw features, return churn probability."""
X = np.array([list(features.values())])
probability = model.predict_proba(X)[0, 1]
return {"churn_probability": round(float(probability), 4)}
# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
# Test with curl
curl -X POST http://localhost:8000/predict \
-H "Content-Type: application/json" \
-d '{"tenure": 14, "monthly_charges": 89.5, "total_charges": 1253.0}'
This works, but it has a critical flaw: it accepts any dictionary. If a client sends {"color": "blue"}, the model will crash with a confusing numpy error. We need input validation.
Pydantic: Contracts Between Client and Server
Pydantic is a data validation library that uses Python type hints to define data schemas. FastAPI is built on Pydantic, so they work together seamlessly.
A request schema defines what the client must send. A response schema defines what the server will return. These schemas are contracts: if the client violates the request schema, FastAPI rejects the request with a detailed error message before your model code runs.
Defining the Schemas
# schemas.py --- Request and response schemas
from pydantic import BaseModel, Field
from typing import Optional
class ChurnPredictionRequest(BaseModel):
"""Input features for churn prediction.
Each field has a type, a description, and example values
that appear in the auto-generated documentation.
"""
tenure: int = Field(
..., ge=0, le=120,
description="Months the customer has been subscribed",
examples=[14]
)
monthly_charges: float = Field(
..., ge=0,
description="Current monthly charge in dollars",
examples=[89.50]
)
total_charges: float = Field(
..., ge=0,
description="Total charges to date in dollars",
examples=[1253.00]
)
contract_type: str = Field(
...,
description="Contract type: 'month-to-month', 'one-year', 'two-year'",
examples=["month-to-month"]
)
payment_method: str = Field(
...,
description="Payment method used by the customer",
examples=["electronic_check"]
)
num_support_tickets: int = Field(
..., ge=0,
description="Number of support tickets filed in last 6 months",
examples=[3]
)
internet_service: str = Field(
...,
description="Internet service type: 'dsl', 'fiber_optic', 'none'",
examples=["fiber_optic"]
)
streaming_services: int = Field(
..., ge=0,
description="Number of streaming services subscribed to (0-4)",
examples=[2]
)
paperless_billing: bool = Field(
...,
description="Whether the customer uses paperless billing",
examples=[True]
)
senior_citizen: bool = Field(
...,
description="Whether the customer is a senior citizen",
examples=[False]
)
class ShapReason(BaseModel):
"""A single SHAP explanation for a prediction."""
feature: str = Field(..., description="Feature name")
value: float = Field(..., description="Feature value for this customer")
shap_contribution: float = Field(
..., description="SHAP contribution to churn probability"
)
direction: str = Field(
..., description="'increases_risk' or 'decreases_risk'"
)
class ChurnPredictionResponse(BaseModel):
"""Output from the churn prediction endpoint."""
churn_probability: float = Field(
..., ge=0, le=1,
description="Probability of churning within 30 days"
)
risk_tier: str = Field(
...,
description="'low' (<0.3), 'medium' (0.3-0.6), 'high' (>0.6)"
)
top_reasons: list[ShapReason] = Field(
...,
description="Top 3 features driving the prediction"
)
model_version: str = Field(
...,
description="Version identifier of the model that made this prediction"
)
Practical Tip --- The
Field(...)syntax with the ellipsis means the field is required. UseField(default=...)for optional fields with defaults. Thegeandleconstraints catch impossible values (negative tenure, probability above 1.0) at the API boundary, before they pollute your model.
The Full API with Schemas
# app.py --- Full prediction API with Pydantic schemas
from fastapi import FastAPI, HTTPException
import joblib
import numpy as np
import pandas as pd
import shap
from schemas import ChurnPredictionRequest, ChurnPredictionResponse, ShapReason
app = FastAPI(
title="StreamFlow Churn Predictor",
description="Predicts 30-day churn probability for StreamFlow subscribers.",
version="1.0.0",
)
# --- Model Loading ---
MODEL_VERSION = "v2.3.1"
model = joblib.load("model/churn_model.joblib")
explainer = shap.TreeExplainer(model)
# Feature order must match training
FEATURE_ORDER = [
"tenure", "monthly_charges", "total_charges", "contract_type",
"payment_method", "num_support_tickets", "internet_service",
"streaming_services", "paperless_billing", "senior_citizen",
]
# Encoding maps (must match training preprocessing)
CONTRACT_MAP = {"month-to-month": 0, "one-year": 1, "two-year": 2}
PAYMENT_MAP = {
"electronic_check": 0, "mailed_check": 1,
"bank_transfer": 2, "credit_card": 3,
}
INTERNET_MAP = {"none": 0, "dsl": 1, "fiber_optic": 2}
def encode_features(request: ChurnPredictionRequest) -> pd.DataFrame:
"""Convert request to a DataFrame matching the training schema."""
data = {
"tenure": request.tenure,
"monthly_charges": request.monthly_charges,
"total_charges": request.total_charges,
"contract_type": CONTRACT_MAP.get(request.contract_type, -1),
"payment_method": PAYMENT_MAP.get(request.payment_method, -1),
"num_support_tickets": request.num_support_tickets,
"internet_service": INTERNET_MAP.get(request.internet_service, -1),
"streaming_services": request.streaming_services,
"paperless_billing": int(request.paperless_billing),
"senior_citizen": int(request.senior_citizen),
}
# Check for unknown categorical values
for field, value in [
("contract_type", data["contract_type"]),
("payment_method", data["payment_method"]),
("internet_service", data["internet_service"]),
]:
if value == -1:
raise HTTPException(
status_code=422,
detail=f"Unknown value for {field}: {getattr(request, field)}",
)
return pd.DataFrame([data], columns=FEATURE_ORDER)
def get_risk_tier(probability: float) -> str:
"""Classify churn probability into risk tiers."""
if probability < 0.3:
return "low"
elif probability < 0.6:
return "medium"
else:
return "high"
def get_top_shap_reasons(
features_df: pd.DataFrame, n: int = 3
) -> list[ShapReason]:
"""Compute SHAP values and return top N contributors."""
shap_values = explainer.shap_values(features_df)
# For binary classification, shap_values may be a list [class_0, class_1]
if isinstance(shap_values, list):
sv = shap_values[1][0] # Class 1 (churn), first sample
else:
sv = shap_values[0]
feature_names = FEATURE_ORDER
feature_values = features_df.iloc[0].to_dict()
# Pair features with SHAP values and sort by absolute contribution
contributions = []
for fname, shap_val in zip(feature_names, sv):
contributions.append({
"feature": fname,
"value": float(feature_values[fname]),
"shap_contribution": round(float(shap_val), 4),
"direction": "increases_risk" if shap_val > 0 else "decreases_risk",
})
contributions.sort(key=lambda x: abs(x["shap_contribution"]), reverse=True)
return [ShapReason(**c) for c in contributions[:n]]
# --- Endpoints ---
@app.get("/health")
def health_check():
"""Health check endpoint. Returns 200 if the service is running."""
return {"status": "healthy", "model_version": MODEL_VERSION}
@app.post("/predict", response_model=ChurnPredictionResponse)
def predict(request: ChurnPredictionRequest):
"""
Predict churn probability for a single customer.
Returns the probability, a risk tier, the top 3 SHAP reasons,
and the model version.
"""
# Encode features
features_df = encode_features(request)
# Predict
probability = float(model.predict_proba(features_df)[0, 1])
# Explain
top_reasons = get_top_shap_reasons(features_df, n=3)
return ChurnPredictionResponse(
churn_probability=round(probability, 4),
risk_tier=get_risk_tier(probability),
top_reasons=top_reasons,
model_version=MODEL_VERSION,
)
Start the server and open http://localhost:8000/docs in a browser:
uvicorn app:app --host 0.0.0.0 --port 8000 --reload
The --reload flag restarts the server when you change code. Use it during development; remove it in production.
The Swagger UI at /docs is automatically generated from your Pydantic schemas. Every field name, type, description, and example value appears in an interactive form. You can submit test requests directly from the browser. This is not a luxury --- it is documentation that never goes stale because it is generated from the code.
Testing the API
Before we containerize anything, we need to verify that the API works correctly. FastAPI provides a TestClient that makes this straightforward.
# test_api.py --- API tests
from fastapi.testclient import TestClient
from app import app
client = TestClient(app)
def test_health_check():
response = client.get("/health")
assert response.status_code == 200
assert response.json()["status"] == "healthy"
def test_predict_valid_request():
payload = {
"tenure": 14,
"monthly_charges": 89.50,
"total_charges": 1253.00,
"contract_type": "month-to-month",
"payment_method": "electronic_check",
"num_support_tickets": 3,
"internet_service": "fiber_optic",
"streaming_services": 2,
"paperless_billing": True,
"senior_citizen": False,
}
response = client.post("/predict", json=payload)
assert response.status_code == 200
data = response.json()
assert 0 <= data["churn_probability"] <= 1
assert data["risk_tier"] in ("low", "medium", "high")
assert len(data["top_reasons"]) == 3
assert data["model_version"] is not None
def test_predict_missing_field():
payload = {"tenure": 14} # Missing required fields
response = client.post("/predict", json=payload)
assert response.status_code == 422 # Validation error
def test_predict_invalid_type():
payload = {
"tenure": "not_a_number", # Should be int
"monthly_charges": 89.50,
"total_charges": 1253.00,
"contract_type": "month-to-month",
"payment_method": "electronic_check",
"num_support_tickets": 3,
"internet_service": "fiber_optic",
"streaming_services": 2,
"paperless_billing": True,
"senior_citizen": False,
}
response = client.post("/predict", json=payload)
assert response.status_code == 422
def test_predict_out_of_range():
payload = {
"tenure": -5, # ge=0 constraint violated
"monthly_charges": 89.50,
"total_charges": 1253.00,
"contract_type": "month-to-month",
"payment_method": "electronic_check",
"num_support_tickets": 3,
"internet_service": "fiber_optic",
"streaming_services": 2,
"paperless_billing": True,
"senior_citizen": False,
}
response = client.post("/predict", json=payload)
assert response.status_code == 422
pytest test_api.py -v
Every test should pass. If a test fails, fix the API, not the test. These tests are your safety net when you change the code later.
Practical Tip --- Always test the error paths, not just the happy path. A production API will receive garbage input from misconfigured clients, automated scanners, and confused developers. Your API should fail gracefully with clear error messages, not crash with a Python traceback.
Batch Prediction vs. Real-Time Prediction
Not every prediction needs to happen in real time. The choice between batch and real-time serving depends on the business requirement.
Real-Time (Online) Prediction
The client sends a request and waits for the response. Latency matters --- the customer is waiting.
| Characteristic | Detail |
|---|---|
| Latency requirement | < 100 ms per request typical |
| Infrastructure | API server (FastAPI + Uvicorn) behind a load balancer |
| When to use | User-facing features, fraud detection, recommendation widgets |
| StreamFlow example | Customer opens the "My Account" page; the app calls /predict and displays a retention offer if churn risk is high |
Batch Prediction
A scheduled job runs the model on all (or a subset of) records. Latency does not matter --- nobody is waiting.
| Characteristic | Detail |
|---|---|
| Latency requirement | Hours are acceptable |
| Infrastructure | Scheduled script, Spark job, or Airflow DAG |
| When to use | Nightly email campaigns, weekly reports, pre-computing scores for all customers |
| StreamFlow example | Every night at 2 AM, score all 2.3 million subscribers; write results to a database table; the marketing team queries the table the next morning |
Adding a Batch Endpoint
You can support both patterns in the same API by adding a batch endpoint that accepts a list of customers:
from pydantic import BaseModel, Field
class BatchPredictionRequest(BaseModel):
"""Multiple customers in a single request."""
customers: list[ChurnPredictionRequest] = Field(
...,
description="List of customer records to score",
min_length=1,
max_length=1000,
)
class BatchPredictionResponse(BaseModel):
"""Predictions for all customers in the batch."""
predictions: list[ChurnPredictionResponse]
batch_size: int
processing_time_ms: float
@app.post("/predict/batch", response_model=BatchPredictionResponse)
def predict_batch(request: BatchPredictionRequest):
"""Score multiple customers in a single request.
More efficient than calling /predict in a loop because
the model runs vectorized inference on the full batch.
"""
import time
start = time.time()
# Encode all customers into a single DataFrame
rows = []
for customer in request.customers:
features_df = encode_features(customer)
rows.append(features_df)
batch_df = pd.concat(rows, ignore_index=True)
# Vectorized prediction (much faster than one-at-a-time)
probabilities = model.predict_proba(batch_df)[:, 1]
# Build responses
predictions = []
for i, customer in enumerate(request.customers):
prob = float(probabilities[i])
features_df = pd.DataFrame([batch_df.iloc[i]])
reasons = get_top_shap_reasons(features_df, n=3)
predictions.append(ChurnPredictionResponse(
churn_probability=round(prob, 4),
risk_tier=get_risk_tier(prob),
top_reasons=reasons,
model_version=MODEL_VERSION,
))
elapsed_ms = (time.time() - start) * 1000
return BatchPredictionResponse(
predictions=predictions,
batch_size=len(predictions),
processing_time_ms=round(elapsed_ms, 2),
)
Performance Note --- The batch endpoint is not just a convenience wrapper. Scikit-learn's
predict_probais vectorized: scoring 1000 customers in one call is dramatically faster than scoring 1000 customers in 1000 separate calls. If your downstream system can collect requests and send them in batches, use the batch endpoint.
Prediction Latency: Where the Time Goes
When a real-time prediction takes 500 ms instead of 50 ms, the problem is rarely the model itself. Scikit-learn's predict_proba on 10 features takes microseconds. The latency comes from everything around the model.
| Component | Typical Time | What Helps |
|---|---|---|
| Network round-trip | 5--50 ms | Deploy close to the caller; reduce payload size |
| JSON parsing and validation | 1--5 ms | Pydantic is fast; this is rarely the bottleneck |
| Feature encoding | 1--10 ms | Pre-compute expensive features; cache lookups |
| Model inference | 0.1--10 ms | Use a smaller model; ONNX Runtime for complex models |
| SHAP computation | 10--200 ms | Limit to top N features; cache explainer; use approximate SHAP |
| JSON serialization | 1--5 ms | Keep response payload small |
| Total | ~20--280 ms |
SHAP is often the slowest component. For real-time endpoints where sub-50ms latency is required, consider:
- Pre-computing SHAP values in a batch job and caching them
- Using approximate SHAP (
shap.Explainerwithalgorithm="auto") - Dropping SHAP from the real-time endpoint and serving explanations from a separate, async endpoint
import time
from fastapi import Request
@app.middleware("http")
async def add_latency_header(request: Request, call_next):
"""Add response time to every response for monitoring."""
start = time.time()
response = await call_next(request)
elapsed_ms = (time.time() - start) * 1000
response.headers["X-Response-Time-Ms"] = f"{elapsed_ms:.2f}"
return response
This middleware adds an X-Response-Time-Ms header to every response. Use it to track latency trends. When the number starts climbing, you know where to look.
Docker: Packaging for Portability
You have a working API on your laptop. Now you need it to work on every other machine. Docker solves this by packaging your application, its dependencies, and its runtime environment into a single, portable container.
Docker Vocabulary
| Term | Definition |
|---|---|
| Image | A read-only template containing your application code, dependencies, and OS libraries. Think of it as a snapshot of a configured machine. |
| Container | A running instance of an image. You can run multiple containers from the same image. |
| Dockerfile | A text file with instructions for building an image. Each instruction creates a layer. |
| Registry | A place to store and share images. Docker Hub is public. AWS ECR, GCP Artifact Registry, and Azure ACR are private. |
| docker-compose | A tool for defining and running multi-container applications with a YAML file. |
Project Structure for Deployment
Before writing the Dockerfile, organize the project:
streamflow-churn-api/
app.py # FastAPI application
schemas.py # Pydantic models
test_api.py # API tests
model/
churn_model.joblib # Trained model artifact
requirements.txt # Pinned dependencies
Dockerfile # Container build instructions
docker-compose.yml # Local orchestration
.dockerignore # Files to exclude from the build
requirements.txt
Pin every dependency. "Latest" is not a version.
fastapi==0.115.0
uvicorn[standard]==0.30.0
scikit-learn==1.5.0
joblib==1.4.2
numpy==1.26.4
pandas==2.2.2
shap==0.45.0
pydantic==2.8.0
Practical Tip --- Generate pinned requirements from your working environment:
pip freeze > requirements.txt. Then review and remove packages you do not actually need. A bloated image is a slow image.
The Dockerfile
# --- Stage 1: Build ---
FROM python:3.11-slim AS builder
WORKDIR /app
# Install dependencies first (layer caching optimization)
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt
# --- Stage 2: Runtime ---
FROM python:3.11-slim
WORKDIR /app
# Copy installed packages from builder
COPY --from=builder /install /usr/local
# Copy application code
COPY app.py .
COPY schemas.py .
COPY model/ model/
# Create a non-root user (security best practice)
RUN useradd --create-home appuser
USER appuser
# Expose the port
EXPOSE 8000
# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"
# Start the server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]
Let us walk through the key decisions:
Multi-stage build. The first stage (builder) installs Python packages. The second stage (runtime) copies only the installed packages and application code. This keeps the final image small by excluding pip caches, build tools, and compiler headers.
Layer caching. We copy requirements.txt and install dependencies before copying the application code. Docker caches layers. If you change app.py but not requirements.txt, Docker reuses the cached dependency layer instead of reinstalling everything. This turns a 5-minute rebuild into a 10-second rebuild.
Non-root user. Running as root inside a container is a security vulnerability. If an attacker exploits a bug in your application, they get root access to the container. Running as appuser limits the damage.
HEALTHCHECK. The health check tells Docker (and any orchestrator like Kubernetes or ECS) whether the container is alive and ready to serve traffic. If the health check fails three consecutive times, the orchestrator restarts the container.
Building and Running
# Build the image
docker build -t streamflow-churn-api:v1.0 .
# Run the container
docker run -d --name churn-api -p 8000:8000 streamflow-churn-api:v1.0
# Verify it works
curl http://localhost:8000/health
# View logs
docker logs churn-api
# Stop and remove
docker stop churn-api && docker rm churn-api
The -p 8000:8000 flag maps port 8000 on your host to port 8000 in the container. The -d flag runs the container in the background.
.dockerignore
Just as .gitignore keeps files out of git, .dockerignore keeps files out of the Docker build context:
__pycache__/
*.pyc
.git/
.env
.venv/
test_api.py
*.md
.pytest_cache/
mlruns/
notebooks/
Without a .dockerignore, Docker copies your entire directory into the build context --- including your git history, virtual environment, and test files. This slows down the build and bloats the image.
docker-compose: Multi-Container Development
For local development, docker-compose lets you define and run the API alongside other services (a database, a monitoring dashboard, a model registry) with a single command.
# docker-compose.yml
version: "3.8"
services:
churn-api:
build:
context: .
dockerfile: Dockerfile
ports:
- "8000:8000"
environment:
- MODEL_PATH=model/churn_model.joblib
- LOG_LEVEL=info
volumes:
- ./model:/app/model # Mount model directory for hot-swapping
healthcheck:
test: ["CMD", "python", "-c",
"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
interval: 30s
timeout: 5s
retries: 3
restart: unless-stopped
# Start the service
docker-compose up -d
# View logs
docker-compose logs -f churn-api
# Rebuild after code changes
docker-compose up -d --build
# Stop everything
docker-compose down
The volumes mount lets you swap the model file without rebuilding the container. During development, this is convenient. In production, bake the model into the image so the container is self-contained and reproducible.
Model Versioning in Production
A deployed model is not a static artifact. You retrain on new data. You fix bugs in the preprocessing pipeline. You tune hyperparameters. You need to know which version of the model is currently serving predictions, and you need the ability to roll back if something goes wrong.
Versioning Strategy
Every model artifact should include:
- A semantic version (
v2.3.1) embedded in the filename or metadata - A link to the MLflow run that produced it (see Chapter 30)
- The training data version and date
- The git commit hash of the code used to train it
# At the top of app.py
import os
MODEL_VERSION = os.getenv("MODEL_VERSION", "v2.3.1")
MODEL_PATH = os.getenv("MODEL_PATH", "model/churn_model.joblib")
model = joblib.load(MODEL_PATH)
Tag your Docker images with the model version:
docker build -t streamflow-churn-api:v2.3.1 .
docker build -t streamflow-churn-api:latest .
Warning
--- Never deploy the latest tag in production. "Latest" is not a version. It is a prayer. Use explicit version tags so you know exactly what is running and can roll back to a specific version.
Canary Deployment
A canary deployment routes a small percentage of traffic (e.g., 5%) to the new model version while the old version handles the rest. If the new model's metrics (accuracy, latency, error rate) look good after a few hours, you gradually increase the traffic split until 100% goes to the new version.
+-----------+
| Load |
Traffic ---> | Balancer |
+-----+-----+
|
+-----+-----+
| |
v v
+-------+--+ +---+--------+
| Model | | Model |
| v2.3.1 | | v2.4.0 |
| (95%) | | (5%) |
+----------+ +------------+
This is safer than a hard cutover. If the new model has a bug --- it crashes on a specific input pattern, its predictions are wildly different, its latency is 10x higher --- only 5% of users are affected.
Blue-Green Deployment
A blue-green deployment runs two identical environments: "blue" (the current production version) and "green" (the new version). You switch all traffic from blue to green at once. If something goes wrong, you switch back.
Before switch: After switch:
Traffic --> Blue (v2.3.1) Traffic --> Green (v2.4.0)
Green (v2.4.0) Blue (v2.3.1)
(idle) (idle/standby)
Blue-green is simpler than canary but riskier: all traffic switches at once. Use it when you have strong confidence in the new version (comprehensive test suite, staging validation) and need a clean cutover.
Cloud Deployment: AWS and GCP Walkthrough
Your containerized API is ready. Now it needs to run somewhere other than your laptop.
AWS: Elastic Container Service (ECS) with Fargate
AWS ECS runs Docker containers without managing servers. Fargate is the "serverless" mode --- you specify CPU and memory, and AWS handles the rest.
Step 1: Push your image to ECR (Elastic Container Registry).
# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com
# Create a repository
aws ecr create-repository --repository-name streamflow-churn-api
# Tag and push the image
docker tag streamflow-churn-api:v2.3.1 \
123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1
docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1
Step 2: Create an ECS task definition.
The task definition tells ECS how to run your container: which image to use, how much CPU and memory, which ports to expose, and where to send logs.
{
"family": "streamflow-churn-api",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"containerDefinitions": [
{
"name": "churn-api",
"image": "123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1",
"portMappings": [
{"containerPort": 8000, "protocol": "tcp"}
],
"healthCheck": {
"command": ["CMD-SHELL",
"python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')\""],
"interval": 30,
"timeout": 5,
"retries": 3
},
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/streamflow-churn-api",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}
]
}
Step 3: Create an ECS service with an Application Load Balancer (ALB).
The service maintains the desired number of running tasks and registers them with the ALB. If a container fails its health check, ECS replaces it automatically.
# Create the service (simplified; typically done via CloudFormation or Terraform)
aws ecs create-service \
--cluster streamflow-prod \
--service-name churn-api \
--task-definition streamflow-churn-api:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-abc123],assignPublicIp=ENABLED}"
With desired-count 2, ECS runs two instances of your API behind the load balancer. If one crashes, the other continues serving while ECS launches a replacement.
GCP: Cloud Run
Google Cloud Run is the simplest path from a Docker image to a running service. It scales to zero (you pay nothing when there is no traffic) and scales up automatically.
Step 1: Push to Artifact Registry.
# Configure Docker for GCP
gcloud auth configure-docker us-central1-docker.pkg.dev
# Tag and push
docker tag streamflow-churn-api:v2.3.1 \
us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1
docker push us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1
Step 2: Deploy to Cloud Run.
gcloud run deploy churn-api \
--image us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1 \
--platform managed \
--region us-central1 \
--port 8000 \
--memory 1Gi \
--cpu 1 \
--min-instances 1 \
--max-instances 10 \
--allow-unauthenticated
Cloud Run gives you a URL like https://churn-api-abc123-uc.a.run.app. That is your production endpoint. HTTPS, autoscaling, and health checks are built in.
The --min-instances 1 flag keeps one instance warm at all times, eliminating cold start latency. Without it, Cloud Run scales to zero and the first request after a period of inactivity takes 5--15 seconds to cold-start the container.
Theme: Real World =/= Kaggle --- On Kaggle,
model.predict_proba(X_test)is the last line of code. In the real world, it is the first line of a different codebase --- one that handles HTTP routing, input validation, error handling, containerization, load balancing, health checks, auto-scaling, and versioned rollbacks. The prediction is the easy part. The deployment is the engineering.
Bringing It Together: The Complete Deployment Pipeline
Here is the end-to-end workflow from trained model to production endpoint:
1. Train model (Chapter 14, 18)
|
2. Track experiment in MLflow (Chapter 30)
|
3. Register model in MLflow Model Registry
|
4. Export model artifact (joblib/pickle)
|
5. Build FastAPI app with Pydantic schemas (this chapter)
|
6. Write tests, run pytest
|
7. Write Dockerfile, build image
|
8. Test container locally: docker run + curl
|
9. Push image to registry (ECR / Artifact Registry)
|
10. Deploy to cloud (ECS / Cloud Run)
|
11. Monitor (Chapter 32)
Steps 5 through 10 are this chapter. Steps 1 through 4 are what you have already done. Step 11 is next.
Progressive Project M10: Deploy the StreamFlow Churn Model
This milestone brings your progressive project model from a notebook into a deployable API.
Tasks
M10a: Build the FastAPI endpoint.
- Create
app.pywith a/predictendpoint that accepts your progressive project features - Define
ChurnPredictionRequestandChurnPredictionResponsePydantic schemas - Load your best model from Chapter 18 (or the MLflow Model Registry from Chapter 30)
- Return churn probability, risk tier, and top 3 SHAP reasons
- Add a
/healthendpoint
M10b: Write tests.
- Test the health check endpoint
- Test a valid prediction request
- Test that missing fields return 422
- Test that invalid types return 422
- Run all tests with
pytest
M10c: Containerize with Docker.
- Write a
Dockerfilewith multi-stage build - Write
requirements.txtwith pinned versions - Write a
.dockerignore - Build the image:
docker build -t streamflow-churn-api:v1.0 . - Run the container:
docker run -p 8000:8000 streamflow-churn-api:v1.0 - Test with curl or the Swagger UI at
http://localhost:8000/docs
M10d: (Optional) Deploy to a cloud platform.
If you have an AWS or GCP account, push your image to a registry and deploy to ECS Fargate or Cloud Run. Verify the endpoint responds to requests from your local machine. Record the public URL.
Deliverables
app.py,schemas.py,test_api.py--- the FastAPI application and testsDockerfile,requirements.txt,.dockerignore--- the container configuration- A screenshot or terminal output showing a successful prediction from the running container
- (Optional) The public URL of your deployed endpoint
Summary
A model is not deployed until someone else can use it without you being in the room. FastAPI provides the REST interface: type-safe request/response schemas with Pydantic, automatic API documentation, and the performance to serve real-time predictions. Docker provides the portability: a container that runs identically on your laptop, your colleague's laptop, a CI server, and a production cloud instance. Cloud platforms --- ECS Fargate on AWS, Cloud Run on GCP --- provide the infrastructure: auto-scaling, health checks, load balancing, and managed HTTPS.
The choices you make at deployment time --- real-time vs. batch, canary vs. blue-green, SHAP at inference time vs. pre-computed --- are engineering decisions that depend on the business requirements, not the model architecture. A model with an AUC of 0.88 that is deployed, monitored, and serving predictions is infinitely more valuable than a model with an AUC of 0.92 that lives in a notebook on someone's laptop.
Deploy early. Deploy often. And never deploy latest.
Theme: Reproducibility --- Every deployed container should be traceable back to a specific Docker image tag, which maps to a specific model version, which maps to a specific MLflow run, which maps to specific hyperparameters and a specific data version. If any link in that chain is broken, you cannot reproduce the deployment. The chain is the point.
Next chapter: Chapter 32: Monitoring in Production --- detecting data drift, performance degradation, and the moment your deployed model starts silently failing.