16 min read

> War Story --- A data scientist at a mid-size insurance company spent three months building a claims fraud detection model. It had an AUC of 0.94 on the holdout set. The SHAP values were intuitive. The stakeholders were excited. She presented the...

Chapter 31: Model Deployment

REST APIs with FastAPI, Containerization with Docker, and Basic Cloud Deployment


Learning Objectives

By the end of this chapter, you will be able to:

  1. Wrap a scikit-learn model in a FastAPI REST API
  2. Define request/response schemas with Pydantic
  3. Containerize the API with Docker
  4. Deploy to a cloud platform (basic AWS/GCP walkthrough)
  5. Handle prediction latency, batch vs. real-time, and model versioning in production

Your Model Is Not Deployed

War Story --- A data scientist at a mid-size insurance company spent three months building a claims fraud detection model. It had an AUC of 0.94 on the holdout set. The SHAP values were intuitive. The stakeholders were excited. She presented the results to the VP of Claims in a slide deck with a confusion matrix and a feature importance chart. The VP asked one question: "How does our claims processing system call this model?" The data scientist opened her laptop, launched a Jupyter notebook, ran three cells, and showed the output. The VP stared at her. "I cannot put your laptop in the data center." The model was never deployed. It lived in a notebook on a shared drive until the data scientist left the company eight months later.

Your model is not deployed until someone else can use it without you being in the room.

That sentence is the entire philosophy of this chapter. A model in a notebook is a prototype. A model behind an API endpoint, running in a container, accessible over HTTP, with health checks and structured request/response schemas --- that is a deployed model. The difference between the two is not sophistication. It is engineering discipline.

This chapter covers the three layers of deployment:

  1. Model serving --- wrapping the model in a REST API with FastAPI
  2. Containerization --- packaging the API and all its dependencies in a Docker container
  3. Cloud deployment --- pushing the container to a cloud platform where it runs without your laptop

We will use the StreamFlow churn model throughout. By the end of this chapter, anyone on the internet (or your corporate network) will be able to send an HTTP request with customer features and receive a churn probability and the top three SHAP explanations in return. No Jupyter notebook required.


REST APIs: The Universal Interface

A REST API (Representational State Transfer Application Programming Interface) is a way for two systems to communicate over HTTP. You have used REST APIs every time you made a request to a web application. When your browser loads a page, it sends an HTTP GET request to a server and receives HTML in return. When a mobile app submits a form, it sends an HTTP POST request with JSON data and receives a JSON response.

Model serving uses exactly the same pattern. Your model is the server. The client sends a POST request with input features as JSON. The server runs the model, and sends back a JSON response with the prediction.

Client                                Server (Your Model)
  |                                        |
  |  POST /predict                         |
  |  {"tenure": 14, "monthly_charges": 89} |
  |  ------------------------------------> |
  |                                        |  Load features
  |                                        |  Run model.predict_proba()
  |                                        |  Compute SHAP values
  |  200 OK                                |
  |  {"churn_probability": 0.73,           |
  |   "top_reasons": [...]}                |
  |  <------------------------------------ |

Why REST? Because it is the lingua franca of software systems. Every programming language, every platform, every cloud service speaks HTTP. Your ML model does not care whether the caller is a Python script, a Java microservice, a React frontend, or a curl command. It receives JSON, returns JSON. That universality is the point.


FastAPI: The Right Tool for Model Serving

FastAPI is a modern Python web framework built on two ideas: type hints and speed. It was created by Sebastian Ramirez in 2018 and has become the standard for building ML serving APIs in Python. Here is why:

  1. Automatic validation. FastAPI uses Pydantic models to validate incoming requests. If a client sends a string where you expected a float, FastAPI returns a clear 422 error before your code ever runs.
  2. Automatic documentation. FastAPI generates an interactive Swagger UI at /docs and a ReDoc page at /redoc. No extra work required.
  3. Asynchronous support. FastAPI runs on Uvicorn (an ASGI server), which handles concurrent requests efficiently.
  4. Performance. FastAPI is one of the fastest Python web frameworks, comparable to Node.js and Go for I/O-bound workloads.

Installation

pip install fastapi uvicorn scikit-learn joblib numpy pandas shap

The Minimal Prediction API

Let us start with the absolute minimum: a FastAPI app that loads a model and returns predictions.

# app.py --- Minimal prediction API
from fastapi import FastAPI
import joblib
import numpy as np

app = FastAPI(title="StreamFlow Churn Predictor", version="1.0.0")

# Load the model at startup (not per request)
model = joblib.load("model/churn_model.joblib")

@app.get("/health")
def health_check():
    """Health check endpoint for load balancers and monitoring."""
    return {"status": "healthy"}

@app.post("/predict")
def predict(features: dict):
    """Accept raw features, return churn probability."""
    X = np.array([list(features.values())])
    probability = model.predict_proba(X)[0, 1]
    return {"churn_probability": round(float(probability), 4)}
# Start the server
uvicorn app:app --host 0.0.0.0 --port 8000
# Test with curl
curl -X POST http://localhost:8000/predict \
  -H "Content-Type: application/json" \
  -d '{"tenure": 14, "monthly_charges": 89.5, "total_charges": 1253.0}'

This works, but it has a critical flaw: it accepts any dictionary. If a client sends {"color": "blue"}, the model will crash with a confusing numpy error. We need input validation.


Pydantic: Contracts Between Client and Server

Pydantic is a data validation library that uses Python type hints to define data schemas. FastAPI is built on Pydantic, so they work together seamlessly.

A request schema defines what the client must send. A response schema defines what the server will return. These schemas are contracts: if the client violates the request schema, FastAPI rejects the request with a detailed error message before your model code runs.

Defining the Schemas

# schemas.py --- Request and response schemas
from pydantic import BaseModel, Field
from typing import Optional


class ChurnPredictionRequest(BaseModel):
    """Input features for churn prediction.

    Each field has a type, a description, and example values
    that appear in the auto-generated documentation.
    """
    tenure: int = Field(
        ..., ge=0, le=120,
        description="Months the customer has been subscribed",
        examples=[14]
    )
    monthly_charges: float = Field(
        ..., ge=0,
        description="Current monthly charge in dollars",
        examples=[89.50]
    )
    total_charges: float = Field(
        ..., ge=0,
        description="Total charges to date in dollars",
        examples=[1253.00]
    )
    contract_type: str = Field(
        ...,
        description="Contract type: 'month-to-month', 'one-year', 'two-year'",
        examples=["month-to-month"]
    )
    payment_method: str = Field(
        ...,
        description="Payment method used by the customer",
        examples=["electronic_check"]
    )
    num_support_tickets: int = Field(
        ..., ge=0,
        description="Number of support tickets filed in last 6 months",
        examples=[3]
    )
    internet_service: str = Field(
        ...,
        description="Internet service type: 'dsl', 'fiber_optic', 'none'",
        examples=["fiber_optic"]
    )
    streaming_services: int = Field(
        ..., ge=0,
        description="Number of streaming services subscribed to (0-4)",
        examples=[2]
    )
    paperless_billing: bool = Field(
        ...,
        description="Whether the customer uses paperless billing",
        examples=[True]
    )
    senior_citizen: bool = Field(
        ...,
        description="Whether the customer is a senior citizen",
        examples=[False]
    )


class ShapReason(BaseModel):
    """A single SHAP explanation for a prediction."""
    feature: str = Field(..., description="Feature name")
    value: float = Field(..., description="Feature value for this customer")
    shap_contribution: float = Field(
        ..., description="SHAP contribution to churn probability"
    )
    direction: str = Field(
        ..., description="'increases_risk' or 'decreases_risk'"
    )


class ChurnPredictionResponse(BaseModel):
    """Output from the churn prediction endpoint."""
    churn_probability: float = Field(
        ..., ge=0, le=1,
        description="Probability of churning within 30 days"
    )
    risk_tier: str = Field(
        ...,
        description="'low' (<0.3), 'medium' (0.3-0.6), 'high' (>0.6)"
    )
    top_reasons: list[ShapReason] = Field(
        ...,
        description="Top 3 features driving the prediction"
    )
    model_version: str = Field(
        ...,
        description="Version identifier of the model that made this prediction"
    )

Practical Tip --- The Field(...) syntax with the ellipsis means the field is required. Use Field(default=...) for optional fields with defaults. The ge and le constraints catch impossible values (negative tenure, probability above 1.0) at the API boundary, before they pollute your model.

The Full API with Schemas

# app.py --- Full prediction API with Pydantic schemas
from fastapi import FastAPI, HTTPException
import joblib
import numpy as np
import pandas as pd
import shap
from schemas import ChurnPredictionRequest, ChurnPredictionResponse, ShapReason

app = FastAPI(
    title="StreamFlow Churn Predictor",
    description="Predicts 30-day churn probability for StreamFlow subscribers.",
    version="1.0.0",
)

# --- Model Loading ---
MODEL_VERSION = "v2.3.1"
model = joblib.load("model/churn_model.joblib")
explainer = shap.TreeExplainer(model)

# Feature order must match training
FEATURE_ORDER = [
    "tenure", "monthly_charges", "total_charges", "contract_type",
    "payment_method", "num_support_tickets", "internet_service",
    "streaming_services", "paperless_billing", "senior_citizen",
]

# Encoding maps (must match training preprocessing)
CONTRACT_MAP = {"month-to-month": 0, "one-year": 1, "two-year": 2}
PAYMENT_MAP = {
    "electronic_check": 0, "mailed_check": 1,
    "bank_transfer": 2, "credit_card": 3,
}
INTERNET_MAP = {"none": 0, "dsl": 1, "fiber_optic": 2}


def encode_features(request: ChurnPredictionRequest) -> pd.DataFrame:
    """Convert request to a DataFrame matching the training schema."""
    data = {
        "tenure": request.tenure,
        "monthly_charges": request.monthly_charges,
        "total_charges": request.total_charges,
        "contract_type": CONTRACT_MAP.get(request.contract_type, -1),
        "payment_method": PAYMENT_MAP.get(request.payment_method, -1),
        "num_support_tickets": request.num_support_tickets,
        "internet_service": INTERNET_MAP.get(request.internet_service, -1),
        "streaming_services": request.streaming_services,
        "paperless_billing": int(request.paperless_billing),
        "senior_citizen": int(request.senior_citizen),
    }

    # Check for unknown categorical values
    for field, value in [
        ("contract_type", data["contract_type"]),
        ("payment_method", data["payment_method"]),
        ("internet_service", data["internet_service"]),
    ]:
        if value == -1:
            raise HTTPException(
                status_code=422,
                detail=f"Unknown value for {field}: {getattr(request, field)}",
            )

    return pd.DataFrame([data], columns=FEATURE_ORDER)


def get_risk_tier(probability: float) -> str:
    """Classify churn probability into risk tiers."""
    if probability < 0.3:
        return "low"
    elif probability < 0.6:
        return "medium"
    else:
        return "high"


def get_top_shap_reasons(
    features_df: pd.DataFrame, n: int = 3
) -> list[ShapReason]:
    """Compute SHAP values and return top N contributors."""
    shap_values = explainer.shap_values(features_df)

    # For binary classification, shap_values may be a list [class_0, class_1]
    if isinstance(shap_values, list):
        sv = shap_values[1][0]  # Class 1 (churn), first sample
    else:
        sv = shap_values[0]

    feature_names = FEATURE_ORDER
    feature_values = features_df.iloc[0].to_dict()

    # Pair features with SHAP values and sort by absolute contribution
    contributions = []
    for fname, shap_val in zip(feature_names, sv):
        contributions.append({
            "feature": fname,
            "value": float(feature_values[fname]),
            "shap_contribution": round(float(shap_val), 4),
            "direction": "increases_risk" if shap_val > 0 else "decreases_risk",
        })

    contributions.sort(key=lambda x: abs(x["shap_contribution"]), reverse=True)
    return [ShapReason(**c) for c in contributions[:n]]


# --- Endpoints ---

@app.get("/health")
def health_check():
    """Health check endpoint. Returns 200 if the service is running."""
    return {"status": "healthy", "model_version": MODEL_VERSION}


@app.post("/predict", response_model=ChurnPredictionResponse)
def predict(request: ChurnPredictionRequest):
    """
    Predict churn probability for a single customer.

    Returns the probability, a risk tier, the top 3 SHAP reasons,
    and the model version.
    """
    # Encode features
    features_df = encode_features(request)

    # Predict
    probability = float(model.predict_proba(features_df)[0, 1])

    # Explain
    top_reasons = get_top_shap_reasons(features_df, n=3)

    return ChurnPredictionResponse(
        churn_probability=round(probability, 4),
        risk_tier=get_risk_tier(probability),
        top_reasons=top_reasons,
        model_version=MODEL_VERSION,
    )

Start the server and open http://localhost:8000/docs in a browser:

uvicorn app:app --host 0.0.0.0 --port 8000 --reload

The --reload flag restarts the server when you change code. Use it during development; remove it in production.

The Swagger UI at /docs is automatically generated from your Pydantic schemas. Every field name, type, description, and example value appears in an interactive form. You can submit test requests directly from the browser. This is not a luxury --- it is documentation that never goes stale because it is generated from the code.


Testing the API

Before we containerize anything, we need to verify that the API works correctly. FastAPI provides a TestClient that makes this straightforward.

# test_api.py --- API tests
from fastapi.testclient import TestClient
from app import app

client = TestClient(app)


def test_health_check():
    response = client.get("/health")
    assert response.status_code == 200
    assert response.json()["status"] == "healthy"


def test_predict_valid_request():
    payload = {
        "tenure": 14,
        "monthly_charges": 89.50,
        "total_charges": 1253.00,
        "contract_type": "month-to-month",
        "payment_method": "electronic_check",
        "num_support_tickets": 3,
        "internet_service": "fiber_optic",
        "streaming_services": 2,
        "paperless_billing": True,
        "senior_citizen": False,
    }
    response = client.post("/predict", json=payload)
    assert response.status_code == 200

    data = response.json()
    assert 0 <= data["churn_probability"] <= 1
    assert data["risk_tier"] in ("low", "medium", "high")
    assert len(data["top_reasons"]) == 3
    assert data["model_version"] is not None


def test_predict_missing_field():
    payload = {"tenure": 14}  # Missing required fields
    response = client.post("/predict", json=payload)
    assert response.status_code == 422  # Validation error


def test_predict_invalid_type():
    payload = {
        "tenure": "not_a_number",  # Should be int
        "monthly_charges": 89.50,
        "total_charges": 1253.00,
        "contract_type": "month-to-month",
        "payment_method": "electronic_check",
        "num_support_tickets": 3,
        "internet_service": "fiber_optic",
        "streaming_services": 2,
        "paperless_billing": True,
        "senior_citizen": False,
    }
    response = client.post("/predict", json=payload)
    assert response.status_code == 422


def test_predict_out_of_range():
    payload = {
        "tenure": -5,  # ge=0 constraint violated
        "monthly_charges": 89.50,
        "total_charges": 1253.00,
        "contract_type": "month-to-month",
        "payment_method": "electronic_check",
        "num_support_tickets": 3,
        "internet_service": "fiber_optic",
        "streaming_services": 2,
        "paperless_billing": True,
        "senior_citizen": False,
    }
    response = client.post("/predict", json=payload)
    assert response.status_code == 422
pytest test_api.py -v

Every test should pass. If a test fails, fix the API, not the test. These tests are your safety net when you change the code later.

Practical Tip --- Always test the error paths, not just the happy path. A production API will receive garbage input from misconfigured clients, automated scanners, and confused developers. Your API should fail gracefully with clear error messages, not crash with a Python traceback.


Batch Prediction vs. Real-Time Prediction

Not every prediction needs to happen in real time. The choice between batch and real-time serving depends on the business requirement.

Real-Time (Online) Prediction

The client sends a request and waits for the response. Latency matters --- the customer is waiting.

Characteristic Detail
Latency requirement < 100 ms per request typical
Infrastructure API server (FastAPI + Uvicorn) behind a load balancer
When to use User-facing features, fraud detection, recommendation widgets
StreamFlow example Customer opens the "My Account" page; the app calls /predict and displays a retention offer if churn risk is high

Batch Prediction

A scheduled job runs the model on all (or a subset of) records. Latency does not matter --- nobody is waiting.

Characteristic Detail
Latency requirement Hours are acceptable
Infrastructure Scheduled script, Spark job, or Airflow DAG
When to use Nightly email campaigns, weekly reports, pre-computing scores for all customers
StreamFlow example Every night at 2 AM, score all 2.3 million subscribers; write results to a database table; the marketing team queries the table the next morning

Adding a Batch Endpoint

You can support both patterns in the same API by adding a batch endpoint that accepts a list of customers:

from pydantic import BaseModel, Field


class BatchPredictionRequest(BaseModel):
    """Multiple customers in a single request."""
    customers: list[ChurnPredictionRequest] = Field(
        ...,
        description="List of customer records to score",
        min_length=1,
        max_length=1000,
    )


class BatchPredictionResponse(BaseModel):
    """Predictions for all customers in the batch."""
    predictions: list[ChurnPredictionResponse]
    batch_size: int
    processing_time_ms: float


@app.post("/predict/batch", response_model=BatchPredictionResponse)
def predict_batch(request: BatchPredictionRequest):
    """Score multiple customers in a single request.

    More efficient than calling /predict in a loop because
    the model runs vectorized inference on the full batch.
    """
    import time
    start = time.time()

    # Encode all customers into a single DataFrame
    rows = []
    for customer in request.customers:
        features_df = encode_features(customer)
        rows.append(features_df)
    batch_df = pd.concat(rows, ignore_index=True)

    # Vectorized prediction (much faster than one-at-a-time)
    probabilities = model.predict_proba(batch_df)[:, 1]

    # Build responses
    predictions = []
    for i, customer in enumerate(request.customers):
        prob = float(probabilities[i])
        features_df = pd.DataFrame([batch_df.iloc[i]])
        reasons = get_top_shap_reasons(features_df, n=3)
        predictions.append(ChurnPredictionResponse(
            churn_probability=round(prob, 4),
            risk_tier=get_risk_tier(prob),
            top_reasons=reasons,
            model_version=MODEL_VERSION,
        ))

    elapsed_ms = (time.time() - start) * 1000

    return BatchPredictionResponse(
        predictions=predictions,
        batch_size=len(predictions),
        processing_time_ms=round(elapsed_ms, 2),
    )

Performance Note --- The batch endpoint is not just a convenience wrapper. Scikit-learn's predict_proba is vectorized: scoring 1000 customers in one call is dramatically faster than scoring 1000 customers in 1000 separate calls. If your downstream system can collect requests and send them in batches, use the batch endpoint.


Prediction Latency: Where the Time Goes

When a real-time prediction takes 500 ms instead of 50 ms, the problem is rarely the model itself. Scikit-learn's predict_proba on 10 features takes microseconds. The latency comes from everything around the model.

Component Typical Time What Helps
Network round-trip 5--50 ms Deploy close to the caller; reduce payload size
JSON parsing and validation 1--5 ms Pydantic is fast; this is rarely the bottleneck
Feature encoding 1--10 ms Pre-compute expensive features; cache lookups
Model inference 0.1--10 ms Use a smaller model; ONNX Runtime for complex models
SHAP computation 10--200 ms Limit to top N features; cache explainer; use approximate SHAP
JSON serialization 1--5 ms Keep response payload small
Total ~20--280 ms

SHAP is often the slowest component. For real-time endpoints where sub-50ms latency is required, consider:

  1. Pre-computing SHAP values in a batch job and caching them
  2. Using approximate SHAP (shap.Explainer with algorithm="auto")
  3. Dropping SHAP from the real-time endpoint and serving explanations from a separate, async endpoint
import time
from fastapi import Request


@app.middleware("http")
async def add_latency_header(request: Request, call_next):
    """Add response time to every response for monitoring."""
    start = time.time()
    response = await call_next(request)
    elapsed_ms = (time.time() - start) * 1000
    response.headers["X-Response-Time-Ms"] = f"{elapsed_ms:.2f}"
    return response

This middleware adds an X-Response-Time-Ms header to every response. Use it to track latency trends. When the number starts climbing, you know where to look.


Docker: Packaging for Portability

You have a working API on your laptop. Now you need it to work on every other machine. Docker solves this by packaging your application, its dependencies, and its runtime environment into a single, portable container.

Docker Vocabulary

Term Definition
Image A read-only template containing your application code, dependencies, and OS libraries. Think of it as a snapshot of a configured machine.
Container A running instance of an image. You can run multiple containers from the same image.
Dockerfile A text file with instructions for building an image. Each instruction creates a layer.
Registry A place to store and share images. Docker Hub is public. AWS ECR, GCP Artifact Registry, and Azure ACR are private.
docker-compose A tool for defining and running multi-container applications with a YAML file.

Project Structure for Deployment

Before writing the Dockerfile, organize the project:

streamflow-churn-api/
    app.py                  # FastAPI application
    schemas.py              # Pydantic models
    test_api.py             # API tests
    model/
        churn_model.joblib  # Trained model artifact
    requirements.txt        # Pinned dependencies
    Dockerfile              # Container build instructions
    docker-compose.yml      # Local orchestration
    .dockerignore           # Files to exclude from the build

requirements.txt

Pin every dependency. "Latest" is not a version.

fastapi==0.115.0
uvicorn[standard]==0.30.0
scikit-learn==1.5.0
joblib==1.4.2
numpy==1.26.4
pandas==2.2.2
shap==0.45.0
pydantic==2.8.0

Practical Tip --- Generate pinned requirements from your working environment: pip freeze > requirements.txt. Then review and remove packages you do not actually need. A bloated image is a slow image.

The Dockerfile

# --- Stage 1: Build ---
FROM python:3.11-slim AS builder

WORKDIR /app

# Install dependencies first (layer caching optimization)
COPY requirements.txt .
RUN pip install --no-cache-dir --prefix=/install -r requirements.txt

# --- Stage 2: Runtime ---
FROM python:3.11-slim

WORKDIR /app

# Copy installed packages from builder
COPY --from=builder /install /usr/local

# Copy application code
COPY app.py .
COPY schemas.py .
COPY model/ model/

# Create a non-root user (security best practice)
RUN useradd --create-home appuser
USER appuser

# Expose the port
EXPOSE 8000

# Health check
HEALTHCHECK --interval=30s --timeout=5s --retries=3 \
    CMD python -c "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"

# Start the server
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]

Let us walk through the key decisions:

Multi-stage build. The first stage (builder) installs Python packages. The second stage (runtime) copies only the installed packages and application code. This keeps the final image small by excluding pip caches, build tools, and compiler headers.

Layer caching. We copy requirements.txt and install dependencies before copying the application code. Docker caches layers. If you change app.py but not requirements.txt, Docker reuses the cached dependency layer instead of reinstalling everything. This turns a 5-minute rebuild into a 10-second rebuild.

Non-root user. Running as root inside a container is a security vulnerability. If an attacker exploits a bug in your application, they get root access to the container. Running as appuser limits the damage.

HEALTHCHECK. The health check tells Docker (and any orchestrator like Kubernetes or ECS) whether the container is alive and ready to serve traffic. If the health check fails three consecutive times, the orchestrator restarts the container.

Building and Running

# Build the image
docker build -t streamflow-churn-api:v1.0 .

# Run the container
docker run -d --name churn-api -p 8000:8000 streamflow-churn-api:v1.0

# Verify it works
curl http://localhost:8000/health

# View logs
docker logs churn-api

# Stop and remove
docker stop churn-api && docker rm churn-api

The -p 8000:8000 flag maps port 8000 on your host to port 8000 in the container. The -d flag runs the container in the background.

.dockerignore

Just as .gitignore keeps files out of git, .dockerignore keeps files out of the Docker build context:

__pycache__/
*.pyc
.git/
.env
.venv/
test_api.py
*.md
.pytest_cache/
mlruns/
notebooks/

Without a .dockerignore, Docker copies your entire directory into the build context --- including your git history, virtual environment, and test files. This slows down the build and bloats the image.


docker-compose: Multi-Container Development

For local development, docker-compose lets you define and run the API alongside other services (a database, a monitoring dashboard, a model registry) with a single command.

# docker-compose.yml
version: "3.8"

services:
  churn-api:
    build:
      context: .
      dockerfile: Dockerfile
    ports:
      - "8000:8000"
    environment:
      - MODEL_PATH=model/churn_model.joblib
      - LOG_LEVEL=info
    volumes:
      - ./model:/app/model  # Mount model directory for hot-swapping
    healthcheck:
      test: ["CMD", "python", "-c",
             "import urllib.request; urllib.request.urlopen('http://localhost:8000/health')"]
      interval: 30s
      timeout: 5s
      retries: 3
    restart: unless-stopped
# Start the service
docker-compose up -d

# View logs
docker-compose logs -f churn-api

# Rebuild after code changes
docker-compose up -d --build

# Stop everything
docker-compose down

The volumes mount lets you swap the model file without rebuilding the container. During development, this is convenient. In production, bake the model into the image so the container is self-contained and reproducible.


Model Versioning in Production

A deployed model is not a static artifact. You retrain on new data. You fix bugs in the preprocessing pipeline. You tune hyperparameters. You need to know which version of the model is currently serving predictions, and you need the ability to roll back if something goes wrong.

Versioning Strategy

Every model artifact should include:

  1. A semantic version (v2.3.1) embedded in the filename or metadata
  2. A link to the MLflow run that produced it (see Chapter 30)
  3. The training data version and date
  4. The git commit hash of the code used to train it
# At the top of app.py
import os

MODEL_VERSION = os.getenv("MODEL_VERSION", "v2.3.1")
MODEL_PATH = os.getenv("MODEL_PATH", "model/churn_model.joblib")

model = joblib.load(MODEL_PATH)

Tag your Docker images with the model version:

docker build -t streamflow-churn-api:v2.3.1 .
docker build -t streamflow-churn-api:latest .

Warning

--- Never deploy the latest tag in production. "Latest" is not a version. It is a prayer. Use explicit version tags so you know exactly what is running and can roll back to a specific version.

Canary Deployment

A canary deployment routes a small percentage of traffic (e.g., 5%) to the new model version while the old version handles the rest. If the new model's metrics (accuracy, latency, error rate) look good after a few hours, you gradually increase the traffic split until 100% goes to the new version.

                    +-----------+
                    |  Load     |
       Traffic ---> |  Balancer |
                    +-----+-----+
                          |
                    +-----+-----+
                    |           |
                    v           v
            +-------+--+  +---+--------+
            | Model    |  | Model      |
            | v2.3.1   |  | v2.4.0     |
            | (95%)    |  | (5%)       |
            +----------+  +------------+

This is safer than a hard cutover. If the new model has a bug --- it crashes on a specific input pattern, its predictions are wildly different, its latency is 10x higher --- only 5% of users are affected.

Blue-Green Deployment

A blue-green deployment runs two identical environments: "blue" (the current production version) and "green" (the new version). You switch all traffic from blue to green at once. If something goes wrong, you switch back.

    Before switch:               After switch:
    Traffic --> Blue (v2.3.1)    Traffic --> Green (v2.4.0)
                Green (v2.4.0)              Blue (v2.3.1)
                (idle)                      (idle/standby)

Blue-green is simpler than canary but riskier: all traffic switches at once. Use it when you have strong confidence in the new version (comprehensive test suite, staging validation) and need a clean cutover.


Cloud Deployment: AWS and GCP Walkthrough

Your containerized API is ready. Now it needs to run somewhere other than your laptop.

AWS: Elastic Container Service (ECS) with Fargate

AWS ECS runs Docker containers without managing servers. Fargate is the "serverless" mode --- you specify CPU and memory, and AWS handles the rest.

Step 1: Push your image to ECR (Elastic Container Registry).

# Authenticate Docker with ECR
aws ecr get-login-password --region us-east-1 | \
  docker login --username AWS --password-stdin 123456789.dkr.ecr.us-east-1.amazonaws.com

# Create a repository
aws ecr create-repository --repository-name streamflow-churn-api

# Tag and push the image
docker tag streamflow-churn-api:v2.3.1 \
  123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1

docker push 123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1

Step 2: Create an ECS task definition.

The task definition tells ECS how to run your container: which image to use, how much CPU and memory, which ports to expose, and where to send logs.

{
  "family": "streamflow-churn-api",
  "networkMode": "awsvpc",
  "requiresCompatibilities": ["FARGATE"],
  "cpu": "512",
  "memory": "1024",
  "containerDefinitions": [
    {
      "name": "churn-api",
      "image": "123456789.dkr.ecr.us-east-1.amazonaws.com/streamflow-churn-api:v2.3.1",
      "portMappings": [
        {"containerPort": 8000, "protocol": "tcp"}
      ],
      "healthCheck": {
        "command": ["CMD-SHELL",
                    "python -c \"import urllib.request; urllib.request.urlopen('http://localhost:8000/health')\""],
        "interval": 30,
        "timeout": 5,
        "retries": 3
      },
      "logConfiguration": {
        "logDriver": "awslogs",
        "options": {
          "awslogs-group": "/ecs/streamflow-churn-api",
          "awslogs-region": "us-east-1",
          "awslogs-stream-prefix": "ecs"
        }
      }
    }
  ]
}

Step 3: Create an ECS service with an Application Load Balancer (ALB).

The service maintains the desired number of running tasks and registers them with the ALB. If a container fails its health check, ECS replaces it automatically.

# Create the service (simplified; typically done via CloudFormation or Terraform)
aws ecs create-service \
  --cluster streamflow-prod \
  --service-name churn-api \
  --task-definition streamflow-churn-api:1 \
  --desired-count 2 \
  --launch-type FARGATE \
  --network-configuration "awsvpcConfiguration={subnets=[subnet-abc123],securityGroups=[sg-abc123],assignPublicIp=ENABLED}"

With desired-count 2, ECS runs two instances of your API behind the load balancer. If one crashes, the other continues serving while ECS launches a replacement.

GCP: Cloud Run

Google Cloud Run is the simplest path from a Docker image to a running service. It scales to zero (you pay nothing when there is no traffic) and scales up automatically.

Step 1: Push to Artifact Registry.

# Configure Docker for GCP
gcloud auth configure-docker us-central1-docker.pkg.dev

# Tag and push
docker tag streamflow-churn-api:v2.3.1 \
  us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1

docker push us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1

Step 2: Deploy to Cloud Run.

gcloud run deploy churn-api \
  --image us-central1-docker.pkg.dev/my-project/streamflow/churn-api:v2.3.1 \
  --platform managed \
  --region us-central1 \
  --port 8000 \
  --memory 1Gi \
  --cpu 1 \
  --min-instances 1 \
  --max-instances 10 \
  --allow-unauthenticated

Cloud Run gives you a URL like https://churn-api-abc123-uc.a.run.app. That is your production endpoint. HTTPS, autoscaling, and health checks are built in.

The --min-instances 1 flag keeps one instance warm at all times, eliminating cold start latency. Without it, Cloud Run scales to zero and the first request after a period of inactivity takes 5--15 seconds to cold-start the container.

Theme: Real World =/= Kaggle --- On Kaggle, model.predict_proba(X_test) is the last line of code. In the real world, it is the first line of a different codebase --- one that handles HTTP routing, input validation, error handling, containerization, load balancing, health checks, auto-scaling, and versioned rollbacks. The prediction is the easy part. The deployment is the engineering.


Bringing It Together: The Complete Deployment Pipeline

Here is the end-to-end workflow from trained model to production endpoint:

1. Train model (Chapter 14, 18)
   |
2. Track experiment in MLflow (Chapter 30)
   |
3. Register model in MLflow Model Registry
   |
4. Export model artifact (joblib/pickle)
   |
5. Build FastAPI app with Pydantic schemas (this chapter)
   |
6. Write tests, run pytest
   |
7. Write Dockerfile, build image
   |
8. Test container locally: docker run + curl
   |
9. Push image to registry (ECR / Artifact Registry)
   |
10. Deploy to cloud (ECS / Cloud Run)
    |
11. Monitor (Chapter 32)

Steps 5 through 10 are this chapter. Steps 1 through 4 are what you have already done. Step 11 is next.


Progressive Project M10: Deploy the StreamFlow Churn Model

This milestone brings your progressive project model from a notebook into a deployable API.

Tasks

M10a: Build the FastAPI endpoint.

  1. Create app.py with a /predict endpoint that accepts your progressive project features
  2. Define ChurnPredictionRequest and ChurnPredictionResponse Pydantic schemas
  3. Load your best model from Chapter 18 (or the MLflow Model Registry from Chapter 30)
  4. Return churn probability, risk tier, and top 3 SHAP reasons
  5. Add a /health endpoint

M10b: Write tests.

  1. Test the health check endpoint
  2. Test a valid prediction request
  3. Test that missing fields return 422
  4. Test that invalid types return 422
  5. Run all tests with pytest

M10c: Containerize with Docker.

  1. Write a Dockerfile with multi-stage build
  2. Write requirements.txt with pinned versions
  3. Write a .dockerignore
  4. Build the image: docker build -t streamflow-churn-api:v1.0 .
  5. Run the container: docker run -p 8000:8000 streamflow-churn-api:v1.0
  6. Test with curl or the Swagger UI at http://localhost:8000/docs

M10d: (Optional) Deploy to a cloud platform.

If you have an AWS or GCP account, push your image to a registry and deploy to ECS Fargate or Cloud Run. Verify the endpoint responds to requests from your local machine. Record the public URL.

Deliverables

  • app.py, schemas.py, test_api.py --- the FastAPI application and tests
  • Dockerfile, requirements.txt, .dockerignore --- the container configuration
  • A screenshot or terminal output showing a successful prediction from the running container
  • (Optional) The public URL of your deployed endpoint

Summary

A model is not deployed until someone else can use it without you being in the room. FastAPI provides the REST interface: type-safe request/response schemas with Pydantic, automatic API documentation, and the performance to serve real-time predictions. Docker provides the portability: a container that runs identically on your laptop, your colleague's laptop, a CI server, and a production cloud instance. Cloud platforms --- ECS Fargate on AWS, Cloud Run on GCP --- provide the infrastructure: auto-scaling, health checks, load balancing, and managed HTTPS.

The choices you make at deployment time --- real-time vs. batch, canary vs. blue-green, SHAP at inference time vs. pre-computed --- are engineering decisions that depend on the business requirements, not the model architecture. A model with an AUC of 0.88 that is deployed, monitored, and serving predictions is infinitely more valuable than a model with an AUC of 0.92 that lives in a notebook on someone's laptop.

Deploy early. Deploy often. And never deploy latest.

Theme: Reproducibility --- Every deployed container should be traceable back to a specific Docker image tag, which maps to a specific model version, which maps to a specific MLflow run, which maps to specific hyperparameters and a specific data version. If any link in that chain is broken, you cannot reproduce the deployment. The chain is the point.


Next chapter: Chapter 32: Monitoring in Production --- detecting data drift, performance degradation, and the moment your deployed model starts silently failing.