Quiz: Chapter 31
Model Deployment
Instructions: Answer all questions. Multiple-choice questions have one correct answer unless otherwise stated. Short-answer questions should be answered in 2-4 sentences.
Question 1 (Multiple Choice)
What is the primary purpose of defining Pydantic schemas for a FastAPI prediction endpoint?
- A) To improve model accuracy by constraining input ranges
- B) To validate incoming requests and reject malformed input before it reaches the model
- C) To compress the request payload for faster network transfer
- D) To encrypt sensitive customer data during transit
Answer: B) To validate incoming requests and reject malformed input before it reaches the model. Pydantic schemas act as a contract between the client and the server. They enforce type correctness (e.g., tenure must be an integer), range constraints (e.g., tenure >= 0), and completeness (all required fields must be present). If validation fails, FastAPI returns a 422 error with a detailed message, and the model code never executes.
Question 2 (Multiple Choice)
In a Dockerfile for an ML prediction API, why is COPY requirements.txt . placed before COPY app.py .?
- A) Python requires
requirements.txtto be present before any.pyfiles - B) Docker builds layers from top to bottom, and placing rarely-changed files first enables layer caching
- C) The
requirements.txtfile must be smaller thanapp.pyfor Docker to work correctly - D) FastAPI needs the dependencies listed before the application code is loaded
Answer: B) Docker builds layers from top to bottom, and placing rarely-changed files first enables layer caching. Docker caches each layer (each instruction in the Dockerfile). If requirements.txt has not changed since the last build, Docker reuses the cached layer that installed the dependencies. The COPY app.py . layer, which changes more frequently, is rebuilt without re-installing all dependencies. This optimization turns a multi-minute rebuild into a seconds-long rebuild.
Question 3 (Short Answer)
Explain the difference between a Docker image and a Docker container.
Answer: A Docker image is a read-only template containing the application code, installed dependencies, and OS libraries --- it is a static snapshot. A Docker container is a running instance of an image --- it is a process with its own filesystem, network, and memory. You can run multiple containers from the same image, just as you can run multiple processes from the same executable. Stopping a container does not delete the image; rebuilding an image does not affect running containers.
Question 4 (Multiple Choice)
Which deployment pattern routes a small percentage of production traffic to a new model version while the majority continues to be served by the old version?
- A) Blue-green deployment
- B) A/B testing
- C) Canary deployment
- D) Rolling deployment
Answer: C) Canary deployment. In a canary deployment, a small percentage of traffic (often 5--10%) is routed to the new model version while the remaining traffic continues to hit the old version. If the new version's metrics (accuracy, latency, error rate) look acceptable after a monitoring period, the traffic percentage is gradually increased until the new version handles 100% of traffic. If metrics degrade, the canary is rolled back with minimal user impact.
Question 5 (Short Answer)
A colleague says: "I tagged my Docker image as latest and deployed it. I will just update the image and redeploy when I retrain the model." Explain why this is a bad practice and what they should do instead.
Answer: The latest tag is mutable --- it points to whichever image was most recently tagged, not to a specific version. If a problem occurs in production, you cannot determine which model version is running, and you cannot roll back to a known-good state because latest has already been overwritten. Instead, use explicit version tags (e.g., churn-api:v2.3.1) that correspond to the model version and MLflow run ID. This ensures every deployed image is traceable and any previous version can be redeployed instantly.
Question 6 (Multiple Choice)
When should you prefer batch prediction over a real-time API endpoint?
- A) When the model is too complex to run in less than 1 second
- B) When predictions are needed for a large set of records on a schedule and no user is waiting for the result
- C) When the model requires GPU acceleration
- D) When the prediction endpoint must be highly available
Answer: B) When predictions are needed for a large set of records on a schedule and no user is waiting for the result. Batch prediction is appropriate when you need to score an entire customer base (e.g., nightly churn scores for a marketing campaign) and latency is not a concern. Real-time APIs are for user-facing features where someone is actively waiting for the result. The choice depends on the business requirement, not the model complexity.
Question 7 (Multiple Choice)
What does the HEALTHCHECK instruction in a Dockerfile accomplish?
- A) It verifies that the Docker image was built without errors
- B) It runs a periodic command inside the container to verify the application is responsive, and marks the container as unhealthy if the check fails
- C) It monitors the host machine's CPU and memory usage
- D) It validates that all Python dependencies are correctly installed
Answer: B) It runs a periodic command inside the container to verify the application is responsive, and marks the container as unhealthy if the check fails. Container orchestrators (Docker, Kubernetes, ECS) use the health check status to decide whether to route traffic to the container and whether to restart it. A typical health check for an ML API calls the /health endpoint and expects a 200 response. If the check fails a configurable number of times, the container is marked unhealthy and replaced.
Question 8 (Short Answer)
Why is it important to run the application as a non-root user inside a Docker container? What line in the Dockerfile accomplishes this?
Answer: Running as root inside a container means that if an attacker exploits a vulnerability in the application, they gain root access to the container's filesystem and potentially the host system (depending on the container runtime configuration). Running as a non-root user (RUN useradd --create-home appuser followed by USER appuser) limits the damage: the attacker can only access files owned by that unprivileged user. This is a standard security best practice for all containerized applications, not just ML services.
Question 9 (Multiple Choice)
In the following FastAPI code, what happens if a client sends a POST to /predict with {"tenure": -5} when the schema defines tenure: int = Field(..., ge=0)?
- A) The model runs with
tenure = -5and returns a prediction - B) FastAPI returns a 422 Unprocessable Entity error with details about the constraint violation
- C) FastAPI returns a 400 Bad Request error
- D) The server crashes with an unhandled exception
Answer: B) FastAPI returns a 422 Unprocessable Entity error with details about the constraint violation. Pydantic validates the request body against the schema before the endpoint function executes. The ge=0 constraint (greater than or equal to zero) rejects -5, and FastAPI returns a structured error response explaining that the value is below the minimum. The model code never runs.
Question 10 (Multiple Choice)
What is the primary advantage of a multi-stage Docker build for an ML prediction API?
- A) It allows you to use multiple programming languages in the same container
- B) It reduces the final image size by excluding build-time dependencies and caches from the runtime image
- C) It automatically optimizes the model for faster inference
- D) It enables running multiple API instances in the same container
Answer: B) It reduces the final image size by excluding build-time dependencies and caches from the runtime image. The first stage installs Python packages (which may require compilers, header files, and pip caches). The second stage copies only the installed packages and application code, leaving behind everything that was needed only for building. A smaller image means faster pulls from the registry, faster deployments, and a smaller attack surface.
Question 11 (Short Answer)
Explain why SHAP computation is often the latency bottleneck in a real-time prediction API, and describe two strategies to mitigate this.
Answer: SHAP computation requires evaluating the model on many feature permutations to calculate each feature's contribution, which is orders of magnitude slower than a single predict_proba call. For tree-based models, TreeExplainer is optimized but still takes 10--200 ms per prediction. Two mitigation strategies: (1) pre-compute SHAP values in a nightly batch job and cache them, serving pre-computed explanations from a lookup table instead of computing them at request time; (2) serve explanations from a separate asynchronous endpoint so the main prediction response is not blocked by SHAP computation.
Question 12 (Multiple Choice)
You have deployed a churn prediction API on AWS ECS with desired-count 2. One container starts returning 500 errors due to a corrupted model file. What happens?
- A) Both containers stop serving traffic until the issue is manually resolved
- B) The load balancer stops routing traffic to the unhealthy container, and ECS launches a replacement; the healthy container continues serving
- C) ECS automatically retrains the model and redeploys
- D) The unhealthy container continues receiving traffic until you manually remove it
Answer: B) The load balancer stops routing traffic to the unhealthy container, and ECS launches a replacement; the healthy container continues serving. The health check (which hits the /health endpoint) will fail for the unhealthy container. The ALB removes it from the target group, and ECS launches a new task to maintain the desired count of 2. This is the value of running multiple instances behind a load balancer: single-container failures do not cause outages.
Question 13 (Short Answer)
A data scientist proposes deploying the model by sharing a Jupyter notebook that a colleague can run manually whenever a prediction is needed. Give three specific reasons why this approach fails in a production setting.
Answer: (1) It requires a human in the loop --- no other system can call the model programmatically, which rules out integration with downstream applications and automation. (2) The notebook environment is not reproducible: cell execution order, hidden state from previous runs, and untracked dependency versions mean results may differ between machines or sessions. (3) There is no input validation, error handling, health monitoring, or scaling --- a notebook cannot serve concurrent requests, handle malformed input gracefully, or recover from crashes without manual intervention.
Question 14 (Multiple Choice)
Which of the following is NOT a valid reason to prefer Google Cloud Run over AWS ECS Fargate for deploying an ML prediction API?
- A) Cloud Run can scale to zero instances, so you pay nothing when there is no traffic
- B) Cloud Run requires fewer configuration steps for a basic deployment
- C) Cloud Run automatically optimizes your model's hyperparameters
- D) Cloud Run provides a managed HTTPS endpoint without additional setup
Answer: C) Cloud Run automatically optimizes your model's hyperparameters. Cloud Run is a container hosting platform; it has no knowledge of ML models or hyperparameters. Its advantages over ECS Fargate for simple deployments include scale-to-zero pricing, simpler configuration (a single gcloud run deploy command vs. task definitions, services, and load balancers), and automatic HTTPS. For complex deployments with multiple services and fine-grained networking, ECS Fargate offers more control.
Question 15 (Short Answer)
Your production churn prediction API handles both real-time requests (from the customer portal) and batch scoring requests (from the nightly marketing pipeline). The batch job sends 50,000 records in a single request, which takes 90 seconds and blocks the API for real-time requests. How would you solve this?
Answer: Separate the two workloads. Run the real-time API on its own service with low latency requirements, and run the batch scoring as a separate job (an Airflow DAG, a scheduled Lambda, or a dedicated batch container) that reads from and writes to a database or data warehouse. If the batch job must use the same model, it can load the model directly from the artifact store (MLflow, S3) and run predict_proba on the full DataFrame without going through the API at all. Never route batch workloads through a latency-sensitive real-time endpoint.
This quiz covers Chapter 31: Model Deployment. Return to the chapter for full context.