Chapter 29: Exercises — DevOps and Deployment
Tier 1: Recall and Understanding (Exercises 1–6)
Exercise 1: DevOps Vocabulary Match
Match each DevOps term with its correct definition:
| Term | Definition |
|---|---|
| 1. CI | A. Defining infrastructure through version-controlled config files |
| 2. CD | B. A lightweight, isolated runtime environment sharing the host OS kernel |
| 3. IaC | C. Automatically building and testing code on every commit |
| 4. Container | D. A deployment strategy that routes a small percentage of traffic to a new version |
| 5. Canary | E. Keeping the codebase in a deployable state at all times |
| 6. Blue-green | F. Maintaining two identical environments and switching traffic between them |
Expected output: 1-C, 2-E, 3-A, 4-B, 5-D, 6-F
Exercise 2: Dockerfile Instruction Ordering
Given the following Dockerfile instructions in random order, arrange them in the correct sequence for optimal layer caching:
COPY . .
RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 8000
FROM python:3.12-slim
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
WORKDIR /app
COPY requirements.txt .
Write the correctly ordered Dockerfile.
Exercise 3: Environment Variable Categories
Classify each of the following configuration values as one of: (a) can be committed to Git, (b) should be an environment variable, or (c) must be stored in a secrets manager.
APP_NAME=MyVibeAppDATABASE_URL=postgresql://admin:s3cret@prod-db:5432/myappLOG_LEVEL=INFOAWS_SECRET_ACCESS_KEY=AKIAIOSFODNN7EXAMPLEALLOWED_ORIGINS=https://myapp.comSTRIPE_SECRET_KEY=sk_live_abc123MAX_UPLOAD_SIZE=10485760JWT_SECRET=a-very-long-random-string
Exercise 4: Log Level Selection
For each scenario, select the appropriate log level (DEBUG, INFO, WARNING, ERROR, CRITICAL):
- A user successfully logged in
- An API endpoint received a request with a deprecated parameter
- The database connection pool is exhausted and new requests are failing
- A background job processed 1,000 records in 45 seconds
- The application ran out of memory and the main process is shutting down
- A variable holds the value
user_id=12345and you need to trace a bug - A third-party API returned a 429 (rate limited) status; the request will be retried
Exercise 5: CI/CD Pipeline Stages
List the six standard stages of a CI/CD pipeline in the correct order. For each stage, provide one specific tool or action that belongs to that stage.
Exercise 6: Docker Compose Service Dependencies
Given a web application that depends on PostgreSQL and Redis, explain what the following docker-compose.yml snippet does and why the condition: service_healthy option is important:
services:
web:
depends_on:
db:
condition: service_healthy
cache:
condition: service_started
Tier 2: Application (Exercises 7–12)
Exercise 7: Write a Dockerfile
Write a complete, production-ready Dockerfile for a Python Flask application with the following requirements:
- Uses Python 3.12
- Uses multi-stage build
- Installs dependencies from requirements.txt
- Runs as non-root user
- Exposes port 5000
- Uses Gunicorn as the production server with 4 workers
- Includes a health check
Exercise 8: GitHub Actions Workflow
Write a GitHub Actions workflow file (.github/workflows/ci.yml) for a Python project that:
1. Triggers on push to main and on pull requests to main
2. Sets up Python 3.12
3. Installs dependencies from requirements.txt
4. Runs ruff check . for linting
5. Runs pytest with coverage reporting
6. Only runs on Ubuntu latest
Exercise 9: Environment Configuration Class
Using Pydantic Settings, write a Python Settings class that:
- Loads from environment variables and .env files
- Has fields for: database_url (required), redis_url (default redis://localhost:6379/0), secret_key (required), debug (default False), environment (default development), log_level (default INFO)
- Include a validation method that raises an error if debug is True in production environment
Exercise 10: Health Check Endpoint
Write a FastAPI health check endpoint that:
1. Has a basic /health route that returns {"status": "healthy"}
2. Has a /health/ready route that checks:
- Database connectivity (simulate with a function call)
- Redis connectivity (simulate with a function call)
- Disk space availability (simulate with a function call)
3. Returns HTTP 200 if all checks pass, HTTP 503 if any fail
4. Includes the timestamp and duration of each check in the response
Exercise 11: Docker Compose Multi-Service
Write a docker-compose.yml file for an application with:
- A Python web service built from a local Dockerfile
- PostgreSQL 16 with persistent storage and a health check
- Redis 7 with Alpine base
- An Nginx reverse proxy on ports 80 and 443
- All services on a shared custom network
- Environment variables loaded from a .env file
Exercise 12: Structured Logging Setup
Write a Python module that configures structured logging using the structlog library. The module should:
1. Output JSON in production, human-readable format in development
2. Include timestamp, log level, and logger name in every entry
3. Support correlation IDs via context variables
4. Provide a middleware function for FastAPI that logs every request with method, path, status code, and duration
Tier 3: Analysis (Exercises 13–18)
Exercise 13: Dockerfile Optimization
The following Dockerfile is functional but poorly optimized. Identify at least five problems and rewrite it with corrections:
FROM python:3.12
WORKDIR /app
COPY . .
RUN pip install -r requirements.txt
RUN apt-get update && apt-get install -y curl vim nano htop
EXPOSE 8000
CMD python main.py
Exercise 14: CI/CD Pipeline Failure Analysis
A CI/CD pipeline has the following stages: lint, test, build, deploy. The pipeline has been failing intermittently with the following pattern: - Lint: always passes - Test: passes 95% of the time - Build: passes 100% of the time when tests pass - Deploy: fails about 10% of the time with "connection refused" errors
Analyze this pattern and: 1. Identify the most likely root cause for the test failures 2. Identify the most likely root cause for the deploy failures 3. Propose specific fixes for each issue 4. Calculate the overall pipeline success rate
Exercise 15: Cloud Platform Comparison
You are deploying a new application with the following requirements: - Expected traffic: 1,000 requests per minute initially, growing to 50,000 RPM over 12 months - Budget: $50/month initially, willing to scale to $500/month - Tech stack: Python FastAPI backend, React frontend, PostgreSQL, Redis - Team size: 2 developers, no dedicated ops - Compliance: No special compliance requirements
Compare the following deployment options and recommend one with justification: 1. Heroku 2. Railway 3. AWS (ECS Fargate + RDS) 4. Google Cloud Run + Cloud SQL 5. Self-managed VPS (DigitalOcean/Linode)
Exercise 16: Monitoring Dashboard Design
Design a monitoring dashboard for a production web application. Specify: 1. The four golden signals and how you would measure each 2. At least six specific metrics you would display 3. Three alert rules with thresholds and notification channels 4. A proposed dashboard layout (describe sections and their contents)
Exercise 17: Rollback Scenario Analysis
Your team deployed version 2.5.0 of an e-commerce application at 2:00 PM. At 2:15 PM, you notice:
- Error rate increased from 0.1% to 5%
- Average response time increased from 200ms to 1,500ms
- The errors are all 500 Internal Server Error on the /api/checkout endpoint
- Database query logs show a new query that is taking 8 seconds due to a missing index
- Version 2.5.0 included a database migration that added a new column and modified the checkout query
Describe your step-by-step incident response, including: 1. Immediate actions 2. Rollback strategy (considering the database migration) 3. Communication plan 4. Post-incident tasks
Exercise 18: Security Audit of Deployment Configuration
Review the following deployment configuration and identify all security issues:
# docker-compose.yml
services:
web:
image: myapp:latest
ports:
- "8000:8000"
environment:
- DATABASE_URL=postgresql://admin:password123@db:5432/prod
- SECRET_KEY=mysecretkey
- DEBUG=true
- AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
- AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
db:
image: postgres:latest
ports:
- "5432:5432"
environment:
- POSTGRES_PASSWORD=password123
volumes:
- ./data:/var/lib/postgresql/data
Tier 4: Synthesis (Exercises 19–24)
Exercise 19: Complete CI/CD Pipeline
Design and write a complete CI/CD pipeline (GitHub Actions) for a full-stack application (Python backend + React frontend) that:
1. Runs backend linting, type checking, and tests in parallel with frontend linting and tests
2. Builds Docker images for both backend and frontend
3. Pushes images to GitHub Container Registry
4. Deploys to a staging environment on push to develop
5. Deploys to production on push to main with a manual approval gate
6. Sends a Slack notification on deployment success or failure
7. Includes proper caching for pip and npm dependencies
Exercise 20: Infrastructure as Code
Write a Terraform configuration that provisions: 1. An AWS VPC with two public and two private subnets across two availability zones 2. An Application Load Balancer in the public subnets 3. An ECS Fargate service running your application container in the private subnets 4. An RDS PostgreSQL instance in the private subnets 5. Security groups that restrict database access to only the ECS service 6. Use variables for all configurable values 7. Output the load balancer DNS name
Exercise 21: Monitoring and Alerting System
Build a complete monitoring solution for a Python web application: 1. Write Prometheus metric instrumentation (counters, histograms, gauges) 2. Create a Prometheus configuration to scrape the application 3. Write three Prometheus alerting rules (high error rate, high latency, service down) 4. Design a Grafana dashboard JSON (or describe it in detail) with at least four panels 5. Write a Python script that simulates load and generates metrics data
Exercise 22: Blue-Green Deployment Script
Write a Python script that automates a blue-green deployment: 1. Determines which environment (blue or green) is currently active 2. Deploys the new version to the inactive environment 3. Runs health checks against the new deployment 4. Switches the load balancer to point to the new environment 5. Keeps the old environment running for 30 minutes as a fallback 6. Provides a manual rollback command 7. Logs all actions with timestamps
Exercise 23: Multi-Environment Configuration System
Design and implement a configuration management system that: 1. Supports development, staging, and production environments 2. Uses a hierarchy: defaults < environment-specific < environment variables < command-line arguments 3. Validates all configuration at startup 4. Redacts secrets in log output 5. Provides a CLI command to display current configuration (with secrets masked) 6. Supports hot-reloading of non-secret configuration values
Exercise 24: Disaster Recovery Plan
Write a complete disaster recovery plan for a production web application that includes: 1. Backup strategy for the database (frequency, retention, testing) 2. Backup strategy for uploaded files and assets 3. Recovery procedure for each of the following scenarios: - Database corruption - Application server failure - Complete data center outage - Accidental data deletion by a user 4. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each scenario 5. A runbook with step-by-step commands for the most critical recovery procedures
Tier 5: Evaluation and Critical Thinking (Exercises 25–30)
Exercise 25: DevOps Tool Evaluation
Your company is choosing between three DevOps approaches for a new project:
Option A: GitHub Actions + Docker + Heroku Option B: GitLab CI + Docker + AWS ECS Option C: Jenkins + Kubernetes + GCP
Evaluate each option across the following dimensions for a team of 5 developers building a medium-complexity web application: 1. Setup complexity and time to first deployment 2. Ongoing maintenance burden 3. Scalability (handling growth from 100 to 100,000 users) 4. Cost at three scales: startup, growth, and enterprise 5. Learning curve for the team 6. Vendor lock-in risk
Provide a recommendation with your reasoning.
Exercise 26: AI-Generated DevOps Configurations
An AI assistant generated the following Dockerfile for a production Python application:
FROM python:3.12
WORKDIR /app
COPY requirements.txt .
RUN pip install -r requirements.txt
COPY . .
ENV PYTHONUNBUFFERED=1
ENV DEBUG=True
EXPOSE 8000
CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]
Critically evaluate this Dockerfile: 1. Identify every issue (security, performance, reliability) 2. Rate the AI's output quality on a scale of 1-10 3. Write an improved prompt that would produce a better result 4. Rewrite the Dockerfile with all issues fixed
Exercise 27: Monitoring Strategy Debate
Two senior engineers disagree about monitoring strategy:
Engineer A argues: "We should alert on every metric that deviates from normal. Better to have too many alerts than to miss a real incident."
Engineer B argues: "We should only alert on user-facing symptoms. If users are not affected, it is not worth waking someone up."
Write a 500-word essay evaluating both positions, including: 1. The strengths and weaknesses of each approach 2. The concept of alert fatigue and its consequences 3. How to find the right balance 4. Your recommended alerting philosophy with specific examples
Exercise 28: Microservices vs. Monolith Deployment
Your team is debating whether to deploy their application as a monolith or as microservices. The application has four logical components: user authentication, product catalog, order processing, and notification service.
Analyze the deployment implications of each approach: 1. Container and orchestration complexity 2. CI/CD pipeline complexity 3. Monitoring and debugging difficulty 4. Deployment risk and rollback complexity 5. Resource efficiency 6. Team coordination requirements
Provide a recommendation based on team size (2 developers), current traffic (low), and projected growth (moderate).
Exercise 29: Post-Mortem Analysis
Read the following incident timeline and write a blameless post-mortem:
Timeline: - 09:00 — Developer pushes new feature to main branch - 09:05 — CI/CD pipeline passes all tests - 09:10 — Automatic deployment to production begins - 09:15 — Deployment completes; health checks pass - 09:45 — Customer reports checkout page showing "Internal Server Error" - 09:50 — On-call engineer investigates; sees elevated error rate in Grafana - 10:00 — Root cause identified: new code calls a third-party API that was not mocked in tests; the API changed its response format - 10:05 — Engineer initiates rollback to previous version - 10:15 — Rollback complete; error rate returns to normal - 10:20 — Customer confirms checkout is working
Your post-mortem should include: 1. Incident summary 2. Impact assessment 3. Root cause analysis 4. Timeline of events 5. What went well 6. What went poorly 7. Action items with owners and due dates
Exercise 30: Future of DevOps with AI
Write a 500-word analysis of how AI coding assistants will change DevOps practices over the next five years. Consider: 1. Which DevOps tasks are most likely to be fully automated by AI? 2. Which tasks will still require human judgment? 3. How will the role of the DevOps engineer evolve? 4. What new risks does AI-generated infrastructure code introduce? 5. How should teams adapt their workflows to take advantage of AI while managing risks?