Chapter 29: Exercises — DevOps and Deployment

Tier 1: Recall and Understanding (Exercises 1–6)

Exercise 1: DevOps Vocabulary Match

Match each DevOps term with its correct definition:

Term Definition
1. CI A. Defining infrastructure through version-controlled config files
2. CD B. A lightweight, isolated runtime environment sharing the host OS kernel
3. IaC C. Automatically building and testing code on every commit
4. Container D. A deployment strategy that routes a small percentage of traffic to a new version
5. Canary E. Keeping the codebase in a deployable state at all times
6. Blue-green F. Maintaining two identical environments and switching traffic between them

Expected output: 1-C, 2-E, 3-A, 4-B, 5-D, 6-F


Exercise 2: Dockerfile Instruction Ordering

Given the following Dockerfile instructions in random order, arrange them in the correct sequence for optimal layer caching:

COPY . .
RUN pip install --no-cache-dir -r requirements.txt
EXPOSE 8000
FROM python:3.12-slim
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
WORKDIR /app
COPY requirements.txt .

Write the correctly ordered Dockerfile.


Exercise 3: Environment Variable Categories

Classify each of the following configuration values as one of: (a) can be committed to Git, (b) should be an environment variable, or (c) must be stored in a secrets manager.

  1. APP_NAME=MyVibeApp
  2. DATABASE_URL=postgresql://admin:s3cret@prod-db:5432/myapp
  3. LOG_LEVEL=INFO
  4. AWS_SECRET_ACCESS_KEY=AKIAIOSFODNN7EXAMPLE
  5. ALLOWED_ORIGINS=https://myapp.com
  6. STRIPE_SECRET_KEY=sk_live_abc123
  7. MAX_UPLOAD_SIZE=10485760
  8. JWT_SECRET=a-very-long-random-string

Exercise 4: Log Level Selection

For each scenario, select the appropriate log level (DEBUG, INFO, WARNING, ERROR, CRITICAL):

  1. A user successfully logged in
  2. An API endpoint received a request with a deprecated parameter
  3. The database connection pool is exhausted and new requests are failing
  4. A background job processed 1,000 records in 45 seconds
  5. The application ran out of memory and the main process is shutting down
  6. A variable holds the value user_id=12345 and you need to trace a bug
  7. A third-party API returned a 429 (rate limited) status; the request will be retried

Exercise 5: CI/CD Pipeline Stages

List the six standard stages of a CI/CD pipeline in the correct order. For each stage, provide one specific tool or action that belongs to that stage.


Exercise 6: Docker Compose Service Dependencies

Given a web application that depends on PostgreSQL and Redis, explain what the following docker-compose.yml snippet does and why the condition: service_healthy option is important:

services:
  web:
    depends_on:
      db:
        condition: service_healthy
      cache:
        condition: service_started

Tier 2: Application (Exercises 7–12)

Exercise 7: Write a Dockerfile

Write a complete, production-ready Dockerfile for a Python Flask application with the following requirements: - Uses Python 3.12 - Uses multi-stage build - Installs dependencies from requirements.txt - Runs as non-root user - Exposes port 5000 - Uses Gunicorn as the production server with 4 workers - Includes a health check


Exercise 8: GitHub Actions Workflow

Write a GitHub Actions workflow file (.github/workflows/ci.yml) for a Python project that: 1. Triggers on push to main and on pull requests to main 2. Sets up Python 3.12 3. Installs dependencies from requirements.txt 4. Runs ruff check . for linting 5. Runs pytest with coverage reporting 6. Only runs on Ubuntu latest


Exercise 9: Environment Configuration Class

Using Pydantic Settings, write a Python Settings class that: - Loads from environment variables and .env files - Has fields for: database_url (required), redis_url (default redis://localhost:6379/0), secret_key (required), debug (default False), environment (default development), log_level (default INFO) - Include a validation method that raises an error if debug is True in production environment


Exercise 10: Health Check Endpoint

Write a FastAPI health check endpoint that: 1. Has a basic /health route that returns {"status": "healthy"} 2. Has a /health/ready route that checks: - Database connectivity (simulate with a function call) - Redis connectivity (simulate with a function call) - Disk space availability (simulate with a function call) 3. Returns HTTP 200 if all checks pass, HTTP 503 if any fail 4. Includes the timestamp and duration of each check in the response


Exercise 11: Docker Compose Multi-Service

Write a docker-compose.yml file for an application with: - A Python web service built from a local Dockerfile - PostgreSQL 16 with persistent storage and a health check - Redis 7 with Alpine base - An Nginx reverse proxy on ports 80 and 443 - All services on a shared custom network - Environment variables loaded from a .env file


Exercise 12: Structured Logging Setup

Write a Python module that configures structured logging using the structlog library. The module should: 1. Output JSON in production, human-readable format in development 2. Include timestamp, log level, and logger name in every entry 3. Support correlation IDs via context variables 4. Provide a middleware function for FastAPI that logs every request with method, path, status code, and duration


Tier 3: Analysis (Exercises 13–18)

Exercise 13: Dockerfile Optimization

The following Dockerfile is functional but poorly optimized. Identify at least five problems and rewrite it with corrections:

FROM python:3.12

WORKDIR /app

COPY . .

RUN pip install -r requirements.txt

RUN apt-get update && apt-get install -y curl vim nano htop

EXPOSE 8000

CMD python main.py

Exercise 14: CI/CD Pipeline Failure Analysis

A CI/CD pipeline has the following stages: lint, test, build, deploy. The pipeline has been failing intermittently with the following pattern: - Lint: always passes - Test: passes 95% of the time - Build: passes 100% of the time when tests pass - Deploy: fails about 10% of the time with "connection refused" errors

Analyze this pattern and: 1. Identify the most likely root cause for the test failures 2. Identify the most likely root cause for the deploy failures 3. Propose specific fixes for each issue 4. Calculate the overall pipeline success rate


Exercise 15: Cloud Platform Comparison

You are deploying a new application with the following requirements: - Expected traffic: 1,000 requests per minute initially, growing to 50,000 RPM over 12 months - Budget: $50/month initially, willing to scale to $500/month - Tech stack: Python FastAPI backend, React frontend, PostgreSQL, Redis - Team size: 2 developers, no dedicated ops - Compliance: No special compliance requirements

Compare the following deployment options and recommend one with justification: 1. Heroku 2. Railway 3. AWS (ECS Fargate + RDS) 4. Google Cloud Run + Cloud SQL 5. Self-managed VPS (DigitalOcean/Linode)


Exercise 16: Monitoring Dashboard Design

Design a monitoring dashboard for a production web application. Specify: 1. The four golden signals and how you would measure each 2. At least six specific metrics you would display 3. Three alert rules with thresholds and notification channels 4. A proposed dashboard layout (describe sections and their contents)


Exercise 17: Rollback Scenario Analysis

Your team deployed version 2.5.0 of an e-commerce application at 2:00 PM. At 2:15 PM, you notice: - Error rate increased from 0.1% to 5% - Average response time increased from 200ms to 1,500ms - The errors are all 500 Internal Server Error on the /api/checkout endpoint - Database query logs show a new query that is taking 8 seconds due to a missing index - Version 2.5.0 included a database migration that added a new column and modified the checkout query

Describe your step-by-step incident response, including: 1. Immediate actions 2. Rollback strategy (considering the database migration) 3. Communication plan 4. Post-incident tasks


Exercise 18: Security Audit of Deployment Configuration

Review the following deployment configuration and identify all security issues:

# docker-compose.yml
services:
  web:
    image: myapp:latest
    ports:
      - "8000:8000"
    environment:
      - DATABASE_URL=postgresql://admin:password123@db:5432/prod
      - SECRET_KEY=mysecretkey
      - DEBUG=true
      - AWS_ACCESS_KEY_ID=AKIAIOSFODNN7EXAMPLE
      - AWS_SECRET_ACCESS_KEY=wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY

  db:
    image: postgres:latest
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_PASSWORD=password123
    volumes:
      - ./data:/var/lib/postgresql/data

Tier 4: Synthesis (Exercises 19–24)

Exercise 19: Complete CI/CD Pipeline

Design and write a complete CI/CD pipeline (GitHub Actions) for a full-stack application (Python backend + React frontend) that: 1. Runs backend linting, type checking, and tests in parallel with frontend linting and tests 2. Builds Docker images for both backend and frontend 3. Pushes images to GitHub Container Registry 4. Deploys to a staging environment on push to develop 5. Deploys to production on push to main with a manual approval gate 6. Sends a Slack notification on deployment success or failure 7. Includes proper caching for pip and npm dependencies


Exercise 20: Infrastructure as Code

Write a Terraform configuration that provisions: 1. An AWS VPC with two public and two private subnets across two availability zones 2. An Application Load Balancer in the public subnets 3. An ECS Fargate service running your application container in the private subnets 4. An RDS PostgreSQL instance in the private subnets 5. Security groups that restrict database access to only the ECS service 6. Use variables for all configurable values 7. Output the load balancer DNS name


Exercise 21: Monitoring and Alerting System

Build a complete monitoring solution for a Python web application: 1. Write Prometheus metric instrumentation (counters, histograms, gauges) 2. Create a Prometheus configuration to scrape the application 3. Write three Prometheus alerting rules (high error rate, high latency, service down) 4. Design a Grafana dashboard JSON (or describe it in detail) with at least four panels 5. Write a Python script that simulates load and generates metrics data


Exercise 22: Blue-Green Deployment Script

Write a Python script that automates a blue-green deployment: 1. Determines which environment (blue or green) is currently active 2. Deploys the new version to the inactive environment 3. Runs health checks against the new deployment 4. Switches the load balancer to point to the new environment 5. Keeps the old environment running for 30 minutes as a fallback 6. Provides a manual rollback command 7. Logs all actions with timestamps


Exercise 23: Multi-Environment Configuration System

Design and implement a configuration management system that: 1. Supports development, staging, and production environments 2. Uses a hierarchy: defaults < environment-specific < environment variables < command-line arguments 3. Validates all configuration at startup 4. Redacts secrets in log output 5. Provides a CLI command to display current configuration (with secrets masked) 6. Supports hot-reloading of non-secret configuration values


Exercise 24: Disaster Recovery Plan

Write a complete disaster recovery plan for a production web application that includes: 1. Backup strategy for the database (frequency, retention, testing) 2. Backup strategy for uploaded files and assets 3. Recovery procedure for each of the following scenarios: - Database corruption - Application server failure - Complete data center outage - Accidental data deletion by a user 4. Recovery Time Objective (RTO) and Recovery Point Objective (RPO) for each scenario 5. A runbook with step-by-step commands for the most critical recovery procedures


Tier 5: Evaluation and Critical Thinking (Exercises 25–30)

Exercise 25: DevOps Tool Evaluation

Your company is choosing between three DevOps approaches for a new project:

Option A: GitHub Actions + Docker + Heroku Option B: GitLab CI + Docker + AWS ECS Option C: Jenkins + Kubernetes + GCP

Evaluate each option across the following dimensions for a team of 5 developers building a medium-complexity web application: 1. Setup complexity and time to first deployment 2. Ongoing maintenance burden 3. Scalability (handling growth from 100 to 100,000 users) 4. Cost at three scales: startup, growth, and enterprise 5. Learning curve for the team 6. Vendor lock-in risk

Provide a recommendation with your reasoning.


Exercise 26: AI-Generated DevOps Configurations

An AI assistant generated the following Dockerfile for a production Python application:

FROM python:3.12

WORKDIR /app

COPY requirements.txt .
RUN pip install -r requirements.txt

COPY . .

ENV PYTHONUNBUFFERED=1
ENV DEBUG=True

EXPOSE 8000

CMD ["python", "manage.py", "runserver", "0.0.0.0:8000"]

Critically evaluate this Dockerfile: 1. Identify every issue (security, performance, reliability) 2. Rate the AI's output quality on a scale of 1-10 3. Write an improved prompt that would produce a better result 4. Rewrite the Dockerfile with all issues fixed


Exercise 27: Monitoring Strategy Debate

Two senior engineers disagree about monitoring strategy:

Engineer A argues: "We should alert on every metric that deviates from normal. Better to have too many alerts than to miss a real incident."

Engineer B argues: "We should only alert on user-facing symptoms. If users are not affected, it is not worth waking someone up."

Write a 500-word essay evaluating both positions, including: 1. The strengths and weaknesses of each approach 2. The concept of alert fatigue and its consequences 3. How to find the right balance 4. Your recommended alerting philosophy with specific examples


Exercise 28: Microservices vs. Monolith Deployment

Your team is debating whether to deploy their application as a monolith or as microservices. The application has four logical components: user authentication, product catalog, order processing, and notification service.

Analyze the deployment implications of each approach: 1. Container and orchestration complexity 2. CI/CD pipeline complexity 3. Monitoring and debugging difficulty 4. Deployment risk and rollback complexity 5. Resource efficiency 6. Team coordination requirements

Provide a recommendation based on team size (2 developers), current traffic (low), and projected growth (moderate).


Exercise 29: Post-Mortem Analysis

Read the following incident timeline and write a blameless post-mortem:

Timeline: - 09:00 — Developer pushes new feature to main branch - 09:05 — CI/CD pipeline passes all tests - 09:10 — Automatic deployment to production begins - 09:15 — Deployment completes; health checks pass - 09:45 — Customer reports checkout page showing "Internal Server Error" - 09:50 — On-call engineer investigates; sees elevated error rate in Grafana - 10:00 — Root cause identified: new code calls a third-party API that was not mocked in tests; the API changed its response format - 10:05 — Engineer initiates rollback to previous version - 10:15 — Rollback complete; error rate returns to normal - 10:20 — Customer confirms checkout is working

Your post-mortem should include: 1. Incident summary 2. Impact assessment 3. Root cause analysis 4. Timeline of events 5. What went well 6. What went poorly 7. Action items with owners and due dates


Exercise 30: Future of DevOps with AI

Write a 500-word analysis of how AI coding assistants will change DevOps practices over the next five years. Consider: 1. Which DevOps tasks are most likely to be fully automated by AI? 2. Which tasks will still require human judgment? 3. How will the role of the DevOps engineer evolve? 4. What new risks does AI-generated infrastructure code introduce? 5. How should teams adapt their workflows to take advantage of AI while managing risks?