Chapter 36: Exercises
AI Coding Agents and Autonomous Workflows
These exercises are organized into five tiers of increasing difficulty, from foundational recall to open-ended creation. Each exercise includes the Bloom's taxonomy level it targets.
Tier 1: Recall and Comprehension (Exercises 1--6)
Exercise 1: Agent vs. Assistant Comparison
Bloom's Level: Remember
List five characteristics that distinguish an AI coding agent from a conversational AI assistant. For each characteristic, provide a one-sentence explanation of why it matters for software development tasks.
Exercise 2: Agent Loop Identification
Bloom's Level: Understand
Read the following pseudocode and identify which phase of the plan-act-observe loop each line corresponds to:
1. response = llm.generate(prompt_with_context)
2. files = search_codebase("*.py", project_root)
3. plan = parse_plan(response)
4. result = execute_command("pytest tests/")
5. memory.update(result)
6. if all_tests_pass(result): mark_complete()
Label each line as Plan, Act, Observe, or Control (loop management). Explain your reasoning for each.
Exercise 3: Tool Classification
Bloom's Level: Understand
Classify each of the following tools into one of four categories: Read-Only, Write, Execute, or Communication. Then rank them from lowest to highest risk.
read_file(path)write_file(path, content)delete_file(path)run_command(cmd)git_commit(message)web_search(query)send_notification(message)deploy_to_production()
Exercise 4: Guardrail Matching
Bloom's Level: Remember
Match each guardrail type with the risk it primarily mitigates:
| Guardrail | Risk |
|---|---|
| A. Permission allowlists | 1. Infinite loops consuming API credits |
| B. Sandboxing | 2. Agent modifying system files |
| C. Cost limits | 3. Generated code containing secrets |
| D. Output validation | 4. Damage escaping to the host system |
Exercise 5: Memory Types
Bloom's Level: Understand
For each scenario, identify whether the agent needs working memory, short-term memory, or long-term memory:
a) The agent needs to remember the output of a file it read three steps ago in the current task. b) The agent needs to know the project's preferred testing framework across multiple sessions. c) The agent needs to track which steps of its current plan have been completed. d) The agent needs to remember that a particular approach failed in a previous task last week. e) The agent needs the contents of the current system prompt.
Exercise 6: Error Classification
Bloom's Level: Understand
Classify each error scenario and describe the appropriate recovery strategy (retry, fallback, escalate, or abort):
a) An API call returns a 429 (rate limit) error. b) The agent generates Python code with a syntax error. c) The agent tries to write to a directory that does not exist. d) The agent's task requires modifying a file that has been deleted since the task started. e) The agent has spent $50 on a task with a $10 budget. f) A test fails with an assertion error after the agent's code change.
Tier 2: Application (Exercises 7--12)
Exercise 7: Tool Definition
Bloom's Level: Apply
Write a complete tool definition (name, description, parameter schema, and implementation) for a search_and_replace tool that:
- Takes a file path, a search string, and a replacement string
- Replaces all occurrences of the search string in the file
- Returns the number of replacements made
- Handles errors gracefully (file not found, permission denied)
Include type hints and a docstring.
Exercise 8: Permission System
Bloom's Level: Apply
Implement a PermissionChecker class that:
- Accepts a configuration of allowed directories, blocked commands, and allowed file extensions
- Has a check_file_access(path, mode) method that returns True/False for read or write access
- Has a check_command(command) method that returns True/False
- Logs all permission checks (both allowed and denied)
Write at least five test cases for your implementation.
Exercise 9: Conversation Summarizer
Bloom's Level: Apply
Write a function summarize_conversation(history: list[dict], max_tokens: int) that:
- Takes a conversation history (list of message dicts with "role" and "content" keys)
- Estimates the token count of each message (approximate: 1 token per 4 characters)
- If the total exceeds max_tokens, summarizes older messages while keeping recent ones intact
- Returns a new history that fits within the token budget
Exercise 10: Retry Logic
Bloom's Level: Apply
Implement a RetryPolicy class that supports three strategies:
- Immediate retry (up to N times)
- Exponential backoff (base delay, max delay, max retries)
- Retry with modification (a callback that modifies the action before retrying)
Write a test that demonstrates each strategy handling a simulated flaky operation.
Exercise 11: Agent State Tracker
Bloom's Level: Apply
Create a TaskState class that tracks:
- The original goal
- The current plan (list of steps)
- Completed steps with their results
- Files read and modified
- Errors encountered
- Current iteration number
Include methods to serialize/deserialize the state to/from JSON, so it can be persisted between sessions.
Exercise 12: Cost Monitor
Bloom's Level: Apply
Implement a CostMonitor class that:
- Tracks token usage (input and output separately)
- Calculates cost based on configurable per-token pricing
- Enforces a budget limit
- Raises a BudgetExceededError when the limit is reached
- Provides a summary() method showing total tokens, cost, and remaining budget
Tier 3: Analysis (Exercises 13--18)
Exercise 13: Agent Trace Analysis
Bloom's Level: Analyze
Given the following agent trace, identify: a) Where the agent made a suboptimal decision b) Where the agent could have been more efficient c) What guardrails should have been in place
Iteration 1: READ src/main.py → 500 lines
Iteration 2: READ src/utils.py → 200 lines
Iteration 3: READ src/helpers.py → 150 lines
Iteration 4: READ src/config.py → 50 lines
Iteration 5: READ src/models.py → 300 lines
Iteration 6: READ tests/test_main.py → 100 lines
Iteration 7: WRITE src/utils.py → Added new function (25 lines)
Iteration 8: RUN pytest tests/ → 3 failures
Iteration 9: READ src/utils.py → Re-read the file it just wrote
Iteration 10: WRITE src/utils.py → Fixed import statement
Iteration 11: RUN pytest tests/ → 1 failure
Iteration 12: READ tests/test_utils.py → 80 lines
Iteration 13: WRITE tests/test_utils.py → Updated expected values
Iteration 14: RUN pytest tests/ → All pass
Iteration 15: RUN rm -rf /tmp/cache → Cleared a cache directory
Write a 500-word analysis.
Exercise 14: Autonomy Level Assessment
Bloom's Level: Analyze
For each of the following tasks, recommend an appropriate autonomy level (0--4 from Section 36.1) and justify your recommendation:
a) Formatting code according to a style guide b) Implementing a new payment processing feature c) Updating documentation to reflect API changes d) Fixing a security vulnerability in an authentication module e) Refactoring a module to use a new design pattern f) Adding logging statements to existing functions
Exercise 15: Workflow Decomposition
Bloom's Level: Analyze
Decompose the following task into a hierarchical plan suitable for an agent. Identify which steps can be parallelized and which must be sequential:
"Migrate the user authentication system from session-based authentication to JWT tokens. The system currently uses Flask-Login with server-side sessions stored in Redis. The new system should use JWTs with refresh tokens, maintain backward compatibility during the migration period, and include comprehensive tests."
Exercise 16: Failure Mode Analysis
Bloom's Level: Analyze
For the "Issue-to-PR" workflow described in Section 36.4, identify at least eight potential failure modes. For each, describe: - What could go wrong - How the agent should detect the failure - What recovery strategy is appropriate - Whether human intervention is needed
Exercise 17: Memory Strategy Comparison
Bloom's Level: Analyze
Compare the following memory strategies for an agent working on a large codebase (500+ files):
a) Keep the full conversation history in context b) Use summarization to compress older messages c) Use a project knowledge base (CLAUDE.md) plus minimal conversation history d) Use a vector database to retrieve relevant context on demand
For each strategy, analyze the tradeoffs in terms of: accuracy, cost, latency, and scalability.
Exercise 18: Guardrail Gap Analysis
Bloom's Level: Analyze
Review the following guardrail configuration and identify at least five gaps or weaknesses:
GUARDRAILS = {
"blocked_commands": ["rm -rf /", "sudo rm"],
"allowed_extensions": [".py", ".js", ".md"],
"max_iterations": 100,
"max_file_size_bytes": 1_000_000,
"allowed_paths": ["/home/user/project/"]
}
For each gap, explain the risk and propose a fix.
Tier 4: Synthesis and Evaluation (Exercises 19--24)
Exercise 19: Design a Code Review Agent
Bloom's Level: Create
Design a complete code review agent that: - Reads a pull request diff - Analyzes each change for bugs, security issues, performance problems, and style violations - Generates inline comments at specific locations in the code - Provides an overall summary with a recommendation (approve, request changes, or comment)
Write the tool definitions, the prompt template for the LLM, and the main agent loop. You do not need to implement the LLM call itself.
Exercise 20: Build a Test Generator Agent
Bloom's Level: Create
Build an agent that: - Receives a Python source file - Analyzes the functions and classes in the file - Generates pytest test cases for each function - Runs the tests - Iterates to fix any tests that fail due to incorrect expectations
Implement the complete agent with at least three tools (read_file, write_file, run_tests). Include guardrails that prevent the agent from modifying the source file (it should only write test files).
Exercise 21: Evaluate Agent Strategies
Bloom's Level: Evaluate
Design an experiment to compare two agent planning strategies: - Strategy A: Plan the entire task upfront, then execute all steps - Strategy B: Plan one step at a time, observing results before planning the next
Define: - A set of at least five test tasks of varying complexity - Metrics to compare the strategies (completion rate, quality, efficiency, cost) - Expected hypotheses about which strategy will perform better and why - How you would analyze the results
Exercise 22: Agent Safety Audit
Bloom's Level: Evaluate
Conduct a safety audit of the simple coding agent built in Section 36.10. Your audit should: - Identify at least ten potential safety issues - Rate each issue by severity (critical, high, medium, low) - Propose a mitigation for each issue - Prioritize the mitigations by implementation order - Estimate the engineering effort for each mitigation
Exercise 23: Design an Agent Evaluation Framework
Bloom's Level: Create
Design an evaluation framework for coding agents that includes: - A task taxonomy (at least five task categories) - Metrics for each category - A scoring rubric - A process for creating and maintaining the evaluation benchmark - Statistical methods for comparing agent performance across runs
Write the framework as a specification document (500-800 words).
Exercise 24: Human-in-the-Loop Protocol
Bloom's Level: Create
Design a human-in-the-loop protocol for a coding agent used by a team of five developers. Your protocol should define: - What actions require approval and from whom - How approval requests are communicated (Slack, email, in-tool) - Maximum wait time for approvals - What happens when the approver is unavailable - How the protocol differs for different risk levels - How the protocol evolves as trust in the agent increases
Tier 5: Open-Ended and Research (Exercises 25--30)
Exercise 25: Multi-Agent System Design
Bloom's Level: Create
Design a multi-agent system where three agents collaborate to complete a feature request: - Architect Agent: Analyzes requirements and designs the solution - Developer Agent: Implements the code - Reviewer Agent: Reviews the implementation and requests changes
Define the communication protocol between agents, the tools each agent needs, and the workflow for handling disagreements between agents. Implement the communication protocol as a Python class.
Exercise 26: Adaptive Guardrails
Bloom's Level: Create
Design and implement an adaptive guardrail system that: - Starts with strict permissions - Tracks the agent's behavior over time - Gradually relaxes permissions for actions the agent has consistently used safely - Tightens permissions if the agent triggers a guardrail - Maintains an audit log of all permission changes
Implement this as a Python class with at least five unit tests.
Exercise 27: Agent Memory Architecture
Bloom's Level: Create
Design and implement a three-tier memory system for a coding agent: - Tier 1: Working memory (current context window) - Tier 2: Session memory (persisted within a task as a JSON file) - Tier 3: Project memory (persisted across tasks as a knowledge base)
Include methods for promoting information from lower tiers to higher tiers (e.g., a key finding during a task gets promoted to the project knowledge base).
Exercise 28: Benchmark Creation
Bloom's Level: Create
Create a benchmark of ten coding tasks for evaluating agent performance. Each task should include: - A natural language description - A starter repository (file structure and contents) - A test suite that the agent's solution must pass - A difficulty rating (easy, medium, hard) - Expected metrics (iterations, tokens, time)
Implement the benchmark runner as a Python script that executes each task, runs the test suite, and reports results.
Exercise 29: Research Analysis
Bloom's Level: Evaluate
Read three recent papers or technical blog posts on coding agents (from 2024-2025). For each: - Summarize the key contribution in 100 words - Identify the agent architecture used - Evaluate the strengths and limitations of their approach - Compare their approach with the principles discussed in this chapter
Write a 1000-word comparative analysis.
Exercise 30: Build a Production-Ready Agent Feature
Bloom's Level: Create
Choose one of the following features and implement it as a production-quality Python module with full test coverage:
a) Context Window Manager: Automatically manages the context window by summarizing old messages, tracking token usage, and ensuring the most relevant information is always in context.
b) Agent Debugger: A tool that replays agent traces step-by-step, allows you to inspect the agent's reasoning at each step, and identifies where the agent's reasoning diverged from the optimal path.
c) Agent Cost Optimizer: Analyzes agent traces to identify wasteful patterns (unnecessary file reads, redundant tool calls, overly broad searches) and suggests optimizations.
Include type hints, docstrings, error handling, and at least ten unit tests.