Chapter 36: Exercises

AI Coding Agents and Autonomous Workflows

These exercises are organized into five tiers of increasing difficulty, from foundational recall to open-ended creation. Each exercise includes the Bloom's taxonomy level it targets.


Tier 1: Recall and Comprehension (Exercises 1--6)

Exercise 1: Agent vs. Assistant Comparison

Bloom's Level: Remember

List five characteristics that distinguish an AI coding agent from a conversational AI assistant. For each characteristic, provide a one-sentence explanation of why it matters for software development tasks.

Exercise 2: Agent Loop Identification

Bloom's Level: Understand

Read the following pseudocode and identify which phase of the plan-act-observe loop each line corresponds to:

1. response = llm.generate(prompt_with_context)
2. files = search_codebase("*.py", project_root)
3. plan = parse_plan(response)
4. result = execute_command("pytest tests/")
5. memory.update(result)
6. if all_tests_pass(result): mark_complete()

Label each line as Plan, Act, Observe, or Control (loop management). Explain your reasoning for each.

Exercise 3: Tool Classification

Bloom's Level: Understand

Classify each of the following tools into one of four categories: Read-Only, Write, Execute, or Communication. Then rank them from lowest to highest risk.

  • read_file(path)
  • write_file(path, content)
  • delete_file(path)
  • run_command(cmd)
  • git_commit(message)
  • web_search(query)
  • send_notification(message)
  • deploy_to_production()

Exercise 4: Guardrail Matching

Bloom's Level: Remember

Match each guardrail type with the risk it primarily mitigates:

Guardrail Risk
A. Permission allowlists 1. Infinite loops consuming API credits
B. Sandboxing 2. Agent modifying system files
C. Cost limits 3. Generated code containing secrets
D. Output validation 4. Damage escaping to the host system

Exercise 5: Memory Types

Bloom's Level: Understand

For each scenario, identify whether the agent needs working memory, short-term memory, or long-term memory:

a) The agent needs to remember the output of a file it read three steps ago in the current task. b) The agent needs to know the project's preferred testing framework across multiple sessions. c) The agent needs to track which steps of its current plan have been completed. d) The agent needs to remember that a particular approach failed in a previous task last week. e) The agent needs the contents of the current system prompt.

Exercise 6: Error Classification

Bloom's Level: Understand

Classify each error scenario and describe the appropriate recovery strategy (retry, fallback, escalate, or abort):

a) An API call returns a 429 (rate limit) error. b) The agent generates Python code with a syntax error. c) The agent tries to write to a directory that does not exist. d) The agent's task requires modifying a file that has been deleted since the task started. e) The agent has spent $50 on a task with a $10 budget. f) A test fails with an assertion error after the agent's code change.


Tier 2: Application (Exercises 7--12)

Exercise 7: Tool Definition

Bloom's Level: Apply

Write a complete tool definition (name, description, parameter schema, and implementation) for a search_and_replace tool that: - Takes a file path, a search string, and a replacement string - Replaces all occurrences of the search string in the file - Returns the number of replacements made - Handles errors gracefully (file not found, permission denied)

Include type hints and a docstring.

Exercise 8: Permission System

Bloom's Level: Apply

Implement a PermissionChecker class that: - Accepts a configuration of allowed directories, blocked commands, and allowed file extensions - Has a check_file_access(path, mode) method that returns True/False for read or write access - Has a check_command(command) method that returns True/False - Logs all permission checks (both allowed and denied)

Write at least five test cases for your implementation.

Exercise 9: Conversation Summarizer

Bloom's Level: Apply

Write a function summarize_conversation(history: list[dict], max_tokens: int) that: - Takes a conversation history (list of message dicts with "role" and "content" keys) - Estimates the token count of each message (approximate: 1 token per 4 characters) - If the total exceeds max_tokens, summarizes older messages while keeping recent ones intact - Returns a new history that fits within the token budget

Exercise 10: Retry Logic

Bloom's Level: Apply

Implement a RetryPolicy class that supports three strategies: - Immediate retry (up to N times) - Exponential backoff (base delay, max delay, max retries) - Retry with modification (a callback that modifies the action before retrying)

Write a test that demonstrates each strategy handling a simulated flaky operation.

Exercise 11: Agent State Tracker

Bloom's Level: Apply

Create a TaskState class that tracks: - The original goal - The current plan (list of steps) - Completed steps with their results - Files read and modified - Errors encountered - Current iteration number

Include methods to serialize/deserialize the state to/from JSON, so it can be persisted between sessions.

Exercise 12: Cost Monitor

Bloom's Level: Apply

Implement a CostMonitor class that: - Tracks token usage (input and output separately) - Calculates cost based on configurable per-token pricing - Enforces a budget limit - Raises a BudgetExceededError when the limit is reached - Provides a summary() method showing total tokens, cost, and remaining budget


Tier 3: Analysis (Exercises 13--18)

Exercise 13: Agent Trace Analysis

Bloom's Level: Analyze

Given the following agent trace, identify: a) Where the agent made a suboptimal decision b) Where the agent could have been more efficient c) What guardrails should have been in place

Iteration 1: READ src/main.py → 500 lines
Iteration 2: READ src/utils.py → 200 lines
Iteration 3: READ src/helpers.py → 150 lines
Iteration 4: READ src/config.py → 50 lines
Iteration 5: READ src/models.py → 300 lines
Iteration 6: READ tests/test_main.py → 100 lines
Iteration 7: WRITE src/utils.py → Added new function (25 lines)
Iteration 8: RUN pytest tests/ → 3 failures
Iteration 9: READ src/utils.py → Re-read the file it just wrote
Iteration 10: WRITE src/utils.py → Fixed import statement
Iteration 11: RUN pytest tests/ → 1 failure
Iteration 12: READ tests/test_utils.py → 80 lines
Iteration 13: WRITE tests/test_utils.py → Updated expected values
Iteration 14: RUN pytest tests/ → All pass
Iteration 15: RUN rm -rf /tmp/cache → Cleared a cache directory

Write a 500-word analysis.

Exercise 14: Autonomy Level Assessment

Bloom's Level: Analyze

For each of the following tasks, recommend an appropriate autonomy level (0--4 from Section 36.1) and justify your recommendation:

a) Formatting code according to a style guide b) Implementing a new payment processing feature c) Updating documentation to reflect API changes d) Fixing a security vulnerability in an authentication module e) Refactoring a module to use a new design pattern f) Adding logging statements to existing functions

Exercise 15: Workflow Decomposition

Bloom's Level: Analyze

Decompose the following task into a hierarchical plan suitable for an agent. Identify which steps can be parallelized and which must be sequential:

"Migrate the user authentication system from session-based authentication to JWT tokens. The system currently uses Flask-Login with server-side sessions stored in Redis. The new system should use JWTs with refresh tokens, maintain backward compatibility during the migration period, and include comprehensive tests."

Exercise 16: Failure Mode Analysis

Bloom's Level: Analyze

For the "Issue-to-PR" workflow described in Section 36.4, identify at least eight potential failure modes. For each, describe: - What could go wrong - How the agent should detect the failure - What recovery strategy is appropriate - Whether human intervention is needed

Exercise 17: Memory Strategy Comparison

Bloom's Level: Analyze

Compare the following memory strategies for an agent working on a large codebase (500+ files):

a) Keep the full conversation history in context b) Use summarization to compress older messages c) Use a project knowledge base (CLAUDE.md) plus minimal conversation history d) Use a vector database to retrieve relevant context on demand

For each strategy, analyze the tradeoffs in terms of: accuracy, cost, latency, and scalability.

Exercise 18: Guardrail Gap Analysis

Bloom's Level: Analyze

Review the following guardrail configuration and identify at least five gaps or weaknesses:

GUARDRAILS = {
    "blocked_commands": ["rm -rf /", "sudo rm"],
    "allowed_extensions": [".py", ".js", ".md"],
    "max_iterations": 100,
    "max_file_size_bytes": 1_000_000,
    "allowed_paths": ["/home/user/project/"]
}

For each gap, explain the risk and propose a fix.


Tier 4: Synthesis and Evaluation (Exercises 19--24)

Exercise 19: Design a Code Review Agent

Bloom's Level: Create

Design a complete code review agent that: - Reads a pull request diff - Analyzes each change for bugs, security issues, performance problems, and style violations - Generates inline comments at specific locations in the code - Provides an overall summary with a recommendation (approve, request changes, or comment)

Write the tool definitions, the prompt template for the LLM, and the main agent loop. You do not need to implement the LLM call itself.

Exercise 20: Build a Test Generator Agent

Bloom's Level: Create

Build an agent that: - Receives a Python source file - Analyzes the functions and classes in the file - Generates pytest test cases for each function - Runs the tests - Iterates to fix any tests that fail due to incorrect expectations

Implement the complete agent with at least three tools (read_file, write_file, run_tests). Include guardrails that prevent the agent from modifying the source file (it should only write test files).

Exercise 21: Evaluate Agent Strategies

Bloom's Level: Evaluate

Design an experiment to compare two agent planning strategies: - Strategy A: Plan the entire task upfront, then execute all steps - Strategy B: Plan one step at a time, observing results before planning the next

Define: - A set of at least five test tasks of varying complexity - Metrics to compare the strategies (completion rate, quality, efficiency, cost) - Expected hypotheses about which strategy will perform better and why - How you would analyze the results

Exercise 22: Agent Safety Audit

Bloom's Level: Evaluate

Conduct a safety audit of the simple coding agent built in Section 36.10. Your audit should: - Identify at least ten potential safety issues - Rate each issue by severity (critical, high, medium, low) - Propose a mitigation for each issue - Prioritize the mitigations by implementation order - Estimate the engineering effort for each mitigation

Exercise 23: Design an Agent Evaluation Framework

Bloom's Level: Create

Design an evaluation framework for coding agents that includes: - A task taxonomy (at least five task categories) - Metrics for each category - A scoring rubric - A process for creating and maintaining the evaluation benchmark - Statistical methods for comparing agent performance across runs

Write the framework as a specification document (500-800 words).

Exercise 24: Human-in-the-Loop Protocol

Bloom's Level: Create

Design a human-in-the-loop protocol for a coding agent used by a team of five developers. Your protocol should define: - What actions require approval and from whom - How approval requests are communicated (Slack, email, in-tool) - Maximum wait time for approvals - What happens when the approver is unavailable - How the protocol differs for different risk levels - How the protocol evolves as trust in the agent increases


Tier 5: Open-Ended and Research (Exercises 25--30)

Exercise 25: Multi-Agent System Design

Bloom's Level: Create

Design a multi-agent system where three agents collaborate to complete a feature request: - Architect Agent: Analyzes requirements and designs the solution - Developer Agent: Implements the code - Reviewer Agent: Reviews the implementation and requests changes

Define the communication protocol between agents, the tools each agent needs, and the workflow for handling disagreements between agents. Implement the communication protocol as a Python class.

Exercise 26: Adaptive Guardrails

Bloom's Level: Create

Design and implement an adaptive guardrail system that: - Starts with strict permissions - Tracks the agent's behavior over time - Gradually relaxes permissions for actions the agent has consistently used safely - Tightens permissions if the agent triggers a guardrail - Maintains an audit log of all permission changes

Implement this as a Python class with at least five unit tests.

Exercise 27: Agent Memory Architecture

Bloom's Level: Create

Design and implement a three-tier memory system for a coding agent: - Tier 1: Working memory (current context window) - Tier 2: Session memory (persisted within a task as a JSON file) - Tier 3: Project memory (persisted across tasks as a knowledge base)

Include methods for promoting information from lower tiers to higher tiers (e.g., a key finding during a task gets promoted to the project knowledge base).

Exercise 28: Benchmark Creation

Bloom's Level: Create

Create a benchmark of ten coding tasks for evaluating agent performance. Each task should include: - A natural language description - A starter repository (file structure and contents) - A test suite that the agent's solution must pass - A difficulty rating (easy, medium, hard) - Expected metrics (iterations, tokens, time)

Implement the benchmark runner as a Python script that executes each task, runs the test suite, and reports results.

Exercise 29: Research Analysis

Bloom's Level: Evaluate

Read three recent papers or technical blog posts on coding agents (from 2024-2025). For each: - Summarize the key contribution in 100 words - Identify the agent architecture used - Evaluate the strengths and limitations of their approach - Compare their approach with the principles discussed in this chapter

Write a 1000-word comparative analysis.

Exercise 30: Build a Production-Ready Agent Feature

Bloom's Level: Create

Choose one of the following features and implement it as a production-quality Python module with full test coverage:

a) Context Window Manager: Automatically manages the context window by summarizing old messages, tracking token usage, and ensuring the most relevant information is always in context.

b) Agent Debugger: A tool that replays agent traces step-by-step, allows you to inspect the agent's reasoning at each step, and identifies where the agent's reasoning diverged from the optimal path.

c) Agent Cost Optimizer: Analyzes agent traces to identify wasteful patterns (unnecessary file reads, redundant tool calls, overly broad searches) and suggests optimizations.

Include type hints, docstrings, error handling, and at least ten unit tests.