Chapter 32 Exercises: AI Agents and Tool Use
Section 1: The Agent Paradigm and ReAct Pattern
Exercise 32.1: Basic ReAct Agent
Implement a minimal ReAct agent that uses a search tool (simulated with a dictionary lookup) and a calculate tool to answer multi-step questions. The agent should produce explicit Thought/Action/Observation traces. Test it on: "What is the population of France multiplied by 3?"
Exercise 32.2: Trace Analysis
Given the following ReAct trace, identify the error and explain how the agent should recover:
Question: What year was the Eiffel Tower completed?
Thought 1: I need to search for information about the Eiffel Tower.
Action 1: search("Eiffel Tower height")
Observation 1: The Eiffel Tower is 330 meters tall.
Thought 2: The tower is 330 meters tall. The answer is 330.
Action 2: finish("330")
Rewrite the trace to correctly answer the question.
Exercise 32.3: ReAct vs. Chain-of-Thought Comparison
Implement both a chain-of-thought (reasoning only, no tools) and a ReAct agent for answering the question: "Which country has a larger GDP: the country where the Mona Lisa is displayed, or the country where sushi originated?" Compare the outputs and discuss when each approach is preferable.
Exercise 32.4: Observation Parsing
Write a robust observation parser that can handle multiple tool output formats: plain text, JSON, error messages, and empty responses. The parser should normalize all outputs into a consistent format with fields: status (success/error), content (string), and metadata (dict). Include error handling for malformed outputs.
Exercise 32.5: ReAct with Backtracking
Extend a basic ReAct agent to support backtracking: if the agent reaches a dead end (e.g., a search returns no results), it should be able to revise its plan and try an alternative approach. Implement a max_backtracks parameter and demonstrate the agent recovering from a failed search.
Section 2: Function Calling and Tool Use
Exercise 32.6: Tool Schema Design
Design JSON schemas for the following five tools: (a) a weather API that supports multiple locations and units, (b) a calculator that handles arithmetic expressions, (c) a file reader that supports multiple formats, (d) a database query tool that accepts SQL, and (e) a translation tool that supports 10 languages. Each schema should include comprehensive descriptions, parameter types, enums, and required fields.
Exercise 32.7: Tool Dispatcher
Build a tool dispatcher class that: (a) registers tools with their schemas and handler functions, (b) validates incoming tool call arguments against schemas, (c) executes the appropriate handler, (d) returns structured results with execution time and status. Test with at least three different tools.
Exercise 32.8: Parallel Tool Calls
Implement an agent that detects when multiple tool calls are independent and executes them in parallel using asyncio. Compare the latency of sequential vs. parallel execution for the query: "What is the weather in New York, London, and Tokyo?" where each tool call takes 1 second.
Exercise 32.9: Tool Error Recovery
Create a tool wrapper that implements retry logic with exponential backoff, fallback tools, and graceful degradation. If get_weather fails after 3 retries, the wrapper should try get_weather_backup, and if that also fails, return a structured error with suggestions for the agent.
Exercise 32.10: Dynamic Tool Generation
Build a system where the agent can create new tools at runtime. Given a natural language description of a tool (e.g., "A tool that converts temperatures between Celsius and Fahrenheit"), the system should generate the Python function, register it as a tool, and make it available for subsequent agent calls.
Section 3: Planning and Task Decomposition
Exercise 32.11: Plan Generation
Implement a planning agent that takes a complex task description and produces a structured plan as a list of steps with dependencies. The plan should include: step ID, description, dependencies (list of step IDs), estimated tool calls, and priority. Test with: "Research the top 3 AI papers of 2024, summarize each, and write a comparative analysis."
Exercise 32.12: Plan Execution Engine
Build a plan execution engine that takes a structured plan (from Exercise 32.11) and executes it, respecting dependencies. Steps with no unmet dependencies should execute in parallel. Track execution status (pending, running, completed, failed) for each step and handle failures by marking dependent steps as blocked.
Exercise 32.13: Adaptive Planning
Implement a plan-and-execute agent that revises its plan after each step. After executing step N, the agent should evaluate whether the remaining plan is still valid given the results of step N. If not, it should generate a revised plan. Demonstrate with a research task where an initial search reveals an unexpected topic that requires plan adjustment.
Exercise 32.14: Plan Evaluation
Write an evaluator that scores a plan on four dimensions: (a) completeness (does it cover all aspects of the task?), (b) efficiency (minimal redundant steps), (c) feasibility (all steps are achievable with available tools), and (d) ordering (dependencies are correctly specified). Test with both well-formed and flawed plans.
Exercise 32.15: Hierarchical Task Decomposition
Implement a recursive task decomposition system where complex tasks are broken into subtasks, and each subtask is further decomposed until reaching atomic actions (single tool calls). Visualize the resulting task tree. Limit recursion depth to 3 levels and set a maximum of 5 subtasks per level.
Section 4: Memory Systems
Exercise 32.16: Conversation Summarization
Implement a context window manager that monitors token usage and automatically summarizes older messages when the context approaches a configurable limit (e.g., 4096 tokens). Use a sliding window approach where messages older than N turns are summarized. Compare the agent's performance with and without summarization on a 20-turn conversation.
Exercise 32.17: Vector-Based Long-Term Memory
Build a long-term memory system using a vector store (FAISS or ChromaDB) that: (a) stores agent experiences as embeddings, (b) retrieves relevant memories given a query, (c) scores memories by a weighted combination of recency, relevance, and importance. Test by storing 50 memories and querying with 5 different contexts.
Exercise 32.18: Working Memory Scratchpad
Implement a structured scratchpad that the agent uses to track intermediate results during multi-step tasks. The scratchpad should support: adding key-value pairs, updating existing entries, querying by key, and serializing to a compact string for inclusion in the prompt. Demonstrate on a data analysis task with 5 intermediate steps.
Exercise 32.19: Memory Consolidation
Build a memory consolidation system that periodically processes short-term memories and extracts durable facts for long-term storage. For example, after a conversation about a user's preferences, the system should extract "User prefers Python over JavaScript" and store it as a long-term memory. Use an LLM to perform the extraction.
Exercise 32.20: Memory-Augmented Agent
Combine the components from Exercises 32.16-32.19 into a complete memory-augmented agent. The agent should use working memory for the current task, retrieve from long-term memory when relevant, and consolidate new learnings after each task. Evaluate on a series of 5 related tasks to show memory improving performance over time.
Section 5: Multi-Agent Systems
Exercise 32.21: Two-Agent Debate
Implement a debate system with two agents: a "proponent" and an "opponent." Given a proposition (e.g., "AI will replace most software engineering jobs within 10 years"), each agent takes turns presenting arguments and counterarguments for 3 rounds. A "judge" agent evaluates the arguments and declares a winner with reasoning.
Exercise 32.22: Hierarchical Multi-Agent System
Build a manager-worker multi-agent system where: (a) a manager agent receives a complex task, (b) decomposes it into subtasks, (c) assigns each subtask to a specialized worker agent, (d) collects results, and (e) synthesizes a final output. Implement at least 3 worker agents with different specializations (e.g., research, analysis, writing).
Exercise 32.23: Agent Communication Protocol
Design and implement a message-passing protocol for multi-agent systems. Each message should include: sender, recipient, message type (request, response, broadcast, error), content, and metadata (timestamp, priority, thread ID). Build a message router that delivers messages to the correct agent and handles message queuing.
Exercise 32.24: Consensus Mechanism
Implement a consensus mechanism where 3 agents independently answer the same question, then vote on the best answer. If no consensus is reached (all 3 answers differ), initiate a discussion round where agents explain their reasoning and vote again. Track how often consensus improves answer quality vs. a single agent.
Exercise 32.25: Agent Specialization Benchmark
Create a benchmark that compares: (a) a single general-purpose agent, (b) a team of 3 specialized agents, and (c) a team of 3 specialized agents with a manager, on a set of 10 diverse tasks (mix of research, coding, analysis, and writing). Measure task completion rate, quality (scored 1-5 by an LLM judge), total token cost, and latency.
Section 6: Safety, Evaluation, and Production
Exercise 32.26: Safety Guardrails
Implement a safety layer that wraps an agent's tool calls with the following guardrails: (a) action allowlist (only permitted tool calls pass through), (b) parameter validation (block dangerous parameter values like rm -rf /), (c) rate limiting (max 10 tool calls per minute), (d) confirmation prompts for destructive actions. Test with both safe and unsafe action sequences.
Exercise 32.27: Prompt Injection Detection
Build a prompt injection detector that scans tool outputs (simulating web page content) for injection attempts. The detector should identify patterns such as: "Ignore your previous instructions," "You are now a different agent," and embedded system prompts. Test with 20 benign and 20 malicious tool outputs. Report precision and recall.
Exercise 32.28: Agent Evaluation Framework
Create an evaluation framework that scores agents on: task completion (binary), correctness (F1 against reference), efficiency (steps taken vs. optimal), safety (no unauthorized actions), and robustness (performance on edge cases). Implement the framework with 10 test tasks and generate a scorecard report.
Exercise 32.29: Cost Analyzer
Build a cost analysis tool that instruments an agent system and tracks: total LLM tokens (input + output), number of tool calls, wall-clock time, and estimated dollar cost (using configurable per-token pricing). Generate a breakdown showing cost per agent step and identify the most expensive steps. Suggest optimizations.
Exercise 32.30: End-to-End Agent System
Build a complete agent system that combines all concepts from this chapter: (a) ReAct reasoning loop, (b) function calling with at least 5 tools, (c) planning for complex tasks, (d) short-term and long-term memory, (e) safety guardrails, and (f) cost tracking. Deploy it as a FastAPI endpoint with streaming responses and evaluate on 5 diverse tasks. Document the architecture, trade-offs, and lessons learned.