Chapter 32: Further Reading

Foundational Papers

Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." ICLR 2023. The paper that established the ReAct pattern, demonstrating that interleaving reasoning traces with actions outperforms both chain-of-thought and action-only approaches on knowledge-intensive and decision-making tasks. Essential reading for understanding the foundation of modern agent design. https://arxiv.org/abs/2210.03629
Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." NeurIPS 2023. Demonstrates that language models can be fine-tuned to decide when and how to use external tools (calculator, search, calendar, etc.) by learning from self-generated examples. Shows that tool use can be learned rather than only prompt-engineered. https://arxiv.org/abs/2302.04761
Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." NeurIPS 2022. The foundational chain-of-thought paper that precedes and motivates the ReAct pattern. Understanding CoT is essential for understanding why explicit reasoning traces improve agent performance.
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K., and Yao, S. (2023). "Reflexion: Language Agents with Verbal Reinforcement Learning." NeurIPS 2023. Introduces self-reflection as a mechanism for agents to improve from experience, storing verbal feedback in memory to guide future attempts. A key paper for understanding agent learning and self-improvement. https://arxiv.org/abs/2303.11366

Multi-Agent Systems

Du, Y., Li, S., Torralba, A., Tenenbaum, J. B., and Mordatch, I. (2023). "Improving Factuality and Reasoning in Language Models through Multiagent Debate." arXiv preprint arXiv:2305.14325. Demonstrates that multiple LLM instances debating and refining each other's responses produces more factual and well-reasoned outputs than a single instance. Provides theoretical grounding for multi-agent debate architectures.
Park, J. S., O'Brien, J. C., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. (2023). "Generative Agents: Interactive Simulacra of Human Behavior." UIST 2023. A landmark paper that creates a simulated town populated by 25 LLM-powered agents with memory, reflection, and social interaction capabilities. Demonstrates the potential of multi-agent systems for complex emergent behavior.
Li, G., Hammoud, H., Itani, H., Khizbullin, D., and Ghanem, B. (2023). "CAMEL: Communicative Agents for 'Mind' Exploration of Large Language Model Society." NeurIPS 2023. Explores role-playing between communicative agents and demonstrates how structured agent-to-agent communication can solve complex tasks through collaborative problem-solving.

Agent Frameworks and Tools

Chase, H. (2022). "LangChain: Building Applications with LLMs through Composability." https://langchain.com/. The most widely adopted framework for building LLM applications and agents. Documentation includes extensive tutorials on tool use, agent loops, memory, and chain composition.
LangGraph Documentation. https://langchain-ai.github.io/langgraph/. Built on LangChain, LangGraph provides stateful, graph-based agent workflows with built-in persistence, human-in-the-loop support, and complex control flow. Particularly useful for production agent deployments.
LlamaIndex Documentation. https://docs.llamaindex.ai/. Focuses on data-augmented agents that combine RAG with tool use. Strong support for building agents that reason over structured and unstructured data sources.
Microsoft AutoGen. https://microsoft.github.io/autogen/. Multi-agent conversation framework with conversable agents, code execution sandboxes, and group chat coordination. Includes extensive examples of multi-agent collaboration patterns.
Anthropic Tool Use Documentation. https://docs.anthropic.com/en/docs/build-with-claude/tool-use. Official documentation for Claude's function calling API, including schema design, parallel tool calls, and best practices for tool integration.

Agent Evaluation and Benchmarks

Jimenez, C. E., Yang, J., Wettig, A., Yao, S., Pei, K., Press, O., and Narasimhan, K. (2024). "SWE-bench: Can Language Models Resolve Real-World GitHub Issues?" ICLR 2024. The definitive benchmark for code generation agents, testing the ability to fix real bugs in real open-source repositories. Results provide insight into the practical capabilities and limitations of code agents. https://swe-bench.github.io/
Mialon, G., Dessì, R., Lomeli, M., et al. (2023). "GAIA: A Benchmark for General AI Assistants." arXiv preprint arXiv:2311.12983. A benchmark of diverse tasks requiring tools and multi-step reasoning, designed to be easy for humans but challenging for AI. Provides a practical evaluation of general agent capabilities.
Liu, X., Yu, H., Zhang, H., et al. (2023). "AgentBench: Evaluating LLMs as Agents." ICLR 2024. A comprehensive benchmark spanning operating system interaction, database operations, web browsing, and more. Evaluates agent capabilities across multiple domains with standardized metrics.

Agent Safety and Alignment

Nakano, R., Hilton, J., Balaji, S., et al. (2022). "WebGPT: Browser-Assisted Question-Answering with Human Feedback." arXiv preprint arXiv:2112.09332. An early exploration of web-browsing agents with human feedback, addressing accuracy, citation, and safety considerations in agents that interact with the web.
Ruan, Y., Dong, H., Wang, A., Pitis, S., and Ba, J. (2024). "Identifying the Risks of LM Agents with an LM-Emulated Sandbox." arXiv preprint arXiv:2309.15817. Proposes methods for identifying risks in LLM agents by emulating environments, enabling safety testing without real-world consequences.
Perez, E., Ringer, S., Lukošiūtė, K., et al. (2022). "Discovering Language Model Behaviors with Model-Written Evaluations." arXiv preprint arXiv:2212.09251. Demonstrates methods for systematically evaluating LLM behaviors including sycophancy, power-seeking, and instruction following—relevant for understanding agent alignment.

Model Context Protocol (MCP)

Anthropic. (2024). "Model Context Protocol." https://modelcontextprotocol.io/. The official specification for MCP, an open standard for connecting AI agents to external data sources and tools. Includes the protocol specification, server implementations, and integration guides.

Online Resources and Tutorials

LangChain Agent Tutorials: https://python.langchain.com/docs/how_to/#agents --- Practical tutorials covering agent construction with various frameworks and patterns.
Lilian Weng's "LLM Powered Autonomous Agents" (2023): https://lilianweng.github.io/posts/2023-06-23-agent/ --- An excellent blog post providing a comprehensive overview of agent architectures, planning, memory, and tool use with clear diagrams and references.
Andrew Ng's "Agentic Design Patterns" (2024): Series of articles and lectures covering reflection, tool use, planning, and multi-agent collaboration as fundamental design patterns for AI agents.
Chip Huyen's "Building LLM Applications for Production" (2023): https://huyenchip.com/2023/04/11/llm-engineering.html --- Practical guidance on production engineering for LLM applications, including agent systems.

Software Libraries

LangChain (langchain): Framework for building LLM applications with tools, agents, memory, and chain composition. pip install langchain.
LangGraph (langgraph): Stateful graph-based agent framework built on LangChain. pip install langgraph.
LlamaIndex (llama-index): Data-augmented LLM framework with strong agent and RAG capabilities. pip install llama-index.
AutoGen (autogen): Microsoft's multi-agent conversation framework. pip install autogen.
CrewAI (crewai): Role-based multi-agent framework. pip install crewai.
Anthropic SDK (anthropic): Official Python SDK for Claude with tool use support. pip install anthropic.
OpenAI SDK (openai): Official Python SDK for GPT models with function calling. pip install openai.
E2B (e2b): Managed sandboxed environments for AI agent code execution. pip install e2b.