Chapter 21 Further Reading: AI-Powered Workflows

DataField.Dev

Chapter 21 Further Reading: AI-Powered Workflows

Retrieval-Augmented Generation — Foundations

1. Lewis, P., Perez, E., Piktus, A., et al. (2020). "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Advances in Neural Information Processing Systems (NeurIPS). The foundational paper that introduced the RAG framework. Lewis and colleagues at Meta AI Research demonstrated that combining a pre-trained retrieval model with a pre-trained language model produced state-of-the-art results on knowledge-intensive tasks while significantly reducing hallucination. Technically dense but essential reading for anyone who wants to understand the theoretical basis of RAG.

2. Gao, Y., Xiong, Y., Velingker, A., et al. (2024). "Retrieval-Augmented Generation for Large Language Models: A Survey." arXiv preprint arXiv:2312.10997. The most comprehensive academic survey of RAG techniques as of 2024. Covers the full taxonomy of RAG architectures — naive RAG, advanced RAG, and modular RAG — along with retrieval strategies, augmentation methods, and evaluation frameworks. An excellent reference for teams designing or evaluating RAG systems. Updated periodically; check for the latest version.

3. Ram, O., Levine, Y., Dalmedigos, I., et al. (2023). "In-Context Retrieval-Augmented Language Models." Transactions of the Association for Computational Linguistics. Explores how retrieval-augmented approaches compare to in-context learning (providing examples in the prompt) for grounding LLM responses. The paper provides evidence that retrieval augmentation is more robust than in-context learning for factual accuracy, particularly as the knowledge domain becomes more specialized.

Embeddings and Vector Search

4. Reimers, N., & Gurevych, I. (2019). "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP). The paper that introduced Sentence-BERT (SBERT), the architecture behind the sentence-transformers library used in this chapter's code examples. Reimers and Gurevych showed how to produce semantically meaningful sentence embeddings efficiently, enabling the vector similarity search that underpins RAG retrieval. The sentence-transformers library remains the most accessible open-source embedding solution for prototyping and small-to-medium deployments.

5. Muennighoff, N., Tazi, N., Magne, L., & Reimers, N. (2023). "MTEB: Massive Text Embedding Benchmark." Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics (EACL). The definitive benchmark for comparing embedding models across retrieval, classification, clustering, and semantic similarity tasks. The associated MTEB leaderboard (huggingface.co/spaces/mteb/leaderboard) is the most reliable resource for selecting an embedding model based on empirical performance rather than marketing claims. Consult it before choosing an embedding model for your RAG system.

6. Douze, M., Guzhva, A., Deng, C., et al. (2024). "The FAISS Library." arXiv preprint arXiv:2401.08281. Documentation and technical overview of FAISS (Facebook AI Similarity Search), the open-source library from Meta AI that implements the approximate nearest neighbor algorithms used by many vector databases. Understanding FAISS provides insight into the indexing strategies (IVF, HNSW, PQ) discussed in the chapter. Not required reading for business leaders, but valuable for technical teams implementing vector search at scale.

AI Agents and Tool Use

7. Yao, S., Zhao, J., Yu, D., et al. (2023). "ReAct: Synergizing Reasoning and Acting in Language Models." International Conference on Learning Representations (ICLR). The paper that formalized the ReAct pattern — the observe-think-act loop described in this chapter. Yao and colleagues demonstrated that LLMs that alternate between reasoning (thinking out loud about what to do) and acting (calling tools, retrieving information) significantly outperform models that only reason or only act. The ReAct pattern has become the default architecture for production AI agents.

8. Schick, T., Dwivedi-Yu, J., Dessì, R., et al. (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools." Advances in Neural Information Processing Systems (NeurIPS). A Meta AI paper showing that language models can learn to use external tools (calculators, search engines, translation APIs) with minimal supervision. The paper provides theoretical grounding for the function calling capabilities that are now standard in commercial LLMs from OpenAI, Anthropic, and Google.

9. Wang, L., Ma, C., Feng, X., et al. (2024). "A Survey on Large Language Model based Autonomous Agents." Frontiers of Computer Science. A comprehensive survey of the rapidly evolving agent landscape. Covers agent architectures, planning strategies, memory systems, multi-agent collaboration, and evaluation methods. Useful for teams considering agent deployments and wanting to understand the current state of the art and its limitations.

Orchestration Frameworks

10. LangChain Documentation and Cookbook (2024). langchain.com. The official documentation for LangChain, the most widely adopted orchestration framework for LLM applications. The cookbook section provides practical, copy-and-modify examples for RAG pipelines, agent systems, and multi-step workflows. Start with the "RAG" and "Agents" sections. The API has stabilized significantly since the framework's early days, making the documentation more reliable than it was in 2023.

11. Liu, J. (2024). LlamaIndex: Data Framework for LLM Applications. llamaindex.ai. The documentation and conceptual guide for LlamaIndex, the RAG-focused orchestration framework. LlamaIndex's documentation is particularly strong on advanced indexing strategies (tree indices, knowledge graph indices) and query routing — capabilities that go beyond basic RAG. Recommended for teams building sophisticated retrieval systems over complex document collections.

RAG Evaluation

12. Es, S., James, J., Espinosa-Anke, L., & Schockaert, S. (2024). "RAGAS: Automated Evaluation of Retrieval Augmented Generation." Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (EACL). The paper introducing the RAGAS evaluation framework discussed in the chapter. The framework's metrics — faithfulness, answer relevance, context precision, and context recall — provide a structured approach to RAG evaluation. The accompanying open-source library (github.com/explodinggradients/ragas) enables automated evaluation at scale. Essential reading for teams that need to measure and improve RAG quality systematically.

13. Zheng, L., Chiang, W. L., Sheng, Y., et al. (2024). "Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena." Advances in Neural Information Processing Systems (NeurIPS). An empirical study of using LLMs to evaluate LLM outputs — the "LLM-as-a-judge" approach used by RAGAS and other evaluation frameworks. The paper demonstrates that strong LLMs (GPT-4 class) show high agreement with human judges for most tasks but identifies systematic biases, including position bias and verbosity bias. Important reading for anyone using automated evaluation: know the limitations of your evaluator.

Enterprise RAG Deployments

14. Morgan Stanley. (2023). "Morgan Stanley Wealth Management Deploys GPT-4 Powered Assistant to Financial Advisors." Press Release, September 2023. The primary source for Case Study 2. Morgan Stanley's public disclosures about their AI assistant deployment — including the compliance architecture, phased rollout strategy, and measured business impact — provide a rare window into enterprise RAG deployment at a major financial institution.

15. Anthropic. (2024). "Building Effective RAG Systems." Anthropic Documentation and Research Blog. Anthropic's practical guidance on building RAG systems, including prompt engineering for grounded generation, chunking strategies, and evaluation methods. Particularly valuable for its discussion of "responsible RAG" — designing systems that acknowledge uncertainty, cite sources, and avoid overconfident claims. Relevant to both the Notion and Morgan Stanley case studies.

16. Pinecone. (2024). "The RAG Handbook: A Practical Guide to Retrieval-Augmented Generation." pinecone.io/learn. A comprehensive practitioner-oriented guide covering the full RAG pipeline, from document processing through production deployment. Pinecone is a vector database vendor, so the guide naturally emphasizes its own product, but the architectural patterns, benchmarking methods, and production tips are broadly applicable. The "chunking strategies" and "evaluation" sections are particularly well-done.

Chunking and Document Processing

17. Barnett, S., Chuber, S., Gao, Y., et al. (2024). "Seven Failure Points When Engineering a Retrieval Augmented Generation System." arXiv preprint arXiv:2401.05856. An empirical study identifying the most common failure modes in RAG systems, including missing content, incorrect chunking, search index misconfiguration, and wrong extraction. Each failure point is illustrated with real examples and accompanied by mitigation strategies. Essential reading for teams moving from RAG prototype to production.

18. Kamradt, G. (2024). "Chunking Strategies for LLM Applications." Greg Kamradt's Blog and YouTube Channel. A practitioner's guide to chunking strategies, with side-by-side comparisons of fixed-size, recursive, semantic, and document-aware chunking on real-world documents. Kamradt's experiments — measuring retrieval quality across chunking strategies — provide empirical support for the chapter's argument that chunking strategy matters more than model choice. Accessible and practical.

Data Governance and Knowledge Management

19. Davenport, T. H., & Prusak, L. (1998). Working Knowledge: How Organizations Manage What They Know. Harvard Business School Press. The classic text on organizational knowledge management. While written before the LLM era, Davenport and Prusak's frameworks for knowledge creation, codification, and transfer are directly relevant to RAG system design. The book's central argument — that knowledge management is fundamentally an organizational challenge, not a technology challenge — resonates strongly with this chapter's emphasis on governance.

20. Shankar, V., Caverly, M., & Zimmerman, M. (2024). "Governing AI Knowledge Systems: Challenges and Best Practices." Harvard Business Review. A practitioner-oriented article on governance frameworks for AI-powered knowledge systems, including RAG. Covers document ownership models, freshness policies, quality auditing, and change management processes. Directly relevant to Athena's knowledge base governance process described in the chapter.

Broader Context

21. Mollick, E. (2024). Co-Intelligence: Living and Working with AI. Portfolio. Previously recommended in Chapter 1, Mollick's book is equally relevant here for its practical treatment of how to work effectively with AI systems. The chapters on "AI as a creative partner" and "AI in organizations" provide frameworks for thinking about where RAG-powered tools fit into individual and organizational workflows.

22. OpenAI. (2024). "Function Calling and Tool Use." OpenAI API Documentation. The technical documentation for OpenAI's function calling API — the mechanism by which LLMs interact with external tools, as discussed in the agent and tool-use sections of this chapter. The documentation includes examples, best practices, and parameter schemas. Even if you use a different LLM provider, OpenAI's documentation provides the clearest explanation of function calling concepts.

23. Chase, H. (2024). "Building LLM Applications." LangChain Blog and Talks. Harrison Chase, LangChain's creator, has published extensively on patterns for building LLM applications. His talks at AI Engineer Summit and other conferences provide practical insights into RAG pipeline design, agent architectures, and the trade-offs between framework abstractions and custom code. Particularly valuable for technical leaders deciding between framework adoption and custom development.

For foundational concepts in prompt engineering that inform RAG prompt construction, see the Further Reading in Chapters 19 and 20. For cloud infrastructure supporting production RAG, see Chapter 23. For AI governance frameworks relevant to RAG deployment, see Chapter 27. For data privacy considerations in RAG over sensitive documents, see Chapter 29.