Chapter 17 Further Reading: Generative AI — Large Language Models

DataField.Dev

Chapter 17 Further Reading: Generative AI — Large Language Models

The Transformer Architecture and Technical Foundations

1. Vaswani, A., Shazeer, N., Parmar, N., et al. (2017). "Attention Is All You Need." Advances in Neural Information Processing Systems, 30. The paper that launched the transformer revolution. Vaswani et al. introduced the self-attention mechanism and demonstrated that a model based entirely on attention — without recurrence or convolution — could achieve state-of-the-art results on machine translation. The paper is technical but the core ideas (query-key-value attention, multi-head attention, positional encoding) are accessible to readers with the background from Chapters 13-14 of this textbook. One of the most cited papers in computer science history.

2. Alammar, J. (2018-2023). "The Illustrated Transformer." jalammar.github.io. The single best visual explanation of how transformers work. Alammar's illustrated guides walk through self-attention, multi-head attention, and the full transformer architecture using clear diagrams and step-by-step examples. If you want to develop deeper intuition for the mechanism behind LLMs without reading mathematical notation, start here. Alammar's companion posts on GPT-2, BERT, and word embeddings are equally valuable.

3. Jurafsky, D., & Martin, J. H. (2024). Speech and Language Processing (3rd edition, draft). Stanford University. The definitive textbook on natural language processing, freely available online. The chapters on transformers, pre-training, and fine-tuning provide rigorous but accessible treatment of the technical foundations discussed in Chapter 17. Particularly useful for readers who want to go deeper on attention mechanisms, tokenization, and training procedures without jumping directly to research papers.

How LLMs Are Built and Trained

4. Ouyang, L., Wu, J., Jiang, X., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." Advances in Neural Information Processing Systems, 35. The foundational paper on RLHF (Reinforcement Learning from Human Feedback) from the OpenAI team. Describes the three-phase training process (pre-training, supervised fine-tuning, RLHF) that became standard for aligning LLMs with human intentions. Technical but essential for understanding why LLMs behave the way they do — including why they sometimes prioritize helpfulness over accuracy.

5. Bai, Y., Kadavath, S., Kundu, S., et al. (2022). "Constitutional AI: Harmlessness from AI Feedback." arXiv preprint arXiv:2212.08073. Anthropic's paper introducing Constitutional AI (CAI), an alternative to RLHF that uses a set of written principles ("constitution") to guide model behavior. CAI represents a different philosophical approach to alignment — one that relies more on explicit principles and less on human rater preferences. Relevant for understanding the differences in approach among major LLM providers.

6. Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). "Scaling Laws for Neural Language Models." arXiv preprint arXiv:2001.08361. The OpenAI paper that demonstrated predictable, power-law relationships between model scale (parameters, data, compute) and performance. These "scaling laws" provided the intellectual justification for the multi-billion-dollar race to build ever-larger models. Understanding scaling laws is essential for business leaders evaluating whether the next generation of models will meaningfully improve on the current one — and whether that improvement is worth the investment.

Capabilities, Limitations, and Evaluation

7. Bubeck, S., Chandrasekaran, V., Eldan, R., et al. (2023). "Sparks of Artificial General Intelligence: Early Experiments with GPT-4." arXiv preprint arXiv:2303.12712. A systematic evaluation of GPT-4's capabilities across mathematics, coding, vision, reasoning, and more. The "AGI" framing in the title is controversial, but the detailed capability analysis is valuable for understanding what frontier LLMs can and cannot do. Read alongside the limitations discussion in this chapter for a balanced view.

8. Maynez, J., Narayan, S., Bohnet, B., & McDonald, R. (2020). "On Faithfulness and Factuality in Abstractive Summarization." Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. An early and rigorous study of hallucination in neural text generation. Maynez et al. categorize types of hallucination (intrinsic vs. extrinsic) and measure their frequency in summarization tasks. Provides the theoretical framework for understanding why LLMs fabricate information and how hallucination can be detected and measured — essential context for the business risks discussed in Chapter 17.

9. Ji, Z., Lee, N., Frieske, R., et al. (2023). "Survey of Hallucination in Natural Language Generation." ACM Computing Surveys, 55(12), 1-38. The most comprehensive academic survey of hallucination in language models. Covers causes, detection methods, and mitigation strategies across different types of NLG tasks. Particularly useful for technical teams developing hallucination detection systems for enterprise deployments.

10. Wei, J., Tay, Y., Bommasani, R., et al. (2022). "Emergent Abilities of Large Language Models." Transactions on Machine Learning Research. The paper that introduced the concept of "emergent abilities" — capabilities that appear in LLMs only at certain scale thresholds. A fascinating and contested finding. Read alongside Schaeffer et al. (2024), "Are Emergent Abilities of Large Language Models a Mirage?" which argues that many supposedly emergent capabilities are measurement artifacts. Together, these papers illustrate the ongoing debate about whether scale will continue to produce new capabilities.

Business Applications and Enterprise Deployment

11. Mollick, E. (2024). Co-Intelligence: Living and Working with AI. Portfolio. Wharton professor Ethan Mollick's practical guide to working with LLMs, based on extensive classroom and organizational experimentation. Mollick's approach is empirical and pragmatic — he tests what works rather than theorizing about what should work. Essential reading for any business leader deploying LLMs, particularly the chapters on organizational adoption, prompt engineering for business tasks, and the surprisingly narrow gap between novice and expert users.

12. Dell'Acqua, F., McFowland, E., Mollick, E. R., et al. (2023). "Navigating the Jagged Technological Frontier: Field Experimental Evidence of the Effects of AI on Knowledge Worker Productivity and Quality." Harvard Business School Working Paper, No. 24-013. A rigorous field experiment conducted with Boston Consulting Group consultants using GPT-4. Found that consultants using AI completed 12.2% more tasks, 25.1% faster, and with 40% higher quality — but only for tasks within the AI's capability frontier. For tasks outside the frontier, AI-using consultants performed worse than those working without AI, because the AI's confident-but-wrong outputs led them astray. This "jagged frontier" finding is one of the most important empirical results for business AI deployment.

13. Eloundou, T., Manning, S., Mishkin, P., & Rock, D. (2023). "GPTs Are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models." arXiv preprint arXiv:2303.10130. An OpenAI-sponsored analysis estimating that approximately 80% of the US workforce could have at least 10% of their work tasks affected by LLMs. The task-level analysis (rather than job-level) provides a nuanced framework for thinking about how LLMs will reshape specific roles — relevant for Chapter 38 on the future of work and for any business leader planning workforce strategy.

Domain-Specific Models

14. Wu, S., Irsoy, O., Lu, S., et al. (2023). "BloombergGPT: A Large Language Model for Finance." arXiv preprint arXiv:2303.17564. The paper describing Bloomberg's 50-billion-parameter finance-specific LLM. Provides detailed information on training data composition, training methodology, and evaluation results. Essential reading for the Case Study 2 discussion and for any business leader considering domain-specific model training. The data mixture strategy (roughly 50/50 financial and general data) is particularly instructive.

15. Singhal, K., Azizi, S., Tu, T., et al. (2023). "Large Language Models Encode Clinical Knowledge." Nature, 620, 172-180. Google's paper on Med-PaLM 2, a medical domain LLM that achieved expert-level performance on medical licensing exam questions. Demonstrates the fine-tuning approach (adapting a general-purpose model for a specific domain) as an alternative to full domain-specific pre-training. The comparison of fine-tuning vs. pre-training approaches provides useful data for the build-vs-buy decision discussed in Chapter 17.

Safety, Ethics, and Regulation

16. Weidinger, L., Mellor, J., Rauh, M., et al. (2021). "Ethical and Social Risks of Harm from Language Models." arXiv preprint arXiv:2112.04359. A comprehensive taxonomy of risks from language models, developed by researchers at Google DeepMind. Covers discrimination and exclusion, information hazards, misinformation, malicious uses, human-computer interaction harms, and environmental costs. Provides the conceptual framework for the risk assessment that should precede any enterprise LLM deployment.

17. Perez, E., Ringer, S., Lukosuite, K., et al. (2022). "Red Teaming Language Models with Language Models." arXiv preprint arXiv:2202.03286. An Anthropic paper demonstrating the use of AI systems to systematically discover failure modes in other AI systems. The red-teaming methodology described here has become standard practice for enterprise LLM evaluation. Relevant for teams building evaluation frameworks for customer-facing LLM deployments.

18. European Parliament. (2024). Regulation (EU) 2024/1689 of the European Parliament and of the Council laying down harmonised rules on artificial intelligence (Artificial Intelligence Act). The full text of the EU AI Act — the world's first comprehensive AI regulatory framework. The provisions on transparency (requiring disclosure of AI-generated content), high-risk AI systems, and general-purpose AI models directly affect enterprise LLM deployment in the European Union and, given the "Brussels Effect," globally. Chapter 28 of this textbook covers the regulatory landscape in detail, but business leaders deploying LLMs should familiarize themselves with the primary source.

Prompt Engineering and Practical Techniques

19. White, J., Fu, Q., Hays, S., et al. (2023). "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT." arXiv preprint arXiv:2302.11382. A systematic catalog of prompt engineering patterns — reusable techniques for structuring prompts to achieve specific outcomes. Includes patterns for output customization, error identification, and prompt improvement. A practical reference that complements the prompt engineering techniques explored in Chapters 19-20.

20. Wei, J., Wang, X., Schuurmans, D., et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Advances in Neural Information Processing Systems, 35. The paper that introduced chain-of-thought prompting — the technique of instructing LLMs to "think step by step" before answering. This simple technique dramatically improved LLM performance on reasoning tasks and became one of the most widely adopted prompt engineering methods. Understanding why it works (and when it does not) is essential for Chapter 19's treatment of prompt engineering.

Industry Analysis and Market Landscape

21. Stanford University Human-Centered AI Institute. (2026). AI Index Report. Stanford HAI. The most comprehensive annual compilation of data on AI research, deployment, investment, and policy. The 2026 report (published April 2026) includes extensive data on LLM capabilities, costs, adoption, and impact. An indispensable reference for staying current on the rapidly evolving LLM landscape. Published annually — always check for the latest edition.

22. Epoch AI. (2024-2025). "Trends in Machine Learning." epochai.org. A research organization that systematically tracks trends in AI compute, data, parameters, and performance. Their datasets on training compute trends and model capabilities provide the empirical foundation for the scaling laws discussion in Chapter 17. The interactive visualizations of compute growth and cost trends are particularly useful for business presentations.

23. Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). "On the Opportunities and Risks of Foundation Models." arXiv preprint arXiv:2108.07258. A comprehensive Stanford report examining the broad implications of "foundation models" — large models trained on diverse data that can be adapted to many tasks. At over 200 pages, this is the most thorough treatment of LLMs' societal implications, covering capabilities, applications, risks, economics, and ethics. The business strategy implications are explored in depth.

Practical Guides for Technical Teams

24. OpenAI. (2024-2026). "API Reference and Best Practices." platform.openai.com/docs. OpenAI's official documentation for its API, including detailed guidance on prompt design, JSON mode, function calling, fine-tuning, and error handling. The "Best Practices" section is particularly valuable for teams building production LLM applications. Updates frequently — bookmark and check regularly.

25. Anthropic. (2024-2026). "Anthropic API Documentation and Prompt Engineering Guide." docs.anthropic.com. Anthropic's documentation includes one of the most thorough prompt engineering guides in the industry, with detailed examples, anti-patterns, and domain-specific advice. Even if you are using a different provider's models, the prompt engineering principles are broadly applicable. The sections on reducing hallucination and handling ambiguity are especially relevant to the deployment challenges discussed in Chapter 17.

Each item in this reading list was selected because it directly supports concepts introduced in Chapter 17 and developed throughout the textbook. Items marked with specific chapter references connect to more detailed treatment later in the course. The LLM landscape evolves rapidly — supplement these readings with the latest reports from Stanford HAI, Epoch AI, and the major LLM providers.