Chapter 23: Further Reading

Foundational Papers

In-Context Learning

Brown, T. B., et al. (2020). "Language models are few-shot learners." NeurIPS. The GPT-3 paper that demonstrated in-context learning at scale. Showed that sufficiently large language models can perform diverse tasks from just a few examples in the prompt, without gradient updates. Essential reading for understanding the origins of modern prompt engineering.
Xie, S. M., et al. (2022). "An explanation of in-context learning as implicit Bayesian inference." ICLR. Proposes that ICL performs implicit Bayesian inference, where the model infers a latent concept from demonstrations and generates accordingly. Provides a rigorous theoretical framework for understanding why demonstrations help.
Dai, D., et al. (2023). "Why can GPT learn in-context? Language models implicitly perform gradient descent as meta-optimizers." ACL. Establishes a mathematical equivalence between the forward pass of Transformers during ICL and gradient descent on an implicit linear model. Each attention head can be interpreted as one optimization step.
Olsson, C., et al. (2022). "In-context learning and induction heads." Transformer Circuits Thread. Identifies specific attention head patterns (induction heads) that implement the copying mechanism underlying ICL. A key contribution to mechanistic interpretability of language models.

Chain-of-Thought and Reasoning

Wei, J., et al. (2022). "Chain-of-thought prompting elicits reasoning in large language models." NeurIPS. Introduced chain-of-thought prompting, showing that including intermediate reasoning steps in demonstrations dramatically improves performance on arithmetic, commonsense, and symbolic reasoning tasks.
Kojima, T., et al. (2022). "Large language models are zero-shot reasoners." NeurIPS. Discovered that appending "Let's think step by step" to a prompt triggers chain-of-thought reasoning without demonstrations. A remarkably simple technique with broad applicability.
Wang, X., et al. (2023). "Self-consistency improves chain of thought reasoning in language models." ICLR. Proposed self-consistency: sampling multiple reasoning paths and taking a majority vote. Provides a principled way to improve CoT accuracy by approximating marginalization over reasoning chains.
Yao, S., et al. (2023). "Tree of thoughts: Deliberate problem solving with large language models." NeurIPS. Generalized CoT from linear chains to tree structures with branching, evaluation, and backtracking. Enables planning and exploration for complex reasoning tasks.

Prompt Engineering and Optimization

Liu, J., et al. (2022). "What makes good in-context examples for GPT-3?" DeeLIO Workshop. Systematic study of example selection for few-shot prompting. Shows that similarity-based selection significantly outperforms random selection and provides practical guidelines for demonstration design.
Min, S., et al. (2022). "Rethinking the role of demonstrations: What makes in-context learning work?" EMNLP. Surprising finding that randomly assigning labels to demonstrations does not always destroy ICL performance, suggesting demonstrations serve purposes beyond providing input-output mappings.
Zhou, Y., et al. (2023). "Large language models are human-level prompt engineers." ICLR. Introduced Automatic Prompt Engineer (APE), which uses LLMs to generate and select effective prompts. Demonstrates that prompt optimization can be partially automated.
Khattab, O., et al. (2023). "DSPy: Compiling declarative language model calls into self-improving pipelines." arXiv. A framework that compiles high-level task descriptions into optimized prompts. Introduces the concept of "teleprompters" for automated prompt optimization and represents a shift toward programming with language models rather than writing prompts manually.

Surveys and Tutorials

Liu, P., et al. (2023). "Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing." ACM Computing Surveys. A comprehensive survey covering the full landscape of prompting methods, from template-based approaches to in-context learning. Provides a useful taxonomy and comparison of different techniques.
Sahoo, P., et al. (2024). "A systematic survey of prompt engineering in large language models: Techniques and applications." arXiv:2402.07927. An up-to-date survey covering modern prompt engineering techniques including CoT variants, ToT, self-consistency, and their applications across diverse domains.
Schulhoff, S., et al. (2024). "The Prompt Report: A systematic survey of prompting techniques." arXiv:2406.06608. An extensive survey cataloging over 50 distinct prompting techniques with a structured taxonomy. Useful as a reference guide for practitioners.

Security and Safety

Perez, F., & Ribeiro, I. (2022). "Ignore this title and HackAPrompt: Exposing systemic weaknesses of LLMs through a global scale prompt hacking competition." EMNLP. Documents real-world prompt injection attacks from a large-scale competition. Provides a taxonomy of attack vectors and insights into model vulnerabilities.
Greshake, K., et al. (2023). "Not what you've signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection." AISec Workshop. Demonstrates indirect prompt injection attacks where malicious instructions are embedded in content processed by LLM-integrated applications. Essential reading for anyone building production LLM systems.
Wallace, E., et al. (2024). "The instruction hierarchy: Training LLMs to prioritize privileged instructions." arXiv. Proposes training LLMs to respect an instruction hierarchy where system prompts take priority over user inputs. A promising direction for mitigating prompt injection.

Evaluation

Zheng, L., et al. (2023). "Judging LLM-as-a-judge with MT-Bench and Chatbot Arena." NeurIPS. Introduces the LLM-as-judge evaluation paradigm and the MT-Bench benchmark. Shows that strong LLMs can serve as reliable evaluators for open-ended generation tasks.
Turpin, M., et al. (2023). "Language models don't always say what they think: Unfaithful explanations in chain-of-thought prompting." NeurIPS. Demonstrates that CoT reasoning chains may not faithfully represent the model's internal computation, raising important questions about the interpretability of generated reasoning.

Retrieval-Augmented Generation

Lewis, P., et al. (2020). "Retrieval-augmented generation for knowledge-intensive NLP tasks." NeurIPS. The foundational RAG paper combining retrieval with generation. Demonstrates that grounding generation in retrieved documents improves factual accuracy and reduces hallucination.
Gao, Y., et al. (2024). "Retrieval-augmented generation for large language models: A survey." arXiv:2312.10997. A comprehensive survey of RAG methods covering retrieval strategies, augmentation techniques, and evaluation approaches. Useful for understanding the full RAG landscape.

Practical Resources

OpenAI Prompt Engineering Guide. https://platform.openai.com/docs/guides/prompt-engineering Practical guidelines for prompt engineering with OpenAI models. While vendor-specific, many principles generalize across models.
Anthropic Prompt Engineering Guide. https://docs.anthropic.com/en/docs/build-with-claude/prompt-engineering Anthropic's documentation on effective prompting for Claude models. Covers techniques like XML tags for structure and thinking prompts.

Looking Ahead

The prompt engineering concepts from this chapter connect directly to upcoming topics:

Chapter 24 (Fine-Tuning LLMs): When prompting alone is insufficient, fine-tuning provides deeper behavioral customization. Understanding when to escalate from prompting to fine-tuning is a critical engineering decision.
Chapter 25 (Alignment: RLHF and DPO): The instruction-following ability that makes prompt engineering possible is itself created through alignment techniques. Understanding alignment deepens your understanding of why prompts work.
Chapter 26 (Vision Transformers): Multimodal prompting extends the concepts in this chapter to image-text models, where prompts guide both visual and textual understanding.