Chapter 2: Further Reading — LLM Mechanics and Behavior

The following resources extend the conceptual framework from Chapter 2. They are organized by focus area. Not every resource requires a technical background; the annotations note the expected level.

Foundational Concepts: How Language Models Work

"Attention Is All You Need" — Vaswani et al. (2017) The original transformer paper that introduced the architecture underlying virtually all modern large language models. This is a technical research paper and requires familiarity with machine learning concepts to read in full, but the abstract and introduction convey the core insight: attention mechanisms allow models to relate different positions in a sequence to each other, enabling the contextual understanding that distinguishes transformers from earlier architectures. Worth reading the abstract and discussion sections even if the math is not accessible. Level: Technical

"The Illustrated Transformer" — Jay Alammar A visual, step-by-step explanation of the transformer architecture using diagrams and intuitive examples rather than equations. One of the most widely recommended introductions to transformer internals for non-mathematicians. Particularly good on the attention mechanism and how tokens relate to each other in context. Level: Accessible | Available at: jalammar.github.io

"What Is ChatGPT Doing… and Why Does It Work?" — Stephen Wolfram (2023) A long, detailed essay by the founder of Wolfram Research explaining language model mechanics from first principles, using concrete examples and minimal jargon. Wolfram's perspective is distinctive and occasionally idiosyncratic, but the piece is exceptionally good at conveying the next-token prediction mechanism and why it produces the results it does. Level: Accessible–Intermediate | Available at: writings.stephenwolfram.com

Tokenization

"Let's Build the GPT Tokenizer" — Andrej Karpathy (2024) A video tutorial walking through the implementation of a Byte Pair Encoding tokenizer from scratch. Karpathy is an exceptional educator and this piece gives a ground-level understanding of why tokens are the units they are, how the vocabulary is built, and why tokenization affects what models find easy or hard. Level: Intermediate (some Python) | Available on YouTube and karpathy.ai

OpenAI Tokenizer Tool An interactive tool that allows you to paste any text and see exactly how it is tokenized — which fragments become which tokens, and how many tokens a given text requires. Essential for developing practical intuition about token counts. Level: Accessible | Available at: platform.openai.com/tokenizer

Training Cutoffs and Knowledge Limitations

"TruthfulQA: Measuring How Models Mimic Human Falsehoods" — Lin et al. (2022) A research paper that benchmarks language models on questions where humans commonly hold false beliefs (misconceptions, folk wisdom, commonly misattributed quotes). The findings demonstrate that larger models are not necessarily more truthful — they are better at generating plausible-sounding text, which sometimes means generating the false-but-common answer. A foundational paper for understanding the relationship between model scale and accuracy. Level: Technical–Accessible

"Measuring Massive Multitask Language Understanding" — Hendrycks et al. (2021) The MMLU benchmark evaluates model knowledge across 57 domains, from elementary mathematics to professional medicine and law. The results illustrate which domains models tend to be accurate in and where performance degrades — providing a useful map of where to apply more vs. less skepticism. Level: Technical–Accessible

Hallucination and the Fluency-Accuracy Gap

"Survey of Hallucination in Natural Language Generation" — Ji et al. (2023) A comprehensive academic survey of hallucination across NLP tasks — the types, causes, detection methods, and mitigation strategies. More technical than most resources in this list, but the taxonomy of hallucination types (intrinsic vs. extrinsic) and the discussion of evaluation methods provide a useful framework for thinking about when and why models produce false content. Level: Technical

"Language Models (Mostly) Know What They Know" — Kadavath et al. (2022) A research paper examining whether language models have calibrated self-knowledge — whether they are aware of what they do and do not know. The findings are nuanced: models show some ability to express calibrated uncertainty when prompted explicitly, but this self-knowledge is imperfect and domain-dependent. Relevant to questions about when to ask a model to assess its own confidence. Level: Technical

"On the Dangers of Stochastic Parrots: Can Language Models Be Too Big?" — Bender et al. (2021) An influential and somewhat controversial paper examining the risks of large language models, including the argument that they generate plausible text by pattern-matching rather than genuine understanding. A useful counterweight to over-optimistic perspectives on model capability, and a good introduction to critical literature on AI fluency and its limits. Level: Accessible

Context Windows and Memory

"Lost in the Middle: How Language Models Use Long Contexts" — Liu et al. (2023) A research paper examining how model performance changes depending on where relevant information is positioned within a long context window. The finding — that models tend to perform better on information at the beginning and end of a context window than on information in the middle — has direct implications for how you structure prompts and documents. Level: Technical–Accessible

"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks" — Lewis et al. (2020) The foundational paper on retrieval-augmented generation (RAG) — a technique that addresses training cutoff and context window limitations by retrieving relevant documents and inserting them into the context at inference time. Understanding RAG helps explain how some AI tools appear to have more current knowledge than their training cutoff would suggest. Level: Technical

Emergent Capabilities

"Emergent Abilities of Large Language Models" — Wei et al. (2022) The key research paper documenting the emergence phenomenon — the finding that certain capabilities appear sharply as models scale past certain parameter thresholds, rather than improving gradually. The paper presents case studies of specific capabilities and discusses the implications for capability prediction. Level: Technical–Accessible

"Sparks of Artificial General Intelligence: Early Experiments with GPT-4" — Bubeck et al. (2023) A long Microsoft Research report examining GPT-4's capabilities across a wide range of tasks. Whether or not you accept the "sparks of AGI" framing, the case studies are illuminating — they show both the impressive range of GPT-4's capabilities and the characteristic ways in which it fails. Good empirical grounding for intuitions about emergent capability. Level: Accessible–Intermediate

Practical and Applied Perspectives

"Prompt Engineering Guide" — DAIR.AI A comprehensive, continuously updated guide to prompting techniques grounded in empirical findings. Covers chain-of-thought prompting, few-shot learning, role assignment, and many other techniques with explanations of why they work mechanistically. More practical than theoretical, but grounded in research. Level: Accessible | Available at: promptingguide.ai

"AI Snake Oil" — Arvind Narayanan and Sayash Kapoor (2024) A book and ongoing blog by two Princeton researchers examining AI hype, limitations, and the gap between AI capability claims and reality. An excellent antidote to both over-optimism and reflexive dismissal, with careful analysis of where AI tools do and do not deliver value. The chapter on predictive AI versus generative AI is particularly useful for calibrating expectations. Level: Accessible | aisnakeoil.com for blog; book available through standard channels