Chapter 2: Quiz — How Language Models Think

Test your understanding of the conceptual framework from Chapter 2. Each question is followed by a collapsible answer. Try to answer each question before revealing the answer.

Question 1

A language model has a 128,000-token context window. A user provides a document of approximately 80,000 words. Roughly how many tokens does this document represent, and does it fit within the context window?

Answer

At approximately 0.75 words per token (or about 1.33 tokens per word), an 80,000-word document represents roughly 106,000 tokens. This fits within a 128,000-token context window — but it leaves only about 22,000 tokens for the prompt, any system instructions, and the model's response. At this scale, there is very little room for additional context, and the model's ability to generate a lengthy response is constrained.

Question 2

What is the fundamental operation that a language model performs when generating a response?

Answer

The fundamental operation is next-token prediction: given all the tokens seen so far (the prompt plus any tokens the model has already generated), predict the most probable next token. This operation is performed iteratively — one token at a time — until the model determines the response is complete (or until a length limit is reached). All higher-level behaviors, including apparent reasoning and creativity, emerge from this probabilistic token-by-token generation process.

Question 3

A colleague claims: "I asked the AI about the library update that came out last month and it gave me a very detailed answer, so it clearly has current information." What is wrong with this reasoning?

Answer

The reasoning confuses detail with currency. Language models produce detailed, confident-sounding answers based on the statistical patterns in their training data — but all of that training data predates the training cutoff. A model can produce an extremely detailed, authoritative-sounding description of a library while describing the version that existed before the cutoff. The detail of the response does not indicate that the model has access to current information; it indicates that the library was well-represented in its training data. The colleague should verify the answer against current documentation regardless of how detailed it seems.

Question 4

What is "temperature" in the context of language model generation, and what happens to output quality at very low and very high temperature settings?

Answer

Temperature is a parameter that controls how the model samples from its probability distribution of possible next tokens. At very low temperature (approaching zero), the model consistently selects the highest-probability token, producing deterministic, predictable, and often repetitive output. At very high temperature, the model samples more liberally from lower-probability tokens, producing more varied and creative output, but also increasing the likelihood of incoherence, logical errors, and unexpected tangents. Most practical applications use a moderate temperature that balances coherence with variation. Importantly, high temperature does not improve accuracy — it introduces more randomness, which can help with creative tasks but tends to hurt precision tasks.

Question 5

Elena is working on a long consulting engagement and notices the AI tool starting to ignore constraints she established early in the conversation. What is the most likely technical explanation?

Answer

The most likely explanation is context window overflow or truncation. The conversation has grown long enough that the earliest content — where the constraints were established — has been pushed outside the model's active context window. The model literally cannot see that content anymore. It is not ignoring the constraints; it simply has no access to them. The appropriate response is to re-provide the critical constraints, either by pasting them again or by including a structured context summary at the start of each new session.

Question 6

What is a "hallucination" in the context of language models, and what causes it?

Answer

In AI usage, a hallucination refers to the generation of plausible-sounding but factually incorrect content — invented papers, wrong dates, misattributed quotes, non-existent features, or other fabrications presented with confidence. Hallucinations are caused by the same mechanism that drives all language model output: next-token prediction. When the model encounters a prompt, it generates the most statistically probable continuation. Sometimes the probability distribution leads toward incorrect content — because the model is pattern-matching on related but not identical material, or because correct information in that domain was sparse or ambiguous in the training data. The model has no internal truth-verification mechanism that separates accurate from inaccurate content; it produces what is statistically probable, not what is factually true.

Question 7

What does the term "emergent capability" mean when applied to large language models?

Answer

An emergent capability is a capability that appears in a language model as the model scales up in size and training, but that was not explicitly designed or trained for. Examples include multi-step reasoning, analogical thinking, and certain forms of in-context learning. These capabilities are called emergent because they arise from scale rather than from any deliberate architectural or training choice — they appear to emerge from the model's learning to predict text across a vast range of human-written content. Emergent capabilities are uneven and somewhat unpredictable: a model may demonstrate sophisticated reasoning on certain types of problems while failing on structurally similar problems that require slightly different pattern-matching.

Question 8

You are using an AI tool to research best practices for a specific software security practice. You find the response detailed, well-organized, and consistent with what you already know about the topic. Should you trust this response without verification? Explain your reasoning.

Answer

No. There are two important reasons: First, the training cutoff may mean that best practices have evolved since the model's training data was collected — security is a particularly fast-moving field. Second, the fluency-accuracy gap means that a well-organized, detailed response is not a reliable indicator of accuracy. The response may contain subtle errors or outdated recommendations that are difficult to spot precisely because the surrounding content is accurate. For security specifically, the consequences of acting on incorrect information are high. The appropriate approach is to use the AI response as a starting point and verify critical recommendations against current authoritative sources (official documentation, recent security research, current standards bodies).

Question 9

A developer adds a "think step by step" instruction to a prompt for a complex reasoning task. From a mechanistic perspective, why does this tend to improve output quality?

Answer

Adding "think step by step" improves output quality because of how sequential token generation works. When the model generates intermediate reasoning steps, those tokens become part of the context for generating subsequent tokens — including the final answer. Intermediate reasoning creates a kind of scaffold: each reasoning step constrains the probability distribution for the next step in a way that pushes toward logically coherent conclusions. Without the instruction, the model jumps directly to generating a final answer, which may be the statistically probable answer without intermediate logical grounding. Visible reasoning also helps the user identify where the model's logic went wrong if the final answer is incorrect.

Question 10

In your own words, explain the difference between a model's "context window" and its "memory." Why does this distinction matter practically?

Answer

The context window is the set of tokens the model can process in a single inference — everything it can currently "see." This includes the current prompt, the conversation history visible in the session, and any injected content. Memory, in the way humans use the word, implies retention of information across time — the ability to recall something from a previous conversation, a previous day, or a previous project. Language models have no such memory. The context window is ephemeral: when a session ends, the context disappears entirely. A new session starts with a clean slate. This matters practically because you cannot rely on the model to remember context from previous sessions, to maintain a running understanding of a long-term project, or to build up knowledge of you and your preferences over time without explicit mechanisms to re-inject that context.

Question 11

What is tokenization, and why do code files often consume more tokens than equivalently-sized prose documents?

Answer

Tokenization is the process of breaking input text into tokens — the fundamental units a language model processes. Tokens are not words; they are subword fragments determined by a tokenization algorithm (such as Byte Pair Encoding) trained on the language corpus. Code files often consume more tokens per character than prose because code contains dense sequences of special characters (brackets, parentheses, semicolons, operators), precise whitespace and indentation that is tokenized separately, short variable names that may not correspond to common subwords, and syntax-specific punctuation. A line of code that takes up the same number of characters as a prose sentence may require significantly more tokens, which has implications for context window usage and API costs.

Question 12

Why is it problematic to use expressed confidence as a signal of accuracy when evaluating AI output?

Answer

Language models were trained on human-written text, which means they learned to produce text with the same confidence markers that human experts use — authoritative phrasing, definitive statements, lack of hedging. However, these confidence markers are stylistic patterns, not calibrated signals of accuracy. The model produces fluent, authoritative-sounding text because that is what high-quality writing looks like in the training data, not because it has verified its claims. Research has shown that the correlation between a model's expressed confidence and the accuracy of its claims is weak — models do not reliably hedge more when they are more likely to be wrong. Treating confident tone as a reliability signal leads to accepting incorrect information that happens to be delivered with authority.

Question 13

Raj uses GitHub Copilot on a project that relies on a library that had a major breaking-change release eight months ago. The model's training cutoff predates this release. What specific type of Copilot behavior should Raj anticipate, and what practice should he adopt?

Answer

Raj should anticipate that Copilot will generate code that targets the pre-release version of the library — using the old API, the deprecated method names, the old import patterns, or the old configuration structure. The suggestions will likely look syntactically correct and may even follow good coding patterns; the problem will not be obvious from the code's appearance. The failure will manifest at runtime. Raj should adopt the practice of running any AI-suggested library calls against the current documentation before trusting them, particularly for libraries he knows have undergone recent changes. He might also add a note to his prompt specifying the library version he is targeting, which may help if the model has any knowledge of the newer version (though this is not guaranteed if the version postdates the training cutoff).

Question 14

What is the "brilliant student who read everything but experienced nothing" analogy meant to capture about language models?

Answer

The analogy is meant to capture the gap between apparent knowledge and genuine understanding or experience. A language model has processed an enormous amount of human-written text — more than any human could read. It can speak fluently about almost any topic because it has learned the patterns of how those topics are discussed. But it has never directly experienced anything: no sensation, no consequence, no trial and error, no failure. When it gives advice on how to negotiate a contract, it is reflecting the patterns of how such advice is written, not the experience of having negotiated contracts. This produces a characteristic failure mode: the model is excellent at synthesizing and articulating what others have written, but it lacks the grounding that comes from direct experience — which means it can miss context that would be obvious to someone who has actually done the thing.

Question 15

A content creator notices that when she asks an AI tool to write a marketing email, she always gets the same structure: a hook, a problem statement, a solution, and a call to action. Even when she explicitly asks for a different approach, the model gravitates back to this structure. What mechanism explains this, and what might she do about it?

Answer

This behavior reflects the statistical patterns in the model's training data. Marketing emails and copywriting guides in the training data likely overwhelmingly follow the hook-problem-solution-CTA structure, making this the highest-probability output pattern for that type of request. The model gravitates toward it even when asked for something different because the alternative structures are less represented in its training data and therefore less probable. To overcome this, she could: (1) provide a concrete example of the structure she wants, rather than describing it abstractly; (2) explicitly show the model the structure to avoid and then show what she wants instead; (3) describe the structural elements she wants in very specific terms; or (4) ask the model to generate several distinct structural options and evaluate them. Specificity and examples are more effective than abstract instructions when trying to override strongly-learned patterns.