Chapter 2: Key Takeaways — How Language Models Think

Tokens are the fundamental unit of language model processing. Text is not processed as words or characters but as tokens — subword fragments determined by a tokenization algorithm. Understanding tokens is prerequisite to understanding context windows, costs, and generation behavior.
A rough rule of thumb: 1 token equals approximately 3–4 characters or 0.75 words in English. Code, special characters, and technical content tend to be more token-dense than natural prose.
The fundamental operation is next-token prediction. At its core, a language model does one thing: predict the most probable next token given all tokens seen so far. All higher-level behavior — apparent reasoning, creativity, explanation — emerges from this operation performed iteratively.
Generation is sequential and probabilistic. The model generates one token at a time, in order. It cannot revise previous tokens, only continue forward. Each generated token becomes part of the context for the next.
Temperature controls sampling behavior. Low temperature produces deterministic, predictable output. High temperature produces varied, creative, but potentially less coherent output. Temperature is not a quality dial — it is a specificity-versus-variation trade-off.
Every model has a training cutoff. All knowledge the model has was learned from data collected before a specific date. The model has no information about events, developments, or changes that occurred after that date.
The model does not know what it does not know. After its training cutoff, the model does not experience a gap in its knowledge as an absence. It will generate confident, plausible-sounding responses about post-cutoff topics based on pre-cutoff patterns.
Training cutoffs create the "frozen knowledge" problem. Software libraries, APIs, regulations, best practices, and organizational information all change over time. AI output on time-sensitive topics reflects the state of the world at the training cutoff, not today.
The context window is all the model can see at once. Everything the model can consider when generating a response — your prompt, the conversation history, any injected documents, system instructions — must fit within the context window. What is outside the window is invisible to the model.
Context window and memory are not the same thing. The model has no persistent memory across sessions. A new session always starts with an empty context. Information from previous sessions is not available unless explicitly re-provided.
Long conversations cause context dropout. In extended conversations, early content eventually scrolls out of the context window. Constraints, decisions, and context established early in a session may no longer be visible to the model by the time you are working through later parts of the conversation.
The interface may show more than the model can see. A chat interface typically displays the full conversation history. The model's active context window may be much shorter. Do not assume the model can see everything visible in your interface.
Fluency does not imply accuracy. Language models produce confident, well-structured, authoritative-sounding text as a consequence of training on high-quality text — not as a consequence of fact-checking their output. The same fluent tone accompanies correct and incorrect statements.
Hallucinations occur across all domains. Models will confabulate — generate plausible-sounding but factually incorrect content — in well-represented domains as well as obscure ones. The familiarity of a subject is not a reliable proxy for the accuracy of AI output about it.
The fluency-accuracy gap is the most practically dangerous AI characteristic. It is dangerous precisely because the output quality that signals a problem (fluent, confident prose) is identical to the output quality that signals reliability. Detection requires external verification, not assessment of the output itself.
Chain-of-thought prompting works mechanistically. Asking a model to "think step by step" produces better reasoning because the intermediate reasoning tokens become part of the context for generating the final answer. The reasoning scaffold constrains subsequent token probabilities toward logical consistency.
Emergent capabilities arise from scale. Capabilities like multi-step reasoning, analogical thinking, and certain forms of in-context learning were not explicitly designed — they emerged as models scaled. This makes capability prediction difficult and makes testing on your actual task essential.
Capability is uneven. A model may handle a complex task brilliantly in one formulation and fail on a structurally similar task in a slightly different formulation. Do not generalize from performance on one task to expected performance on all similar tasks.
Language model "thinking" is mechanistically different from human thought. The model has no experience, no persistent self-model, no embodied knowledge, and no memory across sessions. What looks like understanding is pattern-matching at enormous scale. This distinction has real implications for what the model can and cannot reliably do.
The "brilliant student who read everything" is a useful conceptual frame. The model is excellent at synthesizing, articulating, and reasoning within patterns from its training data. It is less reliable on things that require direct experience, real-time information, or knowledge that was underrepresented or absent from training data.
Context management is a skill you must develop. Because the model has no persistent memory, you are responsible for ensuring it has the context it needs at any given moment. For extended projects, this means maintaining and re-providing context documents at the start of each session.
For time-sensitive information, treat AI output as a starting point, not a conclusion. Use AI to understand the shape of the problem and the vocabulary of the domain, then verify specific implementation details against authoritative current sources.
Ask the model to surface its own uncertainty. Including prompts like "note anything about this that might be outdated or that I should verify" often produces useful hedging that the model would not include unprompted.
Your working framework for AI output should include two filters: (1) Is this time-sensitive, and does it need to be checked against current sources? (2) Is this a claim with real-world consequences if wrong, and does it need independent verification regardless of age?
Understanding the mechanics predicts the failure modes. Training cutoffs explain why AI gives outdated information. Context windows explain why AI "forgets" constraints. Probabilistic generation explains why output varies. Fluency-accuracy gaps explain why confident output needs verification. Each failure mode you understand is one you can anticipate and prevent.