Chapter 2: Exercises — How Language Models Think

DataField.Dev

Chapter 2: Exercises — How Language Models Think

These exercises are designed to move your understanding of LLM mechanics from conceptual to operational. Each one asks you to apply a principle from the chapter in a practical context. Some require access to an AI tool; others require only reflection and analysis.

Section A: Token Awareness

Exercise 1: Token Estimation Practice

Before consulting any tools, estimate the approximate token count for each of the following inputs:

A 500-word blog post
A Python function with 40 lines of code
A 10-item bulleted list with brief descriptions
A single sentence: "Please summarize this document in three bullet points."
A 3,000-word academic paper

Now use a tokenization tool (OpenAI's Tokenizer at platform.openai.com/tokenizer, or any available tool) to check your estimates. Where were you furthest off? What patterns explain the discrepancies? Code tends to tokenize differently than prose — what did you observe?

Exercise 2: The Cost of Your Prompt

Take a recent complex prompt you sent to an AI tool — one with detailed instructions, examples, or a document pasted in. Estimate or measure its token count. Now calculate: if you were using an API at $X per million tokens, what did that prompt cost? How does that change your thinking about what to include in a prompt?

If you do not have an API account, use the following scenario: You paste a 2,000-word document plus a 150-word instruction into a prompt. Using a rate of $3 per million input tokens, what does this cost? How many such prompts could you send for $1?

Exercise 3: The Token Split

The following words are likely to be split into multiple tokens by most models. For each one, predict how it might be split (e.g., "tokenization" → "token" + "ization"), and explain what that means for how the model "sees" the word:

"microservices"
"deserialization"
"antidisestablishmentarianism"
"COVID-19"
"GPT-4"
"TypeScript"

Does the splitting have implications for how you should write technical prompts?

Section B: Next-Token Prediction and Generation

Exercise 4: Predict the Output

Before sending each of the following prompts to an AI tool, write down your prediction of the first sentence of the response. Then send the prompt and compare. For which prompts were you most accurate? What does that tell you about your understanding of the model's pattern-matching?

"The most important lesson I learned about leadership is..."
"Here is a simple Python function that reverses a string:"
"The main difference between REST and GraphQL is..."
"Dear [Client Name], I am writing to follow up on our meeting last week..."

Exercise 5: Temperature in Action

If you have access to a tool that allows temperature adjustment (such as the OpenAI Playground), send the same creative prompt at temperature 0.1, temperature 0.7, and temperature 1.5. Record the outputs. How do they differ in vocabulary, structure, predictability, and creativity? How do they differ in accuracy or coherence?

If you do not have temperature control, try submitting the same creative prompt five times in a normal AI interface and note the variation across responses. What varies? What stays constant?

Exercise 6: Chain-of-Thought Experiment

Take a moderately complex reasoning problem — a multi-step math word problem, a logical puzzle, or a strategic question with multiple factors. Send it twice:

Version 1: "Answer this question: [problem]"
Version 2: "Think through this step by step before giving your final answer: [problem]"

Compare the quality and accuracy of the two responses. Does the chain-of-thought prompt change the answer? Does it change the quality of reasoning? Does it affect your ability to evaluate the answer?

Section C: Training Cutoffs and Frozen Knowledge

Exercise 7: Identify the Cutoff

Ask an AI tool about events in three time periods: - Something well before its training cutoff (a historical event from several years ago) - Something that might be near its training cutoff (a technology update from roughly the training cutoff period) - Something after its training cutoff (a recent development you know about)

How does the model respond to each? Does it signal uncertainty about the more recent material? Does it confabulate for the post-cutoff event, or does it correctly acknowledge it does not know?

Exercise 8: The Outdated Library Problem

Choose a software library or framework you are familiar with that has had a significant update in the last two years (adding new features, changing APIs, or deprecating methods). Ask an AI tool to explain how to use a feature that changed in the recent update.

Evaluate the response: Is the model describing the old version, the new version, or a confused mix? How would you know if you were not already familiar with the library? What does this tell you about using AI for documentation research?

Exercise 9: The Frozen Knowledge Audit

Make a list of five questions you have asked or might ask an AI tool in your work. For each one, assess: is this question time-sensitive? Does the correct answer depend on information that might have changed since the model's training cutoff? For the time-sensitive ones, what would be the consequence of acting on outdated information? How would you verify?

Section D: Context Windows

Exercise 10: The Conversation Length Test

Start a conversation with an AI tool and establish some specific constraints early — a fictional scenario with specific rules, a writing style guide, or a list of project requirements. Continue the conversation for as long as possible, adding new content and requests. Note the point at which the model begins to ignore or contradict the early constraints. How many exchanges passed before context dropout was visible? What type of information was lost first?

Exercise 11: The Context Re-anchor

Following on from Exercise 10 (or starting a fresh long conversation): when you observe context dropout, try two different recovery strategies: - Strategy A: Simply continue and see if the problem resolves itself - Strategy B: Re-paste the original context or constraints and explicitly note that these still apply

Which strategy is more effective? How many tokens did the re-anchor cost? Was the overhead worth it?

Exercise 12: The Context Budget

You are working on a complex analysis task that requires: - A 1,500-word background document - A 500-word instruction set - A 200-word example output - A 100-word current question

Estimate the total token budget this requires. If the model has a 16,000-token context window, how much space remains for the model's response? At what point would you need to reconsider your approach? What would you trim first and why?

Section E: Fluency-Accuracy Gap

Exercise 13: The Confidence Calibration Test

Find five factual claims — some true, some false, all plausible-sounding — and present them to an AI tool as questions: "Is it true that...?" Rate the model's expressed confidence on a scale of 1-5 for each response. Then verify the actual accuracy.

Is there a correlation between the model's expressed confidence and the actual accuracy of its answer? What does this tell you about using tone or assertiveness as a proxy for reliability?

Exercise 14: The Hallucination Hunt

Ask an AI tool to: - Cite three academic papers on a topic in your domain - Describe a software library you know very well, including its current version number and recent features - Name the author and year of a specific quote you provide

For each response, verify the claims. Do the papers exist? Is the version number and feature description accurate? Is the attribution correct? Note not just whether errors occurred, but how the errors were phrased — were they hedged or presented with confidence?

Exercise 15: Designing Verification Checkpoints

For each of the following AI use cases, design a specific verification checkpoint — a check you would run before acting on the output:

Using AI to research competitor pricing
Using AI to draft a contract clause
Using AI to generate code that calls an external API
Using AI to summarize a research paper and extract key findings
Using AI to recommend a technology stack for a new project

For each checkpoint, identify: What source would you use to verify? How long would verification take? What would you do if the verification revealed an error?

Reflection Exercises

Exercise 16: Your Personal Failure Audit

Think of a time you used an AI tool and the output was wrong, misleading, or problematic in some way. Now that you understand the mechanics — tokens, training cutoffs, context windows, fluency-accuracy gaps — which mechanism was most likely responsible for the failure? What would you do differently?

Exercise 17: The "Brilliant Student" Mapping

Apply the "brilliant student who read everything but experienced nothing" analogy to your specific domain or workflow. What tasks in your work would this student handle exceptionally well? What tasks would they struggle with? What would be their characteristic failure modes in your context?

Exercise 18: Mechanics in Your Workflow

Write a brief inventory (one paragraph each) of: - How training cutoffs affect your use cases - How context windows affect your typical sessions - Where you are most likely to encounter the fluency-accuracy gap in your work

Keep this inventory somewhere accessible. Revisit it after three months of deliberate practice and update it with what you have learned.