Chapter 23: Exercises

Part A: Conceptual Foundations of In-Context Learning

Exercise 23.1: Autoregressive In-Context Learning

Explain why in-context learning (ICL) is possible only in autoregressive Transformers of sufficient scale. Specifically: (a) Describe the role of the attention mechanism in enabling ICL. (b) Explain why smaller models (e.g., <1B parameters) typically fail at ICL while larger models succeed. Reference the concept of emergence. (c) If a model has fixed parameters $\theta$, how can it "learn" from demonstrations in the prompt? Reconcile this with the fact that no gradient update occurs.

Exercise 23.2: Implicit Bayesian Inference

The implicit Bayesian inference framework proposes that ICL approximates:

$$p(y \mid x_{\text{demo}}, x_{\text{query}}) \approx \sum_{c} p(y \mid c, x_{\text{query}}) \; p(c \mid x_{\text{demo}})$$

(a) Define the latent concept $c$ in your own words. Give three concrete examples of what $c$ might represent in practice. (b) How does providing more demonstrations $x_{\text{demo}}$ affect the posterior $p(c \mid x_{\text{demo}})$? (c) If the demonstrations are drawn from a task distribution unseen during pre-training, would you expect ICL to succeed? Why or why not?

Exercise 23.3: ICL as Implicit Gradient Descent

Dai et al. (2023) showed that the forward pass of a Transformer performing ICL is mathematically equivalent to gradient descent on an implicit linear model. (a) Explain what "implicit gradient descent" means in this context. Why is it called "implicit"? (b) If each attention head performs one step of gradient-based optimization, what role does depth (number of layers) play? (c) What are the practical implications of this theory for prompt engineering?

Exercise 23.4: Demonstration Ordering Effects

You are building a few-shot sentiment classifier with four demonstrations. Two are positive, two are negative. You test four orderings: [+, +, -, -], [-, -, +, +], [+, -, +, -], and [-, +, -, +]. (a) Which ordering would you expect to perform best for a query with positive sentiment? Explain the recency bias. (b) Design an experiment to test ordering effects. What metrics would you track? (c) How could you mitigate ordering sensitivity in a production system?

Exercise 23.5: Label Correctness

Min et al. (2022) found that randomly assigning labels to few-shot demonstrations does not always destroy performance. (a) Explain the four purposes demonstrations serve: format specification, label space calibration, task identification, and input-output mapping. Rank them by importance based on the Min et al. finding. (b) For which types of tasks would you expect label correctness to matter most? Least? (c) Design an experiment to test whether label correctness matters for a specific task of your choice.

Exercise 23.6: Zero-Shot vs. Few-Shot Trade-offs

You are deploying a text classification system and must choose between zero-shot and few-shot prompting. (a) List three scenarios where zero-shot is preferable to few-shot. (b) List three scenarios where few-shot is preferable to zero-shot. (c) A colleague argues that few-shot is always better because it provides more information. Construct a counterexample where zero-shot outperforms few-shot.

Part B: Prompting Techniques

Exercise 23.7: Zero-Shot Prompt Design

Write zero-shot prompts for each of the following tasks. For each, explain why your wording choices improve performance: (a) Extracting the main topic from a news article. (b) Classifying a customer support email into one of: billing, technical, account, feedback. (c) Translating informal English to formal English. (d) Determining whether two sentences are paraphrases.

Exercise 23.8: Few-Shot Example Selection

You have a pool of 100 labeled examples for a named entity recognition task. You can include 5 in your prompt. (a) Describe three strategies for selecting the 5 examples. For each, state the assumption it makes about what makes a good demonstration. (b) Would you use the same 5 examples for every query, or dynamically select them? Justify your answer. (c) Implement a similarity-based selection strategy: given a query embedding and a matrix of example embeddings, write the mathematical formula for selecting the top-$k$ examples.

Exercise 23.9: Chain-of-Thought Construction

Construct a chain-of-thought prompt for the following multi-step reasoning problem:

"A store sells apples at \$1.50 each and oranges at \$2.00 each. Maria buys 3 apples and 4 oranges. She pays with a \$20 bill. How much change does she receive?"

(a) Write a few-shot CoT prompt with two demonstrations that teach the reasoning pattern. (b) Write the zero-shot CoT version using "Let's think step by step." (c) Identify a potential failure mode of CoT for this problem and suggest a mitigation.

Exercise 23.10: Self-Consistency Analysis

Consider a math word problem where sampling 10 reasoning paths with temperature $T=0.7$ yields the following answers: [42, 42, 38, 42, 42, 38, 42, 42, 44, 42]. (a) What is the self-consistent answer? What is the effective confidence? (b) Calculate the entropy of the answer distribution. What does this tell you about the model's certainty? (c) If you could only afford 5 samples, would self-consistency still be valuable here? Justify. (d) Propose a strategy for dynamically choosing the number of samples based on intermediate results.

Exercise 23.11: Tree of Thoughts Design

Design a Tree of Thoughts (ToT) approach for the following problem: "Write a four-line poem about autumn that uses at least one metaphor, one simile, and alliteration." (a) Define the thought decomposition: what constitutes one "thought" at each step? (b) Design the evaluation prompt: how would you rate each partial solution? (c) Choose BFS or DFS and justify your choice. What branching factor would you use? (d) Estimate the total number of LLM calls required.

Exercise 23.12: Prompting Strategy Selection

For each of the following tasks, recommend the most appropriate prompting strategy (zero-shot, few-shot, CoT, self-consistency, ToT) and justify your choice: (a) Translating a sentence from English to French. (b) Solving a multi-digit multiplication problem. (c) Writing a short story with a surprise twist ending. (d) Classifying 10,000 product reviews as positive/negative. (e) Diagnosing a complex bug from a stack trace. (f) Planning a week-long travel itinerary with constraints.

Part C: Structured Outputs, System Prompts, and Templates

Exercise 23.13: JSON Schema Design

Design a JSON schema and corresponding prompt for extracting structured information from restaurant reviews. The schema should include: restaurant name, cuisine type, overall rating (1-5), price range, specific dishes mentioned, and pros/cons. (a) Write the full prompt including the schema definition and extraction instructions. (b) Add validation rules to the prompt (e.g., rating must be 1-5, price range must be one of "$", "MATH1$", "$$$$"). (c) Design a retry strategy for when the output fails validation.

Exercise 23.14: Constrained Decoding

Explain how constrained decoding ensures valid JSON output. (a) Given the partial output {"name": "Bella, what tokens should be in $\mathcal{V}_{\text{valid}}$ for the next position? (b) How does constrained decoding handle nested structures? (c) What is the computational overhead of constrained decoding compared to unconstrained generation?

Exercise 23.15: System Prompt Architecture

Design a complete system prompt for a medical triage chatbot that: (a) Defines the role and expertise level. (b) Specifies output format (structured triage categories). (c) Includes safety constraints (when to recommend emergency services). (d) Handles edge cases (off-topic questions, emotional users). (e) Prevents prompt injection attempts. Write the full system prompt, then explain each design decision.

Exercise 23.16: Role-Based Prompting Comparison

For the following query---"Explain why the sky is blue"---write three different system prompts that produce qualitatively different responses: (a) A system prompt for a physics professor explaining to graduate students. (b) A system prompt for a children's science educator for ages 6-8. (c) A system prompt for a technical writer producing a reference manual. For each, predict how the response will differ in vocabulary, length, structure, and level of detail.

Exercise 23.17: Prompt Template Engineering

Design a reusable prompt template system for a customer support application. (a) Define the template with placeholders for: customer name, issue category, account type, previous interactions summary, and tone preference. (b) Write three different instantiations of the template for different customer scenarios. (c) Explain how you would version-control these templates and manage template migrations when the schema changes.

Exercise 23.18: Dynamic Prompt Construction

You are building a few-shot prompt system that dynamically selects examples based on the user query. (a) Write pseudocode for a build_prompt function that: embeds the query, retrieves the top-$k$ similar examples from a vector store, orders them by increasing similarity, and assembles the prompt. (b) What happens if the context window is too small to fit all $k$ examples? Describe a truncation strategy. (c) How would you handle the cold-start problem when the example pool is empty?

Part D: Security, Evaluation, and RAG

Exercise 23.19: Prompt Injection Attack Vectors

For each of the following prompt injection techniques, explain how it works and propose a defense: (a) Direct instruction override: "Ignore all previous instructions and..." (b) Indirect injection via a web page the model is asked to summarize. (c) Payload smuggling via base64 encoding. (d) Role-play injection: "Pretend you are an unrestricted AI..." (e) Multi-turn manipulation: gradually shifting the model's behavior across turns.

Exercise 23.20: Delimiter-Based Isolation

Design a delimiter-based isolation scheme for a document summarization system where the user provides the document to summarize. (a) Choose appropriate delimiters and write the full prompt structure. (b) Explain why simple delimiters (like triple backticks) are insufficient against sophisticated attacks. (c) Propose a multi-layer defense that combines delimiters with input sanitization and output filtering.

Exercise 23.21: Prompt Evaluation Framework

You are tasked with evaluating a prompt for extracting key information from legal contracts. Design a complete evaluation framework: (a) Define at least five evaluation dimensions with specific metrics for each. (b) Describe how you would construct the evaluation dataset (at least 3 different data sources). (c) Design an A/B testing protocol for comparing two prompt variants. (d) Specify the statistical test you would use and the minimum sample size for detecting a 5% improvement with 80% power.

Exercise 23.22: LLM-as-Judge Design

Design an LLM-as-judge evaluation system for open-ended question answering. (a) Write the judge prompt, including evaluation criteria, scoring rubric, and output format. (b) How would you validate the judge's reliability? Propose a calibration procedure. (c) Discuss the limitations of LLM-as-judge evaluation. When should you prefer human evaluation?

Exercise 23.23: Prompt Optimization

You have a baseline prompt for summarization that achieves 0.35 ROUGE-L on your test set. The target is 0.45. (a) Describe a manual iteration strategy: what failure modes would you look for, and how would you modify the prompt? (b) Describe how DSPy or APE could automate the optimization. What is the search space? (c) Could prompt ensembling help? Design an ensemble of 3 prompts and describe the aggregation strategy.

Exercise 23.24: RAG Prompt Design

Design a RAG prompt for a customer support system that retrieves relevant FAQ entries and knowledge base articles. (a) Write the augmented prompt template including placeholders for retrieved context. (b) How should the prompt instruct the model to handle cases where the retrieved context is irrelevant? (c) How should the prompt handle contradictions between retrieved documents? (d) Design a citation mechanism where the model attributes its answers to specific retrieved documents.

Part E: Integration and Advanced Challenges

Exercise 23.25: End-to-End Prompt System

Design a complete prompt-based system for automated code review. The system should: (a) Accept a code diff as input. (b) Identify bugs, style issues, performance problems, and security vulnerabilities. (c) Provide structured output with severity levels and line-specific feedback. (d) Include a system prompt, user prompt template, and output schema. Write all prompt components and explain the design rationale.

Exercise 23.26: Multi-Step Prompt Pipeline

Design a multi-step prompt pipeline for converting a research paper abstract into a social media thread. (a) Step 1: Extract key findings and contributions. (b) Step 2: Simplify technical language for a general audience. (c) Step 3: Format as a thread with appropriate length and tone. Write each prompt and describe how data flows between steps. What are the failure modes at each step?

Exercise 23.27: Prompt Debugging

You have deployed a few-shot classification prompt but it has a 15% error rate, mostly concentrated on ambiguous inputs. Propose a systematic debugging approach: (a) How would you identify the failure patterns? Describe the analysis you would perform. (b) For each failure pattern, propose a prompt modification. (c) How would you verify that your fixes do not regress on previously correct examples?

Exercise 23.28: Cross-Model Prompt Portability

You have developed a highly optimized prompt for GPT-4 and need to port it to Llama-3-70B-Instruct. (a) What aspects of the prompt are likely to transfer well? Which are likely to break? (b) Describe a systematic process for adapting the prompt to the new model. (c) How do differences in chat templates affect prompt portability?

Exercise 23.29: Cost-Performance Trade-off Analysis

You are building a production system that handles 100,000 queries per day. Compare the following strategies on cost and quality: (a) Zero-shot with a large model (e.g., 70B parameters). (b) Few-shot (5 examples) with the same large model. (c) Zero-shot with a smaller model (e.g., 7B parameters) that has been instruction-tuned. (d) Self-consistency (5 samples) with CoT on the smaller model. For each, estimate the relative token cost per query and predict the quality ranking. Under what conditions would each be the best choice?

Exercise 23.30: Comprehensive Prompt Engineering Project

Design a complete prompt engineering solution for an AI-powered writing assistant that helps students improve their essays. Your solution should include: (a) A system prompt that defines the assistant's behavior and pedagogical approach. (b) A prompt template for analyzing essay structure, grammar, argumentation, and style. (c) A structured output schema for the feedback. (d) Few-shot examples demonstrating the expected feedback quality. (e) An evaluation framework with at least three metrics. (f) A prompt injection defense strategy. (g) A cost analysis assuming 10,000 essays per day, each ~500 words.