Chapter 23: Quiz

Multiple Choice Questions

Question 1

In-context learning (ICL) refers to a model's ability to perform tasks based on demonstrations in the prompt. Which statement about ICL is TRUE?

A) ICL updates the model's parameters using backpropagation on the demonstrations. B) ICL requires at least 100 demonstrations to work effectively. C) ICL performs inference with fixed parameters, relying on the context to guide generation. D) ICL only works with encoder-only models like BERT.

Answer: C In-context learning uses fixed parameters $\theta$---no gradient updates occur. The model "learns" purely from the demonstrations and instructions provided in the context window. This works with decoder-only (autoregressive) models, and typically requires only a handful of demonstrations.

Question 2

According to the implicit Bayesian inference framework for ICL, what does the model approximate when processing in-context demonstrations?

A) The gradient of the loss function with respect to model parameters. B) A posterior distribution over latent concepts given the demonstrations. C) The exact joint probability of all demonstrations. D) A uniform distribution over possible outputs.

Answer: B Xie et al. (2022) proposed that ICL performs implicit Bayesian inference: the model infers a posterior $p(c \mid x_{\text{demo}})$ over latent concepts $c$ from the demonstrations, then generates according to $p(y \mid c, x_{\text{query}})$.

Question 3

Which of the following is NOT a factor that affects in-context learning performance?

A) The number of demonstrations provided. B) The order of demonstrations in the prompt. C) The GPU type used for inference. D) The quality and diversity of demonstrations.

Answer: C The GPU type affects inference speed but not the quality of ICL. Number of demonstrations, ordering, quality, and diversity are all well-documented factors that significantly influence ICL performance.

Question 4

A zero-shot prompt for sentiment analysis reads: "Classify this review as positive or negative." How could you improve this prompt?

A) Add 50 demonstrations to make it few-shot. B) Specify the output format: "Respond with exactly one word: 'positive' or 'negative'." C) Remove the task description to give the model more freedom. D) Use a smaller model for faster processing.

Answer: B Specifying the exact output format reduces ambiguity and improves reliability. The model knows exactly what form the answer should take, preventing responses like "The sentiment is positive" or "This review expresses a positive opinion."

Question 5

In few-shot prompting, research by Min et al. (2022) found that randomly assigning labels to demonstrations:

A) Completely destroys performance on all tasks. B) Does not always destroy performance, because demonstrations serve multiple purposes beyond mapping. C) Improves performance by preventing overfitting to examples. D) Has no effect on performance whatsoever.

Answer: B Surprisingly, random labels do not always destroy ICL performance. Demonstrations serve multiple purposes: format specification, label space calibration, task identification, and input-output mapping. Even with wrong labels, the model still learns format and label space, though correct labels do consistently outperform random ones.

Question 6

What is the primary advantage of chain-of-thought (CoT) prompting over standard few-shot prompting?

A) CoT reduces the number of tokens needed in the prompt. B) CoT increases the effective computational depth by generating intermediate reasoning steps. C) CoT eliminates the need for demonstrations. D) CoT guarantees the correct answer on all reasoning tasks.

Answer: B CoT increases computational depth by having the model generate step-by-step reasoning. Each generated token can attend to previously generated reasoning tokens, creating a computation chain deeper than a single forward pass. This is particularly helpful for multi-step reasoning tasks.

Question 7

Zero-shot chain-of-thought prompting uses which trigger phrase?

A) "Show your work." B) "The answer is:" C) "Let's think step by step." D) "Please reason carefully about this problem."

Answer: C Kojima et al. (2022) discovered that appending "Let's think step by step" to a prompt triggers chain-of-thought reasoning without any demonstrations. This remarkably simple technique improves performance on arithmetic, symbolic, and commonsense reasoning tasks.

Question 8

Self-consistency extends CoT by:

A) Verifying the reasoning chain using an external calculator. B) Sampling multiple reasoning paths and returning the majority-vote answer. C) Training the model on its own generated reasoning chains. D) Using two different models and comparing their answers.

Answer: B Self-consistency samples $n$ independent reasoning paths using temperature $T > 0$, extracts the final answer from each, and returns the most frequent answer via majority vote. The intuition is that correct reasoning paths converge on the same answer.

Question 9

In self-consistency with $n$ samples, what is the relationship between $n$ and performance?

A) Performance always increases linearly with $n$. B) Performance generally improves with more samples but has diminishing returns, typically beyond ~20 samples. C) Performance decreases with more samples due to noise accumulation. D) The optimal $n$ is always exactly 5.

Answer: B More samples generally improve the quality of the majority vote, but with diminishing returns. Typically, 5--40 samples suffice, and gains become marginal beyond ~20 for most tasks. The cost scales linearly with $n$.

Question 10

In Tree of Thoughts (ToT), what capability does the tree structure provide that a linear chain of thought does not?

A) Faster generation speed. B) Ability to backtrack from unpromising reasoning paths. C) Smaller memory footprint. D) Deterministic output.

Answer: B Unlike linear CoT, ToT can evaluate partial solutions, prune unpromising branches, and backtrack to explore alternatives. This is inspired by classical AI search algorithms and is particularly valuable for problems requiring planning and exploration.

Question 11

Which statement about structured output generation is FALSE?

A) Constrained decoding modifies token probabilities to maintain valid syntax. B) JSON mode guarantees that the output will conform to any arbitrary JSON schema. C) Even with structured output constraints, validation is recommended in production. D) Providing the expected schema in the prompt improves structured output quality.

Answer: B JSON mode ensures valid JSON syntax but does not guarantee conformity to a specific schema. The output may be valid JSON but have wrong field names, missing fields, or incorrect types. Schema validation on top of JSON mode is necessary for production systems.

Question 12

Constrained decoding for JSON generation works by:

A) Post-processing the output to fix JSON syntax errors. B) Setting the probability of tokens that would create invalid JSON to zero during generation. C) Training the model exclusively on JSON data. D) Using regular expressions to filter the output.

Answer: B Constrained decoding modifies the token probability distribution at each step, setting the probability of tokens that would break valid JSON syntax to zero and renormalizing the remaining probabilities. This guarantees syntactically valid output.

Question 13

A well-designed system prompt should include all of the following EXCEPT:

A) Role definition and expertise level. B) API keys and authentication tokens for external services. C) Output format specification. D) Behavioral constraints on what the model should and should not do.

Answer: B System prompts should never contain secrets like API keys, passwords, or authentication tokens. These could be extracted through prompt injection attacks. System prompts should contain role definitions, task specifications, behavioral constraints, output format, and tone guidance.

Question 14

Role-based prompting (e.g., "You are a senior data scientist...") improves performance because:

A) It changes the model's architecture at runtime. B) It activates relevant knowledge domains acquired during pre-training. C) It increases the model's parameter count. D) It disables safety guardrails.

Answer: B Assigning a role activates domain-relevant knowledge and communication patterns learned during pre-training. Research shows role prompts can improve performance by 5--15% on domain-specific tasks compared to generic prompts.

Question 15

Which prompt injection type embeds malicious instructions in content the model processes rather than in direct user input?

A) Direct injection. B) Indirect injection. C) Payload smuggling. D) Role-play injection.

Answer: B Indirect injection embeds malicious instructions in external content (web pages, documents, emails) that the model is asked to process. The model may follow these embedded instructions because it cannot reliably distinguish between legitimate content and injected instructions.

Question 16

The dual-LLM defense against prompt injection uses:

A) One LLM for generation and one for detecting injection attempts. B) Two identical LLMs that must agree on the output. C) One LLM for encoding and one for decoding. D) Two LLMs trained on different datasets.

Answer: A The dual-LLM approach uses one model to process user input and generate responses, and a separate, isolated model to analyze the input for injection attempts. This separation prevents a single compromised interaction from affecting the detection system.

Question 17

When evaluating prompt quality, "consistency" refers to:

A) Whether the prompt produces grammatically correct output. B) Whether the prompt produces similar outputs for similar inputs across repeated runs. C) Whether the prompt follows the same format as other prompts in the system. D) Whether the prompt has consistent indentation and formatting.

Answer: B Consistency measures the variance in outputs across repeated runs with the same or similar inputs. A consistent prompt produces predictable, reliable results. This is especially important for production systems where unpredictable behavior is unacceptable.

Question 18

In A/B testing of prompt variants, the null hypothesis $H_0: \mu_A = \mu_B$ states that:

A) Both prompts are identical in wording. B) The mean performance of prompt A equals the mean performance of prompt B. C) Both prompts produce the same output for every input. D) Both prompts use the same number of tokens.

Answer: B The null hypothesis in A/B testing states that there is no difference in mean performance between the two prompt variants. Statistical tests determine whether observed differences are significant or could arise by chance.

Question 19

LLM-as-judge evaluation is most appropriate for:

A) Tasks with clear, objectively verifiable answers (e.g., math problems). B) Open-ended tasks where automated metrics are insufficient (e.g., creative writing). C) Tasks where the ground truth is a single token (e.g., binary classification). D) Tasks where latency is the primary concern.

Answer: B LLM-as-judge is particularly valuable for open-ended tasks like summarization, creative writing, and conversational quality, where traditional automated metrics (BLEU, ROUGE) correlate poorly with human judgment. For tasks with objectively verifiable answers, direct comparison to ground truth is simpler and more reliable.

Question 20

DSPy optimizes prompts by:

A) Manually iterating through prompt variants based on human intuition. B) Compiling high-level task descriptions into optimized prompts using automated search. C) Training a separate neural network to generate prompts. D) Using genetic algorithms to evolve prompt templates.

Answer: B DSPy (Khattab et al., 2023) compiles high-level declarative task descriptions into optimized prompts using a set of teleprompters (optimizers). It automates the prompt optimization process that would otherwise require manual iteration.

Question 21

Retrieval-Augmented Generation (RAG) addresses which fundamental limitation of static prompts?

A) The inability to generate structured output. B) The model's training data cutoff and limited context window for knowledge. C) The inability to perform multi-step reasoning. D) The high latency of large model inference.

Answer: B RAG addresses the limitation that static prompts rely solely on knowledge from the model's training data, which has a cutoff date and cannot cover all domains. RAG dynamically retrieves relevant documents and includes them in the prompt, providing up-to-date and domain-specific knowledge.

Question 22

In the prompting decision framework, what is the recommended starting point for a new task?

A) Tree of Thoughts, as it is the most powerful technique. B) Self-consistency with 20 samples for maximum accuracy. C) Zero-shot prompting, then escalate complexity only if needed. D) Fine-tuning, as it always produces the best results.

Answer: C The principle of parsimony recommends starting with the simplest approach: zero-shot prompting. If performance is adequate, stop. Otherwise, progressively add complexity: few-shot examples, chain-of-thought, self-consistency, structured output, retrieval, or fine-tuning.

Question 23

Which of the following is TRUE about prompt templates?

A) Templates should embed all data directly in the prompt string without parameterization. B) Templates should separate prompt structure from data via placeholders filled at runtime. C) Templates are only useful for zero-shot prompts. D) Templates should be hardcoded and never version-controlled.

Answer: B Prompt templates provide a parameterized structure with placeholders (e.g., {document_type}, {content}) that are filled at runtime. This separation of concerns makes prompts maintainable, reusable, version-controlled, and testable---essential for production systems.

Question 24

A RAG system retrieves documents and includes them in the prompt. Compared to fine-tuning, RAG offers:

A) Deeper behavioral customization of the model. B) Instant knowledge updates by modifying the index, without retraining. C) Lower per-query computational cost. D) Better performance on tasks requiring stylistic changes.

Answer: B RAG's primary advantage over fine-tuning is that knowledge can be updated instantly by modifying the document index, without any model retraining. Fine-tuning requires retraining to incorporate new knowledge, while RAG simply indexes new documents.

Question 25

Which defense strategy is LEAST effective against prompt injection attacks?

A) Simple pattern matching to filter known injection phrases. B) Instruction hierarchy training that prioritizes system prompts. C) Output validation before returning results to the user. D) Sandboxing the model's capabilities to a minimum required set.

Answer: A Simple pattern matching (e.g., blocking "ignore all previous instructions") is the least effective defense because natural language is too flexible---attackers can paraphrase, use different languages, encode instructions, or use indirect methods. Instruction hierarchy training, output validation, and sandboxing provide more robust protection.