Chapter 24: Quiz

Multiple Choice Questions

Question 1

Which scenario BEST justifies fine-tuning over prompting?

A) You need to classify 10 emails per day into 3 categories. B) You need the model to consistently output XML in a specific schema across millions of API calls. C) You want to quickly prototype a summarization system. D) The task changes frequently and you need flexibility.

Answer: B Fine-tuning is justified when you need high consistency, have high volume (reducing per-query prompt costs), and the task is relatively stable. Millions of API calls with strict formatting requirements is the ideal case for fine-tuning.

Question 2

In LoRA, the weight update is parameterized as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$. What is the key assumption?

A) The weight matrix $W_0$ is sparse. B) The weight update during fine-tuning has low intrinsic rank. C) The model's attention heads are redundant. D) The gradient of the loss function is constant.

Answer: B LoRA is motivated by the observation that fine-tuning updates $\Delta W = W_{\text{finetuned}} - W_{\text{pretrained}}$ have low intrinsic rank. This means the adaptation can be captured by a low-rank decomposition $BA$ with $r \ll \min(d, k)$.

Question 3

In LoRA initialization, $B$ is set to zero and $A$ is initialized randomly. Why is $B = 0$ critical?

A) It ensures the LoRA matrices are orthogonal. B) It ensures $\Delta W = BA = 0$ at the start, so training begins from the exact pre-trained model. C) It reduces the memory footprint of the LoRA parameters. D) It prevents gradient vanishing in the backward pass.

Answer: B Setting $B = 0$ ensures that $\Delta W = BA = 0$ at initialization, meaning the model starts from the exact pre-trained weights. This is critical for stable training---the fine-tuning process can smoothly adapt from the pre-trained starting point.

Question 4

What is the LoRA scaling factor and its purpose?

A) $\alpha / r$; it normalizes the LoRA update so that changing rank does not change the update magnitude. B) $r / \alpha$; it amplifies the LoRA update for small ranks. C) $\alpha \times r$; it increases expressiveness proportionally to rank. D) $1 / r$; it prevents the LoRA update from becoming too large.

Answer: A The scaling factor $\alpha / r$ ensures that when the rank $r$ is changed, the effective magnitude of the LoRA update remains controlled. A common practice is $\alpha = 2r$, giving a scaling factor of 2 regardless of rank.

Question 5

For a 7B parameter model in full fp32, the approximate total memory for fine-tuning (parameters + gradients + Adam states) is:

A) 28 GB B) 56 GB C) 84 GB D) 112 GB or more

Answer: D Full fine-tuning requires: parameters (4 bytes x 7B = 28 GB) + gradients (28 GB) + Adam states with two moment estimates (56 GB) = 112 GB minimum, plus activation memory which adds 10-50 GB more.

Question 6

QLoRA's NF4 data type is specifically designed for:

A) Integers that follow a uniform distribution. B) Floating-point numbers that follow a normal distribution. C) Sparse matrices with many zero values. D) Embedding vectors with high dimensionality.

Answer: B NF4 (NormalFloat 4-bit) is information-theoretically optimal for normally distributed values. Since pre-trained neural network weights are approximately normally distributed, NF4 provides better quantization quality than standard 4-bit integer formats.

Question 7

"Double quantization" in QLoRA refers to:

A) Quantizing both weights and activations to 4-bit. B) Quantizing the quantization constants themselves to reduce memory overhead. C) Running the quantization process twice for better accuracy. D) Using two different quantization schemes for different layers.

Answer: B Double quantization quantizes the quantization scaling constants from fp32 to fp8, further reducing the per-parameter memory from approximately 4.5 bits down. This small optimization provides meaningful savings at scale.

Question 8

Which PEFT method has ZERO inference latency overhead after deployment?

A) Adapter layers. B) Prefix tuning. C) LoRA (after weight merging). D) Prompt tuning.

Answer: C After training, LoRA weights can be merged into the base model: $W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$. The merged model has the same architecture as the original, so there is zero additional inference overhead. Adapters add extra layers, and prefix/prompt tuning add extra tokens.

Question 9

When applying LoRA to a Transformer, which configuration generally performs best?

A) Applying LoRA only to the query (Q) projection. B) Applying LoRA to Q and V projections only. C) Applying LoRA to all attention matrices (Q, K, V, O). D) Applying LoRA only to the feed-forward network layers.

Answer: C Research shows that applying LoRA to all four attention weight matrices (Q, K, V, and O) generally outperforms applying it to a subset. For a given parameter budget, it is often better to use a lower rank across more layers than a higher rank on fewer layers.

Question 10

In instruction tuning, "response-only loss masking" means:

A) Removing all loss computation during training. B) Computing the loss only on the assistant's response tokens, not the instruction tokens. C) Masking random tokens in the response for denoising. D) Only training on responses that receive positive human feedback.

Answer: B Response-only loss masking computes the cross-entropy loss only on the tokens that comprise the assistant's response, not the instruction/prompt tokens. This prevents the model from wasting capacity learning to reproduce instructions and focuses training on generating good responses.

Question 11

Which instruction dataset format includes "instruction", "input", and "output" fields?

A) ShareGPT format. B) Alpaca format. C) OpenAI chat format. D) JSONL format.

Answer: B The Alpaca format structures each example as a JSON object with "instruction" (the task description), "input" (optional additional context), and "output" (the desired response) fields. It was popularized by the Stanford Alpaca project.

Question 12

What is catastrophic forgetting in the context of fine-tuning?

A) The model forgets the fine-tuning data after a few inference calls. B) The model loses pre-trained capabilities when fine-tuned on a specific task. C) The model fails to converge during fine-tuning. D) The model's parameters overflow GPU memory.

Answer: B Catastrophic forgetting occurs when fine-tuning causes the model to lose capabilities acquired during pre-training. The model's parameters shift to optimize for the fine-tuning task at the expense of previously learned general knowledge and abilities.

Question 13

Which strategy does NOT help mitigate catastrophic forgetting?

A) Using a smaller learning rate. B) Using LoRA instead of full fine-tuning. C) Increasing the number of training epochs. D) Mixing task-specific data with general-purpose data.

Answer: C Increasing training epochs makes catastrophic forgetting worse, as the model overfits more to the fine-tuning data and drifts further from the pre-trained weights. Lower learning rates, PEFT methods, and data mixing all help preserve general capabilities.

Question 14

The Elastic Weight Consolidation (EWC) penalty uses the Fisher information matrix to:

A) Speed up convergence during fine-tuning. B) Discourage large changes to parameters that are important for pre-training tasks. C) Reduce the memory footprint of the optimizer. D) Automatically select which layers to freeze.

Answer: B EWC adds a penalty $\frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{0,i})^2$ that penalizes changes to parameters where $F_i$ (the Fisher information diagonal) is large, meaning those parameters are important for the pre-training distribution.

Question 15

For fine-tuning, the typical learning rate range is:

A) $10^{-1}$ to $10^{-2}$ (same as pre-training). B) $10^{-2}$ to $10^{-3}$. C) $10^{-5}$ to $5 \times 10^{-5}$. D) $10^{-8}$ to $10^{-7}$.

Answer: C Fine-tuning uses learning rates of $10^{-5}$ to $5 \times 10^{-5}$, which are 10-100x smaller than pre-training rates ($10^{-4}$ to $3 \times 10^{-4}$). This prevents catastrophic forgetting by making small, controlled updates to the pre-trained weights.

Question 16

Gradient checkpointing trades what for reduced memory?

A) Model accuracy for memory. B) Training speed for memory (recomputes activations during backward pass). C) Number of layers for memory. D) Batch size for memory.

Answer: B Gradient checkpointing saves memory by not storing intermediate activations during the forward pass. Instead, it recomputes them during the backward pass, trading approximately 33% more computation time for a significant reduction in activation memory.

Question 17

In the context of LoRA, what does "merging" refer to?

A) Combining multiple LoRA adapters from different tasks. B) Adding the LoRA update $\frac{\alpha}{r}BA$ to the base weights to produce a single weight matrix. C) Merging the training and validation datasets. D) Combining the gradients from multiple GPUs.

Answer: B Merging computes $W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$, producing a single weight matrix that incorporates the LoRA adaptation. After merging, the LoRA matrices are no longer needed, and inference runs on a standard model architecture with no overhead.

Question 18

How many models must be held in GPU memory simultaneously during PPO-based RLHF?

A) 1 (the policy model). B) 2 (policy and reference models). C) 3 (policy, reference, and reward models). D) 4 (policy, reference, reward, and value models).

Answer: D PPO-based RLHF requires four models: the policy model (being trained), the reference model (frozen SFT model), the reward model (frozen), and the value model (being trained). This is one motivation for DPO, which requires only 2 models (Chapter 25).

Question 19

Which of the following is TRUE about prompt tuning?

A) It modifies all model parameters. B) It prepends trainable continuous embeddings only to the input layer. C) It prepends trainable vectors to every attention layer. D) It replaces the model's vocabulary embeddings.

Answer: B Prompt tuning (Lester et al., 2021) prepends trainable soft prompt embeddings only to the input, not to every layer. This uses far fewer parameters ($p \times d$) than prefix tuning ($L \times 2 \times p \times d$), but is generally less expressive.

Question 20

The "self-instruct" method for creating training data involves:

A) Having human experts write all instruction-response pairs. B) Using a strong LLM to generate instruction-response pairs, then filtering for quality. C) Extracting instruction-response pairs from pre-training data. D) Automatically labeling unlabeled data using clustering.

Answer: B Self-instruct (Wang et al., 2023) uses a strong LLM to generate diverse instruction-response pairs from seed examples. The generated data is then filtered for quality, correctness, and diversity before being used for fine-tuning.

Question 21

When fine-tuning with the TRL library's SFTTrainer, the tokenizer.apply_chat_template() is used to:

A) Compress the input tokens for efficiency. B) Format conversations according to the model's expected chat template structure. C) Remove special tokens from the input. D) Apply data augmentation to the training examples.

Answer: B Different models (Llama, Mistral, etc.) expect conversations in specific formats with particular special tokens and delimiters. apply_chat_template() converts a list of message dictionaries into the correctly formatted string for the specific model being fine-tuned.

Question 22

For a given parameter budget, which approach typically works best with LoRA?

A) High rank on a few layers. B) Low rank across many layers. C) Medium rank on only the output projection. D) Variable rank per layer (higher for deeper layers).

Answer: B Research shows that for a fixed parameter budget, distributing a lower rank across more layers generally outperforms concentrating a higher rank on fewer layers. This is because fine-tuning adaptations are needed throughout the model, and each layer benefits from some degree of adaptation.

Question 23

Which evaluation dimension is MOST specific to fine-tuned models (as opposed to base models)?

A) Perplexity on a language modeling benchmark. B) General capability retention on benchmarks like MMLU. C) Token generation speed. D) Vocabulary coverage.

Answer: B General capability retention is uniquely important for fine-tuned models because catastrophic forgetting can degrade performance on tasks the model previously handled well. Base models do not have this concern since they have not been fine-tuned.

Question 24

In multi-task fine-tuning, what is the purpose of task weights $w_t$ in the loss $\mathcal{L} = \sum_t w_t \mathcal{L}_t$?

A) To ensure all tasks converge at the same rate. B) To balance contributions so rare or difficult tasks are not dominated by common or easy ones. C) To reduce the total training time. D) To prevent gradient overflow.

Answer: B Task weights balance the contribution of each task to the total loss. Without weighting, tasks with more data or easier optimization can dominate the gradient signal, causing the model to underperform on rare or difficult tasks.

Question 25

Which statement about the hybrid prompting-then-fine-tuning workflow is TRUE?

A) You should always skip the prompting phase and go directly to fine-tuning. B) Data collected from the prompting phase can be used to create the fine-tuning dataset. C) Once a model is fine-tuned, prompting techniques become useless. D) The prompting phase is only useful if you have no training data.

Answer: B The hybrid approach starts with prompting to validate feasibility, then collects data from the prompting system (inputs, outputs, and human corrections) to create a high-quality fine-tuning dataset. This is a practical, iterative workflow that leverages the strengths of both approaches.