Chapter 24: Exercises

Conceptual Exercises

Exercise 24.1: Fine-Tuning vs. Prompting Decision

For each of the following scenarios, recommend whether to use prompting, fine-tuning, or a hybrid approach. Justify your choice using the decision framework from Section 24.2. (a) A legal firm needs to classify 50 contracts per day into 5 categories. They have 200 labeled examples. (b) A startup needs to generate product descriptions in a specific brand voice. They have 10,000 examples. (c) A research team wants to quickly prototype a sentiment analyzer for a new language. (d) A hospital needs a model that extracts medication names and dosages from clinical notes with >99% precision. (e) An e-commerce platform processes 1 million product queries per day with a 10-word prompt overhead per query.

Exercise 24.2: Cost Analysis

A company uses a prompted LLM API with a system prompt of 500 tokens, 5 few-shot examples of 200 tokens each, and an average user query of 50 tokens plus response of 100 tokens. Input tokens cost \$0.01/1K and output tokens cost \$0.03/1K. (a) Calculate the per-query cost with prompting. (b) If fine-tuning eliminates the system prompt and few-shot examples (reducing input to just the query), calculate the per-query cost. (c) If fine-tuning costs \$500, how many queries are needed to break even? (d) If the company processes 10,000 queries per day, how many days until break-even?

Exercise 24.3: Memory Requirements

Calculate the GPU memory requirements for full fine-tuning a 13B parameter model with: (a) Full fp32 precision (parameters + gradients + Adam optimizer states + estimated activations). (b) Mixed precision (bf16 compute, fp32 optimizer). (c) With gradient checkpointing enabled (assume activation memory reduces by factor of 4). (d) Explain why even option (c) typically requires multiple GPUs.

Exercise 24.4: LoRA Mathematics

For a weight matrix $W_0 \in \mathbb{R}^{4096 \times 4096}$ with LoRA rank $r = 16$ and $\alpha = 32$: (a) Calculate the number of trainable parameters in the LoRA adaptation. (b) What is the compression ratio compared to full fine-tuning of this matrix? (c) Compute the effective scaling factor $\alpha / r$. (d) If the model has 32 layers and LoRA is applied to Q, K, V, O projections (each 4096 x 4096), what is the total number of trainable LoRA parameters? What percentage of the full model (13B) is this?

Exercise 24.5: LoRA Initialization

(a) Explain why $B$ is initialized to zero and $A$ is initialized with random Gaussian values. What would happen if both were initialized randomly? (b) What would happen if both were initialized to zero? (c) Prove that $\Delta W = BA = 0$ at initialization with the standard scheme. (d) How does this initialization relate to the concept of "starting from the pre-trained model"?

Exercise 24.6: QLoRA Memory Analysis

For a 7B parameter model: (a) Calculate the memory for the base model in NF4 (4.5 bits per parameter). (b) If LoRA rank is 16 and applied to all attention matrices (Q, K, V, O) across 32 layers with $d = 4096$, calculate the LoRA parameter memory in fp16. (c) Calculate the optimizer state memory for the LoRA parameters (Adam with fp32 states). (d) Sum up the total memory and verify it fits on a 24GB GPU.

Exercise 24.7: Rank Selection Trade-offs

You are fine-tuning a 7B model on three different tasks: - Task A: Sentiment classification (binary, simple task) - Task B: Medical report summarization (domain-specific, moderate complexity) - Task C: Code generation in a new programming language (complex, large distribution shift) (a) Recommend a LoRA rank for each task and justify. (b) For Task B, design an experiment to find the optimal rank. (c) If your parameter budget is fixed at 10M trainable parameters, how would you allocate rank across layers for Task C?

Exercise 24.8: Adapter vs. LoRA Comparison

(a) For an adapter with bottleneck dimension $m = 64$ and hidden dimension $d = 4096$, calculate the number of parameters per adapter module (including bias terms). (b) For a LoRA with rank $r = 16$ on the same layer, calculate the number of parameters. (c) Compare the inference latency overhead: why does LoRA have zero overhead after merging while adapters always add latency? (d) In what scenario might adapters be preferred over LoRA?

Exercise 24.9: Prefix Tuning vs. Prompt Tuning

(a) For a model with $L = 32$ layers, hidden dimension $d = 4096$, and prefix length $p = 20$: - Calculate the number of trainable parameters for prefix tuning. - Calculate the number of trainable parameters for prompt tuning. - What is the ratio? (b) Explain why prefix tuning is more expressive than prompt tuning. (c) At what model size does prompt tuning approach full fine-tuning performance? Why?

Exercise 24.10: Instruction Tuning Dataset Design

You are creating an instruction-tuning dataset for a customer support model. (a) List 10 distinct task types the dataset should cover with 2 examples each. (b) For each task type, write one instruction in the Alpaca format. (c) Describe a quality filtering pipeline with at least 5 filtering criteria. (d) How would you handle class imbalance across task types?

Programming Exercises

Exercise 24.11: Implement LoRA from Scratch

Implement the LoRA layer as a PyTorch module without using the peft library: (a) Implement the LoRALayer class with proper initialization ($B = 0$, $A \sim \mathcal{N}(0, \sigma^2)$). (b) Implement the forward pass: $h = W_0 x + \frac{\alpha}{r} BAx$. (c) Write a function that replaces all nn.Linear layers in a model with LoRA-augmented versions. (d) Verify that the output is identical to the original model at initialization.

Exercise 24.12: LoRA Weight Merging

Implement LoRA weight merging and unmerging: (a) Write a function that merges LoRA weights into the base model: $W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$. (b) Write a function that unmerges (restores the original weights). (c) Verify that the merged model produces the same output as the LoRA model. (d) Measure the inference speed before and after merging.

Exercise 24.13: Custom Data Collator

Implement a data collator for instruction tuning that: (a) Pads sequences to the same length within a batch. (b) Masks the loss for instruction tokens (only compute loss on response tokens). (c) Handles variable-length conversations. (d) Test with a small batch and verify the loss mask is correct.

Exercise 24.14: Learning Rate Finder

Implement a learning rate finder for fine-tuning: (a) Start with a very small learning rate and increase exponentially. (b) Record the loss at each step. (c) Plot loss vs. learning rate. (d) Identify the optimal learning rate range (where loss decreases most rapidly). Apply to a LoRA fine-tuning setup and compare with the default learning rate.

Exercise 24.15: Catastrophic Forgetting Measurement

Design and implement an experiment to measure catastrophic forgetting: (a) Fine-tune a model on a specific task (e.g., sentiment classification). (b) Before and after fine-tuning, evaluate on three general benchmarks (e.g., HellaSwag, ARC, MMLU). (c) Plot the general benchmark performance vs. fine-tuning epochs. (d) Compare full fine-tuning vs. LoRA on forgetting metrics.

Exercise 24.16: EWC Implementation

Implement Elastic Weight Consolidation (EWC) for fine-tuning: (a) Compute the diagonal Fisher information matrix from the pre-training loss. (b) Add the EWC penalty to the fine-tuning loss. (c) Compare training curves with and without EWC. (d) Evaluate the trade-off between task performance and general capability retention.

Exercise 24.17: Multi-LoRA Serving

Implement a simple multi-LoRA serving system: (a) Load a base model and three different LoRA adapters. (b) Implement adapter switching at inference time. (c) Measure the latency of adapter switching. (d) Compare with serving three separate merged models.

Exercise 24.18: Data Mixing for Fine-Tuning

Implement a data mixing strategy: (a) Combine task-specific data with general instruction data at ratio $\alpha$. (b) Fine-tune with $\alpha \in \{0.5, 0.7, 0.9, 1.0\}$ (where 1.0 = task-only). (c) Evaluate each on both the target task and general benchmarks. (d) Plot the Pareto frontier of task performance vs. general capability.

Exercise 24.19: QLoRA Fine-Tuning Pipeline

Build a complete QLoRA fine-tuning pipeline using HuggingFace libraries: (a) Load a 7B model in 4-bit quantization with BitsAndBytesConfig. (b) Apply LoRA to all attention layers. (c) Prepare an instruction dataset with chat template. (d) Train for 1 epoch and evaluate on a held-out set. (e) Save and reload the adapter.

Exercise 24.20: Chat Template Investigation

For three different model families (e.g., Llama, Mistral, Gemma): (a) Print the chat template for each tokenizer. (b) Format the same conversation with each template and compare the results. (c) Demonstrate what happens when you use the wrong template (e.g., Llama template on Mistral). (d) Implement a template validator that checks for common formatting errors.

Challenge Exercises

Exercise 24.21: TIES-Merging Implementation

Implement the TIES-Merging algorithm for combining multiple LoRA adapters: (a) Train three LoRA adapters on different tasks. (b) Implement trim, elect, and merge steps. (c) Evaluate the merged model on all three tasks. (d) Compare with simple linear interpolation.

Exercise 24.22: Progressive Fine-Tuning

Implement progressive fine-tuning where layers are gradually unfrozen: (a) Start with only the last 2 layers trainable. (b) Every $N$ steps, unfreeze 2 more layers. (c) Compare with standard LoRA (all layers from the start). (d) Measure catastrophic forgetting at each stage.

Exercise 24.23: Hyperparameter Search for LoRA

Conduct a systematic hyperparameter search for LoRA fine-tuning: (a) Define a search space: rank $\in \{4, 8, 16, 32, 64\}$, alpha $\in \{1\times r, 2\times r\}$, learning rate $\in \{1\text{e-5}, 2\text{e-5}, 5\text{e-5}\}$, target modules $\in \{\text{QV only}, \text{QKVO}, \text{all linear}\}$. (b) Implement grid search or random search. (c) Train for 500 steps each and evaluate on a validation set. (d) Analyze which hyperparameters matter most.

Exercise 24.24: Data Efficiency Analysis

Study how the amount of fine-tuning data affects performance: (a) Fine-tune with 100, 500, 1000, 5000, and 10000 examples. (b) Plot learning curves (validation loss vs. training set size). (c) At what point does adding more data show diminishing returns? (d) Compare the data efficiency of full fine-tuning vs. LoRA.

Exercise 24.25: Continual Fine-Tuning

Simulate continual fine-tuning across three sequential tasks: (a) Fine-tune on Task 1, evaluate on Tasks 1, 2, 3. (b) Fine-tune on Task 2 (starting from Task 1 checkpoint), evaluate on all. (c) Fine-tune on Task 3, evaluate on all. (d) Compare: (i) sequential fine-tuning, (ii) separate LoRA per task, (iii) training on all tasks simultaneously.

Exercise 24.26: Model Merging with DARE

Implement the DARE (Drop And REscale) merging algorithm: (a) Train two LoRA adapters on different tasks. (b) Implement the DARE algorithm: randomly drop parameters, rescale, merge. (c) Sweep the drop probability $p \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$. (d) Evaluate each merged model on both tasks.

Exercise 24.27: Instruction Dataset Quality Analysis

Analyze the quality of an instruction dataset: (a) Compute diversity metrics: unique instruction templates, n-gram diversity, task type distribution. (b) Identify near-duplicate examples using embedding similarity. (c) Use an LLM to rate the quality of 100 random examples. (d) Propose filtering criteria and measure the effect of filtering on model quality.

Exercise 24.28: Full Fine-Tuning Comparison

Compare full fine-tuning with LoRA on a single task: (a) Fine-tune both with the same data and number of steps. (b) Compare: final loss, task accuracy, general benchmark retention, training time, memory usage. (c) At what dataset size does LoRA start to underperform full fine-tuning? (d) Can increasing LoRA rank close the gap?

Exercise 24.29: SFT with Response Masking

Implement SFT training with and without response-only masking: (a) Implement a training loop where loss is computed on all tokens. (b) Implement a training loop where loss is computed only on assistant response tokens. (c) Compare the two approaches on instruction-following quality. (d) Analyze which approach leads to better generalization.

Exercise 24.30: Fine-Tuning for Structured Output

Fine-tune a model to produce consistent JSON output: (a) Create a dataset of 500 examples with text inputs and structured JSON outputs. (b) Fine-tune with LoRA targeting the output projection. (c) Measure the JSON validity rate before and after fine-tuning. (d) Compare with prompt engineering approaches for the same task.