Chapter 24: Key Takeaways

When to Fine-Tune

Fine-tuning modifies model parameters to internalize new behaviors. Unlike prompting (fixed parameters, behavior guided by context), fine-tuning creates persistent changes that affect all future interactions. Choose fine-tuning when the task requires deep customization, consistent behavior, or domain expertise that prompting cannot achieve.
A systematic decision framework balances cost, quality, and complexity. Favor prompting for simple tasks with few examples and rapidly changing requirements. Favor fine-tuning for complex tasks with hundreds+ of examples, strict consistency needs, and stable requirements. The hybrid approach (prompt first, collect data, then fine-tune) is often optimal.
Fine-tuning reduces per-query costs at the expense of upfront training cost. By eliminating long system prompts and few-shot examples, fine-tuned models use fewer input tokens per query. This becomes cost-effective at high query volumes.

Full fine-tuning updates all parameters and achieves the best performance. The training objective is standard cross-entropy loss on the fine-tuning data, using learning rates 10-100x smaller than pre-training to prevent catastrophic forgetting.
Memory requirements are prohibitive for large models. A 7B model in fp32 requires approximately 120+ GB for parameters, gradients, optimizer states, and activations. Mixed precision, gradient checkpointing, and distributed training (DeepSpeed ZeRO, FSDP) can reduce but not eliminate these demands.

LoRA constrains weight updates to a low-rank subspace. The update $\Delta W = BA$ with $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ (where $r \ll \min(d, k)$) reduces trainable parameters by 100x or more. This is motivated by the low intrinsic dimensionality of fine-tuning updates.
LoRA initialization ensures training starts from the pre-trained model. Setting $B = 0$ guarantees $\Delta W = 0$ at initialization. The scaling factor $\alpha / r$ controls update magnitude across different rank settings.
Apply LoRA broadly at low rank rather than narrowly at high rank. Applying LoRA to all attention matrices (Q, K, V, O) across all layers with a moderate rank (8-16) typically outperforms higher rank on fewer layers for the same parameter budget.
QLoRA enables fine-tuning on consumer hardware. By storing the base model in 4-bit NF4 precision and training LoRA parameters in higher precision, QLoRA reduces memory to approximately 6 GB for a 7B model, fitting on a single 24 GB GPU.
LoRA weights can be merged for zero-overhead inference. After training, compute $W_{\text{merged}} = W_0 + \frac{\alpha}{r}BA$ to obtain a standard model. Multiple LoRA adapters can be stored compactly and swapped for multi-task serving.

Adapter layers insert trainable bottleneck modules. They add new computation paths (down-project, nonlinearity, up-project, residual) but introduce inference latency that cannot be eliminated.
Prefix tuning and prompt tuning add trainable vectors to attention layers or input. Prefix tuning adds to every layer ($L \times 2 \times p \times d$ parameters) and is more expressive; prompt tuning adds only to the input ($p \times d$ parameters) but approaches full fine-tuning performance at large model scales.

Instruction tuning creates models that follow diverse natural language instructions. The training data should be diverse in task type, instruction phrasing, and complexity level. Standard formats include Alpaca, ShareGPT, and OpenAI chat formats.
Dataset quality trumps quantity. Careful curation, filtering for correctness and helpfulness, deduplication, and complexity balancing produce better models than simply scaling up noisy data. Self-instruct and evol-instruct are effective data generation strategies.
Chat templates are model-specific and critical. Using the wrong template (e.g., Llama template on Mistral) severely degrades performance. Always use tokenizer.apply_chat_template() for correct formatting.

SFT training should mask the loss on instruction tokens. Computing loss only on assistant response tokens focuses training on generation quality rather than instruction reproduction.
Evaluation must cover task performance and general capability retention. A model that excels on the target task but has lost general abilities may be less useful overall. Evaluate on standard benchmarks (MMLU, HellaSwag, ARC) alongside task-specific metrics.
Catastrophic forgetting is the primary risk of fine-tuning. Mitigate with: PEFT methods (which freeze most parameters), low learning rates, few epochs, data mixing (include general data alongside task data), and regularization (EWC, weight decay).

Multi-task fine-tuning improves generalization. Training on multiple tasks simultaneously with appropriate task weighting helps the model maintain breadth while developing depth on specific tasks.
Model merging enables combining specialized capabilities. Techniques like linear interpolation, TIES-Merging, and DARE allow merging multiple LoRA adapters trained on different tasks into a single model.