Chapter 23 demonstrated that prompt engineering can accomplish a remarkable range of tasks. But prompting has fundamental limitations. When an application requires deep behavioral customization, consistent adherence to complex formatting rules...
In This Chapter
- 24.1 Introduction: Beyond Prompting
- 24.2 When to Fine-Tune vs. When to Prompt
- 24.3 Full Fine-Tuning
- 24.4 Parameter-Efficient Fine-Tuning (PEFT)
- 24.5 LoRA: Low-Rank Adaptation
- 24.6 QLoRA: Quantized LoRA
- 24.7 Adapter Layers
- 24.8 Prefix Tuning and Prompt Tuning
- 24.9 Instruction Tuning and Dataset Creation
- 24.10 Supervised Fine-Tuning with HuggingFace TRL
- 24.11 Evaluation of Fine-Tuned Models
- 24.12 Catastrophic Forgetting
- 24.13 Advanced Topics
- 24.14 Summary
- References
Chapter 24: Fine-Tuning Large Language Models
Part IV: Attention, Transformers, and Language Models
24.1 Introduction: Beyond Prompting
Chapter 23 demonstrated that prompt engineering can accomplish a remarkable range of tasks. But prompting has fundamental limitations. When an application requires deep behavioral customization, consistent adherence to complex formatting rules, domain-specific expertise that the base model lacks, or performance that exceeds what few-shot demonstrations can achieve, fine-tuning becomes necessary.
Fine-tuning adapts a pre-trained language model to a specific task or domain by continuing the training process on a curated dataset. Unlike prompting, fine-tuning modifies the model's parameters, enabling it to internalize new behaviors, knowledge, and capabilities that persist across all future interactions.
This chapter provides a comprehensive treatment of modern fine-tuning techniques for large language models. We begin with the decision framework for when to fine-tune versus prompt, then cover full fine-tuning and its limitations. The bulk of the chapter focuses on parameter-efficient fine-tuning (PEFT) methods---LoRA, QLoRA, adapter layers, and prefix tuning---which make fine-tuning accessible even with limited hardware. We then cover instruction tuning, supervised fine-tuning with HuggingFace TRL, evaluation of fine-tuned models, and the critical problem of catastrophic forgetting.
What You Will Learn
By the end of this chapter, you will be able to:
- Determine when fine-tuning is appropriate versus prompting
- Implement full fine-tuning of a language model with PyTorch
- Explain the mathematical foundations of LoRA and apply it in practice
- Use QLoRA to fine-tune large models on consumer hardware
- Implement adapter layers and prefix tuning
- Create high-quality instruction-tuning datasets
- Perform supervised fine-tuning using the TRL library
- Evaluate fine-tuned models rigorously
- Diagnose and mitigate catastrophic forgetting
Prerequisites
This chapter assumes familiarity with:
- Transformer architecture and attention mechanisms (Chapter 20)
- Language model pre-training (Chapter 21)
- Transfer learning concepts (Chapter 22)
- Prompt engineering basics (Chapter 23)
- PyTorch training loops and optimizers
24.2 When to Fine-Tune vs. When to Prompt
24.2.1 The Decision Framework
The choice between prompting and fine-tuning involves multiple factors. Here is a systematic framework:
| Factor | Favor Prompting | Favor Fine-Tuning |
|---|---|---|
| Task complexity | Simple, well-defined tasks | Complex, nuanced behaviors |
| Data availability | Few or no labeled examples | Hundreds to thousands of examples |
| Latency requirements | Generous (long prompts OK) | Strict (short prompts needed) |
| Consistency needs | Moderate variability acceptable | High consistency required |
| Domain specialization | General knowledge sufficient | Deep domain expertise needed |
| Deployment constraints | API access only | Custom model hosting possible |
| Budget | Limited compute budget | Training budget available |
| Update frequency | Task changes frequently | Task is relatively stable |
24.2.2 Cost Analysis
Fine-tuning has an upfront training cost but reduces per-query costs by eliminating long prompts:
$$\text{Cost}_{\text{prompt}} = N_{\text{queries}} \times (C_{\text{input}} \times L_{\text{prompt}} + C_{\text{output}} \times L_{\text{response}})$$
$$\text{Cost}_{\text{finetune}} = C_{\text{train}} + N_{\text{queries}} \times (C_{\text{input}} \times L_{\text{short\_prompt}} + C_{\text{output}} \times L_{\text{response}})$$
Fine-tuning becomes cost-effective when:
$$N_{\text{queries}} > \frac{C_{\text{train}}}{C_{\text{input}} \times (L_{\text{prompt}} - L_{\text{short\_prompt}})}$$
For high-volume applications, the crossover point can be reached quickly, making fine-tuning more economical.
24.2.3 The Hybrid Approach
In practice, the best approach is often hybrid:
- Start with prompting to validate the task is feasible
- Collect data from the prompting system (inputs, outputs, human corrections)
- Fine-tune a smaller model on the collected data
- Use the fine-tuned model for production, falling back to the prompted large model for edge cases
This approach combines rapid prototyping (prompting) with efficient production deployment (fine-tuning).
24.2.4 Practical Examples of the Decision
To make the framework concrete, consider these scenarios:
Scenario 1: Customer support classification. You need to classify incoming support tickets into 10 categories with 95%+ accuracy. You have 50,000 labeled examples. Decision: Fine-tune. The volume of labeled data, consistency requirements, and high accuracy threshold all favor fine-tuning. A LoRA fine-tuned model will be faster and cheaper per query than few-shot prompting.
Scenario 2: One-off research summarization. A researcher needs to summarize 50 academic papers. Decision: Prompt. The task is a one-time effort, the volume is low, and the format (summarization) is well within the capabilities of prompted models. Fine-tuning would be overkill.
Scenario 3: Medical report structuring. A hospital needs to extract structured information from radiology reports into a specific JSON schema, with domain-specific terminology. Decision: Fine-tune. The domain specialization, strict formatting requirements, and high-stakes nature of the task all favor fine-tuning. The model needs to internalize medical vocabulary and the specific JSON schema reliably.
Scenario 4: Internal chatbot. A company wants a chatbot that answers questions about internal policies, using a set of policy documents as context. Decision: RAG + prompting first, fine-tune later if needed. Retrieval-augmented generation (as we previewed in Chapter 23) handles the knowledge grounding, and prompting handles the interaction style. Fine-tune only if the prompting approach fails to meet quality requirements.
These examples illustrate that the decision is rarely binary---it depends on the specific combination of data availability, quality requirements, deployment constraints, and budget. The hybrid approach outlined in Section 24.2.3 is almost always the safest starting point.
24.3 Full Fine-Tuning
24.3.1 Definition
Full fine-tuning updates all parameters of the pre-trained model on the downstream dataset. Given a pre-trained model with parameters $\theta_0$, we optimize:
$$\theta^* = \arg\min_\theta \; \mathcal{L}(\theta; \mathcal{D}_{\text{finetune}})$$
where $\mathcal{D}_{\text{finetune}}$ is the fine-tuning dataset and $\mathcal{L}$ is typically the cross-entropy loss for language modeling:
$$\mathcal{L}(\theta) = -\frac{1}{|\mathcal{D}|} \sum_{(x,y) \in \mathcal{D}} \sum_{t=1}^{|y|} \log p_\theta(y_t \mid x, y_{ Learning rate. Fine-tuning uses a much smaller learning rate than pre-training---typically $10^{-5}$ to $5 \times 10^{-5}$, compared to $10^{-4}$ to $3 \times 10^{-4}$ for pre-training. This prevents catastrophic forgetting by making small updates to the pre-trained weights. Learning rate schedule. A linear warmup followed by cosine decay is standard: $$\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t - T_{\text{warmup}}}{T - T_{\text{warmup}}} \pi)) & \text{otherwise} \end{cases}$$ Batch size. Larger batch sizes provide more stable gradients but require more memory. Gradient accumulation can simulate large batches: $$\theta \leftarrow \theta - \eta \cdot \frac{1}{K} \sum_{k=1}^{K} \nabla_\theta \mathcal{L}(\theta; B_k)$$ where $K$ is the number of accumulation steps and $B_k$ are micro-batches. Number of epochs. Fine-tuning typically requires 1--5 epochs. More epochs risk overfitting to the fine-tuning data, especially with small datasets. Full fine-tuning of a model with $P$ parameters requires storing: For a 7-billion parameter model in float32, this exceeds the memory of most single GPUs. Mixed-precision training (fp16/bf16) approximately halves these requirements but still demands substantial hardware. Mixed-precision training. Store weights in fp32 but compute forward and backward passes in fp16/bf16: $$\text{Memory} \approx 4P + 2P + 2P + 8P = 16P \text{ bytes (with fp32 optimizer)}$$ Gradient checkpointing. Trade compute for memory by recomputing activations during the backward pass instead of storing them. Reduces activation memory by a factor of $\sqrt{L}$ (where $L$ is the number of layers) at the cost of ~33% more computation. DeepSpeed ZeRO. Partition optimizer states, gradients, and parameters across multiple GPUs. ZeRO Stage 3 distributes all three, enabling training of models that would not fit on a single GPU. FSDP (Fully Sharded Data Parallel). PyTorch's native implementation of model sharding, which distributes model parameters and optimizer states across GPUs. Despite its effectiveness, full fine-tuning has significant drawbacks: These limitations motivate parameter-efficient fine-tuning methods. For completeness, here is a minimal full fine-tuning implementation in PyTorch. While PEFT methods are preferred for most use cases, understanding full fine-tuning helps build intuition for why PEFT works: Note the key differences from pre-training: the learning rate is 10-100x lower ($2 \times 10^{-5}$ vs. $3 \times 10^{-4}$), gradient clipping is applied to prevent large updates, and the number of epochs is small (1-5). These choices collectively prevent the fine-tuning process from straying too far from the pre-trained initialization, as we saw in Section 24.3.2. Parameter-efficient fine-tuning modifies only a small subset of model parameters while keeping the majority frozen. This dramatically reduces memory requirements, storage costs, and the risk of catastrophic forgetting. Formally, PEFT methods decompose the fine-tuned parameters as: $$\theta_{\text{finetuned}} = \theta_0 + \Delta\theta$$ where $\theta_0$ are the frozen pre-trained parameters and $\Delta\theta$ is a small, trainable perturbation with $|\Delta\theta| \ll |\theta_0|$. The key insight is that the downstream task likely resides in a low-dimensional subspace of the full parameter space. We do not need to update all parameters to capture task-specific behavior. To build intuition for this claim, consider an analogy. A pre-trained language model is like a Swiss Army knife---it has many tools (capabilities), but none are perfectly optimized for any specific task. Full fine-tuning rebuilds the entire knife from scratch, producing a specialized tool. PEFT methods, by contrast, make small adjustments---sharpening one blade, adjusting the angle of another---that specialize the tool for the task at hand while preserving its general utility. The mathematical justification comes from the low intrinsic dimensionality result of Aghajanyan et al. (2021), who showed that the effective dimensionality of fine-tuning updates is often as low as a few hundred, even for models with billions of parameters. This means that the subspace of "useful" parameter changes is vanishingly small compared to the full parameter space, and PEFT methods exploit this structure directly. PEFT methods can be categorized by where and how they introduce trainable parameters: Each category has its own trade-offs. Addition-based methods are the most flexible but add parameters to the model. Reparameterization-based methods like LoRA are the most popular because they add no parameters at inference time (after merging). Selection-based methods are the simplest conceptually but offer the least control over what is adapted. LoRA (Hu et al., 2022) is motivated by the observation that the weight updates during fine-tuning have low intrinsic rank. That is, the difference $\Delta W = W_{\text{finetuned}} - W_{\text{pretrained}}$ can be well-approximated by a low-rank matrix. Aghajanyan et al. (2021) showed that pre-trained models have a low intrinsic dimensionality---they can be fine-tuned effectively in a much lower-dimensional subspace than the full parameter space. For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA constrains the weight update to be low-rank: $$W = W_0 + \Delta W = W_0 + BA$$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ with rank $r \ll \min(d, k)$. The intuition is elegant: rather than learning a full $d \times k$ update matrix (which has $dk$ free parameters), we factorize it into two much smaller matrices that form a bottleneck of dimension $r$. The input first projects down to $r$ dimensions through $A$, then projects back up to $d$ dimensions through $B$. This is analogous to a very thin autoencoder applied to the weight update---the bottleneck forces the update to lie in a low-dimensional subspace. The forward pass becomes: $$h = W_0 x + \Delta W x = W_0 x + BAx$$ where:
- $W_0 x$ is the original pre-trained computation (frozen, no gradients computed)
- $BAx$ is the low-rank adaptation (trainable)
- $x \in \mathbb{R}^k$ is the input activation
- $h \in \mathbb{R}^d$ is the output activation During training, $W_0$ is frozen and only $A$ and $B$ are updated. The number of trainable parameters per LoRA layer is: $$|\Delta\theta|_{\text{LoRA}} = r \times (d + k)$$ compared to $d \times k$ for full fine-tuning. For typical values ($d = k = 4096$, $r = 16$), this is a reduction factor of $\frac{d \times k}{r \times (d + k)} = \frac{4096^2}{16 \times 8192} = 128\times$. Worked example. Consider a Llama-7B model where the attention query projection has $d = k = 4096$. Full fine-tuning of this single weight matrix requires updating $4096 \times 4096 = 16,777,216$ parameters. With LoRA at rank $r = 16$, we instead train $16 \times (4096 + 4096) = 131,072$ parameters---a 128x reduction. Applied to all attention matrices (Q, K, V, O) across all 32 layers, the total LoRA parameters are $4 \times 32 \times 131,072 = 16,777,216$---compared to $4 \times 32 \times 16,777,216 = 2,147,483,648$ for full fine-tuning of those same matrices. LoRA trains less than 1% of the full parameter count while achieving comparable task performance. LoRA uses a specific initialization scheme: This ensures that $\Delta W = BA = 0$ at the start of training, so the model begins from the exact pre-trained weights. This is critical for stable training. LoRA introduces a scaling factor $\alpha$: $$h = W_0 x + \frac{\alpha}{r} BAx$$ The ratio $\frac{\alpha}{r}$ controls the magnitude of the LoRA update relative to the pre-trained weights. A common practice is to set $\alpha = 2r$, making the effective scaling factor 2. When the rank $r$ is increased, the per-element contribution of the LoRA update is automatically reduced. LoRA is typically applied to the attention weight matrices: $W_Q$, $W_K$, $W_V$, and $W_O$. Research suggests: The rank $r$ controls the expressiveness of the adaptation: The optimal rank depends on the task complexity and the distance between the pre-training and fine-tuning distributions. To solidify understanding, here is a minimal LoRA layer implementation: The key implementation details to note: After training, the LoRA weights can be merged into the base model: $$W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$$ This eliminates the inference overhead of the separate LoRA path, resulting in the same architecture and latency as the original model. Multiple LoRA adaptations can be stored and swapped efficiently. A single 7B-parameter base model can serve dozens of specialized tasks by hot-swapping LoRA adapters (typically 10-50 MB each), rather than maintaining separate model deployments. QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization to enable fine-tuning of large models on consumer hardware. The key innovation is that the base model is stored in 4-bit precision while LoRA parameters are trained in higher precision. 4-bit NormalFloat (NF4). QLoRA introduces the NF4 data type, which is information-theoretically optimal for normally distributed weights. Pre-trained neural network weights are approximately normally distributed, so NF4 provides better representation than standard 4-bit integer quantization. Double quantization. The quantization constants themselves are quantized (from fp32 to fp8), further reducing memory overhead: $$\text{Memory per parameter} = 4\text{ bits} + \frac{32}{64} \text{ bits (first quant)} + \frac{8}{256} \text{ bits (double quant)} \approx 4.5 \text{ bits}$$ Paged optimizers. QLoRA uses paged optimizers that automatically offload optimizer states from GPU to CPU memory when GPU memory is exhausted, then page them back when needed. For a 7B parameter model: QLoRA makes it possible to fine-tune a 7B model on a single consumer GPU (e.g., RTX 3090 with 24GB) and a 65B model on a single 48GB GPU. To understand why QLoRA works so well, it helps to understand the NF4 data type. Standard 4-bit integer quantization divides the range $[\min(w), \max(w)]$ into 16 equally-spaced bins. But neural network weights are approximately normally distributed, with most values near zero and few at the extremes. Uniform quantization wastes bins on the sparse tails. NF4 instead places quantization bins at the quantiles of the standard normal distribution. The 16 quantization levels are chosen so that each bin contains an equal probability mass under the normal distribution. Formally, the NF4 levels are: $$q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right) \quad \text{for } i = 0, 1, \ldots, 15$$ where $\Phi^{-1}$ is the inverse CDF (quantile function) of the standard normal. This means bins near zero (where most weights cluster) are narrow, providing fine precision, while bins in the tails are wide, which is acceptable because few weights fall there. The result is an information-theoretically optimal quantization for normally distributed data, as demonstrated by Dettmers et al. Remarkably, QLoRA matches the performance of full fine-tuning and 16-bit LoRA on most benchmarks. The quantization of the base model introduces minimal degradation because: The practical impact is dramatic: a researcher with a single consumer GPU (24GB VRAM) can fine-tune a 7B parameter model that would otherwise require 120+ GB for full fine-tuning. This democratization of fine-tuning capability has been one of the most significant developments in making LLM customization accessible to a broad community. Adapter layers (Houlsby et al., 2019) insert small, trainable bottleneck modules between the existing layers of the pre-trained model. Each adapter consists of: The adapter computation is: $$h = h + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)$$ where $h$ is the hidden representation and $\sigma$ is the activation function. Adapters are typically placed: Houlsby et al. (2019) found that placing adapters after both sublayers performs best, but the serial adapter configuration (after the FFN only) achieves comparable performance with fewer parameters. LoRA has become the dominant PEFT method primarily because it introduces zero inference latency after merging and is simpler to implement. For reference, here is a minimal adapter layer implementation: The zero initialization of the up-projection ensures that at the start of training, the adapter contributes nothing (like LoRA's zero initialization of $B$), so the model begins from the exact pre-trained checkpoint. During training, only the adapter parameters are updated while the main Transformer layers remain frozen. Prefix tuning (Li & Liang, 2021) prepends trainable continuous vectors (the "prefix") to the key and value matrices at each attention layer. The prefix is analogous to a learned "soft prompt" that operates in the activation space rather than the token space. Formally, at each layer $l$, the attention computation becomes: $$\text{Attention}(Q, [P_K^l; K], [P_V^l; V])$$ where $P_K^l, P_V^l \in \mathbb{R}^{p \times d}$ are the trainable prefix matrices and $p$ is the prefix length. The total number of trainable parameters is: $$|\Delta\theta|_{\text{prefix}} = L \times 2 \times p \times d$$ where $L$ is the number of layers. Prompt tuning (Lester et al., 2021) is a simplified version that only prepends trainable embeddings to the input, rather than to every layer: $$\text{input} = [e_1, e_2, \ldots, e_p, x_1, x_2, \ldots, x_n]$$ where $e_i \in \mathbb{R}^d$ are the trainable "soft prompt" tokens. Prompt tuning uses far fewer parameters than prefix tuning ($p \times d$ vs. $L \times 2 \times p \times d$) but is generally less expressive. However, Lester et al. showed that as model size increases, prompt tuning approaches the performance of full fine-tuning. IA3 (Liu et al., 2022) takes a radically minimal approach to parameter-efficient adaptation. Instead of adding new weight matrices, IA3 introduces learned vectors that scale the existing keys, values, and feed-forward activations element-wise: $$\text{Attention: } K' = l_K \odot K, \quad V' = l_V \odot V$$
$$\text{FFN: } h' = l_{ff} \odot \text{FFN}(h)$$ where $l_K, l_V \in \mathbb{R}^{d_k}$ and $l_{ff} \in \mathbb{R}^{d_{ff}}$ are learned scaling vectors, and $\odot$ denotes element-wise multiplication. The total number of trainable parameters is: $$|\Delta\theta|_{\text{IA3}} = L \times (2 d_k + d_{ff})$$ For a model with $d_k = 64$, $d_{ff} = 11008$, and $L = 32$ layers, IA3 adds only $32 \times (128 + 11008) = 356,352$ trainable parameters---far fewer than even LoRA. IA3 works best for few-shot fine-tuning scenarios where the adaptation is relatively simple, such as adjusting the model's output style or domain vocabulary. In practice, the choice of PEFT method depends on several factors: If you need maximum quality with no inference overhead: Use LoRA. It offers near-full-fine-tuning quality and can be merged into the base model for zero-overhead inference. This is the default recommendation for most use cases. If memory is extremely constrained: Use QLoRA (Section 24.6). It enables fine-tuning of large models on consumer hardware with minimal quality loss. If you need to serve many task-specific variants: LoRA adapters are small (typically 10-50 MB for a 7B model) and can be swapped at serving time. Use a single base model with multiple LoRA adapters. If you have very few training examples (10-100): Consider prompt tuning or IA3, which have fewer parameters and are less prone to overfitting on tiny datasets. If you need to fine-tune frequently with minimal infrastructure: Prompt tuning requires the least engineering overhead---no model weight modifications, just learned embedding vectors. The empirical ranking on most benchmarks is: full fine-tuning $\geq$ LoRA $>$ adapters $\approx$ prefix tuning $>$ prompt tuning $>$ IA3, though the gaps narrow significantly as model size increases. At the scale of 70B+ parameters, even prompt tuning approaches full fine-tuning performance. Instruction tuning fine-tunes a language model on a diverse collection of tasks formatted as natural language instructions. The goal is to produce a model that can follow arbitrary instructions, including instructions for tasks not seen during training. The training data takes the form: Several standard formats are used for instruction tuning: Alpaca format: ShareGPT/conversation format: OpenAI chat format: Diversity. The dataset should cover a wide range of tasks: summarization, classification, extraction, generation, reasoning, coding, math, creative writing, and more. Diversity in task type, complexity, and domain is key to generalization. Quality. Each example should represent a high-quality response: accurate, well-structured, appropriate in length, and helpful. Low-quality responses teach the model bad habits. Complexity distribution. Include examples ranging from simple (one-line answers) to complex (multi-step reasoning, long-form generation). This teaches the model to calibrate response complexity to the question. Instruction variation. Express the same task in multiple ways to improve robustness: "Summarize this," "Give me the key points," "Write a brief overview of," etc. Human annotation. The gold standard: human experts write instructions and responses. Expensive but highest quality. Used in InstructGPT (Ouyang et al., 2022) and Dolly. Self-instruct. Use a strong LLM to generate instruction-response pairs, then filter for quality. Introduced by Wang et al. (2023) and used to create Alpaca dataset. Evol-instruct. Start with simple instructions and evolve them into more complex versions using an LLM. Used to create the WizardLM dataset. Distillation from stronger models. Generate training data using a stronger model (e.g., GPT-4) and use it to fine-tune a weaker model. Legal and ethical considerations apply---check the API terms of service. Beyond content quality, several practical data preparation steps significantly affect fine-tuning outcomes: Tokenization alignment. Ensure your training data uses the same tokenizer and chat template as the base model. A mismatch---for example, training a Llama model on data formatted with ChatML tags---can cause the model to learn to produce the wrong format tokens, leading to degraded instruction following. Length distribution. Analyze the token length distribution of your training data. If most examples are short (under 100 tokens) but the task requires generating long responses, the model may learn a bias toward brevity. Include long-form examples to calibrate the model's response length. Conversely, if many examples are unnecessarily verbose, the model will learn to be verbose. Decontamination. Remove any examples that overlap with benchmark datasets you plan to use for evaluation. Benchmark contamination inflates evaluation scores and masks true performance. Use n-gram overlap detection to identify potential contamination. Balanced representation. If your dataset covers multiple task types, ensure no single task dominates. A dataset that is 90% classification and 10% generation will produce a model biased toward short, label-like responses. Use stratified sampling or upsampling of minority tasks to balance representation. Data formatting consistency. Inconsistent formatting---sometimes using "Answer:", sometimes "Response:", sometimes no prefix---confuses the model. Choose a single format and apply it uniformly across the entire dataset. Not all generated data is suitable for training. Filter for: TRL (Transformer Reinforcement Learning) is HuggingFace's library for training language models, covering supervised fine-tuning (SFT), reward modeling, and reinforcement learning from human feedback (RLHF). It provides high-level abstractions built on the The supervised fine-tuning pipeline consists of: Choosing the right hyperparameters is crucial for successful fine-tuning. Here is a systematic approach: Learning rate. Start with the values in the table above and adjust based on the training loss curve. If the loss decreases very slowly, the learning rate is too low. If the loss is noisy or spikes, the learning rate is too high. A reliable starting point: use $2 \times 10^{-4}$ for LoRA and $2 \times 10^{-5}$ for full fine-tuning. LoRA rank selection. The rank $r$ should match the complexity of the adaptation. A practical approach: The relationship between rank and capacity can be understood through the singular value decomposition. If the true weight update $\Delta W$ has singular values $\sigma_1 \geq \sigma_2 \geq \ldots$, the optimal rank-$r$ approximation captures the top $r$ singular values. The approximation error is: $$\|\Delta W - \Delta W_r\|_F^2 = \sum_{i=r+1}^{\min(d,k)} \sigma_i^2$$ If the singular values decay rapidly (as Aghajanyan et al. showed they do for fine-tuning updates), even a small rank captures most of the update's "signal." Number of epochs. For instruction tuning with 10K-50K examples, 2-3 epochs is typical. For smaller datasets (1K-5K), 3-5 epochs may be needed. Always monitor the validation loss and stop when it begins to increase, as discussed in the general machine learning principles of Part II. Sequence packing. When training examples vary in length, packing multiple short examples into a single sequence (up to For instruction tuning, it is common practice to only compute the loss on the assistant's response tokens, not the instruction tokens. This prevents the model from wasting capacity learning to reproduce the instructions: $$\mathcal{L} = -\sum_{t \in \text{response}} \log p_\theta(y_t \mid y_{ TRL's Modern models use specific chat templates to format conversations. The Different models use different templates (ChatML, Llama, Mistral, etc.), and using the wrong template can severely degrade performance. Here is a complete, end-to-end example of fine-tuning a model with LoRA using the This example demonstrates QLoRA fine-tuning: the base model is loaded in 4-bit precision (NF4 quantization), LoRA adapters are applied to all linear layers, and training uses 8-bit paged Adam. On a single GPU with 24GB of memory, this configuration can fine-tune models up to approximately 7B parameters. After training, the LoRA adapter can be merged back into the base model for deployment: The merged model is identical in architecture to the original---no additional inference overhead, no adapter loading logic, and full compatibility with existing serving infrastructure. This merge-and-deploy pattern is one of the key practical advantages of LoRA over adapter-based methods. Fine-tuned models should be evaluated on multiple dimensions: Task performance. How well does the model perform on the target task? Use task-specific metrics: accuracy for classification, ROUGE/BERTScore for summarization, pass@k for code generation. General capability retention. Has the model retained its general capabilities? Evaluate on standard benchmarks (MMLU, HellaSwag, ARC) to check for catastrophic forgetting. Instruction following. Does the model correctly follow diverse instructions? Evaluate on instruction-following benchmarks like IFEval or MT-Bench. Safety. Has the model maintained its safety guardrails? Test with standard safety benchmarks and red-teaming. A comprehensive evaluation strategy includes: Overfitting to the training format. The model may only respond well to inputs that match the training format exactly. Test with varied input formats. Benchmark contamination. If benchmark data leaked into the training set, results are inflated. Use held-out or freshly-created evaluation sets. Cherry-picking examples. A few impressive examples do not constitute evaluation. Use systematic, quantitative evaluation on representative test sets. Ignoring base model regression. A model that excels at the target task but has lost general capabilities may be less useful overall. Here is a framework for evaluating fine-tuned models that covers both task-specific and general capabilities: This evaluator provides a structured way to track both task performance and general capability retention during and after fine-tuning. By comparing perplexity on general text before and after fine-tuning, you can quantify the degree of catastrophic forgetting, as discussed in Section 24.12. Catastrophic forgetting occurs when fine-tuning causes the model to lose capabilities acquired during pre-training. The model's parameters shift to optimize for the fine-tuning task at the expense of previously learned knowledge. Formally, if $\theta_0$ are the pre-trained parameters and $\theta^*$ are the fine-tuned parameters, catastrophic forgetting occurs when: $$\mathcal{L}_{\text{pre-train}}(\theta^*) \gg \mathcal{L}_{\text{pre-train}}(\theta_0)$$ That is, the fine-tuned model performs significantly worse on the pre-training distribution. Lower learning rate. Use a learning rate 10--100x smaller than the pre-training rate. This limits the magnitude of weight updates. Few training epochs. Fine-tune for 1-3 epochs rather than training until convergence. Early stopping based on validation loss is essential. PEFT methods. LoRA, adapters, and prefix tuning inherently limit catastrophic forgetting because they freeze most parameters. The pre-trained weights remain unchanged, and only the small adaptation parameters are trained. Data mixing. Include a fraction of general-purpose data alongside the task-specific data: $$\mathcal{D}_{\text{train}} = \alpha \cdot \mathcal{D}_{\text{task}} + (1-\alpha) \cdot \mathcal{D}_{\text{general}}$$ where $\alpha \in [0.5, 0.9]$ controls the mix ratio. This approach, used in the Orca and Tulu papers, helps maintain general capabilities. Elastic Weight Consolidation (EWC). Add a penalty that discourages large changes to parameters important for pre-training: $$\mathcal{L}_{\text{EWC}} = \mathcal{L}_{\text{task}} + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{0,i})^2$$ where $F_i$ is the Fisher information matrix diagonal, approximating the importance of parameter $i$ for the pre-training task. Progressive fine-tuning. Gradually unfreeze layers from top to bottom during training. Start by training only the top layers, then progressively unfreeze deeper layers. This allows lower layers (which encode more general features) to remain stable longer. Regularization. Apply weight decay, dropout, and gradient clipping to prevent excessive parameter changes: $$\theta \leftarrow \theta - \eta (\nabla_\theta \mathcal{L} + \lambda \theta)$$ To detect forgetting during fine-tuning, monitor performance on a diverse evaluation suite throughout training, not just at the end. Set up a monitoring pipeline that evaluates the model every $N$ steps on: If you observe more than a 5% degradation on general benchmarks while fine-tuning, this is a signal that the learning rate is too high, the dataset is too narrow, or the number of training steps is excessive. Switching to a PEFT method, reducing the learning rate, or mixing in general-purpose data are the most reliable remedies. A practical rule of thumb: if you are fine-tuning with LoRA at rank 16 or below and the training dataset is at least 5,000 examples, catastrophic forgetting is rarely a problem. The frozen pre-trained weights act as an anchor that preserves the model's general capabilities. This is one of the strongest practical arguments for PEFT methods, beyond their memory savings. Fine-tuning on multiple tasks simultaneously can improve generalization and reduce forgetting. The training data combines examples from all tasks: $$\mathcal{L}_{\text{multi}} = \sum_{t=1}^T w_t \cdot \mathcal{L}_t(\theta)$$ where:
- $T$ is the number of tasks
- $w_t$ is the weight for task $t$
- $\mathcal{L}_t(\theta)$ is the loss on task $t$ Careful weight selection is important---rare or difficult tasks may need higher weights to avoid being dominated by common or easy tasks. Several strategies exist for setting task weights: Proportional weighting. Set $w_t$ proportional to the dataset size of task $t$. This is the simplest approach and is equivalent to uniform random sampling from the combined dataset. Inverse proportional weighting. Set $w_t$ inversely proportional to dataset size, giving rare tasks more influence. This helps if the primary goal is balanced performance across all tasks. Temperature-based sampling. Sample from task $t$ with probability proportional to $|\mathcal{D}_t|^{1/T_{\text{mix}}}$, where $T_{\text{mix}}$ is a temperature parameter. When $T_{\text{mix}} = 1$, this is proportional sampling; as $T_{\text{mix}} \to \infty$, it approaches uniform sampling across tasks. Dynamic weighting. Adjust weights during training based on each task's validation loss. Tasks that are underperforming receive higher weights. This approach requires maintaining separate validation sets for each task. The Tulu 2 (Ivison et al., 2024) and FLAN (Longpre et al., 2023) papers provide extensive ablation studies on multi-task mixing strategies and demonstrate that thoughtful task mixing leads to models that outperform single-task fine-tuning on almost all individual tasks. The intuition is that multi-task learning provides a regularization effect: the model must learn representations that are useful across tasks, preventing overfitting to the idiosyncrasies of any single task. In practice, fine-tuning is not a one-time event. As new data arrives or requirements change, the model needs to be updated. Continual fine-tuning presents additional challenges: Strategies include: Multiple LoRA adapters trained for different tasks can be merged: Linear interpolation:
$$W_{\text{merged}} = W_0 + \sum_{i=1}^n \alpha_i \Delta W_i$$ TIES merging: Trim small values, resolve sign conflicts, then merge: DARE: Drop parameters randomly, rescale the remaining ones, and merge: $$\Delta W_i' = \frac{1}{1-p} \cdot m_i \odot \Delta W_i$$ where $m_i$ is a binary mask with drop probability $p$ and the factor $\frac{1}{1-p}$ rescales the surviving parameters to maintain the expected magnitude. DARE was motivated by the observation that most individual LoRA parameters contribute little to the final performance---they can be dropped randomly with minimal impact, and the rescaling ensures the overall contribution is preserved. This sparsification step also reduces interference between tasks during merging, because fewer parameters conflict. Here is a practical example of merging two LoRA adapters using linear interpolation: The optimal mixing weight $\alpha$ can be found by evaluating the merged model on a validation set that covers both tasks. In practice, values between 0.3 and 0.7 work well, with the exact optimum depending on the similarity of the tasks. Model merging is particularly powerful because it requires no additional training---just arithmetic on the adapter parameters. This makes it possible to combine capabilities from multiple specialized fine-tunes into a single model that performs well across all tasks, provided the tasks do not conflict significantly in their parameter requirements. Model merging does not always work. It tends to fail when: In such cases, multi-task fine-tuning (training a single adapter on data from all tasks simultaneously) typically outperforms post-hoc merging. TIES merging and DARE partially address these issues by resolving sign conflicts and reducing interference, but they are not a complete solution. The choice between merging and multi-task training depends on whether you have access to all task data simultaneously (favoring multi-task training) or only to the trained adapters (requiring merging). Fine-tuning large language models enables deep customization that prompting alone cannot achieve. In this chapter, we covered: The field of fine-tuning is evolving rapidly. New PEFT methods continue to emerge, training recipes are being refined, and the interplay between fine-tuning and alignment (Chapter 25) is becoming better understood. The fundamental insight, however, remains: pre-trained language models contain vast latent capability, and fine-tuning is the key that unlocks task-specific performance from that general foundation. The next chapter explores what happens after supervised fine-tuning: aligning language models with human preferences through RLHF and DPO.24.3.2 Training Considerations
24.3.3 Memory Requirements
Component
Memory
Example (7B model, fp32)
Model parameters
$4P$ bytes (fp32)
28 GB
Gradients
$4P$ bytes
28 GB
Optimizer states (Adam)
$8P$ bytes
56 GB
Activations
Variable
10--50 GB
Total
~$16P$+ bytes
~120+ GB
24.3.4 Techniques for Reducing Memory
24.3.5 Limitations of Full Fine-Tuning
24.3.6 Full Fine-Tuning Implementation
import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer
def full_finetune(
model_name: str,
train_dataset,
num_epochs: int = 3,
learning_rate: float = 2e-5,
batch_size: int = 4,
gradient_accumulation_steps: int = 8,
max_grad_norm: float = 1.0,
warmup_steps: int = 100,
) -> AutoModelForCausalLM:
"""Full fine-tuning of a causal language model.
Args:
model_name: HuggingFace model identifier.
train_dataset: Training dataset yielding input_ids and labels.
num_epochs: Number of training epochs.
learning_rate: Peak learning rate.
batch_size: Per-device batch size.
gradient_accumulation_steps: Steps to accumulate before update.
max_grad_norm: Maximum gradient norm for clipping.
warmup_steps: Linear warmup steps.
Returns:
The fine-tuned model.
"""
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16
).cuda()
optimizer = torch.optim.AdamW(
model.parameters(), lr=learning_rate, weight_decay=0.01
)
dataloader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True
)
total_steps = len(dataloader) * num_epochs
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
optimizer, T_max=total_steps, eta_min=learning_rate * 0.1
)
model.train()
step = 0
for epoch in range(num_epochs):
for batch_idx, batch in enumerate(dataloader):
input_ids = batch["input_ids"].cuda()
labels = batch["labels"].cuda()
outputs = model(input_ids=input_ids, labels=labels)
loss = outputs.loss / gradient_accumulation_steps
loss.backward()
if (batch_idx + 1) % gradient_accumulation_steps == 0:
torch.nn.utils.clip_grad_norm_(
model.parameters(), max_grad_norm
)
optimizer.step()
scheduler.step()
optimizer.zero_grad()
step += 1
if step % 50 == 0:
print(
f"Step {step}, Loss: {loss.item() * gradient_accumulation_steps:.4f}, "
f"LR: {scheduler.get_last_lr()[0]:.2e}"
)
return model
24.4 Parameter-Efficient Fine-Tuning (PEFT)
24.4.1 The PEFT Philosophy
24.4.2 Taxonomy of PEFT Methods
24.5 LoRA: Low-Rank Adaptation
24.5.1 Motivation
24.5.2 Mathematical Formulation
24.5.3 Initialization
24.5.4 Scaling Factor
24.5.5 Which Layers to Apply LoRA
24.5.6 Rank Selection
24.5.7 LoRA Implementation from Scratch
import torch
import torch.nn as nn
import math
class LoRALinear(nn.Module):
"""A linear layer augmented with LoRA adaptation.
Implements the formula: h = W_0 x + (alpha/r) * B A x
Args:
in_features: Input dimension (k).
out_features: Output dimension (d).
rank: LoRA rank (r).
alpha: LoRA scaling factor.
"""
def __init__(
self,
in_features: int,
out_features: int,
rank: int = 16,
alpha: float = 32.0,
) -> None:
super().__init__()
self.rank = rank
self.scaling = alpha / rank
# Frozen pre-trained weight (would be loaded from checkpoint)
self.weight = nn.Parameter(
torch.randn(out_features, in_features), requires_grad=False
)
# LoRA matrices
self.lora_A = nn.Parameter(torch.randn(rank, in_features))
self.lora_B = nn.Parameter(torch.zeros(out_features, rank))
# Initialize A with Kaiming uniform
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
# B is already initialized to zero
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass with LoRA.
Args:
x: Input tensor of shape (..., in_features).
Returns:
Output tensor of shape (..., out_features).
"""
# Original linear transformation (frozen)
h = x @ self.weight.T
# LoRA adaptation (trainable)
h = h + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
return h
def merge_weights(self) -> None:
"""Merge LoRA weights into the base weight matrix.
After merging, the LoRA path is no longer needed and
forward becomes a simple matrix multiplication.
"""
self.weight.data += self.scaling * (self.lora_B @ self.lora_A)
# Optionally delete LoRA parameters to free memory
requires_grad=False), so no gradients are computed or stored for the $d \times k$ matrix---only for the much smaller $A$ and $B$ matrices.24.5.8 Merging LoRA Weights
24.6 QLoRA: Quantized LoRA
24.6.1 Motivation
24.6.2 Key Innovations
24.6.3 Memory Savings
Method
GPU Memory
Full fine-tuning (fp32)
~120 GB
Full fine-tuning (fp16)
~60 GB
LoRA (fp16 base)
~16 GB
QLoRA (NF4 base)
~6 GB
24.6.4 How NF4 Quantization Works
24.6.5 Performance
24.7 Adapter Layers
24.7.1 Architecture
24.7.2 Placement
24.7.3 Comparison with LoRA
Aspect
LoRA
Adapters
Architecture modification
None (merged at inference)
Adds new layers
Inference latency
Zero overhead (after merging)
Small overhead
Parameter count
$2 \times r \times d$ per layer
$2 \times m \times d + m + d$ per adapter
Training stability
Very stable
Stable
Multi-task serving
Swap LoRA weights
Swap adapter modules
24.7.4 Adapter Implementation
import torch
import torch.nn as nn
class AdapterLayer(nn.Module):
"""A bottleneck adapter layer.
Projects from d dimensions down to m, applies a nonlinearity,
and projects back up to d, with a residual connection.
Args:
d_model: Model hidden dimension.
bottleneck: Adapter bottleneck dimension.
"""
def __init__(self, d_model: int, bottleneck: int = 64) -> None:
super().__init__()
self.down = nn.Linear(d_model, bottleneck)
self.act = nn.GELU()
self.up = nn.Linear(bottleneck, d_model)
# Initialize up-projection to near-zero
nn.init.zeros_(self.up.weight)
nn.init.zeros_(self.up.bias)
def forward(self, x: torch.Tensor) -> torch.Tensor:
"""Forward pass with residual connection."""
return x + self.up(self.act(self.down(x)))
24.8 Prefix Tuning and Prompt Tuning
24.8.1 Prefix Tuning
24.8.2 Prompt Tuning
24.8.3 IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations
24.8.4 Comparison
Method
Parameters per task
Where added
Performance
Inference overhead
Full fine-tuning
All
Everywhere
Best
None
LoRA
$2rLd'$
Attention weights
Near-best
None (after merging)
Prefix tuning
$2Lpd$
Attention KV
Good
Small (extra KV)
Prompt tuning
$pd$
Input only
Good (large models)
Minimal (extra tokens)
Adapter layers
$2Lmd$
After sublayers
Good
Small (extra layers)
IA3
$L(2d_k + d_{ff})$
Scaling vectors
Moderate
Negligible
24.8.5 How to Choose a PEFT Method
24.9 Instruction Tuning and Dataset Creation
24.9.1 What Is Instruction Tuning?
Instruction: Summarize the following article in three bullet points.
Input: [article text]
Output: [three bullet points]
24.9.2 Dataset Formats
{
"instruction": "Classify the sentiment of the following review.",
"input": "The food was delicious and the service was excellent.",
"output": "Positive"
}
{
"conversations": [
{"from": "human", "value": "What causes seasons on Earth?"},
{"from": "assistant", "value": "Seasons are caused by..."}
]
}
{
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain photosynthesis."},
{"role": "assistant", "content": "Photosynthesis is..."}
]
}
24.9.3 Creating High-Quality Instruction Datasets
24.9.4 Data Generation Strategies
24.9.5 Data Preparation Best Practices
24.9.5 Data Quality Filtering
24.10 Supervised Fine-Tuning with HuggingFace TRL
24.10.1 The TRL Library
transformers and peft libraries.24.10.2 SFT Pipeline Overview
24.10.3 Key Hyperparameters for SFT
Hyperparameter
Typical Range
Notes
Learning rate
1e-5 to 5e-5
Lower for larger models
Batch size
4 to 32
Use gradient accumulation
Epochs
1 to 5
Monitor for overfitting
Max sequence length
512 to 4096
Depends on task
Warmup ratio
0.03 to 0.1
Fraction of total steps
Weight decay
0.0 to 0.1
Mild regularization
LoRA rank
8 to 64
Higher for complex tasks
LoRA alpha
16 to 128
Often set to 2 * rank
LoRA dropout
0.0 to 0.1
Mild regularization
24.10.4 Hyperparameter Selection Guide
max_seq_length) improves GPU utilization. TRL's SFTTrainer supports this with the packing=True option. Without packing, short examples waste compute on padding tokens.24.10.5 Training Loss Masking
SFTTrainer supports this through response-template-based masking or dataset formatting.24.10.5 Chat Templates
tokenizer.apply_chat_template() method handles this:messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"},
{"role": "assistant", "content": "Machine learning is..."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)
24.10.6 Complete PEFT Fine-Tuning Example
peft and trl libraries:import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
# 1. Load model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# 2. Prepare model for quantized training
model = prepare_model_for_kbit_training(model)
# 3. Configure LoRA
lora_config = LoraConfig(
r=16, # Rank
lora_alpha=32, # Scaling factor (alpha/r = 2)
target_modules=[ # Apply to all attention matrices
"q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 1,249,283,072
# || trainable%: 1.0911
# 4. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")
def format_instruction(example):
"""Format an Alpaca example into a training prompt."""
if example["input"]:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Input:\n{example['input']}\n\n"
f"### Response:\n{example['output']}"
)
else:
text = (
f"### Instruction:\n{example['instruction']}\n\n"
f"### Response:\n{example['output']}"
)
return {"text": text}
dataset = dataset.map(format_instruction)
# 5. Configure training
training_args = TrainingArguments(
output_dir="./lora-llama",
num_train_epochs=3,
per_device_train_batch_size=4,
gradient_accumulation_steps=4, # Effective batch size = 16
learning_rate=2e-4,
warmup_ratio=0.03,
lr_scheduler_type="cosine",
logging_steps=25,
save_strategy="epoch",
bf16=True,
optim="paged_adamw_8bit",
max_grad_norm=0.3,
)
# 6. Train with SFTTrainer
trainer = SFTTrainer(
model=model,
train_dataset=dataset,
args=training_args,
tokenizer=tokenizer,
dataset_text_field="text",
max_seq_length=512,
packing=True, # Pack multiple examples into one sequence
)
trainer.train()
# 7. Save the LoRA adapter (not the full model)
trainer.model.save_pretrained("./lora-llama/final")
24.10.7 Merging and Deploying LoRA Adapters
from peft import AutoPeftModelForCausalLM
# Load the base model + LoRA adapter
model = AutoPeftModelForCausalLM.from_pretrained(
"./lora-llama/final",
device_map="auto",
torch_dtype=torch.bfloat16,
)
# Merge LoRA weights into the base model
merged_model = model.merge_and_unload()
# Save the merged model for deployment
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")
24.11 Evaluation of Fine-Tuned Models
24.11.1 Evaluation Dimensions
24.11.2 Benchmarking Strategy
24.11.3 Common Pitfalls
24.11.4 Practical Evaluation Implementation
from typing import Callable
import numpy as np
class FineTuneEvaluator:
"""Comprehensive evaluator for fine-tuned language models.
Evaluates task performance, general capability retention,
and instruction following quality.
"""
def __init__(self, model, tokenizer, base_model=None):
self.model = model
self.tokenizer = tokenizer
self.base_model = base_model # For regression testing
def evaluate_task(
self,
test_data: list[dict],
metric_fn: Callable,
) -> dict:
"""Evaluate performance on the target task.
Args:
test_data: List of dicts with 'input' and 'expected'.
metric_fn: Function that computes a metric from
(predictions, references).
Returns:
Dict with metric scores.
"""
predictions = []
for example in test_data:
input_ids = self.tokenizer(
example["input"], return_tensors="pt"
).input_ids.to(self.model.device)
output = self.model.generate(
input_ids, max_new_tokens=256,
do_sample=False, # Greedy for reproducibility
)
pred = self.tokenizer.decode(
output[0][input_ids.shape[1]:],
skip_special_tokens=True,
)
predictions.append(pred)
references = [ex["expected"] for ex in test_data]
return metric_fn(predictions, references)
def evaluate_perplexity(self, texts: list[str]) -> float:
"""Compute perplexity on general text.
Lower perplexity on general text indicates better
retention of pre-trained capabilities.
"""
import torch
total_loss = 0.0
total_tokens = 0
for text in texts:
encodings = self.tokenizer(
text, return_tensors="pt", truncation=True,
max_length=512,
)
input_ids = encodings.input_ids.to(self.model.device)
with torch.no_grad():
outputs = self.model(
input_ids=input_ids, labels=input_ids
)
total_loss += outputs.loss.item() * input_ids.shape[1]
total_tokens += input_ids.shape[1]
avg_loss = total_loss / total_tokens
return np.exp(avg_loss)
24.12 Catastrophic Forgetting
24.12.1 The Problem
24.12.2 Causes
24.12.3 Mitigation Strategies
24.12.4 Diagnosing Catastrophic Forgetting
24.13 Advanced Topics
24.13.1 Multi-Task Fine-Tuning
24.13.2 Continual Fine-Tuning
24.13.3 Merging Fine-Tuned Models
24.13.4 Model Merging in Practice
from peft import PeftModel
import torch
# Load base model
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
# Load first adapter (e.g., trained for coding tasks)
model_1 = PeftModel.from_pretrained(base, "coding-lora-adapter")
delta_1 = {}
for name, param in model_1.named_parameters():
if "lora" in name:
delta_1[name] = param.data.clone()
# Load second adapter (e.g., trained for math tasks)
model_2 = PeftModel.from_pretrained(base, "math-lora-adapter")
delta_2 = {}
for name, param in model_2.named_parameters():
if "lora" in name:
delta_2[name] = param.data.clone()
# Merge with weighted average
alpha = 0.5 # Equal weighting
merged = PeftModel.from_pretrained(base, "coding-lora-adapter")
for name, param in merged.named_parameters():
if "lora" in name and name in delta_2:
param.data = alpha * delta_1[name] + (1 - alpha) * delta_2[name]
24.13.5 When Model Merging Fails
24.14 Summary
References