38 min read

In This Chapter

24.1 Introduction: Beyond Prompting
24.2 When to Fine-Tune vs. When to Prompt
24.3 Full Fine-Tuning
24.4 Parameter-Efficient Fine-Tuning (PEFT)
24.5 LoRA: Low-Rank Adaptation
24.6 QLoRA: Quantized LoRA
24.7 Adapter Layers
24.8 Prefix Tuning and Prompt Tuning
24.9 Instruction Tuning and Dataset Creation
24.10 Supervised Fine-Tuning with HuggingFace TRL
24.11 Evaluation of Fine-Tuned Models
24.12 Catastrophic Forgetting
24.13 Advanced Topics
24.14 Summary
References

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 24: Fine-Tuning Large Language Models

Part IV: Attention, Transformers, and Language Models

24.1 Introduction: Beyond Prompting

Chapter 23 demonstrated that prompt engineering can accomplish a remarkable range of tasks. But prompting has fundamental limitations. When an application requires deep behavioral customization, consistent adherence to complex formatting rules, domain-specific expertise that the base model lacks, or performance that exceeds what few-shot demonstrations can achieve, fine-tuning becomes necessary.

Fine-tuning adapts a pre-trained language model to a specific task or domain by continuing the training process on a curated dataset. Unlike prompting, fine-tuning modifies the model's parameters, enabling it to internalize new behaviors, knowledge, and capabilities that persist across all future interactions.

This chapter provides a comprehensive treatment of modern fine-tuning techniques for large language models. We begin with the decision framework for when to fine-tune versus prompt, then cover full fine-tuning and its limitations. The bulk of the chapter focuses on parameter-efficient fine-tuning (PEFT) methods---LoRA, QLoRA, adapter layers, and prefix tuning---which make fine-tuning accessible even with limited hardware. We then cover instruction tuning, supervised fine-tuning with HuggingFace TRL, evaluation of fine-tuned models, and the critical problem of catastrophic forgetting.

What You Will Learn

By the end of this chapter, you will be able to:

Determine when fine-tuning is appropriate versus prompting
Implement full fine-tuning of a language model with PyTorch
Explain the mathematical foundations of LoRA and apply it in practice
Use QLoRA to fine-tune large models on consumer hardware
Implement adapter layers and prefix tuning
Create high-quality instruction-tuning datasets
Perform supervised fine-tuning using the TRL library
Evaluate fine-tuned models rigorously
Diagnose and mitigate catastrophic forgetting

Prerequisites

This chapter assumes familiarity with:

Transformer architecture and attention mechanisms (Chapter 20)
Language model pre-training (Chapter 21)
Transfer learning concepts (Chapter 22)
Prompt engineering basics (Chapter 23)
PyTorch training loops and optimizers

24.2 When to Fine-Tune vs. When to Prompt

24.2.1 The Decision Framework

The choice between prompting and fine-tuning involves multiple factors. Here is a systematic framework:

Factor	Favor Prompting	Favor Fine-Tuning
Task complexity	Simple, well-defined tasks	Complex, nuanced behaviors
Data availability	Few or no labeled examples	Hundreds to thousands of examples
Latency requirements	Generous (long prompts OK)	Strict (short prompts needed)
Consistency needs	Moderate variability acceptable	High consistency required
Domain specialization	General knowledge sufficient	Deep domain expertise needed
Deployment constraints	API access only	Custom model hosting possible
Budget	Limited compute budget	Training budget available
Update frequency	Task changes frequently	Task is relatively stable

24.2.2 Cost Analysis

Fine-tuning has an upfront training cost but reduces per-query costs by eliminating long prompts:

$$\text{Cost}_{\text{prompt}} = N_{\text{queries}} \times (C_{\text{input}} \times L_{\text{prompt}} + C_{\text{output}} \times L_{\text{response}})$$

$$\text{Cost}_{\text{finetune}} = C_{\text{train}} + N_{\text{queries}} \times (C_{\text{input}} \times L_{\text{short\_prompt}} + C_{\text{output}} \times L_{\text{response}})$$

Fine-tuning becomes cost-effective when:

$$N_{\text{queries}} > \frac{C_{\text{train}}}{C_{\text{input}} \times (L_{\text{prompt}} - L_{\text{short\_prompt}})}$$

For high-volume applications, the crossover point can be reached quickly, making fine-tuning more economical.

24.2.3 The Hybrid Approach

In practice, the best approach is often hybrid:

Start with prompting to validate the task is feasible
Collect data from the prompting system (inputs, outputs, human corrections)
Fine-tune a smaller model on the collected data
Use the fine-tuned model for production, falling back to the prompted large model for edge cases

This approach combines rapid prototyping (prompting) with efficient production deployment (fine-tuning).

24.2.4 Practical Examples of the Decision

To make the framework concrete, consider these scenarios:

Scenario 1: Customer support classification. You need to classify incoming support tickets into 10 categories with 95%+ accuracy. You have 50,000 labeled examples. Decision: Fine-tune. The volume of labeled data, consistency requirements, and high accuracy threshold all favor fine-tuning. A LoRA fine-tuned model will be faster and cheaper per query than few-shot prompting.

Scenario 2: One-off research summarization. A researcher needs to summarize 50 academic papers. Decision: Prompt. The task is a one-time effort, the volume is low, and the format (summarization) is well within the capabilities of prompted models. Fine-tuning would be overkill.

Scenario 3: Medical report structuring. A hospital needs to extract structured information from radiology reports into a specific JSON schema, with domain-specific terminology. Decision: Fine-tune. The domain specialization, strict formatting requirements, and high-stakes nature of the task all favor fine-tuning. The model needs to internalize medical vocabulary and the specific JSON schema reliably.

Scenario 4: Internal chatbot. A company wants a chatbot that answers questions about internal policies, using a set of policy documents as context. Decision: RAG + prompting first, fine-tune later if needed. Retrieval-augmented generation (as we previewed in Chapter 23) handles the knowledge grounding, and prompting handles the interaction style. Fine-tune only if the prompting approach fails to meet quality requirements.

These examples illustrate that the decision is rarely binary---it depends on the specific combination of data availability, quality requirements, deployment constraints, and budget. The hybrid approach outlined in Section 24.2.3 is almost always the safest starting point.

24.3 Full Fine-Tuning

24.3.1 Definition

Full fine-tuning updates all parameters of the pre-trained model on the downstream dataset. Given a pre-trained model with parameters $\theta_0$, we optimize:

$$\theta^* = \arg\min_\theta \; \mathcal{L}(\theta; \mathcal{D}_{\text{finetune}})$$

where $\mathcal{D}_{\text{finetune}}$ is the fine-tuning dataset and $\mathcal{L}$ is typically the cross-entropy loss for language modeling:

$$\mathcal{L}(\theta) = -\frac{1}{|\mathcal{D}|} \sum_{(x,y) \in \mathcal{D}} \sum_{t=1}^{|y|} \log p_\theta(y_t \mid x, y_{

24.3.2 Training Considerations

Learning rate. Fine-tuning uses a much smaller learning rate than pre-training---typically $10^{-5}$ to $5 \times 10^{-5}$, compared to $10^{-4}$ to $3 \times 10^{-4}$ for pre-training. This prevents catastrophic forgetting by making small updates to the pre-trained weights.

Learning rate schedule. A linear warmup followed by cosine decay is standard:

$$\eta(t) = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & \text{if } t < T_{\text{warmup}} \\ \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})(1 + \cos(\frac{t - T_{\text{warmup}}}{T - T_{\text{warmup}}} \pi)) & \text{otherwise} \end{cases}$$

Batch size. Larger batch sizes provide more stable gradients but require more memory. Gradient accumulation can simulate large batches:

$$\theta \leftarrow \theta - \eta \cdot \frac{1}{K} \sum_{k=1}^{K} \nabla_\theta \mathcal{L}(\theta; B_k)$$

where $K$ is the number of accumulation steps and $B_k$ are micro-batches.

Number of epochs. Fine-tuning typically requires 1--5 epochs. More epochs risk overfitting to the fine-tuning data, especially with small datasets.

24.3.3 Memory Requirements

Full fine-tuning of a model with $P$ parameters requires storing:

Component	Memory	Example (7B model, fp32)
Model parameters	$4P$ bytes (fp32)	28 GB
Gradients	$4P$ bytes	28 GB
Optimizer states (Adam)	$8P$ bytes	56 GB
Activations	Variable	10--50 GB
Total	~$16P$+ bytes	~120+ GB

For a 7-billion parameter model in float32, this exceeds the memory of most single GPUs. Mixed-precision training (fp16/bf16) approximately halves these requirements but still demands substantial hardware.

24.3.4 Techniques for Reducing Memory

Mixed-precision training. Store weights in fp32 but compute forward and backward passes in fp16/bf16:

$$\text{Memory} \approx 4P + 2P + 2P + 8P = 16P \text{ bytes (with fp32 optimizer)}$$

Gradient checkpointing. Trade compute for memory by recomputing activations during the backward pass instead of storing them. Reduces activation memory by a factor of $\sqrt{L}$ (where $L$ is the number of layers) at the cost of ~33% more computation.

DeepSpeed ZeRO. Partition optimizer states, gradients, and parameters across multiple GPUs. ZeRO Stage 3 distributes all three, enabling training of models that would not fit on a single GPU.

FSDP (Fully Sharded Data Parallel). PyTorch's native implementation of model sharding, which distributes model parameters and optimizer states across GPUs.

24.3.5 Limitations of Full Fine-Tuning

Despite its effectiveness, full fine-tuning has significant drawbacks:

Memory cost: Requires storing full model copies, gradients, and optimizer states
Storage cost: Each fine-tuned variant is a complete copy of the model
Catastrophic forgetting: Risk of losing pre-trained capabilities
Overfitting: Easy to overfit on small fine-tuning datasets
Serving complexity: Each task variant requires a separate model deployment

These limitations motivate parameter-efficient fine-tuning methods.

24.3.6 Full Fine-Tuning Implementation

For completeness, here is a minimal full fine-tuning implementation in PyTorch. While PEFT methods are preferred for most use cases, understanding full fine-tuning helps build intuition for why PEFT works:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer

def full_finetune(
    model_name: str,
    train_dataset,
    num_epochs: int = 3,
    learning_rate: float = 2e-5,
    batch_size: int = 4,
    gradient_accumulation_steps: int = 8,
    max_grad_norm: float = 1.0,
    warmup_steps: int = 100,
) -> AutoModelForCausalLM:
    """Full fine-tuning of a causal language model.

    Args:
        model_name: HuggingFace model identifier.
        train_dataset: Training dataset yielding input_ids and labels.
        num_epochs: Number of training epochs.
        learning_rate: Peak learning rate.
        batch_size: Per-device batch size.
        gradient_accumulation_steps: Steps to accumulate before update.
        max_grad_norm: Maximum gradient norm for clipping.
        warmup_steps: Linear warmup steps.

    Returns:
        The fine-tuned model.
    """
    model = AutoModelForCausalLM.from_pretrained(
        model_name, torch_dtype=torch.bfloat16
    ).cuda()

    optimizer = torch.optim.AdamW(
        model.parameters(), lr=learning_rate, weight_decay=0.01
    )

    dataloader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True
    )

    total_steps = len(dataloader) * num_epochs
    scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(
        optimizer, T_max=total_steps, eta_min=learning_rate * 0.1
    )

    model.train()
    step = 0
    for epoch in range(num_epochs):
        for batch_idx, batch in enumerate(dataloader):
            input_ids = batch["input_ids"].cuda()
            labels = batch["labels"].cuda()

            outputs = model(input_ids=input_ids, labels=labels)
            loss = outputs.loss / gradient_accumulation_steps
            loss.backward()

            if (batch_idx + 1) % gradient_accumulation_steps == 0:
                torch.nn.utils.clip_grad_norm_(
                    model.parameters(), max_grad_norm
                )
                optimizer.step()
                scheduler.step()
                optimizer.zero_grad()
                step += 1

                if step % 50 == 0:
                    print(
                        f"Step {step}, Loss: {loss.item() * gradient_accumulation_steps:.4f}, "
                        f"LR: {scheduler.get_last_lr()[0]:.2e}"
                    )

    return model

Note the key differences from pre-training: the learning rate is 10-100x lower ($2 \times 10^{-5}$ vs. $3 \times 10^{-4}$), gradient clipping is applied to prevent large updates, and the number of epochs is small (1-5). These choices collectively prevent the fine-tuning process from straying too far from the pre-trained initialization, as we saw in Section 24.3.2.

24.4 Parameter-Efficient Fine-Tuning (PEFT)

24.4.1 The PEFT Philosophy

Parameter-efficient fine-tuning modifies only a small subset of model parameters while keeping the majority frozen. This dramatically reduces memory requirements, storage costs, and the risk of catastrophic forgetting.

Formally, PEFT methods decompose the fine-tuned parameters as:

$$\theta_{\text{finetuned}} = \theta_0 + \Delta\theta$$

where $\theta_0$ are the frozen pre-trained parameters and $\Delta\theta$ is a small, trainable perturbation with $|\Delta\theta| \ll |\theta_0|$.

The key insight is that the downstream task likely resides in a low-dimensional subspace of the full parameter space. We do not need to update all parameters to capture task-specific behavior.

To build intuition for this claim, consider an analogy. A pre-trained language model is like a Swiss Army knife---it has many tools (capabilities), but none are perfectly optimized for any specific task. Full fine-tuning rebuilds the entire knife from scratch, producing a specialized tool. PEFT methods, by contrast, make small adjustments---sharpening one blade, adjusting the angle of another---that specialize the tool for the task at hand while preserving its general utility. The mathematical justification comes from the low intrinsic dimensionality result of Aghajanyan et al. (2021), who showed that the effective dimensionality of fine-tuning updates is often as low as a few hundred, even for models with billions of parameters. This means that the subspace of "useful" parameter changes is vanishingly small compared to the full parameter space, and PEFT methods exploit this structure directly.

24.4.2 Taxonomy of PEFT Methods

PEFT methods can be categorized by where and how they introduce trainable parameters:

Addition-based: Add new trainable modules (adapters, soft prompts)
Reparameterization-based: Reparameterize weight updates in a low-rank space (LoRA)
Selection-based: Select a subset of existing parameters to fine-tune (BitFit, which trains only bias terms, or layer freezing, which unfreezes only the top $k$ layers)

Each category has its own trade-offs. Addition-based methods are the most flexible but add parameters to the model. Reparameterization-based methods like LoRA are the most popular because they add no parameters at inference time (after merging). Selection-based methods are the simplest conceptually but offer the least control over what is adapted.

24.5 LoRA: Low-Rank Adaptation

24.5.1 Motivation

LoRA (Hu et al., 2022) is motivated by the observation that the weight updates during fine-tuning have low intrinsic rank. That is, the difference $\Delta W = W_{\text{finetuned}} - W_{\text{pretrained}}$ can be well-approximated by a low-rank matrix.

Aghajanyan et al. (2021) showed that pre-trained models have a low intrinsic dimensionality---they can be fine-tuned effectively in a much lower-dimensional subspace than the full parameter space.

24.5.2 Mathematical Formulation

For a pre-trained weight matrix $W_0 \in \mathbb{R}^{d \times k}$, LoRA constrains the weight update to be low-rank:

$$W = W_0 + \Delta W = W_0 + BA$$

where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times k}$ with rank $r \ll \min(d, k)$.

The intuition is elegant: rather than learning a full $d \times k$ update matrix (which has $dk$ free parameters), we factorize it into two much smaller matrices that form a bottleneck of dimension $r$. The input first projects down to $r$ dimensions through $A$, then projects back up to $d$ dimensions through $B$. This is analogous to a very thin autoencoder applied to the weight update---the bottleneck forces the update to lie in a low-dimensional subspace.

The forward pass becomes:

$$h = W_0 x + \Delta W x = W_0 x + BAx$$

where: - $W_0 x$ is the original pre-trained computation (frozen, no gradients computed) - $BAx$ is the low-rank adaptation (trainable) - $x \in \mathbb{R}^k$ is the input activation - $h \in \mathbb{R}^d$ is the output activation

During training, $W_0$ is frozen and only $A$ and $B$ are updated. The number of trainable parameters per LoRA layer is:

$$|\Delta\theta|_{\text{LoRA}} = r \times (d + k)$$

compared to $d \times k$ for full fine-tuning. For typical values ($d = k = 4096$, $r = 16$), this is a reduction factor of $\frac{d \times k}{r \times (d + k)} = \frac{4096^2}{16 \times 8192} = 128\times$.

Worked example. Consider a Llama-7B model where the attention query projection has $d = k = 4096$. Full fine-tuning of this single weight matrix requires updating $4096 \times 4096 = 16,777,216$ parameters. With LoRA at rank $r = 16$, we instead train $16 \times (4096 + 4096) = 131,072$ parameters---a 128x reduction. Applied to all attention matrices (Q, K, V, O) across all 32 layers, the total LoRA parameters are $4 \times 32 \times 131,072 = 16,777,216$---compared to $4 \times 32 \times 16,777,216 = 2,147,483,648$ for full fine-tuning of those same matrices. LoRA trains less than 1% of the full parameter count while achieving comparable task performance.

24.5.3 Initialization

LoRA uses a specific initialization scheme:

$A$ is initialized with random Gaussian values: $A_{ij} \sim \mathcal{N}(0, \sigma^2)$
$B$ is initialized to zero: $B = 0$

This ensures that $\Delta W = BA = 0$ at the start of training, so the model begins from the exact pre-trained weights. This is critical for stable training.

24.5.4 Scaling Factor

LoRA introduces a scaling factor $\alpha$:

$$h = W_0 x + \frac{\alpha}{r} BAx$$

The ratio $\frac{\alpha}{r}$ controls the magnitude of the LoRA update relative to the pre-trained weights. A common practice is to set $\alpha = 2r$, making the effective scaling factor 2. When the rank $r$ is increased, the per-element contribution of the LoRA update is automatically reduced.

24.5.5 Which Layers to Apply LoRA

LoRA is typically applied to the attention weight matrices: $W_Q$, $W_K$, $W_V$, and $W_O$. Research suggests:

Applying LoRA to all attention matrices (Q, K, V, O) generally outperforms applying it to a subset
Applying LoRA to feed-forward network (FFN) layers can provide additional gains
For a given parameter budget, it is often better to use a lower rank across more layers than a higher rank on fewer layers

24.5.6 Rank Selection

The rank $r$ controls the expressiveness of the adaptation:

$r = 4$: Minimal adaptation, suitable for simple tasks or very limited compute
$r = 8$--$16$: Common default, works well for most tasks
$r = 32$--$64$: Higher expressiveness, useful for complex tasks or large distribution shifts
$r = 256+$: Approaches full fine-tuning in expressiveness

The optimal rank depends on the task complexity and the distance between the pre-training and fine-tuning distributions.

24.5.7 LoRA Implementation from Scratch

To solidify understanding, here is a minimal LoRA layer implementation:

import torch
import torch.nn as nn
import math

class LoRALinear(nn.Module):
    """A linear layer augmented with LoRA adaptation.

    Implements the formula: h = W_0 x + (alpha/r) * B A x

    Args:
        in_features: Input dimension (k).
        out_features: Output dimension (d).
        rank: LoRA rank (r).
        alpha: LoRA scaling factor.
    """

    def __init__(
        self,
        in_features: int,
        out_features: int,
        rank: int = 16,
        alpha: float = 32.0,
    ) -> None:
        super().__init__()
        self.rank = rank
        self.scaling = alpha / rank

        # Frozen pre-trained weight (would be loaded from checkpoint)
        self.weight = nn.Parameter(
            torch.randn(out_features, in_features), requires_grad=False
        )

        # LoRA matrices
        self.lora_A = nn.Parameter(torch.randn(rank, in_features))
        self.lora_B = nn.Parameter(torch.zeros(out_features, rank))

        # Initialize A with Kaiming uniform
        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
        # B is already initialized to zero

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with LoRA.

        Args:
            x: Input tensor of shape (..., in_features).

        Returns:
            Output tensor of shape (..., out_features).
        """
        # Original linear transformation (frozen)
        h = x @ self.weight.T
        # LoRA adaptation (trainable)
        h = h + (x @ self.lora_A.T @ self.lora_B.T) * self.scaling
        return h

    def merge_weights(self) -> None:
        """Merge LoRA weights into the base weight matrix.

        After merging, the LoRA path is no longer needed and
        forward becomes a simple matrix multiplication.
        """
        self.weight.data += self.scaling * (self.lora_B @ self.lora_A)
        # Optionally delete LoRA parameters to free memory

The key implementation details to note:

$B$ is initialized to zero, ensuring the LoRA contribution is zero at initialization. This means the model starts from the exact pre-trained checkpoint.
$A$ is initialized with Kaiming uniform, providing a reasonable starting point for the down-projection.
The scaling factor $\alpha/r$ is applied to the LoRA output, not to the parameters themselves. This means changing $r$ while keeping $\alpha$ fixed automatically adjusts the per-element contribution.
The base weight is frozen (requires_grad=False), so no gradients are computed or stored for the $d \times k$ matrix---only for the much smaller $A$ and $B$ matrices.

24.5.8 Merging LoRA Weights

After training, the LoRA weights can be merged into the base model:

$$W_{\text{merged}} = W_0 + \frac{\alpha}{r} BA$$

This eliminates the inference overhead of the separate LoRA path, resulting in the same architecture and latency as the original model. Multiple LoRA adaptations can be stored and swapped efficiently. A single 7B-parameter base model can serve dozens of specialized tasks by hot-swapping LoRA adapters (typically 10-50 MB each), rather than maintaining separate model deployments.

24.6 QLoRA: Quantized LoRA

24.6.1 Motivation

QLoRA (Dettmers et al., 2023) combines LoRA with aggressive quantization to enable fine-tuning of large models on consumer hardware. The key innovation is that the base model is stored in 4-bit precision while LoRA parameters are trained in higher precision.

24.6.2 Key Innovations

4-bit NormalFloat (NF4). QLoRA introduces the NF4 data type, which is information-theoretically optimal for normally distributed weights. Pre-trained neural network weights are approximately normally distributed, so NF4 provides better representation than standard 4-bit integer quantization.

Double quantization. The quantization constants themselves are quantized (from fp32 to fp8), further reducing memory overhead:

$$\text{Memory per parameter} = 4\text{ bits} + \frac{32}{64} \text{ bits (first quant)} + \frac{8}{256} \text{ bits (double quant)} \approx 4.5 \text{ bits}$$

Paged optimizers. QLoRA uses paged optimizers that automatically offload optimizer states from GPU to CPU memory when GPU memory is exhausted, then page them back when needed.

24.6.3 Memory Savings

For a 7B parameter model:

Method	GPU Memory
Full fine-tuning (fp32)	~120 GB
Full fine-tuning (fp16)	~60 GB
LoRA (fp16 base)	~16 GB
QLoRA (NF4 base)	~6 GB

QLoRA makes it possible to fine-tune a 7B model on a single consumer GPU (e.g., RTX 3090 with 24GB) and a 65B model on a single 48GB GPU.

24.6.4 How NF4 Quantization Works

To understand why QLoRA works so well, it helps to understand the NF4 data type. Standard 4-bit integer quantization divides the range $[\min(w), \max(w)]$ into 16 equally-spaced bins. But neural network weights are approximately normally distributed, with most values near zero and few at the extremes. Uniform quantization wastes bins on the sparse tails.

NF4 instead places quantization bins at the quantiles of the standard normal distribution. The 16 quantization levels are chosen so that each bin contains an equal probability mass under the normal distribution. Formally, the NF4 levels are:

$$q_i = \Phi^{-1}\left(\frac{i + 0.5}{16}\right) \quad \text{for } i = 0, 1, \ldots, 15$$

where $\Phi^{-1}$ is the inverse CDF (quantile function) of the standard normal. This means bins near zero (where most weights cluster) are narrow, providing fine precision, while bins in the tails are wide, which is acceptable because few weights fall there. The result is an information-theoretically optimal quantization for normally distributed data, as demonstrated by Dettmers et al.

24.6.5 Performance

Remarkably, QLoRA matches the performance of full fine-tuning and 16-bit LoRA on most benchmarks. The quantization of the base model introduces minimal degradation because:

The 4-bit quantization is optimized for the weight distribution (NF4)
LoRA parameters are trained in full precision (bfloat16), providing a correction mechanism
The LoRA update can compensate for quantization errors during training
Double quantization keeps the overhead of quantization constants minimal

The practical impact is dramatic: a researcher with a single consumer GPU (24GB VRAM) can fine-tune a 7B parameter model that would otherwise require 120+ GB for full fine-tuning. This democratization of fine-tuning capability has been one of the most significant developments in making LLM customization accessible to a broad community.

24.7 Adapter Layers

24.7.1 Architecture

Adapter layers (Houlsby et al., 2019) insert small, trainable bottleneck modules between the existing layers of the pre-trained model. Each adapter consists of:

A down-projection: $W_{\text{down}} \in \mathbb{R}^{d \times m}$ (where $m \ll d$)
A nonlinearity: typically ReLU or GELU
An up-projection: $W_{\text{up}} \in \mathbb{R}^{m \times d}$
A residual connection

The adapter computation is:

$$h = h + W_{\text{up}} \cdot \sigma(W_{\text{down}} \cdot h)$$

where $h$ is the hidden representation and $\sigma$ is the activation function.

24.7.2 Placement

Adapters are typically placed:

After the multi-head attention sublayer
After the feed-forward sublayer
Or both

Houlsby et al. (2019) found that placing adapters after both sublayers performs best, but the serial adapter configuration (after the FFN only) achieves comparable performance with fewer parameters.

24.7.3 Comparison with LoRA

Aspect	LoRA	Adapters
Architecture modification	None (merged at inference)	Adds new layers
Inference latency	Zero overhead (after merging)	Small overhead
Parameter count	$2 \times r \times d$ per layer	$2 \times m \times d + m + d$ per adapter
Training stability	Very stable	Stable
Multi-task serving	Swap LoRA weights	Swap adapter modules

LoRA has become the dominant PEFT method primarily because it introduces zero inference latency after merging and is simpler to implement.

24.7.4 Adapter Implementation

For reference, here is a minimal adapter layer implementation:

import torch
import torch.nn as nn

class AdapterLayer(nn.Module):
    """A bottleneck adapter layer.

    Projects from d dimensions down to m, applies a nonlinearity,
    and projects back up to d, with a residual connection.

    Args:
        d_model: Model hidden dimension.
        bottleneck: Adapter bottleneck dimension.
    """

    def __init__(self, d_model: int, bottleneck: int = 64) -> None:
        super().__init__()
        self.down = nn.Linear(d_model, bottleneck)
        self.act = nn.GELU()
        self.up = nn.Linear(bottleneck, d_model)
        # Initialize up-projection to near-zero
        nn.init.zeros_(self.up.weight)
        nn.init.zeros_(self.up.bias)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        """Forward pass with residual connection."""
        return x + self.up(self.act(self.down(x)))

The zero initialization of the up-projection ensures that at the start of training, the adapter contributes nothing (like LoRA's zero initialization of $B$), so the model begins from the exact pre-trained checkpoint. During training, only the adapter parameters are updated while the main Transformer layers remain frozen.

24.8 Prefix Tuning and Prompt Tuning

24.8.1 Prefix Tuning

Prefix tuning (Li & Liang, 2021) prepends trainable continuous vectors (the "prefix") to the key and value matrices at each attention layer. The prefix is analogous to a learned "soft prompt" that operates in the activation space rather than the token space.

Formally, at each layer $l$, the attention computation becomes:

$$\text{Attention}(Q, [P_K^l; K], [P_V^l; V])$$

where $P_K^l, P_V^l \in \mathbb{R}^{p \times d}$ are the trainable prefix matrices and $p$ is the prefix length.

The total number of trainable parameters is:

$$|\Delta\theta|_{\text{prefix}} = L \times 2 \times p \times d$$

where $L$ is the number of layers.

24.8.2 Prompt Tuning

Prompt tuning (Lester et al., 2021) is a simplified version that only prepends trainable embeddings to the input, rather than to every layer:

$$\text{input} = [e_1, e_2, \ldots, e_p, x_1, x_2, \ldots, x_n]$$

where $e_i \in \mathbb{R}^d$ are the trainable "soft prompt" tokens.

Prompt tuning uses far fewer parameters than prefix tuning ($p \times d$ vs. $L \times 2 \times p \times d$) but is generally less expressive. However, Lester et al. showed that as model size increases, prompt tuning approaches the performance of full fine-tuning.

24.8.3 IA3: Infused Adapter by Inhibiting and Amplifying Inner Activations

IA3 (Liu et al., 2022) takes a radically minimal approach to parameter-efficient adaptation. Instead of adding new weight matrices, IA3 introduces learned vectors that scale the existing keys, values, and feed-forward activations element-wise:

$$\text{Attention: } K' = l_K \odot K, \quad V' = l_V \odot V$$ $$\text{FFN: } h' = l_{ff} \odot \text{FFN}(h)$$

where $l_K, l_V \in \mathbb{R}^{d_k}$ and $l_{ff} \in \mathbb{R}^{d_{ff}}$ are learned scaling vectors, and $\odot$ denotes element-wise multiplication. The total number of trainable parameters is:

$$|\Delta\theta|_{\text{IA3}} = L \times (2 d_k + d_{ff})$$

For a model with $d_k = 64$, $d_{ff} = 11008$, and $L = 32$ layers, IA3 adds only $32 \times (128 + 11008) = 356,352$ trainable parameters---far fewer than even LoRA. IA3 works best for few-shot fine-tuning scenarios where the adaptation is relatively simple, such as adjusting the model's output style or domain vocabulary.

24.8.4 Comparison

Method	Parameters per task	Where added	Performance	Inference overhead
Full fine-tuning	All	Everywhere	Best	None
LoRA	$2rLd'$	Attention weights	Near-best	None (after merging)
Prefix tuning	$2Lpd$	Attention KV	Good	Small (extra KV)
Prompt tuning	$pd$	Input only	Good (large models)	Minimal (extra tokens)
Adapter layers	$2Lmd$	After sublayers	Good	Small (extra layers)
IA3	$L(2d_k + d_{ff})$	Scaling vectors	Moderate	Negligible

24.8.5 How to Choose a PEFT Method

In practice, the choice of PEFT method depends on several factors:

If you need maximum quality with no inference overhead: Use LoRA. It offers near-full-fine-tuning quality and can be merged into the base model for zero-overhead inference. This is the default recommendation for most use cases.
If memory is extremely constrained: Use QLoRA (Section 24.6). It enables fine-tuning of large models on consumer hardware with minimal quality loss.
If you need to serve many task-specific variants: LoRA adapters are small (typically 10-50 MB for a 7B model) and can be swapped at serving time. Use a single base model with multiple LoRA adapters.
If you have very few training examples (10-100): Consider prompt tuning or IA3, which have fewer parameters and are less prone to overfitting on tiny datasets.
If you need to fine-tune frequently with minimal infrastructure: Prompt tuning requires the least engineering overhead---no model weight modifications, just learned embedding vectors.

The empirical ranking on most benchmarks is: full fine-tuning $\geq$ LoRA $>$ adapters $\approx$ prefix tuning $>$ prompt tuning $>$ IA3, though the gaps narrow significantly as model size increases. At the scale of 70B+ parameters, even prompt tuning approaches full fine-tuning performance.

24.9 Instruction Tuning and Dataset Creation

24.9.1 What Is Instruction Tuning?

Instruction tuning fine-tunes a language model on a diverse collection of tasks formatted as natural language instructions. The goal is to produce a model that can follow arbitrary instructions, including instructions for tasks not seen during training.

The training data takes the form:

Instruction: Summarize the following article in three bullet points.
Input: [article text]
Output: [three bullet points]

24.9.2 Dataset Formats

Several standard formats are used for instruction tuning:

Alpaca format:

{
  "instruction": "Classify the sentiment of the following review.",
  "input": "The food was delicious and the service was excellent.",
  "output": "Positive"
}

ShareGPT/conversation format:

{
  "conversations": [
    {"from": "human", "value": "What causes seasons on Earth?"},
    {"from": "assistant", "value": "Seasons are caused by..."}
  ]
}

OpenAI chat format:

{
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Explain photosynthesis."},
    {"role": "assistant", "content": "Photosynthesis is..."}
  ]
}

24.9.3 Creating High-Quality Instruction Datasets

Diversity. The dataset should cover a wide range of tasks: summarization, classification, extraction, generation, reasoning, coding, math, creative writing, and more. Diversity in task type, complexity, and domain is key to generalization.

Quality. Each example should represent a high-quality response: accurate, well-structured, appropriate in length, and helpful. Low-quality responses teach the model bad habits.

Complexity distribution. Include examples ranging from simple (one-line answers) to complex (multi-step reasoning, long-form generation). This teaches the model to calibrate response complexity to the question.

Instruction variation. Express the same task in multiple ways to improve robustness: "Summarize this," "Give me the key points," "Write a brief overview of," etc.

24.9.4 Data Generation Strategies

Human annotation. The gold standard: human experts write instructions and responses. Expensive but highest quality. Used in InstructGPT (Ouyang et al., 2022) and Dolly.

Self-instruct. Use a strong LLM to generate instruction-response pairs, then filter for quality. Introduced by Wang et al. (2023) and used to create Alpaca dataset.

Evol-instruct. Start with simple instructions and evolve them into more complex versions using an LLM. Used to create the WizardLM dataset.

Distillation from stronger models. Generate training data using a stronger model (e.g., GPT-4) and use it to fine-tune a weaker model. Legal and ethical considerations apply---check the API terms of service.

24.9.5 Data Preparation Best Practices

Beyond content quality, several practical data preparation steps significantly affect fine-tuning outcomes:

Tokenization alignment. Ensure your training data uses the same tokenizer and chat template as the base model. A mismatch---for example, training a Llama model on data formatted with ChatML tags---can cause the model to learn to produce the wrong format tokens, leading to degraded instruction following.

Length distribution. Analyze the token length distribution of your training data. If most examples are short (under 100 tokens) but the task requires generating long responses, the model may learn a bias toward brevity. Include long-form examples to calibrate the model's response length. Conversely, if many examples are unnecessarily verbose, the model will learn to be verbose.

Decontamination. Remove any examples that overlap with benchmark datasets you plan to use for evaluation. Benchmark contamination inflates evaluation scores and masks true performance. Use n-gram overlap detection to identify potential contamination.

Balanced representation. If your dataset covers multiple task types, ensure no single task dominates. A dataset that is 90% classification and 10% generation will produce a model biased toward short, label-like responses. Use stratified sampling or upsampling of minority tasks to balance representation.

Data formatting consistency. Inconsistent formatting---sometimes using "Answer:", sometimes "Response:", sometimes no prefix---confuses the model. Choose a single format and apply it uniformly across the entire dataset.

24.9.5 Data Quality Filtering

Not all generated data is suitable for training. Filter for:

Correctness: Verify factual accuracy, especially for math and coding
Helpfulness: Remove responses that are evasive, overly cautious, or unhelpful
Safety: Remove responses containing harmful content
Formatting: Ensure consistent formatting across the dataset
Deduplication: Remove near-duplicate examples that would cause overfitting

24.10 Supervised Fine-Tuning with HuggingFace TRL

24.10.1 The TRL Library

TRL (Transformer Reinforcement Learning) is HuggingFace's library for training language models, covering supervised fine-tuning (SFT), reward modeling, and reinforcement learning from human feedback (RLHF). It provides high-level abstractions built on the transformers and peft libraries.

24.10.2 SFT Pipeline Overview

The supervised fine-tuning pipeline consists of:

Data preparation: Format data into the expected chat/instruction template
Model loading: Load the pre-trained model with optional quantization
PEFT configuration: Set up LoRA or other PEFT method
Training: Run the SFT training loop with appropriate hyperparameters
Evaluation: Assess the fine-tuned model on held-out data

24.10.3 Key Hyperparameters for SFT

Hyperparameter	Typical Range	Notes
Learning rate	1e-5 to 5e-5	Lower for larger models
Batch size	4 to 32	Use gradient accumulation
Epochs	1 to 5	Monitor for overfitting
Max sequence length	512 to 4096	Depends on task
Warmup ratio	0.03 to 0.1	Fraction of total steps
Weight decay	0.0 to 0.1	Mild regularization
LoRA rank	8 to 64	Higher for complex tasks
LoRA alpha	16 to 128	Often set to 2 * rank
LoRA dropout	0.0 to 0.1	Mild regularization

24.10.4 Hyperparameter Selection Guide

Choosing the right hyperparameters is crucial for successful fine-tuning. Here is a systematic approach:

Learning rate. Start with the values in the table above and adjust based on the training loss curve. If the loss decreases very slowly, the learning rate is too low. If the loss is noisy or spikes, the learning rate is too high. A reliable starting point: use $2 \times 10^{-4}$ for LoRA and $2 \times 10^{-5}$ for full fine-tuning.

LoRA rank selection. The rank $r$ should match the complexity of the adaptation. A practical approach:

Start with $r = 16$ as the default
If the fine-tuning loss plateaus before reaching satisfactory performance, increase $r$
If the model overfits (training loss continues to decrease while validation loss increases), decrease $r$
For simple style adaptations, $r = 4$-$8$ often suffices
For complex domain adaptation (e.g., medical, legal), consider $r = 32$-$64$

The relationship between rank and capacity can be understood through the singular value decomposition. If the true weight update $\Delta W$ has singular values $\sigma_1 \geq \sigma_2 \geq \ldots$, the optimal rank-$r$ approximation captures the top $r$ singular values. The approximation error is:

$$\|\Delta W - \Delta W_r\|_F^2 = \sum_{i=r+1}^{\min(d,k)} \sigma_i^2$$

If the singular values decay rapidly (as Aghajanyan et al. showed they do for fine-tuning updates), even a small rank captures most of the update's "signal."

Number of epochs. For instruction tuning with 10K-50K examples, 2-3 epochs is typical. For smaller datasets (1K-5K), 3-5 epochs may be needed. Always monitor the validation loss and stop when it begins to increase, as discussed in the general machine learning principles of Part II.

Sequence packing. When training examples vary in length, packing multiple short examples into a single sequence (up to max_seq_length) improves GPU utilization. TRL's SFTTrainer supports this with the packing=True option. Without packing, short examples waste compute on padding tokens.

24.10.5 Training Loss Masking

For instruction tuning, it is common practice to only compute the loss on the assistant's response tokens, not the instruction tokens. This prevents the model from wasting capacity learning to reproduce the instructions:

$$\mathcal{L} = -\sum_{t \in \text{response}} \log p_\theta(y_t \mid y_{

TRL's SFTTrainer supports this through response-template-based masking or dataset formatting.

24.10.5 Chat Templates

Modern models use specific chat templates to format conversations. The tokenizer.apply_chat_template() method handles this:

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is machine learning?"},
    {"role": "assistant", "content": "Machine learning is..."}
]
formatted = tokenizer.apply_chat_template(messages, tokenize=False)

Different models use different templates (ChatML, Llama, Mistral, etc.), and using the wrong template can severely degrade performance.

24.10.6 Complete PEFT Fine-Tuning Example

Here is a complete, end-to-end example of fine-tuning a model with LoRA using the peft and trl libraries:

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
    TrainingArguments,
)
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer

# 1. Load model with 4-bit quantization (QLoRA)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model_name = "meta-llama/Llama-3.2-1B"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token

# 2. Prepare model for quantized training
model = prepare_model_for_kbit_training(model)

# 3. Configure LoRA
lora_config = LoraConfig(
    r=16,                        # Rank
    lora_alpha=32,               # Scaling factor (alpha/r = 2)
    target_modules=[             # Apply to all attention matrices
        "q_proj", "k_proj", "v_proj", "o_proj",
        "gate_proj", "up_proj", "down_proj",
    ],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Output: trainable params: 13,631,488 || all params: 1,249,283,072
# || trainable%: 1.0911

# 4. Load and format dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")

def format_instruction(example):
    """Format an Alpaca example into a training prompt."""
    if example["input"]:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Input:\n{example['input']}\n\n"
            f"### Response:\n{example['output']}"
        )
    else:
        text = (
            f"### Instruction:\n{example['instruction']}\n\n"
            f"### Response:\n{example['output']}"
        )
    return {"text": text}

dataset = dataset.map(format_instruction)

# 5. Configure training
training_args = TrainingArguments(
    output_dir="./lora-llama",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,   # Effective batch size = 16
    learning_rate=2e-4,
    warmup_ratio=0.03,
    lr_scheduler_type="cosine",
    logging_steps=25,
    save_strategy="epoch",
    bf16=True,
    optim="paged_adamw_8bit",
    max_grad_norm=0.3,
)

# 6. Train with SFTTrainer
trainer = SFTTrainer(
    model=model,
    train_dataset=dataset,
    args=training_args,
    tokenizer=tokenizer,
    dataset_text_field="text",
    max_seq_length=512,
    packing=True,  # Pack multiple examples into one sequence
)

trainer.train()

# 7. Save the LoRA adapter (not the full model)
trainer.model.save_pretrained("./lora-llama/final")

This example demonstrates QLoRA fine-tuning: the base model is loaded in 4-bit precision (NF4 quantization), LoRA adapters are applied to all linear layers, and training uses 8-bit paged Adam. On a single GPU with 24GB of memory, this configuration can fine-tune models up to approximately 7B parameters.

24.10.7 Merging and Deploying LoRA Adapters

After training, the LoRA adapter can be merged back into the base model for deployment:

from peft import AutoPeftModelForCausalLM

# Load the base model + LoRA adapter
model = AutoPeftModelForCausalLM.from_pretrained(
    "./lora-llama/final",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Merge LoRA weights into the base model
merged_model = model.merge_and_unload()

# Save the merged model for deployment
merged_model.save_pretrained("./llama-merged")
tokenizer.save_pretrained("./llama-merged")

The merged model is identical in architecture to the original---no additional inference overhead, no adapter loading logic, and full compatibility with existing serving infrastructure. This merge-and-deploy pattern is one of the key practical advantages of LoRA over adapter-based methods.

24.11 Evaluation of Fine-Tuned Models

24.11.1 Evaluation Dimensions

Fine-tuned models should be evaluated on multiple dimensions:

Task performance. How well does the model perform on the target task? Use task-specific metrics: accuracy for classification, ROUGE/BERTScore for summarization, pass@k for code generation.

General capability retention. Has the model retained its general capabilities? Evaluate on standard benchmarks (MMLU, HellaSwag, ARC) to check for catastrophic forgetting.

Instruction following. Does the model correctly follow diverse instructions? Evaluate on instruction-following benchmarks like IFEval or MT-Bench.

Safety. Has the model maintained its safety guardrails? Test with standard safety benchmarks and red-teaming.

24.11.2 Benchmarking Strategy

A comprehensive evaluation strategy includes:

Held-out test set: Reserve 10-20% of the fine-tuning data for testing
Out-of-distribution tests: Test on data that differs from the training distribution
Standard benchmarks: Compare against the base model on established benchmarks
Human evaluation: For subjective tasks, collect human ratings on quality
A/B testing: In production, compare the fine-tuned model against the baseline

24.11.3 Common Pitfalls

Overfitting to the training format. The model may only respond well to inputs that match the training format exactly. Test with varied input formats.

Benchmark contamination. If benchmark data leaked into the training set, results are inflated. Use held-out or freshly-created evaluation sets.

Cherry-picking examples. A few impressive examples do not constitute evaluation. Use systematic, quantitative evaluation on representative test sets.

Ignoring base model regression. A model that excels at the target task but has lost general capabilities may be less useful overall.

24.11.4 Practical Evaluation Implementation

Here is a framework for evaluating fine-tuned models that covers both task-specific and general capabilities:

from typing import Callable
import numpy as np

class FineTuneEvaluator:
    """Comprehensive evaluator for fine-tuned language models.

    Evaluates task performance, general capability retention,
    and instruction following quality.
    """

    def __init__(self, model, tokenizer, base_model=None):
        self.model = model
        self.tokenizer = tokenizer
        self.base_model = base_model  # For regression testing

    def evaluate_task(
        self,
        test_data: list[dict],
        metric_fn: Callable,
    ) -> dict:
        """Evaluate performance on the target task.

        Args:
            test_data: List of dicts with 'input' and 'expected'.
            metric_fn: Function that computes a metric from
                (predictions, references).

        Returns:
            Dict with metric scores.
        """
        predictions = []
        for example in test_data:
            input_ids = self.tokenizer(
                example["input"], return_tensors="pt"
            ).input_ids.to(self.model.device)

            output = self.model.generate(
                input_ids, max_new_tokens=256,
                do_sample=False,  # Greedy for reproducibility
            )
            pred = self.tokenizer.decode(
                output[0][input_ids.shape[1]:],
                skip_special_tokens=True,
            )
            predictions.append(pred)

        references = [ex["expected"] for ex in test_data]
        return metric_fn(predictions, references)

    def evaluate_perplexity(self, texts: list[str]) -> float:
        """Compute perplexity on general text.

        Lower perplexity on general text indicates better
        retention of pre-trained capabilities.
        """
        import torch
        total_loss = 0.0
        total_tokens = 0

        for text in texts:
            encodings = self.tokenizer(
                text, return_tensors="pt", truncation=True,
                max_length=512,
            )
            input_ids = encodings.input_ids.to(self.model.device)

            with torch.no_grad():
                outputs = self.model(
                    input_ids=input_ids, labels=input_ids
                )
            total_loss += outputs.loss.item() * input_ids.shape[1]
            total_tokens += input_ids.shape[1]

        avg_loss = total_loss / total_tokens
        return np.exp(avg_loss)

This evaluator provides a structured way to track both task performance and general capability retention during and after fine-tuning. By comparing perplexity on general text before and after fine-tuning, you can quantify the degree of catastrophic forgetting, as discussed in Section 24.12.

24.12 Catastrophic Forgetting

24.12.1 The Problem

Catastrophic forgetting occurs when fine-tuning causes the model to lose capabilities acquired during pre-training. The model's parameters shift to optimize for the fine-tuning task at the expense of previously learned knowledge.

Formally, if $\theta_0$ are the pre-trained parameters and $\theta^*$ are the fine-tuned parameters, catastrophic forgetting occurs when:

$$\mathcal{L}_{\text{pre-train}}(\theta^*) \gg \mathcal{L}_{\text{pre-train}}(\theta_0)$$

That is, the fine-tuned model performs significantly worse on the pre-training distribution.

24.12.2 Causes

Distribution shift: The fine-tuning data has a different distribution than pre-training data
Small dataset size: Limited fine-tuning data causes the model to overfit to specific patterns
High learning rate: Large weight updates can significantly alter pre-trained representations
Excessive training: Too many epochs on a small dataset
Narrow task focus: Fine-tuning on a single, narrow task

24.12.3 Mitigation Strategies

Lower learning rate. Use a learning rate 10--100x smaller than the pre-training rate. This limits the magnitude of weight updates.

Few training epochs. Fine-tune for 1-3 epochs rather than training until convergence. Early stopping based on validation loss is essential.

PEFT methods. LoRA, adapters, and prefix tuning inherently limit catastrophic forgetting because they freeze most parameters. The pre-trained weights remain unchanged, and only the small adaptation parameters are trained.

Data mixing. Include a fraction of general-purpose data alongside the task-specific data:

$$\mathcal{D}_{\text{train}} = \alpha \cdot \mathcal{D}_{\text{task}} + (1-\alpha) \cdot \mathcal{D}_{\text{general}}$$

where $\alpha \in [0.5, 0.9]$ controls the mix ratio. This approach, used in the Orca and Tulu papers, helps maintain general capabilities.

Elastic Weight Consolidation (EWC). Add a penalty that discourages large changes to parameters important for pre-training:

$$\mathcal{L}_{\text{EWC}} = \mathcal{L}_{\text{task}} + \frac{\lambda}{2} \sum_i F_i (\theta_i - \theta_{0,i})^2$$

where $F_i$ is the Fisher information matrix diagonal, approximating the importance of parameter $i$ for the pre-training task.

Progressive fine-tuning. Gradually unfreeze layers from top to bottom during training. Start by training only the top layers, then progressively unfreeze deeper layers. This allows lower layers (which encode more general features) to remain stable longer.

Regularization. Apply weight decay, dropout, and gradient clipping to prevent excessive parameter changes:

$$\theta \leftarrow \theta - \eta (\nabla_\theta \mathcal{L} + \lambda \theta)$$

24.12.4 Diagnosing Catastrophic Forgetting

To detect forgetting during fine-tuning, monitor performance on a diverse evaluation suite throughout training, not just at the end. Set up a monitoring pipeline that evaluates the model every $N$ steps on:

The fine-tuning task validation set (should improve)
A general knowledge benchmark like MMLU (should remain stable)
A commonsense reasoning benchmark like HellaSwag (should remain stable)
A sample of the pre-training distribution (loss should not increase significantly)

If you observe more than a 5% degradation on general benchmarks while fine-tuning, this is a signal that the learning rate is too high, the dataset is too narrow, or the number of training steps is excessive. Switching to a PEFT method, reducing the learning rate, or mixing in general-purpose data are the most reliable remedies.

A practical rule of thumb: if you are fine-tuning with LoRA at rank 16 or below and the training dataset is at least 5,000 examples, catastrophic forgetting is rarely a problem. The frozen pre-trained weights act as an anchor that preserves the model's general capabilities. This is one of the strongest practical arguments for PEFT methods, beyond their memory savings.

24.13 Advanced Topics

24.13.1 Multi-Task Fine-Tuning

Fine-tuning on multiple tasks simultaneously can improve generalization and reduce forgetting. The training data combines examples from all tasks:

$$\mathcal{L}_{\text{multi}} = \sum_{t=1}^T w_t \cdot \mathcal{L}_t(\theta)$$

where: - $T$ is the number of tasks - $w_t$ is the weight for task $t$ - $\mathcal{L}_t(\theta)$ is the loss on task $t$

Careful weight selection is important---rare or difficult tasks may need higher weights to avoid being dominated by common or easy tasks. Several strategies exist for setting task weights:

Proportional weighting. Set $w_t$ proportional to the dataset size of task $t$. This is the simplest approach and is equivalent to uniform random sampling from the combined dataset.

Inverse proportional weighting. Set $w_t$ inversely proportional to dataset size, giving rare tasks more influence. This helps if the primary goal is balanced performance across all tasks.

Temperature-based sampling. Sample from task $t$ with probability proportional to $|\mathcal{D}_t|^{1/T_{\text{mix}}}$, where $T_{\text{mix}}$ is a temperature parameter. When $T_{\text{mix}} = 1$, this is proportional sampling; as $T_{\text{mix}} \to \infty$, it approaches uniform sampling across tasks.

Dynamic weighting. Adjust weights during training based on each task's validation loss. Tasks that are underperforming receive higher weights. This approach requires maintaining separate validation sets for each task.

The Tulu 2 (Ivison et al., 2024) and FLAN (Longpre et al., 2023) papers provide extensive ablation studies on multi-task mixing strategies and demonstrate that thoughtful task mixing leads to models that outperform single-task fine-tuning on almost all individual tasks. The intuition is that multi-task learning provides a regularization effect: the model must learn representations that are useful across tasks, preventing overfitting to the idiosyncrasies of any single task.

24.13.2 Continual Fine-Tuning

In practice, fine-tuning is not a one-time event. As new data arrives or requirements change, the model needs to be updated. Continual fine-tuning presents additional challenges:

Each fine-tuning round risks further catastrophic forgetting
The model may become increasingly specialized and brittle
Data from earlier fine-tuning rounds may not be available

Strategies include:

Replay buffer. Maintain a buffer of representative examples from all past tasks and mix them into each training round. This prevents the model from forgetting earlier tasks but requires storing and managing training data across rounds.
Separate LoRA adapters. Train independent LoRA adapters for each task or time period, composing them at inference time through adapter merging or routing. This avoids interference between tasks entirely but increases serving complexity.
Periodic re-initialization. Periodically start fresh from the base model and retrain on the accumulated data from all tasks. This is the most reliable approach but also the most computationally expensive.
Adapter composition. Chain multiple LoRA adapters sequentially, where each adapter builds on the previous one. The first adapter handles the earliest task, the second handles the next, and so on. Research shows this can work but requires careful ordering and scaling of adapter contributions.

24.13.3 Merging Fine-Tuned Models

Multiple LoRA adapters trained for different tasks can be merged:

Linear interpolation: $$W_{\text{merged}} = W_0 + \sum_{i=1}^n \alpha_i \Delta W_i$$

TIES merging: Trim small values, resolve sign conflicts, then merge:

Trim: Set parameters below a threshold to zero
Elect: For each parameter, choose the sign with the largest magnitude sum
Merge: Average the parameters that agree in sign

DARE: Drop parameters randomly, rescale the remaining ones, and merge:

$$\Delta W_i' = \frac{1}{1-p} \cdot m_i \odot \Delta W_i$$

where $m_i$ is a binary mask with drop probability $p$ and the factor $\frac{1}{1-p}$ rescales the surviving parameters to maintain the expected magnitude. DARE was motivated by the observation that most individual LoRA parameters contribute little to the final performance---they can be dropped randomly with minimal impact, and the rescaling ensures the overall contribution is preserved. This sparsification step also reduces interference between tasks during merging, because fewer parameters conflict.

24.13.4 Model Merging in Practice

Here is a practical example of merging two LoRA adapters using linear interpolation:

from peft import PeftModel
import torch

# Load base model
from transformers import AutoModelForCausalLM
base = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Load first adapter (e.g., trained for coding tasks)
model_1 = PeftModel.from_pretrained(base, "coding-lora-adapter")
delta_1 = {}
for name, param in model_1.named_parameters():
    if "lora" in name:
        delta_1[name] = param.data.clone()

# Load second adapter (e.g., trained for math tasks)
model_2 = PeftModel.from_pretrained(base, "math-lora-adapter")
delta_2 = {}
for name, param in model_2.named_parameters():
    if "lora" in name:
        delta_2[name] = param.data.clone()

# Merge with weighted average
alpha = 0.5  # Equal weighting
merged = PeftModel.from_pretrained(base, "coding-lora-adapter")
for name, param in merged.named_parameters():
    if "lora" in name and name in delta_2:
        param.data = alpha * delta_1[name] + (1 - alpha) * delta_2[name]

The optimal mixing weight $\alpha$ can be found by evaluating the merged model on a validation set that covers both tasks. In practice, values between 0.3 and 0.7 work well, with the exact optimum depending on the similarity of the tasks.

Model merging is particularly powerful because it requires no additional training---just arithmetic on the adapter parameters. This makes it possible to combine capabilities from multiple specialized fine-tunes into a single model that performs well across all tasks, provided the tasks do not conflict significantly in their parameter requirements.

24.13.5 When Model Merging Fails

Model merging does not always work. It tends to fail when:

Tasks require conflicting weight changes: If the coding adapter needs to increase a particular weight while the math adapter needs to decrease it, averaging produces a compromise that satisfies neither task.
The adapters were trained with very different hyperparameters: Differences in learning rate, rank, or number of epochs can produce adapters with incompatible scales.
The tasks are in very different domains: Merging a medical adapter with a legal adapter may produce worse results than either individual adapter because the required representations are too different.

In such cases, multi-task fine-tuning (training a single adapter on data from all tasks simultaneously) typically outperforms post-hoc merging. TIES merging and DARE partially address these issues by resolving sign conflicts and reducing interference, but they are not a complete solution. The choice between merging and multi-task training depends on whether you have access to all task data simultaneously (favoring multi-task training) or only to the trained adapters (requiring merging).

24.14 Summary

Fine-tuning large language models enables deep customization that prompting alone cannot achieve. In this chapter, we covered:

When to fine-tune: Use fine-tuning for complex behavioral customization, domain specialization, or high-volume production where prompt costs dominate
Full fine-tuning updates all parameters and achieves the best performance but requires substantial computational resources
PEFT methods modify only a small fraction of parameters, dramatically reducing compute and memory requirements while maintaining competitive performance
LoRA reparameterizes weight updates as low-rank matrices, providing an elegant trade-off between expressiveness and efficiency. The mathematical foundation rests on the low intrinsic dimensionality of fine-tuning updates
QLoRA combines LoRA with 4-bit quantization, enabling fine-tuning of large models on consumer hardware
Adapter layers insert small trainable bottleneck modules between existing layers
Prefix tuning and prompt tuning prepend trainable continuous vectors to the model's inputs or activations
Instruction tuning trains models to follow diverse natural language instructions, requiring carefully constructed datasets with diversity, quality, and complexity balance
SFT with TRL provides a practical, high-level API for supervised fine-tuning with PEFT support
Evaluation should cover task performance, general capability retention, instruction following, and safety
Catastrophic forgetting remains a key challenge, mitigated by PEFT, low learning rates, data mixing, and regularization

The field of fine-tuning is evolving rapidly. New PEFT methods continue to emerge, training recipes are being refined, and the interplay between fine-tuning and alignment (Chapter 25) is becoming better understood. The fundamental insight, however, remains: pre-trained language models contain vast latent capability, and fine-tuning is the key that unlocks task-specific performance from that general foundation.

The next chapter explores what happens after supervised fine-tuning: aligning language models with human preferences through RLHF and DPO.

References

Hu, E. J., et al. (2022). LoRA: Low-rank adaptation of large language models. ICLR.
Dettmers, T., et al. (2023). QLoRA: Efficient finetuning of quantized language models. NeurIPS.
Houlsby, N., et al. (2019). Parameter-efficient transfer learning for NLP. ICML.
Li, X. L., & Liang, P. (2021). Prefix-tuning: Optimizing continuous prompts for generation. ACL.
Lester, B., et al. (2021). The power of scale for parameter-efficient prompt tuning. EMNLP.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
Wang, Y., et al. (2023). Self-instruct: Aligning language models with self-generated instructions. ACL.
Aghajanyan, A., et al. (2021). Intrinsic dimensionality explains the effectiveness of language model fine-tuning. ACL.
Taori, R., et al. (2023). Stanford Alpaca: An instruction-following LLaMA model. GitHub.
Xu, C., et al. (2024). WizardLM: Empowering large language models to follow complex instructions. ICLR.
Kirkpatrick, J., et al. (2017). Overcoming catastrophic forgetting in neural networks. PNAS.
Yadav, P., et al. (2024). TIES-merging: Resolving interference when merging models. NeurIPS.
Liu, H., et al. (2022). Few-shot parameter-efficient fine-tuning is better and cheaper than in-context learning. NeurIPS.
Ivison, H., et al. (2024). Tulu 2: Pushing the limits of open-weight language models. arXiv.
Longpre, S., et al. (2023). The FLAN collection: Designing data and methods for effective instruction tuning. ICML.