Chapter 24: Further Reading

Foundational Papers

LoRA and Variants

  • Hu, E. J., et al. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR. The original LoRA paper demonstrating that fine-tuning updates have low intrinsic rank. Introduces the $\Delta W = BA$ parameterization with zero initialization of $B$. Essential reading for understanding the mathematical foundations.

  • Dettmers, T., et al. (2023). "QLoRA: Efficient Finetuning of Quantized Language Models." NeurIPS. Introduces NF4 quantization, double quantization, and paged optimizers, enabling fine-tuning of 65B models on single GPUs. Demonstrates that QLoRA matches full fine-tuning quality despite 4-bit base model quantization.

  • Aghajanyan, A., et al. (2021). "Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning." ACL. Provides the theoretical foundation for LoRA by showing that pre-trained models have low intrinsic dimensionality---they can be fine-tuned effectively in a much lower-dimensional subspace.

Adapters and Prefix Tuning

  • Houlsby, N., et al. (2019). "Parameter-Efficient Transfer Learning for NLP." ICML. The original adapter paper that inserts small bottleneck modules into pre-trained Transformers. Establishes the paradigm of training only a small fraction of parameters for downstream tasks.

  • Li, X. L., & Liang, P. (2021). "Prefix-Tuning: Optimizing Continuous Prompts for Generation." ACL. Introduces prefix tuning, which prepends trainable continuous vectors to attention layers. Demonstrates competitive performance with less than 0.1% of model parameters.

  • Lester, B., et al. (2021). "The Power of Scale for Parameter-Efficient Prompt Tuning." EMNLP. Shows that prompt tuning (trainable embeddings prepended to input) approaches full fine-tuning performance as model size increases, establishing an important scaling law for PEFT methods.

Instruction Tuning

  • Ouyang, L., et al. (2022). "Training Language Models to Follow Instructions with Human Feedback." NeurIPS. The InstructGPT paper describing the full pipeline from SFT to RLHF. Establishes the modern instruction-tuning paradigm and demonstrates its effectiveness for alignment.

  • Wang, Y., et al. (2023). "Self-Instruct: Aligning Language Models with Self-Generated Instructions." ACL. Introduces the self-instruct framework for generating instruction-tuning data using the model itself. The foundation for many subsequent data generation approaches.

  • Taori, R., et al. (2023). "Stanford Alpaca: An Instruction-Following LLaMA Model." GitHub. Demonstrates that fine-tuning LLaMA on 52K self-instruct examples produces a capable instruction-following model. Popularized the Alpaca dataset format.

  • Xu, C., et al. (2024). "WizardLM: Empowering Large Language Models to Follow Complex Instructions." ICLR. Introduces Evol-Instruct, which progressively evolves simple instructions into complex ones, producing models that excel at following complicated, multi-constraint instructions.

Surveys and Overviews

  • Ding, N., et al. (2023). "Parameter-Efficient Fine-Tuning of Large-Scale Pre-Trained Language Models." Nature Machine Intelligence. A comprehensive survey of PEFT methods including adapters, LoRA, prefix tuning, prompt tuning, and BitFit. Provides clear comparisons and categorization of approaches.

  • Han, Z., et al. (2024). "Parameter-Efficient Fine-Tuning for Large Models: A Comprehensive Survey." arXiv:2403.14608. An up-to-date survey covering the rapidly evolving PEFT landscape, including LoRA variants, adapter variants, and emerging methods. Good reference for understanding the full taxonomy.

Advanced Fine-Tuning Methods

Model Merging

  • Yadav, P., et al. (2024). "TIES-Merging: Resolving Interference When Merging Models." NeurIPS. Proposes the Trim, Elect, Merge algorithm for combining multiple fine-tuned models. Addresses the interference problem that arises when naively averaging model weights.

  • Yu, L., et al. (2024). "Language Model Merging: A Survey." arXiv:2405.17897. Surveys model merging methods including linear interpolation, task arithmetic, TIES, and DARE. Provides a unified framework for understanding different merging strategies.

Catastrophic Forgetting

  • Kirkpatrick, J., et al. (2017). "Overcoming Catastrophic Forgetting in Neural Networks." PNAS. The original Elastic Weight Consolidation (EWC) paper, which uses the Fisher information matrix to penalize changes to important parameters. A foundational reference for continual learning.

  • Luo, Y., et al. (2024). "An Empirical Study of Catastrophic Forgetting in Large Language Models During Continual Fine-Tuning." arXiv:2308.08747. An empirical study specific to LLMs documenting how different fine-tuning methods, data volumes, and hyperparameters affect catastrophic forgetting. Provides practical guidelines.

Libraries and Tools

  • HuggingFace PEFT Library. https://huggingface.co/docs/peft The primary library for parameter-efficient fine-tuning, supporting LoRA, QLoRA, adapters, prefix tuning, and more. Integrates seamlessly with Transformers and TRL.

  • HuggingFace TRL Library. https://huggingface.co/docs/trl Transformer Reinforcement Learning library covering SFT, reward modeling, DPO, and PPO. Provides SFTTrainer and DPOTrainer with built-in PEFT support.

  • BitsAndBytes Library. https://github.com/TimDettmers/bitsandbytes Provides 4-bit and 8-bit quantization primitives used by QLoRA. Essential for memory-efficient fine-tuning.

  • Axolotl. https://github.com/OpenAccess-AI-Collective/axolotl A higher-level fine-tuning framework that simplifies configuration of LoRA, QLoRA, and full fine-tuning. Supports multiple dataset formats and model architectures.

Looking Ahead

The fine-tuning concepts from this chapter connect directly to upcoming topics:

  • Chapter 25 (Alignment: RLHF and DPO): SFT is the first stage of the alignment pipeline. Understanding fine-tuning mechanics is essential for implementing RLHF and DPO, which build on the SFT model as a starting point.
  • Chapter 26 (Vision Transformers): Fine-tuning techniques apply to vision models with LoRA being used for efficient adaptation of ViT models.
  • Chapter 31 (RAG): Fine-tuned models combined with retrieval provide the best of both worlds: deep behavioral customization from fine-tuning and dynamic knowledge from retrieval.