Chapter 27: Further Reading

Foundational Papers

The Origins of Diffusion Models

  • Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). "Deep Unsupervised Learning using Nonequilibrium Thermodynamics." ICML 2015. The paper that introduced the diffusion model framework, drawing on ideas from non-equilibrium statistical mechanics. Establishes the theoretical foundation of learning to reverse a gradual noising process.

  • Ho, J., Jain, A., & Abbeel, P. (2020). "Denoising Diffusion Probabilistic Models." NeurIPS 2020. The breakthrough paper that made diffusion models practical. Introduces the simplified noise prediction objective and demonstrates competitive image generation quality. Essential reading for understanding the core DDPM framework.

  • Song, Y. & Ermon, S. (2019). "Generative Modeling by Estimating Gradients of the Data Distribution." NeurIPS 2019. Introduces score-based generative modeling with Noise Conditional Score Networks (NCSN). Provides the score-matching perspective that was later unified with DDPM.

Theoretical Unification

  • Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). "Score-Based Generative Modeling through Stochastic Differential Equations." ICLR 2021. Unifies DDPM and score-based models through the SDE framework. Establishes the probability flow ODE and enables exact likelihood computation. A must-read for deep theoretical understanding.

Improved Training and Sampling

  • Nichol, A. & Dhariwal, P. (2021). "Improved Denoising Diffusion Probabilistic Models." ICML 2021. Introduces the cosine noise schedule, learned variance, and importance sampling for the variational bound. Demonstrates improved sample quality and log-likelihood.

  • Song, J., Meng, C., & Ermon, S. (2020). "Denoising Diffusion Implicit Models." ICLR 2021. Introduces DDIM, enabling faster sampling (50 steps instead of 1000) and deterministic generation. Essential for understanding modern sampling strategies.

  • Lu, C., Zhou, Y., Bao, F., Chen, J., Li, C., & Zhu, J. (2022). "DPM-Solver: A Fast ODE Solver for Diffusion Probabilistic Model Sampling in Around 10 Steps." NeurIPS 2022. Develops a fast high-order ODE solver specifically designed for diffusion models, achieving high-quality samples in 10-20 steps.

  • Song, Y., Dhariwal, P., Chen, M., & Sutskever, I. (2023). "Consistency Models." ICML 2023. Introduces consistency models that can generate images in a single step by learning to map any point on the diffusion trajectory to the clean data.

Scaling and Architecture

  • Dhariwal, P. & Nichol, A. (2021). "Diffusion Models Beat GANs on Image Synthesis." NeurIPS 2021. Demonstrates that diffusion models can surpass GANs in sample quality (FID) for the first time. Introduces classifier guidance and architecture improvements.

  • Rombach, R., Blattmann, A., Ling, D., Ommeren, T., & Esser, P. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models." CVPR 2022. Introduces latent diffusion models, the foundation of Stable Diffusion. Shows that diffusion in a compressed latent space dramatically reduces cost while maintaining quality.

  • Peebles, W. & Xie, S. (2023). "Scalable Diffusion Models with Transformers." ICCV 2023. Introduces the Diffusion Transformer (DiT), replacing the U-Net with a plain transformer. Demonstrates that transformers scale better than U-Nets for diffusion, paving the way for Sora.

Conditioning and Control

  • Ho, J. & Salimans, T. (2022). "Classifier-Free Diffusion Guidance." NeurIPS 2021 Workshop. Introduces classifier-free guidance, the technique that makes text-conditioned diffusion models produce high-quality, text-aligned outputs. Short but hugely influential.

  • Zhang, L. & Agrawala, M. (2023). "Adding Conditional Control to Text-to-Image Diffusion Models." ICCV 2023. Introduces ControlNet for adding spatial control (edges, depth, pose) to pre-trained diffusion models. Widely adopted for controlled generation.

  • Ye, H., Zhang, J., Liu, S., Han, X., & Yang, W. (2023). "IP-Adapter: Text Compatible Image Prompt Adapter for Text-to-Image Diffusion Models." Enables image-based conditioning through CLIP image embeddings injected via cross-attention, allowing style and content transfer from reference images.

Customization and Fine-Tuning

  • Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., & Aberman, K. (2023). "DreamBooth: Fine Tuning Text-to-Image Diffusion Models for Subject-Driven Generation." CVPR 2023. Fine-tunes diffusion models on a few images of a specific subject, enabling personalized generation. Foundational for subject-driven customization.

  • Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., & Chen, W. (2022). "LoRA: Low-Rank Adaptation of Large Language Models." ICLR 2022. While originally proposed for language models, LoRA has become the dominant method for efficient diffusion model fine-tuning. Understanding LoRA is essential for practical diffusion model customization.

  • Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A. H., Gal, G., & Cohen-Or, D. (2023). "An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion." Learns new concept embeddings while keeping the model frozen, providing the most parameter-efficient personalization method.

Video Generation

  • Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., & Fleet, D. J. (2022). "Video Diffusion Models." NeurIPS 2022. Extends image diffusion to video with temporal attention layers. Foundational for understanding video generation with diffusion models.

  • Blattmann, A., Dockhorn, T., Kulal, S., Mendelevitch, D., Kilian, M., Lorber, D., ... & Rombach, R. (2023). "Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets." Applies the Stable Diffusion approach to video generation, demonstrating high-quality short video synthesis.

Surveys and Tutorials

  • Yang, L., Zhang, Z., Song, Y., Hong, S., Xu, R., Zhao, Y., ... & Yang, M. H. (2023). "Diffusion Models: A Comprehensive Survey of Methods and Applications." ACM Computing Surveys. A thorough survey covering the mathematical foundations, architectures, training techniques, and applications of diffusion models.

  • Luo, C. (2022). "Understanding Diffusion Models: A Unified Perspective." arXiv:2208.11970. An excellent tutorial that derives DDPM from the variational perspective, connecting it to VAEs and score matching in a clear, accessible manner.

Looking Ahead

The diffusion model foundations from this chapter connect to several upcoming topics:

  • Chapter 28 (Multimodal Models): CLIP, which provides the text encoder for Stable Diffusion, is explored in depth.
  • Chapter 30 (Video Understanding): Video generation with diffusion models builds directly on the spatiotemporal extensions discussed in Section 27.11.5.
  • Chapter 31 (RAG): Retrieval-augmented generation can enhance diffusion models by retrieving relevant reference images during generation.