Appendix H: Bibliography

This bibliography lists the key papers, textbooks, and resources referenced throughout this book, organized by part and topic. Entries within each section are listed alphabetically by first author. Where a work is relevant to multiple chapters, it is listed under the section where it is most prominently discussed.

Part I: Foundations (Chapters 1-6)

Textbooks and General References

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. A comprehensive reference for probabilistic machine learning, Bayesian methods, and classical techniques.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. The standard graduate-level textbook for deep learning theory and practice. Freely available at https://www.deeplearningbook.org.
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. A rigorous treatment of statistical learning methods. Freely available from the authors' website.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. An accessible introduction to machine learning for practitioners.
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. A modern and thorough treatment of probabilistic approaches to machine learning.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. The advanced companion covering deep generative models, Bayesian deep learning, and more.

Key Papers

Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. The foundational paper on random forests.
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. The paper introducing the XGBoost framework.
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. The original SVM paper.

Part II: Deep Learning Foundations (Chapters 7-9)

Neural Networks

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303-314. Proof of the universal approximation theorem for sigmoid networks.
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256. Analysis of initialization and activation functions.
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1026-1034. Kaiming initialization for ReLU networks.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. The paper introducing ResNets and skip connections.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. The dropout technique.
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 448-456.
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). The widely used Adam optimizer.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 1097-1105. The AlexNet paper that catalyzed the deep learning revolution.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. The seminal LeNet paper.
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Proceedings of ICLR. The AdamW optimizer.
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536. The backpropagation algorithm.

Recurrent Networks

Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of EMNLP, 1724-1734. The GRU architecture.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. The LSTM architecture.

Part III: Transformers and Language Models (Chapters 10-13)

Embeddings and Tokenization

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of ICLR Workshop. The Word2Vec paper.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in NeurIPS, 26, 3111-3119. Skip-gram with negative sampling.
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP, 1532-1543.
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL, 1715-1725. The Byte Pair Encoding (BPE) tokenization method.

The Transformer

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in NeurIPS, 30, 5998-6008. The original transformer paper. One of the most influential papers in AI history.
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450.
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864. The RoPE positional encoding used in LLaMA and many modern LLMs.

Pre-Training and Transfer Learning

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171-4186. The BERT model.
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL-HLT, 2227-2237. The ELMo model.
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67. The T5 model.

Contrastive and Self-Supervised Learning

Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. Proceedings of ICML, 1597-1607. The SimCLR framework.
Grill, J.-B., Strub, F., Altche, F., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. Advances in NeurIPS, 33, 21271-21284. BYOL: self-supervised learning without negative pairs.
He, K., Fan, H., Wu, Y., Xie, S., & Girber, R. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of CVPR, 9729-9738. The MoCo framework.

Part IV: Large Language Models (Chapters 14-17)

Language Model Pre-Training

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language models are few-shot learners. Advances in NeurIPS, 33, 1877-1901. The GPT-3 paper demonstrating in-context learning.
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. Advances in NeurIPS, 35, 30016-30030. The Chinchilla paper on scaling laws.
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv:2001.08361. Empirical scaling laws relating model size, data, and compute to loss.
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Technical Report. The GPT-1 paper.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report. The GPT-2 paper.
Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971. The LLaMA model family.
Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.

Fine-Tuning and Alignment

Christiano, P. F., Leike, J., Brown, T., et al. (2017). Deep reinforcement learning from human preferences. Advances in NeurIPS, 30, 4299-4307. Foundational work on RLHF.
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized language models. Advances in NeurIPS, 36. QLoRA for memory-efficient fine-tuning.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of ICLR. The LoRA method.
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. Advances in NeurIPS, 35, 27730-27744. The InstructGPT paper detailing RLHF for instruction following.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in NeurIPS, 36. The DPO paper.
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. Anthropic's work on RLHF for safety.
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073. The Constitutional AI approach.

Part V: Applied AI Systems (Chapters 18-25)

Prompt Engineering and In-Context Learning

Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in NeurIPS, 35, 24824-24837. The chain-of-thought prompting paper.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in NeurIPS, 35, 22199-22213. "Let's think step by step" zero-shot chain-of-thought.
Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing reasoning and acting in language models. Proceedings of ICLR. The ReAct framework for agents.

Retrieval-Augmented Generation

Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in NeurIPS, 33, 9459-9474. The original RAG paper.
Karpukhin, V., Oguz, B., Min, S., et al. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP, 6769-6781. The DPR method for dense retrieval.
Robertson, S. E., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389. The definitive reference on BM25.

Computer Vision

Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of ICLR. The Vision Transformer (ViT).
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. Proceedings of ICML, 8748-8763. The CLIP model.
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of CVPR, 10684-10695. Stable Diffusion / Latent Diffusion.
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in NeurIPS, 33, 6840-6851. The DDPM paper.

Audio and Speech

Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of ICML, 28492-28518. The Whisper ASR model.
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in NeurIPS, 33, 12449-12460.

Part VI: Production ML Systems (Chapters 26-32)

MLOps and Deployment

Sculley, D., Holt, G., Golovin, D., et al. (2015). Hidden technical debt in machine learning systems. Advances in NeurIPS, 28, 2503-2511. The influential paper on technical debt in ML systems.
Amershi, S., Begel, A., Bird, C., et al. (2019). Software engineering for machine learning: A case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 291-300.
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. SIGMOD Record, 47(2), 17-28.

Evaluation and Safety

Liang, P., Bommasani, R., Lee, T., et al. (2023). Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1), 140-146. The HELM benchmark framework.
Perez, E., Huang, S., Song, F., et al. (2022). Red teaming language models with language models. Proceedings of EMNLP, 3419-3448.

Part VII: Deployment and Operations (Chapters 33-36)

Model Compression and Serving

Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Advances in NeurIPS, 35, 30318-30332. LLM.int8() quantization.
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. Proceedings of ICLR. The GPTQ quantization method.
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531. The knowledge distillation paper.
Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 611-626. The vLLM system.
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in NeurIPS, 35, 16344-16359.
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning. Proceedings of ICLR.
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. Proceedings of ICML, 19274-19286.

Part VIII: Ethics, Safety, and the Future (Chapters 37-40)

Responsible AI

Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610-623.
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258. A comprehensive survey of foundation model risks and opportunities.
Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT)*, 220-229.
Weidinger, L., Mellor, J., Rauh, M., et al. (2021). Ethical and social risks of harm from language models. arXiv:2112.04359. DeepMind's taxonomy of LLM risks.

AI Safety

Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mane, D. (2016). Concrete problems in AI safety. arXiv:1606.06565. A foundational paper organizing key AI safety research directions.
Ngo, R., Chan, L., & Shlegeris, B. (2024). The alignment problem from a deep learning perspective. Proceedings of ICLR.

Online Resources and Documentation

Official Documentation

PyTorch Documentation: https://pytorch.org/docs/stable/
HuggingFace Transformers: https://huggingface.co/docs/transformers/
HuggingFace Datasets: https://huggingface.co/docs/datasets/
HuggingFace PEFT: https://huggingface.co/docs/peft/
LangChain: https://python.langchain.com/docs/
LlamaIndex: https://docs.llamaindex.ai/
vLLM: https://docs.vllm.ai/
Weights & Biases: https://docs.wandb.ai/

Courses and Tutorials

Stanford CS224N: Natural Language Processing with Deep Learning. https://web.stanford.edu/class/cs224n/
Stanford CS231N: Deep Learning for Computer Vision. http://cs231n.stanford.edu/
Stanford CS329S: Machine Learning Systems Design. https://stanford-cs329s.github.io/
Fast.ai: Practical Deep Learning for Coders. https://course.fast.ai/
Andrej Karpathy's Neural Networks: Zero to Hero. https://karpathy.ai/zero-to-hero.html

Blogs and Research Repositories

The Illustrated Transformer (Jay Alammar): https://jalammar.github.io/illustrated-transformer/
Lil'Log (Lilian Weng): https://lilianweng.github.io/ -- Excellent in-depth surveys of ML topics.
Anthropic Research: https://www.anthropic.com/research
OpenAI Research: https://openai.com/research
Google DeepMind Research: https://deepmind.google/research/
Papers With Code: https://paperswithcode.com/ -- Papers with implementations and benchmark results.
Semantic Scholar: https://www.semanticscholar.org/ -- AI-powered academic search engine.
arXiv (cs.CL, cs.LG, cs.AI): https://arxiv.org/ -- Pre-print server for the latest research.