Appendix H: Bibliography
This bibliography lists the key papers, textbooks, and resources referenced throughout this book, organized by part and topic. Entries within each section are listed alphabetically by first author. Where a work is relevant to multiple chapters, it is listed under the section where it is most prominently discussed.
Part I: Foundations (Chapters 1-6)
Textbooks and General References
-
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. A comprehensive reference for probabilistic machine learning, Bayesian methods, and classical techniques.
-
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. The standard graduate-level textbook for deep learning theory and practice. Freely available at https://www.deeplearningbook.org.
-
Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning (2nd ed.). Springer. A rigorous treatment of statistical learning methods. Freely available from the authors' website.
-
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2021). An Introduction to Statistical Learning with Applications in Python (2nd ed.). Springer. An accessible introduction to machine learning for practitioners.
-
Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. A modern and thorough treatment of probabilistic approaches to machine learning.
-
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. The advanced companion covering deep generative models, Bayesian deep learning, and more.
Key Papers
-
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32. The foundational paper on random forests.
-
Chen, T., & Guestrin, C. (2016). XGBoost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 785-794. The paper introducing the XGBoost framework.
-
Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273-297. The original SVM paper.
Part II: Deep Learning Foundations (Chapters 7-9)
Neural Networks
-
Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems, 2(4), 303-314. Proof of the universal approximation theorem for sigmoid networks.
-
Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 249-256. Analysis of initialization and activation functions.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: Surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 1026-1034. Kaiming initialization for ReLU networks.
-
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770-778. The paper introducing ResNets and skip connections.
-
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2012). Improving neural networks by preventing co-adaptation of feature detectors. arXiv:1207.0580. The dropout technique.
-
Ioffe, S., & Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. Proceedings of the 32nd International Conference on Machine Learning (ICML), 448-456.
-
Kingma, D. P., & Ba, J. (2015). Adam: A method for stochastic optimization. Proceedings of the 3rd International Conference on Learning Representations (ICLR). The widely used Adam optimizer.
-
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 1097-1105. The AlexNet paper that catalyzed the deep learning revolution.
-
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278-2324. The seminal LeNet paper.
-
Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. Proceedings of ICLR. The AdamW optimizer.
-
Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning representations by back-propagating errors. Nature, 323(6088), 533-536. The backpropagation algorithm.
Recurrent Networks
-
Cho, K., van Merrienboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder-decoder for statistical machine translation. Proceedings of EMNLP, 1724-1734. The GRU architecture.
-
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735-1780. The LSTM architecture.
Part III: Transformers and Language Models (Chapters 10-13)
Embeddings and Tokenization
-
Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. Proceedings of ICLR Workshop. The Word2Vec paper.
-
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in NeurIPS, 26, 3111-3119. Skip-gram with negative sampling.
-
Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. Proceedings of EMNLP, 1532-1543.
-
Sennrich, R., Haddow, B., & Birch, A. (2016). Neural machine translation of rare words with subword units. Proceedings of ACL, 1715-1725. The Byte Pair Encoding (BPE) tokenization method.
The Transformer
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. Advances in NeurIPS, 30, 5998-6008. The original transformer paper. One of the most influential papers in AI history.
-
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450.
-
Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv:2104.09864. The RoPE positional encoding used in LLaMA and many modern LLMs.
Pre-Training and Transfer Learning
-
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. Proceedings of NAACL-HLT, 4171-4186. The BERT model.
-
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., & Stoyanov, V. (2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv:1907.11692.
-
Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of NAACL-HLT, 2227-2237. The ELMo model.
-
Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., & Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140), 1-67. The T5 model.
Contrastive and Self-Supervised Learning
-
Chen, T., Kornblith, S., Norouzi, M., & Hinton, G. (2020). A simple framework for contrastive learning of visual representations. Proceedings of ICML, 1597-1607. The SimCLR framework.
-
Grill, J.-B., Strub, F., Altche, F., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. Advances in NeurIPS, 33, 21271-21284. BYOL: self-supervised learning without negative pairs.
-
He, K., Fan, H., Wu, Y., Xie, S., & Girber, R. (2020). Momentum contrast for unsupervised visual representation learning. Proceedings of CVPR, 9729-9738. The MoCo framework.
Part IV: Large Language Models (Chapters 14-17)
Language Model Pre-Training
-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., et al. (2020). Language models are few-shot learners. Advances in NeurIPS, 33, 1877-1901. The GPT-3 paper demonstrating in-context learning.
-
Hoffmann, J., Borgeaud, S., Mensch, A., et al. (2022). Training compute-optimal large language models. Advances in NeurIPS, 35, 30016-30030. The Chinchilla paper on scaling laws.
-
Kaplan, J., McCandlish, S., Henighan, T., et al. (2020). Scaling laws for neural language models. arXiv:2001.08361. Empirical scaling laws relating model size, data, and compute to loss.
-
Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI Technical Report. The GPT-1 paper.
-
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Technical Report. The GPT-2 paper.
-
Touvron, H., Lavril, T., Izacard, G., et al. (2023). LLaMA: Open and efficient foundation language models. arXiv:2302.13971. The LLaMA model family.
-
Touvron, H., Martin, L., Stone, K., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv:2307.09288.
Fine-Tuning and Alignment
-
Christiano, P. F., Leike, J., Brown, T., et al. (2017). Deep reinforcement learning from human preferences. Advances in NeurIPS, 30, 4299-4307. Foundational work on RLHF.
-
Dettmers, T., Pagnoni, A., Holtzman, A., & Zettlemoyer, L. (2023). QLoRA: Efficient finetuning of quantized language models. Advances in NeurIPS, 36. QLoRA for memory-efficient fine-tuning.
-
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-rank adaptation of large language models. Proceedings of ICLR. The LoRA method.
-
Ouyang, L., Wu, J., Jiang, X., et al. (2022). Training language models to follow instructions with human feedback. Advances in NeurIPS, 35, 27730-27744. The InstructGPT paper detailing RLHF for instruction following.
-
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct preference optimization: Your language model is secretly a reward model. Advances in NeurIPS, 36. The DPO paper.
-
Bai, Y., Jones, A., Ndousse, K., et al. (2022). Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv:2204.05862. Anthropic's work on RLHF for safety.
-
Bai, Y., Kadavath, S., Kundu, S., et al. (2022). Constitutional AI: Harmlessness from AI feedback. arXiv:2212.08073. The Constitutional AI approach.
Part V: Applied AI Systems (Chapters 18-25)
Prompt Engineering and In-Context Learning
-
Wei, J., Wang, X., Schuurmans, D., et al. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in NeurIPS, 35, 24824-24837. The chain-of-thought prompting paper.
-
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in NeurIPS, 35, 22199-22213. "Let's think step by step" zero-shot chain-of-thought.
-
Yao, S., Zhao, J., Yu, D., et al. (2023). ReAct: Synergizing reasoning and acting in language models. Proceedings of ICLR. The ReAct framework for agents.
Retrieval-Augmented Generation
-
Lewis, P., Perez, E., Piktus, A., et al. (2020). Retrieval-augmented generation for knowledge-intensive NLP tasks. Advances in NeurIPS, 33, 9459-9474. The original RAG paper.
-
Karpukhin, V., Oguz, B., Min, S., et al. (2020). Dense passage retrieval for open-domain question answering. Proceedings of EMNLP, 6769-6781. The DPR method for dense retrieval.
-
Robertson, S. E., & Zaragoza, H. (2009). The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval, 3(4), 333-389. The definitive reference on BM25.
Computer Vision
-
Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of ICLR. The Vision Transformer (ViT).
-
Radford, A., Kim, J. W., Hallacy, C., et al. (2021). Learning transferable visual models from natural language supervision. Proceedings of ICML, 8748-8763. The CLIP model.
-
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., & Ommer, B. (2022). High-resolution image synthesis with latent diffusion models. Proceedings of CVPR, 10684-10695. Stable Diffusion / Latent Diffusion.
-
Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in NeurIPS, 33, 6840-6851. The DDPM paper.
Audio and Speech
-
Radford, A., Kim, J. W., Xu, T., Brockman, G., McLeavey, C., & Sutskever, I. (2023). Robust speech recognition via large-scale weak supervision. Proceedings of ICML, 28492-28518. The Whisper ASR model.
-
Baevski, A., Zhou, Y., Mohamed, A., & Auli, M. (2020). wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in NeurIPS, 33, 12449-12460.
Part VI: Production ML Systems (Chapters 26-32)
MLOps and Deployment
-
Sculley, D., Holt, G., Golovin, D., et al. (2015). Hidden technical debt in machine learning systems. Advances in NeurIPS, 28, 2503-2511. The influential paper on technical debt in ML systems.
-
Amershi, S., Begel, A., Bird, C., et al. (2019). Software engineering for machine learning: A case study. Proceedings of the 41st International Conference on Software Engineering: Software Engineering in Practice, 291-300.
-
Polyzotis, N., Roy, S., Whang, S. E., & Zinkevich, M. (2018). Data lifecycle challenges in production machine learning: A survey. SIGMOD Record, 47(2), 17-28.
Evaluation and Safety
-
Liang, P., Bommasani, R., Lee, T., et al. (2023). Holistic evaluation of language models. Annals of the New York Academy of Sciences, 1525(1), 140-146. The HELM benchmark framework.
-
Perez, E., Huang, S., Song, F., et al. (2022). Red teaming language models with language models. Proceedings of EMNLP, 3419-3448.
Part VII: Deployment and Operations (Chapters 33-36)
Model Compression and Serving
-
Dettmers, T., Lewis, M., Belkada, Y., & Zettlemoyer, L. (2022). GPT3.int8(): 8-bit matrix multiplication for transformers at scale. Advances in NeurIPS, 35, 30318-30332. LLM.int8() quantization.
-
Frantar, E., Ashkboos, S., Hoefler, T., & Alistarh, D. (2023). GPTQ: Accurate post-training quantization for generative pre-trained transformers. Proceedings of ICLR. The GPTQ quantization method.
-
Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv:1503.02531. The knowledge distillation paper.
-
Kwon, W., Li, Z., Zhuang, S., et al. (2023). Efficient memory management for large language model serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles (SOSP), 611-626. The vLLM system.
-
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Re, C. (2022). FlashAttention: Fast and memory-efficient exact attention with IO-awareness. Advances in NeurIPS, 35, 16344-16359.
-
Dao, T. (2023). FlashAttention-2: Faster attention with better parallelism and work partitioning. Proceedings of ICLR.
-
Leviathan, Y., Kalman, M., & Matias, Y. (2023). Fast inference from transformers via speculative decoding. Proceedings of ICML, 19274-19286.
Part VIII: Ethics, Safety, and the Future (Chapters 37-40)
Responsible AI
-
Bender, E. M., Gebru, T., McMillan-Major, A., & Shmitchell, S. (2021). On the dangers of stochastic parrots: Can language models be too big? Proceedings of the ACM Conference on Fairness, Accountability, and Transparency (FAccT), 610-623.
-
Bommasani, R., Hudson, D. A., Adeli, E., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258. A comprehensive survey of foundation model risks and opportunities.
-
Mitchell, M., Wu, S., Zaldivar, A., et al. (2019). Model cards for model reporting. Proceedings of the Conference on Fairness, Accountability, and Transparency (FAT)*, 220-229.
-
Weidinger, L., Mellor, J., Rauh, M., et al. (2021). Ethical and social risks of harm from language models. arXiv:2112.04359. DeepMind's taxonomy of LLM risks.
AI Safety
-
Amodei, D., Olah, C., Steinhardt, J., Christiano, P., Schulman, J., & Mane, D. (2016). Concrete problems in AI safety. arXiv:1606.06565. A foundational paper organizing key AI safety research directions.
-
Ngo, R., Chan, L., & Shlegeris, B. (2024). The alignment problem from a deep learning perspective. Proceedings of ICLR.
Online Resources and Documentation
Official Documentation
- PyTorch Documentation: https://pytorch.org/docs/stable/
- HuggingFace Transformers: https://huggingface.co/docs/transformers/
- HuggingFace Datasets: https://huggingface.co/docs/datasets/
- HuggingFace PEFT: https://huggingface.co/docs/peft/
- LangChain: https://python.langchain.com/docs/
- LlamaIndex: https://docs.llamaindex.ai/
- vLLM: https://docs.vllm.ai/
- Weights & Biases: https://docs.wandb.ai/
Courses and Tutorials
- Stanford CS224N: Natural Language Processing with Deep Learning. https://web.stanford.edu/class/cs224n/
- Stanford CS231N: Deep Learning for Computer Vision. http://cs231n.stanford.edu/
- Stanford CS329S: Machine Learning Systems Design. https://stanford-cs329s.github.io/
- Fast.ai: Practical Deep Learning for Coders. https://course.fast.ai/
- Andrej Karpathy's Neural Networks: Zero to Hero. https://karpathy.ai/zero-to-hero.html
Blogs and Research Repositories
- The Illustrated Transformer (Jay Alammar): https://jalammar.github.io/illustrated-transformer/
- Lil'Log (Lilian Weng): https://lilianweng.github.io/ -- Excellent in-depth surveys of ML topics.
- Anthropic Research: https://www.anthropic.com/research
- OpenAI Research: https://openai.com/research
- Google DeepMind Research: https://deepmind.google/research/
- Papers With Code: https://paperswithcode.com/ -- Papers with implementations and benchmark results.
- Semantic Scholar: https://www.semanticscholar.org/ -- AI-powered academic search engine.
- arXiv (cs.CL, cs.LG, cs.AI): https://arxiv.org/ -- Pre-print server for the latest research.