Chapter 10: Further Reading

Foundational Texts

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press. The modern standard reference for probabilistic ML, covering Bayesian inference, graphical models, and Gaussian processes with exceptional clarity. Freely available at https://probml.github.io/pml-book/.
Murphy, K. P. (2023). Probabilistic Machine Learning: Advanced Topics. MIT Press. The advanced companion volume covering variational inference, MCMC, deep generative models, and Bayesian deep learning. Freely available at https://probml.github.io/pml-book/.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer. Chapters 1--4 provide a masterful treatment of Bayesian inference, and Chapters 6 and 10 cover Gaussian processes and variational inference. A classic that remains deeply relevant.
Gelman, A., Carlin, J. B., Stern, H. S., Dunson, D. B., Vehtari, A., and Rubin, D. B. (2013). Bayesian Data Analysis, 3rd ed. CRC Press. The definitive applied Bayesian statistics textbook. Essential reading for anyone doing Bayesian modeling in practice. Freely available at https://stat.columbia.edu/~gelman/book/.
Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press. The authoritative reference on Gaussian processes, covering kernels, inference, model selection, and connections to neural networks. Freely available at https://gaussianprocess.org/gpml/.
McElreath, R. (2020). Statistical Rethinking, 2nd ed. CRC Press. An outstanding pedagogical introduction to Bayesian inference that builds intuition through examples and simulation. Highly recommended for beginners.

Key Papers

Bayesian Inference and MCMC

Metropolis, N., Rosenbluth, A. W., Rosenbluth, M. N., Teller, A. H., and Teller, E. (1953). "Equation of State Calculations by Fast Computing Machines." Journal of Chemical Physics, 21(6), 1087--1092. The foundational paper introducing the Metropolis algorithm for Monte Carlo sampling.
Hastings, W. K. (1970). "Monte Carlo Sampling Methods Using Markov Chains and Their Applications." Biometrika, 57(1), 97--109. Generalizes the Metropolis algorithm to asymmetric proposals, creating the Metropolis-Hastings framework.
Geman, S. and Geman, D. (1984). "Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images." IEEE Transactions on Pattern Analysis and Machine Intelligence, 6(6), 721--741. Introduces Gibbs sampling to machine learning and image processing.
Neal, R. M. (2011). "MCMC Using Hamiltonian Dynamics." Handbook of Markov Chain Monte Carlo, Chapter 5. CRC Press. The definitive tutorial on Hamiltonian Monte Carlo, explaining the physics intuition and practical implementation.
Hoffman, M. D. and Gelman, A. (2014). "The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo." Journal of Machine Learning Research, 15, 1593--1623. Introduces NUTS, which eliminates the need to manually tune the trajectory length in HMC.

Variational Inference

Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. (2017). "Variational Inference: A Review for Statisticians." Journal of the American Statistical Association, 112(518), 859--877. An accessible and thorough review of variational methods, highly recommended.
Kingma, D. P. and Welling, M. (2014). "Auto-Encoding Variational Bayes." Proceedings of the 2nd International Conference on Learning Representations. Introduces the Variational Autoencoder (VAE), connecting variational inference with deep generative models.
Kucukelbir, A., Tran, D., Ranganath, R., Gelman, A., and Blei, D. M. (2017). "Automatic Differentiation Variational Inference." Journal of Machine Learning Research, 18, 1--45. ADVI enables automatic variational inference for arbitrary probabilistic models.

Gaussian Processes

Titsias, M. (2009). "Variational Learning of Inducing Variables in Sparse Gaussian Processes." Proceedings of the 12th International Conference on Artificial Intelligence and Statistics. Introduces variational sparse GP approximation with inducing points, enabling GPs to scale to larger datasets.
Hensman, J., Fusi, N., and Lawrence, N. D. (2013). "Gaussian Processes for Big Data." Proceedings of the 29th Conference on Uncertainty in Artificial Intelligence. Stochastic variational inference for GPs, enabling mini-batch training on millions of data points.
Wilson, A. G. and Adams, R. P. (2013). "Gaussian Process Kernels for Pattern Discovery and Extrapolation." Proceedings of the 30th International Conference on Machine Learning. Introduces the spectral mixture kernel, which can discover and extrapolate complex patterns.

Bayesian Neural Networks

Blundell, C., Cornebise, J., Kavukcuoglu, K., and Wierstra, D. (2015). "Weight Uncertainty in Neural Networks." Proceedings of the 32nd International Conference on Machine Learning. Introduces "Bayes by Backprop" for learning posterior distributions over neural network weights.
Gal, Y. and Ghahramani, Z. (2016). "Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning." Proceedings of the 33rd International Conference on Machine Learning. Shows that dropout training can be interpreted as approximate variational inference, enabling uncertainty estimation in standard deep networks.

Bayesian Optimization

Snoek, J., Larochelle, H., and Adams, R. P. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms." Advances in Neural Information Processing Systems, 25. The paper that popularized Bayesian optimization for hyperparameter tuning.
Shahriari, B., Swersky, K., Wang, Z., Adams, R. P., and de Freitas, N. (2016). "Taking the Human Out of the Loop: A Review of Bayesian Optimization." Proceedings of the IEEE, 104(1), 148--175. Comprehensive survey of Bayesian optimization methods and applications.

Online Resources and Tutorials

Stan Documentation: https://mc-stan.org/users/documentation/ --- Stan is the leading probabilistic programming language. Its documentation includes excellent tutorials on Bayesian modeling.
PyMC Documentation: https://www.pymc.io/welcome.html --- Python library for Bayesian modeling with NUTS sampling. Excellent example gallery.
Numpyro: https://num.pyro.ai/ --- Lightweight probabilistic programming with JAX for fast MCMC and variational inference.
Distill.pub - "A Visual Exploration of Gaussian Processes": https://distill.pub/2019/visual-exploration-gp/ --- Interactive visualization that builds intuition for GP kernels and inference.
David MacKay's Information Theory, Inference, and Learning Algorithms: http://www.inference.org.uk/mackay/itila/ --- Free textbook with excellent coverage of Bayesian methods and their connection to information theory.

Software Libraries

scikit-learn (sklearn): Provides GaussianNB, MultinomialNB, BernoulliNB, GaussianProcessRegressor, GaussianProcessClassifier, and BayesianRidge.
PyMC (pymc): Full-featured Bayesian modeling with NUTS sampling, variational inference, and model comparison. Install with pip install pymc.
Stan / PyStan / CmdStanPy: State-of-the-art MCMC sampling with the NUTS algorithm. Highly optimized for complex hierarchical models.
NumPyro (numpyro): JAX-based probabilistic programming with GPU/TPU acceleration. Install with pip install numpyro.
GPyTorch (gpytorch): Scalable GP inference using GPU-accelerated linear algebra. Integrates with PyTorch. Install with pip install gpytorch.
scikit-optimize (skopt): Bayesian optimization with GP surrogates for hyperparameter tuning. Install with pip install scikit-optimize.
BoTorch (botorch): Bayesian optimization built on GPyTorch and PyTorch, supporting multi-objective and constrained optimization. Install with pip install botorch.

Advanced Topics for Further Study

Bayesian Nonparametrics: Models with an infinite number of parameters, including Dirichlet Process Mixtures and the Indian Buffet Process. See Hjort, Holmes, Muller, and Walker (2010), Bayesian Nonparametrics. Cambridge University Press.
Probabilistic Programming: Languages that let you write generative models as programs and perform automatic inference. Beyond PyMC and Stan, see Pyro (Uber), Edward2 (Google), and Turing.jl (Julia).
Bayesian Deep Learning at Scale: Practical techniques for uncertainty in production deep learning, including ensembles, MC Dropout, and last-layer Bayesian methods. See Lakshminarayanan et al. (2017), "Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles."
Causal Inference: Bayesian methods are natural for causal reasoning. See Pearl (2009), Causality, and Peters, Janzing, and Scholkopf (2017), Elements of Causal Inference.
Conformal Prediction: Distribution-free uncertainty quantification that provides prediction sets with guaranteed coverage. See Angelopoulos and Bates (2021), "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification."