Chapter 22: Further Reading
Essential Sources
1. Jasper Snoek, Hugo Larochelle, and Ryan P. Adams, "Practical Bayesian Optimization of Machine Learning Algorithms" (NeurIPS 2012)
The paper that established Bayesian optimization as the standard approach to hyperparameter tuning in machine learning. Snoek et al. demonstrated that GP-based Bayesian optimization with Expected Improvement consistently outperformed manual tuning, grid search, and random search on convolutional neural networks, latent Dirichlet allocation, and structured SVM models — achieving better results in fewer evaluations. The paper introduced the Spearmint software and several practical innovations: Matern-5/2 as the default kernel (over RBF), integrated acquisition function optimization over GP hyperparameters, and parallel evaluations via the "fantasized observations" method. The experimental methodology — comparing optimization algorithms on real ML pipelines rather than synthetic test functions — set the standard for the field.
Reading guidance: Read the full paper; it is only 9 pages and remarkably well-written. Section 2 (Bayesian optimization with Gaussian processes) is a concise presentation of the framework developed in Sections 22.2-22.3 of this chapter. Section 3 (practical considerations) covers topics that textbooks often omit: handling integer and categorical hyperparameters, managing GP hyperparameters, and parallelizing evaluations. The experimental results in Section 4 remain instructive as a template for how to benchmark optimization algorithms. Follow up with Shahriari et al., "Taking the Human Out of the Loop: A Review of Bayesian Optimization" (Proceedings of the IEEE, 2016) for a comprehensive survey of the field that Snoek et al. launched.
2. Olivier Chapelle and Lihong Li, "An Empirical Evaluation of Thompson Sampling" (NeurIPS 2011)
This paper revived interest in Thompson sampling after decades of neglect. Thompson's original 1933 algorithm was considered a heuristic curiosity until Chapelle and Li demonstrated that it consistently matched or outperformed UCB1 across a wide range of Bernoulli and Gaussian bandit problems — with no tuning parameters. The paper's key contribution is empirical: systematic comparison of Thompson sampling, UCB1, epsilon-greedy, Bayes-UCB, and other algorithms across problem configurations with varying numbers of arms, reward distributions, and time horizons. The results showed that Thompson sampling's empirical regret was often 2-5x lower than UCB1, despite having the same theoretical guarantees.
Reading guidance: Start with Section 2 (algorithms) for clean descriptions of each bandit algorithm. Section 3 (experiments) provides the empirical foundation for this chapter's claim that Thompson sampling is the dominant bandit algorithm for practice. The paper also covers Thompson sampling for Gaussian bandits (not just Bernoulli), which extends directly to the revenue-weighted credit offer problem in Case Study 2. For the theoretical underpinning, follow up with Agrawal and Goyal, "Analysis of Thompson Sampling for the Multi-armed Bandit Problem" (COLT 2012), which proved the Lai-Robbins-optimal regret bound that this chapter references in Section 22.4.
3. Takuya Akiba, Shotaro Sano, Toshihiko Yanase, Takeru Ohta, and Masanori Koyama, "Optuna: A Next-generation Hyperparameter Optimization Framework" (KDD 2019)
The paper introducing Optuna, which has become the standard Python library for hyperparameter optimization. The key innovations are the define-by-run API (hyperparameter spaces specified inside the objective function, enabling dynamic and conditional search spaces), the integration of multiple samplers (TPE, CMA-ES, random, grid) under a unified interface, and the pruning framework that combines early stopping algorithms (successive halving, Hyperband, median pruner) with any sampler. The paper includes benchmarks showing Optuna matching or exceeding Hyperopt, SMAC, and BOHB on standard benchmarks while providing a more flexible API.
Reading guidance: Section 3 (system design) explains the define-by-run architecture that distinguishes Optuna from earlier tools. Section 4 (benchmarks) is useful for understanding when TPE outperforms other samplers. For the TPE algorithm itself, read the original paper: Bergstra et al., "Algorithms for Hyper-Parameter Optimization" (NeurIPS 2011), which derives the $\ell(x)/g(x)$ acquisition criterion from first principles. For Hyperband, read Li et al., "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization" (JMLR 2017), which provides the theoretical framework for the successive halving brackets used in Section 22.8.
4. Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (Machine Learning, 2002)
The foundational paper on UCB1 and finite-time regret analysis for bandits. Auer et al. proved that UCB1 achieves logarithmic regret $O(\sum_k \ln T / \Delta_k)$ without knowledge of the reward distributions or the gaps $\Delta_k$ — a significant advance over earlier algorithms that required such knowledge. The paper introduced the "optimism in the face of uncertainty" principle: always act as if the best plausible outcome will occur, where "plausible" is defined by a concentration inequality (Hoeffding's inequality for UCB1). This principle unifies bandit algorithms with the UCB acquisition function in Bayesian optimization (Section 22.3) and provides the intellectual foundation for GP-UCB (Srinivas et al., 2010).
Reading guidance: The paper is technical but well-structured. Theorem 1 (the UCB1 regret bound) and its proof are the core contribution — the proof technique (decomposing regret into exploration and exploitation terms, then bounding the exploration term using concentration inequalities) appears throughout the bandit literature. Section 3 extends the analysis to UCB-tuned (a refined variant) and to settings with non-stationary rewards. For the information-theoretic lower bound that establishes logarithmic regret as optimal, see Lai and Robbins, "Asymptotically Efficient Adaptive Allocation Rules" (Advances in Applied Mathematics, 1985).
5. Daniel Russo and Benjamin Van Roy, "Learning to Optimize via Information-Directed Sampling" (NeurIPS 2014)
This paper provides the deepest theoretical understanding of Thompson sampling and introduces Information-Directed Sampling (IDS) as a principled alternative. Russo and Van Roy prove that Thompson sampling's regret is bounded by the square root of the entropy of the optimal action times $T$ — a result that explains Thompson sampling's strong empirical performance through an information-theoretic lens. They also show that Thompson sampling applied to GPs (GP-Thompson sampling, mentioned in Section 22.5) achieves the same regret bounds as GP-UCB. The IDS algorithm, which minimizes the ratio of squared regret to information gain per step, improves upon Thompson sampling in structured problems where actions provide different amounts of information about the optimal action.
Reading guidance: Sections 1-3 provide the cleanest available explanation of why Thompson sampling works from an information-theoretic perspective. The "information ratio" concept (Definition 1) quantifies the exploration-exploitation tradeoff in a single number — a unifying perspective that connects all the algorithms in this chapter. The GP-Thompson sampling results (Section 5) justify the claim in Section 22.5 that Thompson sampling is a valid acquisition function for Bayesian optimization. For a more accessible introduction to the information-theoretic perspective, see Russo et al., "A Tutorial on Thompson Sampling" (Foundations and Trends in Machine Learning, 2018), which covers both theory and practice across 97 pages with numerous worked examples.