Chapter 22: Key Takeaways
-
Bayesian optimization and multi-armed bandits are both sequential decision problems under uncertainty — they differ only in whether the action space is continuous or discrete. Bayesian optimization uses a Gaussian process surrogate to model an unknown objective function $f(\mathbf{x})$ with uncertainty, then selects the next evaluation point via an acquisition function. Multi-armed bandits maintain posterior distributions over each arm's reward and select the next action via a bandit algorithm. In both cases, the core challenge is the same: balancing exploration (gathering information about uncertain regions or arms) with exploitation (acting on current knowledge to maximize reward). Understanding one framework deeply makes the other immediately accessible.
-
Acquisition functions formalize the exploration-exploitation tradeoff with different biases. Probability of Improvement (PI) asks "how likely is improvement?" — simple but overly exploitative, because it ignores the magnitude of potential gains. Expected Improvement (EI) asks "how much improvement do I expect?" — the default choice, naturally balancing exploitation (the $(\mu - f^+ - \xi)\Phi(z)$ term) with exploration (the $\sigma\phi(z)$ term). Upper Confidence Bound (UCB) is $\mu + \sqrt{\beta}\sigma$ — the most interpretable, with a tunable $\beta$ controlling the tradeoff and theoretical regret guarantees from GP-UCB. For most practical purposes, EI with $\xi \approx 0.01$ is the right default.
-
Thompson sampling is asymptotically optimal, requires no tuning, and is the natural Bayesian solution to exploration-exploitation. For Bernoulli bandits with Beta posteriors, Thompson sampling matches the Lai-Robbins information-theoretic lower bound on regret. Its mechanism is elegant: sample from each arm's posterior, act on the highest sample. Exploration happens automatically because uncertain arms have wide posteriors that occasionally produce high samples. As data accumulates and posteriors concentrate, the algorithm converges to pure exploitation. Unlike epsilon-greedy (linear regret with fixed $\epsilon$) or UCB (logarithmic regret but with a constant-factor gap), Thompson sampling adapts its exploration rate based purely on posterior uncertainty.
-
The Gaussian process is a distribution over functions — its power comes from exact posterior predictive distributions. Given observations, the GP posterior mean interpolates through the data, and the posterior variance is large far from observations and small near them. The Matern-5/2 kernel is preferred over RBF for Bayesian optimization because it avoids the overconfidence pathology of infinitely smooth functions. The GP's computational limitation — $O(n^3)$ for $n$ observations — is rarely binding for hyperparameter optimization (50-500 evaluations) but motivates alternatives (TPE, random forests) for problems with more data.
-
Optuna's TPE sampler and Hyperband pruner are the practical tools for production hyperparameter optimization. TPE inverts the surrogate model to $p(\mathbf{x} \mid y)$, fitting separate density estimates for "good" and "bad" hyperparameter configurations, which scales to higher dimensions and handles categorical/conditional parameters naturally. Hyperband prunes unpromising evaluations early, saving 30-50% of compute. Together, they outperform random search by a margin that compounds at scale: a 3% relative improvement in Hit@10 from better hyperparameters translates to hundreds of thousands of additional engagements per day at StreamRec's scale.
-
Adaptive experimentation bridges bandits and A/B testing, reducing regret at the cost of statistical complexity. Thompson sampling for A/B tests shifts traffic toward the winning variant during the experiment, reducing the opportunity cost of serving the inferior variant. In the chapter's simulation, adaptive allocation reduced regret by 48.5% compared to uniform allocation. The cost is that standard confidence intervals become invalid under non-uniform, outcome-dependent allocation — requiring always-valid inference methods or inverse probability weighting. Use adaptive experimentation when regret is expensive and speed matters; use fixed allocation when statistical rigor is paramount.
-
Thompson sampling's short-term exploration investment pays off in long-term engagement. The StreamRec M8 experiment demonstrated that Thompson sampling trails greedy by 3 percentage points in the first 20 interactions but leads by 12 percentage points by step 120 — and discovers each user's true favorite category 2x more often. The Meridian Financial experiment showed that segment-level Thompson sampling improves credit offer acceptance by 28% while simultaneously reducing default rates by 26%, because exploration reveals which offers genuinely match each segment's financial behavior. In both cases, the same principle applies: uncertainty is not a problem to minimize but a signal to act on.