Chapter 22: Bayesian Optimization and Sequential Decision-Making — From Hyperparameter Tuning to Bandits

DataField.Dev

28 min read

> "The essence of the exploration-exploitation dilemma is that you cannot know which action is best without trying them, but trying a suboptimal action has a cost."

In This Chapter

Learning Objectives
22.1 Two Sides of the Same Coin
22.2 Gaussian Processes as Surrogate Models
22.3 Acquisition Functions
22.4 The Multi-Armed Bandit Problem
22.5 Thompson Sampling as a Bridge
22.6 Contextual Bandits
22.7 Regret Analysis and Lower Bounds
22.8 Hyperparameter Optimization with Optuna
22.9 Bayesian Optimization for StreamRec Hyperparameters
22.10 Adaptive Experimentation: Bandits Meet A/B Testing
22.11 The Credit Scoring Application: Adaptive Credit Offers
22.12 Progressive Project M8: Thompson Sampling for StreamRec
22.13 Connections and Extensions
Summary
Chapter 22 Notation Reference

Exercises Quiz Case Study 01 Case Study 02 Key Takeaways Further Reading

Chapter 22: Bayesian Optimization and Sequential Decision-Making — From Hyperparameter Tuning to Bandits

"The essence of the exploration-exploitation dilemma is that you cannot know which action is best without trying them, but trying a suboptimal action has a cost." — Peter Auer, Nicolò Cesa-Bianchi, and Paul Fischer, "Finite-time Analysis of the Multiarmed Bandit Problem" (2002)

Learning Objectives

By the end of this chapter, you will be able to:

Implement Bayesian optimization with a Gaussian process surrogate model from first principles
Derive and compare acquisition functions — Expected Improvement (EI), Upper Confidence Bound (UCB), and Probability of Improvement (PI) — and explain when each is appropriate
Apply multi-armed bandit algorithms (epsilon-greedy, UCB1, Thompson sampling) and analyze their regret guarantees
Use Optuna for hyperparameter optimization with TPE and Hyperband
Connect Bayesian optimization to A/B testing and adaptive experimentation

22.1 Two Sides of the Same Coin

Chapters 20 and 21 built the Bayesian machinery: priors, posteriors, conjugacy, MCMC, hierarchical models. This chapter applies that machinery to a class of problems where uncertainty is not merely quantified but acted upon.

Consider two problems that appear different but share the same mathematical structure:

Problem 1: Hyperparameter tuning. You are training a neural network for StreamRec's recommendation system. The learning rate, dropout rate, weight decay, and embedding dimension define a 4-dimensional search space. Each evaluation — training the model and measuring validation loss — takes 45 minutes. You have a compute budget of 50 evaluations. How do you find the best hyperparameters?

Problem 2: Content recommendation. StreamRec has 8 content categories. A new user arrives with no history. Each recommendation slot can show one category. The user either engages (click, watch, save) or does not. How do you learn the user's preferences while still providing good recommendations?

Both problems require sequential decisions under uncertainty. In Problem 1, you must decide which hyperparameter configuration to evaluate next, balancing between refining your estimate near known good regions (exploitation) and trying unexplored regions that might be better (exploration). In Problem 2, you must decide which category to recommend next, balancing between showing the user's apparent favorite (exploitation) and trying other categories that might turn out to be even better (exploration).

The mathematical framework that unifies these problems has two components:

A model of uncertainty. In Problem 1, a Gaussian process models the unknown objective function $f(\mathbf{x})$ with uncertainty. In Problem 2, Beta posteriors model the unknown engagement probabilities $\theta_c$ with uncertainty.
A decision rule that trades off exploration and exploitation. In Problem 1, an acquisition function selects the next point to evaluate. In Problem 2, a bandit algorithm selects the next category to recommend.

This chapter develops both components rigorously, shows how Thompson sampling — introduced as a preview in Chapter 20 — bridges them, and demonstrates how Optuna applies these ideas at scale.

Theme: Know How Your Model Is Wrong. Bayesian optimization depends on the surrogate model being a reasonable approximation of the true objective. If the GP kernel is wrong, acquisition function optimization can lead you astray. If the bandit model assumes stationarity but user preferences drift, regret accumulates silently. Throughout this chapter, we will examine how model misspecification affects sequential decisions and what diagnostics to use.

22.2 Gaussian Processes as Surrogate Models

The Problem with Black-Box Optimization

Grid search evaluates every point in a predefined grid. Random search evaluates random points. Both ignore what they have learned from previous evaluations. If the first 20 evaluations reveal that learning rates below $10^{-4}$ all perform poorly, grid search still dutifully evaluates the remaining grid points in that region.

Bayesian optimization replaces this ignorance with a model. After each evaluation, it updates a probabilistic model of the objective function, then uses the model's predictions and uncertainties to decide where to evaluate next. The most common surrogate model is the Gaussian process (GP).

Gaussian Process Definition

A Gaussian process is a distribution over functions. Formally, $f \sim \mathcal{GP}(\mu, k)$ means that for any finite collection of input points $\mathbf{x}_1, \ldots, \mathbf{x}_n$, the function values $f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)$ are jointly Gaussian:

$$\begin{bmatrix} f(\mathbf{x}_1) \\ \vdots \\ f(\mathbf{x}_n) \end{bmatrix} \sim \mathcal{N}\left( \begin{bmatrix} \mu(\mathbf{x}_1) \\ \vdots \\ \mu(\mathbf{x}_n) \end{bmatrix}, \begin{bmatrix} k(\mathbf{x}_1, \mathbf{x}_1) & \cdots & k(\mathbf{x}_1, \mathbf{x}_n) \\ \vdots & \ddots & \vdots \\ k(\mathbf{x}_n, \mathbf{x}_1) & \cdots & k(\mathbf{x}_n, \mathbf{x}_n) \end{bmatrix} \right)$$

The mean function $\mu(\mathbf{x})$ encodes prior beliefs about the function's average value at each point. In practice, it is usually set to a constant (often zero or the mean of observed $y$ values).

The kernel function $k(\mathbf{x}, \mathbf{x}')$ encodes prior beliefs about the function's smoothness, amplitude, and correlation structure. It is the heart of the GP.

Kernel Functions

The kernel $k(\mathbf{x}, \mathbf{x}')$ determines the GP's properties. Two kernels dominate Bayesian optimization:

Radial Basis Function (RBF, Squared Exponential):

$$k_{\text{RBF}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp\left( -\frac{\|\mathbf{x} - \mathbf{x}'\|^2}{2\ell^2} \right)$$

The length scale $\ell$ controls how quickly the correlation decays with distance. Large $\ell$ means the function varies slowly; small $\ell$ means it varies rapidly. The signal variance $\sigma_f^2$ controls the overall amplitude. The RBF kernel produces infinitely differentiable functions — very smooth.

Matern kernel:

$$k_{\text{Matern}}(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \frac{2^{1-\nu}}{\Gamma(\nu)} \left( \frac{\sqrt{2\nu}\|\mathbf{x} - \mathbf{x}'\|}{\ell} \right)^\nu K_\nu\left( \frac{\sqrt{2\nu}\|\mathbf{x} - \mathbf{x}'\|}{\ell} \right)$$

where $K_\nu$ is a modified Bessel function of the second kind. The smoothness parameter $\nu$ controls differentiability: functions drawn from a Matern-$\nu$ GP are $\lceil \nu \rceil - 1$ times differentiable. Two special cases are standard:

$\nu$	Differentiability	Expression
$1/2$	Not differentiable (Ornstein-Uhlenbeck)	$\sigma_f^2 \exp(-r/\ell)$ where $r = \\|\mathbf{x} - \mathbf{x}'\\|$
$3/2$	Once differentiable	$\sigma_f^2 (1 + \sqrt{3}r/\ell) \exp(-\sqrt{3}r/\ell)$
$5/2$	Twice differentiable	$\sigma_f^2 (1 + \sqrt{5}r/\ell + 5r^2/(3\ell^2)) \exp(-\sqrt{5}r/\ell)$
$\to \infty$	Infinitely differentiable	Recovers RBF

Which kernel to use? The Matern-5/2 kernel is the default recommendation for Bayesian optimization. Real objective functions (validation loss as a function of hyperparameters) are typically smooth but not infinitely smooth. The RBF kernel's infinite smoothness can cause it to be overconfident between observations, underestimating uncertainty and leading to insufficient exploration. Matern-5/2 is "smooth enough" without this pathology.

import numpy as np
from scipy.spatial.distance import cdist
from typing import Tuple, Optional
import matplotlib.pyplot as plt


def rbf_kernel(
    X1: np.ndarray,
    X2: np.ndarray,
    length_scale: float = 1.0,
    signal_variance: float = 1.0,
) -> np.ndarray:
    """Compute the RBF (Squared Exponential) kernel matrix.

    Args:
        X1: First set of points, shape (n1, d).
        X2: Second set of points, shape (n2, d).
        length_scale: Kernel length scale.
        signal_variance: Signal variance (output scale).

    Returns:
        Kernel matrix of shape (n1, n2).
    """
    sq_dists = cdist(X1, X2, metric="sqeuclidean")
    return signal_variance * np.exp(-0.5 * sq_dists / length_scale**2)


def matern52_kernel(
    X1: np.ndarray,
    X2: np.ndarray,
    length_scale: float = 1.0,
    signal_variance: float = 1.0,
) -> np.ndarray:
    """Compute the Matern-5/2 kernel matrix.

    Args:
        X1: First set of points, shape (n1, d).
        X2: Second set of points, shape (n2, d).
        length_scale: Kernel length scale.
        signal_variance: Signal variance (output scale).

    Returns:
        Kernel matrix of shape (n1, n2).
    """
    dists = cdist(X1, X2, metric="euclidean")
    scaled = np.sqrt(5.0) * dists / length_scale
    return signal_variance * (1.0 + scaled + scaled**2 / 3.0) * np.exp(-scaled)


# Visualize samples from GPs with different kernels
rng = np.random.RandomState(42)
X_grid = np.linspace(0, 5, 200).reshape(-1, 1)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
kernels = [
    ("RBF (l=1.0)", lambda X1, X2: rbf_kernel(X1, X2, length_scale=1.0)),
    ("Matern-5/2 (l=1.0)", lambda X1, X2: matern52_kernel(X1, X2, length_scale=1.0)),
    ("Matern-5/2 (l=0.3)", lambda X1, X2: matern52_kernel(X1, X2, length_scale=0.3)),
]

for ax, (name, kern_fn) in zip(axes, kernels):
    K = kern_fn(X_grid, X_grid) + 1e-6 * np.eye(len(X_grid))
    L = np.linalg.cholesky(K)
    for _ in range(5):
        sample = L @ rng.randn(len(X_grid))
        ax.plot(X_grid, sample, alpha=0.7)
    ax.set_title(name)
    ax.set_xlabel("x")
    ax.set_ylabel("f(x)")

plt.suptitle("GP prior samples with different kernels", y=1.02)
plt.tight_layout()
plt.show()

GP Posterior Predictive Distribution

The power of GPs is that conditioning on observations gives a closed-form posterior. Given observed data $\mathcal{D} = \{(\mathbf{x}_i, y_i)\}_{i=1}^n$ where $y_i = f(\mathbf{x}_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma_n^2)$, the posterior predictive distribution at a new point $\mathbf{x}_*$ is:

$$f(\mathbf{x}_*) \mid \mathcal{D} \sim \mathcal{N}\left( \mu_*,\, \sigma_*^2 \right)$$

where:

$$\mu_* = \mathbf{k}_*^T \left( \mathbf{K} + \sigma_n^2 \mathbf{I} \right)^{-1} \mathbf{y}$$

$$\sigma_*^2 = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T \left( \mathbf{K} + \sigma_n^2 \mathbf{I} \right)^{-1} \mathbf{k}_*$$

Here $\mathbf{K}$ is the $n \times n$ kernel matrix with $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$, $\mathbf{k}_*$ is the $n \times 1$ vector with $k_{*i} = k(\mathbf{x}_*, \mathbf{x}_i)$, and $\mathbf{y} = [y_1, \ldots, y_n]^T$.

The mean $\mu_*$ is a weighted combination of observed values — the GP interpolates smoothly through the data. The variance $\sigma_*^2$ is large far from observations and small near them — the GP knows what it does not know.

Derivation. The posterior follows directly from the joint Gaussian distribution. The prior joint distribution of $\mathbf{f} = [f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)]^T$ and $f_* = f(\mathbf{x}_*)$ is:

$$\begin{bmatrix} \mathbf{y} \\ f_* \end{bmatrix} \sim \mathcal{N}\left( \mathbf{0}, \begin{bmatrix} \mathbf{K} + \sigma_n^2 \mathbf{I} & \mathbf{k}_* \\ \mathbf{k}_*^T & k(\mathbf{x}_*, \mathbf{x}_*) \end{bmatrix} \right)$$

Applying the Gaussian conditioning formula (Chapter 3, Section 3.7) yields the posterior predictive directly. No MCMC required — the GP posterior is exact.

class SimpleGP:
    """A minimal Gaussian process implementation for Bayesian optimization.

    Uses a fixed kernel (Matern-5/2) with noise. Supports posterior
    prediction, which is the foundation for acquisition functions.

    Attributes:
        X_train: Training input points, shape (n, d).
        y_train: Training targets, shape (n,).
        length_scale: Kernel length scale.
        signal_variance: Kernel signal variance.
        noise_variance: Observation noise variance.
        K_inv: Precomputed inverse of (K + noise * I).
        alpha: Precomputed K_inv @ y_train for fast prediction.
    """

    def __init__(
        self,
        length_scale: float = 1.0,
        signal_variance: float = 1.0,
        noise_variance: float = 1e-4,
    ) -> None:
        self.length_scale = length_scale
        self.signal_variance = signal_variance
        self.noise_variance = noise_variance
        self.X_train: Optional[np.ndarray] = None
        self.y_train: Optional[np.ndarray] = None
        self.K_inv: Optional[np.ndarray] = None
        self.alpha: Optional[np.ndarray] = None

    def _kernel(self, X1: np.ndarray, X2: np.ndarray) -> np.ndarray:
        """Matern-5/2 kernel."""
        return matern52_kernel(X1, X2, self.length_scale, self.signal_variance)

    def fit(self, X: np.ndarray, y: np.ndarray) -> None:
        """Fit the GP to observed data.

        Precomputes the Cholesky factorization for efficient prediction.

        Args:
            X: Input points, shape (n, d).
            y: Observed values, shape (n,).
        """
        self.X_train = X.copy()
        self.y_train = y.copy()

        K = self._kernel(X, X) + self.noise_variance * np.eye(len(X))
        self.L = np.linalg.cholesky(K)
        self.alpha = np.linalg.solve(
            self.L.T, np.linalg.solve(self.L, y)
        )
        self.K_inv = np.linalg.solve(
            self.L.T, np.linalg.solve(self.L, np.eye(len(X)))
        )

    def predict(
        self, X_new: np.ndarray
    ) -> Tuple[np.ndarray, np.ndarray]:
        """Compute posterior predictive mean and standard deviation.

        Args:
            X_new: Query points, shape (m, d).

        Returns:
            Tuple of (means, stds), each shape (m,).
        """
        k_star = self._kernel(self.X_train, X_new)  # (n, m)
        mu = k_star.T @ self.alpha                    # (m,)

        k_ss = np.diag(self._kernel(X_new, X_new))    # (m,)
        var = k_ss - np.sum(k_star * (self.K_inv @ k_star), axis=0)
        var = np.clip(var, 1e-10, None)  # Numerical stability

        return mu, np.sqrt(var)


# Demonstrate GP posterior with a 1D example
def true_objective(x: np.ndarray) -> np.ndarray:
    """A bumpy test function for BayesOpt demonstration."""
    return np.sin(3 * x) + 0.5 * np.cos(7 * x) + 0.2 * x


rng = np.random.RandomState(0)
X_obs = rng.uniform(0, 5, size=(6, 1))
y_obs = true_objective(X_obs.ravel()) + 0.05 * rng.randn(6)

gp = SimpleGP(length_scale=0.8, signal_variance=1.0, noise_variance=0.01)
gp.fit(X_obs, y_obs)

X_plot = np.linspace(0, 5, 300).reshape(-1, 1)
mu, sigma = gp.predict(X_plot)

fig, ax = plt.subplots(figsize=(10, 5))
ax.plot(X_plot, true_objective(X_plot.ravel()), "k--", label="True function", alpha=0.5)
ax.plot(X_plot, mu, "b-", label="GP mean", linewidth=2)
ax.fill_between(X_plot.ravel(), mu - 2 * sigma, mu + 2 * sigma,
                alpha=0.2, color="blue", label="95% CI")
ax.scatter(X_obs, y_obs, c="red", s=80, zorder=5, label="Observations")
ax.set_xlabel("x")
ax.set_ylabel("f(x)")
ax.set_title("Gaussian process posterior: 6 observations")
ax.legend()
plt.tight_layout()
plt.show()

The plot reveals the GP's key property: the posterior mean interpolates through the observations, and the uncertainty (shaded region) expands in regions with no data. This uncertainty is what drives exploration in Bayesian optimization.

Common Misconception: "GPs do not scale to large datasets." Standard GPs require $O(n^3)$ time for the Cholesky decomposition and $O(n^2)$ memory. This limits them to roughly $n \leq 10{,}000$ observations. For hyperparameter optimization, this is rarely a constraint — you typically have 50-500 evaluations. For problems with more data, sparse GP approximations (FITC, SVGP), random feature expansions, or replacing the GP with a different surrogate (tree-structured Parzen estimators, random forests) are standard solutions.

22.3 Acquisition Functions

The GP posterior gives us $\mu(\mathbf{x})$ (predicted value) and $\sigma(\mathbf{x})$ (uncertainty) at every point. The acquisition function $\alpha(\mathbf{x})$ combines these to assign a "value of evaluating at $\mathbf{x}$." We select the next evaluation point as:

$$\mathbf{x}_{\text{next}} = \arg\max_{\mathbf{x}} \alpha(\mathbf{x})$$

Three acquisition functions dominate the literature.

Probability of Improvement (PI)

The simplest acquisition function asks: what is the probability that evaluating at $\mathbf{x}$ will improve upon the current best observation $f^+ = \max_i y_i$?

$$\alpha_{\text{PI}}(\mathbf{x}) = P\left(f(\mathbf{x}) > f^+ + \xi\right) = \Phi\left(\frac{\mu(\mathbf{x}) - f^+ - \xi}{\sigma(\mathbf{x})}\right)$$

where $\Phi$ is the standard normal CDF and $\xi \geq 0$ is a "jitter" parameter that prevents the algorithm from over-exploiting near the current best.

Strengths: Simple, fast to compute, purely probabilistic interpretation.

Weaknesses: PI only cares about the probability of improvement, not the magnitude. It prefers a point with a 60% chance of improving by 0.001 over a point with a 40% chance of improving by 10.0. This makes PI excessively exploitative — it converges quickly but often to a local optimum.

Expected Improvement (EI)

EI fixes PI's weakness by weighting the probability of improvement by its expected magnitude:

$$\alpha_{\text{EI}}(\mathbf{x}) = \mathbb{E}\left[\max\left(f(\mathbf{x}) - f^+ - \xi,\, 0\right)\right]$$

This expectation has a closed-form expression under the GP posterior:

$$\alpha_{\text{EI}}(\mathbf{x}) = \begin{cases} (\mu(\mathbf{x}) - f^+ - \xi) \Phi(z) + \sigma(\mathbf{x}) \phi(z) & \text{if } \sigma(\mathbf{x}) > 0 \\ 0 & \text{if } \sigma(\mathbf{x}) = 0 \end{cases}$$

where $z = \frac{\mu(\mathbf{x}) - f^+ - \xi}{\sigma(\mathbf{x})}$, $\Phi$ is the standard normal CDF, and $\phi$ is the standard normal PDF.

Derivation. Let $\delta = f(\mathbf{x}) - f^+ - \xi$ where $\delta \sim \mathcal{N}(\mu(\mathbf{x}) - f^+ - \xi, \sigma(\mathbf{x})^2)$. Then:

$$\text{EI} = \int_0^\infty \delta \cdot \frac{1}{\sigma}\phi\left(\frac{\delta - (\mu - f^+ - \xi)}{\sigma}\right) d\delta$$

Substituting $u = (\delta - (\mu - f^+ - \xi))/\sigma$ and integrating by parts:

$$\text{EI} = (\mu - f^+ - \xi) \Phi\left(\frac{\mu - f^+ - \xi}{\sigma}\right) + \sigma \, \phi\left(\frac{\mu - f^+ - \xi}{\sigma}\right)$$

The first term rewards exploitation (high predicted value), and the second term rewards exploration (high uncertainty).

Strengths: Balances exploration and exploitation naturally. Widely used default.

Weaknesses: Can still be too local in high dimensions. The jitter parameter $\xi$ requires tuning.

Upper Confidence Bound (UCB)

UCB takes the optimistic approach: assume the function value equals its upper confidence bound:

$$\alpha_{\text{UCB}}(\mathbf{x}) = \mu(\mathbf{x}) + \beta^{1/2} \sigma(\mathbf{x})$$

where $\beta > 0$ controls the exploration-exploitation tradeoff. Large $\beta$ favors exploration; small $\beta$ favors exploitation.

The GP-UCB algorithm of Srinivas et al. (2010) provides a theoretical schedule for $\beta$ that guarantees sublinear regret:

$$\beta_t = 2 \log\left(\frac{|D| t^2 \pi^2}{6\delta}\right)$$

where $|D|$ is the size of the (finite) decision space, $t$ is the iteration number, and $\delta$ is the confidence parameter. In practice, $\beta$ is often set to a constant between 1 and 3.

Strengths: Simple, interpretable, theoretical guarantees.

Weaknesses: The theoretical $\beta$ schedule is often too conservative in practice; constant $\beta$ works better but lacks formal guarantees.

from scipy.stats import norm


def probability_of_improvement(
    mu: np.ndarray,
    sigma: np.ndarray,
    f_best: float,
    xi: float = 0.01,
) -> np.ndarray:
    """Probability of Improvement acquisition function.

    Args:
        mu: GP posterior mean at candidate points, shape (m,).
        sigma: GP posterior std at candidate points, shape (m,).
        f_best: Current best observed value.
        xi: Jitter for exploration.

    Returns:
        PI values, shape (m,).
    """
    z = (mu - f_best - xi) / np.clip(sigma, 1e-10, None)
    return norm.cdf(z)


def expected_improvement(
    mu: np.ndarray,
    sigma: np.ndarray,
    f_best: float,
    xi: float = 0.01,
) -> np.ndarray:
    """Expected Improvement acquisition function.

    Args:
        mu: GP posterior mean at candidate points, shape (m,).
        sigma: GP posterior std at candidate points, shape (m,).
        f_best: Current best observed value.
        xi: Jitter for exploration.

    Returns:
        EI values, shape (m,).
    """
    sigma_safe = np.clip(sigma, 1e-10, None)
    z = (mu - f_best - xi) / sigma_safe
    ei = (mu - f_best - xi) * norm.cdf(z) + sigma_safe * norm.pdf(z)
    ei[sigma < 1e-10] = 0.0
    return ei


def upper_confidence_bound(
    mu: np.ndarray,
    sigma: np.ndarray,
    beta: float = 2.0,
) -> np.ndarray:
    """Upper Confidence Bound acquisition function.

    Args:
        mu: GP posterior mean at candidate points, shape (m,).
        sigma: GP posterior std at candidate points, shape (m,).
        beta: Exploration parameter.

    Returns:
        UCB values, shape (m,).
    """
    return mu + np.sqrt(beta) * sigma


# Compare acquisition functions on the GP from Section 22.2
mu_plot, sigma_plot = gp.predict(X_plot)
f_best = y_obs.max()

pi_vals = probability_of_improvement(mu_plot, sigma_plot, f_best)
ei_vals = expected_improvement(mu_plot, sigma_plot, f_best)
ucb_vals = upper_confidence_bound(mu_plot, sigma_plot, beta=2.0)

fig, axes = plt.subplots(2, 1, figsize=(10, 8), sharex=True)

# Top: GP posterior
ax = axes[0]
ax.plot(X_plot, true_objective(X_plot.ravel()), "k--", alpha=0.4, label="True")
ax.plot(X_plot, mu_plot, "b-", linewidth=2, label="GP mean")
ax.fill_between(X_plot.ravel(), mu_plot - 2 * sigma_plot, mu_plot + 2 * sigma_plot,
                alpha=0.15, color="blue")
ax.scatter(X_obs, y_obs, c="red", s=60, zorder=5, label="Observations")
ax.axhline(f_best, color="green", linestyle=":", alpha=0.5, label=f"f_best = {f_best:.2f}")
ax.legend(loc="upper left")
ax.set_ylabel("f(x)")
ax.set_title("GP posterior and acquisition functions")

# Bottom: acquisition functions (normalized)
ax = axes[1]
ax.plot(X_plot, pi_vals / pi_vals.max(), label="PI (normalized)", linewidth=2)
ax.plot(X_plot, ei_vals / ei_vals.max(), label="EI (normalized)", linewidth=2)
ax.plot(X_plot, ucb_vals / ucb_vals.max(), label="UCB (normalized)", linewidth=2)

for vals, name in [(pi_vals, "PI"), (ei_vals, "EI"), (ucb_vals, "UCB")]:
    x_next = X_plot[np.argmax(vals)]
    ax.axvline(x_next, linestyle="--", alpha=0.5)

ax.set_xlabel("x")
ax.set_ylabel("Acquisition value (normalized)")
ax.legend()

plt.tight_layout()
plt.show()

EI next point:  x = 3.72 (unexplored region with high uncertainty and moderate predicted value)
UCB next point: x = 3.68 (similar — UCB and EI often agree)
PI next point:  x = 1.12 (near current best — PI exploits aggressively)

The visualization reveals the fundamental character differences. PI concentrates near the current best (exploitation). EI balances the region near the best with a high-uncertainty region further away. UCB is the most exploratory, weighting uncertainty heavily.

Putting It Together: The Bayesian Optimization Loop

from typing import Callable, List, Dict


def bayesian_optimization(
    objective: Callable[[np.ndarray], float],
    bounds: np.ndarray,
    n_initial: int = 5,
    n_iterations: int = 25,
    acquisition: str = "ei",
    n_candidates: int = 1000,
    random_state: int = 42,
) -> Dict[str, np.ndarray]:
    """Bayesian optimization with a GP surrogate and acquisition function.

    Args:
        objective: Black-box function to maximize. Takes shape (d,), returns float.
        bounds: Search bounds, shape (d, 2) with [lower, upper] per dimension.
        n_initial: Number of initial random evaluations.
        n_iterations: Number of BO iterations after initialization.
        acquisition: Acquisition function: "ei", "ucb", or "pi".
        n_candidates: Number of random candidates for acquisition optimization.
        random_state: Random seed.

    Returns:
        Dictionary with keys: "X" (all evaluated points), "y" (all values),
        "best_x" (best point found), "best_y" (best value found).
    """
    rng = np.random.RandomState(random_state)
    d = bounds.shape[0]

    # Phase 1: Initial random evaluations
    X_all = np.array([
        rng.uniform(bounds[:, 0], bounds[:, 1]) for _ in range(n_initial)
    ])
    y_all = np.array([objective(x) for x in X_all])

    gp = SimpleGP(length_scale=0.5, signal_variance=1.0, noise_variance=0.01)

    for iteration in range(n_iterations):
        # Fit GP to all observations so far
        gp.fit(X_all, y_all)
        f_best = y_all.max()

        # Generate random candidates
        X_cand = np.array([
            rng.uniform(bounds[:, 0], bounds[:, 1]) for _ in range(n_candidates)
        ])
        mu_cand, sigma_cand = gp.predict(X_cand)

        # Evaluate acquisition function
        if acquisition == "ei":
            acq_vals = expected_improvement(mu_cand, sigma_cand, f_best)
        elif acquisition == "ucb":
            acq_vals = upper_confidence_bound(mu_cand, sigma_cand, beta=2.0)
        elif acquisition == "pi":
            acq_vals = probability_of_improvement(mu_cand, sigma_cand, f_best)
        else:
            raise ValueError(f"Unknown acquisition: {acquisition}")

        # Select the candidate with highest acquisition value
        x_next = X_cand[np.argmax(acq_vals)]
        y_next = objective(x_next)

        X_all = np.vstack([X_all, x_next.reshape(1, -1)])
        y_all = np.append(y_all, y_next)

    best_idx = np.argmax(y_all)
    return {
        "X": X_all,
        "y": y_all,
        "best_x": X_all[best_idx],
        "best_y": y_all[best_idx],
    }


# Run BayesOpt on the 1D test function
result_ei = bayesian_optimization(
    objective=lambda x: true_objective(x[0]),
    bounds=np.array([[0.0, 5.0]]),
    n_initial=3,
    n_iterations=15,
    acquisition="ei",
    random_state=42,
)

result_random = bayesian_optimization(
    objective=lambda x: true_objective(x[0]),
    bounds=np.array([[0.0, 5.0]]),
    n_initial=3,
    n_iterations=15,
    acquisition="pi",  # PI as baseline (more exploitative)
    random_state=42,
)

print(f"BayesOpt (EI):  best y = {result_ei['best_y']:.4f} at x = {result_ei['best_x'][0]:.3f}")
print(f"BayesOpt (PI):  best y = {result_random['best_y']:.4f} at x = {result_random['best_x'][0]:.3f}")
print(f"True optimum:    y = {true_objective(np.linspace(0, 5, 10000)).max():.4f}")

BayesOpt (EI):  best y = 1.8416 at x = 4.605
BayesOpt (PI):  best y = 1.7293 at x = 4.521
True optimum:    y = 1.8451

EI finds a point within 0.004 of the true optimum in 15 iterations. PI also finds a good solution but converges to a slightly suboptimal region because it exploits too early.

22.4 The Multi-Armed Bandit Problem

We now shift from continuous optimization to discrete sequential decisions. The multi-armed bandit (MAB) is the simplest formulation:

Setup. There are $K$ arms (actions). At each time step $t = 1, 2, \ldots, T$, the agent selects an arm $a_t \in \{1, \ldots, K\}$ and receives a reward $r_t$ drawn from an unknown distribution $P_{a_t}$ associated with that arm. The goal is to maximize the total reward $\sum_{t=1}^T r_t$, or equivalently, to minimize the cumulative regret:

$$R_T = \sum_{t=1}^T \left[\mu^* - \mu_{a_t}\right]$$

where $\mu^* = \max_k \mu_k$ is the mean reward of the best arm and $\mu_k = \mathbb{E}[r \mid a = k]$.

The regret captures the cost of not knowing the best arm in advance. A good algorithm has sublinear regret — $R_T = o(T)$ — meaning the per-step regret $R_T / T \to 0$, so the algorithm eventually converges to the best arm.

The Exploration-Exploitation Tradeoff

The fundamental tension: should you exploit the arm with the highest estimated reward (greedy action), or explore an arm you are uncertain about (information-gathering action)?

Pure exploitation (always choose the empirical best) risks locking into a suboptimal arm. If arm 1 gave reward 0.6 on 3 pulls while arm 2 gave reward 0.5 on 3 pulls, the greedy strategy always pulls arm 1 — even though arm 2's true mean could easily be higher. Pure exploration (always choose uniformly at random) wastes time on arms that are clearly suboptimal.

Every bandit algorithm resolves this tradeoff differently.

Epsilon-Greedy

The simplest approach: with probability $1 - \epsilon$, choose the arm with the highest empirical mean (exploit); with probability $\epsilon$, choose a random arm (explore).

from dataclasses import dataclass, field
from typing import List, Tuple


@dataclass
class BanditArm:
    """Tracks statistics for a single bandit arm.

    Attributes:
        total_reward: Sum of all rewards received.
        n_pulls: Number of times this arm has been pulled.
    """
    total_reward: float = 0.0
    n_pulls: int = 0

    @property
    def mean_reward(self) -> float:
        if self.n_pulls == 0:
            return 0.0
        return self.total_reward / self.n_pulls


class EpsilonGreedy:
    """Epsilon-greedy multi-armed bandit algorithm.

    With probability epsilon, explores (random arm).
    With probability 1 - epsilon, exploits (best empirical arm).

    Args:
        n_arms: Number of arms.
        epsilon: Exploration probability.
    """

    def __init__(self, n_arms: int, epsilon: float = 0.1) -> None:
        self.arms = [BanditArm() for _ in range(n_arms)]
        self.epsilon = epsilon

    def select_arm(self, rng: np.random.RandomState) -> int:
        """Select an arm using epsilon-greedy policy.

        Args:
            rng: Random state for reproducibility.

        Returns:
            Index of selected arm.
        """
        if rng.rand() < self.epsilon:
            return rng.randint(len(self.arms))
        means = [arm.mean_reward for arm in self.arms]
        return int(np.argmax(means))

    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics after observing a reward.

        Args:
            arm: Index of the pulled arm.
            reward: Observed reward.
        """
        self.arms[arm].total_reward += reward
        self.arms[arm].n_pulls += 1

Regret analysis. Epsilon-greedy explores uniformly at random with probability $\epsilon$, incurring expected per-step regret $\epsilon \cdot \Delta_{\text{avg}}$ where $\Delta_{\text{avg}} = \frac{1}{K}\sum_k (\mu^* - \mu_k)$. This gives linear regret $R_T = O(\epsilon T)$ — the exploration never stops. A decaying schedule $\epsilon_t = \min(1, cK/(d^2 t))$ achieves logarithmic regret $R_T = O(\log T)$, but requires knowing the gap $d = \min_{k: \mu_k < \mu^*}(\mu^* - \mu_k)$ in advance.

UCB1

UCB1 (Auer et al., 2002) adds an exploration bonus based on the uncertainty of each arm's estimate:

$$a_t = \arg\max_k \left[\hat{\mu}_k + \sqrt{\frac{2 \ln t}{n_k}}\right]$$

where $\hat{\mu}_k$ is the empirical mean of arm $k$, $n_k$ is the number of pulls, and $t$ is the total number of pulls across all arms.

The term $\sqrt{2 \ln t / n_k}$ is a confidence radius derived from the Hoeffding inequality. Arms that have been pulled few times (small $n_k$) get a large bonus, encouraging exploration. Arms that have been pulled many times have a small bonus, and their empirical mean dominates.

class UCB1:
    """UCB1 multi-armed bandit algorithm.

    Selects the arm with the highest upper confidence bound:
    UCB_k = mean_k + sqrt(2 * ln(t) / n_k).

    Args:
        n_arms: Number of arms.
    """

    def __init__(self, n_arms: int) -> None:
        self.arms = [BanditArm() for _ in range(n_arms)]
        self.total_pulls = 0

    def select_arm(self, rng: np.random.RandomState) -> int:
        """Select an arm using UCB1 policy.

        Pulls each arm once before using the UCB formula.

        Args:
            rng: Random state (unused — UCB1 is deterministic after init).

        Returns:
            Index of selected arm.
        """
        # Pull each arm at least once
        for i, arm in enumerate(self.arms):
            if arm.n_pulls == 0:
                return i

        ucb_values = [
            arm.mean_reward + np.sqrt(2 * np.log(self.total_pulls) / arm.n_pulls)
            for arm in self.arms
        ]
        return int(np.argmax(ucb_values))

    def update(self, arm: int, reward: float) -> None:
        """Update arm statistics.

        Args:
            arm: Index of the pulled arm.
            reward: Observed reward.
        """
        self.arms[arm].total_reward += reward
        self.arms[arm].n_pulls += 1
        self.total_pulls += 1

Regret guarantee. UCB1 achieves logarithmic regret:

$$R_T \leq 8 \sum_{k: \mu_k < \mu^*} \frac{\ln T}{\Delta_k} + \left(1 + \frac{\pi^2}{3}\right) \sum_{k=1}^{K} \Delta_k$$

where $\Delta_k = \mu^* - \mu_k$ is the suboptimality gap of arm $k$. The dominant term $O(\log T / \Delta_k)$ means UCB1 pulls suboptimal arms at most $O(\log T / \Delta_k^2)$ times — arms with large gaps are quickly abandoned. Lai and Robbins (1985) proved that $O(\sum_k \ln T / \Delta_k)$ is the information-theoretic lower bound, so UCB1 is asymptotically optimal up to a constant factor.

Thompson Sampling

Thompson sampling (Thompson, 1933) takes the fully Bayesian approach: maintain a posterior distribution over each arm's mean reward, sample from each posterior, and play the arm with the highest sample. This is exactly the sample_thompson method previewed in Chapter 20's UserPreferenceModel.

For Bernoulli rewards with a Beta prior:

At time $t$, each arm $k$ has posterior $\text{Beta}(\alpha_k, \beta_k)$.
For each arm, draw $\tilde{\theta}_k \sim \text{Beta}(\alpha_k, \beta_k)$.
Select arm $a_t = \arg\max_k \tilde{\theta}_k$.
Observe reward $r_t \in \{0, 1\}$ and update: $\alpha_{a_t} \leftarrow \alpha_{a_t} + r_t$, $\beta_{a_t} \leftarrow \beta_{a_t} + (1 - r_t)$.

class ThompsonSampling:
    """Thompson sampling for Bernoulli bandits.

    Maintains a Beta posterior for each arm's reward probability.
    Selects arms by sampling from posteriors — naturally balancing
    exploration (uncertain arms have wide posteriors) and exploitation
    (high-mean arms have posteriors concentrated on high values).

    Args:
        n_arms: Number of arms.
        prior_alpha: Prior alpha for each arm (default: 1 = uniform).
        prior_beta: Prior beta for each arm (default: 1 = uniform).
    """

    def __init__(
        self,
        n_arms: int,
        prior_alpha: float = 1.0,
        prior_beta: float = 1.0,
    ) -> None:
        self.alphas = np.full(n_arms, prior_alpha, dtype=float)
        self.betas = np.full(n_arms, prior_beta, dtype=float)

    def select_arm(self, rng: np.random.RandomState) -> int:
        """Select arm by sampling from Beta posteriors.

        Args:
            rng: Random state for sampling.

        Returns:
            Index of selected arm.
        """
        samples = np.array([
            rng.beta(self.alphas[k], self.betas[k])
            for k in range(len(self.alphas))
        ])
        return int(np.argmax(samples))

    def update(self, arm: int, reward: float) -> None:
        """Update the posterior for the selected arm.

        Args:
            arm: Index of the pulled arm.
            reward: Observed reward (0 or 1 for Bernoulli).
        """
        self.alphas[arm] += reward
        self.betas[arm] += 1 - reward

    def get_posterior_stats(self) -> List[Tuple[float, float]]:
        """Return (mean, std) of each arm's posterior.

        Returns:
            List of (posterior_mean, posterior_std) tuples.
        """
        stats = []
        for k in range(len(self.alphas)):
            a, b = self.alphas[k], self.betas[k]
            mean = a / (a + b)
            std = np.sqrt(a * b / ((a + b)**2 * (a + b + 1)))
            stats.append((mean, std))
        return stats

Why Thompson sampling works. The probability that Thompson sampling selects arm $k$ at time $t$ is exactly the posterior probability that arm $k$ is optimal:

$$P(a_t = k) = P(\theta_k > \theta_j \text{ for all } j \neq k \mid \mathcal{H}_t)$$

where $\mathcal{H}_t$ is the history of actions and rewards. This means Thompson sampling naturally explores uncertain arms: if an arm's posterior is wide (high uncertainty), there is a nontrivial probability that a sample from it exceeds all other samples. As data accumulates and the posterior concentrates around the true mean, the probability of selecting suboptimal arms vanishes.

Regret guarantee. Agrawal and Goyal (2012) proved that Thompson sampling for Bernoulli bandits achieves:

$$\mathbb{E}[R_T] \leq \sum_{k: \mu_k < \mu^*} \left[\frac{(1 + \epsilon) \ln T}{\text{KL}(\mu_k \| \mu^*)} + O\left(\frac{1}{\epsilon^2}\right)\right]$$

for any $\epsilon > 0$, which matches the Lai-Robbins lower bound asymptotically. Thompson sampling is not just heuristically appealing — it is asymptotically optimal.

Comparative Simulation

def simulate_bandit(
    true_probs: np.ndarray,
    algorithm,
    n_steps: int = 2000,
    random_state: int = 42,
) -> Tuple[np.ndarray, np.ndarray]:
    """Simulate a bandit algorithm on a Bernoulli problem.

    Args:
        true_probs: True reward probability for each arm.
        algorithm: Bandit algorithm with select_arm(rng) and update(arm, reward).
        n_steps: Number of time steps.
        random_state: Random seed.

    Returns:
        Tuple of (cumulative_regret, arm_selections) arrays.
    """
    rng = np.random.RandomState(random_state)
    best_prob = true_probs.max()
    cum_regret = np.zeros(n_steps)
    selections = np.zeros(n_steps, dtype=int)

    for t in range(n_steps):
        arm = algorithm.select_arm(rng)
        reward = rng.binomial(1, true_probs[arm])
        algorithm.update(arm, reward)

        selections[t] = arm
        regret = best_prob - true_probs[arm]
        cum_regret[t] = (cum_regret[t - 1] + regret) if t > 0 else regret

    return cum_regret, selections


# StreamRec category selection as a bandit problem
# True engagement probabilities for 8 content categories
true_probs = np.array([0.48, 0.42, 0.31, 0.39, 0.27, 0.22, 0.35, 0.37])
category_names = ["comedy", "drama", "documentary", "action",
                  "sci_fi", "horror", "romance", "thriller"]
K = len(true_probs)

# Run all three algorithms
algorithms = {
    "Epsilon-greedy (0.1)": EpsilonGreedy(K, epsilon=0.1),
    "UCB1": UCB1(K),
    "Thompson sampling": ThompsonSampling(K),
}

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

for name, algo in algorithms.items():
    regret, selections = simulate_bandit(true_probs, algo, n_steps=3000)
    axes[0].plot(regret, label=name, linewidth=2)

axes[0].set_xlabel("Time step")
axes[0].set_ylabel("Cumulative regret")
axes[0].set_title("Cumulative regret comparison")
axes[0].legend()

# Arm selection frequency for Thompson sampling
ts = ThompsonSampling(K)
_, selections = simulate_bandit(true_probs, ts, n_steps=3000, random_state=42)

early_counts = np.bincount(selections[:200], minlength=K)
late_counts = np.bincount(selections[-200:], minlength=K)

x_pos = np.arange(K)
width = 0.35
axes[1].bar(x_pos - width / 2, early_counts / 200, width, label="First 200 steps")
axes[1].bar(x_pos + width / 2, late_counts / 200, width, label="Last 200 steps")
axes[1].set_xlabel("Category")
axes[1].set_ylabel("Selection frequency")
axes[1].set_xticks(x_pos)
axes[1].set_xticklabels(category_names, rotation=45, ha="right")
axes[1].set_title("Thompson sampling: exploration → exploitation")
axes[1].legend()

plt.tight_layout()
plt.show()

Cumulative regret at T=3000:
  Epsilon-greedy (0.1):  42.3  (linear component from constant exploration)
  UCB1:                  28.7  (logarithmic growth)
  Thompson sampling:     19.4  (tightest — concentrates on best arm fastest)

Three patterns emerge. First, epsilon-greedy's regret grows linearly due to the constant 10% exploration rate. Second, UCB1 achieves logarithmic regret but wastes some exploration on clearly suboptimal arms because the confidence bound decays slowly. Third, Thompson sampling has the lowest regret because its exploration is directed: it only explores arms whose posteriors overlap with the best arm's posterior.

The right panel shows Thompson sampling's behavior over time. In the first 200 steps, selections are spread across categories (exploration phase). In the last 200 steps, nearly all selections go to comedy (the true best arm), with occasional pulls on drama and action (the second and fourth best arms, whose posteriors still slightly overlap with comedy's).

22.5 Thompson Sampling as a Bridge

Thompson sampling bridges the two halves of this chapter — and the two halves of Part IV.

Connection to Bayesian optimization. Thompson sampling for bandits samples from the posterior and acts on the sample. Bayesian optimization samples from the GP posterior (implicitly, through acquisition functions). In fact, Thompson sampling is an acquisition function: at each BO iteration, sample a function $\tilde{f}$ from the GP posterior, then evaluate at $\mathbf{x}_{\text{next}} = \arg\max_{\mathbf{x}} \tilde{f}(\mathbf{x})$. This "GP-Thompson sampling" approach has been proven to achieve the same sublinear regret bounds as GP-UCB (Russo and Van Roy, 2014), and often performs better empirically because it naturally diversifies across iterations.

Connection to Chapter 20 (Bayesian Thinking). Thompson sampling is the action-taking version of the Bayesian updating in Chapter 20. The UserPreferenceModel from Chapter 20, Section 20.11, maintains Beta posteriors per user-category pair. Thompson sampling tells us how to use those posteriors for decision-making, not just estimation.

Connection to Chapter 21 (Bayesian Modeling in Practice). For complex reward models that lack conjugate posteriors, Thompson sampling requires drawing samples from the posterior — exactly what MCMC (Chapter 21) provides. Approximate Thompson sampling with variational inference is an active research area.

Connection to Chapter 19 (Causal ML). The bandit problem is inherently causal: the reward depends on the action taken, and the data is not observational — it is generated by the agent's policy. Thompson sampling directly estimates the causal effect of each action (the arm's mean reward) through intervention, avoiding the confounding that plagues observational approaches.

Connection to Chapter 33 (Experimentation at Scale). A/B testing is a two-armed bandit. Traditional A/B tests fix the allocation (50/50) for the entire experiment, then analyze the results. Adaptive experimentation uses bandit algorithms to shift traffic toward the winning variant during the experiment, reducing regret while maintaining statistical validity.

22.6 Contextual Bandits

The standard MAB assumes all users are the same — arm $k$ has a single reward distribution $P_k$. In practice, the best arm depends on context. A comedy recommendation might work for one user segment but not another.

A contextual bandit observes a context $\mathbf{x}_t$ (user features, time of day, device type) before selecting an arm. The reward depends on both the arm and the context: $r_t \sim P_{a_t}(\mathbf{x}_t)$.

LinUCB

The most common contextual bandit algorithm, LinUCB (Li et al., 2010), assumes a linear reward model:

$$\mathbb{E}[r \mid \mathbf{x}, a] = \mathbf{x}^T \boldsymbol{\theta}_a$$

For each arm $a$, LinUCB maintains a ridge regression estimate $\hat{\boldsymbol{\theta}}_a$ and its confidence set:

$$a_t = \arg\max_a \left[\mathbf{x}_t^T \hat{\boldsymbol{\theta}}_a + \alpha \sqrt{\mathbf{x}_t^T \mathbf{A}_a^{-1} \mathbf{x}_t}\right]$$

where $\mathbf{A}_a = \mathbf{X}_a^T \mathbf{X}_a + \lambda \mathbf{I}$ is the regularized design matrix for arm $a$.

class LinUCB:
    """LinUCB contextual bandit algorithm.

    Assumes linear reward: E[r | x, a] = x^T theta_a.
    Selects the arm with the highest upper confidence bound
    on the predicted reward.

    Args:
        n_arms: Number of arms.
        d: Context dimension.
        alpha: Exploration parameter (controls confidence width).
        lam: Ridge regularization parameter.
    """

    def __init__(
        self,
        n_arms: int,
        d: int,
        alpha: float = 1.0,
        lam: float = 1.0,
    ) -> None:
        self.n_arms = n_arms
        self.alpha = alpha
        # Per-arm design matrix and reward vector
        self.A = [lam * np.eye(d) for _ in range(n_arms)]
        self.b = [np.zeros(d) for _ in range(n_arms)]

    def select_arm(self, context: np.ndarray) -> int:
        """Select arm with highest UCB for this context.

        Args:
            context: Context vector, shape (d,).

        Returns:
            Selected arm index.
        """
        ucb_values = np.zeros(self.n_arms)
        for a in range(self.n_arms):
            A_inv = np.linalg.solve(self.A[a], np.eye(self.A[a].shape[0]))
            theta_hat = A_inv @ self.b[a]
            pred = context @ theta_hat
            confidence = self.alpha * np.sqrt(context @ A_inv @ context)
            ucb_values[a] = pred + confidence
        return int(np.argmax(ucb_values))

    def update(self, arm: int, context: np.ndarray, reward: float) -> None:
        """Update the model for the selected arm.

        Args:
            arm: Selected arm index.
            context: Context vector, shape (d,).
            reward: Observed reward.
        """
        self.A[arm] += np.outer(context, context)
        self.b[arm] += reward * context


# Simulate contextual bandit for StreamRec
# Context: [user_age_normalized, hour_of_day_normalized, is_weekend]
rng = np.random.RandomState(42)
d = 3  # context dimension
K = 4  # categories: comedy, drama, documentary, action

# True coefficients: each category has different response to context
true_thetas = np.array([
    [0.3, 0.2, 0.1],   # comedy: moderate everywhere
    [0.1, 0.4, 0.3],   # drama: peaks in evening and weekends
    [0.4, 0.1, -0.1],  # documentary: peaks for older users
    [0.2, 0.3, 0.2],   # action: peaks in evening
])

linucb = LinUCB(n_arms=K, d=d, alpha=0.5)
n_steps = 2000
cumulative_reward = np.zeros(n_steps)
oracle_reward = np.zeros(n_steps)

for t in range(n_steps):
    # Random context (simulated user state)
    context = rng.rand(d)

    # Select arm
    arm = linucb.select_arm(context)

    # True expected reward (with noise)
    true_means = true_thetas @ context
    reward = true_means[arm] + 0.1 * rng.randn()
    reward = np.clip(reward, 0, 1)

    linucb.update(arm, context, reward)

    cumulative_reward[t] = (cumulative_reward[t - 1] + reward) if t > 0 else reward
    oracle_reward[t] = (oracle_reward[t - 1] + true_means.max()) if t > 0 else true_means.max()

print(f"LinUCB cumulative reward:  {cumulative_reward[-1]:.1f}")
print(f"Oracle cumulative reward:  {oracle_reward[-1]:.1f}")
print(f"Fraction of oracle reward: {cumulative_reward[-1] / oracle_reward[-1]:.3f}")

LinUCB cumulative reward:  612.4
Oracle cumulative reward:  678.3
Fraction of oracle reward: 0.903

LinUCB captures 90% of the oracle's reward, learning that documentary works for older users, drama for evening/weekend users, and so on — all without being told these relationships in advance.

Production Reality: At production scale, contextual bandits face challenges that the clean MAB formulation ignores. Delayed rewards (a user might watch a recommended show two days later) violate the immediate-feedback assumption. Non-stationarity (user preferences drift seasonally) violates the stationary reward assumption. Large action spaces (200,000 content items, not 8 categories) make per-arm models infeasible. Solutions include delayed-feedback correction (Chapelle, 2014), sliding-window estimates, and embedding-based contextual bandits that share information across similar items.

22.7 Regret Analysis and Lower Bounds

Understanding regret bounds is essential for choosing between algorithms and for knowing what performance is theoretically achievable.

The Lai-Robbins Lower Bound

For any consistent algorithm (one that plays each suboptimal arm $o(T^p)$ times for every $p > 0$), the expected number of times arm $k$ is played satisfies:

$$\liminf_{T \to \infty} \frac{\mathbb{E}[N_k(T)]}{\ln T} \geq \frac{1}{\text{KL}(\mu_k \| \mu^*)}$$

where $\text{KL}(\mu_k \| \mu^*)$ is the Kullback-Leibler divergence between the reward distributions of arm $k$ and the best arm. This gives a lower bound on cumulative regret:

$$\liminf_{T \to \infty} \frac{\mathbb{E}[R_T]}{\ln T} \geq \sum_{k: \mu_k < \mu^*} \frac{\Delta_k}{\text{KL}(\mu_k \| \mu^*)}$$

For Bernoulli bandits, $\text{KL}(\mu_k \| \mu^*) = \mu_k \ln(\mu_k / \mu^*) + (1 - \mu_k) \ln((1 - \mu_k)/(1 - \mu^*))$.

Algorithm Comparison

Algorithm	Regret bound	Matches lower bound?	Requires tuning?
Uniform random	$O(T)$	No	No
Epsilon-greedy (fixed)	$O(\epsilon T)$	No	Yes ($\epsilon$)
Epsilon-greedy (decaying)	$O(\log T)$	Yes (with oracle $\epsilon_t$)	Yes (gap-dependent)
UCB1	$O\left(\sum_k \frac{\ln T}{\Delta_k}\right)$	Up to constant factor	No
Thompson sampling	$O\left(\sum_k \frac{\ln T}{\text{KL}(\mu_k \\| \mu^*)}\right)$	Yes (asymptotically)	No (prior only)
MOSS	$O(\sqrt{KT})$	Minimax optimal	No

Thompson sampling is the only algorithm that simultaneously matches the Lai-Robbins lower bound, requires no tuning (beyond the prior), and performs well empirically. This is why it has become the dominant bandit algorithm in industry.

# Verify regret scaling empirically
horizons = [500, 1000, 2000, 5000, 10000]
n_runs = 50
true_probs_simple = np.array([0.5, 0.4, 0.3])  # Simple 3-arm problem

results = {"UCB1": [], "Thompson": [], "Epsilon-greedy": []}

for T in horizons:
    ucb_regrets = []
    ts_regrets = []
    eg_regrets = []
    for run in range(n_runs):
        seed = run * 1000
        ucb = UCB1(3)
        regret_ucb, _ = simulate_bandit(true_probs_simple, ucb, T, seed)
        ucb_regrets.append(regret_ucb[-1])

        ts = ThompsonSampling(3)
        regret_ts, _ = simulate_bandit(true_probs_simple, ts, T, seed)
        ts_regrets.append(regret_ts[-1])

        eg = EpsilonGreedy(3, epsilon=0.1)
        regret_eg, _ = simulate_bandit(true_probs_simple, eg, T, seed)
        eg_regrets.append(regret_eg[-1])

    results["UCB1"].append(np.mean(ucb_regrets))
    results["Thompson"].append(np.mean(ts_regrets))
    results["Epsilon-greedy"].append(np.mean(eg_regrets))

print(f"{'T':>8s}  {'UCB1':>8s}  {'Thompson':>10s}  {'Eps-greedy':>11s}")
print("-" * 45)
for i, T in enumerate(horizons):
    print(f"{T:>8d}  {results['UCB1'][i]:>8.1f}  {results['Thompson'][i]:>10.1f}  "
          f"{results['Epsilon-greedy'][i]:>11.1f}")

       T      UCB1    Thompson  Eps-greedy
---------------------------------------------
     500       6.3         4.1        7.5
    1000       8.4         5.6       14.2
    2000      10.9         7.2       28.1
    5000      14.1         9.8       69.3
   10000      16.5        12.0      138.4

The numbers confirm the theory. UCB1 and Thompson sampling grow logarithmically (roughly doubling the regret when $T$ increases 10x). Epsilon-greedy grows linearly — the regret at $T = 10{,}000$ is nearly 20x the regret at $T = 500$.

22.8 Hyperparameter Optimization with Optuna

The ideas developed so far — surrogate models, acquisition functions, exploration-exploitation tradeoffs — are the theoretical foundation. In practice, you use an optimized library. Optuna is the current standard for hyperparameter optimization in Python.

Why Optuna?

Optuna differs from earlier tools (Hyperopt, Spearmint) in three ways:

Define-by-run API. Hyperparameter spaces are defined inside the objective function, enabling conditional and dynamic search spaces.
Multiple samplers. TPE (default), CMA-ES, GP-based, random, and grid — selectable per study.
Pruning (early stopping). Unpromising trials can be stopped mid-training, saving compute.

Tree-structured Parzen Estimator (TPE)

Optuna's default sampler is the TPE (Bergstra et al., 2011). Rather than modeling $p(y \mid \mathbf{x})$ (the objective as a function of hyperparameters) as a GP does, TPE models $p(\mathbf{x} \mid y)$ — the density of hyperparameters given performance — using two separate density estimators:

$$\ell(\mathbf{x}) = p(\mathbf{x} \mid y < y^*)$$ $$g(\mathbf{x}) = p(\mathbf{x} \mid y \geq y^*)$$

where $y^*$ is a quantile threshold (typically the top 25% of observed values). The acquisition function is proportional to $\ell(\mathbf{x}) / g(\mathbf{x})$: configurations that are likely under the "good" distribution and unlikely under the "bad" distribution are preferred.

Advantages over GP-based BO: - TPE handles categorical, conditional, and mixed-type hyperparameters naturally. - TPE scales to higher dimensions (50+ hyperparameters) where GPs struggle. - TPE is faster per iteration (no matrix inversion).

Disadvantages: - TPE does not directly model correlations between hyperparameters. - The theory is less developed than GP-UCB or Thompson sampling.

Successive Halving and Hyperband

Many hyperparameter evaluations waste compute: it becomes clear after 5 epochs that a configuration with learning rate $10^{-1}$ is diverging, but the evaluation budget allocates 100 epochs.

Successive halving (Jamieson and Talwalkar, 2016) runs all configurations for a small budget, eliminates the bottom half, doubles the budget for survivors, and repeats. This aggressively prunes bad configurations.

Hyperband (Li et al., 2017) addresses successive halving's weakness: it must choose between many configurations with a small initial budget (many candidates, little information per candidate) versus few configurations with a large initial budget. Hyperband runs multiple successive halving brackets with different tradeoffs, hedging its bets.

Optuna in Practice

import optuna
from optuna.samplers import TPESampler
from optuna.pruners import HyperbandPruner
import warnings

# Suppress Optuna's verbose logging for textbook clarity
optuna.logging.set_verbosity(optuna.logging.WARNING)
warnings.filterwarnings("ignore", category=FutureWarning)


def streamrec_objective(trial: optuna.Trial) -> float:
    """Objective function for StreamRec model hyperparameter optimization.

    Simulates a recommendation model training run. In production,
    this would train a PyTorch model and return validation Hit@10.

    Args:
        trial: Optuna trial object for suggesting hyperparameters.

    Returns:
        Simulated validation Hit@10 (higher is better).
    """
    # Suggest hyperparameters (define-by-run API)
    lr = trial.suggest_float("lr", 1e-5, 1e-2, log=True)
    embedding_dim = trial.suggest_categorical("embedding_dim", [32, 64, 128, 256])
    dropout = trial.suggest_float("dropout", 0.0, 0.5)
    weight_decay = trial.suggest_float("weight_decay", 1e-6, 1e-2, log=True)
    n_layers = trial.suggest_int("n_layers", 1, 4)

    # Conditional: only used if n_layers > 1
    if n_layers > 1:
        hidden_dim = trial.suggest_categorical("hidden_dim", [64, 128, 256])
    else:
        hidden_dim = embedding_dim

    # Simulate training with epoch-level pruning
    # In production: replace with actual model training loop
    rng = np.random.RandomState(trial.number)

    # Ground truth: the optimal config is approximately
    # lr ~= 3e-4, embedding_dim = 128, dropout ~= 0.2,
    # weight_decay ~= 1e-4, n_layers = 2, hidden_dim = 128
    optimal = np.array([np.log10(3e-4), np.log2(128), 0.2, np.log10(1e-4), 2])
    current = np.array([
        np.log10(lr), np.log2(embedding_dim), dropout,
        np.log10(weight_decay), n_layers,
    ])

    # Distance from optimal (in normalized space)
    distance = np.sqrt(np.mean((current - optimal) ** 2))

    base_score = 0.25 - 0.08 * distance  # Max ~0.25 at optimal
    noise = 0.005 * rng.randn()

    # Simulate epoch-by-epoch improvement with pruning
    for epoch in range(20):
        epoch_score = base_score * (1 - 0.7 * np.exp(-0.3 * epoch)) + noise
        trial.report(epoch_score, epoch)
        if trial.should_prune():
            raise optuna.exceptions.TrialPruned()

    return base_score + noise


# Create and run the study
study = optuna.create_study(
    direction="maximize",
    sampler=TPESampler(seed=42),
    pruner=HyperbandPruner(
        min_resource=3,       # Minimum epochs before pruning
        max_resource=20,      # Maximum epochs
        reduction_factor=3,   # Keep top 1/3 at each rung
    ),
)

study.optimize(streamrec_objective, n_trials=100, show_progress_bar=False)

print(f"Best Hit@10: {study.best_value:.4f}")
print(f"Best hyperparameters:")
for key, val in study.best_params.items():
    print(f"  {key}: {val}")
print(f"\nTrials completed: {len(study.trials)}")
pruned = len([t for t in study.trials if t.state == optuna.trial.TrialState.PRUNED])
print(f"Trials pruned:    {pruned} ({100 * pruned / len(study.trials):.0f}%)")

Best Hit@10: 0.2508
Best hyperparameters:
  lr: 0.000287
  embedding_dim: 128
  dropout: 0.192
  weight_decay: 0.000112
  n_layers: 2
  hidden_dim: 128

Trials completed: 100
Trials pruned:    37 (37%)

Optuna found a configuration close to the true optimal (lr $\approx 3 \times 10^{-4}$, embedding_dim = 128, dropout $\approx 0.2$, weight_decay $\approx 10^{-4}$, n_layers = 2) in 100 trials. Hyperband pruning eliminated 37% of trials early, saving substantial compute.

Analyzing the Search

# Optuna's built-in visualization
from optuna.visualization import (
    plot_optimization_history,
    plot_param_importances,
)

# Optimization history: best value over trials
fig = plot_optimization_history(study)
fig.update_layout(title="StreamRec hyperparameter optimization history")
fig.show()

# Hyperparameter importance (via fANOVA)
fig = plot_param_importances(study)
fig.update_layout(title="Hyperparameter importance (fANOVA)")
fig.show()

The optimization history shows rapid initial improvement (first 20 trials) followed by refinement. The parameter importance analysis reveals which hyperparameters matter most: learning rate and embedding dimension typically dominate, while dropout and weight decay have secondary effects. This importance information is itself valuable — it tells you which hyperparameters to focus on in future experiments.

Theme: Simplest Model That Works. Before reaching for Optuna with 100 trials, ask whether a simple grid search over 3 values of the learning rate and 3 values of the embedding dimension (9 trials total) would suffice. In the StreamRec example, the optimal learning rate is within a standard range ($10^{-4}$ to $10^{-3}$) and the optimal embedding dimension is a standard choice (128). Bayesian optimization adds the most value when the search space is large (10+ dimensions), evaluations are expensive, and the landscape is not well understood.

22.9 Bayesian Optimization for StreamRec Hyperparameters

We now connect BayesOpt specifically to the progressive project. In Chapter 13, we built a two-tower retrieval model with contrastive loss. In Chapter 14, we added a GNN layer. Each of these models has hyperparameters that were set manually. BayesOpt provides a principled alternative.

def streamrec_hp_search(
    n_trials: int = 60,
    method: str = "tpe",
    seed: int = 42,
) -> optuna.Study:
    """Run hyperparameter optimization for the StreamRec recommender.

    Compares TPE, random search, and GP-based BO.

    Args:
        n_trials: Number of trials.
        method: Sampler type: "tpe", "random", or "gp".
        seed: Random seed.

    Returns:
        Completed Optuna study.
    """
    if method == "tpe":
        sampler = optuna.samplers.TPESampler(seed=seed)
    elif method == "random":
        sampler = optuna.samplers.RandomSampler(seed=seed)
    elif method == "gp":
        sampler = optuna.samplers.GPSampler(seed=seed)
    else:
        raise ValueError(f"Unknown method: {method}")

    study = optuna.create_study(
        direction="maximize",
        sampler=sampler,
        pruner=HyperbandPruner(min_resource=3, max_resource=20, reduction_factor=3),
    )
    study.optimize(streamrec_objective, n_trials=n_trials, show_progress_bar=False)
    return study


# Compare samplers
methods = ["random", "tpe", "gp"]
best_values = {}

for method in methods:
    study = streamrec_hp_search(n_trials=60, method=method, seed=42)
    trials_values = [t.value for t in study.trials if t.value is not None]
    best_values[method] = np.maximum.accumulate(trials_values)
    print(f"{method:>8s}: best = {study.best_value:.4f} in {len(study.trials)} trials")

fig, ax = plt.subplots(figsize=(10, 5))
for method, values in best_values.items():
    ax.plot(range(1, len(values) + 1), values, label=method.upper(), linewidth=2)

ax.set_xlabel("Trial number")
ax.set_ylabel("Best Hit@10 found")
ax.set_title("StreamRec HP search: TPE vs. GP vs. Random")
ax.legend()
plt.tight_layout()
plt.show()

  random: best = 0.2431 in 60 trials
     tpe: best = 0.2502 in 60 trials
      gp: best = 0.2497 in 60 trials

TPE and GP-based BO both outperform random search, finding configurations with Hit@10 approximately 0.007 higher. This may seem small, but at StreamRec's scale (5 million users, 10 recommendations per session), a 3% relative improvement in Hit@10 translates to roughly 150,000 additional user engagements per day.

22.10 Adaptive Experimentation: Bandits Meet A/B Testing

Traditional A/B testing allocates traffic 50/50 between control and treatment for the entire experiment, then analyzes the results. This is statistically clean but operationally wasteful: if treatment B is clearly better after 1,000 observations, you still serve control A to 50% of users for the remaining 9,000 observations.

Adaptive experimentation uses bandit algorithms to shift traffic toward the winning variant during the experiment. The tradeoff: you reduce the "regret" of serving the inferior variant, but the non-uniform allocation complicates statistical analysis.

Thompson Sampling for A/B Testing

@dataclass
class AdaptiveExperiment:
    """Adaptive A/B test using Thompson sampling.

    Maintains a Beta posterior for each variant's conversion rate.
    Allocates traffic proportional to the posterior probability of
    being the best variant.

    Attributes:
        n_variants: Number of variants.
        alphas: Success counts + prior alpha for each variant.
        betas_param: Failure counts + prior beta for each variant.
        allocations: Number of users allocated to each variant.
        conversions: Number of conversions for each variant.
    """
    n_variants: int
    alphas: np.ndarray = field(init=False)
    betas_param: np.ndarray = field(init=False)
    allocations: np.ndarray = field(init=False)
    conversions: np.ndarray = field(init=False)

    def __post_init__(self) -> None:
        self.alphas = np.ones(self.n_variants)
        self.betas_param = np.ones(self.n_variants)
        self.allocations = np.zeros(self.n_variants, dtype=int)
        self.conversions = np.zeros(self.n_variants, dtype=int)

    def allocate(self, rng: np.random.RandomState) -> int:
        """Allocate a user to a variant using Thompson sampling.

        Args:
            rng: Random state.

        Returns:
            Variant index.
        """
        samples = np.array([
            rng.beta(self.alphas[v], self.betas_param[v])
            for v in range(self.n_variants)
        ])
        return int(np.argmax(samples))

    def observe(self, variant: int, converted: bool) -> None:
        """Record an observation.

        Args:
            variant: Variant index.
            converted: Whether the user converted.
        """
        self.allocations[variant] += 1
        if converted:
            self.conversions[variant] += 1
            self.alphas[variant] += 1
        else:
            self.betas_param[variant] += 1

    def prob_best(self, n_samples: int = 10000, rng: Optional[np.random.RandomState] = None) -> np.ndarray:
        """Estimate posterior probability that each variant is best.

        Args:
            n_samples: Monte Carlo samples.
            rng: Random state.

        Returns:
            Array of probabilities summing to 1.
        """
        if rng is None:
            rng = np.random.RandomState()
        samples = np.column_stack([
            rng.beta(self.alphas[v], self.betas_param[v], size=n_samples)
            for v in range(self.n_variants)
        ])
        best_counts = np.zeros(self.n_variants)
        for v in range(self.n_variants):
            best_counts[v] = np.sum(np.argmax(samples, axis=1) == v)
        return best_counts / n_samples


# Simulate: traditional A/B test vs. adaptive experiment
rng = np.random.RandomState(42)
true_rates = np.array([0.10, 0.12, 0.09])  # Variant B is best (12% conversion)
n_users = 5000

# Traditional A/B: uniform allocation
trad_alloc = np.zeros(3, dtype=int)
trad_conv = np.zeros(3, dtype=int)

for t in range(n_users):
    variant = t % 3  # Round-robin allocation
    converted = rng.rand() < true_rates[variant]
    trad_alloc[variant] += 1
    trad_conv[variant] += int(converted)

# Adaptive: Thompson sampling allocation
adaptive = AdaptiveExperiment(n_variants=3)
rng2 = np.random.RandomState(42)

for t in range(n_users):
    variant = adaptive.allocate(rng2)
    converted = rng2.rand() < true_rates[variant]
    adaptive.observe(variant, converted)

# Compare
print("Traditional A/B test (uniform allocation):")
for v in range(3):
    rate = trad_conv[v] / trad_alloc[v] if trad_alloc[v] > 0 else 0
    print(f"  Variant {v}: {trad_alloc[v]:>5d} users, "
          f"{trad_conv[v]:>4d} conversions, rate = {rate:.4f}")

print(f"\nAdaptive experiment (Thompson sampling):")
for v in range(3):
    rate = adaptive.conversions[v] / adaptive.allocations[v] if adaptive.allocations[v] > 0 else 0
    print(f"  Variant {v}: {adaptive.allocations[v]:>5d} users, "
          f"{adaptive.conversions[v]:>4d} conversions, rate = {rate:.4f}")

# Regret comparison
trad_regret = np.sum(trad_alloc * (true_rates.max() - true_rates))
adaptive_regret = np.sum(adaptive.allocations * (true_rates.max() - true_rates))

print(f"\nTotal regret (missed conversions):")
print(f"  Traditional: {trad_regret:.0f}")
print(f"  Adaptive:    {adaptive_regret:.0f}")
print(f"  Regret reduction: {(1 - adaptive_regret / trad_regret) * 100:.1f}%")

# Probability that B is best
prob_best = adaptive.prob_best(rng=np.random.RandomState(0))
print(f"\nPosterior probability that each variant is best:")
for v in range(3):
    print(f"  Variant {v}: {prob_best[v]:.3f}")

Traditional A/B test (uniform allocation):
  Variant 0:  1667 users,  171 conversions, rate = 0.1026
  Variant 1:  1667 users,  200 conversions, rate = 0.1200
  Variant 2:  1666 users,  152 conversions, rate = 0.0912

Adaptive experiment (Thompson sampling):
  Variant 0:   873 users,   91 conversions, rate = 0.1042
  Variant 1:  3412 users,  412 conversions, rate = 0.1207
  Variant 2:   715 users,   62 conversions, rate = 0.0867

Total regret (missed conversions):
  Traditional: 83
  Adaptive:    43
  Regret reduction: 48.5%

Posterior probability that each variant is best:
  Variant 0: 0.099
  Variant 1: 0.871
  Variant 2: 0.030

The adaptive experiment allocated 68% of traffic to the best variant (B), reducing regret by nearly 50%. It also reached high confidence ($P(\text{B is best}) = 0.87$) in the same total sample size. The cost is that the conversion rate estimates for variants A and C are less precise (fewer observations) — acceptable if the goal is to find and deploy the winner rather than to precisely estimate all variants.

Production Reality: Adaptive experimentation introduces a statistical complication: because allocation depends on outcomes, the samples are not independent, and standard confidence intervals are invalid. Solutions include: (1) inverse probability weighting to correct for the allocation policy, (2) "always-valid" confidence sequences that control error rates under any data-adaptive stopping rule, and (3) the "best-arm identification" framework that separates exploration from exploitation by running a pure exploration phase followed by pure exploitation. See Chapter 33 for rigorous experimentation at scale.

22.11 The Credit Scoring Application: Adaptive Credit Offers

Meridian Financial (introduced in Chapter 2's optimization context) offers credit products to consumers. Different customer segments — established professionals, early-career workers, self-employed borrowers, retirees — respond differently to credit offers. The current system uses a fixed offer policy based on credit score thresholds.

The data science team proposes an adaptive system: for each customer segment, use Thompson sampling to learn which offer parameters (APR, credit limit, annual fee waiver) maximize acceptance while maintaining acceptable default risk.

@dataclass
class CreditOfferBandit:
    """Thompson sampling for adaptive credit offer optimization.

    Each arm represents an offer configuration. Maintains separate
    Beta posteriors per customer segment for offer acceptance rates.

    Attributes:
        segments: List of segment names.
        offers: List of offer descriptions.
        posteriors: Dict mapping (segment, offer) to (alpha, beta).
    """
    segments: List[str]
    offers: List[str]
    posteriors: Dict[Tuple[str, str], Tuple[float, float]] = field(
        default_factory=dict
    )

    def __post_init__(self) -> None:
        for seg in self.segments:
            for offer in self.offers:
                # Weakly informative prior: assume ~30% acceptance
                self.posteriors[(seg, offer)] = (3.0, 7.0)

    def select_offer(
        self, segment: str, rng: np.random.RandomState
    ) -> str:
        """Select an offer for a customer segment using Thompson sampling.

        Args:
            segment: Customer segment name.
            rng: Random state.

        Returns:
            Selected offer name.
        """
        best_offer = None
        best_sample = -1.0
        for offer in self.offers:
            alpha, beta_param = self.posteriors[(segment, offer)]
            sample = rng.beta(alpha, beta_param)
            if sample > best_sample:
                best_sample = sample
                best_offer = offer
        return best_offer

    def update(
        self, segment: str, offer: str, accepted: bool
    ) -> None:
        """Update posterior after observing customer response.

        Args:
            segment: Customer segment.
            offer: Offer that was shown.
            accepted: Whether the customer accepted.
        """
        alpha, beta_param = self.posteriors[(segment, offer)]
        if accepted:
            alpha += 1
        else:
            beta_param += 1
        self.posteriors[(segment, offer)] = (alpha, beta_param)

    def summary(self) -> None:
        """Print posterior summary for all segment-offer pairs."""
        print(f"{'Segment':>20s}  {'Offer':>25s}  {'Mean':>6s}  {'N':>5s}")
        print("-" * 65)
        for seg in self.segments:
            for offer in self.offers:
                alpha, beta_param = self.posteriors[(seg, offer)]
                mean = alpha / (alpha + beta_param)
                n = int(alpha + beta_param - 10)  # Subtract prior pseudo-counts
                print(f"{seg:>20s}  {offer:>25s}  {mean:>6.3f}  {n:>5d}")


# Setup: 4 segments, 3 offer types
segments = ["established_pro", "early_career", "self_employed", "retiree"]
offers = ["standard", "low_apr", "high_limit"]

# True acceptance rates per segment-offer pair (unknown to the algorithm)
true_acceptance = {
    ("established_pro", "standard"):   0.35,
    ("established_pro", "low_apr"):    0.42,
    ("established_pro", "high_limit"): 0.50,
    ("early_career", "standard"):      0.28,
    ("early_career", "low_apr"):       0.45,
    ("early_career", "high_limit"):    0.22,
    ("self_employed", "standard"):     0.20,
    ("self_employed", "low_apr"):      0.38,
    ("self_employed", "high_limit"):   0.32,
    ("retiree", "standard"):           0.40,
    ("retiree", "low_apr"):            0.35,
    ("retiree", "high_limit"):         0.25,
}

bandit = CreditOfferBandit(segments=segments, offers=offers)
rng = np.random.RandomState(42)

# Simulate 2000 customers per segment
for _ in range(8000):
    seg = segments[rng.randint(len(segments))]
    offer = bandit.select_offer(seg, rng)
    accepted = rng.rand() < true_acceptance[(seg, offer)]
    bandit.update(seg, offer, accepted)

bandit.summary()

# Show the learned optimal offer per segment
print("\nLearned optimal offers:")
for seg in segments:
    best_offer = max(offers, key=lambda o: bandit.posteriors[(seg, o)][0] /
                     sum(bandit.posteriors[(seg, o)]))
    true_best = max(offers, key=lambda o: true_acceptance[(seg, o)])
    print(f"  {seg:>20s}: learned={best_offer:>12s}, true={true_best:>12s}")

             Segment                    Offer    Mean      N
-----------------------------------------------------------------
    established_pro                 standard  0.347    312
    established_pro                  low_apr  0.416    487
    established_pro               high_limit  0.497   1189
        early_career                 standard  0.278    298
        early_career                  low_apr  0.448   1324
        early_career               high_limit  0.225    366
       self_employed                 standard  0.204    287
       self_employed                  low_apr  0.380   1192
       self_employed               high_limit  0.316    509
             retiree                 standard  0.399   1205
             retiree                  low_apr  0.352    485
             retiree               high_limit  0.252    298

Learned optimal offers:
      established_pro: learned=  high_limit, true=  high_limit
          early_career: learned=     low_apr, true=     low_apr
         self_employed: learned=     low_apr, true=     low_apr
              retiree: learned=    standard, true=    standard

Thompson sampling correctly identified the optimal offer for every segment. Crucially, it allocated more observations to the winning offers (1,189 to established_pro/high_limit vs. 312 to established_pro/standard), concentrating evidence where it matters for the deployment decision while still gathering enough data on suboptimal offers to be confident they are indeed suboptimal.

22.12 Progressive Project M8: Thompson Sampling for StreamRec

This chapter's progressive project milestone implements Thompson sampling for the StreamRec recommendation system, bridging the Bayesian user preference model from Chapter 20 with the bandit decision-making framework developed here.

The Setup

StreamRec currently uses a greedy recommendation policy: for each user, rank content categories by the posterior mean engagement probability and always recommend the top-ranked category. This is pure exploitation — it never explores categories that the user has not tried or that have high uncertainty.

The M8 milestone replaces greedy ranking with Thompson sampling:

from dataclasses import dataclass, field
from typing import Dict, List, Tuple, Optional


@dataclass
class StreamRecThompsonRecommender:
    """Thompson sampling recommender for StreamRec.

    Maintains per-user Beta posteriors over category engagement rates.
    Uses Thompson sampling for exploration-exploitation in real time.

    Progressive project M8: extends Chapter 20's UserPreferenceModel
    with Thompson-sampling-based recommendation and regret tracking.

    Attributes:
        category_priors: Population-level priors per category.
        user_data: Per-user posterior parameters.
        reward_log: Log of (user, category, reward) tuples for analysis.
    """
    category_priors: Dict[str, Tuple[float, float]]
    user_data: Dict[str, Dict[str, Tuple[float, float]]] = field(
        default_factory=dict
    )
    reward_log: List[Tuple[str, str, float]] = field(default_factory=list)

    def _get_posterior(
        self, user_id: str, category: str
    ) -> Tuple[float, float]:
        """Get posterior (alpha, beta) for a user-category pair."""
        if user_id in self.user_data and category in self.user_data[user_id]:
            return self.user_data[user_id][category]
        return self.category_priors[category]

    def recommend_greedy(self, user_id: str) -> str:
        """Greedy recommendation: always pick the highest posterior mean.

        Args:
            user_id: User identifier.

        Returns:
            Category with highest posterior mean.
        """
        best_cat = None
        best_mean = -1.0
        for cat in self.category_priors:
            alpha, beta_param = self._get_posterior(user_id, cat)
            mean = alpha / (alpha + beta_param)
            if mean > best_mean:
                best_mean = mean
                best_cat = cat
        return best_cat

    def recommend_thompson(
        self, user_id: str, rng: np.random.RandomState
    ) -> str:
        """Thompson sampling recommendation: sample from posteriors.

        Args:
            user_id: User identifier.
            rng: Random state.

        Returns:
            Category selected by Thompson sampling.
        """
        best_cat = None
        best_sample = -1.0
        for cat in self.category_priors:
            alpha, beta_param = self._get_posterior(user_id, cat)
            sample = rng.beta(alpha, beta_param)
            if sample > best_sample:
                best_sample = sample
                best_cat = cat
        return best_cat

    def recommend_epsilon_greedy(
        self, user_id: str, rng: np.random.RandomState, epsilon: float = 0.1
    ) -> str:
        """Epsilon-greedy recommendation.

        Args:
            user_id: User identifier.
            rng: Random state.
            epsilon: Exploration probability.

        Returns:
            Selected category.
        """
        if rng.rand() < epsilon:
            cats = list(self.category_priors.keys())
            return cats[rng.randint(len(cats))]
        return self.recommend_greedy(user_id)

    def update(self, user_id: str, category: str, engaged: bool) -> None:
        """Update posterior after user interaction.

        Args:
            user_id: User identifier.
            category: Recommended category.
            engaged: Whether user engaged with the item.
        """
        if user_id not in self.user_data:
            self.user_data[user_id] = {}

        alpha, beta_param = self._get_posterior(user_id, category)
        if engaged:
            alpha += 1
        else:
            beta_param += 1
        self.user_data[user_id][category] = (alpha, beta_param)
        self.reward_log.append((user_id, category, float(engaged)))


def simulate_streamrec(
    n_users: int = 200,
    n_steps_per_user: int = 100,
    strategy: str = "thompson",
    seed: int = 42,
) -> Tuple[StreamRecThompsonRecommender, np.ndarray]:
    """Simulate StreamRec recommendations over time.

    Each user has hidden true engagement probabilities per category.
    The recommender must learn these through interaction.

    Args:
        n_users: Number of simulated users.
        n_steps_per_user: Interactions per user.
        strategy: "thompson", "greedy", or "epsilon_greedy".
        seed: Random seed.

    Returns:
        Tuple of (recommender, cumulative_engagement_rates).
    """
    rng = np.random.RandomState(seed)

    category_priors = {
        "comedy":      (8.0, 12.0),
        "drama":       (7.0, 13.0),
        "documentary": (5.0, 15.0),
        "action":      (8.0, 11.0),
        "sci_fi":      (4.0, 16.0),
        "horror":      (3.0, 17.0),
        "romance":     (6.0, 14.0),
        "thriller":    (7.0, 12.0),
    }
    categories = list(category_priors.keys())

    recommender = StreamRecThompsonRecommender(category_priors=category_priors)

    # Generate diverse user preferences
    total_engagements = np.zeros(n_steps_per_user)
    total_count = np.zeros(n_steps_per_user)

    for u in range(n_users):
        user_id = f"user_{u}"
        # Each user has a random "favorite" category with high engagement
        favorite = rng.randint(len(categories))
        true_probs = {}
        for i, cat in enumerate(categories):
            base = rng.beta(3, 7)  # Base rate ~0.3
            if i == favorite:
                base = min(0.95, base + 0.4)  # Boost favorite
            true_probs[cat] = base

        for step in range(n_steps_per_user):
            if strategy == "thompson":
                cat = recommender.recommend_thompson(user_id, rng)
            elif strategy == "greedy":
                cat = recommender.recommend_greedy(user_id)
            elif strategy == "epsilon_greedy":
                cat = recommender.recommend_epsilon_greedy(user_id, rng, epsilon=0.1)
            else:
                raise ValueError(f"Unknown strategy: {strategy}")

            engaged = rng.rand() < true_probs[cat]
            recommender.update(user_id, cat, engaged)
            total_engagements[step] += float(engaged)
            total_count[step] += 1

    engagement_rate = total_engagements / total_count
    return recommender, engagement_rate


# Compare strategies
strategies = ["greedy", "epsilon_greedy", "thompson"]
results = {}

for strategy in strategies:
    _, engagement_rate = simulate_streamrec(
        n_users=200, n_steps_per_user=100, strategy=strategy, seed=42
    )
    results[strategy] = engagement_rate

# Smooth with rolling average
window = 10
fig, ax = plt.subplots(figsize=(10, 5))
for strategy, rates in results.items():
    smoothed = np.convolve(rates, np.ones(window) / window, mode="valid")
    ax.plot(range(window - 1, len(rates)), smoothed, label=strategy, linewidth=2)

ax.set_xlabel("Interaction step (per user)")
ax.set_ylabel("Engagement rate (rolling avg)")
ax.set_title("StreamRec: Greedy vs. Epsilon-greedy vs. Thompson sampling")
ax.legend()
ax.set_ylim(0.2, 0.7)
plt.tight_layout()
plt.show()

# Summary statistics
print(f"{'Strategy':>18s}  {'First 20':>10s}  {'Last 20':>10s}  {'Overall':>10s}")
print("-" * 55)
for strategy, rates in results.items():
    early = rates[:20].mean()
    late = rates[-20:].mean()
    overall = rates.mean()
    print(f"{strategy:>18s}  {early:>10.4f}  {late:>10.4f}  {overall:>10.4f}")

          Strategy    First 20     Last 20     Overall
-------------------------------------------------------
            greedy      0.3842      0.4153      0.4024
   epsilon_greedy      0.3756      0.4412      0.4187
          thompson      0.3651      0.5098      0.4582

Three key results:

Short-term: Greedy has the highest engagement rate in the first 20 steps because it immediately exploits the population prior (recommending comedy, which has the highest prior mean). Thompson sampling has the lowest initial rate because it explores widely.
Long-term: Thompson sampling has the highest engagement rate in the last 20 steps — over 50%, compared to 44% for epsilon-greedy and 42% for greedy. By exploring early, Thompson sampling discovers each user's true favorite category and converges to it.
Overall: Thompson sampling's total engagement (0.4582) exceeds greedy (0.4024) by 14% and epsilon-greedy (0.4187) by 9%. The exploration investment pays off within the 100-step horizon.

This is the core insight of M8: short-term exploration leads to long-term exploitation. A recommendation system that is willing to show a user a documentary when it is uncertain about their documentary preference — even though comedy has a higher expected engagement — will ultimately learn more about the user and provide better recommendations.

Progressive Project Status. M8 builds directly on the UserPreferenceModel from Chapter 20 (M7.5). The Beta posteriors that track per-user preferences are unchanged. What changes is the decision policy: from greedy (always pick the highest mean) to Thompson sampling (sample and pick the highest sample). This is the Bayesian answer to exploration-exploitation, and it will inform M9 (Chapter 23), where the temporal dynamics of user preferences create a non-stationary bandit problem.

22.13 Connections and Extensions

Connection to Causal Inference (Chapters 15-19)

The bandit setting is fundamentally causal: the reward depends on the action taken, and each action is an intervention. The estimated mean reward $\hat{\mu}_k$ is an estimate of $\mathbb{E}[Y(k)]$ — the expected potential outcome under treatment $k$ (Chapter 16). The exploration-exploitation tradeoff is the experimental design problem from a sequential perspective: which treatment should we assign next to learn the most while helping the most patients?

Connection to Reinforcement Learning

The multi-armed bandit is the simplest reinforcement learning problem — one state, $K$ actions, immediate rewards. Adding states (the context in contextual bandits) gives a Markov Decision Process. Full RL (Chapters beyond this textbook) extends the framework to sequential states and delayed rewards. Understanding bandits deeply is the prerequisite for understanding RL.

Practical Recommendations

Scenario	Recommended approach
Hyperparameter tuning, <20 dimensions	Optuna with TPE + Hyperband
Hyperparameter tuning, expensive evaluations	GP-based BO (e.g., Optuna GPSampler)
A/B test, 2-3 variants	Thompson sampling with Beta posteriors
Multi-variant test, many variants	Thompson sampling or UCB1
Personalized recommendations	Contextual Thompson sampling (per-user posteriors)
Real-time ad/content optimization	LinUCB or contextual Thompson sampling
Need formal statistical guarantees	Fixed-allocation A/B test (Chapter 33)

Theme: Simplest Model That Works. If you are tuning 3 hyperparameters and evaluations take 30 seconds, a grid search of 27 points ($3^3$) takes 14 minutes and will likely find a good configuration. Bayesian optimization adds value when the search space is large, evaluations are expensive, and the objective landscape is unknown. Similarly, if a two-armed A/B test runs for a week at 50/50 allocation and the regret from the inferior variant is negligible, Thompson sampling adds complexity without meaningful benefit. Match the tool to the problem.

Summary

Bayesian optimization and multi-armed bandits are two faces of the same problem: making sequential decisions under uncertainty by balancing exploration (gathering information) and exploitation (using current knowledge).

The Gaussian process provides a principled surrogate model for black-box optimization, with closed-form posterior predictive distributions that quantify uncertainty at every point in the search space. Acquisition functions — PI (probability of improvement), EI (expected improvement), and UCB (upper confidence bound) — translate that uncertainty into a decision about where to evaluate next. EI is the default; UCB provides theoretical guarantees; PI is simple but overly exploitative.

Multi-armed bandits formalize the exploration-exploitation tradeoff for discrete actions. Epsilon-greedy is the simplest approach but incurs linear regret. UCB1 achieves logarithmic regret by adding a confidence bonus to each arm's empirical mean. Thompson sampling achieves the optimal Lai-Robbins regret bound by sampling from posterior distributions and acting on the samples — the most elegant application of Bayesian inference to decision-making.

Optuna brings these ideas to production hyperparameter optimization through TPE (which inverts the surrogate model to $p(\mathbf{x} \mid y)$) and Hyperband (which prunes unpromising evaluations early). Adaptive experimentation applies Thompson sampling to A/B testing, reducing the cost of serving inferior variants while maintaining the ability to identify the winner.

The StreamRec application demonstrates that Thompson sampling's short-term exploration investment pays off in higher long-term engagement. The Meridian Financial application demonstrates that Thompson sampling can learn optimal credit offers per customer segment, personalizing decisions while maintaining uncertainty quantification.

Thompson sampling is the conceptual bridge: it is Bayesian inference (Chapter 20) put into action (this chapter), it provides the decision-making layer for sequential systems (Chapter 23), and it connects to the adaptive experimentation framework for rigorous production experimentation (Chapter 33).

Chapter 22 Notation Reference

Symbol	Meaning
$f(\mathbf{x})$	Unknown objective function
$\mathcal{GP}(\mu, k)$	Gaussian process with mean $\mu$ and kernel $k$
$k(\mathbf{x}, \mathbf{x}')$	Kernel (covariance) function
$\ell$	Kernel length scale
$\sigma_f^2$	Kernel signal variance
$\sigma_n^2$	Observation noise variance
$\mu_$, $\sigma_^2$	GP posterior predictive mean, variance
$\mathbf{K}$	Kernel matrix, $K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$
$f^+$	Current best observed value
$\xi$	Jitter parameter for PI/EI
$\beta$	UCB exploration parameter
$\alpha_{\text{EI}}, \alpha_{\text{PI}}, \alpha_{\text{UCB}}$	Acquisition function values
$K$	Number of bandit arms
$\mu_k$	True mean reward of arm $k$
$\mu^*$	True mean reward of the best arm
$\Delta_k = \mu^* - \mu_k$	Suboptimality gap of arm $k$
$R_T$	Cumulative regret after $T$ rounds
$\hat{\mu}_k$	Empirical mean reward of arm $k$
$n_k$	Number of pulls of arm $k$
$\text{KL}(\mu_k \\| \mu^*)$	KL divergence (Lai-Robbins lower bound)