Chapter 22: Quiz

DataField.Dev

Chapter 22: Quiz

Test your understanding of Bayesian optimization and sequential decision-making. Answers follow each question.

Question 1

What is a Gaussian process? State its defining property in one sentence.

Answer

A Gaussian process is a distribution over functions such that any finite collection of function values $f(\mathbf{x}_1), \ldots, f(\mathbf{x}_n)$ is jointly Gaussian, fully specified by a mean function $\mu(\mathbf{x})$ and a kernel (covariance) function $k(\mathbf{x}, \mathbf{x}')$. The defining property is this marginalization consistency: every finite-dimensional marginal of the GP is a multivariate Gaussian distribution, which makes conditioning (computing the posterior after observing data) analytically tractable.

Question 2

What does the kernel length scale $\ell$ control in a GP? What happens to the GP posterior if $\ell$ is set much too small or much too large?

Answer

The length scale $\ell$ controls how quickly the correlation between function values decays with distance between input points. - **Too small $\ell$:** The GP assumes the function varies rapidly. It fits each observation tightly but the posterior mean reverts to the prior mean (typically zero) between observations, with large uncertainty. The GP "forgets" what it learned just a short distance from each observation. This leads to overfitting and excessive exploration in BayesOpt. - **Too large $\ell$:** The GP assumes the function is nearly constant. It underfits, producing a posterior mean that is too smooth and may miss important local structure (bumps, valleys). Uncertainty is uniformly low, which can lead to insufficient exploration because the GP is overconfident about regions it has not observed.

Question 3

Why is the Matern-5/2 kernel generally preferred over the RBF kernel for Bayesian optimization?

Answer

The RBF (squared exponential) kernel produces infinitely differentiable function samples — unrealistically smooth for most real objective functions (e.g., validation loss as a function of hyperparameters). This excessive smoothness causes the GP to be overconfident between observations, underestimating uncertainty and leading to insufficient exploration. The Matern-5/2 kernel produces twice-differentiable functions — smooth enough for practical purposes but not unrealistically so. The wider confidence bands between observations lead to more appropriate exploration. Empirically, Matern-5/2 consistently outperforms RBF for hyperparameter optimization tasks.

Question 4

Write the GP posterior predictive mean $\mu_*$ and variance $\sigma_*^2$ at a new point $\mathbf{x}_*$ given observations $(\mathbf{X}, \mathbf{y})$.

Answer

$$\mu_* = \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y}$$ $$\sigma_*^2 = k(\mathbf{x}_*, \mathbf{x}_*) - \mathbf{k}_*^T (\mathbf{K} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{k}_*$$ where $\mathbf{K}$ is the $n \times n$ kernel matrix ($K_{ij} = k(\mathbf{x}_i, \mathbf{x}_j)$), $\mathbf{k}_*$ is the $n \times 1$ vector ($k_{*i} = k(\mathbf{x}_*, \mathbf{x}_i)$), and $\sigma_n^2$ is the observation noise variance. The mean is a weighted combination of observed values. The variance starts at the prior variance $k(\mathbf{x}_*, \mathbf{x}_*)$ and is reduced by the information from nearby observations (the subtracted term).

Question 5

Compare Probability of Improvement (PI) and Expected Improvement (EI). Why does PI tend to be overly exploitative?

Answer

**PI** measures the probability that a new evaluation will exceed the current best: $\alpha_{\text{PI}}(\mathbf{x}) = P(f(\mathbf{x}) > f^+ + \xi)$. **EI** measures the expected magnitude of improvement: $\alpha_{\text{EI}}(\mathbf{x}) = \mathbb{E}[\max(f(\mathbf{x}) - f^+ - \xi, 0)]$. PI is overly exploitative because it treats all improvements equally regardless of magnitude. It prefers a point with a 60% chance of improving by 0.001 over a point with a 30% chance of improving by 10.0. This causes PI to cluster evaluations near the current best value, often converging to a local optimum. EI avoids this by weighting improvements by their expected size. Its closed-form solution $(\mu - f^+ - \xi)\Phi(z) + \sigma\phi(z)$ has two terms: an exploitation term (first) that rewards high predicted value, and an exploration term (second) that rewards high uncertainty.

Question 6

State the UCB acquisition function. What role does the parameter $\beta$ play?

Answer

$$\alpha_{\text{UCB}}(\mathbf{x}) = \mu(\mathbf{x}) + \sqrt{\beta}\, \sigma(\mathbf{x})$$ The parameter $\beta > 0$ controls the exploration-exploitation tradeoff: - **Large $\beta$**: More weight on uncertainty ($\sigma$), leading to more exploration. The algorithm preferentially evaluates points where it is uncertain. - **Small $\beta$**: More weight on the predicted mean ($\mu$), leading to more exploitation. The algorithm preferentially evaluates points with high predicted values. The GP-UCB theory (Srinivas et al., 2010) provides a schedule $\beta_t = 2\log(|D|t^2\pi^2 / 6\delta)$ that guarantees sublinear regret, but this schedule is often too conservative in practice. A constant $\beta \in [1, 3]$ typically works well.

Question 7

Define cumulative regret for the multi-armed bandit problem. What does it mean for an algorithm to have "sublinear regret"?

Answer

Cumulative regret after $T$ rounds is: $$R_T = \sum_{t=1}^{T} \left[\mu^* - \mu_{a_t}\right]$$ where $\mu^* = \max_k \mu_k$ is the mean reward of the best arm and $\mu_{a_t}$ is the mean reward of the arm selected at time $t$. **Sublinear regret** means $R_T = o(T)$, equivalently $R_T / T \to 0$ as $T \to \infty$. This means the average per-step regret vanishes — the algorithm eventually converges to pulling the best arm almost exclusively. Linear regret ($R_T = \Theta(T)$) means the algorithm never converges — it keeps making suboptimal choices at a constant rate. Logarithmic regret $R_T = O(\log T)$ is the best achievable rate (Lai-Robbins lower bound).

Question 8

Why does fixed-rate epsilon-greedy have linear regret while UCB1 has logarithmic regret?

Answer

**Fixed epsilon-greedy** explores uniformly at random with probability $\epsilon$ at every time step, regardless of how much information has been gathered. Even at $t = 1{,}000{,}000$, it still randomly explores 10% of the time (for $\epsilon = 0.1$), pulling clearly suboptimal arms. This constant exploration rate produces linear regret: $R_T \approx \epsilon \cdot \Delta_{\text{avg}} \cdot T$. **UCB1** explores adaptively: the confidence bonus $\sqrt{2\ln t / n_k}$ shrinks as arm $k$ is pulled more (increasing $n_k$). Arms whose confidence bounds no longer overlap with the best arm's are effectively abandoned. The exploration rate decays over time — at $t = 1{,}000{,}000$, arms with large gaps have been thoroughly identified and are rarely pulled. This gives logarithmic regret: $R_T = O(\sum_k \ln T / \Delta_k)$. The key difference is that UCB1's exploration is *data-dependent* (it stops exploring arms it has identified as suboptimal), while epsilon-greedy's exploration is *data-independent* (it explores the same amount regardless of evidence).

Question 9

Explain Thompson sampling for Bernoulli bandits in three steps.

Answer

1. **Maintain posteriors.** Each arm $k$ has a Beta posterior $\text{Beta}(\alpha_k, \beta_k)$ over its success probability $\theta_k$. Initialize with the prior (e.g., $\alpha_k = \beta_k = 1$ for uniform). 2. **Sample and act.** At each time step, draw one sample $\tilde{\theta}_k$ from each arm's posterior. Select the arm with the highest sample: $a_t = \arg\max_k \tilde{\theta}_k$. 3. **Observe and update.** Receive reward $r_t \in \{0, 1\}$. Update the selected arm's posterior: $\alpha_{a_t} \leftarrow \alpha_{a_t} + r_t$ and $\beta_{a_t} \leftarrow \beta_{a_t} + (1 - r_t)$. The algorithm explores naturally because uncertain arms (wide posteriors) occasionally produce high samples, causing them to be selected. As data accumulates, posteriors concentrate, and the best arm's samples consistently dominate.

Question 10

True or False: Thompson sampling requires tuning an exploration parameter analogous to $\epsilon$ in epsilon-greedy or $\beta$ in UCB.

Answer

**False.** Thompson sampling's exploration is controlled entirely by the posterior uncertainty — no tuning parameter is needed. When the posterior is wide (little data), exploration happens naturally because samples are spread across a wide range. When the posterior concentrates (much data), exploitation dominates because samples cluster near the true mean. The only input is the prior, which is a modeling choice (not a tuning parameter) and whose influence diminishes as data accumulates. This is one of Thompson sampling's key practical advantages: it is a "set and forget" algorithm that automatically adapts its exploration rate based on available evidence.

Question 11

What is the Lai-Robbins lower bound? What does it imply about the difficulty of bandit problems?

Answer

The Lai-Robbins lower bound (1985) states that for any consistent bandit algorithm, the expected number of times a suboptimal arm $k$ is pulled satisfies: $$\liminf_{T \to \infty} \frac{\mathbb{E}[N_k(T)]}{\ln T} \geq \frac{1}{\text{KL}(\mu_k \| \mu^*)}$$ This implies a cumulative regret lower bound: $R_T \geq \sum_{k: \mu_k < \mu^*} \frac{\Delta_k}{\text{KL}(\mu_k \| \mu^*)} \ln T + o(\ln T)$. **Implications:** (1) No algorithm can do better than $O(\log T)$ regret — this is the fundamental information-theoretic limit. (2) Arms with small gaps ($\Delta_k$ small) are inherently harder to distinguish and require more pulls. (3) The KL divergence between the reward distributions (not just the gap) determines the difficulty — arms whose distributions overlap more are harder to separate. Thompson sampling matches this lower bound asymptotically.

Question 12

How does a contextual bandit differ from a standard multi-armed bandit? Give a practical example.

Answer

In a **standard MAB**, the reward distribution for each arm is fixed and does not depend on any external information. The optimal arm is the same for every decision instance. In a **contextual bandit**, a context vector $\mathbf{x}_t$ is observed before each decision. The expected reward depends on both the arm and the context: $\mathbb{E}[r \mid \mathbf{x}, a]$. The optimal arm can differ across contexts. **Example:** StreamRec recommending content categories. In the standard MAB, comedy has the highest average engagement rate and would be recommended to all users. In the contextual bandit, the context might include user age, time of day, and viewing history. For a 22-year-old at midnight, sci-fi might be optimal; for a 55-year-old on a weekend afternoon, documentary might be optimal. The contextual bandit learns these conditional relationships.

Question 13

Explain how TPE (Tree-structured Parzen Estimator) differs from GP-based Bayesian optimization in its modeling approach.

Answer

**GP-based BO** models $p(y \mid \mathbf{x})$ directly — the objective value as a function of hyperparameters. It fits a Gaussian process to the (hyperparameters, objective) pairs and uses acquisition functions to select the next point. **TPE** inverts this and models $p(\mathbf{x} \mid y)$ — the density of hyperparameters given performance. It splits observed trials into "good" ($y < y^*$, where $y^*$ is a quantile threshold) and "bad" ($y \geq y^*$) groups, then fits separate density estimators $\ell(\mathbf{x}) = p(\mathbf{x} \mid y < y^*)$ and $g(\mathbf{x}) = p(\mathbf{x} \mid y \geq y^*)$. The next point is selected to maximize $\ell(\mathbf{x}) / g(\mathbf{x})$ — configurations that look like "good" trials and unlike "bad" trials. **Practical consequences:** TPE handles categorical, conditional, and mixed-type hyperparameters naturally (GPs require special kernels). TPE scales to higher dimensions. TPE does not model correlations between hyperparameters (each dimension is fitted independently), while GPs capture correlations through the kernel.

Question 14

What is Hyperband? What problem does it solve that standard Bayesian optimization does not?

Answer

**Hyperband** (Li et al., 2017) is an early stopping algorithm for hyperparameter optimization. It addresses the problem that many evaluations waste compute: a clearly diverging model trains for the full budget, consuming resources that could be used to try more configurations. Hyperband works by running multiple brackets of **successive halving**: each bracket starts many configurations with a small budget (e.g., 3 epochs), evaluates them, eliminates the bottom fraction, and doubles the budget for survivors. Different brackets use different initial budgets, hedging between "many cheap evaluations" (aggressive pruning) and "few expensive evaluations" (less pruning). Standard BO treats each evaluation as a black box — it has no mechanism to stop a bad evaluation early. Hyperband and BO are complementary: Optuna combines TPE (for selecting hyperparameters) with Hyperband (for pruning bad configurations mid-training), getting the benefits of both.

Question 15

In the adaptive experimentation example (Section 22.10), Thompson sampling allocated 68% of traffic to the best variant. Why might this be a problem for statistical analysis?

Answer

Standard statistical tests (z-tests, t-tests, chi-squared tests) assume that the allocation to each variant is fixed and independent of the outcomes. When Thompson sampling adapts the allocation based on observed rewards, this assumption is violated: 1. **Samples are not independent.** The allocation of user $t$ depends on the outcomes of users $1, \ldots, t-1$. Standard error formulas underestimate the true uncertainty. 2. **Biased estimates.** The empirical conversion rate for the winning variant is inflated because more traffic is allocated to it after favorable early results. This is a form of the "winner's curse." 3. **Invalid confidence intervals.** Standard $p$-values and confidence intervals have incorrect coverage — they are too narrow, leading to overconfident conclusions. Solutions include inverse probability weighting (adjusting for the known allocation probabilities), always-valid confidence sequences (which control error rates under any adaptive stopping rule), and separating exploration and exploitation phases.

Question 16

What is the computational complexity of GP posterior prediction? Why does this limit GPs to relatively small datasets?

Answer

Fitting a GP (computing the Cholesky decomposition of $\mathbf{K} + \sigma_n^2\mathbf{I}$) costs $O(n^3)$ time and $O(n^2)$ memory, where $n$ is the number of observations. Each posterior prediction costs $O(n)$ for the mean and $O(n^2)$ for the variance. For hyperparameter optimization, $n$ is typically 50-500 evaluations, so this is negligible. But GPs cannot be used as surrogate models for problems with thousands of observations (e.g., reinforcement learning with millions of episodes). The $O(n^3)$ cost and $O(n^2)$ memory become prohibitive around $n \approx 10{,}000$. Alternatives for larger $n$ include: sparse GP approximations (FITC, SVGP) that reduce to $O(nm^2)$ using $m \ll n$ inducing points, random Fourier features that approximate the kernel, and non-GP surrogates like tree-structured Parzen estimators (TPE) or random forests (SMAC).

Question 17

Explain the two terms in the Expected Improvement formula $\text{EI}(\mathbf{x}) = (\mu - f^+ - \xi)\Phi(z) + \sigma\phi(z)$ and their roles.

Answer

**First term: $(\mu - f^+ - \xi)\Phi(z)$.** This is the **exploitation term**. It is large when the predicted mean $\mu$ is much higher than the current best $f^+$. The factor $\Phi(z) = P(f(\mathbf{x}) > f^+ + \xi)$ ensures this term only contributes when there is a reasonable probability of improvement. This term drives the search toward regions with high predicted values. **Second term: $\sigma\phi(z)$.** This is the **exploration term**. It is large when the posterior standard deviation $\sigma$ is large — i.e., in regions that have not been evaluated. The factor $\phi(z)$ (the standard normal PDF) is largest when $z \approx 0$, meaning the predicted mean is near the current best — the "boundary of knowledge." This term drives the search toward uncertain regions. Together, the two terms naturally balance exploitation (evaluating where we think the function is high) with exploration (evaluating where we are uncertain). Neither term requires an explicit exploration parameter — the balance emerges from the GP posterior.

Question 18

You are designing a recommendation system for an e-commerce site with 50,000 products. Why is a standard $K$-armed bandit (with one arm per product) impractical, and what would you use instead?

Answer

A standard $K$-armed bandit with $K = 50{,}000$ arms is impractical for two reasons: 1. **Exploration cost.** Each arm must be pulled at least once (and realistically many times) to estimate its reward. With 50,000 arms, even one pull each requires 50,000 interactions before meaningful exploitation begins. Most products would be explored with far too few observations to form reliable estimates. 2. **No generalization.** Standard MABs treat each arm independently. They cannot use the fact that similar products (same category, same brand, similar features) likely have similar reward distributions. **Better approaches:** - **Contextual bandits** with product embeddings as the context, so that information about one product generalizes to similar products. - **Collaborative filtering bandits** that model the reward as a function of user and product embeddings, exploiting the low-rank structure of the user-product interaction matrix. - **Hierarchical bandits** that first select a product category (8-20 arms), then a product within the category. - **Linear bandits** (LinUCB, Linear Thompson Sampling) that model reward as a linear function of product features, reducing the effective dimensionality from 50,000 to the feature dimension.

Question 19

In the StreamRec M8 experiment, Thompson sampling had lower engagement in the first 20 steps but higher engagement in the last 20 steps compared to greedy. Explain this tradeoff and why it leads to higher overall engagement.

Answer

**First 20 steps (Thompson < greedy):** Greedy always recommends the category with the highest population prior mean (comedy), which is a reasonable default. Thompson sampling explores: it sometimes recommends categories with lower prior means (documentary, sci-fi) because their posteriors are wide and occasional samples exceed comedy's sample. This exploration incurs short-term cost — some explored categories turn out to have low engagement for the specific user. **Last 20 steps (Thompson > greedy):** By exploring early, Thompson sampling discovered each user's true favorite category. For a user who secretly loves sci-fi (true engagement 80%), Thompson sampling found this and now recommends sci-fi most of the time. Greedy never discovered this preference because it never recommended sci-fi (the population prior for sci-fi was only 25%). Greedy is stuck recommending comedy (engagement 48%) to a user who would prefer sci-fi (engagement 80%). **Overall:** The exploration cost in the first 20 steps is more than compensated by the exploitation gain in the remaining 80 steps. This is the fundamental insight of explore-exploit algorithms: a small upfront investment in information gathering enables much better long-term decisions. The total engagement rate for Thompson sampling (0.4582) exceeds greedy (0.4024) by 14%.

Question 20

You need to choose between random search, Optuna (TPE), and GP-based BayesOpt for tuning a model with 15 hyperparameters and a budget of 200 evaluations. Each evaluation takes 2 hours. Which do you choose and why?

Answer

**Recommended: Optuna with TPE and Hyperband pruning.** Reasoning: 1. **GP-based BO is a poor fit.** GP-based BO struggles with $d > 10$ dimensions because the GP requires exponentially many observations to cover a high-dimensional space, and the kernel matrix becomes poorly conditioned. With 15 hyperparameters, the GP surrogate would be unreliable. 2. **Random search is a reasonable baseline but suboptimal.** Bergstra and Bengio (2012) showed that random search is surprisingly effective relative to grid search, but 200 evaluations in a 15-dimensional space is still very sparse. Random search would find a reasonable configuration but is unlikely to find a near-optimal one. 3. **TPE handles high dimensions well.** TPE fits independent density estimators per dimension, so its complexity grows linearly with dimensionality (not cubically like GPs). With 200 evaluations, TPE has enough data to build good density models in 15D. 4. **Hyperband saves compute.** At 2 hours per evaluation, 200 full evaluations would take 400 hours (~17 days). Hyperband pruning can eliminate 30-50% of evaluations early, reducing total wall-clock time to ~10 days while maintaining solution quality. 5. **Optuna's define-by-run API** handles the conditional hyperparameters that inevitably arise in a 15-parameter search space (e.g., architecture choices that make certain hyperparameters irrelevant).