Chapter 22: Exercises
Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field
Gaussian Processes
Exercise 22.1 (*)
Consider a GP with RBF kernel ($\sigma_f^2 = 1.0$, $\ell = 1.0$, $\sigma_n^2 = 0.01$) fit to the following 4 observations:
| $x$ | $y$ |
|---|---|
| 0.0 | 0.5 |
| 1.0 | 1.2 |
| 2.0 | 0.8 |
| 4.0 | 1.5 |
(a) Compute the $4 \times 4$ kernel matrix $\mathbf{K}$ by hand (to 3 decimal places).
(b) Using SimpleGP or NumPy, compute the posterior predictive mean and standard deviation at $x_* = 3.0$. Verify that $\sigma_*$ is larger at $x_* = 3.0$ (between observations) than at $x_* = 1.0$ (at an observation).
(c) Plot the GP posterior over $[0, 5]$ with 95% confidence bands. Annotate the plot with the posterior mean and $\pm 2\sigma$ bounds at $x_* = 3.0$.
Exercise 22.2 (*)
Compare the RBF and Matern-5/2 kernels on the same dataset.
(a) Generate 8 observations from the function $f(x) = \sin(2\pi x) + 0.3\cos(5\pi x)$ on $[0, 2]$ with noise $\sigma_n = 0.1$.
(b) Fit a GP with RBF kernel ($\ell = 0.5$) and another with Matern-5/2 ($\ell = 0.5$). Plot both posteriors on the same figure.
(c) Which kernel provides wider confidence bands between observations? Why does this matter for Bayesian optimization?
Exercise 22.3 (**)
Kernel hyperparameter selection via marginal likelihood.
The log marginal likelihood of a GP is:
$$\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\theta}) = -\frac{1}{2} \mathbf{y}^T (\mathbf{K}_{\boldsymbol{\theta}} + \sigma_n^2 \mathbf{I})^{-1} \mathbf{y} - \frac{1}{2} \log |\mathbf{K}_{\boldsymbol{\theta}} + \sigma_n^2 \mathbf{I}| - \frac{n}{2} \log 2\pi$$
(a) Implement a function log_marginal_likelihood(X, y, length_scale, signal_var, noise_var) that computes this quantity using the Cholesky decomposition for numerical stability.
(b) For the data in Exercise 22.1, evaluate the log marginal likelihood over a grid of $\ell \in [0.1, 5.0]$ (50 values) with fixed $\sigma_f^2 = 1.0$ and $\sigma_n^2 = 0.01$. Plot the log marginal likelihood as a function of $\ell$.
(c) Identify the optimal $\ell$. Fit GPs with $\ell = 0.2$, the optimal $\ell$, and $\ell = 5.0$. Plot all three posteriors. Explain how underfitting and overfitting manifest in the GP posterior.
Exercise 22.4 (**)
Kernel composition. Complex functions can be modeled by composing simple kernels.
(a) Implement a product kernel $k_{\text{prod}}(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') \cdot k_2(\mathbf{x}, \mathbf{x}')$ and a sum kernel $k_{\text{sum}}(\mathbf{x}, \mathbf{x}') = k_1(\mathbf{x}, \mathbf{x}') + k_2(\mathbf{x}, \mathbf{x}')$.
(b) Generate data from $f(x) = \sin(x) + 0.1x^2$ on $[0, 10]$. Fit a GP with: (i) RBF only, (ii) RBF + Linear kernel, (iii) RBF $\times$ Linear kernel. Compare the three posteriors.
(c) Which composition captures the trend-plus-oscillation structure best? Why?
Acquisition Functions
Exercise 22.5 (*)
For the GP posterior in Exercise 22.1, compute the following acquisition functions at $x_* = 3.0$:
(a) Probability of Improvement with $\xi = 0$.
(b) Expected Improvement with $\xi = 0$.
(c) UCB with $\beta = 2$.
(d) Which acquisition function recommends evaluating at $x_* = 3.0$ most strongly? What does this tell you about its exploration behavior?
Exercise 22.6 (**)
Sensitivity to $\xi$ and $\beta$.
Using the bayesian_optimization function from Section 22.3, run 20-step optimization on the function $f(x) = \sin(3x) + 0.5\cos(7x) + 0.2x$ over $[0, 5]$ with:
(a) EI with $\xi \in \{0, 0.01, 0.1, 1.0\}$. Plot the best value found vs. iteration for each $\xi$. Which $\xi$ converges fastest? Which avoids local optima best?
(b) UCB with $\beta \in \{0.1, 1.0, 2.0, 10.0\}$. Same analysis.
(c) Based on your results, recommend default values for $\xi$ and $\beta$, and explain the tradeoff.
Exercise 22.7 (***)
Derive the Expected Improvement formula.
Starting from the definition $\text{EI}(\mathbf{x}) = \mathbb{E}[\max(f(\mathbf{x}) - f^+ - \xi, 0)]$ where $f(\mathbf{x}) \sim \mathcal{N}(\mu, \sigma^2)$:
(a) Write the expectation as an integral from $f^+ + \xi$ to $\infty$.
(b) Substitute $u = (f - \mu) / \sigma$ and show the integral splits into two terms.
(c) Evaluate each term using the identities $\int_a^\infty u\phi(u)du = \phi(a)$ and $\int_a^\infty \phi(u)du = 1 - \Phi(a)$, arriving at:
$$\text{EI}(\mathbf{x}) = (\mu - f^+ - \xi)\Phi(z) + \sigma\phi(z) \quad \text{where } z = \frac{\mu - f^+ - \xi}{\sigma}$$
(d) Verify your derivation numerically: compute EI by Monte Carlo (10,000 samples from $\mathcal{N}(\mu, \sigma^2)$) and compare to the closed-form expression for $\mu = 1.5$, $\sigma = 0.8$, $f^+ = 1.2$, $\xi = 0.01$.
Exercise 22.8 (***)
Thompson sampling as an acquisition function for GPs.
(a) Implement GP-Thompson sampling: at each iteration, draw a function sample $\tilde{f}$ from the GP posterior by evaluating $\mu_* + \mathbf{L}_* \mathbf{z}$ where $\mathbf{L}_*$ is the Cholesky factor of the posterior covariance and $\mathbf{z} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$. Maximize $\tilde{f}$ over a grid.
(b) Run the 1D optimization from Section 22.3 using GP-Thompson sampling. Compare the trajectory (evaluated points over iterations) with EI and UCB.
(c) Why does GP-Thompson sampling naturally diversify across iterations, even without a jitter parameter?
Multi-Armed Bandits
Exercise 22.9 (*)
A streaming platform has 5 thumbnail variants for a show. True click-through rates are $[0.08, 0.10, 0.12, 0.09, 0.11]$.
(a) Run epsilon-greedy ($\epsilon = 0.05$), UCB1, and Thompson sampling for $T = 5000$ steps. Plot cumulative regret for each.
(b) For Thompson sampling, plot the Beta posteriors for all 5 arms at $t = 100$, $t = 500$, and $t = 5000$. How do the posteriors evolve?
(c) Compute the fraction of pulls allocated to the best arm (arm 3) for each algorithm. Which algorithm concentrates most quickly?
Exercise 22.10 (*)
Decaying epsilon-greedy.
(a) Implement epsilon-greedy with $\epsilon_t = \min(1, 5K / (d^2 t))$ where $K$ is the number of arms and $d$ is a "minimum gap" parameter.
(b) Using the 5-arm problem from Exercise 22.9, run decaying epsilon-greedy with $d \in \{0.01, 0.02, 0.05\}$. Compare regret with fixed $\epsilon = 0.05$ and with Thompson sampling.
(c) The optimal $d$ requires knowing the true gap. Explain why this is a practical limitation.
Exercise 22.11 (**)
Batched Thompson sampling.
In production, recommendations are often made in batches (e.g., 10 recommendations per page load). Implement batched Thompson sampling:
(a) Naive approach: draw a single sample per arm, select the top-$B$ arms (where $B$ is the batch size). Why does this underexplore?
(b) Improved approach: draw $B$ independent function samples from each arm's posterior and use a matching algorithm. Implement this.
(c) Compare the two approaches on the 8-category StreamRec problem with batch size $B = 3$. Which achieves lower regret?
Exercise 22.12 (**)
Non-stationary bandits.
User preferences change over time. Implement a sliding-window Thompson sampling variant:
(a) Instead of accumulating all observations, maintain a window of the last $W$ observations per arm. Implement this by tracking a deque of (reward, timestamp) pairs and using window-based counts for the Beta posterior.
(b) Simulate a non-stationary problem: 8 arms where the best arm changes every 500 steps (rotate among the top 3). Compare standard Thompson sampling with sliding-window TS ($W = 200$).
(c) What is the tradeoff in choosing $W$? What happens if $W$ is too small or too large?
Exercise 22.13 (***)
Prove that UCB1 achieves logarithmic regret.
Follow the proof outline from Auer et al. (2002):
(a) For a suboptimal arm $k$, show that $n_k(T) \leq m + \sum_{t=m+1}^{T} \mathbb{1}\{a_t = k\}$ for any integer $m$.
(b) Show that if arm $k$ is selected at time $t$ with $n_k(t) \geq m$, then either $\hat{\mu}^*(t) \leq \mu^* - c_t(n^*)$ or $\hat{\mu}_k(t) \geq \mu_k + c_t(n_k)$ where $c_t(s) = \sqrt{2\ln t / s}$.
(c) Apply Hoeffding's inequality to bound the probability of each event and show that choosing $m = \lceil 8\ln T / \Delta_k^2 \rceil$ gives $\mathbb{E}[n_k(T)] \leq 8\ln T / \Delta_k^2 + 1 + \pi^2/3$.
(d) Sum over all suboptimal arms to obtain the regret bound.
Contextual Bandits
Exercise 22.14 (**)
Contextual Thompson sampling.
(a) Implement contextual Thompson sampling with a Bayesian linear regression model: $r = \mathbf{x}^T\boldsymbol{\theta}_a + \epsilon$, $\boldsymbol{\theta}_a \sim \mathcal{N}(\hat{\boldsymbol{\theta}}_a, \sigma^2 \mathbf{A}_a^{-1})$. At each step, sample $\tilde{\boldsymbol{\theta}}_a$ from the posterior and select $\arg\max_a \mathbf{x}^T\tilde{\boldsymbol{\theta}}_a$.
(b) Compare with LinUCB on the contextual bandit simulation from Section 22.6 (4 categories, 3 context features, 2000 steps). Plot cumulative reward for both algorithms.
(c) Which algorithm performs better when the true model is linear? What happens when the true reward function is nonlinear (e.g., $r = \sigma(\mathbf{x}^T\boldsymbol{\theta}_a)$)?
Exercise 22.15 (**)
Disjoint vs. shared features in contextual bandits.
The LinUCB implementation in Section 22.6 uses disjoint models (separate $\boldsymbol{\theta}_a$ per arm). An alternative shares features across arms:
$$\mathbb{E}[r \mid \mathbf{x}, a] = \mathbf{z}_{a,\mathbf{x}}^T \boldsymbol{\theta}$$
where $\mathbf{z}_{a,\mathbf{x}}$ is a feature vector that depends on both the arm and the context.
(a) Construct $\mathbf{z}_{a,\mathbf{x}}$ by concatenating the context $\mathbf{x}$ with a one-hot arm encoding and their interactions.
(b) Implement shared-feature LinUCB with a single $\boldsymbol{\theta}$ and $\mathbf{A}$ matrix.
(c) Compare disjoint and shared models on a problem where 2 of 6 arms share similar coefficients. Which model learns faster?
Optuna and Hyperparameter Optimization
Exercise 22.16 (*)
Use Optuna to tune a Random Forest classifier on the Breast Cancer Wisconsin dataset (from sklearn.datasets).
(a) Define a search space: n_estimators (50-500), max_depth (2-30), min_samples_split (2-20), min_samples_leaf (1-10), max_features ("sqrt", "log2").
(b) Run 50 trials with TPESampler and report the best accuracy (5-fold CV).
(c) Plot the parameter importance using optuna.visualization.plot_param_importances.
Exercise 22.17 (**)
Conditional search spaces with Optuna.
(a) Define an objective that selects between three model types — Logistic Regression, Random Forest, and Gradient Boosting — and tunes the hyperparameters specific to each. Use Optuna's conditional parameter suggestions (if model_type == "rf": ...).
(b) Run 80 trials on a classification dataset. Which model type does Optuna converge to?
(c) Plot the optimization history colored by model type.
Exercise 22.18 (**)
Hyperband pruning analysis.
(a) Using the StreamRec objective from Section 22.8, run two studies: one with HyperbandPruner and one with no pruning (MedianPruner with no early stopping). Use 100 trials each.
(b) Compare total "compute time" (sum of epochs across all trials) for each approach. How much compute does Hyperband save?
(c) Compare the best value found. Does Hyperband sacrifice solution quality for compute savings?
Exercise 22.19 (***)
Implement TPE from scratch.
(a) For a 1D continuous parameter, implement the TPE sampler: 1. Sort trials by objective value and split at the $\gamma$-quantile ($\gamma = 0.25$). 2. Fit a kernel density estimate (KDE) to the "good" group ($\ell(x)$) and the "bad" group ($g(x)$). 3. Sample candidates from $\ell(x)$ and select the one with the highest $\ell(x)/g(x)$ ratio.
(b) Run your TPE implementation on $f(x) = -(x - 3)^2 + 10 + 0.3\sin(10x)$ over $[0, 6]$ for 50 iterations. Compare with random search.
(c) Extend to 2D by using independent KDEs per dimension (the standard TPE factorization). Test on $f(x, y) = -(x-2)^2 - (y-3)^2 + 10$.
Applied Problems
Exercise 22.20 (**)
StreamRec: Thompson sampling with non-uniform priors.
In Chapter 20, we estimated category priors from population statistics. Different user demographics have different priors.
(a) Define three demographic groups (young/18-25, middle/26-45, older/46+) with different category priors. For example, young users might have higher priors for comedy and sci-fi; older users for documentary and drama.
(b) Implement demographic-aware Thompson sampling where the prior depends on the user's demographic group.
(c) Compare with a single-prior Thompson sampling over a simulation of 500 users (equal split among demographics). Report engagement rate improvement.
Exercise 22.21 (**)
Credit scoring: risk-constrained Thompson sampling.
In the Meridian Financial example (Section 22.11), acceptance rate is the only objective. In practice, offers must balance acceptance rate with default risk.
(a) Extend CreditOfferBandit to maintain two Beta posteriors per (segment, offer) pair: one for acceptance rate and one for default rate.
(b) Implement constrained Thompson sampling: sample from both posteriors, but only select offers where the sampled default rate is below a threshold $\delta = 0.05$.
(c) Simulate with true acceptance and default rates that create a tradeoff (e.g., high-limit offers have high acceptance but also high default). Does the constrained algorithm learn to avoid risky offers?
Exercise 22.22 (***)
BayesOpt for neural architecture search.
(a) Define a search space for a feedforward neural network: number of layers (1-5), neurons per layer (16-512), activation function (ReLU, tanh, GELU), learning rate ($10^{-5}$ to $10^{-2}$), batch size (32, 64, 128, 256).
(b) Using Optuna, implement an objective that trains a PyTorch model on the MNIST dataset for 5 epochs and returns validation accuracy. Use Hyperband pruning.
(c) Run 100 trials and report the best architecture. Compare with the default architecture (2 layers, 128 neurons, ReLU, lr=$10^{-3}$, batch_size=64).
Exercise 22.23 (***)
Multi-objective Bayesian optimization.
Optuna supports multi-objective optimization via NSGA-II.
(a) Define a StreamRec objective with two goals: maximize Hit@10 and minimize model latency (simulated as a function of embedding_dim and n_layers).
(b) Run a multi-objective study with 80 trials. Visualize the Pareto front.
(c) Select a point on the Pareto front that achieves at least 95% of the best Hit@10 while minimizing latency. Justify your choice.
Exercise 22.24 (***)
Bayesian optimization vs. random search: when does BO win?
(a) Generate synthetic objective functions of increasing difficulty: (i) a smooth quadratic in 3D, (ii) the Branin function in 2D, (iii) the Hartmann function in 6D.
(b) For each function, run random search, TPE, and GP-based BO with budgets of 20, 50, and 100 evaluations. Record the best value found.
(c) Plot a heatmap: rows are functions, columns are budgets, cells show the improvement of BO over random search. When is BO most valuable? When is random search sufficient?
Research-Level Problems
Exercise 22.25 (****)
Bayesian optimization with input-dependent noise.
Standard GP-based BO assumes homoscedastic noise ($\sigma_n^2$ is constant). In hyperparameter tuning, noise varies: configurations near the optimum have low variance, while unstable configurations have high variance.
(a) Implement a heteroscedastic GP by modeling both the mean function and the log noise variance as GPs: $y = f(\mathbf{x}) + \epsilon(\mathbf{x})$ where $\epsilon(\mathbf{x}) \sim \mathcal{N}(0, \exp(g(\mathbf{x})))$.
(b) Derive the expected improvement under heteroscedastic noise (the standard EI formula changes because $\sigma_*^2$ now includes the input-dependent noise).
(c) Test on a synthetic function with noise that increases away from the optimum. Compare with standard (homoscedastic) BO.
Exercise 22.26 (****)
Thompson sampling with misspecified priors.
(a) Simulate a 10-arm Bernoulli bandit where the prior is Beta(5, 5) (centered at 0.5) but the true means are $\{0.05, 0.10, 0.15, \ldots, 0.50\}$ — all below or at the prior mean.
(b) Compare Thompson sampling's regret with UCB1 over $T = 10{,}000$ steps, averaged over 100 runs. Which algorithm is more robust to prior misspecification?
(c) Implement a "prior correction" mechanism: after every 100 steps, re-estimate the prior hyperparameters using empirical Bayes (fit a Beta distribution to the observed arm means). Does this improve Thompson sampling's robustness?
Exercise 22.27 (****)
The information-directed sampling algorithm.
Russo and Van Roy (2014) proposed Information-Directed Sampling (IDS), which selects actions to minimize the ratio of expected squared regret to information gain.
(a) For a 3-arm Bernoulli bandit, implement IDS: at each step, compute for each arm $a$: - $\Delta_a^2$: the squared expected regret of playing arm $a$. - $g_a$: the mutual information between the reward and the identity of the optimal arm, conditional on playing arm $a$. - Select $a^* = \arg\min_a \Delta_a^2 / g_a$.
(b) Compare IDS with Thompson sampling on problems with 3, 10, and 50 arms. In which regime does IDS offer the largest improvement?
(c) Discuss the computational cost of IDS vs. Thompson sampling. When is the additional cost justified?
Exercise 22.28 (****)
Safe Bayesian optimization.
In some applications (e.g., tuning a production system), certain configurations could cause failures. Safe BO adds a constraint: only evaluate points where the GP predicts that a safety function $g(\mathbf{x}) \geq 0$ with high probability.
(a) Implement SafeOpt (Sui et al., 2015): maintain a GP for the objective $f$ and another for the safety constraint $g$. Only evaluate points where $\mu_g(\mathbf{x}) - \beta^{1/2}\sigma_g(\mathbf{x}) \geq 0$.
(b) Test on a 1D problem where $f(x) = \sin(3x)$ and the safety constraint is $g(x) = 1 - (x - 2.5)^2$ (unsafe region around $x = 2.5$). Show that SafeOpt avoids the unsafe region while finding the best safe point.
(c) Compare with unconstrained BO. How much does the safety constraint reduce the quality of the solution?