Chapter 10: Quiz

Test your understanding of probabilistic and Bayesian methods. Each question has a single best answer unless otherwise noted. Click to reveal the answer and explanation.


Question 1

In Bayes' theorem $p(\theta \mid \mathcal{D}) = \frac{p(\mathcal{D} \mid \theta) \, p(\theta)}{p(\mathcal{D})}$, which term is the marginal likelihood?

  • (a) $p(\theta \mid \mathcal{D})$
  • (b) $p(\mathcal{D} \mid \theta)$
  • (c) $p(\theta)$
  • (d) $p(\mathcal{D})$
Show Answer **(d)** $p(\mathcal{D})$ is the marginal likelihood (also called the evidence). It is computed by integrating the joint probability over all possible parameter values: $p(\mathcal{D}) = \int p(\mathcal{D} \mid \theta) p(\theta) \, d\theta$. It serves as a normalizing constant for the posterior and is used in Bayesian model comparison.

Question 2

Naive Bayes classifiers assume that:

  • (a) All features have the same distribution
  • (b) Features are independent given the class label
  • (c) Features are independent unconditionally
  • (d) The prior probability of each class is equal
Show Answer **(b)** Naive Bayes assumes **conditional independence** of features given the class label: $p(\mathbf{x} \mid y) = \prod_j p(x_j \mid y)$. This is weaker than unconditional independence (c). The features can have different distributions (ruling out a), and class priors can be unequal (ruling out d).

Question 3

In Bayesian linear regression, the predictive variance at a test point has two components. What do they represent?

  • (a) Bias and variance of the estimator
  • (b) Training error and test error
  • (c) Aleatoric (noise) uncertainty and epistemic (parameter) uncertainty
  • (d) Prior uncertainty and likelihood uncertainty
Show Answer **(c)** The predictive variance $\sigma_*^2 = \sigma^2 + \mathbf{x}_*^\top \mathbf{S}_N \mathbf{x}_*$ consists of **aleatoric uncertainty** ($\sigma^2$, irreducible data noise) and **epistemic uncertainty** ($\mathbf{x}_*^\top \mathbf{S}_N \mathbf{x}_*$, uncertainty about model parameters that decreases with more data).

Question 4

Which of the following is the conjugate prior for a Binomial likelihood?

  • (a) Gamma distribution
  • (b) Normal distribution
  • (c) Beta distribution
  • (d) Dirichlet distribution
Show Answer **(c)** The **Beta distribution** is conjugate to the Binomial (and Bernoulli) likelihood. If the prior is Beta($\alpha$, $\beta$) and we observe $k$ successes in $n$ trials, the posterior is Beta($\alpha + k$, $\beta + n - k$). The Dirichlet (d) is the multivariate generalization, conjugate to the Multinomial.

Question 5

In the Metropolis-Hastings algorithm, if the proposal distribution is symmetric, the acceptance ratio simplifies to:

  • (a) $\min(1, p(\theta') / p(\theta^{(t-1)}))$ where $p$ is the posterior
  • (b) Always 1 (all proposals accepted)
  • (c) The ratio of the proposal densities
  • (d) $\min(1, p(\theta^{(t-1)}) / p(\theta'))$ where $p$ is the posterior
Show Answer **(a)** When the proposal is symmetric, $q(\theta' \mid \theta) = q(\theta \mid \theta')$, and the proposal ratio cancels. The acceptance ratio becomes $\alpha = \min(1, p(\theta' \mid \mathcal{D}) / p(\theta^{(t-1)} \mid \mathcal{D}))$. This is the simpler Metropolis algorithm. Proposals to higher-probability regions are always accepted; proposals to lower-probability regions are accepted with probability equal to the density ratio.

Question 6

What is the recommended target acceptance rate for random walk Metropolis-Hastings in high dimensions?

  • (a) Approximately 10%
  • (b) Approximately 23%
  • (c) Approximately 50%
  • (d) Approximately 90%
Show Answer **(b)** Roberts et al. (1997) showed that the optimal acceptance rate for random walk Metropolis in high dimensions is approximately **23.4%**. In one dimension, the optimal rate is closer to 44%. Rates that are too high indicate the proposal step size is too small (slow exploration), while rates that are too low indicate steps are too large (most proposals rejected).

Question 7

Variational inference approximates the posterior by:

  • (a) Drawing samples from a Markov chain
  • (b) Finding the closest distribution in a tractable family by minimizing KL divergence
  • (c) Maximizing the marginal likelihood directly
  • (d) Using the Laplace approximation (Gaussian at the MAP)
Show Answer **(b)** Variational inference casts posterior approximation as an **optimization problem**: find $q^*(\theta) = \arg\min_{q \in \mathcal{Q}} \text{KL}(q \| p(\theta \mid \mathcal{D}))$. In practice, this is equivalent to maximizing the Evidence Lower Bound (ELBO). Unlike MCMC (a), it is an optimization method, not a sampling method.

Question 8

What is a known limitation of mean-field variational inference?

  • (a) It cannot handle continuous parameters
  • (b) It tends to overestimate posterior variance
  • (c) It tends to underestimate posterior variance
  • (d) It requires conjugate priors
Show Answer **(c)** Mean-field VI minimizes $\text{KL}(q \| p)$, which is mode-seeking. The fully factorized $q$ cannot capture posterior correlations and tends to **underestimate posterior variance** (produce posteriors that are too concentrated). This is the opposite of methods minimizing $\text{KL}(p \| q)$, which tend to overdisperse.

Question 9

A Gaussian process is fully specified by:

  • (a) A mean function and a kernel (covariance) function
  • (b) A mean vector and a covariance matrix
  • (c) A set of basis functions and their coefficients
  • (d) A likelihood function and a prior distribution
Show Answer **(a)** A GP is a distribution over functions defined by a **mean function** $m(\mathbf{x})$ and a **kernel function** $k(\mathbf{x}, \mathbf{x}')$. For any finite set of inputs, the corresponding function values are jointly Gaussian with mean vector and covariance matrix induced by these two functions. Answer (b) describes a specific multivariate Gaussian, not the general GP.

Question 10

The length scale parameter $\ell$ in the RBF kernel controls:

  • (a) The overall amplitude of the function
  • (b) How quickly the correlation between function values decays with input distance
  • (c) The noise level in the observations
  • (d) The number of basis functions in the approximation
Show Answer **(b)** The length scale $\ell$ determines how quickly the kernel $k(\mathbf{x}, \mathbf{x}') = \sigma_f^2 \exp(-\|\mathbf{x} - \mathbf{x}'\|^2 / (2\ell^2))$ decays with distance. A **small** $\ell$ means function values become uncorrelated over short distances (wiggly functions). A **large** $\ell$ produces smooth, slowly-varying functions. The amplitude is controlled by $\sigma_f^2$.

Question 11

Ridge regression is equivalent to Bayesian linear regression with which prior?

  • (a) Laplace prior on weights
  • (b) Gaussian prior on weights centered at zero
  • (c) Uniform prior on weights
  • (d) Cauchy prior on weights
Show Answer **(b)** Ridge regression with penalty $\lambda \|\mathbf{w}\|^2$ is equivalent to Bayesian linear regression with a **zero-mean isotropic Gaussian prior** $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, (\sigma^2/\lambda)\mathbf{I})$. The ridge solution is the MAP estimate under this prior. Similarly, LASSO (L1 penalty) corresponds to a Laplace prior (a).

Question 12

The computational bottleneck of exact GP regression is:

  • (a) Computing the kernel function values -- $\mathcal{O}(n)$
  • (b) Inverting the kernel matrix -- $\mathcal{O}(n^3)$
  • (c) Optimizing hyperparameters -- $\mathcal{O}(n \log n)$
  • (d) Sampling from the predictive distribution -- $\mathcal{O}(n^2)$
Show Answer **(b)** Exact GP regression requires inverting (or factoring) the $n \times n$ kernel matrix $\mathbf{K} + \sigma_n^2 \mathbf{I}$, which costs $\mathcal{O}(n^3)$ in time and $\mathcal{O}(n^2)$ in memory. This makes exact GPs impractical for datasets much larger than a few thousand points. Sparse approximations reduce this to $\mathcal{O}(nm^2)$ using $m$ inducing points.

Question 13

Laplace smoothing in Naive Bayes addresses which problem?

  • (a) Features with high variance
  • (b) Zero-probability estimates for unseen feature values
  • (c) Correlated features
  • (d) Imbalanced class distributions
Show Answer **(b)** Without smoothing, if a feature value never appears with a particular class in training data, $p(x_j \mid y) = 0$, which zeros out the entire class posterior. Laplace smoothing adds a pseudocount $\alpha$ to all counts, ensuring no probability is exactly zero. From a Bayesian perspective, this corresponds to a symmetric Dirichlet prior.

Question 14

A 95% Bayesian credible interval means:

  • (a) If we repeated the experiment many times, 95% of intervals would contain the true parameter
  • (b) Given the data and prior, there is a 95% posterior probability the parameter lies in the interval
  • (c) The parameter is exactly within this range 95% of the time
  • (d) 95% of the data falls within this interval
Show Answer **(b)** A Bayesian credible interval has a direct probabilistic interpretation: **given the observed data and the prior**, there is a 95% posterior probability the parameter falls in the interval. This is distinct from a frequentist confidence interval (a), which is a statement about the procedure across hypothetical repetitions, not about the parameter in this particular instance.

Question 15

Which of the following is NOT a benefit of the Bayesian approach?

  • (a) Automatic uncertainty quantification
  • (b) Natural incorporation of prior knowledge
  • (c) Always computationally cheaper than frequentist methods
  • (d) Principled model comparison via the marginal likelihood
Show Answer **(c)** Bayesian methods are often **more computationally expensive** than frequentist alternatives because they require computing or approximating the full posterior distribution (via MCMC, VI, etc.) rather than just a point estimate. The other options are genuine benefits: uncertainty quantification (a), prior knowledge integration (b), and model comparison (d).

Question 16

In the ELBO decomposition $\log p(\mathcal{D}) = \text{ELBO}(q) + \text{KL}(q \| p)$, since KL divergence is always non-negative, what can we conclude?

  • (a) The ELBO is always greater than $\log p(\mathcal{D})$
  • (b) The ELBO is always equal to $\log p(\mathcal{D})$
  • (c) The ELBO is always a lower bound on $\log p(\mathcal{D})$
  • (d) The ELBO is always negative
Show Answer **(c)** Since $\text{KL}(q \| p) \geq 0$, we have $\text{ELBO}(q) \leq \log p(\mathcal{D})$. The ELBO is always a **lower bound** on the log evidence, hence the name "Evidence Lower Bound." Equality holds when $q(\theta) = p(\theta \mid \mathcal{D})$, i.e., the variational approximation exactly matches the true posterior.

Question 17

A prior predictive check involves:

  • (a) Checking if the prior is proper (integrates to 1)
  • (b) Sampling parameters from the prior, generating data from the likelihood, and checking if the simulated data looks plausible
  • (c) Comparing the prior and posterior distributions
  • (d) Verifying that the prior is conjugate to the likelihood
Show Answer **(b)** A prior predictive check involves: (1) sampling parameter values $\theta \sim p(\theta)$ from the prior, (2) generating synthetic data $\mathcal{D} \sim p(\mathcal{D} \mid \theta)$ from the likelihood, and (3) checking whether the simulated data falls in a plausible range. This is a powerful diagnostic for detecting overly vague or overly informative priors before seeing the actual data.

Question 18

Which MCMC diagnostic compares within-chain and between-chain variance to assess convergence?

  • (a) Effective sample size (ESS)
  • (b) Autocorrelation function (ACF)
  • (c) Gelman-Rubin $\hat{R}$ statistic
  • (d) Trace plot
Show Answer **(c)** The **Gelman-Rubin $\hat{R}$ statistic** compares the variance of sample means between chains (B) to the average within-chain variance (W). Values close to 1.0 (below 1.01) indicate that chains have converged to the same distribution. Large $\hat{R}$ suggests chains are exploring different regions of parameter space.

Question 19

Which Naive Bayes variant is most appropriate for document classification using word count features?

  • (a) Gaussian Naive Bayes
  • (b) Bernoulli Naive Bayes
  • (c) Multinomial Naive Bayes
  • (d) Complement Naive Bayes
Show Answer **(c)** **Multinomial Naive Bayes** models features as word counts or frequencies drawn from a multinomial distribution. This is the standard choice for bag-of-words or TF-IDF text representations. Gaussian NB (a) is for continuous features; Bernoulli NB (b) is for binary (word presence/absence) features; Complement NB (d) is a variant of Multinomial NB optimized for imbalanced datasets.

Question 20

In the Beta distribution Beta($\alpha$, $\beta$), the pseudocount interpretation means:

  • (a) $\alpha$ and $\beta$ represent the number of features in the model
  • (b) $\alpha$ and $\beta$ act as virtual observations of successes and failures before seeing real data
  • (c) $\alpha$ and $\beta$ control the learning rate of the model
  • (d) $\alpha + \beta$ represents the total number of training samples required
Show Answer **(b)** In the Beta-Binomial model, $\alpha$ and $\beta$ can be interpreted as **virtual observations**: $\alpha$ prior successes and $\beta$ prior failures. The posterior after observing $k$ successes in $n$ trials is Beta($\alpha + k$, $\beta + n - k$), which simply adds real observations to the virtual ones. The "effective sample size" of the prior is $\alpha + \beta$.

Question 21

Hamiltonian Monte Carlo (HMC) improves on random-walk Metropolis by:

  • (a) Using simpler proposal distributions
  • (b) Eliminating the need for a target distribution
  • (c) Using gradient information to make large, efficient proposals
  • (d) Always accepting proposals
Show Answer **(c)** HMC introduces auxiliary momentum variables and simulates Hamiltonian dynamics using the **gradient of the log posterior**. This allows the sampler to make large moves through parameter space while maintaining high acceptance rates, dramatically reducing the random walk behavior of standard MH. It requires the target density to be differentiable.

Question 22

Which statement about the Matern kernel is correct?

  • (a) It always produces infinitely smooth functions
  • (b) The smoothness is controlled by the parameter $\nu$, with $\nu \to \infty$ recovering the RBF kernel
  • (c) It is only defined for 1D inputs
  • (d) It cannot model periodic functions
Show Answer **(b)** The Matern kernel's smoothness is controlled by $\nu$: smaller values produce rougher functions ($\nu = 1/2$ gives an Ornstein-Uhlenbeck process), while $\nu \to \infty$ recovers the infinitely smooth RBF kernel. Common choices are $\nu = 3/2$ (once differentiable) and $\nu = 5/2$ (twice differentiable). It works in any dimension.

Question 23

In Bayesian model comparison, the Bayes factor $\text{BF}_{12}$ is defined as:

  • (a) The ratio of posterior model probabilities
  • (b) The ratio of marginal likelihoods $p(\mathcal{D} \mid M_1) / p(\mathcal{D} \mid M_2)$
  • (c) The ratio of maximum likelihoods
  • (d) The difference in BIC scores
Show Answer **(b)** The Bayes factor is the **ratio of marginal likelihoods**: $\text{BF}_{12} = p(\mathcal{D} \mid M_1) / p(\mathcal{D} \mid M_2)$. It measures the relative evidence the data provides for $M_1$ over $M_2$, independent of prior model probabilities. The posterior odds equal the Bayes factor times the prior odds: $\frac{p(M_1 \mid \mathcal{D})}{p(M_2 \mid \mathcal{D})} = \text{BF}_{12} \cdot \frac{p(M_1)}{p(M_2)}$.

Question 24

The sequential nature of Bayesian updating means:

  • (a) Data must be processed in chronological order
  • (b) Today's posterior becomes tomorrow's prior when new data arrives
  • (c) The prior must be updated after each gradient step
  • (d) Only one data point can be processed at a time
Show Answer **(b)** Bayesian updating is sequential in the sense that the **posterior after seeing data $\mathcal{D}_1$ becomes the prior for the next update with $\mathcal{D}_2$**: $p(\theta \mid \mathcal{D}_1, \mathcal{D}_2) \propto p(\mathcal{D}_2 \mid \theta) \cdot p(\theta \mid \mathcal{D}_1)$. Data can be processed in any order (not just chronological) and in batches of any size -- the result is the same due to the exchangeability of the likelihood.

Question 25

Which of the following correctly describes the relationship between aleatoric and epistemic uncertainty?

  • (a) Both decrease with more data
  • (b) Aleatoric decreases with more data; epistemic does not
  • (c) Epistemic decreases with more data; aleatoric does not
  • (d) Neither changes with more data
Show Answer **(c)** **Epistemic uncertainty** (uncertainty about model parameters) decreases as we observe more data -- we become more confident about the model. **Aleatoric uncertainty** (inherent noise in the data-generating process) is irreducible and does not decrease with more data. In Bayesian linear regression, the epistemic term $\mathbf{x}_*^\top \mathbf{S}_N \mathbf{x}_*$ shrinks with more data while the noise term $\sigma^2$ remains constant.

Question 26

Thompson Sampling for the multi-armed bandit problem works by:

  • (a) Always choosing the arm with the highest empirical mean
  • (b) Sampling from each arm's posterior and choosing the arm with the highest sample
  • (c) Choosing arms uniformly at random
  • (d) Choosing the arm with the widest confidence interval
Show Answer **(b)** Thompson Sampling maintains a posterior distribution for each arm's reward probability. At each round, it **samples** from each posterior and plays the arm whose sample is highest. This naturally balances exploration (arms with uncertain posteriors occasionally produce high samples) and exploitation (arms with high estimated rewards usually produce high samples). It is a Bayesian approach to the exploration-exploitation tradeoff.

Question 27

Which of the following is a valid approach to scaling Gaussian processes to large datasets?

  • (a) Using a larger kernel matrix
  • (b) Using inducing points (sparse GP approximation)
  • (c) Increasing the noise variance
  • (d) Removing the mean function
Show Answer **(b)** **Inducing point methods** (sparse GP approximations) select $m \ll n$ inducing points that summarize the training data, reducing the computational cost from $\mathcal{O}(n^3)$ to $\mathcal{O}(nm^2)$. Other scalable approaches include random Fourier features, structured kernel interpolation, and stochastic variational GPs. Options (a), (c), and (d) do not address the fundamental $\mathcal{O}(n^3)$ scaling issue.

Question 28

An improper prior is one that:

  • (a) Violates Bayes' theorem
  • (b) Does not integrate to a finite value
  • (c) Is not conjugate to the likelihood
  • (d) Assigns zero probability to the true parameter
Show Answer **(b)** An improper prior is a function used as a prior that **does not integrate to a finite value** (e.g., a uniform distribution on the entire real line, $p(\theta) \propto 1$ for $\theta \in \mathbb{R}$). Improper priors can still yield proper posteriors when the likelihood contains enough information, but they can also lead to improper posteriors, which is problematic.

Question 29

In probabilistic programming, which inference method is typically the default in modern frameworks like Stan and PyMC?

  • (a) Gibbs sampling
  • (b) Rejection sampling
  • (c) No-U-Turn Sampler (NUTS), a variant of HMC
  • (d) Mean-field variational inference
Show Answer **(c)** Both Stan and PyMC use **NUTS (No-U-Turn Sampler)** as their default inference algorithm. NUTS is an adaptive variant of Hamiltonian Monte Carlo that automatically tunes the trajectory length, eliminating one of HMC's key tuning parameters. It produces high-quality posterior samples for a wide range of models.

Question 30

What does the log marginal likelihood of a GP naturally balance?

  • (a) Training error and test error
  • (b) Data fit and model complexity
  • (c) Prior strength and likelihood strength
  • (d) Computation time and accuracy
Show Answer **(b)** The GP log marginal likelihood $\log p(\mathbf{y} \mid \mathbf{X}, \boldsymbol{\psi}) = -\frac{1}{2}\mathbf{y}^\top \mathbf{K}_y^{-1}\mathbf{y} - \frac{1}{2}\log|\mathbf{K}_y| - \frac{n}{2}\log(2\pi)$ contains a **data fit term** (first term, how well the model explains the data) and a **complexity penalty** (second term, the log determinant penalizes overly flexible models). This provides an automatic Occam's razor for hyperparameter selection.