Chapter 20: Quiz

DataField.Dev

Chapter 20: Quiz

Test your understanding of Bayesian thinking. Answers follow each question.

Question 1

Write Bayes' theorem for a parameter $\theta$ given data $D$. Label each term (prior, likelihood, posterior, marginal likelihood).

Answer

$$p(\theta \mid D) = \frac{p(D \mid \theta) \, p(\theta)}{p(D)}$$ - $p(\theta)$: **prior** — beliefs about $\theta$ before observing data - $p(D \mid \theta)$: **likelihood** — probability of the data given parameter value $\theta$ - $p(D) = \int p(D \mid \theta) p(\theta) d\theta$: **marginal likelihood** (evidence) — normalizing constant - $p(\theta \mid D)$: **posterior** — updated beliefs about $\theta$ after observing data The working form is $p(\theta \mid D) \propto p(D \mid \theta) \, p(\theta)$ (posterior is proportional to likelihood times prior).

Question 2

You have a Beta(3, 7) prior on a coin's probability of heads. You observe 5 heads and 5 tails. What is the posterior distribution?

Answer

The posterior is $\text{Beta}(3 + 5, 7 + 5) = \text{Beta}(8, 12)$. The posterior mean is $8 / (8 + 12) = 0.40$, which is between the prior mean $3/10 = 0.30$ and the MLE $5/10 = 0.50$, pulled toward the prior because the prior pseudo-count ($3 + 7 = 10$) is comparable to the sample size ($n = 10$).

Question 3

True or False: With a uniform prior (Beta(1, 1)), the posterior mean for a Bernoulli rate parameter equals the MLE.

Answer

**False.** With Beta(1, 1), the posterior is Beta($1 + k, 1 + n - k$), and the posterior mean is $(1 + k) / (2 + n)$, which is *not* equal to the MLE $k/n$ (unless $n \to \infty$). With $k = 7, n = 10$: MLE $= 0.70$, posterior mean $= 8/12 = 0.667$. The Beta(1, 1) prior adds one pseudo-count to each outcome, shrinking the estimate toward 0.5. The **posterior mode** (MAP) equals the MLE under a uniform prior for continuous parameters.

Question 4

Explain the difference between a credible interval and a confidence interval in one sentence each.

Answer

**Credible interval:** Given the observed data, there is a 95% probability that the parameter lies within this interval (a direct probability statement about the parameter conditional on data). **Confidence interval:** If the experiment were repeated many times, 95% of the constructed intervals would contain the true parameter (a statement about the long-run frequency property of the procedure, not about this particular interval). The credible interval answers the question practitioners usually want: "Where is the parameter?" The confidence interval answers a different question: "How reliable is this procedure?"

Question 5

What is a conjugate prior? Give one example of a conjugate prior-likelihood pair.

Answer

A conjugate prior is a prior distribution that, when combined with a particular likelihood via Bayes' theorem, produces a posterior distribution belonging to the same distributional family as the prior. Example: The **Beta distribution** is conjugate to the **Binomial likelihood**. If the prior is Beta($\alpha, \beta$) and the data consist of $k$ successes in $n$ trials, the posterior is Beta($\alpha + k, \beta + n - k$) — still a Beta distribution with updated parameters. Other examples: Normal prior conjugate to Normal likelihood (with known variance), Gamma prior conjugate to Poisson likelihood, Dirichlet prior conjugate to Multinomial likelihood.

Question 6

A Bayesian analysis uses a prior of $\mathcal{N}(0, 10^2)$ on a regression coefficient. The posterior mean is 3.2 with posterior standard deviation 0.8. A colleague argues: "The prior was wrong — the true value is nowhere near 0." Is this a valid criticism?

Answer

**No.** The prior is not a prediction of the true value; it encodes pre-data uncertainty. A $\mathcal{N}(0, 10^2)$ prior places substantial probability on values well above 3.2 (the 95% prior interval is approximately $[-20, 20]$, which easily includes 3.2). The data updated the prior to concentrate the posterior around 3.2, which is exactly how Bayesian inference is supposed to work. A "wrong" prior would be one that excludes the true value entirely (e.g., a prior concentrated on negative values when the true effect is positive) or one that is so tight that the data cannot move the posterior. A broad, weakly informative prior that gets updated by data is functioning correctly.

Question 7

Explain the MAP-MLE-regularization connection: what is the MAP estimate when the prior on regression coefficients is $\mathcal{N}(0, \tau^2)$?

Answer

The MAP estimate with a $\mathcal{N}(0, \tau^2)$ prior on each coefficient is equivalent to **L2-regularized (ridge) regression** with regularization parameter $\lambda = 1/\tau^2$. The MAP objective is: $$\hat{\theta}_{\text{MAP}} = \arg\min_\theta \left[-\log p(D \mid \theta) + \frac{1}{2\tau^2}\|\theta\|_2^2\right]$$ This is the negative log-likelihood (the usual loss function) plus an L2 penalty. A tighter prior (smaller $\tau^2$) produces stronger regularization (larger $\lambda$). The special case $\tau^2 \to \infty$ (flat prior) recovers the MLE. Similarly, a Laplace$(0, b)$ prior produces L1 regularization (lasso) with $\lambda = 1/b$.

Question 8

You have two models for binary data: $M_1$ with a Beta(50, 50) prior (concentrated near 0.5) and $M_2$ with a Beta(1, 1) prior (uniform). You observe 48 heads in 100 flips. Which model has a higher marginal likelihood, and why?

Answer

**$M_1$ has a higher marginal likelihood.** The observed data (48/100 heads) is highly consistent with the prior of $M_1$ (concentrated around 0.5). The marginal likelihood measures the average probability of the data under the prior, and $M_1$'s prior assigns much more probability mass near $\theta = 0.48$ than $M_2$'s uniform prior does. Even though $M_2$ can "accommodate" any data, it spreads its prior probability over the entire $[0, 1]$ interval. This is the **Bayesian Occam's razor**: the model that makes a more specific (and correct) prediction is rewarded with a higher marginal likelihood.

Question 9

What is the prior predictive distribution, and why is it useful?

Answer

The prior predictive distribution is: $$p(\tilde{x}) = \int p(\tilde{x} \mid \theta) \, p(\theta) \, d\theta$$ It is the distribution of data you would expect to see *before observing any actual data*, averaging over all parameter values weighted by the prior. It is useful as a **sanity check on the prior**: if the prior predictive generates data that is physically impossible or empirically absurd (e.g., negative probabilities, salaries of $1 million for entry-level positions, temperatures of 500 degrees), the prior is poorly calibrated and should be revised. This check should be performed *before* fitting the model to data.

Question 10

A new user on StreamRec has made 2 interactions, engaging with 1 out of 2 drama recommendations. The population prior for drama engagement is Beta(8, 12). What is the posterior mean, and how does it compare to the MLE of 0.50?

Answer

The posterior is Beta($8 + 1, 12 + 1$) = Beta(9, 13), with mean $9/(9 + 13) = 9/22 \approx 0.409$. The MLE is $1/2 = 0.50$, but the posterior mean is 0.409, pulled toward the population prior mean of $8/20 = 0.40$. This is appropriate: 2 observations should not override the population average derived from thousands of users. The posterior represents a compromise between the user's sparse personal data and the population baseline. As this user accumulates more interactions, the posterior will shift toward their personal rate.

Question 11

True or False: As the sample size $n \to \infty$, the Bayesian posterior under any proper prior converges to the same distribution regardless of the prior.

Answer

**True** (with caveats). Under regularity conditions (correct model specification, compact parameter space, prior positive at the true value), the **Bernstein-von Mises theorem** guarantees that the posterior converges to $\mathcal{N}(\hat{\theta}_{\text{MLE}}, I(\theta_0)^{-1}/n)$ regardless of the prior. The prior's influence vanishes as data accumulates. The caveats are important: if the model is misspecified, the parameter space is unbounded, or the prior assigns zero density to the true parameter, the convergence may fail.

Question 12

In the Normal-Normal conjugate model, the posterior mean is a precision-weighted average of the prior mean and the sample mean. What does "precision" mean in this context, and why is it the natural weighting?

Answer

**Precision** is the reciprocal of variance: $\text{precision} = 1/\sigma^2$. The prior precision is $1/\sigma_0^2$ and the data precision is $n/\sigma^2$. The posterior mean is: $$\mu_n = \frac{\frac{1}{\sigma_0^2}\mu_0 + \frac{n}{\sigma^2}\bar{x}}{\frac{1}{\sigma_0^2} + \frac{n}{\sigma^2}}$$ Precision-weighting is natural because higher precision means more certainty, and more certain information should receive more weight. A prior with small variance (high precision) is very certain and dominates the posterior. A dataset with many observations or small variance (high data precision) is highly informative and dominates the posterior. The posterior precision is the *sum* of the prior and data precisions — information from independent sources adds up.

Question 13

Why is the HPDI (Highest Posterior Density Interval) sometimes preferred over the equal-tailed credible interval?

Answer

The HPDI is the **narrowest interval** containing a specified probability mass (e.g., 95%). It includes only the most probable parameter values. For **symmetric, unimodal posteriors** (like a Normal), the HPDI and equal-tailed interval are identical. For **skewed posteriors** (common with small samples or bounded parameters), the HPDI is shorter and more informative because it includes the high-density region and excludes the low-density tail. For example, if the posterior is right-skewed, the equal-tailed interval wastes probability mass on the long right tail, while the HPDI captures the concentrated left region plus enough of the right tail to reach 95%.

Question 14

Explain why the statement "I chose a flat prior to be objective" is misleading.

Answer

A flat (uniform) prior is not "objective" for three reasons: 1. **Parameterization dependence:** A flat prior on $\theta$ is not flat on $\log \theta$ or $\theta^2$. The choice of parameterization is itself a subjective decision that changes the prior. 2. **It still contains assumptions:** A flat prior on $[0, 1]$ says $\theta = 0.01$ is equally likely as $\theta = 0.50$. For many problems (e.g., conversion rates, click rates), this is empirically false — extreme values near 0 or 1 are rare. 3. **Impropriety issues:** A flat prior over $(-\infty, \infty)$ is improper (does not integrate to 1) and can lead to improper posteriors in some models (e.g., hierarchical models, mixture models). The more honest framing is: "I chose a weakly informative prior to minimize prior influence while ensuring a proper posterior."

Question 15

What is the Bayes factor, and how does it differ from a p-value?

Answer

The **Bayes factor** $\text{BF}_{12} = p(D \mid M_1) / p(D \mid M_2)$ is the ratio of marginal likelihoods of two models. It measures the **relative evidence** the data provides for $M_1$ versus $M_2$. A $\text{BF}_{12} = 10$ means the data are 10 times more probable under $M_1$. Key differences from p-values: | Bayes factor | P-value | |:--|:--| | Compares two specific models | Tests one model against "something else" | | Can support the null hypothesis ($\text{BF} > 1$) | Can only fail to reject or reject the null | | Depends on the priors under both models | Does not use priors | | Is a continuous measure of evidence | Is typically dichotomized at $\alpha = 0.05$ | | Does not require specifying a sample size in advance | Has coverage guarantees only for pre-specified $n$ | The Bayes factor can say "the data support the null" — something a p-value can never say (failing to reject is not the same as supporting).

Question 16

You run a Bayesian A/B test. The posterior probability that treatment B is better than treatment A is 0.92. Your manager asks: "Is B significantly better?" How do you respond?

Answer

In Bayesian analysis, "significance" is not a binary threshold — it is a continuous measure of certainty. The response should be: "Given our data and prior assumptions, there is a 92% probability that B outperforms A. Whether this is sufficient to act depends on the decision context: - **If the cost of switching to B is low and reversible** (e.g., a UI change that can be reverted), 92% may be sufficient. - **If the cost of being wrong is high** (e.g., a manufacturing process change), you may want to continue collecting data until the probability exceeds 99%. - **To quantify the risk**, compute the expected loss: if we choose B but A is actually better, how much do we lose? If the expected loss is acceptably small, proceed." The Bayesian framework replaces the binary significant/not-significant with a probability and an expected loss, which are more directly useful for decision-making.

Question 17

In the StreamRec user preference model, a new user has zero interactions but the model can still make recommendations. How?

Answer

The model uses the **population prior** as the posterior for users with no observations. For each category $c$, the new user's posterior is the prior Beta($\alpha_c, \beta_c$), where the hyperparameters are estimated from the population engagement rates. This means: 1. **The posterior mean** equals the population average engagement rate for each category — a sensible default. 2. **The posterior uncertainty** is high (equal to the prior uncertainty) — reflecting that we know nothing about this specific user. 3. **Thompson sampling** uses this high uncertainty productively: when sampling from broad priors, there is substantial randomness, so the algorithm explores many categories for new users instead of always recommending the highest population average. As the user interacts with the system, the prior pseudo-counts are gradually diluted by real data, and the posterior transitions from "population average" to "personalized."

Question 18

What does it mean for the prior to "wash out" with enough data? Under what conditions might it NOT wash out?

Answer

**Washing out** means the posterior becomes independent of the prior as $n \to \infty$ — the data dominate the inference. Mathematically, the posterior concentrates around the MLE regardless of the prior (Bernstein-von Mises theorem). The prior may **not wash out** when: 1. **The prior assigns zero probability to the true parameter:** If $p(\theta_0) = 0$, the posterior will also assign zero probability to $\theta_0$ regardless of the data. 2. **The model is non-identifiable:** Multiple parameter values explain the data equally well (e.g., label switching in mixture models). The prior resolves the ambiguity, and the resolution persists. 3. **The prior is on a different scale than the data:** In hierarchical models, the hyperprior on variance components may not wash out if the number of groups is small. 4. **High-dimensional settings:** When the number of parameters grows with the sample size, the prior on each parameter never fully washes out because the "effective sample size per parameter" remains bounded.

Question 19

Explain why MAP estimation is not invariant under reparameterization, using a concrete example.

Answer

Consider a Beta(8, 4) posterior on $\theta \in [0, 1]$. The MAP (mode) is $\hat{\theta}_{\text{MAP}} = (8 - 1)/(8 + 4 - 2) = 7/10 = 0.70$. Now reparameterize to log-odds: $\phi = \log(\theta / (1 - \theta))$. The density transforms as: $$p(\phi \mid D) = p(\theta(\phi) \mid D) \left|\frac{d\theta}{d\phi}\right|$$ The Jacobian $|d\theta/d\phi|$ changes the shape of the density, so the mode shifts. The MAP of $\phi$ is NOT $\log(0.70/0.30) = 0.847$ — it is a different value because the Jacobian tilts the density. This is problematic because MAP gives a different "best guess" depending on which parameterization you use. The posterior mean, by contrast, transforms consistently: $\mathbb{E}[g(\theta) \mid D] = \int g(\theta) p(\theta \mid D) d\theta$ for any transformation $g$. This is one reason the full posterior is preferred over point estimates.

Question 20

When would you recommend a Bayesian approach over a frequentist approach for a production data science system? Give two specific scenarios and two scenarios where frequentist methods suffice.

Answer

**Use Bayesian when:** 1. **Cold-start personalization (e.g., StreamRec):** New users have no data, but population-level priors provide sensible defaults. The Beta-Binomial model updates in constant time per interaction, provides calibrated uncertainty for exploration (Thompson sampling), and transitions from "population average" to "personalized" automatically. A frequentist approach would either produce undefined or degenerate estimates with 0-2 observations. 2. **Small-sample clinical inference (e.g., MediCore):** A rare-disease trial with 30 patients can incorporate prior knowledge from related compounds. The posterior directly answers "What is the probability the treatment works?" — which is what regulators and clinicians need. The frequentist analysis may fail to reject at $\alpha = 0.05$ due to low power, saying "inconclusive" when the combined evidence (prior + data) is informative. **Frequentist suffices when:** 1. **Large-scale prediction (e.g., click prediction with 100M examples):** The posterior is effectively the MLE for all practical purposes. The prior adds computation without changing the answer. Standard deep learning training (SGD on cross-entropy) is the MAP estimate under an implicit flat prior, and it works well at scale. 2. **Regulatory A/B testing with pre-registered analysis:** When the analysis plan is pre-registered (common in tech companies and clinical trials), the frequentist framework provides guaranteed error rates ($\alpha$, power) that are interpretable to stakeholders and auditors. Adding Bayesian analysis is fine as a supplement, but the primary analysis follows the pre-registered frequentist protocol.