Chapter 3: Quiz

Probability Theory and Statistical Inference

Test your understanding of the key concepts from this chapter. Answers follow each question.


Question 1

Which of the following is not one of the Kolmogorov axioms of probability?

(a) $P(A) \geq 0$ for all events $A$ (b) $P(\Omega) = 1$ (c) $P(A \cup B) = P(A) + P(B)$ for any events $A$ and $B$ (d) For mutually exclusive events $A_1, A_2, \ldots$, $P(\bigcup_i A_i) = \sum_i P(A_i)$

Answer: (c). The additivity axiom requires the events to be mutually exclusive. For arbitrary (not necessarily disjoint) events, the correct formula is $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.


Question 2

A continuous random variable $X$ has PDF $f(x) = 2x$ for $x \in [0, 1]$ and $f(x) = 0$ otherwise. What is $P(X > 0.5)$?

(a) 0.50 (b) 0.75 (c) 0.25 (d) 1.00

Answer: (b). $P(X > 0.5) = \int_{0.5}^{1} 2x \, dx = [x^2]_{0.5}^{1} = 1 - 0.25 = 0.75$.


Question 3

If $X$ and $Y$ are independent random variables, which of the following is true?

(a) $P(X = x \mid Y = y) = P(X = x)$ for all $x, y$ (b) $\text{Cov}(X, Y) = 0$ (c) $P(X = x, Y = y) = P(X = x) \cdot P(Y = y)$ for all $x, y$ (d) All of the above

Answer: (d). All three are consequences of independence. Note, however, that (b) is necessary but not sufficient: zero covariance does not imply independence in general (it does for jointly Gaussian variables).


Question 4

In Bayes' theorem $p(\theta \mid D) = \frac{p(D \mid \theta) p(\theta)}{p(D)}$, the denominator $p(D) = \int p(D \mid \theta) p(\theta) d\theta$ is called the:

(a) Likelihood (b) Evidence (marginal likelihood) (c) Posterior (d) Sufficient statistic

Answer: (b). The evidence or marginal likelihood is the normalizing constant that ensures the posterior integrates to one. It is also used in Bayesian model comparison.


Question 5

The MLE for the Bernoulli parameter given $n$ observations $x_1, \ldots, x_n$ is:

(a) $\frac{n}{\sum_i x_i}$ (b) $\frac{\sum_i x_i}{n}$ (c) $\frac{\sum_i x_i}{n - 1}$ (d) $\frac{\sum_i x_i + 1}{n + 2}$

Answer: (b). The MLE is the sample proportion $\hat{p} = \bar{x} = \frac{\sum_i x_i}{n}$. Option (d) is the posterior mean under a uniform (Beta(1,1)) prior — the Laplace-smoothed estimator.


Question 6

Binary cross-entropy loss is mathematically equivalent to:

(a) The negative log-likelihood of a Gaussian model (b) The negative log-likelihood of a Bernoulli model (c) The KL divergence from the prior to the posterior (d) The Fisher information

Answer: (b). Binary cross-entropy $-\frac{1}{n}\sum_i [y_i \log \hat{p}_i + (1 - y_i)\log(1 - \hat{p}_i)]$ is exactly the negative log-likelihood of a Bernoulli model with parameter $\hat{p}_i$ for each observation.


Question 7

Which probability distribution would be most appropriate for modeling the number of customer service tickets received per hour?

(a) Bernoulli (b) Gaussian (c) Poisson (d) Exponential

Answer: (c). The Poisson distribution models the count of events occurring in a fixed interval. The exponential would model the time between consecutive tickets.


Question 8

A distribution belongs to the exponential family if it can be written in the form $p(x \mid \eta) = h(x) \exp(\eta^\top T(x) - A(\eta))$. The function $A(\eta)$ is called the:

(a) Sufficient statistic (b) Base measure (c) Log-partition function (d) Natural parameter

Answer: (c). The log-partition function $A(\eta)$ ensures the distribution normalizes to one. Its derivatives generate the moments of the sufficient statistic.


Question 9

For the Bernoulli distribution, the natural parameter in exponential family form is:

(a) $p$ (b) $1 - p$ (c) $\log p$ (d) $\log \frac{p}{1-p}$

Answer: (d). The natural parameter is the log-odds (logit) $\eta = \log \frac{p}{1-p}$. This is why logistic regression, which models $\eta = \mathbf{w}^\top \mathbf{x}$, is the natural GLM for binary outcomes.


Question 10

$L_2$ regularization of neural network weights is equivalent to MAP estimation with which prior?

(a) Uniform prior (b) Gaussian prior centered at zero (c) Laplace prior centered at zero (d) Beta prior

Answer: (b). A Gaussian prior $p(\mathbf{w}) = \mathcal{N}(\mathbf{0}, \sigma_0^2 \mathbf{I})$ adds $-\frac{\|\mathbf{w}\|_2^2}{2\sigma_0^2}$ to the log-posterior, which is equivalent to $L_2$ regularization with $\lambda = 1/\sigma_0^2$.


Question 11

The Fisher information $I(\theta)$ for a statistical model measures:

(a) The amount of data needed to estimate $\theta$ (b) The curvature of the log-likelihood at $\theta$ (c) The bias of the MLE (d) The prior probability of $\theta$

Answer: (b). Fisher information is the expected negative second derivative (curvature) of the log-likelihood. Higher curvature means the data is more informative about $\theta$, leading to more precise estimates.


Question 12

The Cramér-Rao lower bound states that for any unbiased estimator $\hat{\theta}$:

(a) $\text{Bias}(\hat{\theta}) \geq 1 / (n \cdot I(\theta))$ (b) $\text{Var}(\hat{\theta}) \geq 1 / (n \cdot I(\theta))$ (c) $\text{MSE}(\hat{\theta}) \leq 1 / (n \cdot I(\theta))$ (d) $\text{Var}(\hat{\theta}) \leq 1 / (n \cdot I(\theta))$

Answer: (b). The Cramér-Rao bound is a lower bound on the variance of any unbiased estimator. An estimator that achieves this bound is called efficient.


Question 13

A 95% frequentist confidence interval means:

(a) There is a 95% probability that the true parameter is in this specific interval (b) If we repeated the experiment many times, 95% of the constructed intervals would contain the true parameter (c) We are 95% confident that the true parameter is in this interval (d) The posterior probability that the parameter is in this interval is 95%

Answer: (b). This is the correct frequentist interpretation. Option (a) is the Bayesian credible interval interpretation. Options (c) and (d) are common misinterpretations of frequentist confidence intervals.


Question 14

The Central Limit Theorem states that the distribution of the sample mean $\bar{X}_n$ approaches a normal distribution as $n \to \infty$. Which condition is required?

(a) The original distribution must be normal (b) The original distribution must have a finite variance (c) The sample size must exceed 30 (d) The observations must be identically distributed (not just independent)

Answer: (b). The CLT requires finite variance (and i.i.d. observations in its basic form). The "n > 30" rule is a common heuristic, not a mathematical requirement. The original distribution need not be normal — that is the entire point of the CLT.


Question 15

Hoeffding's inequality provides a bound on $P(|\bar{X}_n - \mu| \geq t)$ that is:

(a) Asymptotic (valid only as $n \to \infty$) (b) Valid only for Gaussian random variables (c) A finite-sample bound valid for any bounded random variables (d) Tighter than the CLT-based bound in all cases

Answer: (c). Hoeffding's inequality is a finite-sample, distribution-free bound that requires only that the random variables are bounded and independent. It is more conservative than CLT-based bounds (not tighter), which is the price of being distribution-free.


Question 16

In Monte Carlo estimation, the standard error of the estimate decreases at what rate as the number of samples $N$ increases?

(a) $1/N$ (b) $1/\sqrt{N}$ (c) $1/N^2$ (d) $1/\log N$

Answer: (b). The standard error is $\sigma / \sqrt{N}$, which decreases as $1/\sqrt{N}$. This means reducing the error by a factor of 10 requires 100 times as many samples — the fundamental limitation of Monte Carlo methods.


Question 17

In importance sampling, the importance weight for a sample $x_i$ drawn from proposal $q(x)$ when the target is $p(x)$ is:

(a) $p(x_i) \cdot q(x_i)$ (b) $p(x_i) / q(x_i)$ (c) $q(x_i) / p(x_i)$ (d) $\log p(x_i) - \log q(x_i)$

Answer: (b). The importance weight $w(x_i) = p(x_i) / q(x_i)$ corrects for the difference between the proposal and target distributions. High weights indicate that $q$ undersampled that region relative to $p$.


Question 18

Which of the following is a conjugate prior for the Poisson likelihood?

(a) Beta distribution (b) Gaussian distribution (c) Gamma distribution (d) Uniform distribution

Answer: (c). The Gamma prior is conjugate to the Poisson likelihood: if $\lambda \sim \text{Gamma}(\alpha, \beta)$ and $x_1, \ldots, x_n \sim \text{Poisson}(\lambda)$, then the posterior is $\text{Gamma}(\alpha + \sum x_i, \beta + n)$. The Beta distribution is conjugate to the Binomial/Bernoulli likelihood.


Question 19

The bootstrap is most useful when:

(a) You want to compute a confidence interval for a statistic that has no closed-form standard error (b) Your data is exactly normally distributed (c) You have a very large sample size and want faster computation (d) You want to estimate the bias of the MLE

Answer: (a). The bootstrap's main advantage is that it provides confidence intervals for any statistic (median, percentiles, ratios, custom metrics) without requiring distributional assumptions or closed-form formulas for the standard error.


Question 20

MSE loss for regression is the negative log-likelihood of which distribution (up to a constant)?

(a) Bernoulli (b) Poisson (c) Gaussian with fixed variance (d) Exponential

Answer: (c). If we assume $y_i = f_\theta(x_i) + \epsilon_i$ with $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$, the negative log-likelihood (ignoring constants that do not depend on $\theta$) is proportional to $\sum_i (y_i - f_\theta(x_i))^2$, which is MSE. This means using MSE implicitly assumes Gaussian noise — if the noise is heavy-tailed, MSE is a suboptimal loss.