Chapter 4 Quiz

Test your understanding of the core concepts from this chapter. Try to answer each question before revealing the solution.


Question 1

Which of the following is NOT one of the Kolmogorov axioms of probability?

  • (a) $P(A) \geq 0$ for any event $A$
  • (b) $P(\Omega) = 1$
  • (c) $P(A \cup B) = P(A) + P(B)$ for any events $A, B$
  • (d) For mutually exclusive events $A_1, A_2, \ldots$: $P(\bigcup_i A_i) = \sum_i P(A_i)$
Answer **(c)**. The additivity axiom requires the events to be **mutually exclusive** (disjoint). Option (c) omits this crucial condition, making it incorrect as stated. The correct version is option (d), which specifies mutual exclusivity. For general (non-disjoint) events, we need the inclusion-exclusion formula: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.

Question 2

A medical test has 99% sensitivity and 95% specificity. The disease prevalence is 1 in 1,000. If a person tests positive, what is the approximate posterior probability of disease?

  • (a) 99%
  • (b) 95%
  • (c) About 2%
  • (d) About 50%
Answer **(c)**. Using Bayes' theorem: $$P(D \mid +) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099 + 0.04995} \approx 0.019 \approx 2\%$$ This is a classic example of the base rate fallacy. Despite the high sensitivity, the low prevalence means most positive tests are false positives.

Question 3

Two events $A$ and $B$ are independent if and only if:

  • (a) $P(A \cap B) = 0$
  • (b) $P(A \cap B) = P(A) \cdot P(B)$
  • (c) $P(A \mid B) = P(B \mid A)$
  • (d) $P(A \cup B) = P(A) + P(B)$
Answer **(b)**. Independence means $P(A \cap B) = P(A) \cdot P(B)$, equivalently $P(A \mid B) = P(A)$. Option (a) describes mutually exclusive events, which is actually the *opposite* of independence (for events with nonzero probability). Option (d) is the addition rule for mutually exclusive events.

Question 4

Which distribution would you use to model the number of customers arriving at a store per hour?

  • (a) Bernoulli
  • (b) Gaussian
  • (c) Poisson
  • (d) Beta
Answer **(c)**. The Poisson distribution models the count of events occurring in a fixed interval of time or space, given a constant average rate. It is the natural choice for modeling arrival counts. The Bernoulli is for binary outcomes, the Gaussian is for continuous measurements, and the Beta is a distribution over probabilities.

Question 5

For a Gaussian random variable $X \sim \mathcal{N}(\mu, \sigma^2)$, which statement is TRUE?

  • (a) $P(X = \mu)$ is the maximum probability value
  • (b) The PDF value $f(\mu)$ can never exceed 1
  • (c) Approximately 95% of values fall within $\mu \pm 2\sigma$
  • (d) The variance equals $\mu^2$
Answer **(c)**. By the 68-95-99.7 rule, approximately 95.4% of values fall within 2 standard deviations of the mean. Option (a) is wrong because for continuous distributions, the probability of any single point is 0. Option (b) is wrong because the PDF *can* exceed 1 (e.g., for $\sigma < 1/\sqrt{2\pi} \approx 0.4$). Option (d) is nonsensical -- variance is $\sigma^2$, independent of $\mu$.

Question 6

The covariance between two independent random variables $X$ and $Y$ is:

  • (a) Always positive
  • (b) Always negative
  • (c) Zero
  • (d) Equal to $\sigma_X \sigma_Y$
Answer **(c)**. If $X$ and $Y$ are independent, then $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$, so $\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0$. Important: the converse is NOT true -- zero covariance does not imply independence.

Question 7

In maximum likelihood estimation, we find parameters that:

  • (a) Maximize the prior probability of the parameters
  • (b) Maximize the probability of the observed data given the parameters
  • (c) Maximize the posterior probability of the parameters
  • (d) Minimize the entropy of the parameter distribution
Answer **(b)**. MLE maximizes $p(\mathcal{D} \mid \theta)$, the likelihood -- the probability of observing the data under the assumed model with parameters $\theta$. Option (a) has no role in MLE (that is part of Bayesian methods). Option (c) describes MAP estimation. Option (d) is not the MLE objective.

Question 8

Why do we typically maximize the log-likelihood rather than the likelihood itself?

  • (a) The log-likelihood is always larger, making optimization easier
  • (b) Products become sums, improving numerical stability and analytical convenience
  • (c) The log-likelihood has a unique maximum while the likelihood does not
  • (d) The log-likelihood is convex for all distributions
Answer **(b)**. The likelihood of i.i.d. data is a product: $\prod_i p(x_i \mid \theta)$. Taking the log converts this to a sum: $\sum_i \log p(x_i \mid \theta)$. Sums are numerically more stable (products of small numbers underflow) and analytically easier to differentiate. The log transformation is monotonic, so the maximizer is the same. Option (d) is false -- the log-likelihood is NOT convex for all distributions (e.g., Gaussian mixtures).

Question 9

MAP estimation with a Gaussian prior $p(\theta) = \mathcal{N}(0, \sigma_0^2 I)$ is equivalent to MLE with:

  • (a) L1 regularization
  • (b) L2 regularization
  • (c) Dropout
  • (d) No regularization
Answer **(b)**. The log of a Gaussian prior $\log p(\theta) \propto -\|\theta\|^2 / (2\sigma_0^2)$ adds an L2 penalty to the log-likelihood objective. This is exactly L2 regularization (weight decay). A Laplace prior would give L1 regularization (option a).

Question 10

As the amount of training data $n \to \infty$, the MAP estimate:

  • (a) Diverges from the MLE
  • (b) Converges to the prior mean
  • (c) Converges to the MLE
  • (d) Becomes undefined
Answer **(c)**. As $n$ grows, the data term $\sum_i \log p(x_i \mid \theta)$ dominates the prior term $\log p(\theta)$, and the MAP estimate converges to the MLE. The prior becomes irrelevant with enough data. This is why regularization matters most when data is scarce.

Question 11

The entropy of a fair coin (using natural logarithm) is:

  • (a) 0
  • (b) 0.5
  • (c) $\ln 2 \approx 0.693$
  • (d) 1
Answer **(c)**. $H = -[0.5 \ln 0.5 + 0.5 \ln 0.5] = -\ln 0.5 = \ln 2 \approx 0.693$ nats. If we used $\log_2$ instead, the answer would be exactly 1 bit. The answer of 1 (option d) would be correct for entropy measured in bits, not nats.

Question 12

Which distribution has the maximum entropy among all discrete distributions on $\{1, 2, \ldots, K\}$?

  • (a) Bernoulli
  • (b) The distribution concentrated on a single outcome
  • (c) The uniform distribution
  • (d) The Gaussian distribution
Answer **(c)**. The uniform distribution $P(X = k) = 1/K$ for all $k$ maximizes entropy at $H = \log K$. This can be proved using the non-negativity of KL divergence or Lagrange multipliers. The distribution in (b) has *minimum* entropy ($H = 0$). The Gaussian (d) maximizes differential entropy for continuous distributions with a fixed variance, but is not discrete.

Question 13

Cross-entropy $H(p, q)$ between the true distribution $p$ and the model distribution $q$ satisfies:

  • (a) $H(p, q) \geq H(q)$
  • (b) $H(p, q) \leq H(p)$
  • (c) $H(p, q) \geq H(p)$
  • (d) $H(p, q) = H(p) \cdot H(q)$
Answer **(c)**. By definition, $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$, and since $D_{\text{KL}}(p \| q) \geq 0$, we have $H(p, q) \geq H(p)$, with equality if and only if $p = q$. This means the cross-entropy is always at least as large as the true entropy; the excess is the KL divergence.

Question 14

KL divergence $D_{\text{KL}}(p \| q)$ is:

  • (a) Always symmetric: $D_{\text{KL}}(p \| q) = D_{\text{KL}}(q \| p)$
  • (b) A true metric satisfying the triangle inequality
  • (c) Always non-negative
  • (d) Always finite
Answer **(c)**. KL divergence is always non-negative (Gibbs' inequality), with equality iff $p = q$. It is NOT symmetric (a), does NOT satisfy the triangle inequality (b), and can be infinite (d) when $q(x) = 0$ for some $x$ where $p(x) > 0$.

Question 15

When training a neural network classifier with cross-entropy loss, you are implicitly:

  • (a) Performing MAP estimation with a uniform prior
  • (b) Performing maximum likelihood estimation
  • (c) Minimizing the entropy of the model's predictions
  • (d) Maximizing the mutual information between inputs and outputs
Answer **(b)**. Minimizing the average cross-entropy loss $-\frac{1}{n}\sum_i \log q(y_i \mid x_i; \theta)$ is equivalent to maximizing the log-likelihood $\sum_i \log q(y_i \mid x_i; \theta)$, which is MLE. Options (a) and (b) are related (MLE can be seen as MAP with a flat prior), but (b) is the most direct and precise answer. Option (c) is incorrect -- minimizing entropy of predictions would make the model maximally confident, which is not the training objective.

Question 16

Mutual information $I(X; Y)$ equals zero if and only if:

  • (a) $X$ and $Y$ have zero covariance
  • (b) $X$ and $Y$ are independent
  • (c) $X$ and $Y$ have the same distribution
  • (d) $H(X) = H(Y)$
Answer **(b)**. $I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$, which is zero iff $p(x,y) = p(x)p(y)$, i.e., iff $X$ and $Y$ are independent. Option (a) is insufficient -- zero covariance does not imply independence (it only captures linear dependence). Mutual information captures ALL statistical dependencies.

Question 17

The data processing inequality states that for a Markov chain $X \to Y \to Z$:

  • (a) $H(Z) \geq H(X)$
  • (b) $I(X; Z) \leq I(X; Y)$
  • (c) $I(X; Z) \geq I(X; Y)$
  • (d) $H(X \mid Z) = 0$
Answer **(b)**. The data processing inequality states that processing cannot increase information. If $Z$ is computed from $Y$ (which is computed from $X$), then $Z$ cannot contain more information about $X$ than $Y$ does. This is why information lost in early layers of a neural network cannot be recovered by later layers.

Question 18

The softmax function with temperature $T \to 0$ produces:

  • (a) A uniform distribution
  • (b) A distribution concentrated on the maximum logit (argmax)
  • (c) All zeros
  • (d) The identity function
Answer **(b)**. As $T \to 0$, $\text{softmax}(z_i / T)$ concentrates all probability on the index with the largest logit, effectively becoming a "hard" argmax. As $T \to \infty$, softmax approaches the uniform distribution. This temperature parameter is used in language model decoding to control the diversity of generated text.

Question 19

Which of the following is an example of a conjugate prior?

  • (a) Gaussian prior for the mean of a Poisson distribution
  • (b) Beta prior for the parameter of a Bernoulli distribution
  • (c) Uniform prior for the variance of a Gaussian distribution
  • (d) Exponential prior for the parameter of a Categorical distribution
Answer **(b)**. The Beta distribution is the conjugate prior for the Bernoulli (and Binomial) likelihood, meaning the posterior is also a Beta distribution. Conjugate priors are analytically convenient because the posterior has the same functional form as the prior, just with updated parameters. The Beta-Bernoulli conjugacy is one of the most important examples in Bayesian statistics.

Question 20

The perplexity of a language model is defined as $2^{H(p)}$ (or $e^{H(p)}$ with natural log). A lower perplexity indicates:

  • (a) The model is more uncertain about its predictions
  • (b) The model assigns higher probability to the test data
  • (c) The model has more parameters
  • (d) The training data is larger
Answer **(b)**. Perplexity can be interpreted as the effective number of equally likely choices the model considers at each step. Lower perplexity means the model is more confident and accurate -- it assigns higher probability to the observed data. A perfect model that always predicts the correct next token with probability 1 would have perplexity 1. Language model evaluation commonly uses perplexity as a key metric.

Question 21

You observe data $\{3, 5, 7, 4, 6\}$ assumed to come from $\mathcal{N}(\mu, \sigma^2)$. The MLE estimate of $\mu$ is:

  • (a) 4
  • (b) 5
  • (c) 6
  • (d) 25
Answer **(b)**. The MLE for the mean of a Gaussian is the sample mean: $\hat{\mu} = (3 + 5 + 7 + 4 + 6) / 5 = 25 / 5 = 5$.

Question 22

Which statement about the relationship between MLE and regularization is TRUE?

  • (a) MLE naturally includes regularization
  • (b) MLE with a Gaussian prior gives L1 regularization
  • (c) MAP estimation can be viewed as regularized MLE
  • (d) Regularization increases the likelihood of the data
Answer **(c)**. MAP estimation adds $\log p(\theta)$ to the log-likelihood, which acts as a regularization term. The specific form of regularization depends on the prior: Gaussian prior gives L2, Laplace prior gives L1. Plain MLE (option a) has no regularization, and regularization generally *decreases* the training likelihood (option d) in exchange for better generalization.

Question 23

The differential entropy of a continuous random variable can be:

  • (a) Only positive
  • (b) Only non-negative
  • (c) Negative
  • (d) Complex-valued
Answer **(c)**. Unlike discrete entropy, differential entropy can be negative. For example, a uniform distribution on $[0, a]$ has differential entropy $\ln a$, which is negative when $a < 1$. This is one of the subtle differences between discrete and continuous entropy.

Question 24

In the context of variational autoencoders, the KL divergence term $D_{\text{KL}}(q(z|x) \| p(z))$ in the ELBO serves as:

  • (a) A reconstruction loss
  • (b) A regularizer that keeps the encoder close to the prior
  • (c) A classifier for the latent space
  • (d) A measure of the decoder quality
Answer **(b)**. The KL term penalizes the encoder distribution $q(z|x)$ for deviating from the prior $p(z)$ (typically a standard Gaussian). This regularization prevents the encoder from memorizing the training data and ensures the latent space has a smooth, usable structure. The reconstruction term is $\mathbb{E}_{q(z|x)}[\log p(x|z)]$.

Question 25

The log-sum-exp trick is used to:

  • (a) Speed up matrix multiplication
  • (b) Avoid numerical overflow when computing $\log \sum_i e^{a_i}$
  • (c) Reduce memory usage in gradient computation
  • (d) Convert probabilities to logits
Answer **(b)**. The log-sum-exp trick computes $\log \sum_i \exp(a_i) = a_{\max} + \log \sum_i \exp(a_i - a_{\max})$, which avoids overflow by subtracting the maximum value before exponentiation. This is essential for numerically stable softmax computation and working with log-probabilities in machine learning.