Chapter 4 Quiz
Test your understanding of the core concepts from this chapter. Try to answer each question before revealing the solution.
Question 1
Which of the following is NOT one of the Kolmogorov axioms of probability?
- (a) $P(A) \geq 0$ for any event $A$
- (b) $P(\Omega) = 1$
- (c) $P(A \cup B) = P(A) + P(B)$ for any events $A, B$
- (d) For mutually exclusive events $A_1, A_2, \ldots$: $P(\bigcup_i A_i) = \sum_i P(A_i)$
Answer
**(c)**. The additivity axiom requires the events to be **mutually exclusive** (disjoint). Option (c) omits this crucial condition, making it incorrect as stated. The correct version is option (d), which specifies mutual exclusivity. For general (non-disjoint) events, we need the inclusion-exclusion formula: $P(A \cup B) = P(A) + P(B) - P(A \cap B)$.Question 2
A medical test has 99% sensitivity and 95% specificity. The disease prevalence is 1 in 1,000. If a person tests positive, what is the approximate posterior probability of disease?
- (a) 99%
- (b) 95%
- (c) About 2%
- (d) About 50%
Answer
**(c)**. Using Bayes' theorem: $$P(D \mid +) = \frac{0.99 \times 0.001}{0.99 \times 0.001 + 0.05 \times 0.999} = \frac{0.00099}{0.00099 + 0.04995} \approx 0.019 \approx 2\%$$ This is a classic example of the base rate fallacy. Despite the high sensitivity, the low prevalence means most positive tests are false positives.Question 3
Two events $A$ and $B$ are independent if and only if:
- (a) $P(A \cap B) = 0$
- (b) $P(A \cap B) = P(A) \cdot P(B)$
- (c) $P(A \mid B) = P(B \mid A)$
- (d) $P(A \cup B) = P(A) + P(B)$
Answer
**(b)**. Independence means $P(A \cap B) = P(A) \cdot P(B)$, equivalently $P(A \mid B) = P(A)$. Option (a) describes mutually exclusive events, which is actually the *opposite* of independence (for events with nonzero probability). Option (d) is the addition rule for mutually exclusive events.Question 4
Which distribution would you use to model the number of customers arriving at a store per hour?
- (a) Bernoulli
- (b) Gaussian
- (c) Poisson
- (d) Beta
Answer
**(c)**. The Poisson distribution models the count of events occurring in a fixed interval of time or space, given a constant average rate. It is the natural choice for modeling arrival counts. The Bernoulli is for binary outcomes, the Gaussian is for continuous measurements, and the Beta is a distribution over probabilities.Question 5
For a Gaussian random variable $X \sim \mathcal{N}(\mu, \sigma^2)$, which statement is TRUE?
- (a) $P(X = \mu)$ is the maximum probability value
- (b) The PDF value $f(\mu)$ can never exceed 1
- (c) Approximately 95% of values fall within $\mu \pm 2\sigma$
- (d) The variance equals $\mu^2$
Answer
**(c)**. By the 68-95-99.7 rule, approximately 95.4% of values fall within 2 standard deviations of the mean. Option (a) is wrong because for continuous distributions, the probability of any single point is 0. Option (b) is wrong because the PDF *can* exceed 1 (e.g., for $\sigma < 1/\sqrt{2\pi} \approx 0.4$). Option (d) is nonsensical -- variance is $\sigma^2$, independent of $\mu$.Question 6
The covariance between two independent random variables $X$ and $Y$ is:
- (a) Always positive
- (b) Always negative
- (c) Zero
- (d) Equal to $\sigma_X \sigma_Y$
Answer
**(c)**. If $X$ and $Y$ are independent, then $\mathbb{E}[XY] = \mathbb{E}[X]\mathbb{E}[Y]$, so $\text{Cov}(X,Y) = \mathbb{E}[XY] - \mathbb{E}[X]\mathbb{E}[Y] = 0$. Important: the converse is NOT true -- zero covariance does not imply independence.Question 7
In maximum likelihood estimation, we find parameters that:
- (a) Maximize the prior probability of the parameters
- (b) Maximize the probability of the observed data given the parameters
- (c) Maximize the posterior probability of the parameters
- (d) Minimize the entropy of the parameter distribution
Answer
**(b)**. MLE maximizes $p(\mathcal{D} \mid \theta)$, the likelihood -- the probability of observing the data under the assumed model with parameters $\theta$. Option (a) has no role in MLE (that is part of Bayesian methods). Option (c) describes MAP estimation. Option (d) is not the MLE objective.Question 8
Why do we typically maximize the log-likelihood rather than the likelihood itself?
- (a) The log-likelihood is always larger, making optimization easier
- (b) Products become sums, improving numerical stability and analytical convenience
- (c) The log-likelihood has a unique maximum while the likelihood does not
- (d) The log-likelihood is convex for all distributions
Answer
**(b)**. The likelihood of i.i.d. data is a product: $\prod_i p(x_i \mid \theta)$. Taking the log converts this to a sum: $\sum_i \log p(x_i \mid \theta)$. Sums are numerically more stable (products of small numbers underflow) and analytically easier to differentiate. The log transformation is monotonic, so the maximizer is the same. Option (d) is false -- the log-likelihood is NOT convex for all distributions (e.g., Gaussian mixtures).Question 9
MAP estimation with a Gaussian prior $p(\theta) = \mathcal{N}(0, \sigma_0^2 I)$ is equivalent to MLE with:
- (a) L1 regularization
- (b) L2 regularization
- (c) Dropout
- (d) No regularization
Answer
**(b)**. The log of a Gaussian prior $\log p(\theta) \propto -\|\theta\|^2 / (2\sigma_0^2)$ adds an L2 penalty to the log-likelihood objective. This is exactly L2 regularization (weight decay). A Laplace prior would give L1 regularization (option a).Question 10
As the amount of training data $n \to \infty$, the MAP estimate:
- (a) Diverges from the MLE
- (b) Converges to the prior mean
- (c) Converges to the MLE
- (d) Becomes undefined
Answer
**(c)**. As $n$ grows, the data term $\sum_i \log p(x_i \mid \theta)$ dominates the prior term $\log p(\theta)$, and the MAP estimate converges to the MLE. The prior becomes irrelevant with enough data. This is why regularization matters most when data is scarce.Question 11
The entropy of a fair coin (using natural logarithm) is:
- (a) 0
- (b) 0.5
- (c) $\ln 2 \approx 0.693$
- (d) 1
Answer
**(c)**. $H = -[0.5 \ln 0.5 + 0.5 \ln 0.5] = -\ln 0.5 = \ln 2 \approx 0.693$ nats. If we used $\log_2$ instead, the answer would be exactly 1 bit. The answer of 1 (option d) would be correct for entropy measured in bits, not nats.Question 12
Which distribution has the maximum entropy among all discrete distributions on $\{1, 2, \ldots, K\}$?
- (a) Bernoulli
- (b) The distribution concentrated on a single outcome
- (c) The uniform distribution
- (d) The Gaussian distribution
Answer
**(c)**. The uniform distribution $P(X = k) = 1/K$ for all $k$ maximizes entropy at $H = \log K$. This can be proved using the non-negativity of KL divergence or Lagrange multipliers. The distribution in (b) has *minimum* entropy ($H = 0$). The Gaussian (d) maximizes differential entropy for continuous distributions with a fixed variance, but is not discrete.Question 13
Cross-entropy $H(p, q)$ between the true distribution $p$ and the model distribution $q$ satisfies:
- (a) $H(p, q) \geq H(q)$
- (b) $H(p, q) \leq H(p)$
- (c) $H(p, q) \geq H(p)$
- (d) $H(p, q) = H(p) \cdot H(q)$
Answer
**(c)**. By definition, $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$, and since $D_{\text{KL}}(p \| q) \geq 0$, we have $H(p, q) \geq H(p)$, with equality if and only if $p = q$. This means the cross-entropy is always at least as large as the true entropy; the excess is the KL divergence.Question 14
KL divergence $D_{\text{KL}}(p \| q)$ is:
- (a) Always symmetric: $D_{\text{KL}}(p \| q) = D_{\text{KL}}(q \| p)$
- (b) A true metric satisfying the triangle inequality
- (c) Always non-negative
- (d) Always finite
Answer
**(c)**. KL divergence is always non-negative (Gibbs' inequality), with equality iff $p = q$. It is NOT symmetric (a), does NOT satisfy the triangle inequality (b), and can be infinite (d) when $q(x) = 0$ for some $x$ where $p(x) > 0$.Question 15
When training a neural network classifier with cross-entropy loss, you are implicitly:
- (a) Performing MAP estimation with a uniform prior
- (b) Performing maximum likelihood estimation
- (c) Minimizing the entropy of the model's predictions
- (d) Maximizing the mutual information between inputs and outputs
Answer
**(b)**. Minimizing the average cross-entropy loss $-\frac{1}{n}\sum_i \log q(y_i \mid x_i; \theta)$ is equivalent to maximizing the log-likelihood $\sum_i \log q(y_i \mid x_i; \theta)$, which is MLE. Options (a) and (b) are related (MLE can be seen as MAP with a flat prior), but (b) is the most direct and precise answer. Option (c) is incorrect -- minimizing entropy of predictions would make the model maximally confident, which is not the training objective.Question 16
Mutual information $I(X; Y)$ equals zero if and only if:
- (a) $X$ and $Y$ have zero covariance
- (b) $X$ and $Y$ are independent
- (c) $X$ and $Y$ have the same distribution
- (d) $H(X) = H(Y)$
Answer
**(b)**. $I(X; Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$, which is zero iff $p(x,y) = p(x)p(y)$, i.e., iff $X$ and $Y$ are independent. Option (a) is insufficient -- zero covariance does not imply independence (it only captures linear dependence). Mutual information captures ALL statistical dependencies.Question 17
The data processing inequality states that for a Markov chain $X \to Y \to Z$:
- (a) $H(Z) \geq H(X)$
- (b) $I(X; Z) \leq I(X; Y)$
- (c) $I(X; Z) \geq I(X; Y)$
- (d) $H(X \mid Z) = 0$
Answer
**(b)**. The data processing inequality states that processing cannot increase information. If $Z$ is computed from $Y$ (which is computed from $X$), then $Z$ cannot contain more information about $X$ than $Y$ does. This is why information lost in early layers of a neural network cannot be recovered by later layers.Question 18
The softmax function with temperature $T \to 0$ produces:
- (a) A uniform distribution
- (b) A distribution concentrated on the maximum logit (argmax)
- (c) All zeros
- (d) The identity function
Answer
**(b)**. As $T \to 0$, $\text{softmax}(z_i / T)$ concentrates all probability on the index with the largest logit, effectively becoming a "hard" argmax. As $T \to \infty$, softmax approaches the uniform distribution. This temperature parameter is used in language model decoding to control the diversity of generated text.Question 19
Which of the following is an example of a conjugate prior?
- (a) Gaussian prior for the mean of a Poisson distribution
- (b) Beta prior for the parameter of a Bernoulli distribution
- (c) Uniform prior for the variance of a Gaussian distribution
- (d) Exponential prior for the parameter of a Categorical distribution
Answer
**(b)**. The Beta distribution is the conjugate prior for the Bernoulli (and Binomial) likelihood, meaning the posterior is also a Beta distribution. Conjugate priors are analytically convenient because the posterior has the same functional form as the prior, just with updated parameters. The Beta-Bernoulli conjugacy is one of the most important examples in Bayesian statistics.Question 20
The perplexity of a language model is defined as $2^{H(p)}$ (or $e^{H(p)}$ with natural log). A lower perplexity indicates:
- (a) The model is more uncertain about its predictions
- (b) The model assigns higher probability to the test data
- (c) The model has more parameters
- (d) The training data is larger
Answer
**(b)**. Perplexity can be interpreted as the effective number of equally likely choices the model considers at each step. Lower perplexity means the model is more confident and accurate -- it assigns higher probability to the observed data. A perfect model that always predicts the correct next token with probability 1 would have perplexity 1. Language model evaluation commonly uses perplexity as a key metric.Question 21
You observe data $\{3, 5, 7, 4, 6\}$ assumed to come from $\mathcal{N}(\mu, \sigma^2)$. The MLE estimate of $\mu$ is:
- (a) 4
- (b) 5
- (c) 6
- (d) 25
Answer
**(b)**. The MLE for the mean of a Gaussian is the sample mean: $\hat{\mu} = (3 + 5 + 7 + 4 + 6) / 5 = 25 / 5 = 5$.Question 22
Which statement about the relationship between MLE and regularization is TRUE?
- (a) MLE naturally includes regularization
- (b) MLE with a Gaussian prior gives L1 regularization
- (c) MAP estimation can be viewed as regularized MLE
- (d) Regularization increases the likelihood of the data
Answer
**(c)**. MAP estimation adds $\log p(\theta)$ to the log-likelihood, which acts as a regularization term. The specific form of regularization depends on the prior: Gaussian prior gives L2, Laplace prior gives L1. Plain MLE (option a) has no regularization, and regularization generally *decreases* the training likelihood (option d) in exchange for better generalization.Question 23
The differential entropy of a continuous random variable can be:
- (a) Only positive
- (b) Only non-negative
- (c) Negative
- (d) Complex-valued
Answer
**(c)**. Unlike discrete entropy, differential entropy can be negative. For example, a uniform distribution on $[0, a]$ has differential entropy $\ln a$, which is negative when $a < 1$. This is one of the subtle differences between discrete and continuous entropy.Question 24
In the context of variational autoencoders, the KL divergence term $D_{\text{KL}}(q(z|x) \| p(z))$ in the ELBO serves as:
- (a) A reconstruction loss
- (b) A regularizer that keeps the encoder close to the prior
- (c) A classifier for the latent space
- (d) A measure of the decoder quality
Answer
**(b)**. The KL term penalizes the encoder distribution $q(z|x)$ for deviating from the prior $p(z)$ (typically a standard Gaussian). This regularization prevents the encoder from memorizing the training data and ensures the latent space has a smooth, usable structure. The reconstruction term is $\mathbb{E}_{q(z|x)}[\log p(x|z)]$.Question 25
The log-sum-exp trick is used to:
- (a) Speed up matrix multiplication
- (b) Avoid numerical overflow when computing $\log \sum_i e^{a_i}$
- (c) Reduce memory usage in gradient computation
- (d) Convert probabilities to logits