Chapter 4: Quiz

Test your understanding of information theory concepts. Answers follow each question.


Question 1

A fair 8-sided die is rolled. How much information (in bits) does the outcome provide?

Answer $I = \log_2(8) = 3$ bits. Equivalently, the entropy of a uniform distribution over 8 outcomes is $H = \log_2 8 = 3$ bits. You need 3 binary questions to identify the outcome.

Question 2

Which has higher entropy: a Bernoulli distribution with $p = 0.5$, or a Bernoulli distribution with $p = 0.9$? Explain why without computing.

Answer The $p = 0.5$ distribution has higher entropy. Entropy is maximized when all outcomes are equally likely (maximum uncertainty). At $p = 0.9$, the distribution is concentrated — you can predict the outcome with 90% accuracy, so there is less uncertainty. The fair coin ($p = 0.5$) has maximum entropy of 1 bit.

Question 3

True or False: KL divergence is a distance metric (satisfies symmetry, non-negativity, and triangle inequality).

Answer **False.** KL divergence satisfies non-negativity ($D_{\text{KL}}(p \| q) \geq 0$) but violates symmetry ($D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$ in general) and the triangle inequality. It is a divergence, not a metric.

Question 4

A classifier produces the predicted probability vector $\hat{y} = [0.7, 0.2, 0.1]$ for a 3-class problem. The true label is class 1 (0-indexed). What is the cross-entropy loss for this example?

Answer For a one-hot true label, the cross-entropy loss is $-\log \hat{y}_c$ where $c$ is the true class. Here: $-\log(0.7) \approx 0.357$ nats (or $-\log_2(0.7) \approx 0.515$ bits). In PyTorch/TensorFlow (which use natural log), the answer is $0.357$ nats.

Question 5

A feature $X$ has zero Pearson correlation with the target $Y$. Can you conclude that $X$ carries no information about $Y$?

Answer **No.** Zero correlation means zero *linear* dependence, but there may be nonlinear dependence. For example, if $X \sim \mathcal{N}(0,1)$ and $Y = X^2$, then $\text{Corr}(X, Y) = 0$ but $I(X; Y) > 0$. The mutual information captures all dependencies (linear and nonlinear), while correlation captures only linear relationships.

Question 6

State the data processing inequality. If a neural network has layers $X \to h_1 \to h_2 \to \hat{Y}$, what does the inequality imply about $I(X; h_2)$ relative to $I(X; h_1)$?

Answer The data processing inequality states: if $X \to Y \to Z$ forms a Markov chain, then $I(X; Z) \leq I(X; Y)$. For the neural network: $I(X; h_2) \leq I(X; h_1) \leq H(X)$. Each layer can only preserve or lose information about the input — it cannot create new information about $X$.

Question 7

What is the maximum entropy distribution over the integers $\{1, 2, 3, 4, 5\}$ if the only constraint is that the probabilities sum to 1?

Answer The uniform distribution: $p(x) = 1/5$ for all $x$. With no constraints beyond normalization, the maximum entropy principle prescribes the distribution that makes the fewest assumptions — the uniform distribution. Its entropy is $\log_2 5 \approx 2.322$ bits.

Question 8

Write the ELBO decomposition. What are the two terms, and what role does each play?

Answer $$\text{ELBO}(q) = \underbrace{\mathbb{E}_{q(\theta)}[\log p(x \mid \theta)]}_{\text{expected log-likelihood}} - \underbrace{D_{\text{KL}}(q(\theta) \| p(\theta))}_{\text{KL from prior}}$$ The first term encourages the variational distribution $q(\theta)$ to place mass where the data likelihood is high (data fit). The second term penalizes $q(\theta)$ for deviating from the prior $p(\theta)$ (regularization/complexity penalty). Maximizing the ELBO balances data fit against prior consistency.

Question 9

If $H(X) = 3$ bits, $H(Y) = 2$ bits, and $H(X, Y) = 4$ bits, what is $I(X; Y)$?

Answer $I(X; Y) = H(X) + H(Y) - H(X, Y) = 3 + 2 - 4 = 1$ bit. One bit of information is shared between $X$ and $Y$ — knowing one reduces the uncertainty about the other by 1 bit.

Question 10

True or False: Differential entropy (for continuous distributions) is always non-negative.

Answer **False.** Unlike discrete entropy, differential entropy can be negative. For example, $\text{Uniform}(0, 0.5)$ has differential entropy $h = \log(0.5) = -\log(2) < 0$. This is one of the key differences between discrete and continuous entropy.

Question 11

Why does PyTorch's F.cross_entropy take logits rather than probabilities as input?

Answer For numerical stability. `F.cross_entropy` internally computes the log-softmax using the log-sum-exp trick, which avoids the intermediate computation of very small probabilities. If you first compute `softmax(logits)` (which might produce values like $10^{-45}$) and then take `log()`, the result is `-inf`, causing NaN gradients. By combining softmax and log into a single numerically stable operation, PyTorch avoids this underflow problem.

Question 12

A language model assigns probability 0.001 to the word that actually appears next. What is the surprisal (in bits)?

Answer $I = -\log_2(0.001) = \log_2(1000) \approx 9.97$ bits. This is a high surprisal — the model was very surprised by this word. In language modeling, perplexity is defined as $2^{H}$ where $H$ is the average cross-entropy in bits, so high surprisal per token leads to high perplexity.

Question 13

Explain the relationship between cross-entropy and KL divergence using the equation $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$. Why does minimizing cross-entropy with respect to $q$ equal minimizing KL divergence?

Answer Cross-entropy decomposes into two parts: the entropy of the true distribution $H(p)$ (a constant that does not depend on the model $q$) and the KL divergence $D_{\text{KL}}(p \| q)$ (which measures how much $q$ deviates from $p$). When optimizing the model $q$ to minimize cross-entropy, the constant $H(p)$ drops out of the gradient, so the optimizer is effectively minimizing the KL divergence. This is why MLE (minimizing NLL = minimizing cross-entropy) finds the model closest to the true distribution in the KL sense.

Question 14

In the information bottleneck framework, what does the parameter $\beta$ control?

Answer $\beta$ controls the tradeoff between compression and prediction. The IB objective is $\min I(X; T) - \beta \cdot I(T; Y)$. When $\beta$ is small, compression dominates: the representation $T$ discards most of the input information. When $\beta$ is large, prediction dominates: the representation preserves as much task-relevant information as possible. The information bottleneck curve (plotting $I(T; Y)$ vs. $I(X; T)$ for different $\beta$) characterizes the optimal compression-prediction tradeoff.

Question 15

You are monitoring a deployed model. The average entropy of the model's predicted probability distributions suddenly increases by 40% compared to the previous week. What might this indicate? Name two possible causes.

Answer An increase in predictive entropy means the model is less confident — its predicted distributions are more spread out across classes. Two possible causes: 1. **Distribution shift (data drift).** The incoming data has shifted away from the training distribution, so the model encounters unfamiliar patterns and produces uncertain predictions. 2. **Concept drift.** The relationship between features and the target has changed (e.g., user behavior changed), so the model's learned patterns no longer produce confident predictions. Both warrant investigation. Check feature distributions for drift (using KL divergence, as in Exercise 4.25) and evaluate model accuracy on recent labeled data.

Question 16

Given $H(Y \mid X) = 0.2$ bits and $H(Y) = 1.5$ bits, how much information does $X$ provide about $Y$?

Answer $I(X; Y) = H(Y) - H(Y \mid X) = 1.5 - 0.2 = 1.3$ bits. Knowing $X$ reduces the uncertainty about $Y$ from 1.5 bits to 0.2 bits — a reduction of 1.3 bits. $X$ provides substantial information about $Y$.

Question 17

Why is the Gaussian the maximum entropy distribution for a given mean and variance? What is the practical consequence for statistical modeling?

Answer Among all distributions with a specified mean $\mu$ and variance $\sigma^2$ (and support on $\mathbb{R}$), the Gaussian $\mathcal{N}(\mu, \sigma^2)$ has the highest entropy. This means it makes the *fewest assumptions* beyond the known mean and variance. The practical consequence: when you assume Gaussian noise in a regression model, you are making the least committal distributional assumption consistent with knowing the noise's first two moments. This is why Gaussian assumptions are often a good default — not because "everything is Gaussian," but because the Gaussian encodes maximal uncertainty given what you know.

Question 18

Forward KL ($D_{\text{KL}}(p \| q)$) produces "mean-seeking" behavior, while reverse KL ($D_{\text{KL}}(q \| p)$) produces "mode-seeking" behavior. Which one is used in standard MLE/cross-entropy training? Which is used in variational inference?

Answer - **MLE / cross-entropy training** uses forward KL: $\min_q D_{\text{KL}}(\hat{p}_{\text{data}} \| q)$. This is mean-seeking — the model $q$ tries to cover all the modes of the data distribution. - **Variational inference** uses reverse KL: $\min_q D_{\text{KL}}(q \| p(\theta \mid x))$. This is mode-seeking — the variational distribution $q$ tends to concentrate on one mode of the posterior rather than spreading across all modes. The choice has practical consequences: forward KL can lead to oversmoothing (the model "hedges"), while reverse KL can lead to mode collapse (the model ignores some modes of the posterior).

Question 19

In the climate model ensemble analysis, what does the "epistemic uncertainty" represent, and how is it computed using entropy?

Answer Epistemic uncertainty represents the *model disagreement* — uncertainty that arises because different climate models produce different predictions, reflecting gaps in scientific understanding. It is computed as: $$\text{Epistemic} = H(\bar{p}) - \frac{1}{M}\sum_{m=1}^{M} H(p_m)$$ where $\bar{p}$ is the ensemble mean distribution and $p_m$ are the individual model distributions. This is the total entropy minus the average within-model entropy. Epistemic uncertainty is potentially reducible with better models or more data, unlike aleatoric uncertainty (inherent randomness).

Question 20

A sufficient statistic $T(X)$ for parameter $\theta$ satisfies $I(X; \theta) = I(T(X); \theta)$. Explain this statement in plain language, then give an example.

Answer In plain language: the sufficient statistic preserves all the information the data contains about the parameter. No information about $\theta$ is lost by reducing the data $X$ to its summary $T(X)$. **Example:** For $n$ observations from $\mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$, the sample mean $\bar{X}$ is sufficient for $\mu$. The full dataset of $n$ numbers contains exactly the same information about $\mu$ as the single number $\bar{X}$. The individual observations carry additional information (about which specific values were observed), but none of that additional information is relevant to estimating $\mu$.