Chapter 4: Exercises

Exercises are graded by difficulty: - One star (*): Apply the technique from the chapter to a new dataset or scenario - Two stars (**): Extend the technique or combine it with a previous chapter's methods - Three stars (***): Derive a result, implement from scratch, or design a system component - Four stars (****): Research-level problems that connect to open questions in the field


Foundational Computations

Exercise 4.1 (*)

A weather station reports four conditions: sunny (40%), cloudy (30%), rainy (20%), and snowy (10%).

(a) Compute the entropy of this distribution in bits.

(b) What is the maximum possible entropy for a 4-outcome distribution? What distribution achieves it?

(c) How much information (in bits) do you gain when you learn it is snowing?


Exercise 4.2 (*)

Consider a binary classifier that outputs $\hat{y} = 0.85$ for a positive example ($y = 1$).

(a) Compute the cross-entropy loss for this single example.

(b) Compute the cross-entropy loss if the model had output $\hat{y} = 0.99$.

(c) Compute the cross-entropy loss if the model had output $\hat{y} = 0.50$.

(d) Plot the cross-entropy loss as a function of $\hat{y} \in (0, 1)$ for a positive example ($y = 1$). Use Python and matplotlib. What happens as $\hat{y} \to 0$?


Exercise 4.3 (*)

Compute the KL divergence $D_{\text{KL}}(p \| q)$ and $D_{\text{KL}}(q \| p)$ for:

  • $p = [0.5, 0.3, 0.2]$
  • $q = [0.1, 0.1, 0.8]$

Verify that $D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)$. Which direction is larger, and why?


Exercise 4.4 (*)

Given the joint distribution:

$Y=0$ $Y=1$
$X=0$ 0.30 0.10
$X=1$ 0.15 0.45

Compute:

(a) $H(X)$, $H(Y)$, $H(X, Y)$

(b) $H(X \mid Y)$, $H(Y \mid X)$

(c) $I(X; Y)$

(d) Verify the chain rule: $H(X, Y) = H(X) + H(Y \mid X)$

(e) Verify the MI identity: $I(X; Y) = H(X) + H(Y) - H(X, Y)$


Exercise 4.5 (**)

The entropy of a Bernoulli distribution $p(X = 1) = \theta$ is $H(\theta) = -\theta \log \theta - (1 - \theta) \log(1 - \theta)$.

(a) Plot $H(\theta)$ for $\theta \in [0, 1]$. At what value of $\theta$ is entropy maximized?

(b) Compute $\frac{dH}{d\theta}$ and find the maximum analytically.

(c) In the context of binary classification, explain what it means for a model's predictive distribution to have high entropy vs. low entropy. When would you want each?


Cross-Entropy and Loss Functions

Exercise 4.6 (**)

A 5-class classifier produces the following softmax outputs for three test examples:

Example True class $\hat{y}_1$ $\hat{y}_2$ $\hat{y}_3$ $\hat{y}_4$ $\hat{y}_5$
A 3 0.05 0.05 0.80 0.05 0.05
B 1 0.35 0.25 0.20 0.10 0.10
C 5 0.10 0.10 0.10 0.10 0.60

(a) Compute the cross-entropy loss for each example.

(b) Compute the average cross-entropy loss across the three examples.

(c) For example B, the model assigns the highest probability to the correct class, but the loss is still relatively high. Explain why, using the concept of entropy.


Exercise 4.7 (**)

Prove that for a binary classification problem, minimizing the cross-entropy loss is equivalent to minimizing the KL divergence from the empirical distribution to the model distribution. Start with the average cross-entropy over the training set:

$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)\right]$$

Show that $\mathcal{L} = H(\hat{p}) + D_{\text{KL}}(\hat{p} \| q)$, where $\hat{p}$ is the empirical distribution and $q$ is the model's predictive distribution. Conclude that since $H(\hat{p})$ is constant with respect to the model parameters, minimizing $\mathcal{L}$ is equivalent to minimizing $D_{\text{KL}}(\hat{p} \| q)$.


Exercise 4.8 (***)

Why not MSE for classification? Consider a binary classifier with predicted probability $\hat{y}$ and true label $y \in \{0, 1\}$.

(a) Compute the gradient of the cross-entropy loss $\mathcal{L}_{\text{CE}} = -[y\log\hat{y} + (1-y)\log(1-\hat{y})]$ with respect to the logit $z$, where $\hat{y} = \sigma(z)$ and $\sigma$ is the sigmoid function.

(b) Compute the gradient of the MSE loss $\mathcal{L}_{\text{MSE}} = (y - \hat{y})^2$ with respect to $z$.

(c) Show that the cross-entropy gradient has a desirable property: its magnitude is large when the model is confidently wrong and small when the model is correct. Show that the MSE gradient can have the opposite behavior (small gradient when the model is confidently wrong). Relate this to the "vanishing gradient" problem for sigmoid activations.

(d) Implement both losses in PyTorch and train a simple logistic regression model on a binary classification dataset. Plot the training loss curves. Which converges faster?


Mutual Information

Exercise 4.9 (*)

Compute the mutual information $I(X; Y)$ for the following joint distribution:

$Y=0$ $Y=1$ $Y=2$
$X=0$ 1/6 1/6 1/6
$X=1$ 1/6 1/6 1/6

What is special about this distribution? What does the MI value tell you about the relationship between $X$ and $Y$?


Exercise 4.10 (**)

Generate 10,000 samples from each of the following relationships:

(a) $Y = 2X + 3 + \epsilon$, where $X \sim \mathcal{N}(0, 1)$, $\epsilon \sim \mathcal{N}(0, 0.5)$ (linear)

(b) $Y = X^2 + \epsilon$ (quadratic, same distributions)

(c) $Y = \sin(3X) + \epsilon$ (sinusoidal, same distributions)

(d) $X, Y$ independent standard normals

For each, compute both the Pearson correlation and the mutual information (using sklearn.feature_selection.mutual_info_regression). Present the results in a table and explain why correlation and MI agree or disagree in each case.


Exercise 4.11 (**)

Feature selection with MI. Load the UCI Adult Income dataset (available via sklearn.datasets.fetch_openml("adult", version=2)). Compute the mutual information between each feature and the income target (>50K or <=50K). Compare with the absolute Pearson correlation for continuous features and Cramér's V for categorical features. Identify at least one feature where the rankings disagree, and hypothesize why.


Exercise 4.12 (***)

MI estimation is hard. The mutual_info_classif function in scikit-learn uses a k-nearest-neighbor estimator (Kraskov et al., 2004).

(a) Implement a naive plug-in MI estimator for discrete variables: discretize continuous features into 10 equal-width bins, build the contingency table, and compute MI directly from the empirical joint distribution. Apply this to the data from Exercise 4.10.

(b) Compare your estimates with scikit-learn's kNN-based estimator. Which is more accurate? Which is more sensitive to the number of bins?

(c) Explain why MI estimation for continuous variables is fundamentally harder than for discrete variables. What role does the curse of dimensionality play?


KL Divergence

Exercise 4.13 (**)

For two univariate Gaussians $p = \mathcal{N}(\mu_1, \sigma_1^2)$ and $q = \mathcal{N}(\mu_2, \sigma_2^2)$:

(a) Derive the closed-form expression for $D_{\text{KL}}(p \| q)$ starting from the definition:

$$D_{\text{KL}}(p \| q) = \int_{-\infty}^{\infty} p(x) \log \frac{p(x)}{q(x)} \, dx$$

(b) Verify your formula numerically by computing the KL divergence via Monte Carlo estimation (sample from $p$, compute the sample mean of $\log \frac{p(x)}{q(x)}$) for $p = \mathcal{N}(1, 0.5^2)$ and $q = \mathcal{N}(0, 1)$. Show that the MC estimate converges to the closed form as the number of samples increases.


Exercise 4.14 (**)

Forward KL vs. Reverse KL. Consider a bimodal target distribution:

$$p(x) = 0.5 \cdot \mathcal{N}(-3, 0.5^2) + 0.5 \cdot \mathcal{N}(3, 0.5^2)$$

Approximate $p$ with a single Gaussian $q = \mathcal{N}(\mu, \sigma^2)$.

(a) Numerically find the $q$ that minimizes $D_{\text{KL}}(p \| q)$ (forward KL). Use grid search or gradient descent over $(\mu, \sigma)$.

(b) Numerically find the $q$ that minimizes $D_{\text{KL}}(q \| p)$ (reverse KL).

(c) Plot both $p$ and the two optimal $q$ distributions. Explain the "mean-seeking" vs. "mode-seeking" behavior.

(d) Why is this distinction important for variational inference?


Exercise 4.15 (***)

KL divergence and model calibration. A calibrated classifier satisfies $P(Y = 1 \mid \hat{y} = p) = p$ — the predicted probability matches the true frequency.

(a) Show that a perfectly calibrated model minimizes the cross-entropy loss among all models with the same discrimination ability (ranking of examples).

(b) Consider a binary classifier that outputs probabilities in 10 bins: $\hat{y} \in \{0.05, 0.15, \ldots, 0.95\}$. Given the following calibration table, compute the KL divergence between the true conditional distribution $p(Y \mid \hat{y} = b)$ and the predicted distribution $q(Y \mid \hat{y} = b) = \text{Bernoulli}(b)$ for each bin:

Bin Predicted $P(Y=1)$ True fraction $Y=1$ $n$
1 0.05 0.12 200
2 0.15 0.18 300
3 0.25 0.23 350
4 0.35 0.34 400
5 0.45 0.47 450
6 0.55 0.53 400
7 0.65 0.68 350
8 0.75 0.74 300
9 0.85 0.79 200
10 0.95 0.91 100

(c) Which bins have the largest calibration error? How does this relate to the KL divergence values?


Information Bottleneck and Data Processing Inequality

Exercise 4.16 (**)

Data processing inequality in practice. A recommender system has the pipeline:

$$\text{User history } X \to \text{Embedding } h = f(X) \to \text{Click prediction } \hat{Y} = g(h)$$

(a) State the data processing inequality for this pipeline.

(b) A junior engineer proposes adding a PCA step to reduce the embedding dimensionality: $X \to h \to h_{\text{PCA}} \to \hat{Y}$. Using the data processing inequality, explain why this can only hurt (or at best maintain) prediction accuracy. Under what conditions would the accuracy loss be negligible?

(c) The same engineer argues: "But PCA sometimes improves performance!" Is this consistent with the data processing inequality? Explain. (Hint: consider estimation error vs. information.)


Exercise 4.17 (***)

Information bottleneck visualization. Using a two-class synthetic dataset with 2D input features:

(a) Train a neural network with architecture: Input(2) → Dense(128) → Dense(64) → Dense(32) → Dense(2) → Softmax(2).

(b) After training, estimate $I(X; T_l)$ and $I(T_l; Y)$ for each hidden layer $T_l$, using binning-based MI estimation (discretize the activations into 30 bins per dimension, then compute MI from the empirical joint distribution).

(c) Create an "information plane" plot: $I(X; T_l)$ on the x-axis, $I(T_l; Y)$ on the y-axis, one point per layer. Does your network exhibit compression (movement from right to left on the x-axis as you go deeper)?

(d) Repeat with ReLU vs. sigmoid activations. Do you observe different behavior? Relate to the findings of Shwartz-Ziv and Tishby (2017) and the critique by Saxe et al. (2018).


ELBO and Variational Inference

Exercise 4.18 (**)

ELBO decomposition. Starting from the definition:

$$\log p(x) = \text{ELBO}(q) + D_{\text{KL}}(q(\theta) \| p(\theta \mid x))$$

(a) Derive this identity by starting with $D_{\text{KL}}(q(\theta) \| p(\theta \mid x))$ and expanding.

(b) Show that the ELBO can be written as:

$$\text{ELBO}(q) = \mathbb{E}_{q(\theta)}[\log p(x \mid \theta)] - D_{\text{KL}}(q(\theta) \| p(\theta))$$

(c) Interpret each term: why is the first term a "data fit" term and the second a "regularization" term?

(d) What happens to the ELBO when $q(\theta) = p(\theta \mid x)$ exactly? What is the gap?


Exercise 4.19 (***)

Implementing variational inference from scratch. Consider the model:

$$\theta \sim \text{Beta}(2, 2) \quad \text{(prior)}$$ $$X_i \mid \theta \sim \text{Bernoulli}(\theta) \quad \text{(likelihood)}$$

Given observations $x_1, \ldots, x_n$ with $\sum x_i = k$ and $n = 50$, $k = 35$:

(a) Compute the exact posterior analytically (this is a conjugate model).

(b) Approximate the posterior using variational inference with $q(\theta) = \text{Beta}(\alpha, \beta)$. Derive the ELBO as a function of $\alpha, \beta$.

(c) Implement gradient ascent on the ELBO to find the optimal $(\alpha^*, \beta^*)$.

(d) Compare the variational posterior with the exact posterior. Plot both. How close are they?


Maximum Entropy and Sufficient Statistics

Exercise 4.20 (**)

Maximum entropy derivation. Use the method of Lagrange multipliers to show that the maximum entropy distribution over $\{1, 2, \ldots, K\}$ with no constraints (other than normalization) is the uniform distribution.


Exercise 4.21 (***)

Maximum entropy with moment constraints. Show that the maximum entropy distribution on $(-\infty, \infty)$ with constraints $\mathbb{E}[X] = \mu$ and $\mathbb{E}[X^2] = \mu^2 + \sigma^2$ is the Gaussian $\mathcal{N}(\mu, \sigma^2)$.

Hint: Use the method of Lagrange multipliers with the entropy functional $h(f) = -\int f(x) \log f(x) \, dx$ and constraints $\int f(x) dx = 1$, $\int x f(x) dx = \mu$, $\int x^2 f(x) dx = \mu^2 + \sigma^2$.


Exercise 4.22 (**)

Sufficient statistics and MI. For a Poisson distribution $X \sim \text{Poisson}(\lambda)$:

(a) What is the sufficient statistic for $\lambda$ given $n$ observations $x_1, \ldots, x_n$?

(b) Explain, in information-theoretic terms, why the sample mean $\bar{x}$ captures all the information about $\lambda$ that the full dataset contains.

(c) If you observe $x_1, x_2, \ldots, x_{100}$ and compute $T = \sum x_i$, then apply a nonlinear transformation $T' = \log(1 + T)$, is $T'$ still sufficient for $\lambda$? Why or why not? Use the data processing inequality to argue your answer.


Applied and Integration Exercises

Exercise 4.23 (**)

Entropy of neural network predictions across training. Train a simple MLP on the MNIST dataset (or any multiclass dataset). At each epoch:

(a) Compute the average entropy of the predicted probability distributions on the test set.

(b) Plot the average entropy vs. epoch alongside the test accuracy vs. epoch.

(c) What happens to the predictive entropy as the model trains? Is there a relationship between entropy and accuracy? When might high predictive entropy be desirable?


Exercise 4.24 (***)

Entropy-based active learning. Implement an active learning loop for binary classification:

(a) Start with 20 labeled examples from a synthetic dataset (e.g., make_moons from scikit-learn with 1000 total points).

(b) Train a logistic regression model.

(c) Compute the predictive entropy for all unlabeled examples.

(d) Select the 10 unlabeled examples with highest predictive entropy and add them to the training set (with their true labels).

(e) Repeat steps (b)-(d) for 15 iterations.

(f) Compare with random selection of 10 examples per iteration. Plot test accuracy vs. number of labeled examples for both strategies. Does entropy-based selection outperform random?


Exercise 4.25 (***)

KL divergence for distribution drift monitoring. In production ML, you need to detect when the input distribution shifts.

(a) Generate 10,000 training samples from $\mathcal{N}(0, 1)$.

(b) Generate 5 "production batches" of 1,000 samples each, with increasing distribution shift: $\mathcal{N}(0, 1)$, $\mathcal{N}(0.2, 1)$, $\mathcal{N}(0.5, 1.2)$, $\mathcal{N}(1.0, 1.5)$, $\mathcal{N}(2.0, 2.0)$.

(c) For each batch, estimate the KL divergence from the training distribution to the batch distribution using histogram-based density estimation.

(d) Plot the KL divergence for each batch. At what threshold would you trigger a retraining alert? Justify your choice.


Exercise 4.26 (**)

Climate model disagreement analysis. Extend the climate ensemble uncertainty decomposition from Section 4.14.

(a) Instead of 7 temperature bins, use 20 bins spanning -2 to +8 degrees C. Generate 10 CMIP6 model distributions as Gaussians with different means (ranging from 1.5 to 4.5) and standard deviations (ranging from 0.5 to 2.0), discretized into the 20 bins.

(b) Compute the total, aleatoric, and epistemic uncertainty.

(c) Compute the KL divergence from each individual model to the ensemble mean. Which models are the "outliers"?

(d) A policymaker asks: "What is the probability of more than 4 degrees of warming?" Compute this probability for (i) each individual model, (ii) the ensemble mean. How does model disagreement affect the answer?


Exercise 4.27 (***)

MI for representation evaluation. A common question in representation learning is: does the learned representation capture task-relevant information?

(a) Train an autoencoder on the MNIST dataset with a bottleneck of dimension $d = 2$.

(b) Estimate $I(\text{latent representation}; \text{digit label})$ using binning-based MI estimation.

(c) Repeat with $d \in \{2, 5, 10, 20, 50\}$. Plot MI vs. bottleneck dimension.

(d) At what bottleneck dimension does the representation capture "enough" information about the label? How does this relate to the information bottleneck principle?


Exercise 4.28 (****)

Jensen-Shannon divergence and f-divergences. The Jensen-Shannon divergence (JSD) is a symmetrized, bounded version of KL divergence:

$$\text{JSD}(p \| q) = \frac{1}{2}D_{\text{KL}}\left(p \,\middle\|\, \frac{p+q}{2}\right) + \frac{1}{2}D_{\text{KL}}\left(q \,\middle\|\, \frac{p+q}{2}\right)$$

(a) Prove that $0 \leq \text{JSD}(p \| q) \leq \log 2$ (in nats).

(b) Prove that $\text{JSD}(p \| q) = \text{JSD}(q \| p)$ (symmetry).

(c) The original GAN paper (Goodfellow et al., 2014) showed that the GAN training objective is equivalent to minimizing the JSD between the real data distribution and the generator's distribution. Explain why the JSD's boundedness is both an advantage (stable optimization) and a limitation (vanishing gradients when $p$ and $q$ have disjoint support). How did Wasserstein GANs (Arjovsky et al., 2017) address this?


Exercise 4.29 (****)

Entropy estimation for high-dimensional data. Estimating the entropy of a continuous random variable in high dimensions is notoriously difficult.

(a) Implement three entropy estimators: (i) histogram-based (discretize each dimension into $k$ bins), (ii) kNN-based (Kozachenko-Leonenko estimator), (iii) kernel density estimation (KDE)-based.

(b) Test all three on a known distribution: $X \sim \mathcal{N}(0, I_d)$ in $d$ dimensions, where the true differential entropy is $h(X) = \frac{d}{2}\log(2\pi e)$.

(c) Measure the estimation error as a function of: (i) sample size $n \in \{100, 1000, 10000\}$ and (ii) dimensionality $d \in \{1, 2, 5, 10, 20\}$.

(d) Which estimator degrades most gracefully with increasing dimensionality? Why?


Exercise 4.30 (****)

The information bottleneck for feature selection. Implement the information bottleneck method (Tishby et al., 1999) for feature selection on a tabular dataset.

(a) Given input features $X$ and target $Y$, the IB objective finds a compressed representation $T$ of $X$ that maximizes $I(T; Y) - \beta^{-1} I(T; X)$.

(b) Implement the iterative Blahut-Arimoto-style algorithm for the discrete case (discretize all features).

(c) Apply to the StreamRec data from Section 4.15. For different values of $\beta$, plot the information plane: $I(T; X)$ vs. $I(T; Y)$. This traces out the "information curve" — the optimal tradeoff between compression and prediction.

(d) Compare the features selected by the IB method (at a specific $\beta$) with those selected by greedy MI maximization (select the feature with highest MI with $Y$, then the feature with highest conditional MI given the first, etc.). Do they agree?