Chapter 4 Exercises

Section 4.1: Foundations of Probability Theory

Exercise 4.1 [F]

A bag contains 5 red marbles, 3 blue marbles, and 2 green marbles. You draw one marble at random.

(a) What is the probability of drawing a red marble?

(b) What is the probability of drawing a marble that is not green?

(c) Verify that the probabilities of all three colors sum to 1.

Exercise 4.2 [F]

Two fair six-sided dice are rolled. Let $A$ be the event "the sum is 7" and $B$ be the event "at least one die shows a 3."

(a) Compute $P(A)$ and $P(B)$.

(b) Compute $P(A \cap B)$.

(c) Are $A$ and $B$ independent? Justify your answer.

Exercise 4.3 [I]

In a dataset of 10,000 emails, 2,000 are spam. Among spam emails, 80% contain the word "free." Among non-spam emails, 10% contain the word "free."

(a) Using Bayes' theorem, compute the probability that an email is spam given that it contains the word "free."

(b) Now suppose we also know that 60% of spam emails contain the word "winner" and only 2% of non-spam emails do. Assuming the words "free" and "winner" are conditionally independent given the spam label, compute the probability that an email is spam given that it contains both words.

(c) Discuss why the conditional independence assumption in part (b) might not hold in practice.

Exercise 4.4 [I]

Prove from the axioms of probability that for any two events $A$ and $B$:

$$ P(A \cup B) = P(A) + P(B) - P(A \cap B) $$

Exercise 4.5 [A]

Three classifiers independently label a data point. Classifier 1 has accuracy 0.8, Classifier 2 has accuracy 0.85, and Classifier 3 has accuracy 0.75. You use majority voting: the final label is the one chosen by at least 2 of 3 classifiers.

(a) What is the probability that the majority vote is correct?

(b) Compare this to the accuracy of each individual classifier. What does this suggest about ensemble methods?

(c) Under what condition on individual classifier accuracy does majority voting decrease performance?

Exercise 4.6 [F]

State Bayes' theorem. Identify the prior, likelihood, posterior, and evidence terms. Give a one-sentence description of each in the context of classifying a patient as healthy or sick based on a blood test result.

Exercise 4.7 [I]

A drug test has a 99% true positive rate and a 99% true negative rate. In a population where 0.5% of people actually use the drug, a person tests positive. What is the probability they actually use the drug? Comment on the result.

Section 4.2: Random Variables and Distributions

Exercise 4.8 [F]

A random variable $X$ has the following PMF:

$x$	0	1	2	3
$P(X=x)$	0.1	0.3	0.4	0.2

(a) Verify that this is a valid PMF.

(b) Compute $\mathbb{E}[X]$, $\mathbb{E}[X^2]$, and $\text{Var}(X)$.

(c) Compute $\mathbb{E}[3X + 2]$.

Exercise 4.9 [F]

For a Bernoulli random variable with parameter $p$, derive the expressions for $\mathbb{E}[X]$ and $\text{Var}(X)$ from the definitions.

Exercise 4.10 [I]

The heights of a population follow a Gaussian distribution with $\mu = 170$ cm and $\sigma = 10$ cm.

(a) What fraction of the population has height between 160 and 180 cm? (Use the 68-95-99.7 rule.)

(b) Write a NumPy snippet to empirically verify your answer by generating 100,000 samples.

(c) What is the probability that a randomly selected person is taller than 190 cm?

Exercise 4.11 [I]

Let $X \sim \mathcal{N}(\mu_X, \sigma_X^2)$ and $Y \sim \mathcal{N}(\mu_Y, \sigma_Y^2)$ be independent. Show that $Z = X + Y$ is also Gaussian and find its parameters. (Hint: use moment generating functions or the convolution of PDFs.)

Exercise 4.12 [A]

The softmax function maps a vector $\mathbf{z} \in \mathbb{R}^K$ to a probability distribution.

(a) Prove that the output of softmax is a valid probability distribution (all entries positive, sum to 1).

(b) Show that softmax is invariant to adding a constant: $\text{softmax}(\mathbf{z} + c\mathbf{1}) = \text{softmax}(\mathbf{z})$.

(c) Explain why the property in (b) is important for numerical stability, and connect it to the log-sum-exp trick discussed in Section 4.6.5.

Exercise 4.13 [F]

Match each distribution with its most appropriate AI application:

Distribution	Application
1. Bernoulli	A. Modeling word counts in a document
2. Categorical	B. Time between server requests
3. Gaussian	C. Binary spam classification label
4. Poisson	D. Next-token prediction probabilities
5. Exponential	E. Weight initialization in neural networks

Exercise 4.14 [I]

Write a NumPy function that generates $n$ samples from a mixture of two Gaussians: with probability 0.3, sample from $\mathcal{N}(-2, 1)$; with probability 0.7, sample from $\mathcal{N}(3, 0.5)$. Plot the histogram and overlay the true PDF.

Section 4.3: Statistical Estimation

Exercise 4.15 [I]

You observe the following 10 data points, assumed to be drawn i.i.d. from a Bernoulli distribution: $\{1, 0, 1, 1, 0, 1, 1, 1, 0, 1\}$.

(a) Derive the MLE for the Bernoulli parameter $p$.

(b) Compute $\hat{p}_{\text{MLE}}$ for this data.

(c) Now assume a Beta prior $p \sim \text{Beta}(\alpha=2, \beta=2)$. Derive and compute the MAP estimate $\hat{p}_{\text{MAP}}$.

(d) Compare the MLE and MAP estimates. Why is the MAP estimate pulled toward 0.5?

Exercise 4.16 [I]

Derive the MLE for the parameter $\lambda$ of a Poisson distribution given observations $x_1, \ldots, x_n$.

Exercise 4.17 [A]

Consider a linear regression model $y = \mathbf{w}^\top \mathbf{x} + \epsilon$ where $\epsilon \sim \mathcal{N}(0, \sigma^2)$.

(a) Write down the likelihood function $p(y \mid \mathbf{x}, \mathbf{w}, \sigma^2)$.

(b) Show that the MLE for $\mathbf{w}$ is equivalent to minimizing the mean squared error loss.

(c) If we place a Gaussian prior $\mathbf{w} \sim \mathcal{N}(0, \tau^2 I)$, show that the MAP estimate is equivalent to L2-regularized (ridge) regression.

Exercise 4.18 [F]

Explain in your own words the difference between MLE and MAP estimation. When would you prefer one over the other?

Exercise 4.19 [I]

Write NumPy code that:

(a) Generates 50 samples from $\mathcal{N}(\mu=5, \sigma^2=4)$.

(b) Computes the MLE estimates $\hat{\mu}$ and $\hat{\sigma}^2$.

(c) Repeats this experiment 1,000 times and plots the distribution of $\hat{\mu}$ values. Comment on the shape of this distribution and relate it to the properties of the MLE estimator.

Exercise 4.20 [A]

Fisher Information and the Cramer-Rao Bound. The Fisher information for a parameter $\theta$ is:

$$ I(\theta) = -\mathbb{E}\left[\frac{\partial^2}{\partial \theta^2} \log p(X \mid \theta)\right] $$

(a) Compute the Fisher information for the Bernoulli parameter $p$.

(b) The Cramer-Rao lower bound states that the variance of any unbiased estimator satisfies $\text{Var}(\hat{\theta}) \geq 1/[nI(\theta)]$. For $n = 100$ and $p = 0.3$, what is the minimum possible variance for an unbiased estimator of $p$?

(c) Compare this to the actual variance of the MLE $\hat{p} = \bar{X}$.

Section 4.5: Information Theory

Exercise 4.21 [F]

Compute the entropy of the following distributions (in nats, using natural log):

(a) A fair coin: $P(H) = P(T) = 0.5$

(b) A biased coin: $P(H) = 0.9, P(T) = 0.1$

(c) A loaded die: $P(1) = 0.5, P(2) = P(3) = P(4) = P(5) = P(6) = 0.1$

(d) A fair die: $P(k) = 1/6$ for all $k$

Which distribution has the highest entropy? Why?

Exercise 4.22 [I]

Let $p = (0.25, 0.25, 0.25, 0.25)$ and $q = (0.1, 0.2, 0.3, 0.4)$.

(a) Compute $H(p)$ and $H(q)$.

(b) Compute $H(p, q)$ (cross-entropy of $p$ with respect to $q$).

(c) Compute $D_{\text{KL}}(p \| q)$ and $D_{\text{KL}}(q \| p)$.

(d) Verify that $D_{\text{KL}}(p \| q) = H(p, q) - H(p)$.

(e) Are the two KL divergences equal? Why or why not?

Exercise 4.23 [I]

You have a true distribution $p = (0.7, 0.2, 0.1)$ over three classes and three candidate models:

Model A: $q_A = (0.6, 0.3, 0.1)$
Model B: $q_B = (0.33, 0.33, 0.34)$
Model C: $q_C = (0.8, 0.1, 0.1)$

(a) Compute the cross-entropy $H(p, q)$ for each model.

(b) Compute the KL divergence $D_{\text{KL}}(p \| q)$ for each model.

(c) Which model is closest to the true distribution? Does cross-entropy and KL divergence agree on the ranking?

Exercise 4.24 [F]

Explain why minimizing cross-entropy loss during neural network training is equivalent to minimizing the KL divergence between the true data distribution and the model's predicted distribution.

Exercise 4.25 [I]

Consider two binary random variables $X$ and $Y$ with the joint distribution:

	$Y=0$	$Y=1$
$X=0$	0.3	0.1
$X=1$	0.2	0.4

(a) Compute the marginal distributions $p(X)$ and $p(Y)$.

(b) Compute $H(X)$, $H(Y)$, and $H(X, Y)$.

(c) Compute $I(X; Y)$.

(d) Are $X$ and $Y$ independent? How does the mutual information confirm your answer?

Exercise 4.26 [A]

Prove that the entropy $H(X)$ is maximized when $X$ is uniformly distributed over its support. (Hint: Use the method of Lagrange multipliers from Chapter 3 or use the non-negativity of KL divergence.)

Exercise 4.27 [A]

The Information Bottleneck. Consider a Markov chain $X \to T \to Y$, where $X$ is the input, $T$ is a compressed representation, and $Y$ is the target.

(a) State the data processing inequality for this chain.

(b) The information bottleneck objective minimizes $I(X; T) - \beta I(T; Y)$ over $p(t \mid x)$. Explain intuitively what each term encourages when $\beta > 0$.

(c) How does this relate to representation learning in neural networks?

Exercise 4.28 [I]

Write a NumPy function that computes the mutual information between two discrete random variables given their joint probability table as a 2D array. Test it on the distribution from Exercise 4.25.

Applied Exercises

Exercise 4.29 [I]

Temperature scaling. Given logits $z = [2.0, 1.0, 0.5]$:

(a) Compute the softmax probabilities.

(b) Compute the entropy of the resulting distribution.

(c) Repeat for temperature values $T \in \{0.5, 1.0, 2.0, 5.0\}$. How does temperature affect entropy?

(d) Explain why low temperature is used for "greedy" generation and high temperature for "creative" generation in language models.

Exercise 4.30 [I]

Numerical stability. You have log-probabilities: $\log p_1 = -1000.5$, $\log p_2 = -1001.2$, $\log p_3 = -999.8$.

(a) Try computing $\log(p_1 + p_2 + p_3)$ naively (exponentiate, sum, take log). What happens?

(b) Use the log-sum-exp trick to compute the correct answer.

(c) Verify your answer using scipy.special.logsumexp.

Exercise 4.31 [A]

MLE for a neural network classifier. Consider a 3-class classification problem with a neural network that outputs logits $\mathbf{z} \in \mathbb{R}^3$, which are passed through softmax to obtain $\hat{\mathbf{y}}$.

(a) Write out the negative log-likelihood loss for a single training example with true class $c$.

(b) Compute $\frac{\partial \mathcal{L}}{\partial z_k}$ for all $k$. (Hint: this is $\hat{y}_k - \mathbb{1}[k=c]$.)

(c) Explain why this gradient is intuitive: what happens when the model is confident and correct vs. confident and wrong?

Exercise 4.32 [I]

Comparing distributions with KL divergence. Using NumPy, generate 1,000 samples from $\mathcal{N}(0, 1)$ and estimate the KL divergence to $\mathcal{N}(0.5, 1)$, $\mathcal{N}(0, 2)$, and $\mathcal{N}(1, 0.5)$. Use histogram-based density estimation. Compare your estimates to the analytical KL divergence between two Gaussians:

$$ D_{\text{KL}}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2)) = \log\frac{\sigma_2}{\sigma_1} + \frac{\sigma_1^2 + (\mu_1 - \mu_2)^2}{2\sigma_2^2} - \frac{1}{2} $$

Exercise 4.33 [F]

A language model assigns the following probabilities to the next word: {"the": 0.3, "a": 0.2, "an": 0.05, "cat": 0.15, "dog": 0.1, "is": 0.2}.

(a) Compute the entropy of this distribution (in bits, using $\log_2$).

(b) Compute the perplexity, defined as $2^{H(p)}$.

(c) A better model assigns: {"the": 0.6, "a": 0.15, "an": 0.02, "cat": 0.08, "dog": 0.05, "is": 0.1}. Compute its entropy and perplexity. Which model is "better" and why?

Exercise 4.34 [A]

Derive the ELBO. For a latent variable model with observed $x$ and latent $z$:

(a) Show that $\log p(x) = \mathbb{E}_{q(z|x)}[\log p(x, z) - \log q(z|x)] + D_{\text{KL}}(q(z|x) \| p(z|x))$.

(b) Argue that the first term is a lower bound on $\log p(x)$ (the Evidence Lower BOund, or ELBO).

(c) Show that the ELBO can be decomposed as $\mathbb{E}_{q(z|x)}[\log p(x|z)] - D_{\text{KL}}(q(z|x) \| p(z))$.

(d) Identify the "reconstruction" and "regularization" terms.

Exercise 4.35 [I]

Monte Carlo estimation. Use Monte Carlo sampling with $N = 10, 100, 1000, 10000$ samples to estimate:

(a) $\mathbb{E}[\sin(X)]$ where $X \sim \mathcal{N}(0, 1)$. (True answer: 0, by symmetry.)

(b) $P(X > 2)$ where $X \sim \mathcal{N}(0, 1)$. (True answer: $\approx 0.0228$.)

Plot the estimate vs. $N$ and comment on the convergence rate.

Exercise 4.36 [I]

The Naive Bayes assumption. Consider classifying documents into "sports" or "politics" categories using words as features. Explain:

(a) What assumption does Naive Bayes make about the features?

(b) Give an example of two words for which this assumption clearly fails.

(c) Despite this assumption being wrong, Naive Bayes often works well. Suggest a reason why.

Exercise 4.37 [A]

Jensen's inequality and its consequences.

(a) State Jensen's inequality for convex functions.

(b) Use Jensen's inequality to prove that $D_{\text{KL}}(p \| q) \geq 0$ (Gibbs' inequality).

(c) Use Jensen's inequality to show that the arithmetic mean is at least as large as the geometric mean for positive numbers.

Exercise 4.38 [I]

Prior sensitivity analysis. For the Bernoulli MLE/MAP problem in Exercise 4.15:

(a) Compute the MAP estimate for priors $\text{Beta}(1, 1)$, $\text{Beta}(2, 2)$, $\text{Beta}(5, 5)$, and $\text{Beta}(10, 10)$.

(b) Plot the MAP estimate as a function of the prior strength $\alpha = \beta$ for $\alpha \in [0.5, 20]$.

(c) At what prior strength does the MAP estimate equal 0.6 (halfway between MLE and 0.5)?

Exercise 4.39 [A]

Sufficient statistics. A statistic $T(\mathcal{D})$ is sufficient for a parameter $\theta$ if $p(\mathcal{D} \mid T(\mathcal{D}), \theta) = p(\mathcal{D} \mid T(\mathcal{D}))$ (the data provides no additional information about $\theta$ beyond $T$).

(a) Show that $\bar{X}$ is a sufficient statistic for $\mu$ in a Gaussian $\mathcal{N}(\mu, \sigma^2)$ with known $\sigma^2$.

(b) Show that $\sum_i x_i$ is a sufficient statistic for $p$ in a Bernoulli distribution.

(c) Explain why sufficient statistics are relevant to data compression in machine learning.

Exercise 4.40 [I]

End-to-end exercise. You are given a dataset of 100 measurements of a physical quantity. You believe the data comes from a Gaussian distribution.

(a) Generate the data: np.random.seed(42); data = np.random.normal(loc=3.5, sigma=1.2, size=100).

(b) Compute MLE estimates for $\mu$ and $\sigma^2$.

(c) Compute the entropy of the estimated Gaussian distribution.

(d) Compute the cross-entropy between the estimated distribution and a reference $\mathcal{N}(3.0, 1.0)$.

(e) Compute the KL divergence from the estimated distribution to the reference distribution.

(f) Verify that cross-entropy = entropy + KL divergence.