Chapter 3: Exercises

DataField.Dev

Chapter 3: Exercises

Probability Theory and Statistical Inference

Exercise 3.1: Conditional Probability and the Base Rate Fallacy

Difficulty: ★☆☆☆

A fraud detection model at StreamRec flags 95% of truly fraudulent accounts (sensitivity = 0.95) and incorrectly flags 2% of legitimate accounts (false positive rate = 0.02). The true fraud rate is 0.1%.

(a) Compute the probability that a flagged account is actually fraudulent using Bayes' theorem.

(b) Explain why the model's apparent accuracy (95% detection rate) is misleading without considering the base rate.

(c) What fraud rate would be necessary for a flagged account to have at least a 50% probability of being truly fraudulent, keeping the sensitivity and false positive rate fixed?

Exercise 3.2: Joint and Marginal Distributions in User Behavior

Difficulty: ★☆☆☆

A StreamRec analysis reveals the following joint distribution over content type and engagement level:

	Low Engagement	Medium Engagement	High Engagement
Articles	0.15	0.10	0.05
Videos	0.08	0.12	0.15
Podcasts	0.10	0.12	0.13

(a) Compute the marginal distributions $P(\text{content type})$ and $P(\text{engagement level})$.

(b) Compute $P(\text{High Engagement} \mid \text{Videos})$ and $P(\text{Videos} \mid \text{High Engagement})$.

(c) Are content type and engagement level independent? Justify your answer quantitatively.

Exercise 3.3: Bayesian Updating with the Beta-Binomial Model

Difficulty: ★★☆☆

MediCore Pharmaceuticals is evaluating a new drug. Clinical experts believe the response rate is approximately 40% with moderate uncertainty, encoded as a $\text{Beta}(8, 12)$ prior.

(a) Verify that $\text{Beta}(8, 12)$ has mean 0.4. What is its standard deviation?

(b) In a pilot study, 18 out of 40 patients respond. Compute the posterior distribution.

(c) Plot the prior, likelihood (scaled), and posterior on the same axes.

(d) Compute the posterior mean. Show that it lies between the prior mean and the MLE.

(e) Compute a 95% credible interval for the response rate and compare it with the frequentist 95% Wald confidence interval.

Exercise 3.4: Exponential Family Identification

Difficulty: ★★☆☆

For each of the following distributions, write it in exponential family form $p(x \mid \eta) = h(x) \exp(\eta^\top T(x) - A(\eta))$ by identifying the natural parameter $\eta$, sufficient statistic $T(x)$, log-partition function $A(\eta)$, and base measure $h(x)$.

(a) Geometric distribution: $P(X = k) = (1 - p)^{k-1} p$, for $k = 1, 2, \ldots$

(b) Gaussian with known mean $\mu$ and unknown variance $\sigma^2$: $f(x \mid \sigma^2) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$

(c) Pareto distribution: $f(x \mid \alpha) = \alpha x_m^\alpha x^{-(\alpha + 1)}$ for $x \geq x_m$, where $x_m$ is known.

Exercise 3.5: MLE for the Poisson Distribution

Difficulty: ★★☆☆

StreamRec models the number of items a user views per session as a Poisson random variable $X \sim \text{Poisson}(\lambda)$.

(a) Write the log-likelihood for $n$ independent observations $x_1, \ldots, x_n$.

(b) Derive the MLE $\hat{\lambda}_{\text{MLE}}$ by differentiating and setting to zero.

(c) Verify that the second derivative confirms a maximum.

(d) For a dataset with $n = 500$ sessions and $\sum x_i = 3{,}750$, compute $\hat{\lambda}_{\text{MLE}}$.

(e) Implement in Python: simulate 500 Poisson samples with $\lambda = 7.5$, compute the MLE, and verify it matches $\bar{x}$.

Exercise 3.6: MLE for the Exponential Distribution

Difficulty: ★★☆☆

The time between consecutive API calls to the StreamRec recommendation service is modeled as $X \sim \text{Exponential}(\lambda)$.

(a) Derive the MLE for $\lambda$.

(b) The Fisher information for the exponential is $I(\lambda) = 1/\lambda^2$. Derive the Cramér-Rao lower bound for the variance of $\hat{\lambda}$.

(c) Show that the MLE achieves the Cramér-Rao bound asymptotically.

(d) If the mean inter-arrival time is 2.5 seconds ($\hat{\lambda} = 0.4$ from $n = 1{,}000$ observations), compute the 95% confidence interval using the asymptotic normality of the MLE.

Exercise 3.7: Cross-Entropy as Negative Log-Likelihood (Multi-Class)

Difficulty: ★★☆☆

A language model predicts the next token from a vocabulary of $K = 50{,}000$ words using a softmax output layer.

(a) Write the categorical log-likelihood for a single observation $(x, y)$ where $y$ is the true token and $\hat{p}_k = \text{softmax}(z_k)$.

(b) Show that the negative log-likelihood equals the categorical cross-entropy loss.

(c) Compute the gradient $\frac{\partial \mathcal{L}}{\partial z_k}$ of the cross-entropy loss with respect to the logit $z_k$. Show that it simplifies to $\hat{p}_k - \mathbb{1}[y = k]$.

(d) Implement this gradient computation in numpy and verify it against PyTorch's autograd.

import numpy as np
import torch
import torch.nn.functional as F

def softmax(z: np.ndarray) -> np.ndarray:
    """Numerically stable softmax."""
    z_shifted = z - np.max(z)
    exp_z = np.exp(z_shifted)
    return exp_z / exp_z.sum()

def cross_entropy_gradient(z: np.ndarray, y: int) -> np.ndarray:
    """Compute gradient of cross-entropy w.r.t. logits.

    Args:
        z: Logit vector, shape (K,).
        y: True class index.

    Returns:
        Gradient vector, shape (K,).
    """
    # YOUR CODE HERE
    pass

# Verify against PyTorch
np.random.seed(42)
K = 10  # small vocabulary for testing
z_np = np.random.randn(K)
y_true = 3

# Your gradient
grad_manual = cross_entropy_gradient(z_np, y_true)

# PyTorch gradient
z_torch = torch.tensor(z_np, dtype=torch.float32, requires_grad=True)
loss = F.cross_entropy(z_torch.unsqueeze(0), torch.tensor([y_true]))
loss.backward()
grad_pytorch = z_torch.grad.numpy()

print(f"Max absolute difference: {np.max(np.abs(grad_manual - grad_pytorch)):.2e}")

Exercise 3.8: MAP Estimation and Regularization

Difficulty: ★★☆☆

Consider logistic regression with weights $\mathbf{w} \in \mathbb{R}^d$.

(a) Write the negative log-likelihood (binary cross-entropy) for $n$ data points.

(b) Write the MAP objective with a Gaussian prior $\mathbf{w} \sim \mathcal{N}(\mathbf{0}, \sigma_0^2 \mathbf{I})$.

(c) Show that the MAP objective is equivalent to $L_2$-regularized logistic regression. What is the relationship between $\lambda$ and $\sigma_0^2$?

(d) Now consider a Laplace prior $p(w_j) \propto \exp(-|w_j| / b)$. Show that the MAP objective becomes $L_1$-regularized logistic regression.

(e) Which prior (Gaussian or Laplace) is more appropriate for a credit scoring model at Meridian Financial where most features are irrelevant? Justify your answer.

Exercise 3.9: Fisher Information for the Bernoulli Model

Difficulty: ★★☆☆

(a) Derive the Fisher information $I(p)$ for the Bernoulli distribution using the definition $I(p) = \mathbb{E}\left[\left(\frac{\partial \log p(X \mid p)}{\partial p}\right)^2\right]$.

(b) Verify the result using the alternative form $I(p) = -\mathbb{E}\left[\frac{\partial^2 \log p(X \mid p)}{\partial p^2}\right]$.

(c) Plot $I(p)$ as a function of $p \in (0, 1)$. Explain why Fisher information is maximized at $p = 0.5$.

(d) Using the Cramér-Rao bound, determine the minimum number of observations needed to estimate a CTR of $p = 0.05$ with a standard error of at most 0.005.

Exercise 3.10: CLT Verification by Simulation

Difficulty: ★☆☆☆

(a) Draw $n = 1, 5, 30, 100$ samples from the following distributions and compute the sample mean. Repeat $B = 10{,}000$ times. Plot the histogram of sample means for each $n$ and overlay the CLT Gaussian approximation.

Uniform(0, 1)
Exponential($\lambda = 1$)
Bernoulli($p = 0.1$)

(b) For which distribution does the CLT approximation converge fastest? Slowest? Explain why.

(c) For the Bernoulli(0.1) case, what is the minimum $n$ for which the CLT Gaussian approximation is "reasonable" (e.g., Kolmogorov-Smirnov $p$-value > 0.05)?

Exercise 3.11: Hoeffding's Inequality Applied to A/B Testing

Difficulty: ★★☆☆

StreamRec is running an A/B test to compare two recommendation algorithms. User engagement is measured as a binary outcome (engaged or not).

(a) Using Hoeffding's inequality, derive the minimum sample size per group to detect a difference of $\delta = 0.01$ with confidence $1 - \alpha = 0.95$.

(b) Using the CLT-based power calculation (two-sample $z$-test with power 0.80), derive the minimum sample size for the same $\delta$ and $\alpha$, assuming $\sigma = 0.3$.

(c) Compare the two bounds. By what factor is the Hoeffding bound more conservative?

(d) In practice, StreamRec has 5 million daily active users. How long would the A/B test need to run under each bound? Assume 50/50 traffic split.

Exercise 3.12: Bootstrap vs. Parametric Confidence Intervals

Difficulty: ★★☆☆

(a) Generate 200 samples from a $\text{Gamma}(2, 3)$ distribution.

(b) Compute the 95% confidence interval for the mean using: - The CLT-based (Wald) interval - The parametric bootstrap (resample and compute means) - The non-parametric bootstrap (same)

(c) Compute the 95% confidence interval for the median using the bootstrap. Explain why the CLT-based interval is not directly applicable to the median.

(d) Repeat (b)-(c) for $n = 20$ instead of $n = 200$. How do the intervals compare?

Exercise 3.13: Importance Sampling for Rare Events

Difficulty: ★★★☆

In the StreamRec fraud detection system, fraudulent events occur with probability $p = 0.001$. We want to estimate the expected loss from fraud, where each fraud event costs $L \sim \text{LogNormal}(\mu=5, \sigma=2)$.

(a) Estimate $\mathbb{E}[\text{total loss per 10{,}000 transactions}]$ using naive Monte Carlo with $N = 100{,}000$ samples. Report the estimate and standard error.

(b) Now use importance sampling: oversample fraudulent events by using a proposal distribution with $q(\text{fraud}) = 0.1$ (instead of $p = 0.001$). Compute the importance-weighted estimate. Compare the standard error with (a).

(c) What is the effective sample size (ESS) of your importance sampling estimator? Explain why ESS is a better diagnostic than raw sample size.

Exercise 3.14: Conjugate Priors and Sequential Updating

Difficulty: ★★☆☆

MediCore is monitoring adverse events for a drug in post-market surveillance. Adverse events per month are modeled as $X \sim \text{Poisson}(\lambda)$.

(a) Show that the Gamma distribution is the conjugate prior for the Poisson likelihood. If $\lambda \sim \text{Gamma}(\alpha, \beta)$ and we observe $x_1, \ldots, x_n$, derive the posterior.

(b) Start with a prior $\lambda \sim \text{Gamma}(2, 1)$ (mean = 2 events/month). Over three months, MediCore observes 3, 5, and 4 adverse events. Update the posterior sequentially (month by month) and show that the order does not matter.

(c) Plot the prior and the posterior after each month's update.

(d) After 12 months with a total of 48 events, compute a 95% credible interval for $\lambda$. Would a regulator consider this drug safe if the threshold is $\lambda < 5$?

Exercise 3.15: MLE for the Multivariate Gaussian

Difficulty: ★★★☆

(a) Given $n$ i.i.d. samples $\mathbf{x}_1, \ldots, \mathbf{x}_n \in \mathbb{R}^d$ from $\mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})$, write the log-likelihood.

(b) Derive the MLE for $\boldsymbol{\mu}$ by taking the gradient with respect to $\boldsymbol{\mu}$ and setting to zero (use matrix calculus from Chapter 1).

(c) Derive the MLE for $\boldsymbol{\Sigma}$ by taking the gradient with respect to $\boldsymbol{\Sigma}^{-1}$ (the precision matrix). Show that $\hat{\boldsymbol{\Sigma}} = \frac{1}{n}\sum_{i=1}^n (\mathbf{x}_i - \bar{\mathbf{x}})(\mathbf{x}_i - \bar{\mathbf{x}})^\top$.

(d) Show that the MLE for $\boldsymbol{\Sigma}$ is biased. What is the unbiased estimator?

(e) Implement the MLE in numpy. Generate 500 samples from a 3-dimensional Gaussian with known $\boldsymbol{\mu}$ and $\boldsymbol{\Sigma}$, compute the MLE, and verify it is close to the true parameters.

Exercise 3.16: Loss Functions as Negative Log-Likelihoods

Difficulty: ★★★☆

For each loss function below, identify the probability distribution it corresponds to (as a negative log-likelihood), and derive the relationship.

(a) Huber loss: $L_\delta(y, \hat{y}) = \begin{cases} \frac{1}{2}(y - \hat{y})^2 & |y - \hat{y}| \leq \delta \\ \delta|y - \hat{y}| - \frac{1}{2}\delta^2 & |y - \hat{y}| > \delta \end{cases}$

(b) Quantile loss (pinball loss): $L_\tau(y, \hat{y}) = \begin{cases} \tau(y - \hat{y}) & y \geq \hat{y} \\ (1 - \tau)(\hat{y} - y) & y < \hat{y} \end{cases}$

(c) Poisson deviance loss: $L(y, \hat{y}) = \hat{y} - y \log \hat{y}$

(d) For each, explain a practical scenario in the StreamRec system where this loss function would be more appropriate than MSE or cross-entropy.

Exercise 3.17: Sufficient Statistics and Data Compression

Difficulty: ★★★☆

(a) For the Bernoulli model, show that $T(\mathbf{x}) = \sum_{i=1}^n x_i$ is a sufficient statistic by applying the factorization theorem.

(b) For the Gaussian model with unknown $\mu$ and $\sigma^2$, show that $T(\mathbf{x}) = (\sum_i x_i, \sum_i x_i^2)$ is a minimal sufficient statistic.

(c) Suppose StreamRec has 100 million click/no-click events per day. Using the sufficiency result from (a), how much data compression is possible without losing any information about the click-through rate? What is the compression ratio?

(d) Explain why sufficient statistics do not exist (in a useful form) for neural network models, and what this implies about the need to store and process full datasets in deep learning.

Exercise 3.18: Monte Carlo Integration

Difficulty: ★★☆☆

(a) Estimate $\int_0^1 e^{-x^2} dx$ using Monte Carlo integration with $N = 10{,}000$ samples. Report the estimate and standard error.

(b) The integral $\int_0^\infty x^2 e^{-x} dx = 2$ (the second moment of the Exponential(1) distribution). Verify this using Monte Carlo with $N = 100{,}000$ samples.

(c) Estimate $P(X > 5)$ where $X \sim \mathcal{N}(0, 1)$ using: - Naive Monte Carlo - Importance sampling with proposal $q = \mathcal{N}(5, 1)$

Compare the standard errors of the two approaches. Explain why importance sampling is dramatically better for this rare-event estimation.

Exercise 3.19: Bayesian A/B Testing

Difficulty: ★★★☆

StreamRec is comparing two recommendation algorithms. After one week: - Algorithm A: 12,340 engagements out of 85,200 impressions - Algorithm B: 13,050 engagements out of 86,800 impressions

(a) Using a uniform prior $\text{Beta}(1, 1)$ for each algorithm's engagement rate, compute the posterior distributions.

(b) Compute $P(p_B > p_A \mid \text{data})$ by Monte Carlo sampling from both posteriors.

(c) Compute the expected loss of choosing each algorithm: $\mathbb{E}[\max(p_B - p_A, 0) \mid \text{data}]$ (loss of choosing A) and $\mathbb{E}[\max(p_A - p_B, 0) \mid \text{data}]$ (loss of choosing B).

(d) Compare with a frequentist two-proportion $z$-test at $\alpha = 0.05$. Do the two approaches lead to the same decision?

(e) Repeat (a)-(d) with an informative prior $\text{Beta}(100, 600)$ for both algorithms (expressing prior belief that engagement rate is around 14%). How does the prior affect the conclusion?

Exercise 3.20: The Bias-Variance Tradeoff for MLE

Difficulty: ★★★☆

(a) Show that the MLE for the Bernoulli parameter $\hat{p} = \bar{x}$ is unbiased: $\mathbb{E}[\hat{p}] = p$.

(b) Show that the MLE for the Gaussian variance $\hat{\sigma}^2 = \frac{1}{n}\sum_i (x_i - \bar{x})^2$ is biased: $\mathbb{E}[\hat{\sigma}^2] = \frac{n-1}{n}\sigma^2$.

(c) The bias of $\hat{\sigma}^2$ decreases as $n$ grows. For what value of $n$ is the bias less than 1% of the true variance?

(d) The biased MLE for $\sigma^2$ has lower MSE than the unbiased estimator $s^2 = \frac{1}{n-1}\sum_i (x_i - \bar{x})^2$ for small $n$. Derive the MSE of both estimators and find the crossover point.

Exercise 3.21: Exponential Family and Generalized Linear Models

Difficulty: ★★★☆

A GLM specifies that the conditional distribution of $Y \mid \mathbf{x}$ belongs to the exponential family with natural parameter $\eta = \mathbf{w}^\top \mathbf{x}$.

(a) Show that logistic regression is a GLM with a Bernoulli response distribution. Identify the link function.

(b) Show that Poisson regression is a GLM with a Poisson response distribution. Identify the link function.

(c) For a general GLM, derive the gradient of the log-likelihood with respect to $\mathbf{w}$. Show it has the form $\frac{\partial \ell}{\partial \mathbf{w}} = \sum_{i=1}^n (y_i - \mu_i) \mathbf{x}_i$, where $\mu_i = \mathbb{E}[Y_i \mid \mathbf{x}_i]$.

(d) Why is this gradient form important for optimization? Connect to the logistic regression gradient from Chapter 2.

Exercise 3.22: Climate Model Uncertainty Quantification

Difficulty: ★★★☆

The Pacific Climate Research Consortium has 30 climate models, each projecting 2050 global temperature increase (in degrees Celsius) relative to the pre-industrial baseline:

projections = np.array([
    1.8, 2.1, 2.3, 2.0, 2.5, 1.9, 2.7, 2.2, 2.4, 2.6,
    3.0, 2.8, 2.1, 2.3, 1.7, 2.9, 2.5, 2.2, 2.0, 2.4,
    3.2, 2.6, 2.3, 2.1, 2.8, 1.6, 2.4, 2.7, 2.5, 2.2
])

(a) Compute the MLE for $\mu$ and $\sigma$ assuming the projections are Gaussian.

(b) Compute a 90% confidence interval for the mean temperature increase.

(c) Compute $P(\text{increase} > 2.5°\text{C})$ under the fitted Gaussian.

(d) Use the bootstrap to compute a 90% confidence interval for the probability $P(\text{increase} > 2.5°\text{C})$. Explain why this "CI for a probability" is important for policy communication.

(e) Critique the Gaussian assumption. What features of climate model ensembles might violate it? What alternative distributions might be more appropriate?

Exercise 3.23: Asymptotic Properties of MLE

Difficulty: ★★★☆

(a) State the three asymptotic properties of MLE: consistency, asymptotic normality, and asymptotic efficiency. For each, provide the formal statement.

(b) For the Bernoulli model with $p = 0.3$, simulate $n = 10, 50, 200, 1000$ i.i.d. samples, compute $\hat{p}_{\text{MLE}}$ for each, and repeat $B = 10{,}000$ times. Verify: - Consistency: $\hat{p} \to p$ as $n \to \infty$ (plot mean of $\hat{p}$ vs. $n$) - Asymptotic normality: The distribution of $\sqrt{n}(\hat{p} - p)$ converges to $\mathcal{N}(0, p(1-p))$ (overlay Q-Q plots) - Efficiency: $\text{Var}(\hat{p})$ converges to the Cramér-Rao bound $p(1-p)/n$ (plot empirical variance vs. CR bound)

Exercise 3.24: Implementing Rejection Sampling

Difficulty: ★★★☆

(a) Implement rejection sampling to draw samples from the Beta(2.7, 6.3) distribution using a Uniform(0, 1) proposal.

(b) What is the theoretical acceptance rate? Verify empirically.

(c) Now implement rejection sampling with a Beta(2, 5) proposal (closer to the target). What is the new acceptance rate? How does it compare?

(d) For a $d$-dimensional target distribution, explain why rejection sampling becomes impractical and motivate the need for MCMC (covered in Part IV).

Exercise 3.25: The Delta Method

Difficulty: ★★★☆

The delta method states that if $\sqrt{n}(\hat{\theta} - \theta) \xrightarrow{d} \mathcal{N}(0, \sigma^2)$, then for a differentiable function $g$:

$$\sqrt{n}(g(\hat{\theta}) - g(\theta)) \xrightarrow{d} \mathcal{N}(0, [g'(\theta)]^2 \sigma^2)$$

(a) StreamRec reports the odds ratio $\text{OR} = \frac{p}{1-p}$ where $p$ is the engagement rate. Using the delta method, derive the asymptotic distribution of $\widehat{\text{OR}} = \frac{\hat{p}}{1-\hat{p}}$.

(b) Compute the 95% confidence interval for the odds ratio when $\hat{p} = 0.12$ and $n = 10{,}000$.

(c) Compare with the bootstrap confidence interval for the odds ratio. Which do you trust more, and why?

Exercise 3.26: Bayesian Model Comparison

Difficulty: ★★★★

MediCore is comparing two models for adverse event rates: - Model 1: $X \sim \text{Poisson}(\lambda)$ (constant rate) - Model 2: $X \sim \text{NegBin}(r, p)$ (overdispersed counts)

Monthly adverse event counts over 24 months: [3, 5, 2, 8, 4, 6, 1, 7, 3, 5, 9, 2, 4, 6, 3, 5, 8, 1, 4, 7, 2, 6, 5, 3].

(a) For Model 1, compute the marginal likelihood $p(D \mid M_1) = \int p(D \mid \lambda) p(\lambda) \, d\lambda$ analytically using a Gamma(2, 0.5) prior on $\lambda$.

(b) For Model 2, estimate the marginal likelihood using Monte Carlo integration with a suitable prior.

(c) Compute the Bayes factor $\text{BF}_{12} = p(D \mid M_1) / p(D \mid M_2)$. Which model does the data support?

(d) Discuss: can the frequentist likelihood ratio test perform this comparison? What are the advantages and limitations of the Bayesian approach?

Exercise 3.27: Multivariate Concentration Inequalities

Difficulty: ★★★★

(a) State Hoeffding's inequality for sums of bounded independent random variables.

(b) Consider a recommendation model whose error on a single prediction is bounded in $[-1, 1]$. Using Hoeffding's inequality, derive the minimum number of test predictions needed to guarantee that the empirical error is within $\epsilon = 0.01$ of the true error with probability at least $0.99$.

(c) Bernstein's inequality provides a tighter bound when the variance is known to be small. State Bernstein's inequality and compute the sample size from (b) assuming $\text{Var}(e_i) = 0.04$. How much tighter is the bound?

(d) Discuss why these bounds are important for the statistical validity of ML model evaluation on finite test sets. What happens when test set size is too small?

Exercise 3.28: The Expectation-Maximization (EM) Preview

Difficulty: ★★★★

Consider a Gaussian mixture model with $K = 2$ components: $p(x) = \pi_1 \mathcal{N}(x \mid \mu_1, \sigma_1^2) + \pi_2 \mathcal{N}(x \mid \mu_2, \sigma_2^2)$.

(a) Write the log-likelihood for $n$ i.i.d. observations. Explain why direct optimization is difficult (the log of a sum does not decompose).

(b) Introduce latent variables $z_i \in \{1, 2\}$ indicating which component generated observation $x_i$. Write the complete-data log-likelihood $\log p(\mathbf{x}, \mathbf{z} \mid \theta)$.

(c) Derive the E-step: compute $\gamma_{ik} = P(z_i = k \mid x_i, \theta^{(t)})$ — the posterior responsibility of component $k$ for observation $i$.

(d) Derive the M-step: update $\pi_k$, $\mu_k$, and $\sigma_k^2$ by maximizing the expected complete-data log-likelihood.

(e) Implement EM for a 2-component Gaussian mixture on simulated data. Plot the convergence of the log-likelihood and the fitted components.

Note: This exercise previews techniques that will be developed fully in Chapter 12 (Generative Models) and Chapter 20 (Bayesian Thinking).

Exercise 3.29: Monte Carlo Estimation of Pi

Difficulty: ★☆☆☆

A classic Monte Carlo demonstration:

(a) Estimate $\pi$ by sampling $N$ points uniformly in the unit square $[0, 1]^2$ and computing the fraction that fall inside the quarter-circle $x^2 + y^2 \leq 1$.

(b) Plot the estimate as a function of $N$ for $N = 10, 100, 1{,}000, 10{,}000, 100{,}000$. Overlay the true value of $\pi$.

(c) Compute the standard error of the estimate for each $N$. Verify that it decreases as $1/\sqrt{N}$.

(d) How many samples would you need for the estimate to have a standard error of less than $10^{-4}$?

Exercise 3.30: End-to-End: Drug Trial Analysis

Difficulty: ★★★★

MediCore is deciding whether to advance Drug X to Phase III trials based on Phase II data.

Phase II data: 85 responders out of 200 patients (42.5%). Regulatory threshold: The drug should demonstrate a response rate of at least 30%. Prior: Based on the drug's mechanism and similar compounds, MediCore's experts specify a $\text{Beta}(6, 10)$ prior (mean 37.5%).

(a) Compute the posterior distribution for the response rate.

(b) Compute $P(\theta > 0.30 \mid \text{data})$ — the posterior probability that the drug exceeds the regulatory threshold. Use both analytical (Beta CDF) and Monte Carlo approaches.

(c) Compute the expected value of perfect information (EVPI): how much would MediCore's expected outcome improve if they knew $\theta$ exactly? Assume the drug generates \$500M in revenue if $\theta > 0.30$ and \$0 otherwise.

(d) A frequentist would test $H_0: \theta \leq 0.30$ vs. $H_1: \theta > 0.30$ at $\alpha = 0.05$. Compute the $p$-value. Does the frequentist test agree with the Bayesian analysis?

(e) Discuss: which analysis (Bayesian or frequentist) would you present to the FDA, and why?