Case Study 2: Information-Theoretic Analysis of Neural Network Training
Overview
When we train a neural network, we typically monitor the loss curve and accuracy. But these metrics tell only part of the story. Information theory provides a richer set of tools for understanding what the network is learning and how its predictions evolve during training. In this case study, we use entropy, cross-entropy, KL divergence, and mutual information to dissect the training process of a simple neural network, connecting the theory from Section 4.5 to the practice of deep learning.
By the end of this case study, you will have: - Tracked entropy, cross-entropy, and KL divergence throughout training - Visualized how the model's predicted distribution evolves toward the true distribution - Analyzed the effect of temperature scaling on prediction calibration - Connected the cross-entropy loss to its information-theoretic meaning
Background
The Setup
We train a simple two-layer neural network (implemented in pure NumPy) on a synthetic 3-class classification problem. At each epoch, we compute:
- Cross-entropy loss $H(p, q_\theta)$: the standard training objective
- Entropy of predictions $H(q_\theta)$: how confident is the model?
- KL divergence $D_{\text{KL}}(p \| q_\theta)$: how far is the model from the true distribution?
- Mutual information $I(X; \hat{Y})$: how much information do predictions carry about inputs?
The Relationship
Recall from Section 4.5 that cross-entropy decomposes as:
$$ H(p, q) = H(p) + D_{\text{KL}}(p \| q) $$
Since $H(p)$ is fixed (it depends only on the true distribution), minimizing cross-entropy is equivalent to minimizing KL divergence. But tracking both separately reveals interesting dynamics.
Implementation
Step 1: Synthetic Data Generation
We create a 3-class classification dataset with clear but overlapping class boundaries:
import numpy as np
from typing import Tuple
def generate_data(
n_samples: int = 1500,
seed: int = 42,
) -> Tuple[np.ndarray, np.ndarray]:
"""Generate a synthetic 3-class classification dataset.
Creates three clusters of 2D points with some overlap to make
the classification problem non-trivial.
Args:
n_samples: Total number of samples (divided equally among classes).
seed: Random seed for reproducibility.
Returns:
Tuple of (X, y) where X has shape (n_samples, 2)
and y has shape (n_samples,) with values in {0, 1, 2}.
"""
rng = np.random.default_rng(seed)
n_per_class = n_samples // 3
# Class 0: cluster centered at (-2, -1)
X0 = rng.normal(loc=[-2, -1], scale=1.0, size=(n_per_class, 2))
# Class 1: cluster centered at (2, -1)
X1 = rng.normal(loc=[2, -1], scale=1.0, size=(n_per_class, 2))
# Class 2: cluster centered at (0, 2)
X2 = rng.normal(loc=[0, 2], scale=1.0, size=(n_per_class, 2))
X = np.vstack([X0, X1, X2])
y = np.concatenate([
np.zeros(n_per_class, dtype=int),
np.ones(n_per_class, dtype=int),
2 * np.ones(n_per_class, dtype=int),
])
# Shuffle
perm = rng.permutation(len(y))
return X[perm], y[perm]
Step 2: Neural Network in NumPy
We implement a minimal two-layer network with softmax output:
def softmax(z: np.ndarray) -> np.ndarray:
"""Compute softmax with numerical stability.
Args:
z: Logits of shape (batch_size, n_classes).
Returns:
Probabilities of shape (batch_size, n_classes).
"""
z_shifted = z - np.max(z, axis=1, keepdims=True)
exp_z = np.exp(z_shifted)
return exp_z / np.sum(exp_z, axis=1, keepdims=True)
def relu(z: np.ndarray) -> np.ndarray:
"""Rectified linear unit activation.
Args:
z: Input array.
Returns:
Element-wise max(0, z).
"""
return np.maximum(0, z)
def relu_derivative(z: np.ndarray) -> np.ndarray:
"""Derivative of ReLU.
Args:
z: Input array.
Returns:
1 where z > 0, 0 elsewhere.
"""
return (z > 0).astype(np.float64)
class SimpleNeuralNetwork:
"""A two-layer neural network for classification.
Architecture: Input -> Dense(hidden_dim, ReLU) -> Dense(n_classes, Softmax)
Attributes:
W1: First layer weights, shape (input_dim, hidden_dim).
b1: First layer biases, shape (hidden_dim,).
W2: Second layer weights, shape (hidden_dim, n_classes).
b2: Second layer biases, shape (n_classes,).
"""
def __init__(
self,
input_dim: int,
hidden_dim: int,
n_classes: int,
seed: int = 42,
) -> None:
"""Initialize network with Xavier initialization.
Args:
input_dim: Number of input features.
hidden_dim: Number of hidden units.
n_classes: Number of output classes.
seed: Random seed for weight initialization.
"""
rng = np.random.default_rng(seed)
# Xavier initialization
scale1 = np.sqrt(2.0 / (input_dim + hidden_dim))
scale2 = np.sqrt(2.0 / (hidden_dim + n_classes))
self.W1 = rng.normal(0, scale1, (input_dim, hidden_dim))
self.b1 = np.zeros(hidden_dim)
self.W2 = rng.normal(0, scale2, (hidden_dim, n_classes))
self.b2 = np.zeros(n_classes)
def forward(self, X: np.ndarray) -> np.ndarray:
"""Forward pass through the network.
Args:
X: Input data of shape (batch_size, input_dim).
Returns:
Predicted probabilities of shape (batch_size, n_classes).
"""
self._z1 = X @ self.W1 + self.b1
self._a1 = relu(self._z1)
self._z2 = self._a1 @ self.W2 + self.b2
self._probs = softmax(self._z2)
self._X = X
return self._probs
def backward(
self,
y: np.ndarray,
learning_rate: float = 0.01,
) -> float:
"""Backward pass with gradient descent update.
Args:
y: True labels of shape (batch_size,).
learning_rate: Step size for gradient descent.
Returns:
Cross-entropy loss value.
"""
batch_size = len(y)
# One-hot encode labels
y_onehot = np.zeros_like(self._probs)
y_onehot[np.arange(batch_size), y] = 1.0
# Cross-entropy loss
eps = 1e-12
loss = -np.mean(np.sum(y_onehot * np.log(self._probs + eps), axis=1))
# Output layer gradient: dL/dz2 = probs - y_onehot
dz2 = (self._probs - y_onehot) / batch_size
dW2 = self._a1.T @ dz2
db2 = np.sum(dz2, axis=0)
# Hidden layer gradient
da1 = dz2 @ self.W2.T
dz1 = da1 * relu_derivative(self._z1)
dW1 = self._X.T @ dz1
db1 = np.sum(dz1, axis=0)
# Update parameters
self.W1 -= learning_rate * dW1
self.b1 -= learning_rate * db1
self.W2 -= learning_rate * dW2
self.b2 -= learning_rate * db2
return loss
Step 3: Information-Theoretic Metrics
def compute_entropy(probs: np.ndarray) -> float:
"""Compute the average entropy of predicted distributions.
Args:
probs: Predicted probabilities, shape (n_samples, n_classes).
Returns:
Mean entropy across all predictions (in nats).
"""
eps = 1e-12
return -np.mean(np.sum(probs * np.log(probs + eps), axis=1))
def compute_cross_entropy(
y: np.ndarray,
probs: np.ndarray,
n_classes: int,
) -> float:
"""Compute cross-entropy between true labels and predictions.
Args:
y: True labels, shape (n_samples,).
probs: Predicted probabilities, shape (n_samples, n_classes).
n_classes: Number of classes.
Returns:
Cross-entropy value (in nats).
"""
eps = 1e-12
y_onehot = np.zeros((len(y), n_classes))
y_onehot[np.arange(len(y)), y] = 1.0
return -np.mean(np.sum(y_onehot * np.log(probs + eps), axis=1))
def compute_kl_divergence(
y: np.ndarray,
probs: np.ndarray,
n_classes: int,
) -> float:
"""Compute KL divergence from true distribution to predicted.
Since the true distribution is one-hot, H(p) = 0 per sample,
and D_KL(p || q) = H(p, q) - H(p) = H(p, q).
Args:
y: True labels, shape (n_samples,).
probs: Predicted probabilities, shape (n_samples, n_classes).
n_classes: Number of classes.
Returns:
KL divergence value (in nats).
"""
# For one-hot true distributions, H(p) = 0, so KL = cross-entropy
return compute_cross_entropy(y, probs, n_classes)
def compute_class_distribution_kl(
y: np.ndarray,
probs: np.ndarray,
n_classes: int,
) -> float:
"""Compute KL divergence between the aggregate predicted and true
class distributions.
This measures whether the model's average predictions match the
true class proportions.
Args:
y: True labels.
probs: Predicted probabilities, shape (n_samples, n_classes).
n_classes: Number of classes.
Returns:
KL divergence between aggregate distributions.
"""
eps = 1e-12
# True class distribution
true_dist = np.bincount(y, minlength=n_classes) / len(y)
# Average predicted distribution
pred_dist = np.mean(probs, axis=0)
return np.sum(true_dist * np.log((true_dist + eps) / (pred_dist + eps)))
def compute_prediction_confidence(probs: np.ndarray) -> float:
"""Compute average maximum predicted probability (confidence).
Args:
probs: Predicted probabilities, shape (n_samples, n_classes).
Returns:
Mean of the maximum probability across samples.
"""
return np.mean(np.max(probs, axis=1))
Step 4: Training with Information-Theoretic Tracking
def train_and_track(
X_train: np.ndarray,
y_train: np.ndarray,
X_test: np.ndarray,
y_test: np.ndarray,
n_classes: int = 3,
hidden_dim: int = 32,
n_epochs: int = 200,
learning_rate: float = 0.05,
) -> dict:
"""Train the network while tracking information-theoretic metrics.
Args:
X_train: Training features.
y_train: Training labels.
X_test: Test features.
y_test: Test labels.
n_classes: Number of classes.
hidden_dim: Hidden layer dimension.
n_epochs: Number of training epochs.
learning_rate: Learning rate for gradient descent.
Returns:
Dictionary containing metric histories.
"""
net = SimpleNeuralNetwork(
input_dim=X_train.shape[1],
hidden_dim=hidden_dim,
n_classes=n_classes,
)
history = {
"epoch": [],
"train_loss": [],
"test_loss": [],
"train_accuracy": [],
"test_accuracy": [],
"prediction_entropy": [],
"prediction_confidence": [],
"class_dist_kl": [],
}
for epoch in range(n_epochs):
# Training step
probs_train = net.forward(X_train)
loss = net.backward(y_train, learning_rate)
# Evaluation (no gradient update)
probs_test = net.forward(X_test)
# Metrics
train_acc = np.mean(np.argmax(probs_train, axis=1) == y_train)
# Re-forward after parameter update for accurate train metrics
probs_train_updated = net.forward(X_train)
probs_test = net.forward(X_test)
test_loss = compute_cross_entropy(y_test, probs_test, n_classes)
test_acc = np.mean(np.argmax(probs_test, axis=1) == y_test)
history["epoch"].append(epoch)
history["train_loss"].append(loss)
history["test_loss"].append(test_loss)
history["train_accuracy"].append(train_acc)
history["test_accuracy"].append(test_acc)
history["prediction_entropy"].append(
compute_entropy(probs_test)
)
history["prediction_confidence"].append(
compute_prediction_confidence(probs_test)
)
history["class_dist_kl"].append(
compute_class_distribution_kl(y_test, probs_test, n_classes)
)
if epoch % 50 == 0 or epoch == n_epochs - 1:
print(
f"Epoch {epoch:4d} | "
f"Loss: {loss:.4f} | "
f"Acc: {test_acc:.4f} | "
f"Entropy: {history['prediction_entropy'][-1]:.4f} | "
f"Confidence: {history['prediction_confidence'][-1]:.4f}"
)
return history
Analysis and Interpretation
Phase 1: Early Training (Epochs 0-20)
In the earliest epochs, the network's weights are random. The predicted distribution $q_\theta$ is close to uniform over the three classes, giving:
- High prediction entropy $\approx \log 3 \approx 1.099$ (maximum for 3 classes)
- High cross-entropy (the loss starts high)
- Low confidence $\approx 0.33$ (maximum probability close to $1/3$)
This is the information-theoretic interpretation of an untrained model: it carries almost no information about the input-output relationship.
Phase 2: Rapid Learning (Epochs 20-100)
During the rapid learning phase:
- Cross-entropy drops sharply as the model learns the dominant patterns
- Prediction entropy decreases as the model becomes more confident
- Accuracy increases rapidly
- Confidence increases as the maximum predicted probability grows
The cross-entropy decomposes as $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$. Since $H(p) = 0$ for one-hot labels, all reduction in cross-entropy corresponds directly to reduction in KL divergence.
Phase 3: Refinement (Epochs 100+)
In later epochs:
- Cross-entropy continues to decrease slowly as the model refines decision boundaries
- Prediction entropy may decrease further, potentially leading to overconfidence
- The gap between train and test loss may widen (overfitting)
This is where the information bottleneck theory (Exercise 4.27) becomes relevant: the network compresses information about the input while retaining information about the output.
Temperature Calibration Analysis
After training, we can analyze prediction calibration using temperature scaling:
def temperature_analysis(
net: SimpleNeuralNetwork,
X: np.ndarray,
y: np.ndarray,
temperatures: np.ndarray,
n_classes: int = 3,
) -> dict:
"""Analyze the effect of temperature scaling on predictions.
Args:
net: Trained neural network.
X: Input features.
y: True labels.
temperatures: Array of temperature values to test.
n_classes: Number of classes.
Returns:
Dictionary with metrics at each temperature.
"""
# Get logits (pre-softmax values)
_ = net.forward(X)
logits = net._z2
results = {
"temperature": [],
"entropy": [],
"cross_entropy": [],
"accuracy": [],
"confidence": [],
}
for T in temperatures:
scaled_probs = softmax(logits / T)
preds = np.argmax(scaled_probs, axis=1)
results["temperature"].append(T)
results["entropy"].append(compute_entropy(scaled_probs))
results["cross_entropy"].append(
compute_cross_entropy(y, scaled_probs, n_classes)
)
results["accuracy"].append(np.mean(preds == y))
results["confidence"].append(
compute_prediction_confidence(scaled_probs)
)
return results
The temperature analysis reveals: - Low temperature ($T < 1$): Predictions become sharper (lower entropy, higher confidence). Accuracy is preserved but the model becomes overconfident. - High temperature ($T > 1$): Predictions become softer (higher entropy, lower confidence). The distribution approaches uniform. - Optimal temperature: Minimizes the gap between predicted confidence and actual accuracy (calibration).
This directly connects to the temperature parameter in language model decoding (Chapter 21): low temperature for deterministic outputs, high temperature for diverse generation.
Discussion
Information-Theoretic View of Generalization
The cross-entropy on the test set provides an information-theoretic view of generalization:
$$ H_{\text{test}}(p, q_\theta) = H_{\text{test}}(p) + D_{\text{KL}}(p_{\text{test}} \| q_\theta) $$
When the test cross-entropy is much higher than the training cross-entropy, the model's learned distribution $q_\theta$ is far from the true test distribution -- this is overfitting, quantified in KL divergence terms.
Connections to the Information Bottleneck
Tishby and Zaslavsky (2015) proposed that deep learning can be understood through the information bottleneck: each layer creates a compressed representation $T$ that balances two competing objectives:
- Compression: Minimize $I(X; T)$ -- discard irrelevant input information
- Prediction: Maximize $I(T; Y)$ -- retain information about the target
This framework, while debated, provides an elegant information-theoretic perspective on what neural networks learn. The entropy tracking we performed in this case study is a simplified version of this analysis.
Practical Implications
The information-theoretic metrics we tracked provide actionable insights:
- If prediction entropy stays high late in training, the model may lack capacity or the learning rate may be too low.
- If prediction entropy drops to near zero very quickly, the model may be memorizing rather than generalizing (check for overfitting).
- If the aggregate class distribution KL is high, the model's overall predicted class frequencies differ from the true frequencies (class imbalance issue).
These diagnostics complement standard training curves and provide deeper insight into model behavior. We will revisit these ideas when studying training deep networks in Chapter 12 and model evaluation in Chapter 8.
Key Takeaways
- Cross-entropy loss has a precise information-theoretic meaning: it is the extra cost of encoding data with the model's distribution instead of the true distribution.
- Tracking prediction entropy during training reveals the model's confidence dynamics -- from uncertain (uniform) to confident (peaked) predictions.
- KL divergence quantifies how far the model is from the truth and decomposes naturally from the cross-entropy loss.
- Temperature scaling provides a post-hoc way to adjust prediction calibration, with direct connections to information-theoretic quantities.
- The information bottleneck perspective frames deep learning as a compression problem, motivating the analysis of information flow through network layers.