Case Study 2: Information-Theoretic Analysis of Neural Network Training

Overview

When we train a neural network, we typically monitor the loss curve and accuracy. But these metrics tell only part of the story. Information theory provides a richer set of tools for understanding what the network is learning and how its predictions evolve during training. In this case study, we use entropy, cross-entropy, KL divergence, and mutual information to dissect the training process of a simple neural network, connecting the theory from Section 4.5 to the practice of deep learning.

By the end of this case study, you will have: - Tracked entropy, cross-entropy, and KL divergence throughout training - Visualized how the model's predicted distribution evolves toward the true distribution - Analyzed the effect of temperature scaling on prediction calibration - Connected the cross-entropy loss to its information-theoretic meaning

Background

The Setup

We train a simple two-layer neural network (implemented in pure NumPy) on a synthetic 3-class classification problem. At each epoch, we compute:

  1. Cross-entropy loss $H(p, q_\theta)$: the standard training objective
  2. Entropy of predictions $H(q_\theta)$: how confident is the model?
  3. KL divergence $D_{\text{KL}}(p \| q_\theta)$: how far is the model from the true distribution?
  4. Mutual information $I(X; \hat{Y})$: how much information do predictions carry about inputs?

The Relationship

Recall from Section 4.5 that cross-entropy decomposes as:

$$ H(p, q) = H(p) + D_{\text{KL}}(p \| q) $$

Since $H(p)$ is fixed (it depends only on the true distribution), minimizing cross-entropy is equivalent to minimizing KL divergence. But tracking both separately reveals interesting dynamics.

Implementation

Step 1: Synthetic Data Generation

We create a 3-class classification dataset with clear but overlapping class boundaries:

import numpy as np
from typing import Tuple


def generate_data(
    n_samples: int = 1500,
    seed: int = 42,
) -> Tuple[np.ndarray, np.ndarray]:
    """Generate a synthetic 3-class classification dataset.

    Creates three clusters of 2D points with some overlap to make
    the classification problem non-trivial.

    Args:
        n_samples: Total number of samples (divided equally among classes).
        seed: Random seed for reproducibility.

    Returns:
        Tuple of (X, y) where X has shape (n_samples, 2)
        and y has shape (n_samples,) with values in {0, 1, 2}.
    """
    rng = np.random.default_rng(seed)
    n_per_class = n_samples // 3

    # Class 0: cluster centered at (-2, -1)
    X0 = rng.normal(loc=[-2, -1], scale=1.0, size=(n_per_class, 2))
    # Class 1: cluster centered at (2, -1)
    X1 = rng.normal(loc=[2, -1], scale=1.0, size=(n_per_class, 2))
    # Class 2: cluster centered at (0, 2)
    X2 = rng.normal(loc=[0, 2], scale=1.0, size=(n_per_class, 2))

    X = np.vstack([X0, X1, X2])
    y = np.concatenate([
        np.zeros(n_per_class, dtype=int),
        np.ones(n_per_class, dtype=int),
        2 * np.ones(n_per_class, dtype=int),
    ])

    # Shuffle
    perm = rng.permutation(len(y))
    return X[perm], y[perm]

Step 2: Neural Network in NumPy

We implement a minimal two-layer network with softmax output:

def softmax(z: np.ndarray) -> np.ndarray:
    """Compute softmax with numerical stability.

    Args:
        z: Logits of shape (batch_size, n_classes).

    Returns:
        Probabilities of shape (batch_size, n_classes).
    """
    z_shifted = z - np.max(z, axis=1, keepdims=True)
    exp_z = np.exp(z_shifted)
    return exp_z / np.sum(exp_z, axis=1, keepdims=True)


def relu(z: np.ndarray) -> np.ndarray:
    """Rectified linear unit activation.

    Args:
        z: Input array.

    Returns:
        Element-wise max(0, z).
    """
    return np.maximum(0, z)


def relu_derivative(z: np.ndarray) -> np.ndarray:
    """Derivative of ReLU.

    Args:
        z: Input array.

    Returns:
        1 where z > 0, 0 elsewhere.
    """
    return (z > 0).astype(np.float64)


class SimpleNeuralNetwork:
    """A two-layer neural network for classification.

    Architecture: Input -> Dense(hidden_dim, ReLU) -> Dense(n_classes, Softmax)

    Attributes:
        W1: First layer weights, shape (input_dim, hidden_dim).
        b1: First layer biases, shape (hidden_dim,).
        W2: Second layer weights, shape (hidden_dim, n_classes).
        b2: Second layer biases, shape (n_classes,).
    """

    def __init__(
        self,
        input_dim: int,
        hidden_dim: int,
        n_classes: int,
        seed: int = 42,
    ) -> None:
        """Initialize network with Xavier initialization.

        Args:
            input_dim: Number of input features.
            hidden_dim: Number of hidden units.
            n_classes: Number of output classes.
            seed: Random seed for weight initialization.
        """
        rng = np.random.default_rng(seed)

        # Xavier initialization
        scale1 = np.sqrt(2.0 / (input_dim + hidden_dim))
        scale2 = np.sqrt(2.0 / (hidden_dim + n_classes))

        self.W1 = rng.normal(0, scale1, (input_dim, hidden_dim))
        self.b1 = np.zeros(hidden_dim)
        self.W2 = rng.normal(0, scale2, (hidden_dim, n_classes))
        self.b2 = np.zeros(n_classes)

    def forward(self, X: np.ndarray) -> np.ndarray:
        """Forward pass through the network.

        Args:
            X: Input data of shape (batch_size, input_dim).

        Returns:
            Predicted probabilities of shape (batch_size, n_classes).
        """
        self._z1 = X @ self.W1 + self.b1
        self._a1 = relu(self._z1)
        self._z2 = self._a1 @ self.W2 + self.b2
        self._probs = softmax(self._z2)
        self._X = X
        return self._probs

    def backward(
        self,
        y: np.ndarray,
        learning_rate: float = 0.01,
    ) -> float:
        """Backward pass with gradient descent update.

        Args:
            y: True labels of shape (batch_size,).
            learning_rate: Step size for gradient descent.

        Returns:
            Cross-entropy loss value.
        """
        batch_size = len(y)

        # One-hot encode labels
        y_onehot = np.zeros_like(self._probs)
        y_onehot[np.arange(batch_size), y] = 1.0

        # Cross-entropy loss
        eps = 1e-12
        loss = -np.mean(np.sum(y_onehot * np.log(self._probs + eps), axis=1))

        # Output layer gradient: dL/dz2 = probs - y_onehot
        dz2 = (self._probs - y_onehot) / batch_size

        dW2 = self._a1.T @ dz2
        db2 = np.sum(dz2, axis=0)

        # Hidden layer gradient
        da1 = dz2 @ self.W2.T
        dz1 = da1 * relu_derivative(self._z1)

        dW1 = self._X.T @ dz1
        db1 = np.sum(dz1, axis=0)

        # Update parameters
        self.W1 -= learning_rate * dW1
        self.b1 -= learning_rate * db1
        self.W2 -= learning_rate * dW2
        self.b2 -= learning_rate * db2

        return loss

Step 3: Information-Theoretic Metrics

def compute_entropy(probs: np.ndarray) -> float:
    """Compute the average entropy of predicted distributions.

    Args:
        probs: Predicted probabilities, shape (n_samples, n_classes).

    Returns:
        Mean entropy across all predictions (in nats).
    """
    eps = 1e-12
    return -np.mean(np.sum(probs * np.log(probs + eps), axis=1))


def compute_cross_entropy(
    y: np.ndarray,
    probs: np.ndarray,
    n_classes: int,
) -> float:
    """Compute cross-entropy between true labels and predictions.

    Args:
        y: True labels, shape (n_samples,).
        probs: Predicted probabilities, shape (n_samples, n_classes).
        n_classes: Number of classes.

    Returns:
        Cross-entropy value (in nats).
    """
    eps = 1e-12
    y_onehot = np.zeros((len(y), n_classes))
    y_onehot[np.arange(len(y)), y] = 1.0
    return -np.mean(np.sum(y_onehot * np.log(probs + eps), axis=1))


def compute_kl_divergence(
    y: np.ndarray,
    probs: np.ndarray,
    n_classes: int,
) -> float:
    """Compute KL divergence from true distribution to predicted.

    Since the true distribution is one-hot, H(p) = 0 per sample,
    and D_KL(p || q) = H(p, q) - H(p) = H(p, q).

    Args:
        y: True labels, shape (n_samples,).
        probs: Predicted probabilities, shape (n_samples, n_classes).
        n_classes: Number of classes.

    Returns:
        KL divergence value (in nats).
    """
    # For one-hot true distributions, H(p) = 0, so KL = cross-entropy
    return compute_cross_entropy(y, probs, n_classes)


def compute_class_distribution_kl(
    y: np.ndarray,
    probs: np.ndarray,
    n_classes: int,
) -> float:
    """Compute KL divergence between the aggregate predicted and true
    class distributions.

    This measures whether the model's average predictions match the
    true class proportions.

    Args:
        y: True labels.
        probs: Predicted probabilities, shape (n_samples, n_classes).
        n_classes: Number of classes.

    Returns:
        KL divergence between aggregate distributions.
    """
    eps = 1e-12
    # True class distribution
    true_dist = np.bincount(y, minlength=n_classes) / len(y)
    # Average predicted distribution
    pred_dist = np.mean(probs, axis=0)

    return np.sum(true_dist * np.log((true_dist + eps) / (pred_dist + eps)))


def compute_prediction_confidence(probs: np.ndarray) -> float:
    """Compute average maximum predicted probability (confidence).

    Args:
        probs: Predicted probabilities, shape (n_samples, n_classes).

    Returns:
        Mean of the maximum probability across samples.
    """
    return np.mean(np.max(probs, axis=1))

Step 4: Training with Information-Theoretic Tracking

def train_and_track(
    X_train: np.ndarray,
    y_train: np.ndarray,
    X_test: np.ndarray,
    y_test: np.ndarray,
    n_classes: int = 3,
    hidden_dim: int = 32,
    n_epochs: int = 200,
    learning_rate: float = 0.05,
) -> dict:
    """Train the network while tracking information-theoretic metrics.

    Args:
        X_train: Training features.
        y_train: Training labels.
        X_test: Test features.
        y_test: Test labels.
        n_classes: Number of classes.
        hidden_dim: Hidden layer dimension.
        n_epochs: Number of training epochs.
        learning_rate: Learning rate for gradient descent.

    Returns:
        Dictionary containing metric histories.
    """
    net = SimpleNeuralNetwork(
        input_dim=X_train.shape[1],
        hidden_dim=hidden_dim,
        n_classes=n_classes,
    )

    history = {
        "epoch": [],
        "train_loss": [],
        "test_loss": [],
        "train_accuracy": [],
        "test_accuracy": [],
        "prediction_entropy": [],
        "prediction_confidence": [],
        "class_dist_kl": [],
    }

    for epoch in range(n_epochs):
        # Training step
        probs_train = net.forward(X_train)
        loss = net.backward(y_train, learning_rate)

        # Evaluation (no gradient update)
        probs_test = net.forward(X_test)

        # Metrics
        train_acc = np.mean(np.argmax(probs_train, axis=1) == y_train)

        # Re-forward after parameter update for accurate train metrics
        probs_train_updated = net.forward(X_train)
        probs_test = net.forward(X_test)
        test_loss = compute_cross_entropy(y_test, probs_test, n_classes)
        test_acc = np.mean(np.argmax(probs_test, axis=1) == y_test)

        history["epoch"].append(epoch)
        history["train_loss"].append(loss)
        history["test_loss"].append(test_loss)
        history["train_accuracy"].append(train_acc)
        history["test_accuracy"].append(test_acc)
        history["prediction_entropy"].append(
            compute_entropy(probs_test)
        )
        history["prediction_confidence"].append(
            compute_prediction_confidence(probs_test)
        )
        history["class_dist_kl"].append(
            compute_class_distribution_kl(y_test, probs_test, n_classes)
        )

        if epoch % 50 == 0 or epoch == n_epochs - 1:
            print(
                f"Epoch {epoch:4d} | "
                f"Loss: {loss:.4f} | "
                f"Acc: {test_acc:.4f} | "
                f"Entropy: {history['prediction_entropy'][-1]:.4f} | "
                f"Confidence: {history['prediction_confidence'][-1]:.4f}"
            )

    return history

Analysis and Interpretation

Phase 1: Early Training (Epochs 0-20)

In the earliest epochs, the network's weights are random. The predicted distribution $q_\theta$ is close to uniform over the three classes, giving:

  • High prediction entropy $\approx \log 3 \approx 1.099$ (maximum for 3 classes)
  • High cross-entropy (the loss starts high)
  • Low confidence $\approx 0.33$ (maximum probability close to $1/3$)

This is the information-theoretic interpretation of an untrained model: it carries almost no information about the input-output relationship.

Phase 2: Rapid Learning (Epochs 20-100)

During the rapid learning phase:

  • Cross-entropy drops sharply as the model learns the dominant patterns
  • Prediction entropy decreases as the model becomes more confident
  • Accuracy increases rapidly
  • Confidence increases as the maximum predicted probability grows

The cross-entropy decomposes as $H(p, q) = H(p) + D_{\text{KL}}(p \| q)$. Since $H(p) = 0$ for one-hot labels, all reduction in cross-entropy corresponds directly to reduction in KL divergence.

Phase 3: Refinement (Epochs 100+)

In later epochs:

  • Cross-entropy continues to decrease slowly as the model refines decision boundaries
  • Prediction entropy may decrease further, potentially leading to overconfidence
  • The gap between train and test loss may widen (overfitting)

This is where the information bottleneck theory (Exercise 4.27) becomes relevant: the network compresses information about the input while retaining information about the output.

Temperature Calibration Analysis

After training, we can analyze prediction calibration using temperature scaling:

def temperature_analysis(
    net: SimpleNeuralNetwork,
    X: np.ndarray,
    y: np.ndarray,
    temperatures: np.ndarray,
    n_classes: int = 3,
) -> dict:
    """Analyze the effect of temperature scaling on predictions.

    Args:
        net: Trained neural network.
        X: Input features.
        y: True labels.
        temperatures: Array of temperature values to test.
        n_classes: Number of classes.

    Returns:
        Dictionary with metrics at each temperature.
    """
    # Get logits (pre-softmax values)
    _ = net.forward(X)
    logits = net._z2

    results = {
        "temperature": [],
        "entropy": [],
        "cross_entropy": [],
        "accuracy": [],
        "confidence": [],
    }

    for T in temperatures:
        scaled_probs = softmax(logits / T)
        preds = np.argmax(scaled_probs, axis=1)

        results["temperature"].append(T)
        results["entropy"].append(compute_entropy(scaled_probs))
        results["cross_entropy"].append(
            compute_cross_entropy(y, scaled_probs, n_classes)
        )
        results["accuracy"].append(np.mean(preds == y))
        results["confidence"].append(
            compute_prediction_confidence(scaled_probs)
        )

    return results

The temperature analysis reveals: - Low temperature ($T < 1$): Predictions become sharper (lower entropy, higher confidence). Accuracy is preserved but the model becomes overconfident. - High temperature ($T > 1$): Predictions become softer (higher entropy, lower confidence). The distribution approaches uniform. - Optimal temperature: Minimizes the gap between predicted confidence and actual accuracy (calibration).

This directly connects to the temperature parameter in language model decoding (Chapter 21): low temperature for deterministic outputs, high temperature for diverse generation.

Discussion

Information-Theoretic View of Generalization

The cross-entropy on the test set provides an information-theoretic view of generalization:

$$ H_{\text{test}}(p, q_\theta) = H_{\text{test}}(p) + D_{\text{KL}}(p_{\text{test}} \| q_\theta) $$

When the test cross-entropy is much higher than the training cross-entropy, the model's learned distribution $q_\theta$ is far from the true test distribution -- this is overfitting, quantified in KL divergence terms.

Connections to the Information Bottleneck

Tishby and Zaslavsky (2015) proposed that deep learning can be understood through the information bottleneck: each layer creates a compressed representation $T$ that balances two competing objectives:

  1. Compression: Minimize $I(X; T)$ -- discard irrelevant input information
  2. Prediction: Maximize $I(T; Y)$ -- retain information about the target

This framework, while debated, provides an elegant information-theoretic perspective on what neural networks learn. The entropy tracking we performed in this case study is a simplified version of this analysis.

Practical Implications

The information-theoretic metrics we tracked provide actionable insights:

  • If prediction entropy stays high late in training, the model may lack capacity or the learning rate may be too low.
  • If prediction entropy drops to near zero very quickly, the model may be memorizing rather than generalizing (check for overfitting).
  • If the aggregate class distribution KL is high, the model's overall predicted class frequencies differ from the true frequencies (class imbalance issue).

These diagnostics complement standard training curves and provide deeper insight into model behavior. We will revisit these ideas when studying training deep networks in Chapter 12 and model evaluation in Chapter 8.

Key Takeaways

  1. Cross-entropy loss has a precise information-theoretic meaning: it is the extra cost of encoding data with the model's distribution instead of the true distribution.
  2. Tracking prediction entropy during training reveals the model's confidence dynamics -- from uncertain (uniform) to confident (peaked) predictions.
  3. KL divergence quantifies how far the model is from the truth and decomposes naturally from the cross-entropy loss.
  4. Temperature scaling provides a post-hoc way to adjust prediction calibration, with direct connections to information-theoretic quantities.
  5. The information bottleneck perspective frames deep learning as a compression problem, motivating the analysis of information flow through network layers.