Case Study 1: Analyzing Scaling Laws Empirically

Overview

In this case study, we replicate the core methodology behind neural scaling laws by training a family of small Transformer language models at different scales and analyzing the resulting loss curves. While we cannot match the compute budgets of Kaplan et al. or Hoffmann et al., we can observe the same qualitative phenomena---power-law scaling, compute-optimal allocation, and diminishing returns---at a miniature scale.

By working through this case study, you will develop hands-on intuition for how model size, dataset size, and compute interact, and you will practice the curve-fitting techniques used to derive scaling laws.

Learning Objectives

  • Train multiple Transformer language models of varying sizes on the same dataset.
  • Measure validation loss as a function of model size, dataset size, and compute.
  • Fit power-law curves to the resulting data and extract scaling exponents.
  • Determine the compute-optimal model size for a given budget.
  • Visualize scaling relationships on log-log plots.

Background

The Kaplan scaling law posits that loss $L$ relates to model size $N$ as:

$$L(N) = \left(\frac{N_c}{N}\right)^{\alpha_N} + L_\infty$$

where $\alpha_N$ is the scaling exponent, $N_c$ is a characteristic scale, and $L_\infty$ is the irreducible loss. On a log-log plot of $L - L_\infty$ vs. $N$, this appears as a straight line with slope $-\alpha_N$.

Similarly, for dataset size:

$$L(D) = \left(\frac{D_c}{D}\right)^{\alpha_D} + L_\infty$$

Part 1: Dataset Preparation

We use a character-level language modeling setup on the TinyShakespeare dataset. Character-level modeling lets us train quickly while still observing meaningful scaling behavior.

"""Dataset preparation for scaling law experiments.

Downloads TinyShakespeare and creates training/validation splits
with configurable dataset sizes for scaling experiments.
"""

import os
import urllib.request
from typing import Tuple

import torch
from torch.utils.data import Dataset

torch.manual_seed(42)


def download_data(filepath: str = "shakespeare.txt") -> str:
    """Download TinyShakespeare dataset.

    Args:
        filepath: Local path to save the text file.

    Returns:
        The full text as a string.
    """
    url = (
        "https://raw.githubusercontent.com/karpathy/"
        "char-rnn/master/data/tinyshakespeare/input.txt"
    )
    if not os.path.exists(filepath):
        urllib.request.urlretrieve(url, filepath)
    with open(filepath, "r", encoding="utf-8") as f:
        return f.read()


class CharDataset(Dataset):
    """Character-level language modeling dataset.

    Args:
        text: The input text string.
        block_size: The context window length.
        max_tokens: Maximum number of tokens to use (for data scaling).
    """

    def __init__(
        self, text: str, block_size: int = 128, max_tokens: int | None = None
    ) -> None:
        chars = sorted(set(text))
        self.stoi = {ch: i for i, ch in enumerate(chars)}
        self.itos = {i: ch for ch, i in self.stoi.items()}
        self.vocab_size = len(chars)
        self.block_size = block_size

        data = [self.stoi[ch] for ch in text]
        if max_tokens is not None:
            data = data[:max_tokens]
        self.data = torch.tensor(data, dtype=torch.long)

    def __len__(self) -> int:
        return max(0, len(self.data) - self.block_size)

    def __getitem__(self, idx: int) -> Tuple[torch.Tensor, torch.Tensor]:
        chunk = self.data[idx : idx + self.block_size + 1]
        x = chunk[:-1]
        y = chunk[1:]
        return x, y


# Quick test
text = download_data()
dataset = CharDataset(text, block_size=128, max_tokens=100_000)
print(f"Vocabulary size: {dataset.vocab_size}")
print(f"Dataset length: {len(dataset)} samples")
print(f"Total characters: {len(text):,}")

Part 2: Model Family Definition

We define a family of GPT-style models at different scales, systematically varying the number of layers, hidden dimension, and attention heads while keeping the architecture consistent.

"""Model family definition for scaling experiments.

Defines a configurable mini-GPT architecture and a set of
model configurations spanning multiple scales.
"""

import math
from dataclasses import dataclass
from typing import Dict, List

import torch
import torch.nn as nn
import torch.nn.functional as F

torch.manual_seed(42)


@dataclass
class ModelConfig:
    """Configuration for a mini-GPT model.

    Args:
        n_layers: Number of Transformer blocks.
        n_heads: Number of attention heads.
        d_model: Hidden dimension.
        block_size: Maximum sequence length.
        vocab_size: Vocabulary size.
        dropout: Dropout rate.
    """
    n_layers: int
    n_heads: int
    d_model: int
    block_size: int = 128
    vocab_size: int = 65  # TinyShakespeare character set
    dropout: float = 0.0

    @property
    def n_params(self) -> int:
        """Estimate total non-embedding parameters."""
        # Attention: Q, K, V projections + output projection
        attn = 4 * self.d_model * self.d_model
        # FFN: two linear layers with 4x expansion
        ffn = 2 * self.d_model * (4 * self.d_model)
        # Layer norms (2 per layer)
        ln = 2 * 2 * self.d_model
        # Per-layer total
        per_layer = attn + ffn + ln
        # Total non-embedding
        return self.n_layers * per_layer


# Model family spanning ~4 orders of magnitude
MODEL_CONFIGS: Dict[str, ModelConfig] = {
    "tiny":   ModelConfig(n_layers=2,  n_heads=2,  d_model=64),    # ~100K
    "small":  ModelConfig(n_layers=4,  n_heads=4,  d_model=128),   # ~800K
    "medium": ModelConfig(n_layers=6,  n_heads=6,  d_model=192),   # ~2.7M
    "large":  ModelConfig(n_layers=8,  n_heads=8,  d_model=256),   # ~6.3M
    "xlarge": ModelConfig(n_layers=12, n_heads=8,  d_model=512),   # ~38M
    "xxl":    ModelConfig(n_layers=16, n_heads=16, d_model=768),   # ~113M
}

for name, cfg in MODEL_CONFIGS.items():
    print(f"{name:8s}: {cfg.n_params:>12,} non-embedding params")


class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention.

    Args:
        config: Model configuration.
    """

    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        assert config.d_model % config.n_heads == 0
        self.n_heads = config.n_heads
        self.d_model = config.d_model
        self.head_dim = config.d_model // config.n_heads

        self.qkv = nn.Linear(config.d_model, 3 * config.d_model)
        self.proj = nn.Linear(config.d_model, config.d_model)
        self.dropout = nn.Dropout(config.dropout)

        mask = torch.tril(torch.ones(config.block_size, config.block_size))
        self.register_buffer("mask", mask.view(1, 1, config.block_size, config.block_size))

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.shape
        qkv = self.qkv(x)
        q, k, v = qkv.split(self.d_model, dim=2)

        q = q.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        k = k.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)
        v = v.view(B, T, self.n_heads, self.head_dim).transpose(1, 2)

        att = (q @ k.transpose(-2, -1)) * (self.head_dim ** -0.5)
        att = att.masked_fill(self.mask[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.dropout(att)

        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        return self.proj(y)


class TransformerBlock(nn.Module):
    """Pre-norm Transformer block.

    Args:
        config: Model configuration.
    """

    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        self.ln1 = nn.LayerNorm(config.d_model)
        self.attn = CausalSelfAttention(config)
        self.ln2 = nn.LayerNorm(config.d_model)
        self.ffn = nn.Sequential(
            nn.Linear(config.d_model, 4 * config.d_model),
            nn.GELU(),
            nn.Linear(4 * config.d_model, config.d_model),
            nn.Dropout(config.dropout),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.ln1(x))
        x = x + self.ffn(self.ln2(x))
        return x


class MiniGPT(nn.Module):
    """Mini-GPT language model for scaling experiments.

    Args:
        config: Model configuration.
    """

    def __init__(self, config: ModelConfig) -> None:
        super().__init__()
        self.config = config
        self.tok_emb = nn.Embedding(config.vocab_size, config.d_model)
        self.pos_emb = nn.Embedding(config.block_size, config.d_model)
        self.blocks = nn.Sequential(
            *[TransformerBlock(config) for _ in range(config.n_layers)]
        )
        self.ln_f = nn.LayerNorm(config.d_model)
        self.head = nn.Linear(config.d_model, config.vocab_size, bias=False)

        # Weight tying
        self.head.weight = self.tok_emb.weight
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self, idx: torch.Tensor, targets: torch.Tensor | None = None
    ) -> Tuple[torch.Tensor, torch.Tensor | None]:
        B, T = idx.shape
        tok = self.tok_emb(idx)
        pos = self.pos_emb(torch.arange(T, device=idx.device))
        x = tok + pos
        x = self.blocks(x)
        x = self.ln_f(x)
        logits = self.head(x)

        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)), targets.view(-1)
            )
        return logits, loss

    def count_parameters(self) -> int:
        """Count total trainable parameters."""
        return sum(p.numel() for p in self.parameters() if p.requires_grad)


# Verify parameter counts
for name, cfg in MODEL_CONFIGS.items():
    model = MiniGPT(cfg)
    actual = model.count_parameters()
    print(f"{name:8s}: estimated={cfg.n_params:>12,}  actual={actual:>12,}")

Part 3: Training Loop and Scaling Experiments

We train each model configuration and record the validation loss, then fit scaling curves.

"""Training loop for scaling law experiments.

Trains multiple model sizes and records loss curves for analysis.
"""

import time
from typing import Dict, List, Tuple

import torch
from torch.utils.data import DataLoader

torch.manual_seed(42)

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"


def train_model(
    config: ModelConfig,
    train_dataset: CharDataset,
    val_dataset: CharDataset,
    max_steps: int = 2000,
    batch_size: int = 64,
    learning_rate: float = 3e-4,
    eval_interval: int = 200,
) -> Dict[str, List[float]]:
    """Train a single model and return loss history.

    Args:
        config: Model configuration.
        train_dataset: Training dataset.
        val_dataset: Validation dataset.
        max_steps: Maximum number of training steps.
        batch_size: Training batch size.
        learning_rate: Adam learning rate.
        eval_interval: Steps between evaluations.

    Returns:
        Dictionary with 'train_losses', 'val_losses', and 'steps'.
    """
    torch.manual_seed(42)
    model = MiniGPT(config).to(DEVICE)
    optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)
    train_loader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, drop_last=True
    )

    history = {"train_losses": [], "val_losses": [], "steps": []}
    train_iter = iter(train_loader)
    step = 0

    while step < max_steps:
        model.train()
        try:
            x, y = next(train_iter)
        except StopIteration:
            train_iter = iter(train_loader)
            x, y = next(train_iter)

        x, y = x.to(DEVICE), y.to(DEVICE)
        _, loss = model(x, y)
        optimizer.zero_grad()
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
        step += 1

        if step % eval_interval == 0:
            val_loss = evaluate(model, val_dataset, batch_size)
            history["train_losses"].append(loss.item())
            history["val_losses"].append(val_loss)
            history["steps"].append(step)
            n_params = model.count_parameters()
            print(
                f"  Step {step:5d} | "
                f"Train Loss: {loss.item():.4f} | "
                f"Val Loss: {val_loss:.4f} | "
                f"Params: {n_params:,}"
            )

    return history


@torch.no_grad()
def evaluate(
    model: MiniGPT, dataset: CharDataset, batch_size: int = 64
) -> float:
    """Evaluate model on a dataset.

    Args:
        model: The model to evaluate.
        dataset: Evaluation dataset.
        batch_size: Batch size for evaluation.

    Returns:
        Average cross-entropy loss.
    """
    model.eval()
    loader = DataLoader(dataset, batch_size=batch_size, drop_last=True)
    total_loss = 0.0
    n_batches = 0

    for x, y in loader:
        x, y = x.to(DEVICE), y.to(DEVICE)
        _, loss = model(x, y)
        total_loss += loss.item()
        n_batches += 1
        if n_batches >= 20:  # Limit eval batches for speed
            break

    return total_loss / max(n_batches, 1)


# Prepare data
text = download_data()
n = int(0.9 * len(text))
train_text, val_text = text[:n], text[n:]

# Run scaling experiment across model sizes
results = {}
configs_to_train = ["tiny", "small", "medium", "large"]  # Use subset for speed

for name in configs_to_train:
    cfg = MODEL_CONFIGS[name]
    print(f"\n{'='*60}")
    print(f"Training {name} model ({cfg.n_params:,} non-embedding params)")
    print(f"{'='*60}")

    train_ds = CharDataset(train_text, block_size=cfg.block_size)
    val_ds = CharDataset(val_text, block_size=cfg.block_size)

    t0 = time.time()
    history = train_model(cfg, train_ds, val_ds, max_steps=2000)
    elapsed = time.time() - t0

    results[name] = {
        "config": cfg,
        "history": history,
        "n_params": cfg.n_params,
        "final_val_loss": history["val_losses"][-1],
        "training_time": elapsed,
    }
    print(f"Training time: {elapsed:.1f}s")

Part 4: Fitting Scaling Curves

"""Scaling curve fitting and visualization.

Fits power-law curves to the empirical loss data and
produces log-log scaling plots.
"""

import numpy as np

# Extract data points
param_counts = [results[n]["n_params"] for n in configs_to_train]
final_losses = [results[n]["final_val_loss"] for n in configs_to_train]

log_params = np.log10(param_counts)
log_losses = np.log10(final_losses)

# Fit power law: log L = -alpha * log N + log C
# This is linear regression on log-log data
coeffs = np.polyfit(log_params, log_losses, 1)
alpha_N = -coeffs[0]
log_Nc = coeffs[1]

print(f"Fitted scaling exponent alpha_N = {alpha_N:.4f}")
print(f"Kaplan et al. reported alpha_N ≈ 0.076")
print(f"(Exact match is not expected at this scale)")
print()

# Predict loss for larger models
for target_params in [1e7, 1e8, 1e9]:
    predicted_log_loss = coeffs[0] * np.log10(target_params) + coeffs[1]
    predicted_loss = 10 ** predicted_log_loss
    print(f"Predicted loss at N={target_params:.0e}: {predicted_loss:.4f}")

# Visualization (if matplotlib available)
try:
    import matplotlib.pyplot as plt

    fig, axes = plt.subplots(1, 2, figsize=(14, 5))

    # Log-log scaling plot
    ax = axes[0]
    ax.scatter(param_counts, final_losses, s=80, zorder=5, color="steelblue")
    fit_x = np.logspace(
        np.log10(min(param_counts)) - 0.5,
        np.log10(max(param_counts)) + 1.0,
        100,
    )
    fit_y = 10 ** (coeffs[0] * np.log10(fit_x) + coeffs[1])
    ax.plot(fit_x, fit_y, "--", color="coral", label=f"Power law (alpha={alpha_N:.3f})")
    for name, pc, fl in zip(configs_to_train, param_counts, final_losses):
        ax.annotate(name, (pc, fl), textcoords="offset points", xytext=(8, 5))
    ax.set_xscale("log")
    ax.set_yscale("log")
    ax.set_xlabel("Non-embedding Parameters")
    ax.set_ylabel("Validation Loss")
    ax.set_title("Scaling Law: Loss vs. Model Size")
    ax.legend()
    ax.grid(True, alpha=0.3)

    # Training curves for all models
    ax = axes[1]
    for name in configs_to_train:
        h = results[name]["history"]
        ax.plot(h["steps"], h["val_losses"], label=f"{name} ({results[name]['n_params']:,})")
    ax.set_xlabel("Training Steps")
    ax.set_ylabel("Validation Loss")
    ax.set_title("Training Curves by Model Size")
    ax.legend()
    ax.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.savefig("scaling_law_results.png", dpi=150, bbox_inches="tight")
    plt.show()
    print("Plot saved to scaling_law_results.png")
except ImportError:
    print("matplotlib not available; skipping visualization")

Part 5: Data Scaling Experiment

We fix the model size and vary the amount of training data to observe the data scaling law.

"""Data scaling experiment.

Fixes model size and varies training data to observe
the data-dependent scaling law.
"""

# Fix model config (medium)
data_config = MODEL_CONFIGS["medium"]
data_fractions = [0.05, 0.10, 0.25, 0.50, 1.0]

data_results = {}
for frac in data_fractions:
    max_tokens = int(len(train_text) * frac)
    train_ds = CharDataset(train_text, block_size=data_config.block_size, max_tokens=max_tokens)
    val_ds = CharDataset(val_text, block_size=data_config.block_size)

    print(f"\nData fraction: {frac:.0%} ({max_tokens:,} tokens)")
    history = train_model(data_config, train_ds, val_ds, max_steps=2000)

    data_results[frac] = {
        "n_tokens": max_tokens,
        "final_val_loss": history["val_losses"][-1],
        "history": history,
    }

# Fit data scaling law
data_tokens = [data_results[f]["n_tokens"] for f in data_fractions]
data_losses = [data_results[f]["final_val_loss"] for f in data_fractions]

log_tokens = np.log10(data_tokens)
log_data_losses = np.log10(data_losses)

data_coeffs = np.polyfit(log_tokens, log_data_losses, 1)
alpha_D = -data_coeffs[0]

print(f"\nFitted data scaling exponent alpha_D = {alpha_D:.4f}")
print(f"Chinchilla reported alpha_D ≈ 0.28 (in the full loss model)")

Discussion Questions

  1. Exponent comparison: How does your fitted $\alpha_N$ compare to the Kaplan value of 0.076? What factors might cause the discrepancy at small scale?

  2. Chinchilla validation: Based on your model-size and data-size scaling experiments, what is the approximate optimal tokens-per-parameter ratio for your setup? Does it align more with Kaplan (~2) or Chinchilla (~20)?

  3. Extrapolation reliability: If you extrapolate your fitted curve to predict the loss of a 1B-parameter model, how confident are you in that prediction? What could go wrong?

  4. Compute budget: If you had a fixed budget of 100 GPU-hours on an A100, how would you allocate it between model size and training duration? Use your fitted curves to justify your answer.

  5. Irreducible loss: Estimate the irreducible loss for character-level Shakespeare modeling. What is the theoretical minimum cross-entropy for this domain, and how does it compare to the entropy of English text (~1.0 bits per character)?