Case Study 1: Building a Mini-GPT from Scratch

Overview

In this case study, we build a complete mini-GPT language model from scratch using PyTorch, train it on Shakespeare's works, and use it to generate new text in the Bard's style. This hands-on project covers every component of the decoder-only Transformer architecture: token and positional embeddings, multi-head causal self-attention, position-wise feed-forward networks with GELU activation, pre-norm layer normalization, weight tying, and the complete training and generation pipeline.

By working through this case study, you will solidify your understanding of every architectural decision discussed in Chapter 21 and gain practical experience with the challenges of training and generating from autoregressive models.

Learning Objectives

Build each component of the GPT architecture from first principles.
Understand how the components fit together into a complete model.
Train the model on real text data and observe the learning dynamics.
Experiment with different generation strategies and observe their effects on output quality.
Analyze training curves and diagnose common issues.

Dataset: Shakespeare's Complete Works

We use a subset of Shakespeare's complete works, which is a classic choice for character-level language modeling because it has a distinctive style that makes it easy to evaluate whether the model has learned meaningful patterns.

import urllib.request
import os

def download_shakespeare(filepath: str = "shakespeare.txt") -> str:
    """Download Shakespeare's complete works from a public URL.

    Args:
        filepath: Local path to save the text file.

    Returns:
        The full text as a string.
    """
    url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
    if not os.path.exists(filepath):
        urllib.request.urlretrieve(url, filepath)
    with open(filepath, "r", encoding="utf-8") as f:
        text = f.read()
    return text

text = download_shakespeare()
print(f"Total characters: {len(text):,}")
print(f"Unique characters: {len(set(text))}")
print(f"First 200 characters:\n{text[:200]}")

Step 1: Character-Level Tokenizer

For this project, we use character-level tokenization. Each unique character in the text becomes a token. This keeps the vocabulary small and makes the model easy to train on a single GPU.

class CharTokenizer:
    """A simple character-level tokenizer.

    Attributes:
        chars: Sorted list of unique characters.
        vocab_size: Number of unique characters.
        stoi: Dictionary mapping characters to indices.
        itos: Dictionary mapping indices to characters.
    """

    def __init__(self, text: str) -> None:
        self.chars = sorted(list(set(text)))
        self.vocab_size = len(self.chars)
        self.stoi = {ch: i for i, ch in enumerate(self.chars)}
        self.itos = {i: ch for i, ch in enumerate(self.chars)}

    def encode(self, text: str) -> list[int]:
        """Encode a string into a list of token indices."""
        return [self.stoi[c] for c in text]

    def decode(self, indices: list[int]) -> str:
        """Decode a list of token indices into a string."""
        return "".join(self.itos[i] for i in indices)

tokenizer = CharTokenizer(text)
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Characters: {''.join(tokenizer.chars)}")

# Test encode/decode
sample = "Hello, World!"
encoded = tokenizer.encode(sample)
decoded = tokenizer.decode(encoded)
assert decoded == sample, "Encode/decode round-trip failed!"
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")

Step 2: Data Preparation

import torch
from torch.utils.data import Dataset, DataLoader

torch.manual_seed(42)

class ShakespeareDataset(Dataset):
    """Dataset that produces chunks of tokenized Shakespeare text.

    Args:
        data: Tensor of token indices for the full text.
        block_size: Length of each training sequence.
    """

    def __init__(self, data: torch.Tensor, block_size: int) -> None:
        self.data = data
        self.block_size = block_size

    def __len__(self) -> int:
        return len(self.data) - self.block_size

    def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
        x = self.data[idx: idx + self.block_size]
        y = self.data[idx + 1: idx + self.block_size + 1]
        return x, y

# Encode the full text
data = torch.tensor(tokenizer.encode(text), dtype=torch.long)

# Train/validation split (90/10)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]

block_size = 256
train_dataset = ShakespeareDataset(train_data, block_size)
val_dataset = ShakespeareDataset(val_data, block_size)

print(f"Training sequences: {len(train_dataset):,}")
print(f"Validation sequences: {len(val_dataset):,}")

Step 3: Model Architecture

We assemble the full mini-GPT architecture using the components developed in the chapter. Here we gather them into a single, self-contained module.

import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass

@dataclass
class GPTConfig:
    """Configuration for the mini-GPT model."""
    vocab_size: int = 65  # Shakespeare character vocabulary
    block_size: int = 256
    n_layer: int = 6
    n_head: int = 6
    n_embd: int = 384
    dropout: float = 0.2


class CausalSelfAttention(nn.Module):
    """Multi-head causal self-attention."""

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        assert config.n_embd % config.n_head == 0
        self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
        self.c_proj = nn.Linear(config.n_embd, config.n_embd)
        self.attn_dropout = nn.Dropout(config.dropout)
        self.resid_dropout = nn.Dropout(config.dropout)
        self.n_head = config.n_head
        self.n_embd = config.n_embd
        self.register_buffer(
            "bias",
            torch.tril(torch.ones(config.block_size, config.block_size))
            .view(1, 1, config.block_size, config.block_size),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        B, T, C = x.size()
        qkv = self.c_attn(x)
        q, k, v = qkv.split(self.n_embd, dim=2)
        k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
        att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
        att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
        att = F.softmax(att, dim=-1)
        att = self.attn_dropout(att)
        y = att @ v
        y = y.transpose(1, 2).contiguous().view(B, T, C)
        y = self.resid_dropout(self.c_proj(y))
        return y


class FeedForward(nn.Module):
    """Position-wise feed-forward network with GELU."""

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
        self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
        self.dropout = nn.Dropout(config.dropout)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = F.gelu(self.c_fc(x))
        x = self.dropout(self.c_proj(x))
        return x


class TransformerBlock(nn.Module):
    """Pre-norm Transformer decoder block."""

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.ln_1 = nn.LayerNorm(config.n_embd)
        self.attn = CausalSelfAttention(config)
        self.ln_2 = nn.LayerNorm(config.n_embd)
        self.mlp = FeedForward(config)

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        x = x + self.attn(self.ln_1(x))
        x = x + self.mlp(self.ln_2(x))
        return x


class MiniGPT(nn.Module):
    """Complete mini-GPT language model."""

    def __init__(self, config: GPTConfig) -> None:
        super().__init__()
        self.config = config
        self.transformer = nn.ModuleDict(dict(
            wte=nn.Embedding(config.vocab_size, config.n_embd),
            wpe=nn.Embedding(config.block_size, config.n_embd),
            drop=nn.Dropout(config.dropout),
            h=nn.ModuleList([
                TransformerBlock(config) for _ in range(config.n_layer)
            ]),
            ln_f=nn.LayerNorm(config.n_embd),
        ))
        self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
        self.transformer.wte.weight = self.lm_head.weight
        self.apply(self._init_weights)

    def _init_weights(self, module: nn.Module) -> None:
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(
        self,
        idx: torch.Tensor,
        targets: torch.Tensor | None = None,
    ) -> tuple[torch.Tensor, torch.Tensor | None]:
        B, T = idx.size()
        pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
        tok_emb = self.transformer.wte(idx)
        pos_emb = self.transformer.wpe(pos)
        x = self.transformer.drop(tok_emb + pos_emb)
        for block in self.transformer.h:
            x = block(x)
        x = self.transformer.ln_f(x)
        logits = self.lm_head(x)
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)), targets.view(-1)
            )
        return logits, loss

    def count_parameters(self) -> int:
        return sum(p.numel() for p in self.parameters() if p.requires_grad)

# Create the model
config = GPTConfig(vocab_size=tokenizer.vocab_size)
model = MiniGPT(config)
print(f"Model parameters: {model.count_parameters():,}")

Step 4: Training

def train(
    model: MiniGPT,
    train_dataset: ShakespeareDataset,
    val_dataset: ShakespeareDataset,
    epochs: int = 10,
    batch_size: int = 64,
    learning_rate: float = 3e-4,
) -> dict[str, list[float]]:
    """Train the mini-GPT model with validation.

    Args:
        model: The model to train.
        train_dataset: Training dataset.
        val_dataset: Validation dataset.
        epochs: Number of training epochs.
        batch_size: Batch size for training.
        learning_rate: Peak learning rate.

    Returns:
        Dictionary with training and validation loss histories.
    """
    torch.manual_seed(42)
    device = "cuda" if torch.cuda.is_available() else "cpu"
    model = model.to(device)

    optimizer = torch.optim.AdamW(
        model.parameters(), lr=learning_rate, weight_decay=0.01
    )
    train_loader = DataLoader(
        train_dataset, batch_size=batch_size, shuffle=True, drop_last=True
    )
    val_loader = DataLoader(
        val_dataset, batch_size=batch_size, shuffle=False, drop_last=True
    )

    history = {"train_loss": [], "val_loss": []}

    for epoch in range(epochs):
        # Training
        model.train()
        total_train_loss = 0.0
        num_train_batches = 0
        for x, y in train_loader:
            x, y = x.to(device), y.to(device)
            _, loss = model(x, targets=y)
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()
            total_train_loss += loss.item()
            num_train_batches += 1

        avg_train_loss = total_train_loss / num_train_batches

        # Validation
        model.eval()
        total_val_loss = 0.0
        num_val_batches = 0
        with torch.no_grad():
            for x, y in val_loader:
                x, y = x.to(device), y.to(device)
                _, loss = model(x, targets=y)
                total_val_loss += loss.item()
                num_val_batches += 1

        avg_val_loss = total_val_loss / num_val_batches

        history["train_loss"].append(avg_train_loss)
        history["val_loss"].append(avg_val_loss)

        print(
            f"Epoch {epoch + 1}/{epochs} | "
            f"Train Loss: {avg_train_loss:.4f} | "
            f"Val Loss: {avg_val_loss:.4f}"
        )

    return history

history = train(model, train_dataset, val_dataset)

Step 5: Text Generation

@torch.no_grad()
def generate(
    model: MiniGPT,
    tokenizer: CharTokenizer,
    prompt: str,
    max_new_tokens: int = 500,
    temperature: float = 1.0,
    top_k: int | None = None,
    top_p: float | None = None,
) -> str:
    """Generate text from the model given a prompt.

    Args:
        model: Trained MiniGPT model.
        tokenizer: Character tokenizer.
        prompt: Starting text.
        max_new_tokens: Number of characters to generate.
        temperature: Sampling temperature.
        top_k: Top-k filtering parameter.
        top_p: Nucleus sampling parameter.

    Returns:
        Generated text including the prompt.
    """
    device = next(model.parameters()).device
    model.eval()
    idx = torch.tensor(
        [tokenizer.encode(prompt)], dtype=torch.long, device=device
    )

    for _ in range(max_new_tokens):
        idx_cond = idx[:, -model.config.block_size:]
        logits, _ = model(idx_cond)
        logits = logits[:, -1, :] / temperature

        if top_k is not None:
            v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
            logits[logits < v[:, [-1]]] = float("-inf")

        if top_p is not None:
            sorted_logits, sorted_indices = torch.sort(
                logits, descending=True
            )
            cumulative_probs = torch.cumsum(
                F.softmax(sorted_logits, dim=-1), dim=-1
            )
            sorted_mask = cumulative_probs > top_p
            sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
            sorted_mask[..., 0] = 0
            indices_to_remove = sorted_mask.scatter(
                1, sorted_indices, sorted_mask
            )
            logits[indices_to_remove] = float("-inf")

        probs = F.softmax(logits, dim=-1)
        idx_next = torch.multinomial(probs, num_samples=1)
        idx = torch.cat((idx, idx_next), dim=1)

    return tokenizer.decode(idx[0].tolist())


# Generate with different strategies
print("=" * 60)
print("GREEDY DECODING (temperature=0.01)")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.01, max_new_tokens=300))

print("\n" + "=" * 60)
print("TEMPERATURE = 0.8")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.8, max_new_tokens=300))

print("\n" + "=" * 60)
print("TOP-K = 10, TEMPERATURE = 0.9")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.9, top_k=10, max_new_tokens=300))

print("\n" + "=" * 60)
print("NUCLEUS SAMPLING (top_p=0.92)")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", top_p=0.92, max_new_tokens=300))

Step 6: Analysis and Visualization

import matplotlib.pyplot as plt

def plot_training_history(history: dict[str, list[float]]) -> None:
    """Plot training and validation loss curves.

    Args:
        history: Dictionary with 'train_loss' and 'val_loss' lists.
    """
    fig, ax = plt.subplots(figsize=(10, 6))
    epochs = range(1, len(history["train_loss"]) + 1)
    ax.plot(epochs, history["train_loss"], "b-o", label="Training Loss")
    ax.plot(epochs, history["val_loss"], "r-o", label="Validation Loss")
    ax.set_xlabel("Epoch")
    ax.set_ylabel("Cross-Entropy Loss")
    ax.set_title("Mini-GPT Training on Shakespeare")
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig("training_history.png", dpi=150)
    plt.show()

plot_training_history(history)

Discussion

What We Learned

Character-level models are feasible: Even with a small character vocabulary (65 characters), our mini-GPT learns to generate coherent Shakespeare-like text.
Architecture matters: The combination of causal masking, multi-head attention, and GELU-activated FFNs creates a powerful sequence model. Removing any component degrades performance.
Generation strategy dramatically affects output: Greedy decoding produces repetitive but grammatically correct text, while temperature and nucleus sampling introduce creativity at the cost of occasional incoherence.
Training dynamics: The loss drops quickly in the first few epochs as the model learns character frequencies and common words, then decreases more slowly as it learns longer-range dependencies.

Limitations

Character-level models must learn spelling from scratch, which is inefficient.
Our model has only 10M parameters, limiting its ability to capture complex patterns.
The 256-token context window limits long-range coherence.
We did not use learning rate scheduling, mixed precision, or other production training techniques.

Extensions

Switch to BPE tokenization for better efficiency.
Add learning rate warmup and cosine decay.
Implement KV caching for faster generation.
Scale up the model and training data.
Fine-tune a pre-trained GPT-2 on Shakespeare instead of training from scratch.