Case Study 1: Building a Mini-GPT from Scratch
Overview
In this case study, we build a complete mini-GPT language model from scratch using PyTorch, train it on Shakespeare's works, and use it to generate new text in the Bard's style. This hands-on project covers every component of the decoder-only Transformer architecture: token and positional embeddings, multi-head causal self-attention, position-wise feed-forward networks with GELU activation, pre-norm layer normalization, weight tying, and the complete training and generation pipeline.
By working through this case study, you will solidify your understanding of every architectural decision discussed in Chapter 21 and gain practical experience with the challenges of training and generating from autoregressive models.
Learning Objectives
- Build each component of the GPT architecture from first principles.
- Understand how the components fit together into a complete model.
- Train the model on real text data and observe the learning dynamics.
- Experiment with different generation strategies and observe their effects on output quality.
- Analyze training curves and diagnose common issues.
Dataset: Shakespeare's Complete Works
We use a subset of Shakespeare's complete works, which is a classic choice for character-level language modeling because it has a distinctive style that makes it easy to evaluate whether the model has learned meaningful patterns.
import urllib.request
import os
def download_shakespeare(filepath: str = "shakespeare.txt") -> str:
"""Download Shakespeare's complete works from a public URL.
Args:
filepath: Local path to save the text file.
Returns:
The full text as a string.
"""
url = "https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt"
if not os.path.exists(filepath):
urllib.request.urlretrieve(url, filepath)
with open(filepath, "r", encoding="utf-8") as f:
text = f.read()
return text
text = download_shakespeare()
print(f"Total characters: {len(text):,}")
print(f"Unique characters: {len(set(text))}")
print(f"First 200 characters:\n{text[:200]}")
Step 1: Character-Level Tokenizer
For this project, we use character-level tokenization. Each unique character in the text becomes a token. This keeps the vocabulary small and makes the model easy to train on a single GPU.
class CharTokenizer:
"""A simple character-level tokenizer.
Attributes:
chars: Sorted list of unique characters.
vocab_size: Number of unique characters.
stoi: Dictionary mapping characters to indices.
itos: Dictionary mapping indices to characters.
"""
def __init__(self, text: str) -> None:
self.chars = sorted(list(set(text)))
self.vocab_size = len(self.chars)
self.stoi = {ch: i for i, ch in enumerate(self.chars)}
self.itos = {i: ch for i, ch in enumerate(self.chars)}
def encode(self, text: str) -> list[int]:
"""Encode a string into a list of token indices."""
return [self.stoi[c] for c in text]
def decode(self, indices: list[int]) -> str:
"""Decode a list of token indices into a string."""
return "".join(self.itos[i] for i in indices)
tokenizer = CharTokenizer(text)
print(f"Vocabulary size: {tokenizer.vocab_size}")
print(f"Characters: {''.join(tokenizer.chars)}")
# Test encode/decode
sample = "Hello, World!"
encoded = tokenizer.encode(sample)
decoded = tokenizer.decode(encoded)
assert decoded == sample, "Encode/decode round-trip failed!"
print(f"Encoded: {encoded}")
print(f"Decoded: {decoded}")
Step 2: Data Preparation
import torch
from torch.utils.data import Dataset, DataLoader
torch.manual_seed(42)
class ShakespeareDataset(Dataset):
"""Dataset that produces chunks of tokenized Shakespeare text.
Args:
data: Tensor of token indices for the full text.
block_size: Length of each training sequence.
"""
def __init__(self, data: torch.Tensor, block_size: int) -> None:
self.data = data
self.block_size = block_size
def __len__(self) -> int:
return len(self.data) - self.block_size
def __getitem__(self, idx: int) -> tuple[torch.Tensor, torch.Tensor]:
x = self.data[idx: idx + self.block_size]
y = self.data[idx + 1: idx + self.block_size + 1]
return x, y
# Encode the full text
data = torch.tensor(tokenizer.encode(text), dtype=torch.long)
# Train/validation split (90/10)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]
block_size = 256
train_dataset = ShakespeareDataset(train_data, block_size)
val_dataset = ShakespeareDataset(val_data, block_size)
print(f"Training sequences: {len(train_dataset):,}")
print(f"Validation sequences: {len(val_dataset):,}")
Step 3: Model Architecture
We assemble the full mini-GPT architecture using the components developed in the chapter. Here we gather them into a single, self-contained module.
import torch
import torch.nn as nn
import torch.nn.functional as F
from dataclasses import dataclass
@dataclass
class GPTConfig:
"""Configuration for the mini-GPT model."""
vocab_size: int = 65 # Shakespeare character vocabulary
block_size: int = 256
n_layer: int = 6
n_head: int = 6
n_embd: int = 384
dropout: float = 0.2
class CausalSelfAttention(nn.Module):
"""Multi-head causal self-attention."""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
assert config.n_embd % config.n_head == 0
self.c_attn = nn.Linear(config.n_embd, 3 * config.n_embd)
self.c_proj = nn.Linear(config.n_embd, config.n_embd)
self.attn_dropout = nn.Dropout(config.dropout)
self.resid_dropout = nn.Dropout(config.dropout)
self.n_head = config.n_head
self.n_embd = config.n_embd
self.register_buffer(
"bias",
torch.tril(torch.ones(config.block_size, config.block_size))
.view(1, 1, config.block_size, config.block_size),
)
def forward(self, x: torch.Tensor) -> torch.Tensor:
B, T, C = x.size()
qkv = self.c_attn(x)
q, k, v = qkv.split(self.n_embd, dim=2)
k = k.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
q = q.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
v = v.view(B, T, self.n_head, C // self.n_head).transpose(1, 2)
att = (q @ k.transpose(-2, -1)) * (1.0 / (k.size(-1) ** 0.5))
att = att.masked_fill(self.bias[:, :, :T, :T] == 0, float("-inf"))
att = F.softmax(att, dim=-1)
att = self.attn_dropout(att)
y = att @ v
y = y.transpose(1, 2).contiguous().view(B, T, C)
y = self.resid_dropout(self.c_proj(y))
return y
class FeedForward(nn.Module):
"""Position-wise feed-forward network with GELU."""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.c_fc = nn.Linear(config.n_embd, 4 * config.n_embd)
self.c_proj = nn.Linear(4 * config.n_embd, config.n_embd)
self.dropout = nn.Dropout(config.dropout)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = F.gelu(self.c_fc(x))
x = self.dropout(self.c_proj(x))
return x
class TransformerBlock(nn.Module):
"""Pre-norm Transformer decoder block."""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.ln_1 = nn.LayerNorm(config.n_embd)
self.attn = CausalSelfAttention(config)
self.ln_2 = nn.LayerNorm(config.n_embd)
self.mlp = FeedForward(config)
def forward(self, x: torch.Tensor) -> torch.Tensor:
x = x + self.attn(self.ln_1(x))
x = x + self.mlp(self.ln_2(x))
return x
class MiniGPT(nn.Module):
"""Complete mini-GPT language model."""
def __init__(self, config: GPTConfig) -> None:
super().__init__()
self.config = config
self.transformer = nn.ModuleDict(dict(
wte=nn.Embedding(config.vocab_size, config.n_embd),
wpe=nn.Embedding(config.block_size, config.n_embd),
drop=nn.Dropout(config.dropout),
h=nn.ModuleList([
TransformerBlock(config) for _ in range(config.n_layer)
]),
ln_f=nn.LayerNorm(config.n_embd),
))
self.lm_head = nn.Linear(config.n_embd, config.vocab_size, bias=False)
self.transformer.wte.weight = self.lm_head.weight
self.apply(self._init_weights)
def _init_weights(self, module: nn.Module) -> None:
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(
self,
idx: torch.Tensor,
targets: torch.Tensor | None = None,
) -> tuple[torch.Tensor, torch.Tensor | None]:
B, T = idx.size()
pos = torch.arange(0, T, dtype=torch.long, device=idx.device)
tok_emb = self.transformer.wte(idx)
pos_emb = self.transformer.wpe(pos)
x = self.transformer.drop(tok_emb + pos_emb)
for block in self.transformer.h:
x = block(x)
x = self.transformer.ln_f(x)
logits = self.lm_head(x)
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)), targets.view(-1)
)
return logits, loss
def count_parameters(self) -> int:
return sum(p.numel() for p in self.parameters() if p.requires_grad)
# Create the model
config = GPTConfig(vocab_size=tokenizer.vocab_size)
model = MiniGPT(config)
print(f"Model parameters: {model.count_parameters():,}")
Step 4: Training
def train(
model: MiniGPT,
train_dataset: ShakespeareDataset,
val_dataset: ShakespeareDataset,
epochs: int = 10,
batch_size: int = 64,
learning_rate: float = 3e-4,
) -> dict[str, list[float]]:
"""Train the mini-GPT model with validation.
Args:
model: The model to train.
train_dataset: Training dataset.
val_dataset: Validation dataset.
epochs: Number of training epochs.
batch_size: Batch size for training.
learning_rate: Peak learning rate.
Returns:
Dictionary with training and validation loss histories.
"""
torch.manual_seed(42)
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
optimizer = torch.optim.AdamW(
model.parameters(), lr=learning_rate, weight_decay=0.01
)
train_loader = DataLoader(
train_dataset, batch_size=batch_size, shuffle=True, drop_last=True
)
val_loader = DataLoader(
val_dataset, batch_size=batch_size, shuffle=False, drop_last=True
)
history = {"train_loss": [], "val_loss": []}
for epoch in range(epochs):
# Training
model.train()
total_train_loss = 0.0
num_train_batches = 0
for x, y in train_loader:
x, y = x.to(device), y.to(device)
_, loss = model(x, targets=y)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
total_train_loss += loss.item()
num_train_batches += 1
avg_train_loss = total_train_loss / num_train_batches
# Validation
model.eval()
total_val_loss = 0.0
num_val_batches = 0
with torch.no_grad():
for x, y in val_loader:
x, y = x.to(device), y.to(device)
_, loss = model(x, targets=y)
total_val_loss += loss.item()
num_val_batches += 1
avg_val_loss = total_val_loss / num_val_batches
history["train_loss"].append(avg_train_loss)
history["val_loss"].append(avg_val_loss)
print(
f"Epoch {epoch + 1}/{epochs} | "
f"Train Loss: {avg_train_loss:.4f} | "
f"Val Loss: {avg_val_loss:.4f}"
)
return history
history = train(model, train_dataset, val_dataset)
Step 5: Text Generation
@torch.no_grad()
def generate(
model: MiniGPT,
tokenizer: CharTokenizer,
prompt: str,
max_new_tokens: int = 500,
temperature: float = 1.0,
top_k: int | None = None,
top_p: float | None = None,
) -> str:
"""Generate text from the model given a prompt.
Args:
model: Trained MiniGPT model.
tokenizer: Character tokenizer.
prompt: Starting text.
max_new_tokens: Number of characters to generate.
temperature: Sampling temperature.
top_k: Top-k filtering parameter.
top_p: Nucleus sampling parameter.
Returns:
Generated text including the prompt.
"""
device = next(model.parameters()).device
model.eval()
idx = torch.tensor(
[tokenizer.encode(prompt)], dtype=torch.long, device=device
)
for _ in range(max_new_tokens):
idx_cond = idx[:, -model.config.block_size:]
logits, _ = model(idx_cond)
logits = logits[:, -1, :] / temperature
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = float("-inf")
if top_p is not None:
sorted_logits, sorted_indices = torch.sort(
logits, descending=True
)
cumulative_probs = torch.cumsum(
F.softmax(sorted_logits, dim=-1), dim=-1
)
sorted_mask = cumulative_probs > top_p
sorted_mask[..., 1:] = sorted_mask[..., :-1].clone()
sorted_mask[..., 0] = 0
indices_to_remove = sorted_mask.scatter(
1, sorted_indices, sorted_mask
)
logits[indices_to_remove] = float("-inf")
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
idx = torch.cat((idx, idx_next), dim=1)
return tokenizer.decode(idx[0].tolist())
# Generate with different strategies
print("=" * 60)
print("GREEDY DECODING (temperature=0.01)")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.01, max_new_tokens=300))
print("\n" + "=" * 60)
print("TEMPERATURE = 0.8")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.8, max_new_tokens=300))
print("\n" + "=" * 60)
print("TOP-K = 10, TEMPERATURE = 0.9")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", temperature=0.9, top_k=10, max_new_tokens=300))
print("\n" + "=" * 60)
print("NUCLEUS SAMPLING (top_p=0.92)")
print("=" * 60)
print(generate(model, tokenizer, "ROMEO:", top_p=0.92, max_new_tokens=300))
Step 6: Analysis and Visualization
import matplotlib.pyplot as plt
def plot_training_history(history: dict[str, list[float]]) -> None:
"""Plot training and validation loss curves.
Args:
history: Dictionary with 'train_loss' and 'val_loss' lists.
"""
fig, ax = plt.subplots(figsize=(10, 6))
epochs = range(1, len(history["train_loss"]) + 1)
ax.plot(epochs, history["train_loss"], "b-o", label="Training Loss")
ax.plot(epochs, history["val_loss"], "r-o", label="Validation Loss")
ax.set_xlabel("Epoch")
ax.set_ylabel("Cross-Entropy Loss")
ax.set_title("Mini-GPT Training on Shakespeare")
ax.legend()
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("training_history.png", dpi=150)
plt.show()
plot_training_history(history)
Discussion
What We Learned
-
Character-level models are feasible: Even with a small character vocabulary (65 characters), our mini-GPT learns to generate coherent Shakespeare-like text.
-
Architecture matters: The combination of causal masking, multi-head attention, and GELU-activated FFNs creates a powerful sequence model. Removing any component degrades performance.
-
Generation strategy dramatically affects output: Greedy decoding produces repetitive but grammatically correct text, while temperature and nucleus sampling introduce creativity at the cost of occasional incoherence.
-
Training dynamics: The loss drops quickly in the first few epochs as the model learns character frequencies and common words, then decreases more slowly as it learns longer-range dependencies.
Limitations
- Character-level models must learn spelling from scratch, which is inefficient.
- Our model has only 10M parameters, limiting its ability to capture complex patterns.
- The 256-token context window limits long-range coherence.
- We did not use learning rate scheduling, mixed precision, or other production training techniques.
Extensions
- Switch to BPE tokenization for better efficiency.
- Add learning rate warmup and cosine decay.
- Implement KV caching for faster generation.
- Scale up the model and training data.
- Fine-tune a pre-trained GPT-2 on Shakespeare instead of training from scratch.