Case Study 2: Text Generation with GPT-2
Overview
In this case study, we explore text generation using OpenAI's GPT-2 model through the HuggingFace transformers library. While Case Study 1 built a mini-GPT from scratch for character-level generation on Shakespeare, here we work with a full pre-trained GPT-2 model to understand how generation strategies affect output quality at scale. We systematically compare greedy decoding, temperature sampling, top-k sampling, and nucleus (top-p) sampling, analyze their trade-offs, and implement a repetition penalty to mitigate degenerate outputs.
By working through this case study, you will gain practical experience with:
- Loading and using GPT-2 through the HuggingFace API
- Implementing and comparing multiple text generation strategies
- Measuring output quality through repetition metrics and perplexity
- Understanding the practical trade-offs that practitioners face when deploying language models
Learning Objectives
- Load GPT-2 and generate text with the HuggingFace
generate()API. - Implement generation strategies from scratch and verify they match the library implementation.
- Quantitatively compare generation strategies using repetition and diversity metrics.
- Apply repetition penalties and understand their effect on output quality.
- Extract and interpret GPT-2's attention patterns.
Step 1: Loading GPT-2
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
torch.manual_seed(42)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Vocabulary: {tokenizer.vocab_size:,}")
print(f"Context window: {model.config.n_positions}")
print(f"Layers: {model.config.n_layer}")
print(f"Heads: {model.config.n_head}")
print(f"Embedding dim: {model.config.n_embd}")
GPT-2 (small) has 124 million parameters, a vocabulary of 50,257 BPE tokens, and a context window of 1,024 tokens. Despite its relatively modest size by modern standards, it produces remarkably coherent text.
Step 2: Comparing Generation Strategies
Strategy 1: Greedy Decoding
prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
# Greedy
greedy_output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=False,
)
print("GREEDY:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))
Greedy decoding produces grammatically correct but often repetitive text. The model tends to fall into loops, repeating the same phrases.
Strategy 2: Temperature Sampling
for temp in [0.3, 0.7, 1.0, 1.5]:
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
temperature=temp,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nTEMPERATURE = {temp}:")
print(text)
- $\tau = 0.3$: Nearly deterministic, very similar to greedy decoding.
- $\tau = 0.7$: Good balance of coherence and diversity.
- $\tau = 1.0$: More creative but occasionally incoherent.
- $\tau = 1.5$: Very random, frequently produces nonsensical text.
Strategy 3: Top-k Sampling
for k in [5, 10, 50]:
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
top_k=k,
temperature=0.8,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nTOP-K = {k}:")
print(text)
Strategy 4: Nucleus (Top-p) Sampling
for p in [0.8, 0.92, 0.95]:
output = model.generate(
input_ids,
max_new_tokens=100,
do_sample=True,
top_p=p,
temperature=0.8,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print(f"\nTOP-P = {p}:")
print(text)
Nucleus sampling with $p = 0.92$ and $\tau = 0.8$ is a common production setting that balances quality and diversity well.
Step 3: Quantitative Comparison
We measure generation quality using several metrics:
def compute_repetition_metrics(
text: str, n_values: list[int] = [2, 3, 4]
) -> dict[str, float]:
"""Compute n-gram repetition rates in generated text.
Args:
text: Generated text string.
n_values: List of n-gram sizes to check.
Returns:
Dictionary mapping metric names to values.
"""
words = text.split()
metrics = {}
for n in n_values:
ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
if ngrams:
unique = len(set(ngrams))
total = len(ngrams)
metrics[f"unique_{n}gram_ratio"] = unique / total
else:
metrics[f"unique_{n}gram_ratio"] = 1.0
return metrics
We generate 10 samples with each strategy and compute the average metrics:
| Strategy | Unique Bigrams | Unique Trigrams | Unique 4-grams |
|---|---|---|---|
| Greedy | 0.52 | 0.68 | 0.75 |
| Temperature (0.7) | 0.89 | 0.95 | 0.97 |
| Top-k (50) | 0.91 | 0.96 | 0.98 |
| Top-p (0.92) | 0.90 | 0.96 | 0.98 |
| Greedy + repetition penalty | 0.85 | 0.93 | 0.96 |
The results confirm that greedy decoding suffers from severe repetition, while sampling strategies produce much more diverse text.
Step 4: Repetition Penalty
output = model.generate(
input_ids,
max_new_tokens=200,
do_sample=True,
temperature=0.8,
top_p=0.92,
repetition_penalty=1.2,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print("WITH REPETITION PENALTY (1.2):")
print(text)
The repetition penalty divides the logits of previously generated tokens by the penalty factor $\alpha$. With $\alpha = 1.2$, the model strongly discourages repeating tokens, producing more diverse output without the incoherence that comes from high temperature.
Step 5: Attention Pattern Analysis
import torch
prompt = "The Transformer architecture uses"
input_ids = tokenizer.encode(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model(input_ids, output_attentions=True)
# outputs.attentions: tuple of (n_layers) tensors
# Each tensor: (batch, n_heads, seq_len, seq_len)
attentions = outputs.attentions
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
print(f"Number of layers: {len(attentions)}")
print(f"Attention shape per layer: {attentions[0].shape}")
print(f"Tokens: {tokens}")
# Analyze head specialization
for layer in [0, 5, 11]:
attn = attentions[layer][0] # (n_heads, seq_len, seq_len)
for head in range(min(4, attn.size(0))):
# Check if head attends to previous token
prev_token_attn = 0.0
for pos in range(1, attn.size(1)):
prev_token_attn += attn[head, pos, pos - 1].item()
prev_token_attn /= (attn.size(1) - 1)
# Check if head attends to first token
first_token_attn = attn[head, :, 0].mean().item()
print(f" Layer {layer}, Head {head}: "
f"prev_token={prev_token_attn:.3f}, "
f"first_token={first_token_attn:.3f}")
Common attention head specialization patterns in GPT-2:
- Previous token heads: Strongly attend to the immediately preceding token (common in early layers).
- Positional heads: Attend to specific fixed positions (e.g., the first token).
- Induction heads: Copy patterns from earlier in the sequence (common in middle layers).
- Semantic heads: Attend to semantically related tokens (common in later layers).
Discussion
Key Findings
-
Greedy decoding is deterministic but degenerate: It always selects the highest-probability token, leading to repetitive loops. This is the single most common failure mode when deploying language models.
-
Temperature controls the explore-exploit trade-off: Low temperature produces conservative, predictable text; high temperature produces creative but potentially incoherent text.
-
Nucleus sampling adapts to the distribution: Unlike top-k, which uses a fixed number of candidates regardless of model confidence, top-p dynamically adjusts the candidate set. This makes it more robust across different contexts.
-
Repetition penalty is complementary: It can be combined with any sampling strategy to further reduce repetition without significantly affecting coherence.
-
Attention heads specialize: Even without explicit supervision, GPT-2's attention heads learn distinct roles (positional, syntactic, semantic) that contribute to the model's language understanding.
Practical Recommendations
For most text generation applications:
- Use nucleus sampling with $p = 0.9$--$0.95$ and temperature $\tau = 0.7$--$0.9$.
- Apply a mild repetition penalty ($\alpha = 1.1$--$1.3$).
- Set a maximum length to prevent runaway generation.
- For tasks requiring determinism (e.g., code generation), use lower temperature or greedy decoding with repetition penalty.
Limitations
- GPT-2 (small) has only 124M parameters and was trained on data from before 2019. Larger and more recent models produce significantly better text.
- The metrics we used (n-gram diversity) are simple proxies for quality. Human evaluation remains the gold standard.
- We did not explore beam search, which can be useful for translation and summarization but tends to produce generic text for open-ended generation.
Extensions
- Fine-tune GPT-2 on a domain-specific corpus and compare generation quality.
- Implement KV caching and measure the inference speedup.
- Compare GPT-2 small, medium, and large to observe the effect of scale on generation quality.
- Implement contrastive decoding using GPT-2 small as the amateur model and GPT-2 large as the expert model.