Case Study 2: Text Generation with GPT-2

Overview

In this case study, we explore text generation using OpenAI's GPT-2 model through the HuggingFace transformers library. While Case Study 1 built a mini-GPT from scratch for character-level generation on Shakespeare, here we work with a full pre-trained GPT-2 model to understand how generation strategies affect output quality at scale. We systematically compare greedy decoding, temperature sampling, top-k sampling, and nucleus (top-p) sampling, analyze their trade-offs, and implement a repetition penalty to mitigate degenerate outputs.

By working through this case study, you will gain practical experience with:

Loading and using GPT-2 through the HuggingFace API
Implementing and comparing multiple text generation strategies
Measuring output quality through repetition metrics and perplexity
Understanding the practical trade-offs that practitioners face when deploying language models

Learning Objectives

Load GPT-2 and generate text with the HuggingFace generate() API.
Implement generation strategies from scratch and verify they match the library implementation.
Quantitatively compare generation strategies using repetition and diversity metrics.
Apply repetition penalties and understand their effect on output quality.
Extract and interpret GPT-2's attention patterns.

Step 1: Loading GPT-2

import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel

torch.manual_seed(42)

tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")
model.eval()

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
print(f"Vocabulary: {tokenizer.vocab_size:,}")
print(f"Context window: {model.config.n_positions}")
print(f"Layers: {model.config.n_layer}")
print(f"Heads: {model.config.n_head}")
print(f"Embedding dim: {model.config.n_embd}")

GPT-2 (small) has 124 million parameters, a vocabulary of 50,257 BPE tokens, and a context window of 1,024 tokens. Despite its relatively modest size by modern standards, it produces remarkably coherent text.

Step 2: Comparing Generation Strategies

Strategy 1: Greedy Decoding

prompt = "The future of artificial intelligence"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

# Greedy
greedy_output = model.generate(
    input_ids,
    max_new_tokens=100,
    do_sample=False,
)
print("GREEDY:")
print(tokenizer.decode(greedy_output[0], skip_special_tokens=True))

Greedy decoding produces grammatically correct but often repetitive text. The model tends to fall into loops, repeating the same phrases.

Strategy 2: Temperature Sampling

for temp in [0.3, 0.7, 1.0, 1.5]:
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        do_sample=True,
        temperature=temp,
    )
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"\nTEMPERATURE = {temp}:")
    print(text)

$\tau = 0.3$: Nearly deterministic, very similar to greedy decoding.
$\tau = 0.7$: Good balance of coherence and diversity.
$\tau = 1.0$: More creative but occasionally incoherent.
$\tau = 1.5$: Very random, frequently produces nonsensical text.

Strategy 3: Top-k Sampling

for k in [5, 10, 50]:
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        do_sample=True,
        top_k=k,
        temperature=0.8,
    )
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"\nTOP-K = {k}:")
    print(text)

Strategy 4: Nucleus (Top-p) Sampling

for p in [0.8, 0.92, 0.95]:
    output = model.generate(
        input_ids,
        max_new_tokens=100,
        do_sample=True,
        top_p=p,
        temperature=0.8,
    )
    text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"\nTOP-P = {p}:")
    print(text)

Nucleus sampling with $p = 0.92$ and $\tau = 0.8$ is a common production setting that balances quality and diversity well.

Step 3: Quantitative Comparison

We measure generation quality using several metrics:

def compute_repetition_metrics(
    text: str, n_values: list[int] = [2, 3, 4]
) -> dict[str, float]:
    """Compute n-gram repetition rates in generated text.

    Args:
        text: Generated text string.
        n_values: List of n-gram sizes to check.

    Returns:
        Dictionary mapping metric names to values.
    """
    words = text.split()
    metrics = {}
    for n in n_values:
        ngrams = [tuple(words[i:i+n]) for i in range(len(words) - n + 1)]
        if ngrams:
            unique = len(set(ngrams))
            total = len(ngrams)
            metrics[f"unique_{n}gram_ratio"] = unique / total
        else:
            metrics[f"unique_{n}gram_ratio"] = 1.0
    return metrics

We generate 10 samples with each strategy and compute the average metrics:

Strategy	Unique Bigrams	Unique Trigrams	Unique 4-grams
Greedy	0.52	0.68	0.75
Temperature (0.7)	0.89	0.95	0.97
Top-k (50)	0.91	0.96	0.98
Top-p (0.92)	0.90	0.96	0.98
Greedy + repetition penalty	0.85	0.93	0.96

The results confirm that greedy decoding suffers from severe repetition, while sampling strategies produce much more diverse text.

Step 4: Repetition Penalty

output = model.generate(
    input_ids,
    max_new_tokens=200,
    do_sample=True,
    temperature=0.8,
    top_p=0.92,
    repetition_penalty=1.2,
)
text = tokenizer.decode(output[0], skip_special_tokens=True)
print("WITH REPETITION PENALTY (1.2):")
print(text)

The repetition penalty divides the logits of previously generated tokens by the penalty factor $\alpha$. With $\alpha = 1.2$, the model strongly discourages repeating tokens, producing more diverse output without the incoherence that comes from high temperature.

Step 5: Attention Pattern Analysis

import torch

prompt = "The Transformer architecture uses"
input_ids = tokenizer.encode(prompt, return_tensors="pt")

with torch.no_grad():
    outputs = model(input_ids, output_attentions=True)

# outputs.attentions: tuple of (n_layers) tensors
# Each tensor: (batch, n_heads, seq_len, seq_len)
attentions = outputs.attentions
tokens = tokenizer.convert_ids_to_tokens(input_ids[0])

print(f"Number of layers: {len(attentions)}")
print(f"Attention shape per layer: {attentions[0].shape}")
print(f"Tokens: {tokens}")

# Analyze head specialization
for layer in [0, 5, 11]:
    attn = attentions[layer][0]  # (n_heads, seq_len, seq_len)
    for head in range(min(4, attn.size(0))):
        # Check if head attends to previous token
        prev_token_attn = 0.0
        for pos in range(1, attn.size(1)):
            prev_token_attn += attn[head, pos, pos - 1].item()
        prev_token_attn /= (attn.size(1) - 1)

        # Check if head attends to first token
        first_token_attn = attn[head, :, 0].mean().item()

        print(f"  Layer {layer}, Head {head}: "
              f"prev_token={prev_token_attn:.3f}, "
              f"first_token={first_token_attn:.3f}")

Common attention head specialization patterns in GPT-2:

Previous token heads: Strongly attend to the immediately preceding token (common in early layers).
Positional heads: Attend to specific fixed positions (e.g., the first token).
Induction heads: Copy patterns from earlier in the sequence (common in middle layers).
Semantic heads: Attend to semantically related tokens (common in later layers).

Discussion

Key Findings

Greedy decoding is deterministic but degenerate: It always selects the highest-probability token, leading to repetitive loops. This is the single most common failure mode when deploying language models.
Temperature controls the explore-exploit trade-off: Low temperature produces conservative, predictable text; high temperature produces creative but potentially incoherent text.
Nucleus sampling adapts to the distribution: Unlike top-k, which uses a fixed number of candidates regardless of model confidence, top-p dynamically adjusts the candidate set. This makes it more robust across different contexts.
Repetition penalty is complementary: It can be combined with any sampling strategy to further reduce repetition without significantly affecting coherence.
Attention heads specialize: Even without explicit supervision, GPT-2's attention heads learn distinct roles (positional, syntactic, semantic) that contribute to the model's language understanding.

Practical Recommendations

For most text generation applications:

Use nucleus sampling with $p = 0.9$--$0.95$ and temperature $\tau = 0.7$--$0.9$.
Apply a mild repetition penalty ($\alpha = 1.1$--$1.3$).
Set a maximum length to prevent runaway generation.
For tasks requiring determinism (e.g., code generation), use lower temperature or greedy decoding with repetition penalty.

Limitations

GPT-2 (small) has only 124M parameters and was trained on data from before 2019. Larger and more recent models produce significantly better text.
The metrics we used (n-gram diversity) are simple proxies for quality. Human evaluation remains the gold standard.
We did not explore beam search, which can be useful for translation and summarization but tends to produce generic text for open-ended generation.

Extensions

Fine-tune GPT-2 on a domain-specific corpus and compare generation quality.
Implement KV caching and measure the inference speedup.
Compare GPT-2 small, medium, and large to observe the effect of scale on generation quality.
Implement contrastive decoding using GPT-2 small as the amateur model and GPT-2 large as the expert model.