Case Study 1: Fine-Tuning BERT for Sentiment Analysis
Overview
In this case study, we walk through a complete, production-quality pipeline for fine-tuning BERT on sentiment analysis using the Stanford Sentiment Treebank (SST-2). SST-2 is a binary classification task where movie reviews are labeled as positive or negative. It is one of the benchmark tasks in the GLUE suite and serves as an excellent introduction to the transfer learning workflow.
We will cover data exploration, tokenization, model configuration, training with hyperparameter tuning, evaluation with multiple metrics, and error analysis. This case study directly applies the concepts from Sections 20.5 and 20.6 of the chapter.
Motivation
Sentiment analysis is one of the most common NLP applications in industry. Companies use it to analyze customer reviews, social media posts, support tickets, and survey responses. Before pre-trained models, building an accurate sentiment classifier required:
- Large labeled datasets (often tens of thousands of examples)
- Careful feature engineering (n-grams, sentiment lexicons, syntax features)
- Task-specific architectures (CNNs, LSTMs with attention)
With BERT, we can achieve strong performance by fine-tuning a pre-trained model on just a few thousand labeled examples, with minimal feature engineering and a standard classification head.
Dataset: SST-2
The Stanford Sentiment Treebank v2 (SST-2) contains 67,349 training examples, 872 validation examples, and 1,821 test examples. Each example is a sentence from a movie review with a binary sentiment label (0 = negative, 1 = positive).
from datasets import load_dataset
dataset = load_dataset("glue", "sst2")
print(f"Train: {len(dataset['train']):,}")
print(f"Validation: {len(dataset['validation']):,}")
# Examine some examples
for i in range(5):
example = dataset["train"][i]
label = "positive" if example["label"] == 1 else "negative"
print(f" [{label}] {example['sentence']}")
Class distribution analysis reveals a roughly balanced dataset with approximately 56% positive and 44% negative examples, so accuracy is a reasonable primary metric.
Step 1: Data Exploration and Preprocessing
Before tokenization, we analyze the data to inform our preprocessing choices.
import numpy as np
# Analyze sentence lengths (word-level)
word_lengths = [len(ex["sentence"].split()) for ex in dataset["train"]]
print(f"Mean words: {np.mean(word_lengths):.1f}")
print(f"Median words: {np.median(word_lengths):.1f}")
print(f"95th percentile: {np.percentile(word_lengths, 95):.0f}")
print(f"Max words: {max(word_lengths)}")
Typical findings: the mean sentence length is around 19 words, and the 95th percentile is around 36 words. After subword tokenization, these lengths increase modestly. A max_length of 128 tokens is more than sufficient for SST-2 (BERT's maximum is 512).
Step 2: Tokenization
We use BERT's WordPiece tokenizer. Key decisions:
- max_length = 128: Covers virtually all examples with room to spare, while being efficient.
- padding = "max_length": Pads all sequences to the same length for efficient batching.
- truncation = True: Truncates any sequence exceeding 128 tokens (rare for SST-2).
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
def tokenize_function(examples: dict) -> dict:
"""Tokenize a batch of sentences for BERT.
Args:
examples: Dictionary with a 'sentence' key containing a list of strings.
Returns:
Dictionary with input_ids, attention_mask, and token_type_ids.
"""
return tokenizer(
examples["sentence"],
padding="max_length",
truncation=True,
max_length=128,
)
tokenized_datasets = dataset.map(tokenize_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])
Let us inspect a tokenized example:
example = tokenized_datasets["train"][0]
tokens = tokenizer.convert_ids_to_tokens(example["input_ids"])
print(f"Original: {dataset['train'][0]['sentence']}")
print(f"Tokens: {tokens[:20]}...")
print(f"Input IDs shape: {example['input_ids'].shape}")
Step 3: Model Configuration
We load a pre-trained BERT-Base model with a sequence classification head:
import torch
from transformers import AutoModelForSequenceClassification
torch.manual_seed(42)
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased",
num_labels=2,
)
# Inspect model architecture
total_params = sum(p.numel() for p in model.parameters())
trainable_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
print(f"Total parameters: {total_params:,}")
print(f"Trainable parameters: {trainable_params:,}")
# Total parameters: ~109,483,778
# The classification head adds a 768x2 linear layer
The model consists of: - BERT encoder (12 layers, 768 hidden, 12 heads): ~109M parameters - Classification head (768 -> 2): 1,538 parameters (768 * 2 + 2 bias)
Step 4: Training Configuration
Based on the recommendations in Section 20.5.4:
from transformers import TrainingArguments
training_args = TrainingArguments(
output_dir="./sst2_results",
eval_strategy="epoch",
save_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=32,
per_device_eval_batch_size=64,
num_train_epochs=3,
weight_decay=0.01,
warmup_ratio=0.1,
load_best_model_at_end=True,
metric_for_best_model="accuracy",
seed=42,
logging_steps=100,
fp16=torch.cuda.is_available(),
)
Key choices: - Learning rate = 2e-5: The lower end of the recommended range, conservative but stable. - Warmup ratio = 0.1: 10% of training steps with linearly increasing learning rate. - Weight decay = 0.01: Light regularization to prevent overfitting. - load_best_model_at_end: Automatically keeps the checkpoint with the best validation accuracy.
Step 5: Evaluation Metrics
Beyond accuracy, we track precision, recall, and F1 to understand the model's behavior on each class:
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
def compute_metrics(eval_pred: tuple) -> dict:
"""Compute classification metrics from model predictions.
Args:
eval_pred: Tuple of (logits, labels) from the Trainer.
Returns:
Dictionary with accuracy, precision, recall, and F1 scores.
"""
logits, labels = eval_pred
predictions = np.argmax(logits, axis=-1)
accuracy = accuracy_score(labels, predictions)
precision, recall, f1, _ = precision_recall_fscore_support(
labels, predictions, average="binary"
)
return {
"accuracy": accuracy,
"precision": precision,
"recall": recall,
"f1": f1,
}
Step 6: Training
from transformers import Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"],
compute_metrics=compute_metrics,
)
# Train the model
train_result = trainer.train()
# Print final metrics
print(f"Training loss: {train_result.training_loss:.4f}")
print(f"Training time: {train_result.metrics['train_runtime']:.1f}s")
Expected results after 3 epochs on SST-2:
| Metric | Value |
|---|---|
| Validation Accuracy | ~92-93% |
| Validation F1 | ~92-93% |
| Training Time (GPU) | ~15-20 minutes |
Step 7: Error Analysis
Understanding where the model fails is as important as knowing its overall accuracy:
import torch
# Get predictions on validation set
predictions_output = trainer.predict(tokenized_datasets["validation"])
predictions = np.argmax(predictions_output.predictions, axis=-1)
labels = predictions_output.label_ids
# Find misclassified examples
errors = []
for i in range(len(labels)):
if predictions[i] != labels[i]:
errors.append({
"sentence": dataset["validation"][i]["sentence"],
"true_label": "positive" if labels[i] == 1 else "negative",
"predicted": "positive" if predictions[i] == 1 else "negative",
"confidence": float(
torch.softmax(
torch.tensor(predictions_output.predictions[i]), dim=0
).max()
),
})
print(f"Total errors: {len(errors)} / {len(labels)}")
print(f"\nSample errors:")
for err in errors[:10]:
print(f" [{err['true_label']} -> {err['predicted']}] "
f"(conf: {err['confidence']:.3f}) {err['sentence']}")
Common error patterns include:
- Sarcasm and irony: "Oh, what a brilliant movie---if you enjoy watching paint dry." The literal positive words mislead the model.
- Negation: "Not the worst movie I have seen" may be misclassified as negative due to the presence of "worst."
- Mixed sentiment: "The acting was superb but the plot was a disaster" expresses both positive and negative sentiment.
- Subtle sentiment: Reviews that use understatement or domain-specific language.
Step 8: Hyperparameter Sensitivity Analysis
We systematically evaluate the impact of key hyperparameters:
# Pseudocode for hyperparameter search
learning_rates = [2e-5, 3e-5, 5e-5]
batch_sizes = [16, 32]
epochs = [2, 3, 4]
results = []
for lr in learning_rates:
for bs in batch_sizes:
for ep in epochs:
# Train and evaluate with these hyperparameters
# Record validation accuracy
pass
Typical findings for SST-2: - Learning rate: 2e-5 and 3e-5 perform similarly; 5e-5 occasionally shows instability. - Batch size: 32 is slightly better than 16, likely due to more stable gradient estimates. - Epochs: 3 epochs is generally optimal; 4 epochs shows slight overfitting on the small validation set.
Step 9: Comparison with Feature Extraction
For comparison, we also evaluate the feature extraction approach:
from transformers import AutoModel
# Feature extraction: freeze BERT, train only classifier
feature_model = AutoModel.from_pretrained("bert-base-uncased")
for param in feature_model.parameters():
param.requires_grad = False
# Extract [CLS] embeddings for all training examples
# Train a simple logistic regression or small MLP on top
Expected feature extraction accuracy on SST-2: ~85-87%, compared to ~92-93% for fine-tuning. The 5-7 percentage point gap illustrates the value of fine-tuning, especially when sufficient labeled data is available.
Key Takeaways from This Case Study
-
Fine-tuning BERT on SST-2 achieves >92% accuracy with minimal code and standard hyperparameters, far exceeding what was possible with pre-BERT approaches using similar amounts of engineering effort.
-
The training recipe matters. Small changes in learning rate, warmup, and number of epochs can shift accuracy by 1-2 percentage points. Always tune these for your specific task.
-
Error analysis reveals systematic failure modes. Sarcasm, negation, and mixed sentiment remain challenging even for BERT. These insights guide further improvements (data augmentation, ensemble methods, or specialized training).
-
Feature extraction is a viable fast alternative when compute is limited or labeled data is very small, but fine-tuning is preferred when resources allow.
-
The HuggingFace Trainer abstracts away boilerplate while remaining flexible enough for production use. The
compute_metrics,load_best_model_at_end, andfp16features are particularly valuable.
Extension Ideas
- Fine-tune RoBERTa instead of BERT and compare results.
- Implement gradual unfreezing: unfreeze layers one at a time during training.
- Add data augmentation (synonym replacement, back-translation) and measure the impact.
- Deploy the model using the HuggingFace
pipelineAPI for real-time inference. - Evaluate on out-of-domain sentiment data (e.g., product reviews, tweets) to test generalization.