Case Study 22-2: BERT for Fact Verification — Fine-Tuning on FEVER

Overview

The FEVER (Fact Extraction and VERification) benchmark represents the most complete and rigorous task formulation for automated fact verification. Unlike LIAR, which classifies claims in isolation, FEVER requires systems to: (1) retrieve relevant evidence from Wikipedia, (2) identify specific supporting or refuting sentences, and (3) predict a verdict (SUPPORTED, REFUTED, NOT ENOUGH INFORMATION). This end-to-end pipeline mirrors — in a structured way — what a human fact-checker actually does.

This case study walks through the FEVER task definition, the pipeline architecture that top-performing systems use, the role of BERT in evidence reasoning, training and evaluation procedures, and an honest analysis of what FEVER results do and do not tell us about real-world fact verification capabilities.

1. FEVER Dataset: Design and Construction

1.1 Construction Methodology

FEVER was constructed by having crowdworkers create claims by mutating sentences from Wikipedia. The mutation process was designed to produce a diverse range of verifiable claims:

Claim generation operations: 1. Direct paraphrase: Rephrase a Wikipedia sentence without changing its truth 2. Entity substitution: Replace an entity with a different entity (producing a false claim) 3. Negation: Add or remove negation from a factual statement 4. Numerical substitution: Change a number (year, quantity, percentage) 5. Antonym substitution: Replace a word with its antonym 6. Sentence composition: Combine information from multiple Wikipedia sentences

For SUPPORTED claims: the original or closely paraphrased statement For REFUTED claims: the mutated version containing a false element For NOT ENOUGH INFORMATION: claims whose verification requires information not in Wikipedia

1.2 Dataset Statistics

Total claims: 185,455
Training set: 145,449
Development set: 19,998
Test set: 19,998 (gold labels initially withheld)

Label distribution: - SUPPORTED: 80,035 (43%) - REFUTED: 29,775 (16%) - NOT ENOUGH INFO: 35,639 (19%) - Training claims only (additional SUPPORTED-heavy): remaining

Evidence annotation: Each SUPPORTED or REFUTED claim includes: - The set of Wikipedia articles that contain relevant evidence - The specific sentences within those articles that constitute evidence - Multiple annotated evidence sets for some claims (when multiple Wikipedia paths support/refute the claim)

1.3 Key Evaluation Metric: FEVER Score

The primary FEVER evaluation metric is the FEVER score, which requires: 1. Correctly predicting the label (SUPPORTED/REFUTED/NEI) 2. AND providing the complete gold evidence set (for SUPPORTED and REFUTED claims)

A prediction that correctly classifies a claim as SUPPORTED but provides different evidence sentences than the gold standard fails the FEVER metric — even if the predicted evidence is equally valid. This strictness reflects the requirement that automated systems must identify the specific Wikipedia evidence that supports their verdict, not just any plausible evidence.

FEVER score = fraction of claims that are both correctly labeled AND have fully correct evidence.

1.4 Comparison Metrics

Label accuracy: Fraction correctly labeled, ignoring evidence
Evidence precision/recall/F1: Quality of retrieved evidence, ignoring label
FEVER score: The combined metric (both label and evidence correct)

2. The Three-Stage Pipeline Architecture

Top FEVER systems use a three-stage pipeline that mirrors the structure of human fact verification.

2.1 Stage 1: Document Retrieval

Goal: Given a claim, retrieve the relevant Wikipedia articles (documents).

Challenge: Wikipedia contains approximately 5.7 million English articles. A brute-force comparison of the claim to every article is impractical. Top systems use:

TF-IDF Document Retrieval: 1. Extract all named entities from the claim (using a NER model or entity linker) 2. For each entity, retrieve Wikipedia articles whose title matches the entity 3. Return the top-k matching articles (typically k = 5–10)

For example, for the claim "Marie Curie won the Nobel Prize in Chemistry," entity extraction yields "Marie Curie" and "Nobel Prize in Chemistry," leading directly to the relevant Wikipedia pages.

Limitation: Entity extraction fails for claims without named entities ("The tallest building in the world is located in the UAE"). Entity linking fails for ambiguous names. These failures cascade through the pipeline — if the right document is not retrieved, downstream stages cannot succeed.

Neural Document Retrieval (Dense Passage Retrieval, DPR): More recent systems replace TF-IDF with dense retrieval: encoding both claims and Wikipedia passages with a dual-encoder model trained on question-document relevance, then finding nearest neighbors in embedding space. DPR outperforms TF-IDF but requires larger computational resources.

2.2 Stage 2: Sentence Selection

Goal: Given retrieved documents, select the specific sentences that constitute evidence for the claim.

This is a binary classification problem: For each sentence in each retrieved document, predict whether it is evidence for this claim (yes/no).

Features used: - BERT encoding of [CLS] CLAIM [SEP] SENTENCE [SEP] (treating as a sentence pair classification) - Positional features (sentence position in document, section of Wikipedia article) - Whether the sentence contains named entities also appearing in the claim

Fine-tuning BERT for sentence selection:

Input: [CLS] claim tokens [SEP] candidate sentence tokens [SEP]
BERT encoder (12 layers)
[CLS] representation → Linear(768, 2) → Softmax
Output: P(evidence | claim, sentence)

Sentences above a threshold probability are selected as evidence. Typically, the top-5 sentences across all retrieved documents are retained.

Evaluation: Evidence precision@k, recall@k, F1.

Key finding from FEVER research: Sentence selection is the bottleneck stage. The evidence recall of top systems ranges from 80–90% — meaning 10–20% of SUPPORTED/REFUTED claims fail at this stage. When sentence selection is perfect (oracle evidence provided), verdict accuracy rises dramatically, demonstrating that the verdict prediction model can reason well if it has the right evidence.

2.3 Stage 3: Verdict Prediction (Natural Language Inference)

Goal: Given the claim and selected evidence sentences, predict SUPPORTED / REFUTED / NOT ENOUGH INFORMATION.

This is framed as a Natural Language Inference (NLI) problem: does the evidence entail, contradict, or have no relationship with the claim?

Fine-tuning BERT for NLI on FEVER:

Input: [CLS] claim [SEP] evidence sentence 1 [SEP] evidence sentence 2 ... [SEP]
BERT encoder
[CLS] representation → Linear(768, 3) → Softmax
Output: P(SUPPORTED), P(REFUTED), P(NEI)

Multi-evidence aggregation: When multiple evidence sentences are selected, they can be concatenated (respecting BERT's 512 token limit) or aggregated with an attention mechanism over individual sentence predictions.

Performance on oracle evidence: ~77–80% label accuracy when gold evidence is provided. Performance on retrieved evidence: ~60–65% label accuracy (evidence retrieval errors propagate).

3. Fine-Tuning Procedure

3.1 Selecting the Pre-Trained Model

For FEVER, the choice between BERT variants matters:

Model	Parameters	FEVER Advantage
BERT-base	110M	Faster, good baseline
BERT-large	340M	Higher accuracy, requires more GPU memory
RoBERTa-large	355M	Typically +2–4% vs. BERT-large
ALBERT-xxlarge	235M (shared)	Parameter-efficient, competitive

RoBERTa-large is typically the practical choice for FEVER research when computational resources permit, due to its consistently superior performance.

3.2 Hyperparameter Configuration

Standard fine-tuning hyperparameters for FEVER verdict prediction:

# Typical configuration
learning_rate = 2e-5          # Adam with weight decay 0.01
batch_size = 32               # Per-GPU; can gradient-accumulate
max_seq_length = 512          # BERT max
num_epochs = 3                # Usually sufficient; more risks catastrophic forgetting
warmup_steps = 0.1 * total_steps  # 10% warmup
gradient_clipping = 1.0       # Prevents gradient explosion

Catastrophic forgetting: Fine-tuning with too high a learning rate or too many epochs causes the model to "forget" pre-trained linguistic knowledge while fitting the task — a phenomenon unique to transfer learning. The small learning rate (2e-5) and limited epochs are designed to preserve pre-trained representations while adapting to the task.

3.3 Training Data Augmentation

Top FEVER systems augment BERT fine-tuning with:

Negative evidence training: Training on claim-sentence pairs where the sentence is NOT evidence, teaching the model to correctly reject irrelevant sentences
Multi-hop examples: Synthetic training examples requiring reasoning across multiple evidence sentences
Adversarial FEVER claims: Adding FEVER 2.0 adversarial claims (designed to fool models) to training data improves robustness

3.4 Evaluation Protocol

The FEVER evaluation is conducted on a held-out test set with labels initially withheld, submitted to the FEVER evaluation server. This prevents overfitting to the test set — a critical safeguard given the many researcher degrees of freedom in NLP system design.

Evaluation includes: - Overall FEVER score (label + evidence) - Disaggregated by claim type (SUPPORTED, REFUTED, NEI) - Oracle evidence experiments (providing gold evidence to isolate verdict prediction performance)

4. Results and Interpretation

4.1 Benchmark Performance

Representative published results on the FEVER test set (approximate):

System	FEVER Score	Label Accuracy
Baseline (IR + ESIM)	31.9%	50.9%
BERT (Thorne et al. 2019)	64.2%	71.0%
RoBERTa + DPR	~74%	~79%
Human performance	~86%	~88%

The human-machine gap (~12–14 FEVER score points) reflects genuine capability differences, not just label noise — humans can perform multi-hop reasoning, access background knowledge, and understand pragmatic implications that current models struggle with.

4.2 Error Analysis

Where BERT-based systems succeed: - Single-hop factual claims with direct Wikipedia evidence: "Marie Curie was born in Warsaw." - Numerical comparisons: "X is taller than Y" when both heights are in Wikipedia - Entity substitution claims: "Albert Einstein was born in Russia" (refuted by Wikipedia's "Germany")

Where BERT-based systems fail: - Multi-hop reasoning: Claims requiring synthesis across multiple Wikipedia articles - Example: "The actor who played [character X] was born in [city Y]" — requires linking character to actor to birthplace across multiple articles - Common sense reasoning: Claims that require world knowledge beyond what is literally stated in Wikipedia - Claims about absences: "No evidence has been found for..." — requires recognizing the absence of information - Temporal reasoning: Claims about events at specific times requiring temporal inference - NOT ENOUGH INFO recognition: Models over-predict SUPPORTED/REFUTED because they see many evidence sentences but cannot determine whether the evidence is sufficient

4.3 FEVER 2.0: Adversarial Challenge

FEVER 2.0 invited researchers to submit claims that would fool a given FEVER model. Analysis of FEVER 2.0 adversarial claims reveals that successful attacks: - Used claims that required multi-hop reasoning the model couldn't handle - Exploited entity ambiguity (multiple entities with the same name) - Targeted the sentence selection stage (providing misleading but plausible evidence) - Used negation in subtle ways the model mishandled

When models were then trained on FEVER 2.0 adversarial claims, their robustness improved on those specific attack types but new adversarial claims could still fool them — a cat-and-mouse dynamic.

5. Bridging FEVER to Real-World Fact Verification

5.1 The Wikipedia Constraint

FEVER's use of Wikipedia as the sole evidence source is a deliberate simplification that enables rigorous benchmarking. But it severely limits what FEVER-trained systems can evaluate in practice:

Claims FEVER systems cannot evaluate: - Claims about events after Wikipedia's last update - Claims requiring information not on Wikipedia (local events, proprietary data, real-time information) - Claims about scientific findings not yet well-represented on Wikipedia - Claims about statements, opinions, or predictions rather than factual states - Claims in languages where Wikipedia coverage is sparse

Real fact-checkers use search engines, primary source documents, expert interviews, original reporting, and domain-specific databases — a much richer and more complex evidence ecosystem than Wikipedia alone.

5.2 The Closed-World Assumption

FEVER treats Wikipedia as a closed world: any claim that cannot be verified using Wikipedia is NOT ENOUGH INFORMATION. But in practice, the fact that Wikipedia does not contain information about a claim does not mean the claim is false — it means Wikipedia coverage is incomplete.

This closed-world assumption trains models to label many true claims as NEI (because they are not well-covered on Wikipedia), while a real fact-checker would know that absence of Wikipedia coverage does not mean absence of evidence.

5.3 What FEVER Results Can Predict

Good FEVER scores predict: - Ability to find relevant Wikipedia pages for entity-rich claims - Ability to select specific supporting/refuting sentences - Ability to identify when Wikipedia evidence clearly supports or refutes a claim

Good FEVER scores do NOT predict: - Ability to verify claims not covered by Wikipedia - Ability to reason about nuanced or contested claims - Robustness to adversarial manipulation - Performance on real social media misinformation - Ability to evaluate health, financial, or scientific claims

6. Classroom Experimental Design

For courses with computational resources, the following experiment replicates key FEVER findings:

6.1 Simplified FEVER Experiment

Step 1: Create a small dataset (100 claims) by manually writing: - 40 SUPPORTED claims based on Wikipedia facts - 40 REFUTED claims by modifying the SUPPORTED claims (entity substitution, negation) - 20 NEI claims about events not well-covered on Wikipedia

Step 2: For each claim, use Wikipedia search API to retrieve 3 potentially relevant articles, then extract 5 candidate sentences per article.

Step 3: Train DistilBERT on the training split for sentence selection (binary: evidence/not evidence).

Step 4: Fine-tune DistilBERT on the training split for verdict prediction using concatenated selected sentences.

Step 5: Evaluate on the test split using label accuracy and FEVER score.

Step 6: Manually error-analyze 10 misclassified examples. What types of claims were hardest?

6.2 Expected Results on Simplified Dataset

With 80 training examples (40 supported + 40 refuted, no NEI), expect: - Label accuracy ~75–80% (BERT is good at entailment when given the right evidence) - FEVER score ~50–60% (evidence retrieval is the bottleneck) - NEI recall ~30–50% (models struggle to recognize insufficient evidence)

7. Discussion Questions

FEVER's "FEVER score" requires both the correct label AND the complete gold evidence set. This means a model that finds different (but equally valid) Wikipedia evidence supporting a claim gets penalized. Is this evaluation criterion too strict? What does it reward, and what does it fail to capture?
The human-machine performance gap on FEVER is approximately 12 FEVER score points. What categories of reasoning capability would be needed to close this gap? List at least five specific capabilities that current neural models demonstrably lack.
FEVER 2.0 shows that adversarial claims can reliably fool FEVER-trained models, and that models trained on adversarial examples become robust to those attacks but vulnerable to new attacks. What does this cat-and-mouse dynamic imply for the long-term prospects of automated fact verification?
The SUPPORTED/REFUTED/NEI distinction in FEVER does not capture the graduated nature of real-world truth claims. Design an alternative label taxonomy for a FEVER-like dataset that better reflects the complexity of real misinformation. What challenges would your taxonomy introduce for training and evaluation?
FEVER uses Wikipedia as its evidence source. Design a version of FEVER that uses real fact-checker databases (PolitiFact, Snopes, FactCheck.org) as evidence sources instead of Wikipedia. What would the task look like? What would the claims be? What challenges would this introduce relative to the Wikipedia-based FEVER?

8. Key Technical Concepts Illustrated

Pipeline architecture: Multi-stage systems where errors in early stages cascade through later stages
Transfer learning: Using pre-trained BERT representations as starting points for task-specific fine-tuning
Natural Language Inference: The entailment-neutral-contradiction framing that underlies fact verification
Oracle experiments: Providing perfect intermediate outputs to isolate the contribution of each pipeline stage
Adversarial evaluation: Using adversarial examples to identify model brittleness
FEVER score: A strict combined metric that rewards systems providing both correct labels and correct evidence
Domain limitations: The gap between benchmark performance and real-world applicability