Case Study 22-1: The LIAR Dataset — Building and Evaluating a Fake News Classifier

Overview

The LIAR dataset (Wang, 2017) is the most widely used benchmark in fake news detection research. This case study walks through the complete process of building a fake news classifier using LIAR: obtaining and understanding the data, feature engineering, training and evaluating multiple classifiers, interpreting the confusion matrix, and drawing honest conclusions about what the model has and has not learned.

This case study demonstrates not just how to build a classifier but how to evaluate it honestly — resisting the temptation to report only favorable metrics and engaging seriously with the limitations that emerge from evaluation.

1. Dataset Description and Structure

1.1 Origins and Collection

LIAR was collected from PolitiFact.com, one of the most prominent US political fact-checking organizations. PolitiFact employs professional fact-checkers who evaluate political statements made by politicians, media figures, and public officials. Each statement receives one of six "Truth-O-Meter" ratings:

Label	Meaning	Example
pants-fire	Blatantly false	"Barack Obama was born in Kenya"
false	Mostly or entirely false	"The US has the highest tax rate in the world"
barely-true	Contains a kernel of truth but leaves misleading impression	Selective use of statistics
half-true	Partially accurate but missing important context	Technically true but cherry-picked
mostly-true	Accurate with minor inaccuracies or missing minor context	Close to accurate
true	Accurate and complete	Verifiable factual claim

1.2 Dataset Statistics

Total statements: 12,836
Training split: 10,240 statements
Validation split: 1,284 statements
Test split: 1,267 statements

Label distribution (approximate): - pants-fire: 9% - false: 18% - barely-true: 18% - half-true: 21% - mostly-true: 18% - true: 15%

The class distribution is approximately balanced across the six labels, though not perfectly so. This has implications for which accuracy metrics are most informative.

1.3 Available Features

Beyond the statement text, each LIAR example includes rich metadata:

Speaker: Named politician or media figure (1,000+ unique speakers)
Speaker title: Current role (e.g., "U.S. Senator", "Governor")
Party affiliation: Democrat, Republican, Independent, etc.
State: US state the speaker represents (if applicable)
Venue: Where the statement was made (e.g., "Twitter", "Campaign speech", "Interview")
Subject categories: Topic tags (e.g., "economy", "immigration", "healthcare")
Speaker history: For each speaker, counts of previous ratings at each level

The speaker history features — knowing that a speaker has made 47 "pants-fire" claims previously — are particularly informative and not available for unknown speakers.

2. Baseline Analysis: What Can Simple Models Achieve?

2.1 Majority Class Baseline

Before building any learned classifier, the majority class baseline establishes the minimum acceptable performance. Since "half-true" is the most frequent label (≈21%), a classifier that always predicts "half-true" achieves approximately 21% accuracy on six-class classification.

This means the bar to beat is low: any learned classifier should substantially exceed 21%. However, accuracy is not the only relevant metric — a classifier that learns to predict "pants-fire" only when it is extremely confident (high precision on the extremes) may be more useful for a fact-checking assistant than one that maximizes overall accuracy.

2.2 Text-Only Baseline (TF-IDF + Logistic Regression)

Training logistic regression on TF-IDF unigram features from the statement text:

Accuracy on test set: ~24–26%
Macro F1: ~22–24%

The performance is modest — only slightly above the majority class baseline. This is revealing: the statement text alone, for typical political utterances ("The unemployment rate has fallen for the fourth consecutive quarter"), does not contain strong lexical signals about truth or falseness. The same words appear in statements rated "true" and "false."

2.3 Metadata-Only Baseline

Training only on speaker-level features (party, speaker history counts):

Accuracy on test set: ~23–27%

Remarkably, speaker credibility history alone achieves comparable accuracy to text-based classifiers. A speaker who has previously made 12 "pants-fire" statements is statistically more likely to make another one. This highlights that LIAR is, in important ways, a speaker credibility classification problem as much as a content classification problem.

2.4 Combined Model

Training on text + metadata features combined:

Accuracy on test set: ~27–28%

The modest improvement from combining features is consistent with published results. Wang's (2017) original paper reports 27.0% six-class accuracy for a bidirectional LSTM model with speaker metadata.

3. Building and Evaluating a Full Pipeline

3.1 Preprocessing for LIAR

LIAR's statements are short (median ~20 words), formal political language. Preprocessing choices:

Lowercase: Yes — reduces vocabulary size without losing information in this context
Punctuation: Retain quoted material markers; remove excessive punctuation
Stopwords: Partial removal — retain negation, quantifiers, hedging words
No stemming/lemmatization needed: Statements are short enough that morphological normalization has minimal benefit

3.2 Feature Engineering

TF-IDF (unigrams + bigrams) capturing content vocabulary.

Stylometric features (for short statements): - Sentence type (declarative, comparative, superlative claim) - Use of specific numerical claims (count of percentages, dollar amounts) - Use of named entities (people, organizations, places — from spaCy NER) - Hedging language count - Certainty language count

Speaker metadata: - Party affiliation (one-hot) - Job title category (politician, media, other) - Previous rating history (6 count features, normalized by total statements)

3.3 Model Training

Multiple classifiers compared:

Model	6-Class Accuracy	Binary Accuracy	Macro F1
Majority Baseline	21.0%	50.0%	—
TF-IDF + NaiveBayes	23.8%	57.3%	21.4%
TF-IDF + LinearSVC	25.6%	62.1%	24.1%
TF-IDF + Meta + RF	26.8%	63.4%	25.9%
TF-IDF + Meta + LR	27.4%	64.1%	26.8%

(Approximate values consistent with published literature; exact results vary by random seed and preprocessing choices)

4. Confusion Matrix Analysis

The six-class confusion matrix reveals patterns that aggregate accuracy conceals.

4.1 Common Patterns in LIAR Confusion Matrices

True label "pants-fire" → Predicted "false": Extremely common. The model systematically underestimates the most extreme category, predicting merely "false" when the true label is "pants-fire." The linguistic markers of "pants-fire" statements may not differ sufficiently from "false" statements to be distinguishable from text alone.

True label "barely-true" → Predicted "half-true" and vice versa: The boundary between these adjacent categories is where the most confusion occurs. Both categories involve statements that are technically accurate in some aspect but create misleading impressions — a distinction that is genuinely ambiguous even for trained fact-checkers.

True label "mostly-true" → Predicted "true": Models overestimate credibility for statements that are "mostly-true", predicting "true." This is the most consequential error for practical applications: a model that labels "mostly-true" as "true" will fail to catch content that creates false impressions despite technical accuracy.

True label "true" → Predicted "mostly-true": The symmetric error.

4.2 Ordinal Nature of the Labels

The six LIAR labels form an ordinal scale (pants-fire < false < barely-true < half-true < mostly-true < true). Misclassifying "false" as "pants-fire" is less severe than misclassifying "false" as "true" — the model got the wrong side of the ordinal scale.

Evaluating LIAR performance with ordinal-aware metrics (mean absolute error on the label ordinal, or accuracy within ±1 category) produces more favorable numbers: "within ±1" accuracy is typically 50–60% for reasonable models, compared to 25–27% exact accuracy.

4.3 What the Confusion Matrix Reveals About Model Learning

The confusion matrix pattern — good at the extremes (pants-fire, true), poor in the middle — is consistent with models learning stylistic correlates of extreme categories rather than genuine truth evaluation:

"Pants-fire" statements disproportionately contain outrageous absolute claims ("Barack Obama is the worst president in US history") with distinctive vocabulary
"True" statements disproportionately contain verifiable factual assertions with hedging language ("According to the BLS report, unemployment fell to 4.2% in March")
The middle categories (barely-true, half-true, mostly-true) involve contextual judgment that text features alone cannot support

5. Honest Assessment of Limitations

5.1 What the 27% Accuracy Actually Means

A 27% six-class accuracy on LIAR, if measured against professional fact-checkers who presumably achieve near-100% (since they produced the labels), means that current models fail approximately 73% of the time to match expert judgment. This is not a failure of implementation — it reflects the fundamental difficulty of the task and the limitations of text-only features for truth evaluation.

5.2 The Speaker-Text Entanglement Problem

A model trained on LIAR may learn speaker-specific vocabulary patterns rather than general misinformation signals. If a particular politician habitually uses distinctive phrases and that politician's statements are disproportionately "false," the model may learn to associate those phrases with falseness — accurately for that speaker but not generalizable.

5.3 Temporal and Topical Coverage

LIAR covers statements from 2007–2016, predominantly US political discourse. A model trained on LIAR: - Does not generalize to post-2016 political discourse (changed vocabulary, topics, political figures) - Does not generalize to health, financial, or scientific misinformation - Does not generalize to non-US political contexts - Does not generalize to social media platform text (LIAR is transcribed spoken statements, not tweets or Facebook posts)

5.4 The PolitiFact Bias

PolitiFact's fact-checking is produced by US-based journalists applying their professional judgment. This introduces: - Selection bias: PolitiFact checks statements it considers newsworthy and checkable - Geographic bias: US political context - Temporal bias: The "truth" of political statements can change as facts become clearer - Potential political bias: Debate about whether PolitiFact applies its criteria consistently across parties

A classifier trained on PolitiFact labels will inherit these biases.

6. Practical Implications for Fact-Checking

6.1 What the Model Is Useful For

Despite its limitations, a LIAR-trained classifier has practical utility as:

A prioritization tool: If a fact-checking organization receives hundreds of new political statements per day, a classifier can rank them by estimated falseness to help human fact-checkers prioritize which statements to investigate first. Even with 73% misclassification, a classifier that correctly identifies the 30% most likely to be false as "high priority" has value.

A consistency check: After a fact-checker has evaluated a statement, comparing the machine prediction to the human verdict can surface cases where the statement has unusual features — perhaps the fact-checker missed a relevant context that the model (through different features) flagged.

A speaker history integrator: The speaker-level credibility tracking built into LIAR-based models (how often has this speaker been wrong before?) is a legitimate input to risk assessment, even if the text-level classification is weak.

6.2 What the Model Should Not Be Used For

Autonomous content moderation: A 27% accurate six-class classifier should not remove or flag content without human review.
Individual claim verdicts: The model's confidence calibration is poor; high confidence in a predicted class does not reliably indicate correctness.
Non-political domains: The model has no basis for evaluating health, scientific, or financial claims.

7. Discussion Questions

LIAR achieves ~27% six-class accuracy for the best text + metadata models. What would it mean for a model to achieve 50% on this task? What would it mean for a model to achieve 90%? Is 90% on LIAR a realistic target, and what would it require?
The confusion matrix shows that models systematically underestimate the "pants-fire" category, predicting "false" instead. What practical consequences would this have for a model deployed to help prioritize fact-checking? Is this error direction better or worse than the alternative (over-predicting "pants-fire")?
Speaker history features (previous label counts) achieve comparable accuracy to text-only features. Does this mean text features are useless for this task, or does it mean something different? How might you design a study to separate the contributions of text and speaker features?
LIAR was published in 2017. How would you update or expand the dataset to address its known limitations? What sources would you use? What annotation methodology would you apply?
Imagine you are a journalist building a tool to help readers evaluate the credibility of political statements. How would you communicate the 27% six-class accuracy to users in a way that is honest about limitations while still being useful?

8. Key Methodological Lessons

Baseline evaluation is essential: Always compare against the majority class baseline before claiming model success.
Aggregate accuracy conceals class-level variation: Confusion matrices reveal what aggregate metrics hide.
Metadata often matters as much as text: For credibility tasks, speaker and source information can be highly informative.
Ordinal labels require ordinal evaluation: Standard accuracy is unnecessarily harsh for ordered label sets; ordinal metrics provide more informative evaluation.
Honest limitations documentation is a scientific obligation: A model with known limitations that are clearly documented is more valuable than an inflated performance claim.