Appendix D: Data Sources and Datasets

This appendix provides an annotated guide to the major datasets and data sources used in misinformation research. For each source, we describe the contents, size, format, access method, licensing terms, and recommended use cases. Resources are organized by category. URLs are described rather than linked directly, as access portals change; all resources can be located via the described access methods.


D.1 Misinformation Benchmark Datasets

These curated datasets are the standard benchmarks for training and evaluating automated misinformation detection systems.

LIAR Dataset

Description: Created by William Yang Wang (2017), LIAR is one of the most widely used benchmark datasets for fake news detection. It contains 12,836 short political statements collected from PolitiFact, annotated by human fact-checkers with six-level truthfulness labels: pants-fire, false, barely-true, half-true, mostly-true, and true.

Size: 12,836 statements; split into training (10,269), validation (1,284), and test (1,283) sets. Each statement includes speaker metadata (name, job title, state, party affiliation) and the number of previous statements in each label category.

Format: Tab-separated values (TSV), one file per split.

Access: Available on the PolitiFact website and mirrored on GitHub repositories associated with the original paper. Search for "LIAR dataset William Yang Wang ACL 2017."

License: The dataset is released for research use. PolitiFact retains copyright over the original fact-check content.

Recommended use cases: Multi-class misinformation classification, speaker credibility modeling, metadata-enriched claim verification. Chapter 22 exercises use LIAR for training binary classifiers.


FakeNewsNet

Description: A comprehensive repository containing news content with socially-shared features from two fact-checking sources: PolitiFact (political news) and GossipCop (entertainment news). Includes article text, metadata, images, and the associated social media sharing data from Twitter.

Size: Over 23,000 news articles (PolitiFact: ~1,000 real + 1,000 fake; GossipCop: ~22,000 real + ~5,000 fake). Social context includes up to hundreds of thousands of tweets per article.

Format: JSON files organized by article and platform. Social context data stored in nested directories.

Access: GitHub repository maintained by the Arizona State University data mining group. Search "FakeNewsNet GitHub Shu et al." The repository includes data collection scripts (you must collect tweet content yourself via the Twitter API due to platform terms).

License: MIT license for the collection scripts. Twitter data subject to Twitter/X Developer Agreement.

Recommended use cases: Multimodal fake news detection (text + social context), temporal analysis of sharing patterns, network-based detection approaches.


FEVER (Fact Extraction and VERification)

Description: FEVER is a dataset for fact verification against textual sources, constructed by modifying Wikipedia sentences to create supported, refuted, and not-enough-information claims. It drove a major shared task (FEVER Shared Task 2018–2019).

Size: 185,455 annotated claims linked to 5,416,537 Wikipedia sentences. Labels: SUPPORTED (80,035), REFUTED (29,775), NOT ENOUGH INFO (35,639).

Format: JSONL (JSON Lines), with fields for claim, label, evidence sentences, and Wikipedia page IDs.

Access: The FEVER website (search "FEVER dataset UCL") provides download links and leaderboards. A Python data loader is available via the datasets library: datasets.load_dataset("fever", "v1.0").

License: Creative Commons Attribution-ShareAlike 3.0 (same as Wikipedia).

Recommended use cases: Evidence retrieval, natural language inference, document-level fact verification pipelines. Essential for Chapters 22–24.


MultiFC (Multi-Domain Fact-Checked Claims)

Description: A large-scale dataset of 36,534 claims from 26 different fact-checking outlets covering politics, health, science, and general topics. Unlike LIAR (PolitiFact only), MultiFC captures label heterogeneity across organizations.

Size: 36,534 claims with varying label schemas per outlet (mapped to a common schema of true/false/partially true/unverifiable).

Format: CSV with columns for claim, claim URL, fact-check outlet, label, article text, and article links.

Access: GitHub repository of Augenstein et al. (2019). Search "MultiFC dataset Isabelle Augenstein."

License: Research use only.

Recommended use cases: Cross-domain generalization studies, label normalization research, multi-outlet credibility analysis.


COVID-19 Infodemic Dataset

Description: Multiple datasets collected during the COVID-19 pandemic capturing the "infodemic" — the parallel epidemic of misinformation alongside the biological pandemic. The WHO, IEEE, and research groups released several complementary datasets.

Key sources: - CoAID (COVID-19 Healthcare Misinformation Dataset): 4,251 news articles and social media posts about COVID-19 healthcare, labeled real/fake. - ReCOVery: 2,029 news articles about COVID-19 with reliability labels and tweet engagement data. - CONSTRAINT 2021 shared task dataset: COVID-related tweets labeled as fake or real.

Format: CSV and JSON, varying by dataset.

Access: CoAID on GitHub (Cui & Lee, 2020); ReCOVery on GitHub (Zhou et al., 2020); CONSTRAINT on competition websites.

License: Research use; tweet IDs only for Twitter-based datasets (rehydration required).

Recommended use cases: Health misinformation research, real-time detection challenges, pandemic misinformation studies in Chapters 14–16.


PHEME Rumour Dataset

Description: PHEME contains Twitter conversation threads about nine breaking news events, with each tweet labeled as a rumour or non-rumour and (for rumours) as true, false, or unverified. It captures the temporal dynamics of rumour spreading and resolution.

Size: 5,802 tweets organized into 297 conversation threads across 9 events (e.g., Ottawa shooting, Charlie Hebdo, Ferguson unrest).

Format: JSON files organized hierarchically by event, thread (source tweet), and reply structure.

Access: PHEME dataset available via Figshare. Search "PHEME dataset Zubiaga Figshare." A version is also available via the datasets library.

License: Creative Commons Attribution 4.0.

Recommended use cases: Rumour stance detection, conversation thread analysis, temporal credibility assessment, early detection of misinformation.


D.2 Platform Transparency Data

Twitter/X Information Operations Archive

Description: Twitter (now X) has published data about state-backed information operations it has removed. The archive contains tweet content, account metadata, and media from accounts attributed to governments including Russia, Iran, China, Saudi Arabia, Venezuela, and others.

Size: Varies by release; the full archive contains tens of millions of tweets from thousands of accounts across multiple disclosure events.

Format: ZIP files containing CSVs (tweet data) and JSON (follower/following graphs). A companion dataset contains account-level metadata.

Access: Twitter/X Transparency Center (transparency.x.com). Navigate to "Information Operations" to download individual country datasets.

License: Data released for research and journalistic purposes under the Twitter Developer Agreement.

Recommended use cases: State-sponsored disinformation research, coordinated inauthentic behavior detection, IRA (Internet Research Agency) analysis. Chapter 21 uses IRA data for network analysis exercises.


Facebook/Meta CIB Takedown Reports

Description: Meta periodically publishes "Coordinated Inauthentic Behavior" (CIB) takedown reports documenting networks of fake accounts and Pages removed for foreign or domestic interference. Reports include operation descriptions, origin countries, targeting information, and sample content.

Format: PDF reports with accompanying CSV files of removed Page/account IDs.

Access: Meta Transparency Center (transparency.fb.com). Navigate to "Threat Reports" for downloadable materials.

License: Public disclosure for research and journalism.

Recommended use cases: Case study analysis of coordinated influence operations, platform response documentation, comparative analysis across nations and platforms.


Google Threat Analysis Group Reports

Description: Google's Threat Analysis Group (TAG) publishes reports on influence operations, phishing campaigns, and coordinated disinformation targeting Google platforms (YouTube, Gmail, Google Ads).

Format: PDF reports; some include datasets of removed YouTube channels.

Access: Google TAG blog (blog.google/threat-analysis-group). Reports are freely available; YouTube channel datasets are posted to Stanford Internet Observatory and similar partners.

License: Public disclosure.

Recommended use cases: Cross-platform analysis, video-based disinformation research.


D.3 Fact-Check Databases and APIs

ClaimBuster API

Description: ClaimBuster is an automated claim-spotting system from the University of Texas at Arlington. It scores sentences by their "check-worthiness" — the likelihood that a sentence contains a factual claim worth fact-checking.

Format: RESTful API returning JSON with claim score (0–1) and claim text.

Access: Requires free API key registration at ClaimBuster's website. Rate limits: approximately 1,000 requests/day for free tier.

License: Free for academic use.

Recommended use cases: Automated prioritization of claims for fact-checking, preprocessing pipeline for fact-check systems.


Google Fact Check Explorer API

Description: An API that searches indexed fact-checks from IFCN-certified fact-checking organizations worldwide. Returns claim, claimant, verdict, fact-checker, and URL.

Format: JSON via Google's Fact Check Tools API.

Access: Requires a Google Cloud API key with the Fact Check Tools API enabled. Free tier available with standard Google Cloud quotas.

License: Usage subject to Google API Terms of Service.

Recommended use cases: Checking whether a claim has already been fact-checked, building claim-matching systems, cross-lingual fact-check retrieval.


PolitiFact Data

Description: PolitiFact has published structured datasets of their fact-checks through the LIAR dataset (see Section D.1) and directly through their website. Their Truth-O-Meter assigns six ratings (True, Mostly True, Half True, Mostly False, False, Pants on Fire).

Access: The LIAR dataset provides structured access. Direct scraping of PolitiFact is governed by their Terms of Service; academic collaborations can request bulk data.

Recommended use cases: Political claim verification, speaker credibility profiling.


D.4 Social Media Data

Twitter/X Academic Research API

Description: Provides full-archive search access to the complete history of public tweets (back to 2006), higher rate limits, and access to additional fields not available in the standard API.

Access tiers: - Free tier: 1 million tweets/month read access, 500,000 tweet lookup - Basic tier (~$100/month): 10 million tweets/month - Pro/Enterprise: Higher limits, contact Twitter/X directly

Format: JSON via RESTful endpoints or streaming.

Access: developer.x.com. Requires account approval and description of research use.

Rate limits: Vary by endpoint; typical: 500 requests per 15-minute window.

Recommended use cases: Longitudinal tracking of hashtags and narratives, network data collection, bot detection datasets.


Reddit Pushshift Archive

Description: Pushshift was a third-party service that archived all Reddit submissions and comments in real time, enabling full-history research. Following Reddit's 2023 API policy changes, Pushshift's public access was suspended. Access is now available to approved researchers via arrangement with Reddit.

Format: JSON compressed files (zst format), organized by month.

Size: Multiple terabytes covering Reddit's full history from 2005.

Access: Academic Torrent hosts archives. Approved researchers can contact Reddit's research program. The Pushshift team maintains some archival access.

Recommended use cases: Longitudinal studies of conspiracy theory communities, subreddit analysis, comment thread dynamics.


YouTube Data API

Description: The YouTube Data API allows retrieval of video metadata, comments, channel statistics, and playlist information.

Access: Google Cloud Console; requires API key. Free quota: 10,000 units/day (each search query costs ~100 units; each comment fetch ~1 unit).

Format: JSON.

Recommended use cases: Analyzing misinformation on YouTube, tracking health misinformation channels, studying algorithmic recommendation effects.


D.5 Survey Data

Pew Research Center Data

Description: Pew Research conducts regular surveys on media consumption, trust in news, political attitudes, and social media use. Many datasets are publicly available for academic download.

Key relevant surveys: - American Trends Panel: ongoing panel tracking media habits - Global Attitudes Project: cross-national data on trust and information - Journalism and Media studies: annual State of the News Media reports

Format: SPSS (.sav) and Stata (.dta) files; codebooks in PDF.

Access: Pew Research Center website (pewresearch.org). Free registration required for data downloads.

License: Free for non-commercial research with attribution.

Recommended use cases: Nationally representative analysis of misinformation exposure, trust in media, digital news consumption patterns.


Reuters Institute Digital News Report

Description: The annual Digital News Report from the Reuters Institute for the Study of Journalism surveys approximately 90,000 online news consumers across 46 countries. It measures trust in news, platform use, news avoidance, misinformation concern, and willingness to pay for news.

Format: Excel (.xlsx) cross-tabulation files; full data available on request.

Access: Reuters Institute website (reutersinstitute.politics.ox.ac.uk). Data available for download with free registration.

License: Available for academic research with citation.

Recommended use cases: Cross-national comparative analysis of news consumption and trust, longitudinal tracking of platform use.


Edelman Trust Barometer

Description: Annual global survey of trust in institutions (government, business, NGOs, media) across 28 markets, tracking approximately 32,000 respondents. Measures trust in media, credibility of sources, and belief in systemic corruption.

Format: PDF reports and underlying data available to research partners.

Access: Edelman.com. Full data access for academic partners via arrangement; aggregate results freely available.

Recommended use cases: Institutional trust analysis, cross-national trust comparisons, longitudinal trust tracking.


D.6 News Archives and Media Monitoring

GDELT (Global Database of Events, Language, and Tone)

Description: GDELT monitors the world's news media across 65 languages and encodes the events, actors, and emotional tone of virtually every news article. It is one of the largest open datasets in existence.

Size: Petabyte-scale; updated every 15 minutes with new articles.

Format: CSV and JSON available via Google BigQuery (free tier available). Direct download files available for bulk access.

Access: GDELT website (gdeltproject.org) and Google BigQuery (gdelt-bq.gdelt.com).

License: Open, free for any use.

Recommended use cases: Tracking global narrative patterns, measuring media attention to events, cross-lingual information ecosystem analysis.


Media Cloud

Description: Media Cloud is an open platform for studying online news built at MIT, Harvard, and Northeastern. It provides access to millions of news stories per week, with topic modeling, source analysis, and network visualization.

Size: 1.8+ billion stories archived since 2011.

Format: API (JSON) and bulk download.

Access: mediacloud.org; free academic access with registration.

License: API access free for research; data licensing varies.

Recommended use cases: Tracking story spread across media ecosystems, identifying partisan media patterns, measuring story lifecycle.


AllSides Media Bias Ratings

Description: AllSides maintains human-reviewed media bias ratings for hundreds of news sources across a five-point scale (Left, Lean Left, Center, Lean Right, Right). Ratings are crowdsourced, expert-reviewed, and periodically updated.

Format: CSV download available; web interface for browsing.

Access: allsides.com/media-bias/ratings. Bulk data available for academic use.

Recommended use cases: Assigning partisan lean to news sources, studying cross-partisan coverage, building training data for bias classifiers.


NewsGuard

Description: NewsGuard rates news and information websites on nine journalistic standards (e.g., transparency, accuracy, corrections policy) on a 0–100 credibility score. Covers approximately 7,500 English-language sites.

Format: JSON and CSV; API available.

Access: NewsGuard website. Academic licenses available at reduced cost; commercial API requires subscription.

License: Commercial with academic discount programs.

Recommended use cases: Source-level credibility labeling, training data for source credibility classifiers, curriculum materials for media literacy.


D.7 Election and Political Data

MIT Election Data and Science Lab

Description: MIT's MEDSL provides clean, standardized election returns for US federal, state, and local elections back to 1976, plus a range of secondary datasets on candidates, campaign finance, and voter behavior.

Format: CSV via Harvard Dataverse.

Access: Free download at dataverse.harvard.edu/dataverse/medsl.

License: Creative Commons Attribution 4.0.

Recommended use cases: Linking election outcomes to disinformation metrics, geographic analysis of misinformation effects.


Ballotpedia API

Description: Ballotpedia provides structured data on candidates, races, ballots, and officeholders at federal, state, and local levels across the United States.

Format: JSON REST API.

Access: Requires API key; free tier for researchers, paid tiers for commercial use.

Recommended use cases: Candidate metadata enrichment, election integrity research, ballot measure analysis.


D.8 Data Access Best Practices

Ethical considerations: Many of these datasets contain real people's statements and social media activity. Researchers should: - Comply with platform Terms of Service — do not violate rate limits or scrape prohibited content. - Obtain IRB (Institutional Review Board) approval for research involving human subjects data. - Anonymize or aggregate data before publication when individual-level data is not necessary. - Avoid doxing or exposing individuals identified as spreaders of misinformation. - Follow GDPR and relevant national privacy regulations when working with data from EU residents.

Storage and reproducibility: Store datasets with versioning information and document exact download dates, since many datasets are updated or retracted. Use DOIs or persistent identifiers where available. Archive your exact dataset versions with projects.

Citation: All datasets should be cited using the primary paper reference (not just the download URL). See Appendix H for full citations of all datasets referenced here.