Capstone 2 Data Appendix: The Misinformation Tracker

Section A: Dataset Reference Guide

Primary Dataset: oda_media.csv

The ODA media dataset contains approximately 15,000 rows of media coverage relevant to the Garza-Whitfield Senate race and the broader state political landscape. It is the primary dataset for this capstone's spread analysis and automated detection components.

Full column reference:

Column Type Description Notes
article_id string Unique article identifier Format: M-XXXXX
date date Publication date YYYY-MM-DD format
source string Publication name e.g., "State Tribune", "PolitiFact"
source_type categorical Media type Values: "newspaper", "TV", "digital", "wire", "factchecker"
state string State of primary coverage Two-letter abbreviation
topic string Primary topic tag e.g., "immigration", "healthcare", "economy"
headline string Article headline May be null for TV transcripts
excerpt string Article excerpt (first 500 chars) Use with headline for text analysis
sentiment_score float Pre-computed compound sentiment Range: -1 to 1 (VADER scale)
candidate_mentions string Candidates mentioned Comma-separated: "garza", "whitfield", "both"
factcheck_rating string Fact-checker rating if applicable Null for most articles; see values below

factcheck_rating value inventory:

Value Approximate count Classification for model label
"True" ~180 Accurate (label = 0)
"Mostly True" ~240 Accurate (label = 0)
"Half True" ~145 Accurate (label = 0) — see note
"Mostly False" ~190 Misinformation (label = 1)
"False" ~165 Misinformation (label = 1)
"Pants on Fire" ~45 Misinformation (label = 1)
"One Pinocchio" ~95 Accurate (label = 0)
"Two Pinocchios" ~130 Accurate (label = 0) — see note
"Three Pinocchios" ~110 Misinformation (label = 1)
"Four Pinocchios" ~75 Misinformation (label = 1)
"Accurate" ~85 Accurate (label = 0)
"Misleading" ~120 Misinformation (label = 1)

Note on boundary labels: "Half True" and "Two Pinocchios" are genuinely ambiguous. The model label assignments above follow the convention used in Sam's pipeline (where the goal is a conservative classifier that minimizes false positives). Students who prefer to exclude these boundary cases from the training corpus and test whether this changes model performance are encouraged to do so — document your choice.

Labeled example count: Approximately 1,580 rows with non-null factcheck_rating.

Class balance: Approximately 775 misinformation (label = 1) and 805 accurate (label = 0) in the labeled subset — relatively balanced, which is unusual for real-world misinformation datasets (actual misinformation is rarer than accurate coverage). The dataset is balanced by construction for pedagogical purposes.


Secondary Reference: Claim-Specific Data Lookup

For the five claims described in Section 4 of the main capstone text, the following reference tables provide the data that would underlie a real tracker's spread analysis. Students may use these as starting points and supplement with oda_media.csv analysis.

Claim W-1: "Maria Garza Supports Open Borders"

Origin: Whitfield television advertisement, Campaign Week -42 days

Advertising data (from FCC/platform records): - Television buy: $195,000 statewide; estimated 2.1 million impressions - Digital buy (Facebook/Instagram): $85,000; estimated 340,000 impressions - Total estimated reach: 2.44 million impressions (with significant overlap)

Organic spread indicators (from oda_media.csv filtered for topic="immigration" and candidate_mentions contains "garza"): - Articles referencing the "open borders" characterization of Garza: approximately 340 in the campaign period - Distribution by source_type: digital (45%), newspaper (28%), TV (17%), wire (10%) - Average sentiment for these articles: -0.31 (moderately negative, consistent with contrast coverage)

Fact-checker response: - PolitiFact rating: "False" — published Campaign Week -34 - ODA rating: A1-I1 — published same day as PolitiFact - Combined estimated correction audience: ~40,000 page views (PolitiFact: ~28,000; ODA: ~12,000)

Correction gap ratio: ~40,000 / ~2,440,000 = 1.64%

Primary sources for verification: - Garza's official immigration platform (campaign website, updated Campaign Week -48) - State AG's office case records: 847 cases involving undocumented persons, 2017-2021 - Garza's statement on comprehensive immigration reform: campaign speech transcript, Campaign Week -52 - State bar association immigration law committee assessment of Garza's AG record


Claim W-3: "Violent Crime Rose 40% Under Garza's Tenure as AG"

Origin: PAC television advertisement, Campaign Week -28

Advertising data: - Television buy: $145,000 statewide; estimated 980,000 impressions - Digital version: $42,000; estimated 195,000 impressions - Total estimated reach: 1.175 million impressions

Underlying statistical data (from FBI Uniform Crime Reports and state AG office):

Year Violent Crime Rate (per 100K) Notes
2018 (Garza takes office) 412.3
2019 388.7 Decrease during first full year
2020 445.2 Pandemic year; nationwide increase
2021 458.9 FBI methodology change; comparison unreliable
2022 (last full year) 457.6
  • Change from 2018-2022 (full tenure): +11.0%
  • Change from 2019-2021 (PAC-selected period): +18.1%
  • The "40%" figure appears to derive from comparing 2019's record low to a different crime category (aggravated assault) in 2021, cherry-picking both the baseline and the metric

Expert consultation note: Three criminologists contacted by ODA confirmed that no standard methodology supports a "40% increase" characterization of crime trends during Garza's tenure. Two requested attribution; one requested anonymity.

Fact-checker response: - Metro Daily Fact-Checker: "Three Pinocchios" equivalent — Campaign Week -18 - State's largest paper: "Misleading and statistically dubious" — Campaign Week -17 - ODA rating: A1-I1 — Campaign Week -18

Combined correction audience: approximately 65,000 across both fact-checks Correction gap ratio: ~65,000 / ~1,175,000 = 5.5%


Claim G-1: "Tom Whitfield Outsourced Jobs to China"

Origin: Garza campaign television advertisement, Campaign Week -38

Advertising data: - Television buy: $165,000 statewide; estimated 1.6 million impressions - Digital buy: $62,000; estimated 280,000 impressions - Total estimated reach: 1.88 million impressions

Business record data (from Whitfield's Hardware company filings and industry reports):

Data point Value Source
Total employees ~340 Company public records
Products from Chinese manufacturers ~35-40% Industry analysis; consistent with hardware retail average
Products from other Asian countries ~15-20% Industry analysis
U.S. manufacturing operations closed 0 documented Corporate filings
Domestic workers displaced by offshore production 0 documented Corporate filings; SEC disclosures

Industry context: The National Hardware Retailers Association reports that 38% of inventory at comparable hardware retail chains is sourced from Chinese manufacturers — Whitfield's sourcing is at or slightly below industry average.

ODA rating: A2-I1 (materially misleading, high impact) Garza campaign response: "Tom Whitfield made the choice to stock his stores with goods made in China rather than American-made products. That's outsourcing." (Documented in ODA case file)

Correction coverage: Two fact-checks, combined ~48,000 readers Correction gap ratio: ~48,000 / ~1,880,000 = 2.6%


Claim G-2: "Whitfield's Plan Would Cut Medicare by $200 Billion"

Origin: Garza campaign digital advertising and candidate statements, Campaign Week -40

Source documentation: - Heritage Action Policy Framework (third-party document Whitfield has endorsed): Contains $200B Medicare "efficiency savings" target over 10 years - Whitfield's public statements: Supports "reforming entitlement programs" and endorsed the Heritage Action framework in general terms - Whitfield's explicit statements on Medicare: Has not released a Medicare policy; has said he "won't cut benefits for current recipients"

The inference chain: 1. Whitfield endorsed Heritage Action framework (documented) 2. Heritage Action framework includes $200B Medicare savings target (documented) 3. Therefore: "Whitfield's plan would cut Medicare by $200B" (this inference step is the contested one)

Why this is A3 (contested inference) rather than A2 (misleading): - The inference is reasonable: if you endorse a framework, you accept its provisions - But "Whitfield's plan" implies he has a plan — he has only endorsed another organization's framework - Whether $200B in "efficiency savings" constitutes "cuts" is genuinely contested and depends on baseline definitions - Whitfield's own statements directly contradict this characterization

ODA rating: A3-I2


Claim G-3: "Whitfield Said Immigrants 'Don't Belong Here'"

Origin: Edited social media video, origin account: @garza_supporters_unite (not an official campaign account), Campaign Week -35

The actual statement (transcript from campaign event recording):

"Look, my family came here the right way, and I respect everyone who does it the right way. But illegal immigrants who are here breaking the law don't belong here — they need to go back and get in line like everyone else did."

The edited version (as circulated):

"[...] don't belong here — they need to go back."

Spread data: - Original edited clip: 8,200 shares, estimated 2.3 million impressions before correction - Garza campaign response time: 36 hours (campaign tweeted full transcript at Campaign Week -35, day 2 of spread) - Unedited clip circulation: 3,100 shares, estimated 890,000 impressions

Attribution complexity notes: - @garza_supporters_unite: account created Campaign Week -52, 2,100 followers at time of post, no documented affiliation with official campaign - Garza campaign digital director statement (on record with ODA): "We did not create, distribute, or instruct the creation of that video clip. We condemn the selective editing." - ODA attribution: "Social media accounts associated with Garza supporters, with initial campaign inaction"

ODA rating: A1-I1, with attribution note


Section B: Text Analysis Technical Reference

VADER Implementation Reference

VADER (Valence Aware Dictionary and sEntiment Reasoner) is a rule-based sentiment analysis tool optimized for social media content. Install via: pip install vaderSentiment

Key parameters for this capstone:

from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer

analyzer = SentimentIntensityAnalyzer()

# Basic usage
text = "Maria Garza's extreme open borders agenda would destroy our communities!"
scores = analyzer.polarity_scores(text)
# Returns: {'neg': 0.283, 'neu': 0.573, 'pos': 0.144, 'compound': -0.5423}

# Key values:
# compound: normalized sum, range -1 to 1
#   > 0.05: positive
#   < -0.05: negative
#   between: neutral
# neg, pos, neu: proportions of text in each sentiment class

Recommended flagging threshold for this capstone: abs(compound) > 0.60

This threshold is more restrictive than the standard 0.05 positive/negative cutoff. It targets content with strong emotional activation — a deliberate choice to reduce false positives at the cost of some recall.

Watchlist terms for the Garza-Whitfield race:

WATCHLIST = [
    # Candidate names
    'garza', 'whitfield',
    # Documented claim phrases
    'open borders', 'outsourc', 'medicare', 'crime rate',
    'violent crime', 'illegal immigrant', "don't belong",
    'eliminated', 'destroy',
    # Statistical claim markers
    'percent', 'billion', 'cut',
    # General misinformation signal phrases
    'actually', 'fake', 'lie', 'false', 'truth'
]

Note: including general signal phrases like 'false' and 'truth' increases recall of meta-coverage (articles about misinformation) as well as misinformation itself — this is why the source credibility downweighting step is important.


TF-IDF and Logistic Regression Technical Notes

TF-IDF vectorizer key parameters:

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(
    max_features=5000,     # vocabulary size limit
    ngram_range=(1, 2),    # unigrams and bigrams
    min_df=3,              # ignore terms appearing in fewer than 3 documents
    stop_words='english',  # remove common English stopwords
    sublinear_tf=True      # apply log normalization to term frequency
)

Why sublinear_tf: Term frequency (TF) in TF-IDF normally gives a document with "crime" appearing 10 times twice the weight of one with 5 appearances. sublinear_tf=True applies log(tf) + 1, dampening the influence of repeated terms. This is generally better for political text where certain terms (candidate names, key issues) repeat heavily.

Why min_df=3: Terms appearing in only one or two documents are almost certainly noise — proper nouns, typos, unique phrasings. Excluding them keeps the vocabulary to meaningful patterns.

Class imbalance handling:

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(
    C=1.0,                  # regularization strength (1/lambda)
    class_weight='balanced', # upweights minority class
    max_iter=500,
    random_state=42
)

With balanced class weights, the ODA dataset (approximately balanced by construction) shouldn't require heavy adjustment — but including class_weight='balanced' is good practice and should be used.

Cross-validation setup:

from sklearn.model_selection import cross_val_score, StratifiedKFold
from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(...)),
    ('clf', LogisticRegression(...))
])

cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Use stratified K-fold to preserve class balance across folds

# Report F1 for misinformation class (pos_label=1):
f1_scores = cross_val_score(pipeline, X_text, y,
                             cv=cv, scoring='f1', n_jobs=-1)

Interpreting F1: F1 is the harmonic mean of precision and recall. For misinformation detection, recall (catching actual misinformation) is generally more important than precision (avoiding false positives) — but in this pipeline, false positives go to human review where they're caught, so the precision/recall tradeoff is less critical than it would be in a fully automated system.

Expected performance range on this dataset: F1 approximately 0.65-0.78. If your model is achieving F1 above 0.85, double-check for data leakage (fitting the vectorizer on the full dataset before the cross-validation split).


Feature Interpretation

To inspect which features drive the classifier's predictions:

import numpy as np

# After fitting on full training set
tfidf = pipeline.named_steps['tfidf']
clf = pipeline.named_steps['clf']
feature_names = tfidf.get_feature_names_out()
coef = clf.coef_[0]

# Features most associated with misinformation (positive coefficients):
top_mis = np.argsort(coef)[-15:]
for idx in reversed(top_mis):
    print(f"  {feature_names[idx]}: {coef[idx]:.3f}")

# Features most associated with accurate content (negative coefficients):
top_acc = np.argsort(coef)[:15]
for idx in top_acc:
    print(f"  {feature_names[idx]}: {coef[idx]:.3f}")

Interpreting the features: Features with large positive coefficients are terms whose presence makes the classifier more likely to predict misinformation. In this dataset, expect to see terms associated with the tracked claims (e.g., "open borders," "crime rate," "outsourced"), high-intensity emotional terms, and specific named entities. Features with large negative coefficients are terms whose presence makes the classifier more likely to predict accurate content — expect to see terms associated with fact-checking and analytical reporting (e.g., "evidence shows," "research suggests," "according to").


Section C: Fact-Checker Methodology Comparison

Organization Rating System Key Features URL
PolitiFact Truth-O-Meter: True, Mostly True, Half True, Mostly False, False, Pants on Fire Named journalist accountability; searchable database politifact.com
Washington Post Fact Checker 1-4 Pinocchios; Geppetto Checkmark; Kessler Collection (for repeated falsehoods) Named journalist; appeals process washingtonpost.com/news/fact-checker
FactCheck.org No formal scale; narrative rating Affiliated with Annenberg Public Policy Center factcheck.org
AP Fact Check No scale; narrative only Wire service model; high distribution apnews.com/ap-fact-check
Snopes Ratings: True, Mostly True, Mixture, Mostly False, False, Outdated, Unproven, Labeled Satire Strong on viral/social content snopes.com

For this capstone: When your spread analysis identifies fact-checker articles in oda_media.csv (filter: source_type == 'factchecker'), these represent the correction side of the correction gap. Note that oda_media.csv includes ratings from multiple fact-checkers using different scales — the factcheck_rating column normalizes these to a common vocabulary where possible.


Section D: Correction Gap Research Reference

Students building their spread analyses should be aware of the following empirical findings from misinformation research:

Correction effectiveness varies by: - Ideological prior: Corrections are more effective for ideologically neutral claims than for claims that align with the target's existing beliefs. Politically motivated reasoners show weaker response to corrections. - Correction format: Detailed corrections with specific counter-evidence are more effective than simple "this is false" statements. Graph-based corrections (showing actual data vs. claimed data) are often more effective than text-only corrections. - Source credibility: Corrections from high-credibility sources are more effective than corrections from sources perceived as partisan or low-credibility. - Timing: Corrections immediately following the original exposure are more effective than delayed corrections. "Prebunking" (warning about a misleading claim before exposure) is consistently more effective than post-hoc correction.

Correction gap benchmarks (from published research, for context): - Same-source corrections (same platform, similar reach): typically reach 20-30% of original claim audience - Cross-platform corrections (fact-checker on different platform): typically reach 5-15% of original claim audience - Viral corrections (corrections that themselves go viral): rare; when they occur, the correction may reach 50-100%+ of original audience

For your correction gap estimates: You will not be able to calculate exact correction gap ratios from oda_media.csv — the dataset doesn't contain precise audience size data. You should estimate ranges based on source type. A correction in a major newspaper (source_type = 'newspaper') reaching an estimated 50,000-100,000 readers represents a much smaller audience than a television advertisement with 1-2 million impressions. Be explicit about the estimation uncertainty in your analysis.


Section E: ODA Methodology Template

Use the following template for your Deliverable 4 public rating. This is the format ODA's published ratings use.

---
claim_id: ODA-[YOUR-INITIALS]-001
date_rated: [DATE]
last_updated: [DATE]
---

## Claim

> "[EXACT CLAIM TEXT]"

**Attributed to**: [Name, title, campaign/organization]
**Original source**: [Venue, date, format (speech/ad/press release/social media)]
**Date of original claim**: [DATE]

## ODA Rating

**Accuracy Rating**: [A1 / A2 / A3 / A4]
**Impact Rating**: [I1 / I2 / I3]

### Plain-Language Explanation

[150 words or fewer. No jargon. Explain what's wrong, why it matters, and
what the evidence actually shows. Write for a reader with no political science
background who has 30 seconds to spend on this claim.]

---

## Methodology Details

### Primary Sources Consulted

1. [Source name and specific document/page]
2. [Source name and specific document/page]
3. [Continue for all primary sources]

### Expert Consultation

[Name, title, institution] was contacted on [date]. [Their relevant assessment,
quoted or paraphrased with permission level noted.]

### Campaign Response

ODA contacted [campaign/organization] on [date] with the following request:
[Brief description of what was asked.]

Response received on [date / no response received by publication date]:
[Full response or notation of non-response.]

### Reasoning Chain

[Step-by-step: what the evidence shows, how each piece connects to the rating,
where judgment calls were made.]

### What Would Change This Rating

This rating would change to [different rating] if:
[Specific evidence type that would warrant a revision.]

---

## Spread Data

**Original reach estimate**: [Number] [Method: advertising buy data / social
media engagement / news audience estimates]
**Correction reach estimate**: [Number or "not yet documented"] [Method]
**Correction gap ratio**: [X%] [Uncertainty range]

---

*This rating reflects evidence available as of [date]. ODA will issue a
correction if new evidence material to this rating becomes available.*

Section F: Equity Checklist (ODA Version)

The following is ODA's equity checklist, reproduced from Adaeze Nwosu's internal documentation. Apply this checklist in Deliverable 5.

  1. Who benefits from this system, and who bears risk? Identify specific communities on each side of this question.

  2. Are data sources equitable? Does the system's data collection represent all communities with equal rigor, or are some communities systematically underrepresented?

  3. Does the methodology systematically disadvantage any group? Consider: minimum thresholds, language requirements, geographic coverage, technology access requirements.

  4. Is the system's communication accessible? Is output available in languages spoken by all major communities? Is reading level appropriate for a general audience?

  5. Does the system's governance reflect the diversity of the communities it serves? Who makes decisions? Who has oversight? Who is excluded?

  6. Are impacts on historically marginalized communities weighted appropriately? Does a claim with concentrated impact on a specific community receive proportionate attention relative to its aggregate reach?

  7. What is the feedback mechanism? Can communities affected by the system's decisions report concerns and have them heard?

Apply each question specifically to your tracker design. Generic answers ("we are committed to equity") do not satisfy the checklist — each question requires a specific assessment and, where gaps are identified, a specific remediation.