Appendix C: Python Library Reference Guide

This appendix provides a practical reference for the Python libraries used throughout the textbook. Each section includes installation notes, core functionality, and annotated code examples oriented toward misinformation research. The examples assume Python 3.9 or later. For environment setup, see Section C.10.

C.1 NumPy: Numerical Computing

NumPy is the foundational library for numerical computing in Python. It provides the ndarray object — a fast, memory-efficient array — and a comprehensive set of mathematical functions.

Installation

pip install numpy

Core Operations

import numpy as np

# Creating arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])

# Array properties
print(arr.shape)    # (5,)
print(matrix.shape) # (2, 3)
print(arr.dtype)    # int64

# Arithmetic (element-wise by default)
shares = np.array([100, 250, 50, 1000, 75])
log_shares = np.log1p(shares)   # log(1+x), handles 0 gracefully
normalized = (shares - shares.mean()) / shares.std()

# Boolean masking
viral = shares[shares > 200]   # array([250, 1000])

# Random number generation (misinformation simulation)
rng = np.random.default_rng(seed=42)
belief_scores = rng.normal(loc=50, scale=15, size=1000)  # N(50, 15)

Mathematical Functions

# Descriptive statistics
print(np.mean(shares))    # 295.0
print(np.median(shares))  # 100.0
print(np.std(shares))     # 357.7...
print(np.percentile(shares, [25, 50, 75]))  # quartiles

# Correlations
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
correlation_matrix = np.corrcoef(x, y)
r = correlation_matrix[0, 1]  # 0.9994...

# Matrix operations (used in ML feature engineering)
A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A)
eigenvalues, eigenvectors = np.linalg.eig(A)

C.2 Pandas: Data Manipulation

Pandas provides the DataFrame — a labeled, two-dimensional data structure analogous to a spreadsheet — and tools for data wrangling, cleaning, and exploratory analysis.

Installation

pip install pandas

Loading Data

import pandas as pd

# Load CSV dataset (e.g., LIAR dataset)
df = pd.read_csv("liar_dataset/train.tsv", sep="\t",
                 names=["id","label","statement","subject","speaker",
                        "job","state","party","barely_true","false",
                        "half_true","mostly_true","pants_fire","context"])

# Load JSON (common for social media exports)
tweets = pd.read_json("tweets.json", lines=True)

# Quick inspection
print(df.shape)         # (12836, 14)
print(df.dtypes)
print(df.head(3))
print(df["label"].value_counts())

Filtering and Selection

# Boolean filtering
false_claims = df[df["label"].isin(["false", "pants-fire"])]
political = df[df["subject"].str.contains("politics", case=False, na=False)]

# Select columns
subset = df[["label", "statement", "speaker"]]

# Query syntax
high_confidence_false = df.query("pants_fire > 5 and barely_true < 2")

# Handling missing values
df_clean = df.dropna(subset=["statement"])
df["party"] = df["party"].fillna("unknown")

GroupBy and Aggregation

# Credibility by party
party_stats = df.groupby("party").agg(
    total_claims=("label", "count"),
    false_count=("label", lambda x: (x.isin(["false","pants-fire"])).sum()),
    false_rate=("label", lambda x: (x.isin(["false","pants-fire"])).mean())
).reset_index().sort_values("false_rate", ascending=False)

# Pivot tables
pivot = df.pivot_table(values="id", index="party",
                       columns="label", aggfunc="count", fill_value=0)

Merging and Reshaping

# Merge two DataFrames on speaker ID
speaker_meta = pd.read_csv("speaker_metadata.csv")
merged = df.merge(speaker_meta, on="speaker", how="left")

# Melt wide to long format (useful for panel data)
long_df = pd.melt(merged, id_vars=["id","statement"],
                  value_vars=["barely_true","false","half_true","mostly_true"],
                  var_name="rating_type", value_name="count")

C.3 Matplotlib and Seaborn: Visualization

Matplotlib provides low-level plotting control; Seaborn builds on Matplotlib with higher-level statistical visualizations and attractive default styles.

Installation

pip install matplotlib seaborn

Common Chart Types

import matplotlib.pyplot as plt
import seaborn as sns

sns.set_theme(style="whitegrid", palette="colorblind")

# --- Bar chart: false claim frequency by category ---
fig, ax = plt.subplots(figsize=(10, 5))
df["label"].value_counts().plot(kind="bar", ax=ax)
ax.set_title("Distribution of Claim Veracity Labels")
ax.set_xlabel("Label")
ax.set_ylabel("Count")
plt.tight_layout()
plt.savefig("label_distribution.png", dpi=150)

# --- Histogram: belief score distribution ---
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(belief_scores, bins=40, edgecolor="black", alpha=0.7)
ax.axvline(belief_scores.mean(), color="red", linestyle="--", label="Mean")
ax.set_xlabel("Belief Score (0–100)")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of Misinformation Belief Scores")
ax.legend()

# --- Scatter plot with regression line ---
fig, ax = plt.subplots(figsize=(7, 6))
ax.scatter(df["media_literacy"], df["misinformation_belief"], alpha=0.3, s=10)
sns.regplot(x="media_literacy", y="misinformation_belief",
            data=df, scatter=False, color="red", ax=ax)
ax.set_title("Media Literacy vs. Misinformation Belief (r = −0.41)")

# --- Heatmap: correlation matrix ---
corr = df[["media_literacy","critical_thinking","sharing_intent","belief"]].corr()
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="RdYlGn",
            center=0, square=True, ax=ax)
ax.set_title("Feature Correlation Matrix")

C.4 scikit-learn: Machine Learning

scikit-learn is the standard Python library for classical machine learning — preprocessing, model fitting, evaluation, and model selection.

Installation

pip install scikit-learn

Preprocessing

from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Encode target labels
le = LabelEncoder()
y = le.fit_transform(df["label"])   # maps strings to integers

# TF-IDF features from text
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2),
                              stop_words="english", sublinear_tf=True)
X_text = vectorizer.fit_transform(df["statement"])

# Train/test split (stratified to preserve class proportions)
X_train, X_test, y_train, y_test = train_test_split(
    X_text, y, test_size=0.2, random_state=42, stratify=y)

Logistic Regression

from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(max_iter=1000, C=1.0, solver="lbfgs",
                          multi_class="multinomial", random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

# Feature importance (top words for each class)
feature_names = vectorizer.get_feature_names_out()
for i, class_name in enumerate(le.classes_):
    top_idx = clf.coef_[i].argsort()[-10:][::-1]
    print(f"{class_name}: {', '.join(feature_names[top_idx])}")

Random Forest Classifier

from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=200, max_depth=20,
                             random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)

# Feature importances
importances = pd.Series(rf.feature_importances_,
                         index=feature_names).sort_values(ascending=False)
print(importances.head(20))

Evaluation Metrics

from sklearn.metrics import (classification_report, confusion_matrix,
                              roc_auc_score, accuracy_score,
                              precision_recall_fscore_support)

print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=le.classes_))

# ROC AUC (for binary classification)
y_prob = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"ROC AUC: {auc:.3f}")

# Confusion matrix visualization
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", xticklabels=le.classes_,
            yticklabels=le.classes_, cmap="Blues")
plt.title("Confusion Matrix")

C.5 NLTK: Natural Language Toolkit

NLTK provides classical NLP tools: tokenization, stopword removal, stemming, and more. It is particularly useful for preprocessing before TF-IDF or classical ML pipelines.

Installation

pip install nltk
python -c "import nltk; nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])"

Core Operations

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer

text = "Politicians are spreading false claims about vaccines, but experts disagree."

# Tokenization
words = word_tokenize(text)
sentences = sent_tokenize(text)
print(words)
# ['Politicians', 'are', 'spreading', 'false', 'claims', ...]

# Stopword removal
stop_words = set(stopwords.words("english"))
filtered = [w for w in words if w.lower() not in stop_words
            and w.isalpha()]
print(filtered)
# ['Politicians', 'spreading', 'false', 'claims', 'experts', 'disagree']

# Stemming (aggressive, reduces to root form)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
# ['politician', 'spread', 'fals', 'claim', 'expert', 'disagre']

# Lemmatization (returns real dictionary words)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w.lower(), pos="v") for w in filtered]
# ['politicians', 'spread', 'false', 'claim', 'experts', 'disagree']

# POS tagging
pos_tags = nltk.pos_tag(filtered)
# [('Politicians', 'NNS'), ('spreading', 'VBG'), ...]

C.6 spaCy: Industrial-Strength NLP

spaCy provides fast, production-ready NLP pipelines with support for named entity recognition (NER), dependency parsing, coreference resolution, and semantic similarity.

Installation

pip install spacy
python -m spacy download en_core_web_sm   # small model (faster)
python -m spacy download en_core_web_lg   # large model (better accuracy)

Core Pipeline

import spacy

nlp = spacy.load("en_core_web_lg")

text = "Donald Trump claimed in 2020 that the election was stolen from him."
doc = nlp(text)

# Named Entity Recognition (NER)
for ent in doc.ents:
    print(f"{ent.text:25s}  {ent.label_:10s}  {spacy.explain(ent.label_)}")
# Donald Trump             PERSON     People, including fictional
# 2020                     DATE       Absolute or relative dates
# him                      (skipped, pronouns not recognized as NER)

# Dependency parsing
for token in doc:
    print(f"{token.text:15s}  {token.dep_:12s}  {token.head.text}")
# claimed         ROOT          claimed
# election        nsubj         stolen

# Noun chunks (useful for claim extraction)
for chunk in doc.noun_chunks:
    print(chunk.text, "->", chunk.root.dep_)

Semantic Similarity

# Compare sentences for similarity (requires large model with vectors)
doc1 = nlp("COVID-19 vaccines are safe and effective.")
doc2 = nlp("The coronavirus shot has been proven to work without danger.")
doc3 = nlp("Climate change is an existential threat.")

print(doc1.similarity(doc2))  # ~0.85 (high similarity — paraphrase)
print(doc1.similarity(doc3))  # ~0.55 (lower similarity — different topic)

Batch Processing

# Efficient processing of large datasets with pipe()
texts = df["statement"].tolist()
processed = []
for doc in nlp.pipe(texts, batch_size=64, disable=["parser"]):
    entities = [(ent.text, ent.label_) for ent in doc.ents]
    processed.append(entities)
df["entities"] = processed

C.7 Transformers (HuggingFace): Large Language Models

The transformers library provides access to thousands of pre-trained transformer models (BERT, RoBERTa, GPT, T5, etc.) for tasks like text classification, generation, and question answering.

Installation

pip install transformers torch datasets accelerate

Loading Models and the Pipeline API

from transformers import pipeline

# Zero-shot classification — useful for labeling without training data
classifier = pipeline("zero-shot-classification",
                      model="facebook/bart-large-mnli")

headline = "Scientists discover that drinking bleach cures COVID-19."
candidate_labels = ["misinformation", "satire", "legitimate news"]
result = classifier(headline, candidate_labels)
print(result["labels"][0], result["scores"][0])
# misinformation   0.912

# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
texts = ["This vaccine is totally safe!", "I can't believe they lied again."]
for text, res in zip(texts, sentiment(texts)):
    print(f"{text[:40]}  ->  {res['label']} ({res['score']:.3f})")

Tokenization and Feature Extraction

from transformers import AutoTokenizer, AutoModel
import torch

model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)

texts = ["The earth is flat.", "Scientists confirm vaccine safety."]
encoded = tokenizer(texts, padding=True, truncation=True,
                    max_length=128, return_tensors="pt")

with torch.no_grad():
    outputs = model(**encoded)

# [CLS] token embeddings as sentence representations
embeddings = outputs.last_hidden_state[:, 0, :]  # shape: (2, 768)

Fine-Tuning Basics

from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset

# Prepare dataset
train_data = Dataset.from_pandas(train_df[["text", "label"]])
test_data = Dataset.from_pandas(test_df[["text", "label"]])

def tokenize_function(batch):
    return tokenizer(batch["text"], truncation=True, max_length=128)

train_tok = train_data.map(tokenize_function, batched=True)
test_tok = test_data.map(tokenize_function, batched=True)

# Load pre-trained model for classification
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-base-uncased", num_labels=2)

# Training configuration
training_args = TrainingArguments(
    output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16,
    evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True)

trainer = Trainer(model=model, args=training_args,
                  train_dataset=train_tok, eval_dataset=test_tok)
trainer.train()

C.8 NetworkX: Graph Analysis

NetworkX enables creation, analysis, and visualization of complex networks — essential for studying information spread, platform ecosystems, and coordinated inauthentic behavior.

Installation

pip install networkx

Graph Creation

import networkx as nx
import matplotlib.pyplot as plt

# Directed graph (retweet network)
G = nx.DiGraph()

# Add edges from retweet data
retweets = [("user_A", "user_B"), ("user_C", "user_B"),
            ("user_B", "user_D"), ("user_A", "user_D")]
G.add_edges_from(retweets)

# Node attributes
G.nodes["user_A"]["is_bot"] = False
G.nodes["user_B"]["followers"] = 5000

print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
print(f"Is DAG: {nx.is_directed_acyclic_graph(G)}")

Centrality Measures

# Degree centrality (normalized by max possible)
degree_centrality = nx.degree_centrality(G)

# Betweenness centrality (fraction of shortest paths through a node)
betweenness = nx.betweenness_centrality(G, normalized=True)

# PageRank (eigenvector-based importance, used by Google)
pagerank = nx.pagerank(G, alpha=0.85)

# Summary table
centrality_df = pd.DataFrame({
    "node": list(G.nodes()),
    "degree": [degree_centrality[n] for n in G.nodes()],
    "betweenness": [betweenness[n] for n in G.nodes()],
    "pagerank": [pagerank[n] for n in G.nodes()]
}).sort_values("pagerank", ascending=False)
print(centrality_df.head(10))

Community Detection

# Convert to undirected for community detection
G_undirected = G.to_undirected()

# Louvain algorithm (via python-louvain package)
import community as community_louvain
partition = community_louvain.best_partition(G_undirected)
modularity = community_louvain.modularity(partition, G_undirected)
print(f"Modularity Q = {modularity:.3f}")

# Assign communities to nodes
for node, comm in partition.items():
    G.nodes[node]["community"] = comm

# Number of communities
n_communities = len(set(partition.values()))
print(f"Number of communities: {n_communities}")

Visualization

# Spring layout (force-directed)
pos = nx.spring_layout(G, k=2, seed=42)
node_colors = [partition.get(n, 0) for n in G.nodes()]

fig, ax = plt.subplots(figsize=(12, 10))
nx.draw_networkx(G, pos=pos, node_color=node_colors,
                 cmap=plt.cm.Set3, node_size=100, arrows=True,
                 with_labels=False, edge_color="gray", ax=ax)
ax.set_title(f"Retweet Network (Q={modularity:.3f})")
plt.axis("off")

C.9 scipy.stats: Statistical Tests

SciPy provides a comprehensive set of statistical tests appropriate for the experimental designs common in misinformation research.

Installation

pip install scipy

Common Statistical Tests

from scipy import stats

# --- Independent samples t-test ---
# Does an inoculation intervention reduce belief in conspiracy theories?
control_group = [65, 70, 55, 80, 72, 68, 75, 60, 71, 66]
treatment_group = [50, 45, 55, 48, 52, 44, 58, 47, 51, 53]

t_stat, p_value = stats.ttest_ind(control_group, treatment_group,
                                   equal_var=False)  # Welch's t-test
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")

# Effect size (Cohen's d)
d = (np.mean(control_group) - np.mean(treatment_group)) / \
    np.sqrt((np.var(control_group, ddof=1) + np.var(treatment_group, ddof=1)) / 2)
print(f"Cohen's d = {d:.3f}")

# --- Chi-square test of independence ---
# Is misinformation sharing independent of education level?
contingency_table = np.array([[120, 80], [60, 140], [30, 170]])
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ²({dof}) = {chi2:.3f}, p = {p:.4f}")

# Cramér's V (effect size for chi-square)
n = contingency_table.sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f"Cramér's V = {cramers_v:.3f}")

# --- Pearson and Spearman correlation ---
x = [3, 5, 2, 8, 6, 4, 9, 1, 7, 5]  # media literacy score
y = [70, 55, 80, 30, 45, 60, 25, 85, 35, 50]  # misinformation belief

r_pearson, p_pearson = stats.pearsonr(x, y)
r_spearman, p_spearman = stats.spearmanr(x, y)
print(f"Pearson r = {r_pearson:.3f} (p = {p_pearson:.4f})")
print(f"Spearman ρ = {r_spearman:.3f} (p = {p_spearman:.4f})")

# --- One-sample t-test ---
# Is average belief different from the neutral midpoint (50)?
belief_scores_sample = [55, 62, 48, 71, 59, 44, 67, 53, 60, 58]
t, p = stats.ttest_1samp(belief_scores_sample, popmean=50)
print(f"One-sample t = {t:.3f}, p = {p:.4f}")

C.10 Environment Setup

Virtual Environments

Always work in a virtual environment to avoid dependency conflicts across projects.

# Create a virtual environment named 'misinfo-env'
python -m venv misinfo-env

# Activate (Windows)
misinfo-env\Scripts\activate

# Activate (macOS/Linux)
source misinfo-env/bin/activate

# Deactivate
deactivate

requirements.txt

Create a requirements.txt to record all dependencies with pinned versions:

numpy==1.26.4
pandas==2.2.1
matplotlib==3.8.4
seaborn==0.13.2
scikit-learn==1.4.2
nltk==3.8.1
spacy==3.7.4
transformers==4.40.0
torch==2.3.0
datasets==2.19.1
networkx==3.3
scipy==1.13.0
python-louvain==0.16
accelerate==0.29.3

Install all dependencies at once:

pip install -r requirements.txt

Save current environment to requirements.txt:

pip freeze > requirements.txt

Troubleshooting Common Issues

torch installation fails on Windows:

# Install CPU-only version first to verify setup
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu

spaCy model not found:

python -m spacy download en_core_web_sm
# If behind a firewall:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz

pandas read_csv encoding error:

# Try specifying encoding explicitly
df = pd.read_csv("file.csv", encoding="utf-8-sig")  # handles BOM
df = pd.read_csv("file.csv", encoding="latin-1")    # for older Windows exports

Memory error with large transformer models:

# Load in 8-bit precision to reduce memory footprint
model = AutoModelForSequenceClassification.from_pretrained(
    "bert-large-uncased", load_in_8bit=True, device_map="auto")

NLTK resource not found:

import nltk
nltk.download("all")  # downloads everything (~3.5 GB)
# Or selectively:
for resource in ["punkt", "stopwords", "wordnet", "averaged_perceptron_tagger"]:
    nltk.download(resource)

Jupyter Notebook Setup

pip install jupyter ipykernel
python -m ipykernel install --user --name=misinfo-env --display-name "Misinfo Research"
jupyter notebook

All code examples in this textbook were developed and tested in Jupyter notebooks. The companion GitHub repository (see the textbook's online resources) contains runnable notebooks for every chapter with computational exercises.