Appendix C: Python Library Reference Guide
This appendix provides a practical reference for the Python libraries used throughout the textbook. Each section includes installation notes, core functionality, and annotated code examples oriented toward misinformation research. The examples assume Python 3.9 or later. For environment setup, see Section C.10.
C.1 NumPy: Numerical Computing
NumPy is the foundational library for numerical computing in Python. It provides the ndarray object — a fast, memory-efficient array — and a comprehensive set of mathematical functions.
Installation
pip install numpy
Core Operations
import numpy as np
# Creating arrays
arr = np.array([1, 2, 3, 4, 5])
matrix = np.array([[1, 2, 3], [4, 5, 6]])
# Array properties
print(arr.shape) # (5,)
print(matrix.shape) # (2, 3)
print(arr.dtype) # int64
# Arithmetic (element-wise by default)
shares = np.array([100, 250, 50, 1000, 75])
log_shares = np.log1p(shares) # log(1+x), handles 0 gracefully
normalized = (shares - shares.mean()) / shares.std()
# Boolean masking
viral = shares[shares > 200] # array([250, 1000])
# Random number generation (misinformation simulation)
rng = np.random.default_rng(seed=42)
belief_scores = rng.normal(loc=50, scale=15, size=1000) # N(50, 15)
Mathematical Functions
# Descriptive statistics
print(np.mean(shares)) # 295.0
print(np.median(shares)) # 100.0
print(np.std(shares)) # 357.7...
print(np.percentile(shares, [25, 50, 75])) # quartiles
# Correlations
x = np.array([1, 2, 3, 4, 5])
y = np.array([2.1, 3.9, 6.2, 7.8, 10.1])
correlation_matrix = np.corrcoef(x, y)
r = correlation_matrix[0, 1] # 0.9994...
# Matrix operations (used in ML feature engineering)
A = np.array([[1, 2], [3, 4]])
A_inv = np.linalg.inv(A)
eigenvalues, eigenvectors = np.linalg.eig(A)
C.2 Pandas: Data Manipulation
Pandas provides the DataFrame — a labeled, two-dimensional data structure analogous to a spreadsheet — and tools for data wrangling, cleaning, and exploratory analysis.
Installation
pip install pandas
Loading Data
import pandas as pd
# Load CSV dataset (e.g., LIAR dataset)
df = pd.read_csv("liar_dataset/train.tsv", sep="\t",
names=["id","label","statement","subject","speaker",
"job","state","party","barely_true","false",
"half_true","mostly_true","pants_fire","context"])
# Load JSON (common for social media exports)
tweets = pd.read_json("tweets.json", lines=True)
# Quick inspection
print(df.shape) # (12836, 14)
print(df.dtypes)
print(df.head(3))
print(df["label"].value_counts())
Filtering and Selection
# Boolean filtering
false_claims = df[df["label"].isin(["false", "pants-fire"])]
political = df[df["subject"].str.contains("politics", case=False, na=False)]
# Select columns
subset = df[["label", "statement", "speaker"]]
# Query syntax
high_confidence_false = df.query("pants_fire > 5 and barely_true < 2")
# Handling missing values
df_clean = df.dropna(subset=["statement"])
df["party"] = df["party"].fillna("unknown")
GroupBy and Aggregation
# Credibility by party
party_stats = df.groupby("party").agg(
total_claims=("label", "count"),
false_count=("label", lambda x: (x.isin(["false","pants-fire"])).sum()),
false_rate=("label", lambda x: (x.isin(["false","pants-fire"])).mean())
).reset_index().sort_values("false_rate", ascending=False)
# Pivot tables
pivot = df.pivot_table(values="id", index="party",
columns="label", aggfunc="count", fill_value=0)
Merging and Reshaping
# Merge two DataFrames on speaker ID
speaker_meta = pd.read_csv("speaker_metadata.csv")
merged = df.merge(speaker_meta, on="speaker", how="left")
# Melt wide to long format (useful for panel data)
long_df = pd.melt(merged, id_vars=["id","statement"],
value_vars=["barely_true","false","half_true","mostly_true"],
var_name="rating_type", value_name="count")
C.3 Matplotlib and Seaborn: Visualization
Matplotlib provides low-level plotting control; Seaborn builds on Matplotlib with higher-level statistical visualizations and attractive default styles.
Installation
pip install matplotlib seaborn
Common Chart Types
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid", palette="colorblind")
# --- Bar chart: false claim frequency by category ---
fig, ax = plt.subplots(figsize=(10, 5))
df["label"].value_counts().plot(kind="bar", ax=ax)
ax.set_title("Distribution of Claim Veracity Labels")
ax.set_xlabel("Label")
ax.set_ylabel("Count")
plt.tight_layout()
plt.savefig("label_distribution.png", dpi=150)
# --- Histogram: belief score distribution ---
fig, ax = plt.subplots(figsize=(8, 4))
ax.hist(belief_scores, bins=40, edgecolor="black", alpha=0.7)
ax.axvline(belief_scores.mean(), color="red", linestyle="--", label="Mean")
ax.set_xlabel("Belief Score (0–100)")
ax.set_ylabel("Frequency")
ax.set_title("Distribution of Misinformation Belief Scores")
ax.legend()
# --- Scatter plot with regression line ---
fig, ax = plt.subplots(figsize=(7, 6))
ax.scatter(df["media_literacy"], df["misinformation_belief"], alpha=0.3, s=10)
sns.regplot(x="media_literacy", y="misinformation_belief",
data=df, scatter=False, color="red", ax=ax)
ax.set_title("Media Literacy vs. Misinformation Belief (r = −0.41)")
# --- Heatmap: correlation matrix ---
corr = df[["media_literacy","critical_thinking","sharing_intent","belief"]].corr()
fig, ax = plt.subplots(figsize=(7, 6))
sns.heatmap(corr, annot=True, fmt=".2f", cmap="RdYlGn",
center=0, square=True, ax=ax)
ax.set_title("Feature Correlation Matrix")
C.4 scikit-learn: Machine Learning
scikit-learn is the standard Python library for classical machine learning — preprocessing, model fitting, evaluation, and model selection.
Installation
pip install scikit-learn
Preprocessing
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
# Encode target labels
le = LabelEncoder()
y = le.fit_transform(df["label"]) # maps strings to integers
# TF-IDF features from text
vectorizer = TfidfVectorizer(max_features=10000, ngram_range=(1, 2),
stop_words="english", sublinear_tf=True)
X_text = vectorizer.fit_transform(df["statement"])
# Train/test split (stratified to preserve class proportions)
X_train, X_test, y_train, y_test = train_test_split(
X_text, y, test_size=0.2, random_state=42, stratify=y)
Logistic Regression
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(max_iter=1000, C=1.0, solver="lbfgs",
multi_class="multinomial", random_state=42)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# Feature importance (top words for each class)
feature_names = vectorizer.get_feature_names_out()
for i, class_name in enumerate(le.classes_):
top_idx = clf.coef_[i].argsort()[-10:][::-1]
print(f"{class_name}: {', '.join(feature_names[top_idx])}")
Random Forest Classifier
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=200, max_depth=20,
random_state=42, n_jobs=-1)
rf.fit(X_train, y_train)
y_pred_rf = rf.predict(X_test)
# Feature importances
importances = pd.Series(rf.feature_importances_,
index=feature_names).sort_values(ascending=False)
print(importances.head(20))
Evaluation Metrics
from sklearn.metrics import (classification_report, confusion_matrix,
roc_auc_score, accuracy_score,
precision_recall_fscore_support)
print(f"Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(classification_report(y_test, y_pred, target_names=le.classes_))
# ROC AUC (for binary classification)
y_prob = clf.predict_proba(X_test)[:, 1]
auc = roc_auc_score(y_test, y_prob)
print(f"ROC AUC: {auc:.3f}")
# Confusion matrix visualization
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", xticklabels=le.classes_,
yticklabels=le.classes_, cmap="Blues")
plt.title("Confusion Matrix")
C.5 NLTK: Natural Language Toolkit
NLTK provides classical NLP tools: tokenization, stopword removal, stemming, and more. It is particularly useful for preprocessing before TF-IDF or classical ML pipelines.
Installation
pip install nltk
python -c "import nltk; nltk.download(['punkt', 'stopwords', 'wordnet', 'averaged_perceptron_tagger'])"
Core Operations
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
text = "Politicians are spreading false claims about vaccines, but experts disagree."
# Tokenization
words = word_tokenize(text)
sentences = sent_tokenize(text)
print(words)
# ['Politicians', 'are', 'spreading', 'false', 'claims', ...]
# Stopword removal
stop_words = set(stopwords.words("english"))
filtered = [w for w in words if w.lower() not in stop_words
and w.isalpha()]
print(filtered)
# ['Politicians', 'spreading', 'false', 'claims', 'experts', 'disagree']
# Stemming (aggressive, reduces to root form)
stemmer = PorterStemmer()
stemmed = [stemmer.stem(w) for w in filtered]
# ['politician', 'spread', 'fals', 'claim', 'expert', 'disagre']
# Lemmatization (returns real dictionary words)
lemmatizer = WordNetLemmatizer()
lemmatized = [lemmatizer.lemmatize(w.lower(), pos="v") for w in filtered]
# ['politicians', 'spread', 'false', 'claim', 'experts', 'disagree']
# POS tagging
pos_tags = nltk.pos_tag(filtered)
# [('Politicians', 'NNS'), ('spreading', 'VBG'), ...]
C.6 spaCy: Industrial-Strength NLP
spaCy provides fast, production-ready NLP pipelines with support for named entity recognition (NER), dependency parsing, coreference resolution, and semantic similarity.
Installation
pip install spacy
python -m spacy download en_core_web_sm # small model (faster)
python -m spacy download en_core_web_lg # large model (better accuracy)
Core Pipeline
import spacy
nlp = spacy.load("en_core_web_lg")
text = "Donald Trump claimed in 2020 that the election was stolen from him."
doc = nlp(text)
# Named Entity Recognition (NER)
for ent in doc.ents:
print(f"{ent.text:25s} {ent.label_:10s} {spacy.explain(ent.label_)}")
# Donald Trump PERSON People, including fictional
# 2020 DATE Absolute or relative dates
# him (skipped, pronouns not recognized as NER)
# Dependency parsing
for token in doc:
print(f"{token.text:15s} {token.dep_:12s} {token.head.text}")
# claimed ROOT claimed
# election nsubj stolen
# Noun chunks (useful for claim extraction)
for chunk in doc.noun_chunks:
print(chunk.text, "->", chunk.root.dep_)
Semantic Similarity
# Compare sentences for similarity (requires large model with vectors)
doc1 = nlp("COVID-19 vaccines are safe and effective.")
doc2 = nlp("The coronavirus shot has been proven to work without danger.")
doc3 = nlp("Climate change is an existential threat.")
print(doc1.similarity(doc2)) # ~0.85 (high similarity — paraphrase)
print(doc1.similarity(doc3)) # ~0.55 (lower similarity — different topic)
Batch Processing
# Efficient processing of large datasets with pipe()
texts = df["statement"].tolist()
processed = []
for doc in nlp.pipe(texts, batch_size=64, disable=["parser"]):
entities = [(ent.text, ent.label_) for ent in doc.ents]
processed.append(entities)
df["entities"] = processed
C.7 Transformers (HuggingFace): Large Language Models
The transformers library provides access to thousands of pre-trained transformer models (BERT, RoBERTa, GPT, T5, etc.) for tasks like text classification, generation, and question answering.
Installation
pip install transformers torch datasets accelerate
Loading Models and the Pipeline API
from transformers import pipeline
# Zero-shot classification — useful for labeling without training data
classifier = pipeline("zero-shot-classification",
model="facebook/bart-large-mnli")
headline = "Scientists discover that drinking bleach cures COVID-19."
candidate_labels = ["misinformation", "satire", "legitimate news"]
result = classifier(headline, candidate_labels)
print(result["labels"][0], result["scores"][0])
# misinformation 0.912
# Sentiment analysis
sentiment = pipeline("sentiment-analysis")
texts = ["This vaccine is totally safe!", "I can't believe they lied again."]
for text, res in zip(texts, sentiment(texts)):
print(f"{text[:40]} -> {res['label']} ({res['score']:.3f})")
Tokenization and Feature Extraction
from transformers import AutoTokenizer, AutoModel
import torch
model_name = "roberta-base"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
texts = ["The earth is flat.", "Scientists confirm vaccine safety."]
encoded = tokenizer(texts, padding=True, truncation=True,
max_length=128, return_tensors="pt")
with torch.no_grad():
outputs = model(**encoded)
# [CLS] token embeddings as sentence representations
embeddings = outputs.last_hidden_state[:, 0, :] # shape: (2, 768)
Fine-Tuning Basics
from transformers import AutoModelForSequenceClassification, TrainingArguments, Trainer
from datasets import Dataset
# Prepare dataset
train_data = Dataset.from_pandas(train_df[["text", "label"]])
test_data = Dataset.from_pandas(test_df[["text", "label"]])
def tokenize_function(batch):
return tokenizer(batch["text"], truncation=True, max_length=128)
train_tok = train_data.map(tokenize_function, batched=True)
test_tok = test_data.map(tokenize_function, batched=True)
# Load pre-trained model for classification
model = AutoModelForSequenceClassification.from_pretrained(
"bert-base-uncased", num_labels=2)
# Training configuration
training_args = TrainingArguments(
output_dir="./results", num_train_epochs=3, per_device_train_batch_size=16,
evaluation_strategy="epoch", save_strategy="epoch", load_best_model_at_end=True)
trainer = Trainer(model=model, args=training_args,
train_dataset=train_tok, eval_dataset=test_tok)
trainer.train()
C.8 NetworkX: Graph Analysis
NetworkX enables creation, analysis, and visualization of complex networks — essential for studying information spread, platform ecosystems, and coordinated inauthentic behavior.
Installation
pip install networkx
Graph Creation
import networkx as nx
import matplotlib.pyplot as plt
# Directed graph (retweet network)
G = nx.DiGraph()
# Add edges from retweet data
retweets = [("user_A", "user_B"), ("user_C", "user_B"),
("user_B", "user_D"), ("user_A", "user_D")]
G.add_edges_from(retweets)
# Node attributes
G.nodes["user_A"]["is_bot"] = False
G.nodes["user_B"]["followers"] = 5000
print(f"Nodes: {G.number_of_nodes()}, Edges: {G.number_of_edges()}")
print(f"Is DAG: {nx.is_directed_acyclic_graph(G)}")
Centrality Measures
# Degree centrality (normalized by max possible)
degree_centrality = nx.degree_centrality(G)
# Betweenness centrality (fraction of shortest paths through a node)
betweenness = nx.betweenness_centrality(G, normalized=True)
# PageRank (eigenvector-based importance, used by Google)
pagerank = nx.pagerank(G, alpha=0.85)
# Summary table
centrality_df = pd.DataFrame({
"node": list(G.nodes()),
"degree": [degree_centrality[n] for n in G.nodes()],
"betweenness": [betweenness[n] for n in G.nodes()],
"pagerank": [pagerank[n] for n in G.nodes()]
}).sort_values("pagerank", ascending=False)
print(centrality_df.head(10))
Community Detection
# Convert to undirected for community detection
G_undirected = G.to_undirected()
# Louvain algorithm (via python-louvain package)
import community as community_louvain
partition = community_louvain.best_partition(G_undirected)
modularity = community_louvain.modularity(partition, G_undirected)
print(f"Modularity Q = {modularity:.3f}")
# Assign communities to nodes
for node, comm in partition.items():
G.nodes[node]["community"] = comm
# Number of communities
n_communities = len(set(partition.values()))
print(f"Number of communities: {n_communities}")
Visualization
# Spring layout (force-directed)
pos = nx.spring_layout(G, k=2, seed=42)
node_colors = [partition.get(n, 0) for n in G.nodes()]
fig, ax = plt.subplots(figsize=(12, 10))
nx.draw_networkx(G, pos=pos, node_color=node_colors,
cmap=plt.cm.Set3, node_size=100, arrows=True,
with_labels=False, edge_color="gray", ax=ax)
ax.set_title(f"Retweet Network (Q={modularity:.3f})")
plt.axis("off")
C.9 scipy.stats: Statistical Tests
SciPy provides a comprehensive set of statistical tests appropriate for the experimental designs common in misinformation research.
Installation
pip install scipy
Common Statistical Tests
from scipy import stats
# --- Independent samples t-test ---
# Does an inoculation intervention reduce belief in conspiracy theories?
control_group = [65, 70, 55, 80, 72, 68, 75, 60, 71, 66]
treatment_group = [50, 45, 55, 48, 52, 44, 58, 47, 51, 53]
t_stat, p_value = stats.ttest_ind(control_group, treatment_group,
equal_var=False) # Welch's t-test
print(f"t = {t_stat:.3f}, p = {p_value:.4f}")
# Effect size (Cohen's d)
d = (np.mean(control_group) - np.mean(treatment_group)) / \
np.sqrt((np.var(control_group, ddof=1) + np.var(treatment_group, ddof=1)) / 2)
print(f"Cohen's d = {d:.3f}")
# --- Chi-square test of independence ---
# Is misinformation sharing independent of education level?
contingency_table = np.array([[120, 80], [60, 140], [30, 170]])
chi2, p, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ²({dof}) = {chi2:.3f}, p = {p:.4f}")
# Cramér's V (effect size for chi-square)
n = contingency_table.sum()
cramers_v = np.sqrt(chi2 / (n * (min(contingency_table.shape) - 1)))
print(f"Cramér's V = {cramers_v:.3f}")
# --- Pearson and Spearman correlation ---
x = [3, 5, 2, 8, 6, 4, 9, 1, 7, 5] # media literacy score
y = [70, 55, 80, 30, 45, 60, 25, 85, 35, 50] # misinformation belief
r_pearson, p_pearson = stats.pearsonr(x, y)
r_spearman, p_spearman = stats.spearmanr(x, y)
print(f"Pearson r = {r_pearson:.3f} (p = {p_pearson:.4f})")
print(f"Spearman ρ = {r_spearman:.3f} (p = {p_spearman:.4f})")
# --- One-sample t-test ---
# Is average belief different from the neutral midpoint (50)?
belief_scores_sample = [55, 62, 48, 71, 59, 44, 67, 53, 60, 58]
t, p = stats.ttest_1samp(belief_scores_sample, popmean=50)
print(f"One-sample t = {t:.3f}, p = {p:.4f}")
C.10 Environment Setup
Virtual Environments
Always work in a virtual environment to avoid dependency conflicts across projects.
# Create a virtual environment named 'misinfo-env'
python -m venv misinfo-env
# Activate (Windows)
misinfo-env\Scripts\activate
# Activate (macOS/Linux)
source misinfo-env/bin/activate
# Deactivate
deactivate
requirements.txt
Create a requirements.txt to record all dependencies with pinned versions:
numpy==1.26.4
pandas==2.2.1
matplotlib==3.8.4
seaborn==0.13.2
scikit-learn==1.4.2
nltk==3.8.1
spacy==3.7.4
transformers==4.40.0
torch==2.3.0
datasets==2.19.1
networkx==3.3
scipy==1.13.0
python-louvain==0.16
accelerate==0.29.3
Install all dependencies at once:
pip install -r requirements.txt
Save current environment to requirements.txt:
pip freeze > requirements.txt
Troubleshooting Common Issues
torch installation fails on Windows:
# Install CPU-only version first to verify setup
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
spaCy model not found:
python -m spacy download en_core_web_sm
# If behind a firewall:
pip install https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.7.1/en_core_web_sm-3.7.1.tar.gz
pandas read_csv encoding error:
# Try specifying encoding explicitly
df = pd.read_csv("file.csv", encoding="utf-8-sig") # handles BOM
df = pd.read_csv("file.csv", encoding="latin-1") # for older Windows exports
Memory error with large transformer models:
# Load in 8-bit precision to reduce memory footprint
model = AutoModelForSequenceClassification.from_pretrained(
"bert-large-uncased", load_in_8bit=True, device_map="auto")
NLTK resource not found:
import nltk
nltk.download("all") # downloads everything (~3.5 GB)
# Or selectively:
for resource in ["punkt", "stopwords", "wordnet", "averaged_perceptron_tagger"]:
nltk.download(resource)
Jupyter Notebook Setup
pip install jupyter ipykernel
python -m ipykernel install --user --name=misinfo-env --display-name "Misinfo Research"
jupyter notebook
All code examples in this textbook were developed and tested in Jupyter notebooks. The companion GitHub repository (see the textbook's online resources) contains runnable notebooks for every chapter with computational exercises.