Appendix A: Mathematical Foundations

This appendix provides the mathematical and statistical background needed to engage with the quantitative material throughout this textbook. No advanced mathematics is assumed; every concept is introduced from first principles, illustrated with worked examples, and connected to misinformation research wherever possible. Readers comfortable with algebra will find the material accessible; those who have studied probability or statistics will find it a useful refresher.

A.1 Probability Fundamentals

Sample Spaces and Events

A sample space (denoted S or Ω) is the set of all possible outcomes of some process. When we roll a fair six-sided die, the sample space is S = {1, 2, 3, 4, 5, 6}. When we observe whether a social media post is flagged as misinformation, S = {flagged, not flagged}.

An event is any subset of the sample space — a collection of outcomes we care about. "Rolling an even number" is the event A = {2, 4, 6}. "A post is flagged" is an event. Events can be simple (one outcome) or compound (many outcomes).

The Axioms of Probability

All of probability theory rests on three axioms, stated by Andrei Kolmogorov in 1933:

Non-negativity: For any event A, P(A) ≥ 0.
Normalization: P(S) = 1 (something must happen).
Additivity: If events A and B are mutually exclusive (cannot both occur), then P(A or B) = P(A) + P(B).

From these three axioms, every other rule in probability follows by logical deduction.

Basic Probability Rules

Complement Rule: The probability that A does NOT occur is:

P(not A) = 1 − P(A)

If 30% of headlines on a platform are misleading, then 70% are not misleading.

Addition Rule (General): For any two events A and B:

P(A or B) = P(A) + P(B) − P(A and B)

We subtract the intersection to avoid double-counting. If P(article is sensationalized) = 0.40 and P(article is factually wrong) = 0.25, and P(both) = 0.15, then P(sensationalized or wrong) = 0.40 + 0.25 − 0.15 = 0.50.

Multiplication Rule (General): For any two events A and B:

P(A and B) = P(A) × P(B|A)

where P(B|A) is the conditional probability of B given A (discussed below).

Independence: A and B are independent if knowing A occurred tells you nothing about B. Formally, A and B are independent if and only if P(A and B) = P(A) × P(B). Many cognitive biases arise from people treating dependent events as independent (e.g., assuming that a source being wrong once is independent of it being wrong again).

Conditional Probability

The conditional probability of B given A is the probability of B occurring, updated by the knowledge that A has occurred:

P(B|A) = P(A and B) / P(A)

This is perhaps the most important formula in applied statistics. It captures how evidence should update our beliefs.

Example: Suppose that 10% of news articles contain false claims, and 60% of false articles use emotional language, while only 20% of true articles use emotional language. Given that you encounter an emotionally written article, what is the probability it is false?

Even given emotional language, the article is only 25% likely to be false — still much higher than the base rate of 10%, but a reminder that emotional language is an imperfect signal.

A.2 Bayes' Theorem

Derivation

Bayes' Theorem is derived directly from the definition of conditional probability. We know:

P(A|B) = P(A and B) / P(B)
P(B|A) = P(A and B) / P(A)

Solving both for P(A and B) and setting equal:

P(A|B) × P(B) = P(B|A) × P(A)

Dividing both sides by P(B):

P(A|B) = [P(B|A) × P(A)] / P(B)

This is Bayes' Theorem. In words: the probability of A given B equals the likelihood of B given A, times the prior probability of A, divided by the total probability of B.

The full form uses the law of total probability to expand P(B):

P(A|B) = [P(B|A) × P(A)] / [P(B|A)×P(A) + P(B|not A)×P(not A)]

Worked Example 1: Disease Testing

A rare disease affects 1 in 1,000 people. A test for this disease is 99% sensitive (correctly identifies 99% of sick people) and 99% specific (correctly identifies 99% of healthy people). You test positive. What is the probability you actually have the disease?

P(disease) = 0.001, P(no disease) = 0.999
P(positive|disease) = 0.99 (sensitivity)
P(positive|no disease) = 0.01 (1 − specificity)
P(positive) = (0.99)(0.001) + (0.01)(0.999) = 0.00099 + 0.00999 = 0.01098
P(disease|positive) = (0.99 × 0.001) / 0.01098 ≈ 0.0901 = 9.0%

Despite a 99% accurate test, there is only a 9% chance you have the disease upon a positive test. This counterintuitive result — caused by the low base rate — is called the base rate fallacy. It is directly analogous to misinformation detection: even a very good classifier will produce many false positives if true misinformation is rare.

Worked Example 2: Updating News Credibility

Suppose you start with a prior belief that a new online outlet is credible with probability 0.50. Over time, you observe three things: the outlet correctly predicts an election outcome (P(correct prediction | credible) = 0.70; P(correct prediction | not credible) = 0.40), uses primary sources (P(sources | credible) = 0.80; P(sources | not credible) = 0.30), and issues a correction (P(correction | credible) = 0.60; P(correction | not credible) = 0.10).

Applying Bayes sequentially (treating each observation as independent for simplicity):

After correct prediction: - P(credible) = (0.70 × 0.50) / [(0.70)(0.50) + (0.40)(0.50)] = 0.35 / 0.55 ≈ 0.636

After primary sources: - P(credible) = (0.80 × 0.636) / [(0.80)(0.636) + (0.30)(0.364)] = 0.5088 / 0.6177 ≈ 0.824

After correction: - P(credible) = (0.60 × 0.824) / [(0.60)(0.824) + (0.10)(0.176)] = 0.4944 / 0.5120 ≈ 0.966

Each piece of evidence substantially increases the credibility estimate. This is the formal backbone of lateral reading — accumulating evidence from multiple sources about a source itself.

A.3 Basic Descriptive Statistics

Measures of Central Tendency

Mean (arithmetic average): The sum of all values divided by the number of values.

x̄ = (x₁ + x₂ + ... + xₙ) / n = (Σxᵢ) / n

The mean is sensitive to outliers. A single viral misinformation post with 10 million shares will dramatically inflate the mean shares per post.

Median: The middle value when data are ordered from smallest to largest. If n is even, it is the average of the two middle values. The median is robust to outliers and is often more informative for skewed distributions (e.g., social media engagement, which follows power-law distributions).

Mode: The most frequently occurring value. For categorical data (e.g., the most common type of misinformation), the mode is often the only meaningful average.

Measures of Spread

Variance: The average squared deviation from the mean.

Population variance: σ² = Σ(xᵢ − μ)² / N
Sample variance:     s² = Σ(xᵢ − x̄)² / (n − 1)

The sample formula divides by (n − 1) rather than n to produce an unbiased estimate (Bessel's correction).

Standard Deviation: The square root of variance, restoring the original units.

σ = √σ²,   s = √s²

If a fact-checking organization reviews an average of 50 claims per week with a standard deviation of 12, most weeks fall between 38 and 62 claims.

Z-Score: The number of standard deviations a value lies from the mean.

z = (x − μ) / σ

A z-score of +2.0 means the value is 2 standard deviations above the mean. Z-scores allow comparison across different scales (e.g., comparing virality metrics across platforms with different user bases).

A.4 Probability Distributions

The Normal Distribution

The normal (Gaussian) distribution is bell-shaped, symmetric, and defined by two parameters: mean μ and standard deviation σ. Its probability density function is:

f(x) = (1 / σ√2π) × exp(−(x−μ)² / 2σ²)

Key properties: - 68% of data falls within 1σ of the mean - 95% within 2σ - 99.7% within 3σ (the empirical rule or 68-95-99.7 rule)

Misinformation application: Belief in conspiracy theories across a population is often approximately normally distributed. Most people hold moderate skepticism; very few are at the extremes of complete acceptance or complete rejection.

The Binomial Distribution

The binomial distribution models the number of successes k in n independent trials, each with probability p of success:

P(X = k) = C(n,k) × pᵏ × (1−p)^(n−k)

where C(n,k) = n! / [k!(n−k)!] is the binomial coefficient.

Mean: μ = np Standard deviation: σ = √(np(1−p))

Misinformation application: If a fact-checker has a 70% chance of correctly identifying any given false claim, the binomial distribution gives the probability of correctly flagging exactly 7 out of 10 claims: P(X=7) = C(10,7) × 0.7⁷ × 0.3³ ≈ 0.267.

The Poisson Distribution

The Poisson distribution models the number of rare events occurring in a fixed time interval or space, when events occur independently at a constant average rate λ:

P(X = k) = (λᵏ × e^(−λ)) / k!

Mean and Variance: Both equal λ.

Misinformation application: Coordinated inauthentic behavior (bot attacks) often follows Poisson-like patterns at the account level — a bot may post at a rate of λ = 50 tweets/hour. Detecting Poisson anomalies (burst behavior) is a key method in bot detection.

A.5 Correlation and Regression

Pearson Correlation Coefficient

The Pearson correlation coefficient r measures the strength and direction of the linear relationship between two continuous variables X and Y:

r = Σ[(xᵢ − x̄)(yᵢ − ȳ)] / √[Σ(xᵢ − x̄)² × Σ(yᵢ − ȳ)²]

r ranges from −1 (perfect negative linear relationship) to +1 (perfect positive linear relationship). r = 0 indicates no linear relationship.

Interpretation benchmarks (Cohen, 1988): - |r| < 0.10: negligible - 0.10 ≤ |r| < 0.30: small - 0.30 ≤ |r| < 0.50: medium - |r| ≥ 0.50: large

Misinformation application: Researchers have found a positive correlation (r ≈ 0.35) between social media use and exposure to misinformation, though the causal direction remains contested.

Correlation Is Not Causation

This point deserves emphasis. A correlation between X and Y can arise because: 1. X causes Y 2. Y causes X 3. A third variable Z causes both X and Y (confounding) 4. Coincidence (spurious correlation)

A famous spurious correlation: US per capita cheese consumption correlates strongly with deaths by bedsheet tangling (r ≈ 0.95 over certain years). The variables have no causal relationship — both simply trend over time.

In misinformation research, the correlation between low media literacy scores and susceptibility to misinformation does not by itself tell us whether improving media literacy would reduce susceptibility. Randomized controlled experiments are needed to establish causation.

Simple Linear Regression

Simple linear regression models Y as a linear function of X:

Y = β₀ + β₁X + ε

where β₀ is the intercept, β₁ is the slope, and ε is the error term. The slope β₁ = r × (σᵧ / σₓ) and R² = r² gives the proportion of variance in Y explained by X.

A.6 Statistical Significance and Hypothesis Testing

Conceptual Framework

Hypothesis testing asks: "Could my observed data have arisen by chance if the null hypothesis were true?" The steps are:

State the null hypothesis H₀ (e.g., "the intervention has no effect on belief in misinformation")
State the alternative hypothesis H₁ (e.g., "the intervention reduces belief in misinformation")
Choose a significance level α (commonly 0.05)
Compute a test statistic from the data
Calculate a p-value: the probability of observing a test statistic at least this extreme if H₀ were true
Decision: if p < α, reject H₀; if p ≥ α, fail to reject H₀

P-Values

The p-value is the probability of obtaining results at least as extreme as those observed, assuming H₀ is true. It is NOT the probability that H₀ is true, nor the probability the result is a fluke. This is a critically common misinterpretation.

If p = 0.03, we say the result is statistically significant at α = 0.05 — but this says nothing about practical importance.

Confidence Intervals

A 95% confidence interval (CI) for a parameter θ is an interval computed so that 95% of such intervals, across repeated sampling, would contain the true value of θ. A single CI either contains the true value or it does not; the "95%" refers to the procedure, not any specific interval.

A CI of [0.02, 0.04] for an effect size is more informative than p < 0.05 because it conveys both direction and magnitude.

A.7 Effect Sizes

Statistical significance says whether an effect is distinguishable from zero; effect size says how large the effect is.

Cohen's d

Cohen's d expresses the difference between two group means in standard deviation units:

d = (μ₁ − μ₂) / σ_pooled

where σ_pooled = √[(σ₁² + σ₂²)/2].

Benchmarks (Cohen, 1988): - d = 0.20: small - d = 0.50: medium - d = 0.80: large

Example: An inoculation intervention reduces belief in a conspiracy theory from 65% to 55% agreement. If the standard deviation is 20 percentage points, d = (65−55)/20 = 0.50, a medium effect.

Odds Ratio

The odds ratio (OR) compares the odds of an outcome between two groups:

OR = [P(outcome|exposed) / P(no outcome|exposed)] / [P(outcome|unexposed) / P(no outcome|unexposed)]

OR = 1: no difference; OR > 1: higher odds in exposed group; OR < 1: lower odds.

Example: If 40% of people who saw a fact-check label believed the false headline (vs. 60% who did not see the label), OR = (0.40/0.60)/(0.60/0.40) = 0.667/1.50 = 0.44. The fact-check label reduces the odds of belief by 56%.

Relative Risk and Absolute Risk Reduction

Relative Risk (RR) = P(outcome|exposed) / P(outcome|unexposed).

Absolute Risk Reduction (ARR) = P(outcome|unexposed) − P(outcome|exposed).

Both matter. A treatment that reduces risk from 2% to 1% has RR = 0.50 (50% relative reduction) but ARR = 1% (small absolute benefit). Misinformation often exploits relative risk framing to exaggerate effects.

A.8 Information Theory Basics

Information theory, developed by Claude Shannon in 1948, provides a mathematical framework for quantifying information.

Entropy

Shannon entropy H(X) measures the average uncertainty (or information content) of a random variable X with possible values {x₁, x₂, ..., xₙ} and probabilities {p₁, p₂, ..., pₙ}:

H(X) = −Σ pᵢ × log₂(pᵢ)

Entropy is measured in bits. log₂(pᵢ) is negative (since 0 < p ≤ 1), so −log₂(pᵢ) is positive.

Intuition: A fair coin has H = −[(0.5 × log₂(0.5)) + (0.5 × log₂(0.5))] = −[0.5×(−1) + 0.5×(−1)] = 1 bit. A two-headed coin has H = 0 bits (no uncertainty). A six-sided die has H = log₂(6) ≈ 2.58 bits.

Information diversity application: Information ecosystems with low entropy have highly concentrated news diets (everyone reads the same sources). High entropy indicates diverse media consumption. Researchers use entropy to quantify filter bubble effects — individuals in echo chambers have low information entropy (their news diet is predictable and homogeneous).

Mutual Information

Mutual information I(X;Y) measures how much knowing one variable reduces uncertainty about another:

I(X;Y) = H(X) + H(Y) − H(X,Y)

where H(X,Y) is the joint entropy. I(X;Y) = 0 if X and Y are independent; I(X;Y) > 0 if they share information.

Application: Mutual information can quantify how much the political slant of news content (X) predicts the political affiliation of readers (Y). High mutual information confirms the existence of partisan media selection effects.

Kullback-Leibler Divergence

The KL divergence D_KL(P||Q) measures how much probability distribution P differs from a reference distribution Q:

D_KL(P||Q) = Σ P(x) × log[P(x)/Q(x)]

It is often used to measure how much a person's or community's information diet differs from some baseline (e.g., a fact-based news standard). Higher KL divergence from a reliable-source distribution signals greater exposure to off-mainstream content.

A.9 Summary of Key Formulas

Formula	Description
P(A and B) = P(A) × P(B\|A)	Multiplication rule
P(A or B) = P(A) + P(B) − P(A and B)	Addition rule
P(B\|A) = P(A and B) / P(A)	Conditional probability
P(A\|B) = P(B\|A)×P(A) / P(B)	Bayes' Theorem
x̄ = Σxᵢ / n	Sample mean
s² = Σ(xᵢ − x̄)² / (n−1)	Sample variance
z = (x − μ) / σ	Z-score
r = Σ[(xᵢ−x̄)(yᵢ−ȳ)] / √[Σ(xᵢ−x̄)²Σ(yᵢ−ȳ)²]	Pearson correlation
d = (μ₁−μ₂) / σ_pooled	Cohen's d
H(X) = −Σ pᵢ log₂(pᵢ)	Shannon entropy
I(X;Y) = H(X)+H(Y)−H(X,Y)	Mutual information

A.10 Further Reading

For deeper study of probability and statistics, the following texts are recommended:

Introductory: Freedman, Pisani, and Purves, Statistics (4th ed.) — conceptual and example-rich, minimal calculus required.
Bayesian focus: Kruschke, Doing Bayesian Data Analysis — practical Bayesian statistics with R and Python.
Information theory: Cover and Thomas, Elements of Information Theory — the standard graduate reference.
For misinformation researchers: Salganik, Bit by Bit: Social Research in the Digital Age — freely available online, covers computational social science methods with excellent ethical discussion.

All concepts in this appendix are applied throughout the textbook. When you encounter probability arguments in Chapter 28, network entropy calculations in Chapter 23, or effect size reporting in Chapters 35 and 36, return here for the formal definitions.