Appendix A: Research Methods Primer — Evaluating AI Research Claims

This appendix equips you with the tools to read, evaluate, and question AI research. You do not need a statistics background. You need curiosity and a healthy dose of skepticism.

Why This Matters for AI Literacy

Every week, a new headline declares that AI has "solved" a problem, "outperformed" doctors, or "achieved" some milestone. Behind each headline is a study — and studies vary enormously in quality, rigor, and relevance. Being able to evaluate these claims is one of the most durable skills you can develop. Technologies change; the ability to ask the right questions about evidence does not.

A.1 Types of Studies You Will Encounter

Observational Studies

Researchers observe what happens without intervening. They collect data about AI systems as they are already being used and look for patterns. For example, an observational study might analyze how a content moderation system performs across different languages by examining its decisions on existing posts.

Strengths: Reflects real-world conditions; does not require controlling variables. Limitations: Cannot establish causation — only correlation. Confounding variables may explain the results.

Experimental Studies

Researchers deliberately change one variable while holding others constant. In AI research, this might mean testing a diagnostic AI on a controlled set of medical images with known ground truth. Randomized controlled trials (RCTs) are the gold standard: participants are randomly assigned to a treatment group (using the AI system) or a control group (using the existing method).

Strengths: Can establish causation when properly designed; controls for confounding variables. Limitations: Laboratory conditions may not reflect real-world deployment. Many AI questions cannot ethically be studied experimentally.

Benchmark Studies

A model is tested against a standardized dataset with known correct answers. Benchmarks like ImageNet (computer vision), GLUE/SuperGLUE (language understanding), and MMLU (general knowledge) allow researchers to compare models on the same tasks. Much of the "AI beats humans" reporting comes from benchmark results.

Strengths: Enables direct comparison between models; reproducible. Limitations: Performance on benchmarks does not guarantee real-world performance. Models can be "trained to the test." Benchmarks become outdated as models improve, and they may not measure what they claim to measure.

Meta-Analyses and Systematic Reviews

Researchers combine the results of multiple studies to reach a broader conclusion. A meta-analysis statistically pools results from many studies; a systematic review evaluates and synthesizes findings without necessarily pooling the numbers. These are generally the most reliable type of evidence because they account for variation across individual studies.

Strengths: More robust than any single study; can identify patterns across research contexts. Limitations: Only as good as the underlying studies. Publication bias (studies with positive results are more likely to be published) can skew findings.

Audit Studies

Researchers test an AI system for bias or failure by systematically probing it with controlled inputs. The Gender Shades study (Buolamwini & Gebru, discussed in Chapters 6 and 9) is a landmark example: the researchers tested commercial facial recognition systems across combinations of skin tone and gender.

Strengths: Directly measures real-world system behavior; can reveal disparities that internal testing misses. Limitations: Requires access to the system; may not capture all real-world conditions.

A.2 Sample Size and Statistical Significance — In Plain Language

Sample size is the number of observations in a study. In AI research, this could mean the number of images in a test set, the number of patients whose records were analyzed, or the number of times users interacted with a chatbot.

Why it matters: Small samples can produce misleading results. If you flip a coin ten times and get eight heads, that does not mean the coin is rigged — random variation can produce extreme results in small samples. The same logic applies to AI studies. A model that "outperforms" doctors on 50 images may be meaningless; a model that does so on 50,000 images across multiple hospitals is much more convincing.

Statistical significance is a measure of how likely it is that a result occurred by chance. The most common threshold is p < 0.05, meaning there is less than a 5% probability that the observed difference is due to random variation. But statistical significance is not the same as practical significance. A system that is statistically significantly better than doctors — by 0.1 percentage point, on one specific type of image, in laboratory conditions — may be scientifically interesting but practically irrelevant.

Effect size tells you how big the difference is, not just whether it exists. Always ask: "How much better?" not just "Is it better?"

A.3 Correlation vs. Causation

This is the single most abused concept in science communication.

Correlation means two things tend to occur together. Causation means one thing actually produces the other. The classic example: ice cream sales and drowning deaths are correlated (both increase in summer), but ice cream does not cause drowning. A third variable — warm weather — drives both.

In AI research, this matters constantly:

"Countries that adopt AI grow their economies faster." Does AI adoption cause growth, or do already-growing economies adopt AI more readily?
"Students who use AI tutoring tools have higher test scores." Do the tools improve scores, or do more motivated students choose to use them?
"Hospitals with AI diagnostic tools have lower misdiagnosis rates." Is the AI helping, or do hospitals that can afford AI tools also have better-resourced staff, newer equipment, and healthier patient populations?

The rule: When you see an AI claim based on correlation, ask: "What else could explain this relationship?"

A.4 The Replication Crisis in AI Research

AI research has a replication problem. A 2021 survey of machine learning papers found that many published results could not be reproduced — sometimes because the code was not released, sometimes because the exact training data was proprietary, and sometimes because small details about preprocessing, hyperparameter tuning, or hardware made the difference between success and failure.

Key issues include:

Selective reporting: Researchers may run many experiments and report only the ones that worked. This inflates apparent performance.
Benchmark overfitting: When the community focuses on a small set of benchmarks, researchers optimize specifically for those benchmarks. Models may perform well on MMLU but fail on practical tasks.
Compute inequality: Results from organizations with massive computing budgets may not be reproducible by academic labs. A paper that says "we trained for 10,000 GPU-hours" describes a result that most researchers cannot verify.
Lack of error analysis: Many papers report aggregate accuracy without breaking down performance by subgroup, edge case, or failure mode.

What you can do: When reading about an AI result, check whether the code and data are publicly available. Look for independent replications. Be skeptical of results reported only by the organization that built the system.

A.5 How to Read an AI Research Paper

You do not need to understand every equation. Here is a practical guide to extracting what matters:

1. Read the Abstract

The abstract summarizes the entire paper in one paragraph. It tells you what question the researchers asked, what method they used, and what they found. If the abstract makes a strong claim, note it — you will check it against the actual evidence.

Before reading the methods, look at the results. What did they actually find? Look for tables and figures showing performance metrics. Check whether the results are broken down by subgroup or context. Look for confidence intervals, not just point estimates.

3. Read the Methods

How was the study conducted? What data was used? How was the model trained and evaluated? Were the training and test sets truly separate? Was the evaluation performed by the same team that built the system, or by independent evaluators?

4. Read the Limitations Section

This is often the most honest part of a paper. Researchers are required to acknowledge what their study cannot show. If the limitations section is short or absent, that is itself a red flag.

5. Check the Author Affiliations and Funding

A paper from a company evaluating its own product is not the same as an independent evaluation. This does not make the paper wrong, but it does mean you should apply extra scrutiny.

A.6 Red Flags in AI Research Reporting

Be wary when you encounter any of the following:

Red Flag	Why It Matters
"AI outperforms humans" without specifying the task, dataset, and conditions	Performance claims are always context-dependent
No comparison to a baseline or existing method	"90% accuracy" means nothing without knowing what the alternative achieves
Results from the team that built the system, with no independent validation	Conflicts of interest can bias evaluation
No error analysis or subgroup breakdown	Overall accuracy can mask disparities across populations
Benchmark results presented as real-world evidence	Benchmark performance often does not transfer to deployment
Extraordinary claims with no independent replication	Single studies, however impressive, can be wrong
"Our AI is unbiased" or "bias-free"	No system is bias-free; this claim reveals a lack of rigor
Percentages without base rates	"The AI detected 95% of cases" is meaningless without knowing the false positive rate and the prevalence of the condition
Confusion of statistical significance with practical significance	A tiny improvement can be statistically significant with a large enough sample
No limitations section or a perfunctory one	The researchers either do not understand their study's weaknesses or are unwilling to discuss them

A.7 A Pocket Checklist

When you encounter an AI research claim — in a paper, a news article, or a product announcement — run through these questions:

What exactly was tested? (Which task, which data, which conditions?)
How big was the study? (Sample size, number of test cases, diversity of conditions)
What was the comparison? (Compared to what baseline, method, or human performance?)
Who conducted the study? (Independent researchers, or the team that built the system?)
Has it been replicated? (By anyone else, in different conditions?)
What are the limitations? (What does the study not show?)
Does the conclusion follow from the evidence? (Or is there a leap from "works in the lab" to "will work everywhere"?)

These seven questions will not make you a researcher, but they will make you a much better reader of research — and that is a cornerstone of AI literacy.