Appendix A: Research Methods Primer

For readers who want to understand how we know what we know — and how confident to be about it.

Throughout this book, we've made claims like "retrieval practice improves long-term retention more than rereading" and "learning styles have no credible evidence behind them." You may have wondered: How do researchers actually figure these things out? How much should I trust these findings? And when two studies seem to disagree, how do I decide what to believe?

This appendix gives you the tools to answer those questions. You don't need a statistics degree. You just need a framework for thinking about evidence.

What Is a Controlled Experiment?

A controlled experiment is the gold standard for figuring out whether one thing actually causes another. The logic is straightforward:

Take a group of people (or animals, or classrooms — whoever you're studying).
Randomly assign them to at least two groups. One group gets the treatment you're testing (the experimental group), and the other doesn't (the control group).
Keep everything else the same. Both groups take the same test, in the same room, at the same time of day. The only thing that differs is the treatment.
Measure the outcome and compare.

The key word is random assignment. When participants are randomly placed into groups, any pre-existing differences between them — motivation, prior knowledge, IQ, sleep quality — get distributed roughly evenly across both groups. That means if the experimental group outperforms the control group, the most likely explanation is the treatment itself, not some other factor.

Example from learning science: Roediger and Karpicke (2006) wanted to know whether taking a practice test helps you remember material better than simply rereading it. They randomly assigned students to two conditions: one group read a passage and then took a practice test on it, while the other group read the passage the same number of times without testing. Both groups came back for a final test days later. The testing group remembered significantly more. Because of random assignment, the researchers could be reasonably confident that it was the testing — not something else — that made the difference.

Within-Subjects vs. Between-Subjects

Sometimes researchers have the same people try both conditions (within-subjects design). Sofia practices a piece using blocked practice one week and interleaved practice the next. This controls for individual differences because each person is their own comparison. The trade-off is that what you learn in condition A might affect how you perform in condition B (called order effects), so researchers typically counterbalance the order across participants.

Correlational vs. Causal Claims

Not everything can be tested with a controlled experiment. You can't randomly assign people to get different amounts of sleep for ten years to see what happens to their cognition. You can't randomly assign some students to have growth mindsets and others to have fixed mindsets (you can try to induce a mindset, but that's different from a lifetime of beliefs).

When researchers can't manipulate a variable, they use correlational studies — they measure two or more things and look at whether they're related.

The critical difference: Correlation tells you that two things tend to go together. It does not tell you that one causes the other.

Classic example: Students who use more learning strategies tend to get higher grades. Does that mean the strategies cause better grades? Maybe. But it could also mean that more motivated students (who would get good grades anyway) are more likely to use strategies. Or that students with more academic support at home learn both strategies and content. The correlation alone can't distinguish between these possibilities.

How to think about it: When you encounter a claim in this book (or anywhere), ask: Was this tested with random assignment, or is it based on correlation? If it's correlational, ask: What other explanations might account for this relationship?

We've tried to be clear throughout the book about which claims rest on experimental evidence and which are correlational. When we say something like "retrieval practice causes better retention," that's backed by controlled experiments. When we say "students who monitor their learning tend to perform better," that's correlational — though there is also experimental evidence showing that teaching students to monitor improves their performance, which strengthens the causal case.

Effect Sizes and What They Mean

When a study finds that "Group A did better than Group B," the natural question is: How much better? That's what an effect size tells you.

The most common effect size in learning science is Cohen's d. It tells you how many standard deviations apart the two groups are. Here's a rough guide:

Cohen's d	What It Means	Everyday Analogy
0.2	Small effect	The difference between someone who is 5'10" and 5'10.5" — real, but hard to notice
0.5	Medium effect	The difference between someone who is 5'10" and 6'0" — noticeable
0.8	Large effect	The difference between someone who is 5'10" and 6'2" — you'd spot it across the room

In learning science, a d of 0.5 is a big deal. Many interventions that feel transformative in the classroom produce d values of 0.3–0.6. The testing effect (retrieval practice vs. rereading), for example, typically produces effects in the range of d = 0.5–0.7, which is why researchers get so excited about it.

Why effect sizes matter more than "statistical significance": A study can find a "statistically significant" difference that is trivially small. If you have 10,000 participants, you can detect a difference so tiny it has no practical importance. Effect sizes tell you whether the difference is actually meaningful in the real world.

Practical Significance

Beyond statistical measures, think about practical significance. If an intervention raises exam scores by 2% on average, that might be statistically significant with a large sample but practically irrelevant. If it raises scores by 15%, that changes lives. Throughout this book, we've prioritized strategies with both statistical and practical significance.

Meta-Analyses: The Studies of Studies

A single study, no matter how well-designed, is just one data point. Maybe the researchers got lucky. Maybe their particular sample of students was unusual. Maybe there was some subtle flaw in the design nobody noticed.

A meta-analysis addresses this by combining the results of many studies on the same question. Instead of asking "What did this study find?" it asks "What do all the studies on this topic find, taken together?"

How it works:

Researchers search for every study ever published (and sometimes unpublished) on a specific question — say, "Does spaced practice improve retention compared to massed practice?"
They code each study's effect size and methodological features.
They statistically combine the results to get an overall effect size, weighted by sample size and study quality.
They examine whether the effect is consistent or whether it varies depending on factors like the type of material, the age of the learners, or the length of the spacing interval.

Why meta-analyses are powerful: They're less susceptible to the quirks of any single study. When Dunlosky et al. (2013) rated learning strategies, they based their ratings on decades of accumulated evidence, not individual experiments. When a meta-analysis of 200+ studies says retrieval practice works, that's far more convincing than any single experiment.

But meta-analyses have limitations too:

Garbage in, garbage out. If most of the underlying studies are poorly designed, combining them doesn't magically produce good evidence.
Publication bias. Studies that find an effect are more likely to be published than studies that don't. Meta-analysts try to account for this (using techniques like funnel plots and trim-and-fill methods), but it remains a concern.
Apples and oranges. Combining studies that use different measures, populations, and procedures can sometimes obscure meaningful differences.

When we cite meta-analyses in this book, we try to note the number of studies and participants involved, because bigger is generally more trustworthy.

The Replication Crisis: Why We Talk About It

Starting around 2011, psychology hit a reckoning. Researchers began systematically trying to replicate — that is, redo — famous studies, and an alarming number of them failed to produce the same results.

The Open Science Collaboration (2015) attempted to replicate 100 published psychology studies. Only 36% produced results as strong as the originals. Some classic findings — including some related to priming, ego depletion, and social psychology staples — simply didn't hold up.

What went wrong?

Small sample sizes. Many studies used 20–30 participants, which is too few to reliably detect real effects. Small samples produce noisy results that exaggerate effect sizes.
P-hacking. Some researchers (often unconsciously) tried multiple analyses until they found one that produced a "significant" result — like a student retaking a test repeatedly and only reporting the best score.
Publication bias. Journals overwhelmingly published positive results, creating a literature that overrepresented effects that might not be real.
HARKing. Hypothesizing After Results are Known — formulating your prediction after seeing the data, then writing up the paper as though you'd predicted it all along.

What this means for the science of learning:

The good news is that the core findings in this book — the testing effect, spacing, interleaving, elaboration — have been replicated hundreds of times across labs, populations, and materials. They are among the most robust findings in all of cognitive psychology.

The more cautious news involves some findings we discuss honestly in the text:

Growth mindset interventions show smaller effects in large-scale replications than Dweck's original studies suggested (see Chapter 18's discussion).
Learning styles were already debunked before the replication crisis, but the crisis added further weight to the skepticism.
Some motivation and self-regulation findings from the 2000s are being reassessed.

Throughout this book, we've tried to flag where findings are rock-solid versus where the evidence is still evolving. Science is a process, not a pronouncement.

How to Read a Research Paper (For Non-Researchers)

You don't need to read original papers to benefit from this book, but if you ever want to dig deeper, here's a quick survival guide.

The Anatomy of a Research Paper

Abstract (the 150-word summary at the top). Read this first. It tells you the research question, what was done, and what was found. If the abstract isn't interesting or relevant, you can often stop here.
Introduction (the "why should you care" section). This reviews prior research and explains the gap the current study is trying to fill. Skim this to understand the context. Pay attention to what question the study is asking.
Method (the "what we did" section). This tells you who participated, what they did, and how results were measured. Key questions to ask: - How many participants? (More is generally better. Under 50 per group is a yellow flag.) - Were participants randomly assigned to conditions? - Is the dependent variable (what they measured) a meaningful outcome? - Would this method work in the real world, or only in a lab?
Results (the numbers). Don't panic. Look for: - The direction of the effect (which group did better?) - The effect size (d, r, or similar) - Whether the difference was statistically significant (p < .05 by convention, but remember: significance alone doesn't mean importance)
Discussion (the "what it means" section). Researchers interpret their findings, acknowledge limitations, and suggest future directions. The limitations section is often the most honest and informative part of the paper. Read it.

Red Flags in Research

Watch for these warning signs:

Very small samples (fewer than 30 per group)
No control group ("We gave students this intervention and they improved!" — compared to what?)
Only self-report measures ("Students reported feeling like they learned more" — but did they actually learn more?)
Extraordinary claims with no replication (one study, never repeated, claiming a massive effect)
Conflicts of interest (study funded by the company selling the product being tested)
Cherry-picked outcomes (multiple outcomes measured but only the favorable ones reported)

Common Statistical Terms in Plain Language

Here are the terms you'll encounter most often in learning science research, translated into everyday language.

p-value

Technical: The probability of getting results at least this extreme if the treatment actually had no effect.

Plain English: How surprised you should be if there's really nothing going on. A p-value of .03 means: "If the treatment truly did nothing, there's only a 3% chance we'd see a difference this big just by random luck."

Convention: p < .05 is considered "statistically significant," but this is a somewhat arbitrary threshold. A p-value of .049 is not meaningfully different from .051.

Common misconception: A p-value of .03 does not mean there's a 97% chance the treatment works. That's a different question entirely (one answered by Bayesian statistics, which we won't get into here).

Confidence Interval

Technical: A range of values that, with a specified level of confidence (usually 95%), is expected to contain the true population value.

Plain English: Our best estimate, plus a margin of error. "The average improvement was 12 points, with a 95% confidence interval of 8 to 16 points" means we're fairly confident the true improvement is somewhere between 8 and 16 points.

What to look for: Narrow intervals mean more precise estimates. If the interval includes zero (for a difference) or crosses from positive to negative, the effect might not be real.

Effect Size (Cohen's d, r, eta-squared)

Plain English: How big the difference is, in standardized units. See the table above for Cohen's d. The correlation coefficient r ranges from -1 to +1, where 0 means no relationship. Eta-squared tells you what percentage of the variance in outcomes is explained by the treatment (eta-squared = .06 means the treatment explains about 6% of the variation in scores, which is actually considered medium in the social sciences).

Standard Deviation

Plain English: How spread out the data is. If the average exam score is 75 with a standard deviation of 10, most students scored between 65 and 85. If the standard deviation is 3, most scored between 72 and 78. When researchers report means, look for the standard deviation to understand how much variability there was.

Random Assignment vs. Random Sampling

These sound similar but mean different things:

Random assignment: Participants are randomly placed into groups (experimental vs. control). This supports causal claims.
Random sampling: Participants are randomly selected from a larger population. This supports generalization (the idea that findings apply beyond the specific people studied).

Most lab studies in learning science have random assignment but not random sampling (participants are usually college students at the researchers' university). This means the causal claims are strong, but whether they apply to every population requires additional evidence — which is why replication across diverse samples matters.

Statistical Significance vs. Practical Significance

A result can be statistically significant (unlikely to be due to chance) but practically insignificant (too small to matter). A study with 50,000 participants might find a statistically significant d of 0.02 — real in a statistical sense, but meaningless for any individual learner.

Conversely, a result can be practically significant but fail to reach statistical significance because the sample was too small. This is especially common in classroom studies, where you might only have 30 students per group. Always consider both statistical and practical significance.

A Decision Framework for Evaluating Evidence

When you encounter a claim about learning — in this book, in the news, from a colleague, or from someone selling a product — run through these questions:

What is the claim? State it precisely.
What kind of evidence supports it? Controlled experiment? Correlation? Meta-analysis? Expert opinion? Anecdote?
How large is the effect? Is it big enough to matter in practice?
Has it been replicated? By different researchers, with different populations, using different materials?
Are there plausible alternative explanations?
Who is making the claim, and do they have a financial interest in you believing it?
What do other experts in the field think? Is there a scientific consensus, or is this one researcher's pet theory?

You don't need to become a methods expert to be a critical consumer of learning science. You just need to ask the right questions — which, come to think of it, is basically what metacognition is all about.

For a summary of the key studies referenced in this book, see Appendix B. For a glossary of technical terms, see Appendix G.