Appendix A: Evidence Grades and How to Read Research

Throughout this book, you will notice bracketed labels attached to specific claims: [Evidence: Strong], [Evidence: Moderate], [Evidence: Preliminary], and [Evidence: Contested]. This appendix explains what those grades mean, how they were assigned, and — perhaps more importantly — how to read the research behind them yourself.

Science is not a collection of facts waiting to be memorized. It is a process of successive approximation: claims get tested, refined, overturned, and occasionally vindicated after years of skepticism. Understanding how that process works is itself a learnable skill, and it makes you a more effective consumer of everything from textbook claims to news headlines.

The Four Evidence Grades

[Evidence: Strong]

A claim receives the Strong designation when it meets several demanding criteria: the effect has been demonstrated in multiple independent studies conducted by different research groups, the effect sizes are consistent across studies, and the finding has survived meta-analytic scrutiny — meaning that when all the available studies are pooled and analyzed together, the signal remains clear.

Strong evidence claims in this book include:

Retrieval practice outperforms restudying for long-term retention. The testing effect has been replicated in hundreds of studies across subjects, age groups, formats, and cultures. The 2013 meta-analytic review by Dunlosky and colleagues rated retrieval practice as one of only two "high utility" techniques out of ten studied. Independent labs in Germany, Japan, Australia, and across the United States have produced consistent results.
Distributed practice produces better long-term retention than massed practice. The spacing effect was first documented by Hermann Ebbinghaus in 1885 and has been replicated continuously for 140 years. Cepeda and colleagues' 2006 meta-analysis of 254 studies involving 14,000 participants found the effect robust and general.
Sleep consolidates newly encoded memories. The role of sleep in memory consolidation is supported by neuroimaging, behavioral experiments, pharmacological studies, and work with patients with sleep disorders. It is among the most replicated findings in cognitive neuroscience.

When a claim carries [Evidence: Strong], you can act on it with confidence. The scientific community is not actively debating whether the effect exists — the debates, where they exist, are about mechanisms, optimal parameters, and boundary conditions.

[Evidence: Moderate]

A claim receives the Moderate designation when it has been replicated in several independent studies, but with meaningful variability across findings — for example, the effect size is smaller than initial reports suggested, the effect is clearly real in laboratory settings but has been harder to demonstrate in naturalistic learning environments, or there are moderating variables that limit when the effect applies.

Moderate evidence claims in this book include:

Interleaving practice improves long-term performance compared to blocked practice. Multiple experiments show the interleaving advantage, but the effect is less consistent for complex tasks requiring long study times, and some studies show students feel they are learning less during interleaved practice even when they actually retain more — which can lead to abandonment of the strategy.
Elaborative interrogation (asking "why?" about material) improves learning. The evidence base is real but smaller than for retrieval practice, and the benefits depend on the learner having sufficient prior knowledge to generate meaningful elaborations. For complete novices in an unfamiliar domain, elaborative interrogation can be frustrating rather than helpful.
Handwriting notes leads to better conceptual learning than typing. Mueller and Oppenheimer's 2014 studies showed this effect, but a 2019 replication attempt by Morehead, Dunlosky, and Rawson found weaker or inconsistent results. The mechanism — that handwriting forces more processing and summarization — is plausible, but the evidence warrants caution rather than absolute prescription.

When a claim carries [Evidence: Moderate], treat it as a strong working hypothesis. Implement the technique thoughtfully, monitor your own results, and understand that individual variation is likely.

[Evidence: Preliminary]

A claim receives the Preliminary designation when the supporting evidence comes from a limited number of studies, small samples, recent findings not yet subjected to independent replication, or theoretically motivated reasoning that has not yet been fully put to the empirical test.

Preliminary evidence claims in this book include:

Specific pre-study exercise protocols may enhance memory encoding beyond general exercise benefits. There is intriguing early research suggesting that brief bouts of moderate aerobic exercise shortly before or after study may selectively benefit memory consolidation beyond the general cognitive benefits of fitness. But the work is recent, sample sizes are small, and the optimal protocols (duration, intensity, timing) are not established.
Immediate post-encoding rest (brief wakeful rest) may enhance consolidation. Some studies find benefits from sitting quietly after learning rather than immediately moving to another activity. The effect is theoretically coherent (it mirrors sleep consolidation) but remains preliminary.

When a claim carries [Evidence: Preliminary], treat it as interesting and worth considering, but do not reorganize your learning system around it. It may not replicate, and if it does, the effect size may be smaller than initial reports suggest.

[Evidence: Contested]

A claim receives the Contested designation when the scientific community is actively divided — when there are credible studies on multiple sides of the question, when methodological disagreements are substantive, or when high-profile findings have failed to replicate under scrutiny.

Contested evidence claims (and claims I argue are myths despite popular acceptance) include:

Learning styles. The hypothesis that individuals have preferred learning styles (visual, auditory, kinesthetic) and learn better when instruction matches their style is one of the most thoroughly debunked claims in educational psychology. Studies consistently find no "meshing effect" — matching instruction to supposed style does not improve outcomes. Yet the belief is extraordinarily persistent. This is a case where the evidence is not actually contested among researchers but is widely perceived as contested by the public.
The 10,000-hour rule as commonly understood. Ericsson's research on deliberate practice is solid. The Gladwellian version — that 10,000 hours of any practice produces expertise — is not. The distinction between deliberate practice and mere repetition is crucial and often lost in popular discussions.
Growth mindset interventions at scale. The original research by Dweck and colleagues is robust. But large-scale replications of mindset interventions in school settings have produced inconsistent results. The science of mindset is real; the simple "growth mindset program fixes academic outcomes" translation has not reliably held up.

When a claim carries [Evidence: Contested], I have tried to represent the debate fairly, explain why intelligent scientists disagree, and be explicit about where my own interpretation of the evidence lies.

A Brief Guide to Evaluating Research

What Is a Meta-Analysis?

A meta-analysis is a study of studies. Researchers gather all (or nearly all) existing studies on a specific question, extract their statistical findings, and combine them mathematically to produce an overall effect estimate with greater statistical power than any single study could provide. Meta-analyses are generally more reliable than individual studies because they reduce the influence of any single unusual sample or methodology. However, meta-analyses have their own limitations: if the underlying studies share the same design flaw, pooling them just amplifies that flaw. The phrase "garbage in, garbage out" applies.

What Is a Replication?

A replication is an independent attempt to reproduce a finding under similar conditions. Direct replications use the same methods; conceptual replications test the same hypothesis using different methods. Both are important. A finding that replicates across different labs, different samples, and different methodologies is far more trustworthy than one that has only been demonstrated by the original research group. The absence of independent replication is the single biggest warning sign in research claims.

What Does Effect Size Mean?

Statistical significance tells you whether a result is likely to be real or the result of chance. Effect size tells you how large the result is. A finding can be statistically significant but practically meaningless — this is especially common in large-sample research where even trivial differences reach significance. The most common effect size measures you'll encounter in learning research are Cohen's d (which compares two group means) and correlation coefficients (r). As a rough guide: d = 0.2 is small, d = 0.5 is medium, d = 0.8 is large. Retrieval practice effects in laboratory studies commonly show d values of 0.4–0.6; in real classroom studies, the effect is smaller but still meaningful.

The Replication Crisis: What It Means for This Book

Between roughly 2011 and 2020, psychology and related sciences experienced what has been called a replication crisis: when researchers systematically attempted to replicate 100 published studies in psychology, only about 36–39% produced results consistent with the original. High-profile studies in social psychology, personality psychology, and some areas of behavioral economics failed to replicate. This was painful, necessary, and ultimately healthy for science.

What does it mean for learning research specifically? Broadly speaking, the core findings covered in this book — retrieval practice, spacing, interleaving, the importance of sleep, cognitive load theory's basic claims — have fared well. They were developed over decades with larger sample sizes, more rigorous methods, and stronger theoretical grounding than many of the social-priming studies that failed most spectacularly.

Areas of learning research that have shown more replication difficulties include:

Short-term interventions for mindset and motivation (effects often small or absent at scale)
Specific neuroimaging claims about learning ("this activity activates region X, therefore...")
Highly specific prescriptions derived from moderate or preliminary evidence (optimal spacing intervals for specific content types, for example)

The message is not to distrust science. It is to read claims at their appropriate level of confidence, distinguish between robust core findings and preliminary details, and remain genuinely open to updating your beliefs when new evidence arrives. That, fittingly, is exactly what good learning looks like.

How to Read a Research Headline Without Being Misled

Research is frequently distorted between the laboratory and your news feed. A few practical rules:

Find the actual study. Most research headlines link to a press release. Press releases link to the original paper. The abstract is free on almost all journals and PubMed. Read it. Pay particular attention to sample size, sample characteristics (college psychology students are the most studied humans on the planet — their results don't always generalize), and what was actually measured.

Note who funded it. Industry-funded research on cognitive enhancement, brain training, and educational software products should be read with particular skepticism. This is not conspiracy thinking — publication bias and motivated reasoning are well-documented phenomena in industry-sponsored research.

Ask: did they measure what they claimed? A study that measures performance on an immediate test 10 minutes after learning is not measuring the same thing as a study that measures performance on a test two weeks later. Long-term retention — which is what learning ultimately is — requires long-term measurement.

Ask: who was in the study? Many learning studies use convenience samples of undergraduate students in a single session. This does not make the findings false, but it should prompt the question: does this apply to a 45-year-old professional learning a new skill, or a high school student studying for exams?

Look for the effect size, not just the p-value. "Significantly better" in a scientific paper means statistically significant, not impressively large.

Key Journals in Learning Science

For readers who want to go deeper into the primary literature, these are the most important journals covering learning, memory, and educational psychology:

Psychological Science — flagship journal of the Association for Psychological Science; covers all areas of psychology including memory and learning
Educational Psychology Review — publishes review articles and meta-analyses on educational psychology; excellent for getting broad overviews of research areas
Journal of Educational Psychology — empirical studies on learning, teaching, and assessment in educational contexts
Memory & Cognition — covers basic and applied research on memory processes; strong on retrieval practice, spacing, and encoding
Applied Cognitive Psychology — translates cognitive psychology findings to real-world applications
Psychological Review — major theoretical articles; where new frameworks and models are often first presented

Most articles are behind paywalls, but you can typically find preprint versions on ResearchGate, the authors' institutional pages, or PsyArXiv. Many researchers will also send you a copy if you email them directly — they are almost universally pleased that anyone wants to read their work.