Case Study 20.1: The Stanford History Education Group's SHEG Study — How Experts Evaluate Sources vs. How Students Do

Overview

In a landmark series of studies beginning in 2016 and culminating in peer-reviewed publications in 2019 and 2022, Sam Wineburg, Sarah McGrew, and colleagues at the Stanford History Education Group (SHEG) documented a striking pattern in how different groups approach the task of evaluating online information credibility. The findings overturned several prevailing assumptions in source evaluation pedagogy and provided the empirical foundation for a fundamental rethinking of how digital literacy should be taught.

This case study examines the SHEG studies' methodology, their principal findings, the debate they generated, and their implications for source evaluation instruction.

Research Background

The Problem

By the mid-2010s, it was widely recognized that online misinformation posed a significant challenge to democratic information environments. Educational institutions had responded with various source evaluation frameworks — the CRAAP test being the most prominent in library and information science settings — designed to give students systematic tools for assessing source credibility. These frameworks shared a common underlying model: careful, attentive reading of a source's self-presentation would reveal signals of credibility or the lack thereof.

The SHEG team approached this pedagogical landscape with a question that had rarely been asked systematically: how do people who are actually good at evaluating sources do it? If we want to teach source evaluation, we should first understand what expert source evaluation actually looks like.

The Three-Group Design

The studies used a three-group comparative design that was their key methodological innovation:

Group 1: Stanford University undergraduate students — high-achieving students at a selective university who had received standard academic training, including library instruction and source evaluation education.

Group 2: Professional historians — PhDs with faculty positions in history departments, with extensive experience evaluating historical sources as part of their academic work.

Group 3: Professional fact-checkers — journalists employed at major fact-checking organizations (PolitiFact, FactCheck.org, others) as their primary job function.

This three-group design was significant because it allowed the researchers to distinguish between domain expertise (the historians had it; the fact-checkers generally did not in the specific topical areas tested), source evaluation expertise (the fact-checkers had developed this specifically; the historians had a different form of it for historical sources), and general academic training (the students had received this in quantity).

The Tasks

Participants completed three tasks designed to test different aspects of online source credibility evaluation:

Task 1: Website credibility assessment. Participants were shown the homepage of a website and asked to assess its credibility. The websites were chosen to require genuine judgment: not obviously legitimate or obviously disreputable, but genuinely ambiguous sources requiring research to evaluate.

Task 2: Social media post evaluation. Participants were shown social media posts making factual claims and asked to assess the credibility of the claims.

Task 3: Who is behind this website? Participants were given websites and asked to determine as accurately as possible who was actually responsible for producing and funding the content.

All tasks were conducted using "think aloud" protocols in which participants verbalized their reasoning as they worked, producing a rich qualitative record of verification strategies in addition to accuracy outcomes.

Principal Findings

Finding 1: Fact-Checkers Dramatically Outperformed Both Groups

On all three tasks, fact-checkers achieved dramatically higher accuracy rates than both professional historians and undergraduate students. On some tasks, the difference was striking: fact-checkers achieved close to perfect accuracy while historians performed at levels barely distinguishable from chance on the same tasks.

This was surprising because the tasks were explicitly about topics in which the historians had no particular expertise disadvantage — the tasks did not require historical knowledge. The historians' decades of experience evaluating sources was simply not transferring to the domain of web credibility assessment.

Finding 2: Professional Historians Barely Outperformed College Students

The result that generated the most academic discussion was that professional historians — people with doctoral training, decades of professional experience, and sophisticated epistemological frameworks for evaluating sources — barely outperformed undergraduates on web source credibility tasks. This finding was uncomfortable for historical pedagogy because historians have traditionally seen source evaluation as one of their core disciplinary competencies.

The conclusion was not that historians lack analytical skill or intellectual sophistication. Rather, the skills historians develop for evaluating historical sources — skills appropriate for documents from earlier eras — do not map straightforwardly onto the different challenges of web source credibility assessment.

Finding 3: The Key Difference Was Verification Strategy, Not Domain Knowledge

The think-aloud protocols produced the most theoretically important finding: the dramatic performance difference between fact-checkers and the other groups was not primarily explained by differences in domain knowledge, intelligence, or analytical sophistication. It was explained almost entirely by differences in verification strategy.

Historians and students: Vertical reading. Both groups tended to read the source being evaluated carefully from within — scrolling through its content, examining its About page, scrutinizing its language and tone, noting its visual design. This is vertical reading: going deeper into the source to understand it.

Fact-checkers: Lateral reading. Fact-checkers typically left the source being evaluated within the first few seconds. Their first move was to open multiple new browser tabs to search for the source in question — to check what Wikipedia, news organizations, and other credible external sources said about the website being evaluated. They read laterally — checking the source against external references — rather than reading the source itself deeply.

This difference was not subtle. In video recordings of participant behavior, historians and students spent minutes scrolling through source material before forming judgments. Fact-checkers often had a reliable credibility assessment within one to two minutes of receiving a source, having barely looked at the source itself.

Finding 4: Source Self-Presentation Was Often Misleading

Several of the websites used in the study were chosen precisely because their self-presentation was misleading — they appeared professional, authoritative, and credible from within but were revealed by external sources to be partisan advocacy operations, fringe organizations, or coordinated disinformation operations.

Groups that relied on vertical reading — examining the source's self-presentation — frequently misidentified these sources as credible. Fact-checkers who used lateral reading quickly identified the gap between self-presentation and external reputation.

The most vivid illustration involved a website that presented as a scientific organization providing information on environmental issues. Its design was professional, its language was appropriately scientific, and it prominently displayed the logos of what appeared to be institutional partners. A WHOIS lookup and a few Wikipedia searches revealed that the organization was actually a fossil-fuel industry lobbying operation using scientific-seeming language to advocate against environmental regulations. Historians spent considerable time evaluating the website's internal content; fact-checkers identified its actual nature within two minutes.

The "Wineburg Shock"

The studies' findings generated what education researchers informally termed the "Wineburg shock" — the recognition that well-intentioned source evaluation pedagogy had been teaching strategies that were not merely ineffective but actively misaligned with what effective source evaluation requires.

The problem was structural. Traditional source evaluation pedagogies — CRAAP test, the Cornell Library evaluation guides, the MLA source evaluation frameworks widely used in K-12 education — were all built on the same underlying model: read the source carefully, look for internal credibility signals, assess whether these signals are present. This is exactly what sophisticated disinformation operations are designed to satisfy.

The shock extended beyond pedagogy. If professional historians — people who spend their careers evaluating sources — cannot beat college students on web credibility tasks, this suggests that the disciplines most associated with source evaluation expertise need to fundamentally rethink what they mean by that expertise in the digital context.

The Historian's Response

The historical profession's response to the SHEG findings has been a mixture of recognition and defensiveness. Some historians have embraced the findings, arguing that they support a fundamental revision of how historical thinking and source evaluation are taught in history education. Others have argued that the studies' specific tasks were too narrow — that they tested a specific kind of quick credibility judgment that is not representative of the fuller scope of source evaluation that historians perform.

This defense has merit as a disciplinary point: historians do perform sophisticated source evaluation over longer timeframes, triangulating multiple sources, checking factual claims against archival evidence, and building interpretive frameworks that extend far beyond a quick credibility judgment. But as a pedagogical point, it is less compelling. The challenge facing students and citizens consuming online information in real time is precisely the kind of quick credibility judgment the SHEG tasks modeled — not the extended scholarly inquiry that professional historians perform over months.

Implications for Source Evaluation Instruction

The Case for Teaching Lateral Reading

The most direct pedagogical implication of the SHEG studies is that source evaluation instruction should explicitly teach lateral reading as a strategy. This means teaching students:

Not to trust source self-presentation as the primary credibility signal.
To leave the source early and check what external sources say about it.
To use Wikipedia as a starting point for source investigation, not as a primary source.
To interpret search results about a source as credibility evidence.
To move efficiently — the goal is fast, sufficient-quality credibility judgment, not exhaustive analysis.

Subsequent research by Caulfield, McGrew, and others has found that brief instruction in lateral reading — as little as one class period — produces meaningful improvements in web credibility assessment, particularly when combined with practice on real-world examples.

The Importance of Practice on Real Examples

The SHEG studies and subsequent research have consistently found that source evaluation skills — including lateral reading — are not developed through conceptual understanding alone. Students who can accurately describe what lateral reading is may still struggle to perform it efficiently in practice. Fluent lateral reading requires behavioral practice: repeated exercises in which students leave sources quickly, conduct search-based investigations, and make credibility judgments on unfamiliar sources.

This finding has implications for instructional design. Source evaluation cannot be effectively taught through a single lecture on SIFT or lateral reading. It requires regular, embedded practice across courses and contexts — approaching the ideal of making these behaviors automatic enough to be performed habitually in real information consumption.

The Limits of Domain Expertise for Source Evaluation

The finding that domain expertise does not transfer to web source credibility assessment has a broader implication: we should be cautious about assuming that expertise in any field translates to expertise in evaluating sources about that field. A biologist may be excellent at evaluating the scientific validity of studies in their specialty but may not be significantly better than a layperson at determining whether a website claiming to discuss biological research is actually a credible scientific source.

This matters because subject-matter experts are frequently called upon to provide credibility judgments in public discourse — to identify whether a source is credible in their domain — but their authority for these judgments may be less than assumed.

Methodological Critiques and Limitations

The SHEG studies have attracted several methodological critiques that deserve consideration:

Limited sample sizes: The studies used relatively small samples of historians and fact-checkers. With small samples, there is risk that the specific individuals selected were not representative of their respective groups.

Task specificity: The specific tasks designed for the studies may favor the fact-checking verification strategies. If the tasks were designed differently — for example, if they emphasized extended document analysis rather than quick credibility judgment — historians might perform relatively better.

Ecological validity: Whether the three-group comparison reflects the diversity of professional fact-checkers and historians is unclear. Fact-checkers at major organizations may not be representative of all people who fact-check, and academics at research universities may not represent all professional historians.

Confounding variables: Fact-checkers may differ from historians on variables beyond verification strategy — they may be younger, more comfortable with digital tools, or more motivated by the specific tasks used.

These critiques do not overturn the studies' core finding, which is robust enough to have held across multiple tasks and replications, but they counsel appropriate caution in extrapolating from these specific findings to broad conclusions about historical expertise and digital literacy.

Conclusions

The SHEG studies represent a genuine contribution to the evidence base for digital literacy education. Their core finding — that lateral reading dramatically outperforms vertical reading for web source credibility assessment, and that this strategy can be developed through instruction — has shaped the design of digital literacy curricula, including the SIFT framework, in the years since their publication.

The deeper lesson is that effective digital literacy instruction cannot be derived from intuition about what good source evaluation looks like. It requires empirical investigation of what actually works — asking not "what does careful source evaluation look like?" but "what does accurate source evaluation look like, and how do effective evaluators achieve it?"

The answer, it turns out, looks quite different from what most source evaluation pedagogy has taught. Effective evaluators don't read deeply — they read sideways. They don't trust what sources say about themselves — they check what others say about them. And they develop habits of efficient verification that produce reliable judgments quickly, because in the real digital information environment, slow and comprehensive verification is often no verification at all.

Discussion Questions

The SHEG studies found that professional historians barely outperform undergraduates on web credibility tasks. What should this finding mean for history departments' approach to source evaluation instruction?
The study shows that fact-checkers perform lateral reading habitually and automatically. How might a high school teacher create sufficient practice opportunities for students to develop comparable habits?
Could the SHEG findings be explained by factors other than verification strategy — for example, that fact-checkers are simply more skeptical by disposition? How would you design a study that isolates the effect of lateral reading specifically?
If the key skill is leaving the source and checking external references, what happens when the external references themselves are unreliable or unavailable? What are lateral reading's limits?
The studies focused on credibility assessment tasks that required quick judgment. Do the findings apply equally to situations where a scholar has extended time for comprehensive source evaluation?