Case Study 26.2: The Social Priming Replication Fail...

Overview

The replication crisis came into sharp public focus with a series of high-profile failures in social psychology — specifically in the area of social priming research. This case study examines two of the most significant replication failures: ego depletion and social priming (particularly automaticity of social behavior). Through these cases, we explore what the replication crisis reveals about the practice of science, why it does not justify wholesale skepticism of science, and what reforms have been implemented in response.

These cases were chosen because they are among the best-documented, most discussed, and most instructive examples in the replication crisis literature — and because the original research had significant public visibility and cultural influence.

From the 1980s through the 2000s, social psychology developed an increasingly influential research program on the automatic, unconscious influences on behavior. The core claim was that exposure to social concepts — words, images, situations — automatically activates associated knowledge structures and behavioral dispositions, influencing subsequent thought and action outside conscious awareness.

This research program produced results that seemed to reveal a hidden layer of human psychology: that our behavior was far more influenced by subtle environmental cues than we consciously recognized. The implications were dramatic, counterintuitive, and received enormous media attention.

The Elderly Priming Study

John Bargh, Mark Chen, and Lara Burrows published a paper in 1996 in the Journal of Personality and Social Psychology demonstrating what they called behavioral priming: after completing a word-scramble task containing words associated with the elderly stereotype (words like "Florida," "forgetful," "wrinkled"), participants walked more slowly down a hallway than control participants.

This finding — that subliminally priming the concept of elderly people caused people to walk slower — was presented as evidence of automatic, unconscious behavioral influence. The paper was highly cited, became a textbook example, and was featured in Malcolm Gladwell's bestselling "Blink," reaching millions of non-academic readers.

Part 2: The Replication Attempt

Stéphane Doyen et al. (2012)

In 2012, a team led by Stéphane Doyen at the Université Libre de Bruxelles attempted to replicate the elderly priming walking speed effect. Their methodological approach improved on the original in an important way: they used infrared timing sensors rather than human timing (avoiding experimenter expectancy effects where the person timing participants might unconsciously walk more slowly toward primed participants).

Finding: When walking speed was measured by objective sensors, no significant difference between primed and control participants was found. However, a significant effect appeared in the condition where the experimenters knew which participants had been primed — suggesting experimenter expectancy (the Rosenthal effect) rather than behavioral priming was driving the original result.

The Registered Replication Report

As part of the replication crisis's structured response, several "Registered Replication Reports" (RRRs) were organized: pre-registered, multi-lab replications of high-profile findings conducted simultaneously across many research sites.

The RRR for social priming effects was damning. Across multiple labs, following the original protocol as closely as possible, the majority of replication attempts failed to find significant effects.

John Bargh's Response

Bargh responded critically to the replication failures, arguing that: the replication methods differed from his own in culturally or procedurally important ways; replication failures in different countries reflected genuine population differences; the replicators were insufficiently expert; and the failure to find an effect in some labs reflected experimenter skepticism about priming undermining the effect.

These responses were themselves controversial. Many scientists argued that Bargh's explanation — that the effect is so fragile it disappears unless experimenters believe in it — amounts to claiming the effect is not real in any scientifically useful sense. An effect that only occurs when experimenters believe it will is more characteristic of experimenter expectancy than genuine priming.

Part 3: Ego Depletion

The Baumeister Ego Depletion Hypothesis

Roy Baumeister and colleagues proposed in 1998 that self-control (willpower) is a limited cognitive resource that becomes depleted with use, similar to a muscle that fatigues. This became known as the "ego depletion" hypothesis (from Freudian "ego" for the self-regulatory part of the mind).

The theory was supported by dozens of studies across two decades. In a typical ego depletion paradigm: 1. Participants perform a self-control-demanding task (resist eating cookies; suppress emotions; cross out letters in text while following arbitrary rules). 2. They then perform a second self-control-demanding task. 3. Performance on task 2 is compared to a control group that did not do task 1. 4. Finding: people who did the demanding task 1 performed worse on task 2, as if their "willpower" had been depleted.

The research generated an entire theoretical superstructure: glucose as the fuel for self-control, ego depletion as an evolutionary adaptation, and applied implications for everything from judicial decision-making to dieting.

Baumeister's 2011 popular book "Willpower: Rediscovering the Greatest Human Strength," co-authored with science journalist John Tierney, brought these ideas to a mass audience and achieved significant commercial success.

The Crisis: The Hagger et al. Pre-Registered Replication

In 2016, Martin Hagger and colleagues organized a massive, pre-registered multi-lab replication of ego depletion. The pre-registration meant that the hypothesis, method, and analysis plan were specified and publicly registered before data collection began.

Participants: 2,141 participants across 23 international laboratories.

Finding: The overall effect of ego depletion on performance was not significantly different from zero. The meta-analytic effect size was d = 0.04 (95% CI: -0.07 to 0.15) — effectively null.

This was a devastating replication failure for one of social psychology's most prominent research programs.

The Response and Explanation

Sripada et al. (2014) had already provided evidence that ego depletion effects may be explained by motivational factors rather than cognitive resource depletion: offering small monetary incentives to depleted participants fully restored their performance, suggesting that "depletion" affected motivation rather than a fundamental cognitive resource.

Conceptual analysis: The ego depletion paradigm had proliferated an enormous variety of task types used as both the "depleting" task and the "outcome" task — with limited standardization. Meta-analyses by Carter, Kofler, Forster, and colleagues found that effect sizes correlated with methodological flexibility: studies with more flexible analysis had larger effects. This pattern suggests the original literature may have been substantially inflated by analytical flexibility (p-hacking).

Baumeister's response: Baumeister defended the existence of ego depletion, arguing that the null replication used an ineffective depleting task and that the original literature represents genuine evidence. This remains contested, though the field has substantially moved away from the original "resource" model.

Part 4: What These Cases Reveal About Scientific Practice

The Methodological Conditions for Replication Failure

The priming and ego depletion literatures illustrate how replication failure emerges from structural features of how science was practiced:

Inadequate sample sizes: Original studies typically had small samples (N = 20-60 per condition). With small samples, chance variation is large, and any significant result represents a potentially inflated estimate of the true effect.

Degrees of freedom: In the 1996-2010 era, social psychologists had enormous flexibility in how they collected and analyzed data without any community norm requiring pre-specification. Simmons, Nelson, and Simonsohn's 2011 paper "False-Positive Psychology" demonstrated mathematically that plausible-sounding analytical choices could inflate Type I error to 60%. The typical research report revealed only the final analysis, hiding the choices made along the way.

Publication system incentives: The Journal of Personality and Social Psychology, the most prestigious journal in the field, was known to favor dramatic, counterintuitive findings. A replication confirming a known effect was difficult to publish; a study finding a surprising new priming effect was publishable.

Conceptual vagueness: Social priming hypotheses were often formulated loosely enough to accommodate any pattern of results. If priming worked, that confirmed the hypothesis; if it didn't work, the conditions for priming hadn't been met. Unfalsifiable hypotheses are resistant to correction.

Lack of adversarial collaboration: The culture of social psychology did not prioritize independent replication. Different labs worked on different phenomena rather than replicating each other's work. Replication was culturally devalued ("Just confirming what we already know?").

The Role of Pre-Registration in Diagnosis

The failure of the Hagger ego depletion replication was conducted as a pre-registered study, which meant:

The hypothesis was specified in advance (ego depletion will be observed)
The outcome measure was specified in advance
The analysis plan was specified in advance
All data would be reported regardless of the result

This design made it impossible for the null result to be a methodological artifact of post-hoc hypothesis selection or analysis flexibility. The null result was the null result.

Pre-registration serves both a prospective function (preventing p-hacking and HARKing in the pre-registered study) and a diagnostic function (when pre-registered studies fail to replicate unregistered studies, this strongly suggests the original literature was contaminated by flexibility).

Part 5: What the Replication Crisis Does and Does Not Mean

What It Means

The replication crisis demonstrates that: - Publication bias and analytical flexibility systematically inflate reported effect sizes in scientific literature. - Many findings that were accepted as established — appearing in textbooks, popular books, policy documents — were false positives or dramatic overestimates. - Peer review is insufficient quality control for detecting these systematic distortions. - The incentive structures of academic science (publish or perish, preference for positive results) contributed to these problems.

What It Does NOT Mean

The replication crisis does not imply that: - All science is unreliable. - We should disbelieve scientific claims in general. - Pre-registration eliminates false positives (it dramatically reduces them). - The specific fields experiencing replication crises (primarily psychology, nutrition, some areas of medicine) represent all of science.

Many areas of science have high replication rates. Physics, chemistry, geology, and astronomy operate with different methodological norms, larger effect sizes, and different publication cultures. The replication crisis is concentrated in areas involving: - Small effect sizes relative to noise - Enormous analytical flexibility - Strong publication pressure for positive results - Inadequate sample sizes for the phenomena being studied

The Appropriate Update

The appropriate epistemic response to the replication crisis is calibrated skepticism:

Single studies, especially from fields known to have replication problems, should be treated as preliminary.
Effect sizes should be halved (at minimum) as a rough prior adjustment.
Pre-registered replications should be weighted more heavily than unregistered original studies.
Convergent evidence from multiple independent methods is far more reliable than any single study.
Well-established findings that have been replicated across many independent labs, using diverse methods, are genuinely reliable — regardless of the replication crisis in other areas.

Part 6: Reforms and Their Effectiveness

Since the emergence of the replication crisis, the field of psychology and related sciences have implemented a range of reforms:

Pre-registration rates: The percentage of studies in major psychology journals that were pre-registered increased from essentially zero in 2011 to approximately 25-35% in the early 2020s.

Registered Reports: Over 200 journals in psychology and related fields now offer Registered Reports formats, in which editorial acceptance is based on the study design before results are known.

Open data: Major journals now require or strongly encourage data sharing. Policies requiring code sharing are increasingly common.

Larger samples: The culture of running studies with N = 30 is fading. Power analysis requirements and increasing awareness of the winner's curse are pushing toward larger samples.

Replication studies as publishable: The previously stigmatized activity of replication is now increasingly valued. Journals dedicated to replication (like Advances in Methods and Practices in Psychological Science) have been established.

Discussion Questions

The Bargh elderly priming study was included in Malcolm Gladwell's "Blink," which sold millions of copies. Gladwell has not issued a public correction or retraction for this section of the book. What obligations do popular science writers have when findings they have reported fail to replicate?
Some social psychologists defend the original priming literature by arguing that the replication failures reflect differences in experimental conditions, cultural contexts, or experimenter expertise. How would you evaluate this defense? What evidence would distinguish between "the effect is real but fragile" and "the effect does not exist"?
Ego depletion influenced public policy recommendations (e.g., judicial decision-making guidance, advice to schedule difficult decisions early in the day). How should policy be updated when the scientific foundation of an evidence-based policy recommendation fails to replicate?
Pre-registration dramatically reduces false positive rates, but also involves telling journals the hypothesis and design before results are known. Some researchers worry this constrains scientific flexibility and discovery. How would you balance these concerns?
The replication crisis has been used by anti-vaccine advocates and climate change deniers to argue that "you can't trust scientific research." Is this a valid application of the replication crisis findings? What would a consistent and accurate application of replication crisis findings look like?
If you were designing a research grant evaluation system for a major funding agency, what features would you include to reduce the structural incentives for p-hacking, HARKing, and publication bias?

Case Study 26.2: The Social Priming Replication Failures — What the Replication Crisis Teaches Us