In the summer of her third year at Michigan, Dr. Adaeze Okafor spread forty-seven printed survey instruments across her office floor. Each one represented a different published study on attraction — measures of physical preference, mate-value...
Learning Objectives
- Distinguish between experimental, correlational, and observational research designs
- Explain the WEIRD problem and its implications for attraction research
- Interpret effect sizes and understand why statistical significance alone is insufficient
- Identify signs of publication bias in a body of literature
- Critically evaluate methodology in an attraction research paper
In This Chapter
- The Research Methods Toolkit
- Lab vs. Real World: The Ecological Validity Crisis
- The WEIRD Problem: Who Does This Science Think It's About?
- Effect Sizes, Statistical Significance, and the p-Value Misunderstanding
- The Replication Crisis: When the Findings Don't Hold Up
- Publication Bias: The File Drawer Problem
- Inside the Okafor-Reyes Study: Methodology in Practice
- Self-Report vs. Behavioral vs. Physiological Measures
- Ethical Review and Participant Protection
- The Nature vs. Nurture Problem in Method Choice
- How to Read a Research Paper: A Student's Guide
- A Note on Progress
- Meta-Analysis: Synthesizing Across Studies
- Summary
Chapter 3: How Scientists Study Attraction — Research Methods and Their Limits
In the summer of her third year at Michigan, Dr. Adaeze Okafor spread forty-seven printed survey instruments across her office floor. Each one represented a different published study on attraction — measures of physical preference, mate-value self-assessments, romantic jealousy scales — and she had arranged them not chronologically but by sample composition. On one side of the room, the studies whose participants were overwhelmingly white American undergraduates from Research-I universities in the Midwest and Southeast. On the other side, everything else. The pile on the right took up most of the floor. The pile on the left fit on a single chair cushion.
"This," she told Dr. Carlos Reyes when he arrived for their planning meeting that afternoon, gesturing at the lopsided landscape, "is the methodology problem."
Reyes set down two coffees and studied the arrangement. "I know," he said, after a moment. "It's why we're here."
What they were building — what would eventually become the Global Attraction Project — grew directly out of that frustration. But more than that, it grew out of a recognition that the problem Okafor had mapped onto her office floor wasn't merely about sample composition. It was about method itself: what tools we use to study attraction, what those tools can and cannot see, and how the questions we ask constrain the answers we get. This chapter is about all of that.
Understanding how scientists study attraction requires us to hold two things in tension simultaneously. The research enterprise has produced real, replicable, genuinely illuminating findings about human desire. And the research enterprise has also produced an alarming volume of noise — underpowered studies, measurement artifacts, publication-distorted bodies of literature, and findings that say more about Northwestern University sophomores than about human beings. Learning to tell the difference is one of the most important skills this book can give you.
The Research Methods Toolkit
Every science needs instruments. In chemistry, you have spectrometers; in physics, particle colliders; in astronomy, telescopes powerful enough to detect light that left its source before Earth existed. In attraction research, the instruments are more varied and, in some ways, more problematic — because the thing you're trying to measure tends to change when it knows it's being measured.
Experimental Designs
The experiment is the gold standard of social science, and for good reason. When researchers randomly assign participants to conditions — showing some people a photo of an attractive face and others a photo of a neutral face, for instance, while measuring heart rate — they create the conditions under which causation can be inferred. Random assignment means that any differences between groups at the end of the study are likely due to the manipulation, not to pre-existing differences between the people in each group.
In attraction research, classic experiments have investigated how ambient factors affect perceived attractiveness (does a person seem more attractive when you meet them at an exciting location versus a boring one?), how familiarity breeds liking (the mere exposure effect), and whether specific cues like voice pitch or gait symmetry influence ratings of desirability. These studies have generated real knowledge.
But experimental designs in attraction research carry a peculiar burden. The moment you bring someone into a laboratory and ask them to evaluate another person's attractiveness, you have already fundamentally transformed the experience. Attraction in the wild is contextual, unfolding, embedded in social history. Attraction in a lab is a task — "Rate this face on a scale of 1 to 10" — and there is mounting evidence that these are not the same psychological process measured under different conditions. They may be different processes entirely.
🧪 Methodology Note: The Laboratory Paradox The conditions that make experiments internally valid — control, random assignment, standardization — are often inversely related to their ecological validity, or the degree to which findings generalize to real-world behavior. A study of face preferences in a quiet room with neutral lighting tells us something genuine about face preferences in a quiet room with neutral lighting. How much it tells us about attraction between people who meet at a party, share a work project, or swipe right while half-watching television is a question every experimentalist must answer honestly.
Correlational Research
Much attraction research is correlational: researchers measure two or more variables and assess the statistical relationship between them. Do people with more symmetrical faces receive more matches on dating apps? Does similarity in values predict relationship satisfaction? Are people with higher self-reported self-esteem rated as more attractive by strangers?
Correlational designs are valuable because they can capture real-world relationships without the artificiality of the lab. They are also comparatively cheap to run and can make use of large datasets. But they come with a fundamental limitation that most undergraduates encounter early and then spend the rest of their education learning to truly internalize: correlation does not imply causation.
When we find that social confidence correlates positively with attractiveness ratings, this could mean: (1) confident people are more attractive; (2) attractive people become more confident because of positive social feedback; (3) some third variable — say, growing up with warm, attentive parents — causes both confidence and traits that others find attractive; or (4) some combination of the above that is genuinely difficult to disentangle. Correlational methods, on their own, cannot adjudicate between these possibilities.
Observational Research
Some of the richest attraction research comes not from the lab or the survey but from careful, systematic observation of behavior in natural settings. Researchers have stationed observers in bars, speed-dating events, university corridors, and online forums to code naturally occurring behavior: who initiates conversation, how people position their bodies in relation to someone they find attractive, when and how touch is used, what happens in the negotiation of eye contact.
Observational research has high ecological validity — people are, more or less, doing what they would do anyway. But it raises serious challenges around reliability (two observers coding the same interaction may interpret it differently), reactivity (people who know they're being observed may behave differently), and ethics (not all naturalistic observation can be conducted with informed consent). Okafor's methodological instincts were sharpened by her graduate training in participant observation; Reyes, trained in behavioral ecology, had his own tradition of observational rigor. Their debates about observation protocols would run through the entire design phase of the Global Attraction Project.
Survey and Self-Report Methods
Surveys — standardized questionnaires administered to large samples — are the workhorse of social psychological research. For attraction research, they offer scalability (you can survey thousands of participants relatively quickly) and access to private experience (you can ask about feelings, desires, and fantasies that would be impossible to observe directly).
The problem is that people are, to put it gently, unreliable narrators of their own inner lives. We don't have direct access to many of the processes that shape our behavior. We have theories about ourselves — and those theories are heavily influenced by cultural scripts, social desirability concerns, and simple post-hoc rationalization. When someone tells you they prefer tall partners, they may genuinely believe it. Whether their actual behavior in actual romantic encounters reflects that stated preference is an empirical question — and one that, as we'll see in the Case Studies, often yields surprising answers.
Qualitative Methods
Qualitative research — in-depth interviews, focus groups, ethnographic immersion, content analysis of texts — occupies an uneasy position in the hierarchy of evidence that quantitative social science has constructed for itself. Many psychologists treat it as merely exploratory, a way to generate hypotheses that "real" (i.e., quantitative) research can then test. This dismissal misunderstands what qualitative methods are for.
When Okafor designed the qualitative interview component of the Global Attraction Project, she wasn't trying to generate data that would eventually be reduced to numbers. She was trying to understand how people in Lagos and São Paulo and Seoul made meaning of their own attraction experiences — what categories they used, how those categories mapped onto (or failed to map onto) the categories that Western attraction researchers had built their instruments around. That kind of understanding is not capturable with a Likert scale.
⚖️ Debate Point: Does "Mixed Methods" Solve the Problem? The Global Attraction Project uses a mixed-methods design — combining surveys, behavioral observation, and qualitative interviews. Many researchers advocate for mixed methods as a way to leverage the strengths of each approach while compensating for their weaknesses. The counterargument is that mixed methods can become a way of collecting a lot of data without committing to any clear theoretical framework. Okafor and Reyes argue about this regularly. Their compromise: each method is treated as providing its own kind of answer, not as a validation of the others.
Neuroimaging: The New Frontier
Since the 1990s, neuroimaging technologies — particularly functional magnetic resonance imaging (fMRI) — have offered researchers a window into the brain processes underlying attraction. Studies have shown that viewing attractive faces activates reward circuits (including the ventral tegmental area and nucleus accumbens), that romantic love recruits some of the same dopaminergic pathways as other rewarding stimuli, and that the sight of a desired person produces patterns of neural activation distinct from those produced by viewing friends or attractive strangers.
These findings are genuinely interesting. They are also frequently oversold. The problem of "reverse inference" — concluding from "this brain region activated" to "this psychological process is occurring" — is severe in attraction neuroscience. The nucleus accumbens activates for cocaine, music, money, and beautiful faces. This tells us something about the brain's reward architecture. It doesn't tell us that love is "just" an addiction, or that attraction is "hard-wired," or any of the other popular conclusions that neuroscientific results are routinely used to support.
Longitudinal and Experience-Sampling Designs
Two other methodological approaches deserve explicit attention, because they address limitations that cross-sectional laboratory studies cannot.
Longitudinal designs follow the same participants over time, sometimes for months or years. For attraction research, this matters enormously: attraction is not a one-time event but a dynamic process that unfolds through repeated interaction, shared experience, and accumulating history. Studies tracking how couples' attraction to one another changes over the first year of a relationship, for instance, can reveal developmental patterns — an initial spike tied to novelty, a consolidation tied to attachment security, a gradual modulation as partners become familiar — that a single snapshot in time would never detect. The cost is considerable: participant dropout (called attrition), expense, and the sheer organizational complexity of maintaining contact with a sample over years. But the payoff in ecological realism and temporal depth is difficult to replicate by any other means.
Experience sampling methodology (ESM) — sometimes called ecological momentary assessment (EMA) — represents a more recent innovation. Participants are prompted, typically via smartphone, to report their current feelings, thoughts, or behaviors at random or semi-random intervals throughout the day. In attraction research, this allows investigators to capture how much someone thinks about a person they're interested in, how their mood affects their ratings of strangers they encounter, or how the dynamics of a date feel in real time rather than in retrospect. ESM studies on romantic relationships have revealed that the gap between how people feel during experiences and how they remember feeling afterward can be substantial — a finding with major implications for the interpretation of retrospective self-report data, which constitutes the bulk of the attraction literature.
Neither longitudinal studies nor ESM eliminate WEIRD bias or solve the ecological validity problems of laboratory research. But they move the conversation in a more realistic direction, and their growing use reflects a genuine improvement in the field's methodological ambition.
Lab vs. Real World: The Ecological Validity Crisis
Here is a discomfiting fact about attraction science: the vast majority of it was conducted in laboratories, with participants who were aware they were participating in a study, completing tasks that bear limited resemblance to actual romantic encounters, in conditions designed to minimize variability rather than capture the complex, contextual, temporally unfolding nature of real desire.
The reasons for this are understandable. Control is what allows causal inference. Naturalistic studies are expensive, ethically complex, and statistically messy. Academic career incentives reward publications, and clean laboratory findings with neat p-values are easier to publish than messy observational findings with large confidence intervals.
But the consequences are significant. Consider what we actually want to know about attraction — whether it's how people choose partners, how attraction develops over time, how compatibility is negotiated in real relationships — and then consider how far most laboratory paradigms are from capturing any of that. Rating photographs. Imagining hypothetical partners with described attributes. Completing self-report scales that ask about feelings that may not accurately reflect actual behavioral tendencies.
Reyes, despite his more positivist methodological orientation, is the first to flag this in the design discussions for the Global Attraction Project. "I keep thinking about what we're missing by controlling everything," he told Okafor during one planning session. "Every time we add a control to improve internal validity, we're subtracting something real." His background in behavioral ecology — a field that has always been committed to studying animals in their natural habitats — gives him an intuitive skepticism of purely laboratory-based conclusions.
📊 Research Spotlight: The Northgate Studies A series of longitudinal observational studies conducted in university dormitories and apartment complexes during the 1950s and 1960s found that proximity was one of the strongest predictors of friendship and romantic pairing — people were more likely to become friends and romantic partners with neighbors who lived near them, even when initial preference would have predicted otherwise. These findings, with their high ecological validity, remain among the most replicated in the literature. They also came from a much messier research design than modern experimentalists would typically accept. The lesson: methodological messiness and scientific value are not opposites.
Online Research and the New Ecological Frontier
The growth of internet-based data collection has fundamentally changed the scale at which attraction research is conducted. Platforms like MTurk (Amazon Mechanical Turk), Prolific, and various university-based research portals have made it possible to recruit samples of hundreds or thousands of participants within days, at lower cost than traditional in-person recruitment. This has allowed researchers to ask questions that simply weren't statistically powered enough to answer with small lab samples — examining rare interaction effects, testing whether findings generalize across multiple demographic groups, or following large samples over extended periods.
Online methods have also introduced their own problems. MTurk and similar platforms are not representative even of Western populations — they skew younger, more educated, and more politically liberal than the general population, and they include substantial numbers of "professional participants" who complete dozens of studies per week and may develop sophisticated awareness of research designs. Attention failures are more common when participants are at home in front of their own screens than when they are in a laboratory setting with an experimenter present. And demographic data collected online often relies entirely on self-report, with no external verification possible.
The COVID-19 pandemic accelerated the shift to online research so dramatically that in some subfields, nearly all new data collection moved to remote formats. For attraction research, this created a natural experiment in method: findings obtained with in-person laboratory procedures suddenly had to be compared to online replications conducted under very different conditions. In some cases, effects that were reliably obtained in person were noticeably smaller or absent online. In others, the effect sizes were comparable. The field is still processing what these differences mean.
The WEIRD Problem: Who Does This Science Think It's About?
In 2010, psychologists Joseph Henrich, Steven Heine, and Ara Norenzayan published a paper that should have been — and, in some quarters, was — a significant shock to the discipline. They pointed out that the samples on which the vast majority of psychological research is based are strikingly unrepresentative of humanity: they are overwhelmingly Western, Educated, Industrialized, Rich, and Democratic. They coined the acronym WEIRD and argued that people from WEIRD societies are in fact outliers on many psychological measures, not the norm.
The implications for attraction research are substantial.
When we say "research shows that men prefer youthful physical features in partners," what we usually mean is "research with primarily American and European undergraduate samples, conducted mostly by American and European researchers, using measures developed in American and European research contexts, shows this." The leap from that sentence to a universal claim about male psychology requires either evidence of cross-cultural replication (which is often thin or absent) or an evolutionary argument about why this preference would be universal (which typically generates just-so stories that overreach the evidence).
Okafor has been making this point in print and at conferences for nearly a decade. She is not arguing that no findings generalize across cultures — some do, with appropriate nuance. She is arguing that the default assumption of generalizability is unjustified and that the burden of proof lies with those who claim universality, not those who question it.
⚠️ Critical Caveat: WEIRD Bias in Attraction Research A 2017 meta-analysis of published attraction research found that approximately 67% of studies used exclusively American samples, and approximately 90% used samples drawn entirely from North America or Western Europe. Studies using samples from Africa, Southeast Asia, the Middle East, or Central/South America were not merely underrepresented — they were nearly absent. This is not a minor sampling issue. It means that the conclusions of attraction science have been built primarily on the experiences of a minority of the world's population.
The WEIRD problem intersects with intersectionality in a particularly significant way. Even within WEIRD samples, attraction research has historically been conducted with white, heterosexual, able-bodied, cisgender participants. Studies of same-sex attraction, cross-racial attraction, attraction across disability status, and attraction in non-binary gender identities have been either absent or treated as secondary "special case" investigations. As Jordan Ellis might note, the "universal" findings of attraction science have never been particularly universal at all — they've been findings about a slice of humanity that happens to have institutional access to research infrastructure.
Reyes raises a different but related concern during one of his methodological debates with Okafor: even when researchers do try to conduct cross-cultural research, they often import Western constructs wholesale. If you translate an American "attraction preferences" survey into Yoruba or Japanese, you may be accurately measuring whether people in Lagos or Tokyo respond similarly to the items — but you haven't asked whether those items capture the salient dimensions of attraction experience in those cultural contexts. The question is not just "do people in other cultures show the same results?" but "are we asking the right questions?"
The Intersectionality Gap in WEIRD Research
The WEIRD problem, as Henrich and colleagues framed it, focuses primarily on geography and economic development. But there is a parallel gap that operates within WEIRD contexts: the systematic exclusion of people who are marginalized along dimensions of race, gender, sexuality, disability, and class even in countries like the United States.
Consider how attraction research has treated same-sex attraction. For most of the discipline's history, same-sex attracted individuals appeared in attraction research only as "special populations" whose data were either excluded from main analyses or reported in separate subsections as if their experiences were fundamentally different from those of heterosexual participants. This framing implicitly treats heterosexual attraction as the unmarked default and queerness as a variation requiring special explanation. The practical consequence is that a large proportion of the findings presented in introductory textbooks as "what attraction research shows" are findings about heterosexual attraction, often from white, college-educated samples — even when this limitation is not explicitly stated.
Research on how race shapes attraction experiences is even more fraught. When race has appeared in attraction research, it has often been in the form of asking participants (implicitly or explicitly heterosexual) to rate the attractiveness of faces across racial groups — a paradigm that generates data about racial preference while telling us little about how racialized attraction norms are experienced from the inside by people who are racialized in various ways. The experience of navigating racialized desirability hierarchies — what it is like to be a Black woman in a dating market saturated with messages about whose beauty is valued, or an Asian man confronting persistent desexualization — is not capturable with a face-rating paradigm, and it has barely been studied at all.
Disability is even more invisible in the attraction literature. The vast majority of participants in attraction research are non-disabled. The very rare studies that have examined attraction involving disability have tended to focus on how non-disabled people perceive disabled people's attractiveness — a framing that positions disabled people as objects of perception rather than subjects with their own attraction experiences, desires, and relationship lives.
These are not peripheral complaints about the field's blind spots. They go to the heart of what attraction science is for. If the goal is to understand human desire and connection, then a field that systematically excludes the attraction experiences of most of the world's population — including the majority of people living in Western, "WEIRD" countries — has a foundational knowledge problem, not just a diversity optics problem.
Effect Sizes, Statistical Significance, and the p-Value Misunderstanding
Among the most consequential misunderstandings in popular science communication is the conflation of statistical significance with practical significance. They are not the same thing.
A p-value tells you how likely it is that you would observe a result at least as extreme as the one you found, if the null hypothesis (no real effect) were true. A p-value below .05 means that if there were no real effect, you'd see results this extreme less than 5% of the time. That's the threshold that has, by historical convention, become the dividing line between "publishable" and "not publishable" in most social science fields.
Here is the problem: statistical significance is a function of both effect size and sample size. With a large enough sample, you can find a statistically significant result for an effect so tiny it is practically meaningless. Conversely, with a small sample, you might fail to detect a genuinely important effect simply because you lacked the statistical power to find it.
This is where effect sizes become essential. Cohen's d is perhaps the most common effect size measure for comparing two group means. A d of 0.2 is conventionally considered "small," 0.5 "medium," and 0.8 "large." These thresholds are not magic — Cohen himself was explicit that they were rough guidelines, not laws of nature — but they give you a sense of the actual magnitude of an observed difference, independent of whether it crossed the .05 threshold.
💡 Key Insight: Reading Effect Sizes in Attraction Research Many findings in attraction research that have been reported with much confidence and authority have surprisingly small effect sizes. This doesn't mean they're wrong — small consistent effects can be important, especially in evolutionary contexts where small differences compound over time. But it does mean that a finding explaining 3% of the variance in attractiveness ratings is not the same thing as a finding that explains most of what people find attractive. The latter kind of headline is common in popular science reporting about attraction; the former is what the data usually show.
Confidence Intervals and the Width of Our Uncertainty
Effect sizes become even more informative when paired with confidence intervals. A 95% confidence interval around a Cohen's d value tells you the range of values within which the true population effect likely falls, given your data. A study reporting d = 0.4 with a 95% CI of [0.35, 0.45] is saying something very different from a study reporting d = 0.4 with a 95% CI of [−0.1, 0.9]. In the first case, you have a precise estimate of a small-to-medium effect. In the second case, your uncertainty spans almost the entire plausible range — from a small negative effect to a large positive one. That second study tells you essentially nothing.
For decades, attraction research routinely reported p-values and means without reporting confidence intervals, leaving readers unable to assess the precision of the findings. The movement toward "New Statistics" — centered on effect sizes and confidence intervals rather than binary significance testing — has begun to change this, but the change is uneven, and many older papers remain in the literature without these essential pieces of information.
There is also a less-discussed issue: what counts as a meaningful effect size in attraction contexts may differ from Cohen's general benchmarks, which were derived from looking across many different areas of psychology. An effect of d = 0.3 on a face-rating paradigm might be trivially small in absolute terms while being large relative to everything else we know about what influences face ratings. Conversely, an effect of d = 0.5 on relationship satisfaction might be medium in the statistical sense but large in practical terms — a difference of half a standard deviation in reported relationship satisfaction is not a trivial thing in a person's life. Context shapes meaning, and effect size conventions don't substitute for substantive thinking about what the numbers represent.
The Replication Crisis: When the Findings Don't Hold Up
Beginning around 2011 and accelerating through the decade, psychology confronted what has become known as the replication crisis: a wave of high-profile failures to reproduce earlier findings, combined with growing evidence that many standard practices in the field had systematically inflated the apparent success rate of published research.
The crisis was documented dramatically by the Open Science Collaboration (2015), which attempted to replicate 100 studies published in top psychological journals. The results were sobering: only about 39% of the replications showed a significant effect in the same direction as the original, and the average effect size in replications was approximately half that of the originals. This was not a random sampling of weak studies — these were papers from prestigious journals that had been treated as established findings.
Attraction research has not been spared.
Several high-profile findings about physical attractiveness, hormonal influences on partner preference, and the effects of ambient cues on attraction have either failed to replicate or replicated with substantially smaller effect sizes than originally reported. Some of these failures have been disputed by original researchers; others have prompted genuine revision of the literature.
The mechanisms behind the replication crisis are now reasonably well understood. They include:
Underpowered studies. For decades, attraction research studies were conducted with samples of 20–60 participants. With samples this small, random variation can easily produce a significant p-value, and effect size estimates are imprecise. A study that found d = 0.8 with N = 40 might find d = 0.3 with N = 400 — not because the effect disappeared, but because the original estimate was inflated by noise.
Flexible analysis practices. Researchers have many degrees of freedom in data analysis: which outliers to exclude, which covariates to include, which comparisons to make. If researchers (often unconsciously) make these decisions in ways that favor significant outcomes, the published literature will systematically overstate effect sizes and success rates. This is sometimes called "p-hacking," though in most cases it involves motivated reasoning rather than intentional fraud.
Hypothesizing after results are known (HARKing). When researchers generate and test multiple hypotheses but only report the ones that worked, the published paper presents what looks like a confirmatory test of a pre-specified prediction. What it actually represents is an exploratory finding that has been post-hoc reframed.
🧪 Methodology Note: Pre-Registration as a Partial Solution One response to the replication crisis has been the adoption of pre-registration: researchers publicly register their hypotheses, methods, and analysis plans before collecting data. This makes it much harder to HARKing (you can't claim you predicted something you found post-hoc if your pre-registration shows no such prediction) and reduces flexible analysis. The Open Science Framework (OSF) maintains a registry where researchers can time-stamp their pre-registrations. Pre-registration is not a perfect solution — registered reports can still have methodological weaknesses — but it meaningfully addresses some of the most common sources of bias.
Publication Bias: The File Drawer Problem
Imagine, for a moment, that you ran a study. You had a hypothesis — let's say, that people are more attracted to others who make eye contact during a brief conversation. You ran the study with 60 participants. The result was p = .23 — not significant. You write it up, submit it to a journal, and the editor rejects it because "the null result doesn't advance the field." You put it in your file drawer. This is the file drawer problem.
Now imagine this scenario repeated across hundreds of labs, hundreds of studies, thousands of file drawers. The studies that found significant effects get published. The studies that didn't — whether because the effect was real but the study was underpowered, because the effect is smaller than reported in the published literature, or because the effect doesn't exist at all — disappear. The meta-analyst who comes along later and tries to synthesize the literature looks at what's published and concludes: "Eye contact strongly increases attraction ratings, d = 0.6, k = 23 studies."
But what about the 40 unpublished studies in those file drawers?
Publication bias is a systemic distortion of the scientific record, and attraction research is particularly vulnerable to it for several reasons. The topic generates strong hypotheses from multiple theoretical directions — evolutionary psychology predicts X, social exchange theory predicts Y, feminist theory predicts Z. Researchers enter the field with expectations, and expectations interact with the many degrees of analytical freedom described above. A p = .23 feels, in such a context, like a failure; a p = .04 feels like success.
📊 Research Spotlight: Funnel Plot Asymmetry Meta-analysts use a tool called the funnel plot to assess publication bias. In an unbiased literature, effect sizes from large studies and small studies should cluster symmetrically around the true effect. When small studies consistently show larger effects than large studies — creating an asymmetric funnel — this suggests that small studies with small effects (or null results) were not published. Funnel plot asymmetry is detectable in many areas of attraction research, including studies of mate preference and physical attractiveness. Its presence doesn't prove the effect doesn't exist; it does mean the true effect size is likely smaller than the published literature suggests.
What Replication Failure Is Not
It's worth being precise about what a replication failure means — because popular discourse has a tendency to misread these events as proof that science is broken or that all findings are suspect. Neither conclusion is warranted.
A single failed replication of a single study is evidence that the original finding may not be robust — that it may have reflected a particular sample, a particular measurement approach, or statistical noise. It is not proof that the effect doesn't exist. Science asks for patterns of evidence, not individual studies, and a finding that fails to replicate once but succeeds in three subsequent larger pre-registered replications is more credible than the original finding, not less. The process of replication — including successful replication, failed replication, and the slow triangulation toward truth that results — is the process by which scientific knowledge is built.
What the replication crisis does justify is a recalibration of confidence. Findings from single, small, non-pre-registered studies in attraction research should be held more tentatively than findings that have accumulated across multiple independent replications with diverse methods and samples. This is not a reason to dismiss the science — it is a reason to read it carefully, which is exactly what this chapter is trying to teach you to do.
🔴 Myth Busted: "If It Was Published in a Top Journal, It Must Be True" Publication in a high-impact journal is a necessary but not sufficient condition for a finding to be considered reliable. The journals that published many of the findings that later failed to replicate — Psychological Science, JPSP, Nature — are among the most prestigious in the field. Their prestige reflects the quality of the research they publish on average, not a guarantee that any individual study is correct. The best protection against overweighting individual findings is understanding how research programs build evidence over time — and having enough methodological literacy to evaluate what kind of evidence each study actually provides.
Inside the Okafor-Reyes Study: Methodology in Practice
By the third month of designing the Global Attraction Project, Okafor and Reyes had produced something rare in attraction research: a methods section they were both genuinely proud of and deeply anxious about. What they were attempting was ambitious by any measure — a 12-country, 5-year, mixed-methods investigation of attraction norms and behaviors, designed from the ground up to address the limitations of prior work.
The design had three components. First, a standardized survey battery administered to 300–500 participants per country (approximately 4,800 total), carefully translated and back-translated, with pilot testing in each national context to assess measurement equivalence. Second, behavioral observation sessions at naturally occurring social gatherings — community events, university mixers, family parties — with trained local observers using standardized coding schemes. Third, a subset of 20–40 participants per country completing in-depth qualitative interviews in their first language, with local research assistants conducting and transcribing.
"I need the survey to be genuinely equivalent across all sites," Reyes had said during one of their many planning calls. "Not just linguistically equivalent — conceptually equivalent. If we're measuring attraction to 'physical fitness,' we need to know that what 'physical fitness' means in Osaka isn't fundamentally different from what it means in Lagos."
Okafor's response was immediate. "But what if it is? That's the finding, Carlos. That's what the study is trying to detect. If we impose equivalence through our instruments, we might suppress the very variance we're looking for."
This is the core methodological tension in cross-cultural psychology — and their debate captures it precisely. Reyes is right that without measurement equivalence, comparing scores across cultures is comparing apples to motorcycles. But Okafor is also right that forcing equivalence can mask genuine cultural differences in the constructs themselves. Their eventual compromise: they test for measurement equivalence statistically (using confirmatory factor analysis with varying levels of constraint), report which items function equivalently across cultures and which do not, and treat non-equivalent items as a substantive finding rather than a measurement failure.
The sampling controversy was sharper. Okafor had proposed stratified sampling in each country, with deliberate oversampling of rural, lower-income, and less-educated populations — groups typically excluded from attraction research even within the nations that do it. Reyes worried about the trade-offs: oversampling these groups in some countries but not others might actually create non-equivalence rather than fix it.
"You're talking about a comparison problem," Okafor said, not for the first time. "But I'm talking about a representation problem. If we sample only university-educated urban adults in Nigeria and South Africa and Brazil, we haven't solved the WEIRD problem — we've just expanded it by one letter. We'll have Western-Educated-Industrialized-Rich-Democratic people in twelve countries instead of one."
It was Reyes who eventually conceded the point — at least partially. The final design uses stratified probability sampling in all 12 countries, with representation targets for education, income, and urbanicity. It isn't a perfect solution; no solution is. But it is a principled one.
⚖️ Debate Point: Feminist Standpoint Epistemology Okafor approaches methodology through what scholars call feminist standpoint epistemology: the view that who we are — our social position, our lived experience — shapes what we can know and how we know it. From this perspective, the choice of who conducts research, who is studied, and whose categories frame the instruments is not a neutral technical decision but a political one with epistemic consequences. A field that has been designed by, conducted by, and studied with WEIRD populations doesn't just have a sampling problem — it has a knowledge problem. Reyes finds value in this perspective while worrying that it can slide toward dismissing quantitative evidence altogether. Their ongoing tension over this point makes for some of the most productive debates in attraction research today.
Translating Instruments: The Linguistics and Politics of Cross-Cultural Measurement
One episode from the early design phase of the Global Attraction Project illustrates the depth of the problem that Okafor and Reyes were trying to solve. The project's survey included a standard scale measuring "relationship investment" — the degree to which participants had committed resources (time, emotional energy, foregone alternatives) to a current or recent relationship, and the degree to which they intended to continue doing so. It was a well-validated scale with strong psychometric properties in American samples.
When the team's collaborator at the University of Lagos conducted a pilot translation and back-translation, a problem emerged immediately. Several items referred to "sacrificing for the relationship" — a concept that, in the American English original, carries implicit assumptions about individual autonomy and deliberate choice. In the pilot interviews, Nigerian participants tended to interpret "sacrifice" in a communal frame: something you did for family networks and social obligations, not for a dyadic romantic relationship per se. The item was measuring something, but it wasn't clear it was measuring the same thing in both contexts.
Reyes's instinct was to revise the items until they achieved back-translation equivalence — to produce a version that could be verified as linguistically equivalent to the original. Okafor's position was more radical: perhaps the very construct of "relationship investment," as measured by items anchored in American individualist assumptions about dyadic romance, was not a culturally portable concept. Perhaps trying to achieve equivalence was less important than first understanding whether the construct itself had the same meaning.
Their solution — a preliminary qualitative phase in each country to understand local constructions of relationship commitment before finalizing the survey items — added six months to the design phase and significant cost. It also produced genuinely important data: in three of the twelve countries, the qualitative interviews revealed that the most salient dimensions of relationship commitment were substantially different from those captured by the original scale. Those dimensions were added to a new set of locally developed items, which were then tested for cross-cultural comparability using a partial invariance framework.
This is, in miniature, what rigorous cross-cultural psychology looks like. It is slower, more expensive, more epistemologically humble, and more intellectually honest than simply translating existing instruments and running them in new populations. It is also, Okafor would argue, the only methodological approach that has any hope of producing knowledge that is actually about human beings rather than about Americans and their closest cultural neighbors.
Self-Report vs. Behavioral vs. Physiological Measures
One of the central challenges in attraction research is triangulating across different types of evidence. Each measurement strategy has characteristic strengths and weaknesses, and they don't always agree with each other.
Self-report measures — surveys, questionnaires, rating scales — are cheap, scalable, and give direct access to subjective experience. But they are influenced by social desirability bias (people tend to report preferences and behaviors that sound more acceptable), by limited self-insight (we often don't know why we feel what we feel), and by the demand characteristics of the research context (participants who think they're in a study about gender will answer questions differently than participants who don't).
Behavioral measures — recording who participants approach, how long they look at a face, whether they choose to continue a conversation — have better ecological validity than self-reports and are less subject to conscious distortion. But behavior in laboratory contexts is still constrained by those contexts, and the gap between approach behavior in a lab and partner choice in a real-world social context is substantial.
Physiological measures — heart rate, skin conductance, pupil dilation, genital blood flow (in sexuality research), cortisol levels, neural activation patterns — are often presented as the "objective" ground truth beneath subjective experience. This framing is misleading. Physiological responses measure biological processes that are imperfectly correlated with subjective experience. Pupil dilation responds to cognitive load, emotional arousal, and lighting conditions — and these are hard to disentangle. Skin conductance responds to general arousal, not specifically to attraction. The appeal to physiology as a privileged measure of "what people really feel" reflects a kind of naive biologism that the best researchers in the field explicitly reject.
The Demand Characteristics Problem
A persistent challenge in attraction research that deserves its own discussion is demand characteristics: the ways that participants' awareness of being in a study — and their theories about what the study is "about" — shape their responses in ways that have nothing to do with the psychological processes under investigation.
Martin Orne, who coined the term in 1962, argued that research participants are not passive subjects but active, motivated participants who try to make sense of the experimental situation and behave in ways that they perceive as helpful to the researcher. This "good subject role" is a fundamental feature of human sociality — we are attentive to social contexts and we calibrate our behavior accordingly. In an attraction research context, a participant who has correctly guessed that the study is examining how attractiveness affects perceived intelligence will rate the attractive target's essay differently than a participant who believes the study is examining essay quality per se.
Researchers have developed various strategies to minimize demand characteristics: elaborate cover stories (which raise ethical concerns), pre-screening to exclude participants who guess the hypothesis (which raises validity concerns), and measuring behavior rather than self-report (which, as we've seen, has its own limitations). None of these fully solves the problem. Demand characteristics are, to some degree, an irreducible feature of psychological research with human participants.
In the context of the Global Attraction Project, demand characteristics take on a cross-cultural dimension. The "good subject role" may manifest differently across cultural contexts — what it means to be helpful to a researcher, how much social desirability concerns shape responses, and what participants assume researchers want to hear may all vary. A participant in Stockholm might engage with a study on attraction norms very differently from a participant in Marrakech, even holding all else equal, because the social meaning of participating in academic research and the appropriate performance in that social role are culturally variable.
Ethical Review and Participant Protection
All research involving human participants at accredited institutions must be reviewed by an Institutional Review Board (IRB) — an independent committee charged with protecting participants from harm. For attraction research, this process is often more complicated than it might initially appear.
The Global Attraction Project had to navigate IRB review at twelve institutions across twelve countries with very different research ethics frameworks. What constitutes "minimal risk" varies. Protocols around anonymity differ. Informed consent procedures that work seamlessly in a Western research context — written consent forms, explicit debriefing — may not translate straightforwardly to contexts where written documents carry different social meanings or where full disclosure of the study's purpose might cause social harm.
Okafor and Reyes spent months in consultation with local ethics boards before their international partners could begin data collection. In several countries, the behavioral observation component required particularly careful negotiation: how do you observe naturally occurring behavior at a social gathering without deceiving participants? Their solution — conspicuously posted notices at event venues describing the research and offering an opt-out mechanism — was imperfect but defensible.
🔵 Ethical Lens: Deception in Attraction Research Some classic attraction studies used deception — telling participants they were in one kind of study while actually running another. For example, early research on the "arousal transfer" hypothesis had participants cross a scary suspension bridge before being interviewed by an attractive research confederate, without disclosing that the bridge crossing was part of the study design. Modern IRB standards have made such designs much harder to conduct, for good reason. But this creates a methodological bind: some attraction phenomena may only be observable when participants don't know they're being studied. Researchers must weigh the scientific value of deceptive designs against the ethical obligation to treat participants with respect.
The Nature vs. Nurture Problem in Method Choice
One underappreciated dimension of methodological choice in attraction research is how research designs can carry implicit theoretical commitments — particularly regarding the biology-culture dialectic that runs through this entire book.
Consider how you would design a study to answer the question: "Is physical symmetry universally attractive?" If you believe that symmetry preferences are evolved adaptations reflecting developmental stability, you will probably design a cross-cultural study that measures symmetry preferences using standardized stimuli across many populations, treating consistency across cultures as evidence for evolutionary origin. If you believe that symmetry preferences reflect culturally transmitted aesthetic norms that happen to be widespread, you will design a study that looks for variation in the strength of symmetry preferences across different cultural contexts and correlates that variation with cultural variables, treating variation as the signal rather than noise.
Both researchers are studying the same question. But their methodological choices — what they look for, how they measure it, what would count as confirmatory evidence — are shaped by their prior theoretical commitments. And this means that the results they obtain are not straightforwardly neutral "findings" but answers to questions that have been shaped by assumptions.
Reyes is acutely aware of this problem, perhaps because he straddles both camps. As an evolutionary psychologist, he has strong priors about the biological underpinnings of attraction. As a methodologically rigorous empiricist, he is committed to the principle that his priors should inform but not determine his conclusions. This is harder to achieve than it sounds. Confirmation bias — the tendency to seek, weight, and remember information that confirms existing beliefs — is not a moral failing but a cognitive feature of every human researcher. The methodological safeguards that good science provides (pre-registration, replication, peer review) are, in part, safeguards against the researchers' own confirmation bias.
Okafor adds another layer: even the choice of what questions to ask is not neutral. The question "Is physical symmetry universally attractive?" presupposes that symmetry is the relevant variable, that attractiveness is the outcome of interest, and that universality is the right frame for evaluation. A researcher coming from a different theoretical tradition might ask: "How do aesthetic standards of the body function as mechanisms of social inclusion and exclusion?" or "What does the cross-cultural variation in symmetry preferences tell us about the relationship between mate preference and local ecological conditions?" These are not the same question, and they generate different research programs. The choice of question is a theoretical act, and every methodological choice downstream is shaped by it.
🧪 Methodology Note: Open Science Practices Beyond pre-registration, the open science movement has advocated for several practices that improve transparency and replicability: sharing raw data and analysis code publicly (so other researchers can reproduce and audit analyses), publishing pre-prints before peer review (so findings enter the scientific conversation faster), and conducting Registered Reports (a publication format in which the journal commits to publishing the paper based on the quality of the design and planned analyses, before data collection begins — regardless of results). These practices are becoming more common in attraction research, though adoption is uneven. When you encounter papers that have used open science practices, they represent a higher evidentiary standard than papers that have not.
How to Read a Research Paper: A Student's Guide
One of the most practical skills this chapter can give you is the ability to read an attraction research paper critically — not dismissively, but with the informed skepticism that good science both requires and rewards.
When you encounter a claim about attraction in the popular press ("Scientists discover the secret to instant attraction!") or even in a peer-reviewed journal, ask these questions:
1. What is the sample? How many participants? What was their demographic profile? Is the sample WEIRD? Is it student-only? Are there important groups missing?
2. What is the design? Is this experimental (random assignment to conditions), correlational, observational, or survey-based? What does this design allow the authors to conclude?
3. What is the effect size? Not just whether p < .05, but how large is d (or r, or η²)? Does the effect size match the headline claim?
4. Has it been replicated? Is this a single study or a pattern across studies? Are there pre-registered replications? What does the meta-analytic literature show?
5. Who funded it? Not to dismiss findings from industry-funded research wholesale, but to be alert to the ways that funding sources can shape which questions get asked and which results get published.
6. How does it measure attraction? Self-report? Behavioral? Physiological? What are the limitations of that measurement approach?
7. What does the study NOT tell us? What questions remain open? What populations aren't addressed? What alternative explanations aren't ruled out?
8. Is there a pre-registration? Search the Open Science Framework (osf.io) for the study's pre-registration. If one exists, check whether the published paper matches the registered hypotheses and methods. Deviations are not automatically disqualifying, but they should be flagged and explained.
9. Where is the confidence interval? An effect size without a confidence interval is an incomplete finding. The interval tells you how much uncertainty surrounds the point estimate. Wide confidence intervals — especially those that include zero — suggest a finding that deserves considerable tentativeness.
10. What is the theoretical framework? Is the study grounded in an evolutionary framework, a social psychological framework, a feminist framework, or some combination? Understanding the theoretical commitments behind a study helps you evaluate whether the interpretation the authors offer is the only plausible one, or whether alternative theoretical frames would read the same data differently.
This may seem like a lot of questions. It is. But good critical reading of research is a skill, and like all skills, it becomes faster and more intuitive with practice. By the end of this course, these questions should feel like second nature — not because you've become cynical about science, but because you've come to understand how science actually works.
✅ Evidence Summary: What We Can Conclude from Attraction Research Despite its limitations, attraction research has produced findings robust enough to stake real claims on. Proximity and familiarity consistently predict liking. Similarity in values and personality is associated with long-term relationship satisfaction. Physical attractiveness affects social outcomes, including in contexts where it "shouldn't" by equity norms. These findings hold up across replications, different methodologies, and (with appropriate nuance) different cultural contexts. The field has real, hard-won knowledge. The task is learning to distinguish that knowledge from the noise that surrounds it.
A Note on Progress
Okafor and Reyes's methodological debates are not mere academic squabbles. They represent real tensions in how knowledge about human attraction is produced, and those tensions have real consequences for what we think we know.
The field of attraction research is in a period of productive upheaval. Pre-registration is becoming more common. Sample sizes are growing. Researchers are increasingly explicit about the limitations of their designs. Cross-cultural collaborations are expanding, though not fast enough. The replication crisis, painful as it has been, has generated a culture of greater methodological honesty.
What this means for you, as a reader of this book and as a thinking person in the world, is that you should hold the findings we discuss with appropriate calibration. Some things we know with real confidence. Others we know tentatively, with caveats that matter. Some things we think we know but probably don't, or know only about a particular slice of humanity. Throughout this book, we'll try to be honest about which is which.
That honesty is itself a form of rigor. Science doesn't work by never being wrong; it works by having mechanisms to detect and correct error. Understanding those mechanisms — which is what this chapter has been about — is what allows you to engage with the science of attraction as a participant rather than a consumer.
There is one more thing worth naming, which is that methodology is not a value-neutral enterprise. The choices researchers make about who to study, what questions to ask, which methods to use, and how to interpret ambiguous results are all shaped by the social world in which those researchers live — including the structural inequalities, cultural assumptions, and institutional incentives that shape every human endeavor. This doesn't make science impossible or meaningless. It makes the case for doing science with self-awareness, transparency, and genuine intellectual humility: being willing to say, as Reyes does when he concedes Okafor's point about stratified sampling, "I think you're right, and I need to update my view." It is also a case for making science more inclusive — not as a diversity optics exercise, but because the knowledge problems created by excluding most of humanity from either conducting or participating in research are genuinely serious. A more diverse research enterprise is a more epistemically powerful one.
Okafor put it this way, in a keynote address she gave at a conference two years before the Global Attraction Project began: "We do not study attraction in a vacuum. We study it from somewhere, with the tools available to us from somewhere, with the assumptions we absorbed from the intellectual traditions we were trained in. The best methodologists are those who can see their own somewhere — who can hold their methods at arm's length and ask, what would this look like if the somewhere were different? That's not relativism. That's rigor."
Reyes, who was in the audience that day, says it was the moment he decided to reach out to her about a collaboration.
Meta-Analysis: Synthesizing Across Studies
No individual study, however well-designed, settles a scientific question. Knowledge in attraction research — as in all empirical social science — is built by accumulating findings across multiple studies, each with its own strengths and limitations, until patterns emerge that are robust across methods, samples, and contexts.
Meta-analysis is the statistical technique for doing this systematically. A meta-analyst identifies all available studies on a given question (ideally including unpublished studies as well as published ones), codes their methodological features, extracts or calculates effect sizes, and computes a weighted average effect across the full literature. The weighting is typically by sample size: larger, more precise studies count for more than smaller, noisier ones. The result is a single summary effect size estimate, along with an assessment of how much the effects varied across studies (heterogeneity) and, increasingly, an assessment of publication bias.
Meta-analyses are not immune to the problems they are designed to address. A meta-analysis of a literature with severe publication bias will still produce an inflated estimate, though techniques like funnel plot analysis and p-curve can help detect and partially correct for this. A meta-analysis that mixes methodologically heterogeneous studies — face-rating studies, speed-dating studies, longitudinal relationship studies — may produce a meaningless average across apples and motorcycles. And the "garbage in, garbage out" principle applies: if the primary studies are biased, the meta-analysis will aggregate that bias rather than correct it.
When we cite meta-analytic findings in this book, we try to note whether those meta-analyses have addressed publication bias, the degree of heterogeneity in the underlying studies, and any important moderators that suggest the overall effect may look very different in subgroups. A meta-analysis that finds a mean d of 0.4 but with enormous heterogeneity (Q statistic highly significant, I² > 75%) is telling you that the average is not very informative — the real action is in understanding why effects are so variable across studies. That is a very different message from a meta-analysis with low heterogeneity, which does support the conclusion that a consistent effect has been observed across diverse conditions.
Summary
This chapter has surveyed the methodological landscape of attraction research, from experimental designs to neuroimaging, from the WEIRD problem to the replication crisis and publication bias. We've followed Okafor and Reyes through the design phase of the Global Attraction Project, using their debates to illustrate real tensions in cross-cultural research methodology. We've examined what self-report, behavioral, and physiological measures can and cannot tell us, and we've developed a practical framework for reading research papers critically.
The central takeaway is not methodological nihilism — it's methodological humility. The science of attraction has produced genuine, valuable knowledge. It has also produced an uncertain body of findings distorted by sampling bias, publication pressure, and insufficient attention to who gets studied and why. Learning to navigate that landscape is not a reason to dismiss the science. It's a reason to engage with it more carefully than the headlines invite you to do.
🔗 Connections - Chapter 1 introduced the critical framing of attraction science; this chapter gives you the tools to apply it. - Chapter 5 returns to the Okafor-Reyes Study to examine the ethical challenges of cross-cultural consent protocols. - Chapter 8 revisits cross-cultural methods when the first wave of data on physical attractiveness norms comes in. - Appendix A provides a detailed research methods primer with worked examples. - The Python code for this chapter (code/attraction_methods_demo.py) demonstrates effect size calculation, t-tests, and publication bias simulation hands-on.
The next chapter examines the theoretical frameworks that organize attraction research — evolutionary, social exchange, and feminist theories of desire — and follows Jordan's seminar discussion about whether "seduction" is a useful concept at all.