It was a Thursday morning when Dr. Adaeze Okafor opened her laptop to find forty-seven Google alerts about her research. That number, in her experience, meant trouble. The alerts all linked to the same press release, which had been picked up by news...
Learning Objectives
- Explain how meta-analysis works and read a forest plot
- Describe how p-hacking and publication bias distort the literature
- Evaluate a press release about attraction research using the chapter's toolkit
- Apply lessons from the Okafor-Reyes global dataset to assess generalizability
In This Chapter
- 40.1 Building on Chapter 3: A More Advanced Critical Toolkit
- 40.2 Understanding Meta-Analysis: What It Is, What It Tells Us, Its Limits
- 40.3 Forest Plots: How to Read Them
- 40.4 Publication Bias in Meta-Analysis: Funnel Plots, Egger's Test, and the Trim-and-Fill Solution
- 40.5 P-Hacking and Researcher Degrees of Freedom: A Simulation-Based Illustration
- 40.6 Pre-Registration and Open Science: The Reform Movement
- 40.7 Effect Size Revisited: What Sizes Matter in Attraction Research?
- 40.8 The WEIRD Problem Revisited with Okafor-Reyes Global Data
- 40.9 How to Evaluate a Press Release About Attraction Research
- 40.10 The 10-Step Research Quality Checklist
- 40.11 From Research to Policy: The Translation Problem
- 40.12 The Question of Personal Application: From Science to Individual Life
- 40.13 Science Literacy as a Life Skill
- 40.14 Questions to Ask Any Attraction Claim
- Conclusion: Informed Consumers, Better Science
Chapter 40: Critical Thinking About Attraction Research — Becoming an Informed Consumer
It was a Thursday morning when Dr. Adaeze Okafor opened her laptop to find forty-seven Google alerts about her research. That number, in her experience, meant trouble. The alerts all linked to the same press release, which had been picked up by news outlets in nine countries. The headline read: "Scientists Discover Men Worldwide Prefer Women Who Are Less Educated." The subheadline: "New 12-country study reveals universal mating preference."
The study being reported was hers.
She spent the next three hours reading the coverage. None of it was technically false. The study had indeed found that, in six of their twelve countries, male participants rated female profiles as more desirable when the female profile indicated a slightly lower educational attainment — specifically, when comparing graduate degree to bachelor's degree. The coverage omitted several things: that the effect size was small (d = 0.19), that the pattern appeared in only six of twelve countries, that the six were all WEIRD-heavy (Western, Educated, Industrial, Rich, Democratic), that their study had preregistered the hypothesis and found mixed support, and that Okafor's own interpretation was that the finding reflected cultural scripts about gender and status rather than a universal mating preference. None of that appeared in the press release, let alone the coverage.
This chapter is about what Dr. Okafor experienced from the other side of the screen — the experience of being an informed consumer of attraction research in a media environment that systematically distorts the science. Chapter 3 introduced the foundational tools: effect size, p-values, the replication crisis, the WEIRD problem. Those tools remain essential. This chapter builds a more advanced toolkit on top of them — tools for understanding what meta-analysis can and cannot tell us, how to read a forest plot, what p-hacking looks like and why it is so difficult to detect, how pre-registration changes the game, and how to evaluate the gap between a press release and the study it claims to describe.
By the end, you will have the tools to read any attraction claim — in a journal, a press release, a popular book, or a TikTok video — with calibrated skepticism and genuine curiosity. Not cynicism. Cynicism is cheap. Calibrated skepticism is earned.
40.1 Building on Chapter 3: A More Advanced Critical Toolkit
In Chapter 3, you learned the basics: that correlation does not imply causation, that effect size matters more than statistical significance, that p-values are not the probability that the null hypothesis is true, that the replication crisis is a systemic problem rather than isolated bad actors, and that WEIRD sampling creates systematic blind spots. These foundations remain non-negotiable. If any of those concepts feel shaky, revisit Chapter 3 before continuing.
This chapter assumes those foundations and builds on them. The new tools are:
- Meta-analysis and what it adds to the picture
- Forest plots as a visual communication of meta-analytic evidence
- Publication bias and how funnel plots reveal it
- P-hacking and researcher degrees of freedom — with a simulation to make it visceral
- Pre-registration and open science as structural reforms
- Generalizability as a more nuanced concept than "does it replicate?"
- The press release gap as a specific form of scientific miscommunication
These are not separate tools but parts of a coherent critical framework. Together they allow you to move from the surface of a claim ("scientists find X about attraction") to a principled assessment of how much confidence the evidence warrants.
💡 Key Insight: The goal of scientific critical thinking is not to be unable to believe anything. It is to believe things in proportion to the evidence for them. A finding that has replicated in multiple independent samples, with adequate power, across diverse cultural contexts, using mixed methods, with a pre-registered hypothesis and a published null result — that finding deserves high confidence. A finding from a single study of 200 undergraduates, with a p-value just below .05, reported in a press release that omits the effect size — that deserves much lower confidence.
40.2 Understanding Meta-Analysis: What It Is, What It Tells Us, Its Limits
A meta-analysis is a study of studies. Rather than collecting new data, a meta-analytic researcher collects all (or a representative sample of) the studies that have been conducted on a given question, extracts their effect sizes and sample sizes, and uses statistical methods to produce an aggregate estimate of the effect. The logic is straightforward but powerful: any single study has sampling error — it might over- or under-estimate the true effect by chance. If you average across many studies, the random errors should cancel out, leaving you with a more accurate estimate than any individual study could provide. This is why meta-analyses sit at the top of most evidence hierarchies in medicine, psychology, and the behavioral sciences.
The most important output of a meta-analysis is a weighted mean effect size — an estimate of the effect in the population that weights studies by their precision (which is related to sample size — larger studies count more, because they have smaller sampling error and provide more reliable estimates). This aggregate estimate is more reliable than any individual study's estimate, assuming the pool of studies being aggregated is itself an unbiased sample of all studies conducted on the question.
Meta-analyses also produce a measure of heterogeneity — how much the effect sizes vary across studies beyond what would be expected from sampling error alone. This is one of the most important pieces of information a meta-analysis provides, and it is frequently underreported in popular coverage of meta-analytic findings. If all studies find roughly the same effect size, heterogeneity is low and the aggregate estimate is a good summary of a consistent phenomenon. If some studies find large positive effects while others find null or negative effects, heterogeneity is high, and the aggregate estimate may be actively misleading — it is averaging across genuinely different phenomena, not simply estimating a single underlying truth with greater precision.
The most widely used meta-analytic metric for heterogeneity is I², which represents the proportion of total variance in observed effect sizes that is due to true variation across studies rather than to sampling error. An I² of 0% means all variation across studies is attributable to sampling error (consistent underlying effect); an I² of 25% is conventionally considered low heterogeneity; 50% is moderate; and above 75% suggests substantial real heterogeneity — variation that cannot be explained away by chance. When I² is high, the appropriate next step is not to abandon the meta-analysis but to investigate moderators: variables that predict which studies find larger or smaller effects. Cultural context, sample demographics, measurement method, and study design are all common moderators in attraction meta-analyses.
Fixed-effects versus random-effects models represent the most fundamental methodological choice in meta-analysis. A fixed-effects model assumes that all studies in the meta-analysis are measuring the same true underlying effect size, and that differences across studies are entirely due to sampling error. Under this model, the aggregate estimate is the best estimate of that single true effect. A random-effects model makes a more relaxed assumption: that there is genuine variation in the true effect size across studies — different populations, contexts, and operationalizations may produce genuinely different effects — and the aggregate estimate is therefore an estimate of the average effect across a distribution of true effects. In most behavioral science meta-analyses, and certainly in most attraction research meta-analyses, the random-effects model is more appropriate: we genuinely believe that effect sizes vary across cultural contexts, age groups, and measurement approaches. Using a fixed-effects model when heterogeneity is substantial produces artificially narrow confidence intervals and overstates the precision of the aggregate estimate.
The choice between models is not merely technical — it determines what the aggregate estimate means. A fixed-effects estimate says: "Here is the single true effect." A random-effects estimate says: "Here is the average effect across the range of true effects in this literature." In attraction research, where the heterogeneity I² is typically moderate to high, the second interpretation is almost always the appropriate one.
The limits of meta-analysis are as important as its strengths:
Garbage in, garbage out. A meta-analysis of poorly designed studies produces a precise estimate of an imprecise truth. If all the studies in the meta-analysis use WEIRD samples, the aggregate estimate is a very reliable estimate of what happens with WEIRD participants — not necessarily what happens with everyone. Methodological rigor within the meta-analysis cannot compensate for methodological weakness in the studies it synthesizes.
The publication bias problem. Most studies that find null results do not get published. This means the pool of published studies available for meta-analysis is systematically skewed toward positive findings. A meta-analysis of published studies will overestimate true effect sizes for precisely this reason. We return to this in Section 40.4 — it is, arguably, the most serious methodological problem in meta-analysis.
The apples and oranges problem. Studies in the same meta-analysis often measure "the same thing" in quite different ways. One study of physical attractiveness might use standardized photographs; another might use self-reported attractiveness ratings; a third might measure how long participants look at an attractive face. Are these the same construct? The meta-analytic aggregation assumes they are comparable enough to aggregate, but this assumption is not always warranted. When operationalizations vary substantially, the aggregate estimate is partially an artifact of how the construct was measured rather than purely reflecting the underlying phenomenon.
There is a fourth limit that deserves mention: the statistical model assumption. Most meta-analyses use either a fixed-effects model or a random-effects model, as described above. The random-effects model is generally more appropriate when there is genuine heterogeneity in the research literature. But the random-effects model produces wider confidence intervals — which can make the aggregate finding look less definitive, even when the evidence is genuinely strong. Researchers sometimes choose fixed-effects models partly to produce more impressive-looking results, which is a subtle form of analytical manipulation that is difficult for readers to detect without examining the reported heterogeneity statistics.
Understanding meta-analysis also requires understanding its intellectual history. The method was developed in the 1970s and 1980s partly as a response to the "vote counting" approach to literature synthesis — the practice of tallying how many studies found a significant effect vs. how many did not, and declaring the majority's verdict the conclusion. Gene Glass, who coined the term meta-analysis in 1976, recognized that vote-counting was statistically illiterate: it treated a p > .05 result in a small, underpowered study the same as a p > .05 result in a large, well-powered one, despite the very different evidential weight of the two outcomes. A study with 40 participants that fails to find a significant effect has very low statistical power — it would miss a real medium-sized effect a substantial portion of the time. Treating its null result as equivalent evidence to a null from a study of 800 participants is a serious error. Meta-analysis replaced vote-counting with proper weighting by precision — a major methodological advance. But it also created new problems, primarily the publication bias issue, that are still being addressed by the open science reform movement.
📊 Research Spotlight: When the Okafor-Reyes team began their meta-analysis of their own cross-national dataset — five years of data from twelve countries, approximately 4,800 participants — they faced all three of these problems simultaneously. Within their own study, different cultural contexts had used slightly different versions of the questionnaires (adapted for translation and cultural appropriateness), which created an apples-and-oranges comparison problem even in a single study. Their I² values for many key measures were above 60%, suggesting substantial heterogeneity across cultural contexts — which led Okafor to argue that many attraction findings are not singular effects but family-of-effects phenomena that need cultural context-specific analysis rather than aggregate synthesis.
This led the team to a decision that was scientifically and politically fraught: whether to present the aggregate pooled effect or the country-specific effects as the primary finding. Reyes favored the pooled effect, arguing that it represented the best estimate of the overall cross-cultural pattern and would be most useful for the field. Okafor favored country-specific presentation, arguing that pooling was misleading when heterogeneity was this high and would obscure the very cross-cultural variation their study was designed to document. The final paper presented both — the pooled effect in the abstract and the country-specific effects in the results — a compromise that satisfied neither position entirely but honestly represented the tension between them. This is a microcosm of a larger debate in cross-cultural psychology about whether integration or differentiation should be the primary goal of comparative research.
40.3 Forest Plots: How to Read Them
A forest plot is the standard graphical output of a meta-analysis. Learning to read one is a foundational skill for anyone who wants to evaluate the evidence on any behavioral science question. The plot in our Python simulation (Section 40.7 and the code file) will make this concrete, but here is the conceptual framework.
A forest plot typically contains the following elements, arranged in a standardized layout:
- A list of studies on the left, labeled by author and year (and sometimes by key characteristics like sample size or country). The list is usually ordered by publication year, effect size, or sample size — the ordering itself can communicate something about the literature's trajectory.
- A vertical line at zero (or at no effect) in the center — this is the "null effect" line. The position of each study's estimate relative to this line tells you whether the study found an effect in the expected or unexpected direction.
- A square for each study whose horizontal position represents the study's point estimate of the effect size (e.g., Cohen's d or an odds ratio). Crucially, the size of the square represents the study's weight in the meta-analysis, which is proportional to its precision (and thus roughly to its sample size). A large, filled square indicates a large, high-weight study; a small, faint square indicates a small, low-weight study. This size encoding is one of the forest plot's most important features — it lets you see at a glance which studies are driving the aggregate result.
- A horizontal line (whisker) through each square representing the 95% confidence interval for that study's effect estimate. A wide whisker indicates high uncertainty; a narrow whisker indicates greater precision. Any whisker that crosses the vertical zero line means that study did not find a statistically significant effect at p < .05.
- A diamond at the bottom representing the aggregate meta-analytic estimate. The center of the diamond is positioned at the pooled effect size, and the width of the diamond represents the 95% confidence interval around that aggregate. A diamond whose edges do not cross the zero line indicates a statistically significant overall effect. A narrow diamond indicates high precision; a wide diamond indicates substantial uncertainty even in the aggregate.
The forest plot was developed as a way to communicate, visually, both the individual study evidence and the aggregate conclusion in a single display. The name comes from the appearance of the plot: many horizontal lines, each representing a study, creating a forest-like visual density. The metaphor is apt in another sense too: like a real forest, a forest plot can be beautiful and intimidating at the same time, and the expert reader can see things in it that the novice cannot.
Reading a forest plot step by step:
Before looking at any forest plot, establish a question in mind: Are these studies all finding approximately the same effect? Is the aggregate significant? How wide are the individual confidence intervals? How large are the squares? A forest plot is not meant to be read from top to bottom like a table — it is meant to be scanned as a whole first, giving you a gestalt impression of the literature's consistency and direction, before you examine individual elements.
Step 1 — Scan the whole plot for consistency of direction. Are all the squares on the same side of the zero line? If most squares cluster to the right of zero (positive effect), the literature is fairly consistent. If squares are scattered on both sides, the evidence is mixed and the aggregate may be misleading. In a symmetric distribution of squares around a positive effect, you expect a funnel-shaped scatter; gross asymmetry suggests either heterogeneity or publication bias.
Step 2 — Examine the square sizes. The large squares dominate the aggregate. If two or three large studies all show modest effects while many small studies show large effects, the aggregate will be closer to the modest end — the small studies simply don't carry much weight. This pattern is actually a publication-bias warning sign: small studies that show up in a meta-analysis are disproportionately the ones that found large effects (because small studies with modest effects often don't get published).
Step 3 — Examine the confidence intervals. How many individual study whiskers cross the zero line? A whisker that crosses zero means the individual study alone is not significant, even if the aggregate is. Seeing many whiskers that cross zero while the aggregate diamond does not cross zero is expected — that is precisely what meta-analysis is for. But if even the diamond crosses zero, the aggregate effect is not statistically significant.
Step 4 — Look at the diamond size. A narrow diamond indicates high aggregate precision, driven by many large or highly consistent studies. A wide diamond suggests substantial remaining uncertainty. Some researchers report the prediction interval — an interval that encompasses where you would expect the true effect to fall in a new, randomly sampled study — which is almost always wider than the confidence interval around the aggregate and gives a more honest sense of how variable the effect is across contexts.
Step 5 — Note the heterogeneity statistics. Most forest plots include the I² value and a Q-test result at the bottom. High I² with a significant Q-test means that the variation across studies exceeds what chance would produce, and the aggregate estimate should be interpreted cautiously. Look for subgroup analyses or moderator analyses that attempt to explain the heterogeneity.
Step 6 — Consider what is not in the plot. Forest plots show you the studies that exist in the published literature. They cannot show you the studies that were conducted and not published, the studies that were planned and not conducted, or the measurement decisions that were made differently in each study. The expert reader looks at a forest plot not just as a display of what is there but as a window into a literature whose full extent is partially invisible.
⚠️ Critical Caveat: A forest plot can make a shaky body of evidence look impressively systematic. The visual formality of the plot — the neat rows of squares and whiskers, the authoritative diamond — confers an air of scientific rigor that the underlying studies may not warrant. Always look behind the forest plot to the quality and diversity of the individual studies it aggregates. A beautiful forest plot built from twelve small, single-site WEIRD studies tells you less than three large, well-powered cross-national studies, even if it looks more comprehensive.
40.4 Publication Bias in Meta-Analysis: Funnel Plots, Egger's Test, and the Trim-and-Fill Solution
Publication bias is the tendency for positive findings to be published and null findings to be filed away — the "file drawer problem" introduced in Chapter 3. In the context of meta-analysis, publication bias is especially dangerous because meta-analyses are supposed to be definitive summaries of the evidence. If the evidence pool is biased, the meta-analysis is biased — and the statistical sophistication of the meta-analytic method may actually make the bias more difficult to detect, because the formality of the output implies thoroughness even when the inputs are systematically incomplete.
A funnel plot is the primary graphical tool for detecting publication bias. In an unbiased body of literature, studies with larger samples should cluster closely around the true effect size (because they have smaller sampling error), while studies with smaller samples should fan out more widely (because they have larger sampling error). Plotted with effect size on the horizontal axis and study size (or precision) on the vertical axis, the resulting display should look like an inverted funnel — wide at the bottom (small studies, high variance) and narrow at the top (large studies, low variance), symmetrical around the true effect. The symmetry is the key diagnostic feature: in an unbiased literature, there should be roughly as many small studies finding modestly positive effects as small studies finding modestly negative effects, because the random overestimation that produces inflated results in small studies should work equally in both directions.
When there is publication bias, the lower-left quadrant of the funnel (small studies with small or negative effects) is missing — those studies were conducted, but they were not published. The funnel becomes asymmetrical: full and positive on one side, missing on the other. The visual asymmetry of the funnel plot is a warning sign that the meta-analytic estimate may be inflated because the negative and null results are systematically absent from the literature.
Egger's test (Egger et al., 1997) is a formal statistical test for funnel plot asymmetry. It uses a weighted linear regression to test whether there is a statistically significant relationship between effect size and study size across the studies in a meta-analysis. In an unbiased literature, there should be no such relationship — small studies and large studies should cluster around the same true effect. When Egger's test is significant, it indicates asymmetry consistent with publication bias. Importantly, Egger's test has its own limitations: it has low power in meta-analyses with few studies, and funnel plot asymmetry can arise from causes other than publication bias (including genuine heterogeneity and the apples-and-oranges problem). A significant Egger's test is therefore a signal to investigate, not a proof of bias.
The trim-and-fill method (Duval & Tweedie, 2000) goes a step further: it attempts to estimate the number of missing studies (studies that would fill in the empty side of the funnel), impute those missing studies, and recalculate the aggregate effect with the imputed studies included. The result is a corrected effect size estimate that accounts for the estimated publication bias. The method is useful as a sensitivity analysis — if the trim-and-fill correction substantially changes the aggregate effect, that is strong evidence that publication bias is distorting the uncorrected estimate. If the correction is small, the results are probably more robust to publication bias. Like all imputation methods, trim-and-fill is not a perfect solution; the imputed studies are statistical artifacts, not real data, and the correction assumes a specific pattern of publication bias that may not match the actual literature.
A more recent approach is p-curve analysis (Simonsohn, Nelson, & Simmons, 2014), which examines the distribution of reported p-values across studies, specifically those that just cross the significance threshold (p between .01 and .05). The logic is clever: if a true effect exists, most significant studies should have p-values well below .05 (because the studies have adequate power to detect the real effect). If only publication bias and p-hacking are operating — if the studies are finding significant results only by chance or by analytical manipulation — then p-values should pile up just below .05, because that is the threshold that determines publication. A "right-skewed" p-curve (p-values piling up at the low end) suggests a real effect; a "flat" or "left-skewed" p-curve (p-values clustering near .05) suggests that reported significance is partly or largely artifactual. P-curve analysis has become a standard tool in the open science literature and has been applied to several attraction research meta-analyses with illuminating results.
🧪 Methodology Note: In the Okafor-Reyes meta-analysis of their own data, they could assess a form of within-study "publication bias" — specifically, whether the measures that showed significant results in individual country samples were more likely to be highlighted in country-specific reports than measures that did not reach significance. They found modest evidence of this: significant country-level findings received about 1.7 times as many words in preliminary reports as non-significant ones. They also conducted a p-curve analysis on the subset of findings that had been reported as significant in at least three of their twelve country samples, finding a right-skewed distribution — consistent with real underlying effects — for warmth and responsiveness measures, but a notably flatter distribution for some of the status and resource-cue findings, suggesting those may be more vulnerable to p-hacking or selective reporting. This is not traditional publication bias, but it is the same underlying psychology — the human tendency to find positive results more interesting and more worth reporting.
40.5 P-Hacking and Researcher Degrees of Freedom: A Simulation-Based Illustration
P-hacking is the practice — often unconscious — of making multiple analytic decisions until a statistically significant result appears, then reporting only the final significant result as if it were the first analysis attempted. It is not, in most cases, deliberate fraud. It is the result of what psychologist Andrew Gelman calls "the garden of forking paths" — the many legitimate-seeming analytic choices that researchers face at every step of data collection and analysis, and the subtle but powerful way in which awareness of current results shapes which paths are taken.
Researcher degrees of freedom include decisions at multiple stages of the research process:
During data collection: When should I stop collecting data? The temptation to check significance as data accumulates — and to stop when significance is achieved — dramatically inflates false positive rates. If a researcher checks significance after every 10 participants and stops when p < .05 is achieved, the false positive rate for a nominal α = .05 test rises to approximately 22% rather than 5%.
During data preparation: Which participants should be excluded? Participants who failed attention checks, who responded too quickly, who gave implausibly extreme answers, who are outliers on the dependent variable — all of these are legitimate exclusion criteria. But decisions about which criteria to apply, and when to apply them, can be made after seeing the data. Excluding a few participants after observing that their responses are "unusual" (where "unusual" partly means "not in the expected direction") is difficult to distinguish from legitimate quality control.
During analysis: Which covariates should be included in the regression model? Age? Gender? Both? Neither? Including a covariate can shift the significance of a predictor. Which outcome variable should be the primary one — the full composite measure, or just the subscale that shows the expected pattern? How should the independent variable be coded — as continuous, or dichotomized at the median?
During interpretation: Which of the multiple tests conducted should be highlighted as the primary finding? If twelve hypotheses were tested and three reached p < .05, was the study designed to test those three, or is the researcher reporting the three that happened to reach significance?
Each of these decisions is individually justifiable. Any competent methodologist could defend any specific choice. Together, they can inflate the false positive rate to well above the nominal 5%. Simmons, Nelson, and Simonsohn (2011) — in what became one of the most-cited papers of the replication crisis era — demonstrated through simulation that a researcher making just four seemingly innocuous analytic choices (collect more data when initial results are not significant; choose whether to include a covariate; drop one of three conditions; combine dependent measures) could achieve a false positive rate of 60.7% while maintaining a nominal α = .05 threshold. Their paper, "False-Positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant," demonstrated this with a study finding that listening to a Beatles song made participants literally younger — a patently absurd conclusion that followed, statistically, from a p-hacked analysis.
The term "researcher degrees of freedom" was introduced by Simmons and colleagues in that 2011 paper. Their argument was not that researchers are dishonest. Their argument was that the standard research workflow — collect data, explore, analyze, report — creates an incentive structure that rewards the discovery of significant results and punishes the reporting of null ones. Within that structure, the normal human cognitive tendencies toward pattern-finding and confirmation bias systematically produce inflated false positive rates without any individual actor intending to commit fraud. The process is so natural that researchers who engage in it typically do not recognize it as problematic; they experience themselves as making reasonable scientific judgment calls.
The practical magnitude of this problem is sobering. Wicherts et al. (2016) identified a set of analytic "forks" — legitimate-seeming analytic choices — in a typical experiment and found that the number of possible final analyses in a simple 2×2 factorial design was in the thousands. Across those thousands of analyses, the probability that at least one reaches p < .05 by chance, in data where the null is true, approaches certainty. This is not a theoretical concern; it is the actual landscape within which most published social psychology research has been generated.
The code file (code/meta_analysis_tools.py) includes a p-hacking simulation that makes this concrete. The simulation generates data in which the null hypothesis is true — there is literally no relationship between the variables — and shows how many of the "analyses" you can run before finding a "significant" result by chance. The result is uncomfortable: in a surprisingly high proportion of simulations, you can p-hack your way to p < .05 in a dataset where nothing real is happening. Running this simulation is, for most students, the single most attitude-changing exercise in this chapter. Reading about p-hacking produces intellectual understanding; watching a computer manufacture a "significant finding" from pure noise produces something more visceral and more lasting.
💡 Key Insight: Understanding p-hacking is not about assuming that researchers are frauds. Most researchers who p-hack do not know they are doing it. The human mind is remarkably good at finding patterns in noise and remarkably bad at noticing when its analytic decisions are being guided by the results those decisions produce. Pre-registration is the structural solution to this problem — and we turn to it next.
40.6 Pre-Registration and Open Science: The Reform Movement
Pre-registration is the practice of publicly committing to a research question, hypotheses, sample size, and analytic plan before collecting data. The registration is time-stamped and publicly accessible (typically through the Open Science Framework at osf.io). Any deviation from the pre-registered plan must be disclosed — labeled as exploratory rather than confirmatory, and understood to require replication before being treated as established.
Pre-registration solves the p-hacking problem by separating confirmatory from exploratory analysis. A pre-registered study with a significant result is genuine evidence: the hypothesis was specified in advance, the sample size was determined by a power analysis rather than by when significance was achieved, and the analysis followed the pre-specified plan. An exploratory analysis in the same study (testing hypotheses that emerged during analysis) is also valuable — but it is clearly labeled as exploratory and therefore understood to require replication before being treated as established.
The Open Science Framework (osf.io) has seen exponential growth in pre-registered studies since about 2013. By 2020, pre-registration had become a standard expectation at many journals, particularly in psychology and social neuroscience. The broader open science movement includes several related practices:
Open data: Making the raw data of a study publicly available so that other researchers can verify analyses and conduct new ones.
Open materials: Making the stimuli, questionnaires, and protocols publicly available so that exact replication is possible.
Registered Reports: A journal format in which researchers submit their introduction and methods for peer review before data collection. If the study is accepted at that stage, it will be published regardless of whether the results are significant or null. This directly addresses publication bias at its source: because the decision to publish is made before the results are known, journals cannot preferentially publish positive findings. Registered Reports have proliferated rapidly since their introduction around 2013 and are now offered by over 300 journals.
📊 Research Spotlight: The Okafor-Reyes Global Attraction Project was pre-registered in Year 1 on the Open Science Framework. The pre-registration specified the primary and secondary hypotheses for each of the twelve country samples, as well as the meta-analytic analysis plan for the full dataset. When Okafor presented the Year 3 data at a major conference and some results contradicted their hypotheses, a colleague in the audience challenged her: "If you had pre-registered the opposite hypothesis, would you have gotten this into a top journal?" Okafor's answer — "Yes, because we actually preregistered the hypothesis we had, not the one that would have been easiest to support" — was a direct application of the logic of pre-registration. Whether pre-registered results challenge or confirm prior beliefs, the evidential weight is the same.
40.7 Effect Size Revisited: What Sizes Matter in Attraction Research?
Chapter 3 introduced Cohen's d benchmarks: small (d = 0.2), medium (d = 0.5), large (d = 0.8). These benchmarks are useful as a rough orientation, but they obscure something important: what counts as a practically significant effect size depends on the domain and on the question being asked.
In attraction research, some effects that look small by Cohen's benchmarks are practically important. If a particular feature of a dating profile increases match rates by 15% (which might correspond to a small effect size), that is a practically significant effect for the millions of people using dating apps. Conversely, some statistically significant effects in large-sample studies are so small as to be practically irrelevant for understanding individual attraction.
The Okafor-Reyes team used several additional effect size metrics that are worth knowing:
Odds ratios describe how much more likely an outcome is given a predictor. In the racial preference data from Chapter 25, the odds ratio for messaging partners of the same race (compared to messaging across race) was approximately 2.3 in the US sample — same-race messaging was 2.3 times as likely. This is a medium effect by most standards.
Eta-squared (η²) is used in ANOVA designs; it represents the proportion of total variance explained. An η² of .01 is conventionally small; .06 is medium; .14 is large. Most real-world attraction predictors produce small to medium η² values.
Pearson's r describes the correlation between two continuous variables. An r of .1 is typically small; .3 is medium; .5 is large. Most self-report attraction variables correlate with outcomes in the .1–.3 range — small to medium effects that matter at the population level but are not deterministic at the individual level.
⚖️ Debate Point: One of the most honest admissions in the Okafor-Reyes final year report was about what the effect sizes in their cross-national data actually implied. Even the most robust cross-cultural finding — that behavioral indicators of warmth and responsiveness consistently predicted attraction in all twelve countries — accounted for only about 8–12% of the variance in attraction ratings. Ninety percent of what makes a specific person attractive to a specific other person in a specific context remained, even after five years and four thousand participants, unexplained. That is not a failure of the study. It is a fact about attraction.
40.8 The WEIRD Problem Revisited with Okafor-Reyes Global Data
When the Okafor-Reyes team released their full dataset — approximately 4,800 participants across 12 countries, 5 years, three data collection methods — the most immediate scientific question was: which findings from the previous attraction literature replicate with a genuinely diverse sample? The answer was illuminating, and in some cases surprising.
What was more robust cross-culturally than expected:
The research team had anticipated, based on the constructivist critiques in the literature, that fewer findings would hold up across their diverse sample than evolutionary psychology predicted. What they found instead was that a handful of core findings were genuinely more robust than the most critical constructivists had suggested.
The association between perceived warmth and responsiveness on the one hand and attraction on the other was found in all twelve country samples. In their meta-analysis of the twelve-country data, the pooled effect was d = 0.51 (95% CI: 0.43–0.59), with an I² of only 18% — indicating low heterogeneity. Warmth preferences were not simply a Western preoccupation. The negative effect of perceived contempt, dismissiveness, or public humiliation on attraction was even more consistent, with a pooled d = −0.68 and I² of 11%. These findings have a plausible functional interpretation: warmth and non-contempt signal the kind of relationship quality that would be valuable in any cooperative social context. Okafor found herself, somewhat to her own surprise, agreeing with some of the evolutionary predictions about why these cues would be cross-culturally significant.
The propinquity effect — the tendency to form romantic connections with people who are physically proximate — also replicated robustly in all twelve urban samples, with the effect somewhat weaker in rural sub-samples where physical proximity was both ubiquitous and less meaningful as a signal of regular social contact.
What was less robust than expected:
Status and resource cues in partner preferences showed significant cross-national variation that undermined the universal claims in some of the evolutionary psychology literature. In four of the twelve countries (Japan, Germany, South Korea, the United States), male participants showed the pattern typically reported in the Western literature: higher preference for physically attractive partners over economically resourceful ones, relative to female participants. In five countries (Nigeria, Morocco, Brazil, India, South Africa), the gender difference was smaller or absent, and resource cues predicted attraction similarly across genders. In the remaining three countries (Mexico, Sweden, Australia), patterns were mixed and did not reach conventional significance thresholds after multiple-comparison correction.
Okafor's interpretation of this finding became the centerpiece of her conference talk: evolutionary predictions about status-resource cue preferences appear in some domains robustly, but what they show up in seems to be substantially mediated by the specific cultural ecology — the degree to which women's economic independence has been historically suppressed or enabled, the extent to which education functions as a reliable signal of status, and the degree to which gender-equal institutions have been developed. The evolutionary hypothesis can explain why the underlying preference exists; it cannot fully explain why the expression of that preference varies so dramatically across cultural contexts.
What the WEIRD problem's practical lesson suggests:
The Okafor-Reyes data do not show that cross-cultural research is impossible or that all findings are culturally relative. They show that the domain of genuine cross-cultural generalization in attraction is narrower than the literature based primarily on Western samples implied. Effects that appear cross-nationally consistent are probably reflecting something about human social cognition at the species level; effects that show substantial variation are probably indexing the operation of culturally specific norms and structures. The lesson is not "nothing generalizes" but "we need to know which things generalize and which don't, and we cannot learn that from WEIRD-only samples."
🔵 Ethical Lens: The release of the Okafor-Reyes full dataset in open access format was a deliberate methodological and ethical choice. Okafor argued at her institution's press conference that data collected from communities in twelve countries should be available to researchers in those communities — not just to American and European academics who happened to be running the project. Data colonialism — the extraction of data from Global South communities for the benefit of Global North institutions — is a live problem in social science, and the open data commitment was a partial attempt to address it.
40.9 How to Evaluate a Press Release About Attraction Research
Return to Okafor's Thursday morning. Forty-seven alerts. A headline that was not false but was deeply misleading. How should a scientifically literate reader have evaluated that press release?
Here is a seven-step press release evaluation protocol, drawing on everything in this chapter and in Chapter 3:
Step 1: Find the original study. Press releases are about studies. The study should be citable and, ideally, available. If the press release does not name the journal or provide a link to the study, this is itself a warning sign.
Step 2: Check the sample. Who were the participants? How many? From where? How were they recruited? Any finding from a sample of 50 undergraduate psychology students at a single US university should be held very tentatively — not because the finding is false but because it is not yet established.
Step 3: Find the effect size. Press releases almost never report effect sizes. Find the original paper and find the effect size. Is it small, medium, or large by relevant benchmarks? A "significant" finding with d = 0.18 is a different kind of claim than one with d = 0.72.
Step 4: Check the preregistration status. Was the study preregistered? If so, were the reported findings the ones specified in the preregistration? If the preregistration is absent, or if the reported finding was not in the preregistration, adjust your confidence accordingly.
Step 5: Look for the limitations section. Every honest research paper has one. If the press release does not mention any limitations, the limitations have been stripped out. Find them and assess their relevance to the headline claim.
Step 6: Look for replication. Is this the first study to find this effect, or has it been found consistently across multiple independent samples? A finding from a single study, no matter how large, is preliminary evidence. Convergent evidence from multiple studies is much stronger.
Step 7: Check the causal language. Does the headline or press release use causal language ("X causes Y") for what was actually a correlational study? Correlational studies cannot establish causation. This is one of the most common distortions in popular coverage of social science.
Applying this protocol to the Okafor-Reyes press release: Step 1 — the study was named and linked, pass. Step 2 — 4,800 participants across 12 countries, unusually strong, pass. Step 3 — effect size not in the press release; the paper reports d = 0.19, which is small, fail. Step 4 — pre-registered; but the claim in the headline was not the primary preregistered hypothesis, cautious. Step 5 — the press release omitted all limitations, fail. Step 6 — this was the first study of this specific question at this scale, so replication evidence was necessarily thin, cautious. Step 7 — the headline used "prefer," which implies individual preference, for a finding about aggregated rating patterns, which is a level-of-analysis conflation, fail.
Final assessment: the finding may be real but the headline claim is substantially exaggerated relative to the evidence.
40.10 The 10-Step Research Quality Checklist
The seven-step press release protocol is designed for rapid evaluation of media coverage. But evaluating research quality more thoroughly — when you actually have access to the original paper, or when you are deciding how much weight to give a finding in your own thinking — requires a more complete assessment. Here is a ten-step checklist for evaluating any attraction research claim, applicable whether you are reading a paper in a peer-reviewed journal or a claim in a popular book or podcast.
Step 1: Sample size and statistical power. Is the sample large enough to reliably detect the effect the researchers were looking for? A study of 40 participants testing whether attachment style predicts relationship satisfaction has very low power to detect even a medium-sized effect. GPower and similar tools allow you to calculate the minimum sample size required for adequate power (typically ≥ 80%) given a specific effect size expectation and alpha level. Studies that report significant effects with small samples should be treated cautiously — they may have found real effects, but they may also have been lucky, or may have p-hacked their way to significance.
Step 2: Sample diversity. Who was studied? Is the sample WEIRD? Does it include diverse racial, cultural, socioeconomic, gender, and age groups relevant to the question? A study of heterosexual, white, middle-class college students making claims about human attraction in general has a sampling problem that should be explicitly acknowledged.
Step 3: Effect size and practical significance. What is the effect size, and what does it mean in practical terms? Report not just whether there was an effect but how big it was and whether that size matters for real-world behavior.
Step 4: Replication status. Has this finding been independently replicated — not by the same lab, not by a study using the same stimuli and measures, but by a genuinely different team using a different sample? An unreplicated finding is preliminary regardless of the journal it appeared in or the fame of the research team.
Step 5: Peer review quality. Was the paper published in a peer-reviewed journal with external review? What is the journal's reputation in the field? While peer review is imperfect (it is blind to p-hacking and does not catch most methodological errors), publication in a high-quality journal provides at least some quality filter.
Step 6: Preregistration. Was the study preregistered? Were the primary findings the ones specified in the preregistration, or were they secondary or exploratory analyses that emerged from the data? This distinction is critical for interpreting p-values.
Step 7: Funding source and conflicts of interest. Who funded the research, and does the funding source have a financial or ideological stake in the outcome? Supplement manufacturers funding nutrition research, pharmaceutical companies funding drug studies, and ideological advocacy organizations funding social science all introduce potential conflicts that readers should know about.
Step 8: Measurement validity. Do the measures actually capture what the researchers claim they capture? A study measuring "mate preferences" using a written questionnaire asking participants to rate hypothetical partner profiles is measuring something, but is it the same thing as actual mate choice behavior in real contexts? The gap between what is measured and what is claimed to be measured is one of the most persistent problems in attraction research.
Step 9: Alternative explanations. Has the study design ruled out the most plausible alternative explanations for the findings? Does the paper's discussion acknowledge them? A finding that is consistent with multiple competing theories is less useful than one that discriminates between them.
Step 10: Media translation accuracy. If you are reading a news article or press release rather than the original paper, does the coverage accurately represent what the paper actually found? Did it omit the effect size, the sample characteristics, the limitations, the authors' own hedging? The media translation step is where the most distortion occurs, and evaluating it requires going back to the original source.
Applying this checklist rigorously to any single attraction claim is a time-intensive exercise. But the checklist can also be applied quickly as a rough mental screen: How many of these ten quality indicators does this claim satisfy? A claim that satisfies most of them deserves real weight. A claim that satisfies few of them deserves real skepticism.
40.11 From Research to Policy: The Translation Problem
The seven-step protocol and ten-step checklist address one translation gap: the gap between research and media coverage. But there is a second, equally important translation gap: the gap between research and the practical advice that researchers, therapists, coaches, and popular authors derive from it. This is the "telephone game" of social science: findings that are carefully hedged in academic papers become bolder in review articles, bolder still in press releases, bolder again in news coverage, and arrive at popular audiences in forms that researchers would sometimes barely recognize.
Consider the trajectory of a finding about attachment style and relationship satisfaction. The original finding, published in a peer-reviewed journal, might say: "In a sample of 287 heterosexual undergraduate couples recruited from a single university, anxious attachment style (measured by the ECR-R) was associated with lower reported relationship satisfaction (r = .28, p < .001) after controlling for relationship length and partner attachment style." This is a real, useful finding — but it is carefully bounded.
In a review article, this finding becomes: "Anxious attachment is associated with reduced relationship satisfaction." In a press release, it might become: "New research shows anxious attachment damages relationships." In news coverage: "Scientists say anxiously attached people have worse relationships." In a popular book chapter, it may arrive as: "If you're anxiously attached, here's why your relationships keep failing." On social media, distilled further: "Anxious attachment = relationship sabotage."
Each step of this telephone game strips away context, qualifications, and nuance. The individual study becomes "research" or "science." The correlation becomes causation. The undergraduate sample becomes "people." The measured relationship satisfaction becomes "relationships" as a whole. The modest effect size becomes a decisive pattern. By the time a finding reaches the self-help shelf or a TikTok trend, it may share a name and a general direction with the original research while bearing no meaningful resemblance to what the evidence actually supports.
This translation problem has real consequences. When attachment theory findings are translated into rigid personality typologies ("I'm an anxious attachment type, therefore..."), people may adopt identities that limit their agency and foreclose change. When evolutionary psychology findings about mate preferences are translated into dating advice ("men will always prefer X"), they can function as prescriptions that reinforce the very patterns the research describes, treating statistical tendencies as immutable laws. When findings about racial preferences in dating markets are translated carelessly, they can either normalize discrimination or be deployed to shame people for their own preferences — neither of which the research supports.
The responsible translation of attraction science requires holding several things simultaneously: the finding itself and its effect size, the sample and its limitations, the theoretical interpretation and its alternatives, and the practical implications with all their caveats. The press release, the advice column, and the social media post cannot hold all of this. The reader, equipped with the tools in this chapter, can.
40.12 The Question of Personal Application: From Science to Individual Life
There is a question that students in attraction courses ask more often than almost any other: "OK, but what does this mean for me?" It is a reasonable question. This material is not abstract; it is about a central domain of human experience.
The honest answer is: the science speaks to populations and tendencies, not to individuals. An effect size of d = 0.4 means that two groups differ on average by about 0.4 standard deviations on some measure of attraction. It does not tell any individual member of either group what will happen when they meet a specific person in a specific context with a specific history.
This does not mean the science is useless for personal life. Understanding that your attachment style is anxiously organized, and that anxious attachment is associated with specific patterns of perception and behavior in romantic relationships — that is knowledge you can use. Understanding that implicit biases about race and attractiveness have been shaped by structural racism — that is knowledge that can prompt reflection and deliberate choice. Understanding that the butterflies you feel in the early stages of attraction are partly dopamine-driven reward activation that is likely to habituate — that is knowledge that might temper the tendency to treat early intense feelings as reliable predictors of long-term compatibility.
The key distinction is between knowledge that informs reflection and knowledge that prescribes action. The science can do the former. The latter requires judgment, values, and self-knowledge that no study can provide.
There is also a well-documented psychological phenomenon called the "backfire effect" — the tendency for people to become more committed to incorrect beliefs when presented with corrective evidence, at least under certain conditions. While more recent research has questioned the robustness of this specific effect (Wood & Porter, 2019), it points to a genuine truth: presenting scientific findings to people whose identities or group memberships are implicated in the findings can sometimes produce defensiveness rather than updating. This is not a reason to withhold findings; it is a reason to present them with care, context, and attention to how they will be received. The gap between being an informed consumer of attraction research and using that information to improve your own life and relationships is a human and social challenge, not just a cognitive one.
One of the most useful mental models for personal application of science comes from the statistician George Box, who observed that "all models are wrong, but some are useful." Applied to attraction science: every finding about attraction patterns is a model — a simplified representation of a complex reality. The model is not your life. Your life is more complicated, more particular, and more interesting than any model. The model can help you understand your life better without substituting for it. Holding this distinction — using models as tools for understanding while retaining the richness of your actual experience — is the practical skill that bridges scientific literacy and personal wisdom.
40.13 Science Literacy as a Life Skill
The skills in this chapter extend far beyond attraction research. Media coverage of psychology, sociology, nutrition, public health, and virtually every social or behavioral domain is subject to the same patterns: effect sizes omitted, null findings ignored, correlational studies reported with causal language, WEIRD samples generalized without qualification.
Science literacy — the ability to evaluate empirical claims with calibrated skepticism — is not primarily a skill for professional researchers. It is a civic skill. In a world where scientific claims are regularly mobilized in service of commercial, political, and ideological agendas, the ability to ask "How do you know that?" and "What are the limits of this evidence?" is a form of intellectual self-defense.
The critical toolkit in this chapter, and in Chapter 3, is your basic equipment. Use it on everything — including the claims in this book.
40.14 Questions to Ask Any Attraction Claim
As a synthesis and reference, here is a consolidated checklist for evaluating any claim about attraction:
About the study: - What was the sample? (Size, demographics, cultural context) - Was the study experimental or correlational? - What was the effect size? - Was it pre-registered? - Has it been independently replicated? - What do the study's own authors say the limitations are?
About the claim: - Does the headline claim match what the study actually found? - Is causal language being used for correlational data? - Are alternative explanations acknowledged? - Is the finding being generalized beyond its actual sample?
About the media coverage: - Was the press release written by the university's communications office rather than the researchers themselves? - Were limitations, null findings, or competing studies mentioned? - What was omitted from the coverage that appears in the original study?
About your own reaction: - Does this finding confirm what you already believed? If so, be extra cautious — confirmation bias is real. - Does this finding disturb you? That is also important information, but it is not itself evidence against the finding. - What would you need to see before updating your beliefs substantially based on this finding?
These questions will not always have satisfying answers. The honest assessment of a single study is usually something like: "preliminary evidence, deserves further investigation, warrants modest updating of beliefs." That is not a satisfying media headline. But it is accurate, and accuracy is what the science deserves.
Conclusion: Informed Consumers, Better Science
Dr. Okafor spent three hours that Thursday tracing the gap between her research and its public reception. She wrote a brief, clear response for her university's website, walking through the seven-step protocol above with her own study as the example. It got 300 page views. The press release got 4.2 million.
This gap is real and it is not going away. The incentive structures of academic communication, journalism, and social media all push in the direction of simpler, stronger, more headline-friendly claims. The antidote is not more press releases from researchers but a more literate public — people who know enough about how research works to be appropriately skeptical of breathless headlines and appropriately open to carefully hedged but genuinely important findings.
That is what this chapter has tried to give you: not the ability to dismiss everything but the tools to hold scientific claims with calibrated confidence. The code in meta_analysis_tools.py will let you see these tools in action — forest plots, funnel plots, and the uncomfortable viscerality of watching a computer p-hack its way to significance in a dataset where nothing real is happening. Run it, adjust the parameters, and watch the principles become concrete.
The science of attraction is genuine, important, and still unfinished. It deserves careful, critical engagement. So does everything else you will read for the rest of your life.
Next: Chapter 41 brings the frameworks to bear on personal reflection — how Nadia, Sam, and Jordan have used what they've learned, and how you might use it too.