Case Study 30-2: The Cross-Cultural Music Study (Mehr et al., 2019) — Designing an Experiment to Test Music Universals

DataField.Dev

Case Study 30-2: The Cross-Cultural Music Study (Mehr et al., 2019) — Designing an Experiment to Test Music Universals

Introduction: The Problem of Testing Music Universals

The claim that some features of music are universal is intuitively appealing and theoretically significant — but it is extraordinarily difficult to test rigorously. The obstacles are methodological, not merely logistical. What counts as a "musical universal"? How do you sample adequately from the full diversity of human musical cultures? How do you know whether cross-cultural similarities reflect genuine biological universals or merely the spread of Western music through colonialism and globalization? How do you test whether naive listeners from radically different cultural backgrounds respond to music from an unfamiliar culture in the same way?

These are the questions that Samuel Mehr, Manvir Singh, and their collaborators at Harvard's Music Lab set out to answer in the study published in Science in 2019 (Mehr, Singh, Knox, Ketter, et al., "Universality and diversity in human song"). The study represents the most methodologically sophisticated effort to date to test specific claims about musical universals, and understanding how it was designed — what it can and cannot establish — is as important as understanding what it found.

The Study's Design Logic

The fundamental research question was: Do the features of music vary randomly across human cultures, or are there systematic patterns that reflect universal constraints? If music were entirely culturally constructed with no universal biological basis, we would expect its features to be essentially random across cultures — no more predictable from one culture to another than, say, the specific shapes of pottery or the specific colors used in visual art. If, on the other hand, music reflects universal biological constraints, we would expect non-random cross-cultural regularities — patterns that appear in musics from unrelated cultures.

The study operationalized this research question through several complementary approaches:

Approach 1: Expert coding of acoustic features. A team of trained analysts coded each of the 118 songs in the corpus for a range of acoustic features: tempo, melodic range, interval size, rhythmic regularity, pitch contour, number of performers, presence of text, and others. The coded features were then subjected to statistical analysis: were features like tempo, melodic contour, and rhythmic regularity consistent within behavioral contexts (all lullabies are slow; all dance songs are fast) across diverse cultures?

Approach 2: Naive listener identification. If the musical features that distinguish lullabies from dance songs are cross-culturally consistent, then naive listeners from other cultures — people with no knowledge of the culture producing a song — should be able to identify its behavioral context from acoustic features alone. This is an independent test of the expert-coding claim: if experts can code lullabies as slow and smooth, and if naive listeners can identify lullabies by their acoustic features, then the features are doing real cognitive work.

Approach 3: Cross-cultural emotional ratings. The study also asked naive listeners to rate the songs on dimensions including positivity and arousal. This tested whether the emotional valence of acoustic features (faster = more positive/exciting) was cross-culturally consistent.

The Song Selection Process

The 118 songs were drawn from the Human Relations Area Files (HRAF), an archival database of ethnographic recordings, supplemented by targeted field recordings. Selection criteria included:

Documentation of behavioral context: Each song had to be described in the accompanying ethnographic record as a lullaby, dance song, healing song, or love song by a cultural insider — not identified as such by the Western researcher.
Minimal Western contact: The study explicitly excluded songs from cultures with documented substantial exposure to Western music, to avoid the confound of Western musical influence spreading globally. Cultures with documented radio, television, or missionary contact prior to recording were excluded or carefully examined for potential Western influence.
Geographic and cultural diversity: The 60 societies were selected to maximize phylogenetic and geographic independence — societies were chosen to minimize shared ancestry (both genetic and cultural) so that similarities across societies could not be attributed to common descent.
Adequate acoustic recording quality: Songs needed to be recorded clearly enough for acoustic analysis and for use in the listener experiment.

This selection process reduced the original HRAF archive of thousands of recordings to 118 songs — a significant reduction that reflects how strict the criteria were.

The Listener Experiment

The cross-cultural listener component of the study was conducted online, recruiting approximately 750 participants from 30 different countries through a web platform. Participants heard brief excerpts from the songs (ranging from 14 to 30 seconds) and answered questions about what they thought the music was for (lullaby? dance song? healing song? love song?) and how it made them feel.

Key design features of the listener experiment:

Cultural naivety: Participants were recruited from countries different from those producing the songs. A participant from Norway rated songs from Papua New Guinea, Nigeria, and Peru — cultures entirely outside their prior musical experience.

Incentivized accuracy: Participants were offered a financial bonus for correct identifications, incentivizing genuine engagement rather than random clicking.

Audio-only: Participants heard only the audio, without any visual information, lyrics displayed in translation, or other contextual cues. The identification had to be based purely on acoustic features.

Counterbalancing: Song order, cultural context order, and question format were counterbalanced across participants to prevent order effects.

Key Results and Their Interpretation

Result 1: Behavioral function identification above chance. Naive listeners identified the behavioral context of songs at rates substantially above the 25% chance level for all four contexts, with lullabies most accurately identified (~70–75%) and healing songs least accurately identified (~55–60%). This demonstrates that the acoustic features that distinguish behavioral contexts are cross-culturally legible — they are readable by naive listeners from unrelated cultures.

Result 2: Tempo as the primary predictor. Statistical analysis of the expert-coded acoustic features found that tempo was the single most powerful predictor of behavioral context across cultures. Lullabies are consistently slow; dance songs are consistently fast; healing and love songs fall between them. This consistency held across all 60 societies.

Result 3: Cross-cultural consistency of emotional ratings. Naive listeners' ratings of positivity and arousal were consistent across cultural backgrounds: faster music was rated as higher arousal and more positive across virtually all listener groups. This suggests some cross-cultural agreement in the basic emotional valence of musical tempo — though the consistency was stronger for arousal than for specific emotions.

Result 4: Significant cross-cultural variation in other features. The study was careful to document not only consistencies but also the substantial variation that persists across cultures. Scale structure, tonal system, rhythmic complexity, performance practice, instrumentation, and formal structure varied dramatically. The universals identified are at a fairly high level of abstraction; at the level of specific acoustic parameters, diversity dominates.

Methodological Critiques and Limitations

The Mehr et al. study is the most rigorous test of music universals to date, but it has several important limitations that its authors acknowledge and that critics have elaborated:

Critique 1: The HRAF archive is not a random sample of all human music. The Human Relations Area Files, despite its name, overrepresents societies that were of particular interest to 20th-century Western anthropologists. Many of the world's most musically distinctive traditions — particularly small hunter-gatherer societies in remote areas — are underrepresented. The 60 societies chosen, though selected for diversity, are not a random sample of the 7,000+ distinct cultural groups identified by anthropologists.

Critique 2: The behavioral context categories may not be culture-neutral. The four categories (lullaby, dance song, healing song, love song) were selected by the research team — a team primarily based at Harvard — and then identified in the HRAF records by looking for ethnographic descriptions that map onto these categories. Critics argue that this process imposes a Western functional taxonomy on diverse musical practices. Some musical traditions do not have "love songs" in any sense recognizable to Western ethnographers; some healing music is simultaneously social entertainment; some lullabies are simultaneously religious songs. The categories clean up messiness that is itself culturally significant.

Critique 3: Online naïve listeners are not culturally neutral. The 750 listeners from 30 countries were recruited through an online platform and completed the study on a computer or smartphone. This means they were, by definition, members of cultures with internet access and sufficient technological infrastructure for online research — a severe filter that excludes many of the most culturally distinct societies from the listener pool. A participant from urban Lagos, online in 2019, shares far more cultural background with a participant from London than either shares with a hunter-gatherer from the Ju/'hoansi community of the Kalahari.

Critique 4: Western musical influence is difficult to exclude completely. Despite the study's effort to select songs from cultures with minimal Western contact, the global spread of Western music through recorded media, radio, and film over the 20th century means that virtually no musical tradition anywhere on Earth is entirely free of Western influence by 2019. The songs in the archive were recorded at various points in the 20th century, many of them in the post-radio era.

Critique 5: The study tests acoustic features, not cultural meaning. The cross-cultural acoustic regularities documented by the study are real — but they may not be the most important features of music from the perspective of the cultures producing it. A healing song's power may rest entirely in its specific lyrical content, its relationship to a specific ritual context, or the identity of the person performing it — features that are invisible to the acoustic analysis and to a naive listener without cultural knowledge.

What Constitutes Good Evidence for a Music Universal?

The Mehr et al. study provides useful criteria for what constitutes good evidence for a music universal:

Criterion 1: Adequate cross-cultural sampling. Claims about universals must be based on samples that include societies with independent cultural histories and that are not biased toward Western-influenced or elite musical traditions.

Criterion 2: Multiple levels of analysis. Convergent evidence from acoustic analysis, naive listener experiments, and culturally informed analysis within specific traditions is stronger than any single method.

Criterion 3: Distinction between universality and mere prevalence. A feature found in 80% of sampled cultures is not a universal; it is widespread. Claims of universality require either genuinely 100% cross-cultural occurrence or a principled physical or biological explanation for why exceptions are unlikely to exist.

Criterion 4: Causal explanation. The strongest claims about music universals are those supported by a causal account — a physical or biological mechanism that explains why the universal exists. Octave equivalence is a near-universal and can be explained by the acoustic physics of harmonic overtones. Tempo preferences for behavioral contexts are a near-universal and can be explained by the correspondence between musical tempo and biological motion rates. Cross-cultural regularities without causal accounts are descriptively interesting but theoretically limited.

Discussion Questions

The study excluded songs from cultures with "substantial Western contact" — but acknowledged that this criterion is nearly impossible to satisfy completely by the late 20th century. How would you operationalize "minimal Western contact" for the purposes of a study like this? What threshold would you use, and how would you verify it?
The critique argues that the four behavioral categories (lullaby, dance, healing, love) impose a Western taxonomy. Propose an alternative research design that would allow behavioral categories to emerge from within specific cultures (emic categories) rather than being imposed from outside (etic categories). What challenges would this design face?
The study's naive listeners were recruited online and thus were necessarily more Westernized than the average member of the 60 sampled cultures. Design a modified listener experiment that would include participants from more culturally isolated communities. What ethical considerations would arise?
The study found that healing songs were the least accurately identified by naive listeners (~55% vs. ~75% for lullabies). Propose a cultural explanation and an acoustic explanation for why healing songs would be less cross-culturally legible than lullabies. Which explanation do you find more plausible?
The chapter argues that the study's results support the claim that "physics constrains, culture constructs." Do the results actually support this interpretation, or do they only show cross-cultural consistency without identifying a physical cause? What additional evidence would be needed to move from "cross-cultural consistency" to "physically grounded constraint"?