Chapter 37 Quiz: Music in Social Media — The Acoustics of Virality

20 questions — mix of multiple choice, short answer, and analysis. Hidden answers in <details> blocks.


Q1. What does the trend in average listener engagement time (from ~1 minute in 1932 radio to ~1.5-3 seconds on TikTok in 2022) reveal about the structural change in music's competitive environment?

Show Answer The dramatic shortening of engagement time reflects the growing scarcity of attention relative to content supply. When content supply was limited (radio stations played 20-30 songs per hour; listeners had no easy alternative), music had time to develop. When content supply is effectively infinite (TikTok's feed never ends), any individual piece of music competes for the initial decision to keep listening against an immediately available alternative. This is the attention economy's structural effect: music must now win the engagement decision in seconds rather than minutes, creating systematic acoustic selection pressure toward features that capture attention at onset.

Q2. True or False: A song with a high spectral centroid will always perform better on TikTok than a song with a low spectral centroid.

Show Answer **False.** Spectral centroid is one of several acoustic predictors of first-second engagement, not the only one. The relationship is statistical (higher spectral centroid correlates with higher first-second engagement on average) but not deterministic. A song with low spectral centroid but exceptional hook strength, emotional legibility, and memeability could significantly outperform a bright but emotionally empty song. Additionally, genre context matters: a dark, low-centroid metal song has high first-second engagement within its niche audience because the spectral characteristics are *appropriate to the genre*, not despite them.

Q3. Explain why Spotify's −14 LUFS normalization target does not fully end the "loudness war."

Show Answer LUFS normalization equalizes *average* loudness between tracks. However, a heavily compressed track (low dynamic range) at −14 LUFS and a dynamically recorded track (high dynamic range) at −14 LUFS play at the same average loudness but do not sound equally "intense." The compressed track spends almost all its time near the −14 LUFS ceiling — every moment is loud. The dynamic track varies between quiet and loud, so while it averages −14 LUFS, many moments are below the loudness ceiling. The compressed track sounds more consistently intense, giving producers an incentive to maximize compression even after normalization. Loudness competition shifts from raw LUFS value to dynamic range compression.

Q4. What is the "phonological loop" and why does it explain why 15-second TikTok clips work as an acoustic memory format?

Show Answer The phonological loop is a component of working memory that temporarily stores acoustic information by mentally "rehearsing" it — cycling through the sounds repeatedly. Its capacity is approximately 1.5-2.5 seconds of material per cycle, and it can hold roughly 15-30 seconds of musical material before information starts decaying. A 15-second TikTok clip sits near this upper limit: it is long enough to establish a musical idea, create emotional meaning, and be retained after a single hearing, but short enough to be held in working memory and mentally rehearsed, forming the basis of an earworm. Clips that exceed working memory capacity are harder to retain after a single exposure.

Q5. In Spotify's acoustic features, what physical quantities most directly contribute to the "energy" metric? Name at least two.

Show Answer The energy metric is primarily based on: (1) **RMS (root mean square) loudness** — the square root of mean squared amplitude, which measures average acoustic power; (2) **spectral flatness** (Wiener entropy) — the ratio of geometric mean to arithmetic mean of spectral power, measuring how "noise-like" vs. "tonal" the spectrum is; (3) **onset rate** — how many note onsets occur per unit time (density of events); (4) **dynamic range** — the difference between loudest and quietest passages (low dynamic range = high perceived energy). Any two of these four are acceptable.

Q6. A researcher finds a strong positive correlation (r = 0.65) between Spotify energy and track popularity. Does this correlation prove that high-energy music is more enjoyable? Explain your reasoning.

Show Answer No. The correlation reflects a statistical relationship between energy and popularity in Spotify's streaming data, but does not prove causal or perceptual superiority of high-energy music. Several alternative explanations exist: (1) the major labels actively promote high-energy music, causing it to appear popular through promotion rather than inherent enjoyment; (2) Spotify's user base may be demographically skewed toward younger listeners who prefer high-energy styles in a particular cultural moment; (3) "popularity" measures streams, which reflect discovery and algorithmic promotion as much as genuine enjoyment; (4) there is a selection effect — listeners who put on acoustic music while reading may not stream it on Spotify at all. Correlation between acoustic features and popularity reflects how the streaming ecosystem works, not what music is "best."

Q7. What is "memeability" and what is the information-theoretic way to describe what makes a sound highly memeable?

Show Answer Memeability is a sound's capacity to function as a cultural reference when detached from its original context — to be recognized, carry shared meaning, and be applicable to new video or text contexts. Information-theoretically, a highly memeable sound has high **mutual information** between the acoustic signal and the emotional/conceptual response it evokes: knowing the sound strongly predicts the emotional state it communicates (and vice versa). This means: (a) the emotional signal is highly compressed (specific, legible in 1-2 seconds); (b) the relationship between acoustic features and the emotional response is reliable across diverse audiences; (c) the acoustic features are distinctive enough to be recognized out of context. High mutual information = efficient emotional communication = high memeability.

Q8. The SIR epidemic model has a "basic reproduction number" $R_0 = \beta / \gamma$. What does $R_0 > 1$ mean in the context of music virality?

Show Answer $R_0 > 1$ means that on average, each "infectious" listener (currently sharing the song) causes more than one other listener to also start sharing it, before their own "infectious period" ends (they stop actively sharing). The song therefore spreads to an increasing fraction of the population — it goes viral. If $R_0 < 1$, each infectious listener causes less than one additional share, and the song's spread declines and eventually dies out. The closer $R_0$ is to 1 from above, the slower the spread. The larger $R_0$ is (e.g., $R_0 = 5$), the faster and more extensive the viral spread, but also potentially the faster the burnout (because the susceptible population is depleted quickly).

Q9. Why does Spotify's "danceability" metric present a cultural specificity problem, even though it is derived from measurable acoustic quantities?

Show Answer While danceability is calculated from measurable acoustic quantities (beat strength, tempo consistency, rhythmic regularity, groove), the model that combines these quantities into a single "danceability" score was trained on human ratings generated by Spotify's user community — which is demographically and culturally skewed (predominantly Western, young, and from high-income countries). Different musical traditions have very different relationships between acoustic features and danceability: West African polyrhythm may have irregular metric groupings that score low on "regularity" but is highly danceable in its cultural context. Indian classical rhythm may have complex cycles (talas) that are extremely danceable to trained listeners but do not fit the Western pop rhythm regularity model. The metric encodes whose body and whose dancing conventions are treated as normative.

Q10. Describe the TikTok "sound layer" mechanism and explain how it creates a positive feedback loop. What is the physical analogy?

Show Answer The sound layer: TikTok tracks which audio track is associated with each video and accumulates the engagement statistics of all videos that have used that sound. A sound with a history of high-completion-rate videos receives algorithmic amplification — new videos using that sound are more likely to receive broad distribution. This creates a positive feedback loop: high engagement → algorithmic amplification → more videos use the sound → more engagement data → more amplification. The physical analogy is **acoustic resonance in a cavity**: a standing wave at the cavity's resonant frequency is amplified by positive interference with each reflection. The "resonant frequency" of the TikTok environment is whatever acoustic profile is associated with high engagement — just as a physical cavity amplifies its resonant frequency, TikTok's algorithm amplifies its "resonant acoustic profile."

Q11. What does "spectral flatness" measure, and what musical character does high vs. low spectral flatness correspond to?

Show Answer Spectral flatness $F = \text{geometric mean}(|X_k|^2) / \text{arithmetic mean}(|X_k|^2)$ measures how uniformly distributed the spectral energy is across frequency. $F = 1$ for white noise (perfectly flat spectrum, energy equally distributed across all frequencies); $F \to 0$ for a pure tone (all energy at one frequency). Musically: **high spectral flatness** corresponds to dense, complex, noise-adjacent sounds — distorted guitar, synthesizer pads, heavily compressed percussion. **Low spectral flatness** corresponds to tonally pure, harmonically clean sounds — solo flute, acoustic piano harmonics, clean sine waves. High spectral flatness combined with high RMS produces the "dense, intense" quality that contributes to Spotify's high energy scores.

Q12. A researcher observes that pop songs from 2020 have a narrower range of tempo values (80% fall between 90-130 BPM) compared to pop songs from 1980 (80% fall between 60-150 BPM). What mechanism does the chapter's analysis predict is responsible for this narrowing?

Show Answer The chapter's analysis predicts the "double optimization loop" as the primary mechanism: (1) Streaming recommendation algorithms trained on engagement data preferentially surface tracks with tempos in the "virality sweet spot" (approximately 90-130 BPM), as this range correlates with high danceability and completion rates. (2) Artists and producers, responding to streaming performance data, increasingly target this tempo range when producing new music. (3) AI music generation systems trained on popular music data further concentrate production toward this tempo range. All three effects compound to narrow the effective tempo distribution of music that reaches audiences in the mainstream tier. The narrowing is a signature of selection pressure, not artistic consensus.

Q13. True or False: The chapter argues that algorithmic curation is causing overall music diversity (across all listeners and all tiers) to decrease.

Show Answer **False.** The chapter makes a more nuanced argument: algorithmic curation is causing a **polarization**, not a simple decrease in diversity. *In the mainstream tier* (top 1% of streams), acoustic diversity has decreased — the "Spotify Sound" convergence is real. But *in the long tail*, streaming has enabled acoustic diversity that previous distribution systems (radio gatekeeping, physical retail) never permitted. Small niche genres can now reach their worldwide audiences with precision. Total ecosystem diversity may have *increased* even as mainstream-tier diversity has decreased. The chapter describes this as a clustering effect — tighter mainstream peaks and deeper, more diverse long-tail valleys.

Q14. Why is the "onset strength function" $O(t) = \sum_k \max(0, |X(t,k)|^2 - |X(t-1,k)|^2)$ specifically useful for predicting first-second engagement rather than the RMS loudness of the opening second?

Show Answer RMS loudness of the opening second measures the *average* energy level — it captures how loud the opening is overall, but not how quickly energy arrives. The onset strength function measures *rate of energy increase* — how much new energy arrives in each time frame. A song could have high RMS opening loudness because it has sustained loud sounds from before the first note, but the first-second engagement depends more on the *arrival* of the energy signal — the moment the auditory system detects a new, salient event and triggers the orienting response. The onset strength function specifically captures this event-detection dimension. High $O(0)$ means an energetic event happens right at the beginning — which is what triggers the orienting response that prevents scrolling.

Q15. The chapter describes "acoustic distinctiveness as a signal of authenticity" for independent music. Explain the mechanism by which distinctiveness signals authenticity, and identify one limitation of this mechanism.

Show Answer **Mechanism:** In a world where algorithmically optimized music converges on a specific acoustic profile (the "Spotify Sound"), any music that significantly departs from that profile carries an implicit signal: "I was not made by optimizing for the algorithm." This departure can only happen if the creator prioritized their own aesthetic vision over algorithmic performance — a form of costly signaling. Because algorithm-optimization would lead toward the center of the distribution, deviation from the center signals non-optimization, which signals authentic creative motivation. Listeners who value authenticity over algorithmic smoothness respond positively to this signal. **Limitation:** The mechanism can be gamed: producers can deliberately engineer music to *sound* non-commercial (lo-fi aesthetics, "imperfect" production) as an aesthetic choice that is itself commercially calculated. When authenticity-signaling becomes a genre convention (as lo-fi hip-hop has become), it is no longer a reliable signal of actual non-commercial motivation.

Q16. What would a piece of music with valence = 0.9, energy = 0.85, danceability = 0.80, acousticness = 0.04, and tempo = 128 BPM most likely sound like? What genre does this profile match, and what does the chapter predict about its streaming popularity?

Show Answer This profile matches **Electronic/EDM** music: very high energy and danceability, extremely low acousticness (synthetically produced), upbeat and euphoric (high valence), and at the standard dance music tempo of 128 BPM (the standard BPM for house and techno). The chapter predicts this profile falls squarely in the "virality zone" — high energy plus high danceability — and would be predicted to have above-average streaming popularity, particularly in the mainstream tier. The virality analysis code shows this genre cluster as one of the highest in simulated popularity scores.

Q17. Explain the 15-second working memory effect and identify one specific way that 15-second clip culture distorts the intended meaning of music.

Show Answer **Working memory effect:** 15 seconds closely matches human working memory capacity for musical phrases — the phonological loop's ability to hold and rehearse an acoustic passage. This makes 15-second clips highly retainable after a single hearing, easily recognized on re-encounter, and well-matched to earworm formation. **Distortion example:** Context stripping. Musical meaning often accumulates over the full arc of a piece — the emotional impact of a climactic passage depends on having experienced the preceding tension and development. A 15-second clip of that climactic passage, stripped of context, floats free and is recontextualized by whatever video it accompanies. A lyric that means one thing in the album context becomes ironic in a meme. A dramatic orchestral flourish becomes comic when paired with a cat video. The 15-second clip does not preserve the music's intended meaning — it replaces it with whatever meaning the new context assigns.

Q18. The chapter says that TikTok's audio processing pipeline applies its own compression and EQ to all audio. Why would a platform add processing after the creator has already finalized their audio, and what acoustic consequences might this have?

Show Answer **Why platforms process audio:** (1) To normalize loudness across diverse content (videos with very different loudness levels); (2) To optimize audio for the platform's primary listening context (phone speakers, headphones in noisy environments); (3) To reduce data payload for streaming efficiency (audio codec optimization); (4) Possibly to enforce community standards (reduce harsh, distorted, or offensively loud audio). **Acoustic consequences:** The platform's processing may alter the spectral balance (EQ changes which frequencies are emphasized), the dynamic range (additional compression), and potentially the temporal character of the audio (codec artifacts). Music that was carefully mastered may sound different than the creator intended. Specifically, TikTok's compression can reduce the prominence of subtle dynamic variations that a producer used intentionally — turning a carefully crafted "quiet moment" into merely another loud moment.

Q19. What does the SIR model predict would happen to the viral propagation of a song if its $\gamma$ (recovery rate) was very high — that is, if listeners "recovered" (stopped sharing) very quickly after first hearing?

Show Answer High $\gamma$ means short infectious period — each listener who "catches" the song shares it for only a brief time before stopping. Even if $R_0 = \beta/\gamma > 1$ (so the song does go viral), the epidemic burns through the population very quickly: the peak of sharing is high but sharp, and the outbreak ends fast. The song may have spectacular short-term numbers — achieving millions of streams in a week — but just as rapidly drops from public attention as listeners "recover." This is the pattern of "viral smash hits" that dominate charts for 2-3 weeks then disappear. High $\gamma$ corresponds acoustically to music with high initial impact but low replay value — a perfectly engineered hook that feels repetitive after 5-10 hearings.

Q20. Why does the chapter say that the "Spotify Sound" convergence is consistent with the physics of information systems, rather than being an arbitrary or surprising outcome?

Show Answer The "physics of information systems" predicts that when multiple agents in a network all optimize toward the same target function (past popularity), the system converges toward the fixed points of that optimization — the acoustic profiles that have historically performed best. This is analogous to a dynamical system converging toward a stable attractor: small perturbations (any individual new song) get "pulled" toward the attractor by the optimization process. The "Spotify Sound" is the acoustic attractor of the streaming ecosystem — the fixed point toward which the double optimization of production and distribution converge. Far from being surprising, the convergence is predicted by the fundamental mathematics of self-referential optimization systems. What would be surprising — and would require special explanation — is if the system converged toward something *other* than what it was optimizing for.