Case Study 36.1: Suno and Udio — The First Mass-Market AI Music Generators
The Arrival of Text-to-Song
In 2024, two AI systems changed the terms of a debate that had been mostly theoretical: Suno (developed by Suno AI, Cambridge, Massachusetts) and Udio (developed by Uncharted Labs) made it possible for anyone with an internet connection and a text prompt to generate a complete, production-quality song — with lyrics, vocals, instrumentation, and a coherent structure — in approximately thirty seconds. No musical training required. No instruments. No recording equipment. Just words.
The outputs were, by many accounts, astonishing. Suno and Udio could generate credible pop songs, convincing jazz standards, atmospheric film score cues, and passable heavy metal anthems. The vocals had intonation, the arrangements had dynamics, the lyrics rhymed and scanned. A casual listener encountering a Suno output for the first time often could not immediately distinguish it from a professionally produced independent artist's track.
This was a qualitative leap from previous AI music tools. Google's MusicLM had been impressive but academic; Stable Audio was powerful but technically demanding to use well. Suno and Udio were point-and-click. They democratized AI music generation in the same way that smartphone cameras democratized photography — making a previously specialized skill broadly accessible, while simultaneously raising profound questions about what that skill had meant.
Architecture: What's Under the Hood
Both Suno and Udio use architectures that build on the transformer and diffusion model approaches described in the main chapter text. Suno's system — based on publicly available information and technical analysis — appears to use a cascade architecture similar to MusicLM: a language model generates a structured representation of the song (lyrics, structure, style), which conditions a music generation model that produces the audio. The vocals and accompaniment are generated as an integrated unit rather than separately synthesized and mixed.
Udio uses what appears to be a diffusion-based approach operating in a compressed latent representation of audio. This gives Udio somewhat different characteristics than Suno: its outputs tend to have a different spectral texture, with notably different handling of high-frequency detail and vocal timbre. Users who experimented extensively with both systems noted that Suno tended to produce more "polished" sounding results in mainstream pop styles, while Udio's outputs had a somewhat more variable, occasionally more interesting texture in experimental contexts.
Both systems were trained on massive datasets of recorded music — the exact training data was not publicly disclosed, which became legally significant. The systems' ability to generate music "in the style of" specific artists with considerable accuracy implied that those artists' recordings were present in the training data.
Spectral Analysis: What AI Pop Really Sounds Like
When researchers performed spectral analysis on large batches of Suno and Udio outputs, several consistent features emerged that align with Aiko Tanaka's findings in her formant experiment (36.8):
Spectral homogenization. Outputs within the same broad style category (e.g., "indie pop" or "80s synthwave") showed strikingly similar spectral envelopes — the averaged shape of energy distribution across frequency. Human-produced music within these genres, while certainly sharing genre conventions, shows considerably more individual variation in spectral character. The AI outputs cluster tightly; the human outputs spread more broadly.
Compression artifacts and training data echoes. Some spectral analysis revealed what researchers called "training data echoes" — characteristic spectral patterns in the high-frequency range (above 12 kHz) that appeared to be artifacts of the lossy audio compression used on the training data rather than genuine musical content. If the training data consisted largely of MP3-compressed recordings, the model may have learned the MP3 spectral characteristics as features of music rather than as encoding artifacts.
Vocal formant averaging. Consistent with Aiko's findings, the AI-generated vocals showed broader, more averaged formant distributions than typical professional recordings. The characteristic individuality of a singer's voice — the specific formant pattern that makes a voice recognizable across songs — was largely absent. AI-generated voices in Suno and Udio tend to sound like a stylistic category of voice (e.g., "female indie pop vocalist") rather than a specific person with an individual acoustic identity.
Dynamic range compression. AI-generated tracks showed consistently high compression (low dynamic range), approximating the loudness-normalized target levels typical of streaming platforms but lacking the dynamic expressive variation that human producers use intentionally. The compression was present but not purposeful.
The Recording Industry Responds: The RIAA Lawsuit
In June 2024, the Recording Industry Association of America (RIAA) filed copyright infringement lawsuits against both Suno AI and Udio (Uncharted Labs) in federal courts. The suits were filed on behalf of major record labels including Universal Music Group, Warner Music Group, and Sony Music Entertainment, and alleged that the AI companies had used copyrighted recordings to train their systems without obtaining licenses or paying compensation to the rights holders.
The complaints alleged copyright infringement on a "massive scale" — noting that the systems' ability to generate convincing imitations of known artists' styles was only possible because those artists' recordings had been used in training. Suno's CEO Mikey Shulman responded publicly that the company's systems were trained to generate new music rather than reproduce existing music, and that their approach was analogous to a human musician learning from recorded music.
This analogy — human musician learning from recordings versus AI training on recordings — became the central conceptual battleground of the lawsuits. The industry's position: listening and learning is one thing; systematically processing copyrighted recordings to extract statistical patterns and use those patterns for commercial music generation is a different thing, one that constitutes a commercial use of the underlying creative work that requires compensation.
In August 2024, Suno and Udio each settled with the major labels for undisclosed sums, effectively acknowledging the legal risk without establishing clear legal precedent. The settlement terms were not public, but analysts noted that both companies continued operating and presumably negotiated ongoing licensing arrangements.
What the Legal Battle Reveals About Musical Authorship
The Suno/Udio legal saga is interesting not just for its legal outcome but for what it reveals about the conceptual foundations of musical authorship.
Copyright law protects expression — the specific notes, words, arrangements, and recordings. It does not protect style, genre, mood, or general musical approach. This is why you can write a song "in the style of" anyone without infringement — as long as you don't copy specific melody, lyrics, or recorded sound.
The AI case sits awkwardly between these categories. The AI does not copy specific protected expression — it generates new notes, new words, new arrangements. But it achieves this by extracting and internalizing the statistical patterns of protected expression at massive scale. Is statistical extraction of expression the same as copying expression? Or is it more like the human musician who listens to thousands of records, internalizes the patterns, and generates new music from that internalized knowledge?
From a physics-of-music perspective, this is precisely the statistics/physics distinction. Statistical patterns — the regularities that can be extracted from recordings — are what the law calls "style" (unprotected). The specific expression — the precise waveform, the exact melody and lyric — is what copyright protects. AI systems extract style (statistics) and generate new expression. Whether this extraction itself constitutes infringement is a legal question that physics cannot answer, but physics can clarify what, exactly, is being extracted.
Discussion Questions
-
If a human musician listened to every Beatles recording released to date and composed new music in the Beatles' style, would this be copyright infringement? How is this different from — or similar to — how Suno and Udio trained on copyrighted recordings?
-
The RIAA's argument implies that the commercial value of training on copyrighted recordings gives rights holders a claim on the resulting AI output. What physical principle or economic logic underlies this argument? Is it sound?
-
Suno and Udio settled their lawsuits without admitting wrongdoing. What implications does this have for future development of AI music tools? Would a clear legal ruling (either for or against the AI companies) have been better for the music industry, for artists, and for music listeners?
-
Spectral analysis reveals that AI-generated vocals lack the individual formant fingerprint of specific human singers. Is this a limitation or a feature, from an ethical standpoint? What would it mean for an AI to perfectly replicate a specific living singer's voice?