The Atlantic journalist Alex Reisner identified four open datasets that were actively shared within the AI developer community. Together, they contain over 21 million music tracks: the largest contains approximately 12.3 million compositions, the second has 9.7 million, and two others contain roughly 100,000 recordings each.
Among them are songs by Taylor Swift, Bad Bunny, and millions of other artists. But the main value of the publication is not the famous names: for the first time, rights holders received a verification tool — search databases where they can check whether a specific track ended up in the training dataset.
Why This Became Possible
Neural networks do not store original recordings — they learn statistical patterns. That is why, as noted by WIPO in its review, auditing is practically impossible: companies can simply delete the original training data. Neither Suno nor Udio — the two largest generative music services — have disclosed the composition of their datasets so far.
"If a model was trained on Taylor Swift's music and obscure artists' music — should everyone receive the same compensation?"
Dorian Gehrmanns, music AI researcher, WIPO Magazine article
This is not a rhetorical question — it is an unresolved legal and economic problem. Existing royalty models do not provide for compensation for the use of works as training data, only for reproduction or derivative works.
Lawsuits Are Already Underway, but Slowly
In June 2024, the RIAA, on behalf of Sony Music, UMG, and Warner Records, filed lawsuits against Suno and Udio — for copyright infringement "on a shocking scale." Later, the accusations were supplemented: both companies allegedly scraped material from YouTube. In October 2025, UMG reached a settlement with Udio, which includes both a licensing agreement and "compensatory settlement." Sony and Warner refused to settle — the legal proceedings against Udio continue, and Suno defends itself through the fair use doctrine.
In parallel, Taylor Swift applied in 2025 for trademark protection for her own voice and image — protection against deepfakes and unauthorized use in AI products.
What This Means in Numbers
- 21+ million tracks — the volume of four identified datasets; the actual scale of use is likely larger
- $0 — royalties received by most authors of these 21 million tracks for the use of their works as training data
- UMG–Udio settlement — the first precedent of payment, but without disclosure of the amount and without coverage of other platforms
- Suno and Udio — only two of many companies developing generative music AI
Ed Newton-Rex, founder of the nonprofit Fairly Trained, which advocates for payment to authors for training data, called the situation "structural appropriation" in a keynote address at the ISMIR 2024 conference: the industry built tools to replace creators based on the work of these very creators.
The Atlantic's publication made the invisible visible. But seeing does not yet mean receiving compensation: none of the four datasets contain an opt-out mechanism or payment system, only the ability to verify that you have already been used.
If Suno loses the fair use case in court — it will set a precedent that will force the terms for the entire generative audio industry to be rewritten. But if it wins, then databases like those published by The Atlantic will remain merely a museum of others' property with no address for a lawsuit.