Case Study 40.2: Getty Images v. Stability AI — When Creator Work Becomes Training Data

DataField.Dev

Case Study 40.2: Getty Images v. Stability AI — When Creator Work Becomes Training Data

The Lawsuit

In January 2023, Getty Images filed a lawsuit against Stability AI in the United States District Court for the District of Delaware (and a parallel action in the United Kingdom High Court). The lawsuit alleged that Stability AI had scraped and used more than 12 million images from Getty's library — along with their associated metadata, captions, and copyright information — to train Stable Diffusion, its open-source AI image generation model, without consent, license, or compensation.

Getty's complaint included an exhibit that has become one of the most widely shared images in the AI copyright debate: a Stable Diffusion-generated image that featured a blurry, distorted version of Getty's signature watermark in the corner — visible evidence that the training data included Getty-watermarked images, and that the model had learned to reproduce the watermark as part of its learned "style" of realistic photography.

The lawsuit alleges violations of the Digital Millennium Copyright Act (DMCA), copyright infringement, and trademark violation.

As of 2026, the case was still working through pre-trial motions, with significant legal questions unresolved about whether AI training constitutes "fair use" under copyright law.

Why This Matters Beyond Getty

Getty Images is a large commercial enterprise with resources to litigate. The question the case raises, however, extends far beyond corporate intellectual property disputes.

Getty's library includes the work of hundreds of thousands of individual photographers — many of them working freelancers whose livelihoods depend on licensing revenue from their images. When an AI company trains a model on millions of photos and users can generate images in the style of, say, a photojournalist's distinctive documentary aesthetic, those users generate images without paying for the original photographer's work. The photojournalist loses licensing revenue they would otherwise have earned; the AI company profits from the capability the photographer's work helped create.

For individual creators, the scale of what happened is difficult to fully grasp:

LAION-5B, the dataset most commonly used to train image generation models, contains approximately 5.85 billion image-text pairs scraped from the internet. The images include content from personal websites, photo-sharing platforms, artist portfolio sites, news archives, and social media — almost all scraped without consent.
The Pile, a text dataset used to train many large language models, contains approximately 825 gigabytes of text from sources including books (some scraped without authorization), websites, academic papers, and forums.
Multiple musician advocacy groups have documented that AI music models were trained on commercially released songs and independently published tracks without licensing agreements.

The creators whose work is in these datasets didn't agree to be in them. Many didn't even know. And they received nothing.

The "Fair Use" Defense

AI companies have advanced "fair use" as their primary legal defense. Under US copyright law, fair use allows limited use of copyrighted material without permission in certain circumstances — scholarship, commentary, parody, news reporting. The fair use analysis considers four factors:

Purpose and character of the use — Is it commercial? Transformative?
Nature of the copyrighted work
Amount and substantiality of the portion used
Effect on the potential market for the copyrighted work

The AI companies' argument: training a model is transformative use — the model doesn't reproduce the original works, it learns patterns from them. The output of the model is not a copy of any specific input.

The creators' counter-argument: the economic effect test is decisive. When Stable Diffusion can generate stock photography that directly competes with Getty's licensed library, the effect on the market for the original works is not incidental — it is the primary commercial purpose of the system. And if AI music tools can generate music in the style of a specific independent musician, users stop buying or licensing that musician's work.

A related question: does the blurry Getty watermark in Stable Diffusion outputs demonstrate that the model isn't purely learning abstract patterns? Does it reproduce sufficiently specific elements of training images to fail the "transformative use" test?

As of 2026, no appellate court has conclusively resolved the fair use question for AI training at scale. The cases are consequential precisely because they haven't been decided yet.

What Artists and Creators Have Done

In the absence of legal resolution, creators have pursued several strategies:

Opt-out mechanisms: Stability AI and several other companies have created opt-out mechanisms — tools that let creators specify that their work should not be used in future training. Critics point out that opting out of future use doesn't address past scraping, and that the burden falls on individual creators to discover and use opt-out tools rather than on companies to get consent proactively.

Advocacy organizations: The Artist Rights Alliance, formed in 2023, is a coalition including both major-label recording artists and independent musicians. In 2024, they published an open letter signed by hundreds of major artists asking AI companies to stop training on musicians' work without consent. Signatories included artists from Taylor Swift to Nicki Minaj to Billie Eilish.

Do Not Train registries: Organizations including Spawning AI (spawning.ai) have created "Have I Been Trained?" tools that let creators check if their work appears in LAION and other datasets, and "do not train" registries that signal creator preferences to AI developers. Adoption of these registries by AI companies remains voluntary and uneven.

Alternative AI models: Some AI image generation services have specifically committed to training only on licensed or consent-based data — most notably Adobe Firefly, which Adobe developed using content from Adobe Stock (which photographers agreed to by listing their work there) and public domain material. Adobe has offered to indemnify commercial users against copyright claims.

The Compensation Question

Even if AI training on existing content were ultimately held to be legal under some "fair use" interpretation, a separate ethical question remains: should creators be compensated for the contribution their work made to AI systems that are now worth billions of dollars?

Several proposals have been advanced:

Collective licensing: A model analogous to music royalties — a licensing body collects usage fees from AI companies and distributes them to creators whose work was used in training. The challenge: identifying which specific training images or texts contributed to which model outputs is not technically straightforward.

Training data credits: AI companies document and credit the training data used in their models, allowing creators to claim compensation based on documented inclusion.

Revenue sharing: AI companies that generate revenue from outputs based on creator training data pay into a fund distributed to creators in relevant categories.

None of these models has been adopted at scale as of 2026. The legal landscape will likely need to force the issue.

Discussion Questions

The AI companies argue that training on copyrighted work is "transformative" fair use because the model learns patterns rather than reproducing copies. The blurry Getty watermark visible in some Stable Diffusion outputs complicates this argument. How do you evaluate the "transformative use" defense in this context?
If fair use for AI training is ultimately upheld by the courts, does that settle the ethical question? Can something be legal but still ethically wrong? What would ethical treatment of creators look like even in a world where legal protection is absent?
As a creator yourself (or as someone entering the creator economy), how does the knowledge that your content may already be in AI training datasets affect your relationship with AI tools? Does it change how you use them?