You will be redirected back to your article in seconds

Generative AI Copyright Cases Under Cloud of Uncertainty

Lady Justice weighing AI
Illustration: VIP+: Adobe Stock

One of the more pressing, complex and unresolved questions surrounding generative AI is whether and when training AI on copyrighted material infringes copyright law.

Text, image and video AI models are trained on massive data sets, often scraped from publicly available data on the internet. As AI tools have come to market, creative rightsholders — from large public media companies to individual artists and content creators — have publicly argued AI’s use of that data for model training infringes copyright law. In two lawsuits started earlier this year, Getty Images and artists in the class action each claim infringement of their copyrighted material used as data to train AI.

AI training on protected material is also one of the WGA’s concerns. Among the proposed demands, writers don’t want their material to be used to train AI systems. However, it’s the studios, not the writers, that in many cases hold the copyright protecting TV and film scripts.

Negotiations aside, it stands to reason that the studios would be interested in protecting their IP, just as any major content seller would. In a May gathering of the WGA’s AI working group, some were reportedly surprised the studios hadn’t already filed lawsuits to protect their IP from being used to train AI, particularly as others in the creative world have started to cry foul.

While generative AI is likely to prove far more complex, the Supreme Court’s Andy Warhol decision appears to give rightsholders a clearer path to prove infringement, including potentially for protecting copyrighted data used for model training. Creative industry trade groups also applauded the decision, including statements from the Motion Picture Association (MPA) and RIAA CEO Mitch Glazier.

The court ruled that Warhol’s “Prince Series,” based on a photograph by Lynn Goldsmith, infringed copyright because the works shared the same commercial purpose — in effect, competing with the way Goldsmith made her living: licensing her work.

That approach bodes well for infringement claims on similar grounds with generative AI. For example, in its lawsuit, Getty argues that Stability AI illegally trains on copyrighted photographs, which is what allows the outputs of its image AI model Stable Diffusion to directly compete with Getty’s ability to earn licensing revenues from its images. As essentially an image vendor, Stable Diffusion and the commercial purpose of its AI-generated images could be said to be substantially the same as Getty’s.

The same likely goes for AI that generates outputs in the “style” of specific artists, which could certainly compete with their ability to make a living. Artists in the class-action suit against Stability AI, Midjourney and DeviantArt claim similar AI outputs are “derivative” works because an artist’s name can be used in a prompt. “Style” isn’t protectable under existing copyright law, but that could now matter less than how the output was used to compete with an original work in infringement cases.

The court’s decision shifted fair-use emphasis onto how the unauthorized copy was being used in the market to compete with the original work — not just whether it looks substantially similar to the original. That de-emphasis on visual similarity would likely be critical to winning infringement battles related to generative AI.

Ordinarily, a court would need to prove — with side-by-side examples — that AI has created something substantially similar to an original work. But given the sheer volume of training datasets with works numbering in the multimillions and billions, it’s unlikely an AI system will ever produce an output that can be said to take substantially from any single work.

However, it’s not a foregone conclusion that copyright law will be evaluated the way it was in the Warhol case. Companies developing AI models argue that training on such data is “fair use,” the exception under U.S. copyright law that allows the use of copyrighted materials without permission.

Likewise, in a May hearing on AI and intellectual property, former general counsel of the U.S. Copyright Office and current Latham & Watkins LLP partner Sy Damle told the House Judiciary’s intellectual property subcommittee he believed that AI training is fair use and not infringement under existing copyright law.

Of the four factors used to determine fair use, AI model developers will likely argue their use of any data is “transformative” — meaning the outputs their systems generate add something new, with a further purpose or different character than the ingested inputs or training.

Second, if the purpose of use is noncommercial in nature — such as for “nonprofit educational purposes” — it’s also more likely to be deemed fair use. On that point, certain companies may try to argue fair use if they didn’t create the dataset used to train their AI model themselves, with that being done instead by a noncommercial entity.

Several large public datasets that are known to have been used to help train AI models were created by nonprofits or academic researchers, making them more likely to be considered fair use even when these same datasets are later used for commercial purpose. Using such datasets can be a purposeful strategy for AI companies, a practice tech blogger Andy Baio has referred to as “AI data laundering.”

Another potential challenge for any rightsholder trying to make a copyright infringement claim is that we don’t have full transparency into the exact composition of AI training data for many prominent models. For example, in its research white paper about GPT-4, OpenAI gave no details about the “training compute, dataset construction, [or] training method … given both the competitive landscape and the safety implications.”

“I think it’d be very hard to have claims without transparency. If you’re going to go under our existing copyright laws and establish copying and establish substantial similarity, you need to have some granularity as to what’s happening and what work is at issue,” Michael Kasdan, intellectual property lawyer at Wiggin & Dana LLP, told VIP+.

How courts proceed with lawsuits already facing AI companies could give us our first signal about whether copyright infringement might be proven on market competition and “commercial purpose” grounds.

Still, it’s possible some new or amended regulation will be needed to address artist and rightsholder concerns. For example, new or clearer rules around webscraping could articulate when scraping is allowed and when it’s not, Kasdan said. In turn, licensing copyrighted data for AI training would seem that much more likely if regulation creates a context where it becomes a necessary course of action for companies building these models.

Lawmakers might also find ways to require transparency into training data, which would have benefits for research into AI bias and inaccuracy. In late April, the European Commission approved a new addition to its AI Act draft bill, stating that companies deploying generative AI tools like ChatGPT will have to disclose any copyrighted material used to develop their systems.

“Some level of verifiable disclosure requirement and/or a process where AI companies swear or affirm the data is public domain or licensed or bear the risk of liability in the event it’s not — I can see that all being on the table for considering how to regulate in this space,” Kasdan wrote in an email to VIP+.

Ultimately, if copyright doesn’t offer sufficient protection for the holders of those rights, and AI competes with and potentially displaces the need for their work in the marketplace, artists and major rightsholders will need an alternative remedy.