How OpenAI is Looking to Beat the Growing Pool of Copyright Cases

Image: OpenAI

Law

How OpenAI is Looking to Beat the Growing Pool of Copyright Cases

The recent barrage of copyright infringement disputes that are being waged against OpenAI and Microsoft by major publishers, authors, and other plaintiffs continues to stack up, with various outcomes coming in early rounds from district courts. While most of the cases that the ...

July 23, 2024 - By Aaron West

How OpenAI is Looking to Beat the Growing Pool of Copyright Cases

Image : OpenAI

key points

OpenAI and Microsoft are facing copyright lawsuits from major publishers, alleging unauthorized use of copyrighted content to train the models like ChatGPT and Copilot.

The tech giants have argued that the plaintiffs' claims are based on hypothetical user behavior and manipulated prompts that do not meet the standard for alleging infringement.

Experts believe the defense may be part of a larger strategy to pressure plaintiffs into licensing agreements by leveraging the defendants' strong financial position to settle.

Case Documentation

How OpenAI is Looking to Beat the Growing Pool of Copyright Cases

The recent barrage of copyright infringement disputes that are being waged against OpenAI and Microsoft by major publishers, authors, and other plaintiffs continues to stack up, with various outcomes coming in early rounds from district courts. While most of the cases that the artificial intelligence (“AI”) company and its chief investor are facing are still in early stages, at least one major theme has started to emerge from the litigation: The two high-powered defendants are leaning on a defense that paints the plaintiffs’ claims as stemming from their manipulation of the AI-powered platforms at play – and thus, prompting allegedly unlikely and hypothetical outputs – and the harm they allege in their infringement cases as purely speculative. 

Over the past year, OpenAI and Microsoft have been hit with a growing number of copyright-centric lawsuits from a number of news companies, including the New York Times, the New York Daily News, the Denver Post, and The Intercept, among others, as well as various authors and entities in the music and entertainment industries. The plaintiffs largely allege that OpenAI and Microsoft’s methods of training the large language models that power their generative AI platforms, such as ChatGPT and Copilot, and the resulting outputs from those platforms infringe on their copyright-protected works. In their complaints, the plaintiffs argue that OpenAI and Microsoft have trained these models using vast datasets that include others’ copyrighted content without their authorization, and that the models can produce outputs that replicate or closely mimic these underlying works, thereby giving rise to copyright infringement causes of action. 

Implausible Inputs & Speculative Harm

Against this background, OpenAI and Microsoft have crafted a legal defense built around the argument that the plaintiffs’ allegations rely upon their use of implausible written inputs – or prompts – and thus, their claims (and their allegations of harm) are speculative in nature and fail to serve as concrete evidence of actual infringement. The tech titans assert that the hypothetical nature of the plaintiffs’ claims falls short of the requisite standard for alleging infringement, as the written prompts that the plaintiffs use as the basis of their copyright claims are so obscure and unlikely to be used in a real-world setting that no normal user of the tech would create the same outcome.

For example, in a March 4 motion to dismiss the case that the New York Times’ case against it and OpenAI, Microsoft argued that the newspaper’s claims “are based … on unsubstantiated suggestions that the public’s use of GPT-based products harms the Times,” and that none of its claims are based on “how real-world people actually use the GPT-based tools at issue.” In particular, Microsoft asserts that “the Times crafted unrealistic prompts to try to coax the GPT-based tools to output snippets of text matching the Times’s content – a technique [it] buries in lengthy exhibits to the complaint.” The problem with that, it argues, is that in lieu of providing actual evidence of direct infringement, the times merely “hypothesizes that someone could find a way to prompt the GPT-based products to yield output that is similar to one of the Times’s works.” 

The “theoretical possibility that someone somewhere might engage in the same acrobatics the Times did here is not enough to plausibly allege direct infringement,” Microsoft maintains. 

The New York Times is not the only plaintiff that is running up against a legal wall that could require more than a showing of “manipulated” outputs to establish infringement. A number of MediaNews Group-owned newspapers – which are suing OpenAI and Microsoft for allegedly ripping off millions of their copyrighted news articles – are being met with pushback from OpenAI and Microsoft, which again argue that the newspapers’ claims  “depend on an elaborate effort to coax … outputs from OpenAI’s products in a way that violates the operative OpenAI terms of service and that no normal user would ever attempt.” 

> In the motion to dismiss that it filed in June, OpenAI also asserts that the damage alleged by the plaintiffs in connection with their Digital Millennium Copyright Act claims is similarly hypothetical in nature: “The plaintiffs’ harm is roughly the same, legally speaking, as if someone wrote a defamatory letter and then stored it in a desk drawer. Accordingly, just as a ‘letter that is not sent does not harm anyone,’ neither does data that is allegedly missing [Copyright Management Information] harm anyone when contained in an internal database.”

Given the novel – and largely undisclosed – nature of how OpenAI and Microsoft’s technology actually works, plaintiffs will likely need to go out of their way to prompt generative AI models to create infringing outputs, Josh Rich, an intellectual property trial lawyer and a partner at McDonnell Boehnen Hubert & Berghoff LLP, told TFL. “It is often the case that people involved in copyright disputes have to go to significant efforts to uncover the fact that somebody is using their copyrighted works.” Yet, ChatGPT and other generative AI tools “take that requirement even further since the technology is so new and it is not always entirely clear how the tools have been trained.”

“Really the only way for these plaintiffs [to establish infringement] is by writing enough of a prompt where [ChatGPT] will develop the same article or content, more or less,” according to Rich. “The tech was really just a black box in a lot of ways before that.” 

A Viable Defense?

In terms of the viability of OpenAI and Microsoft’s defense, Rich says that the hypothetical-focused legal tactic has its strengths, but it is not a panacea by any means. He said that the argument is a far better equitable argument than a legal one. For one thing, the argument does not apply to OpenAI and Microsoft’s scraping of data from the internet and alleged replication of copyright-protected articles to train their AI tools – it only applies to the results returned from the prompts, which potentially makes the argument irrelevant to one side of the New York Times’ case. 

More than that, the argument might be a long shot because of the reality and logistics of copyright litigation; “it would have been very difficult for the New York Times to perform a thorough pre-filing investigation without seeing if ChatGPT and Copilot would return complete articles with the prompts used,” he said.  

“In the end, the damages that the Times will have to prove will be based on the copyrighted articles returned in response to third parties’ prompts. If other people have not been fishing for copyrighted articles using excerpts as the prompts, there will be little harm that the Times can prove,” Rich said. “On the other hand, if the approach has been widespread, there may be extensive damages based on the same approach. In either case, all that the Times’s prompts add up to will be the evidence needed to bring the case in good faith, not a major contribution to damages.”

Michael Hobbs, a partner at Troutman Pepper, who specializes in copyright law, offered an alternative view. He told TFL that he sees OpenAI’s defense tactic as part of an overall strategy to overwhelm plaintiffs and strong-arm them into licensing agreements, similar to the ones that a number of publishers have signed off on. While the speculative harm and manipulated prompt arguments have the potential to raise questions with judges about the validity of the plaintiff’s claims, he said the big picture – namely the “strategic war of attrition” to “wear down plaintiffs” so they reach licensing deals and settlements – may be the real prize.

“They want the current uncertainty of the law of generative AI to end in a series of well-negotiated licenses and settlement with copyright content owners,” Hobbs stated. “Given their $80 billion dollar valuation, OpenAI can afford them.”

What Courts Have Said So Far

As for what courts have decided in other generative AI copyright cases waged against OpenAI by other authors and publishing groups and, at least some have taken a middle-ground approach, allowing cases to move forward while expressing skepticism about the plaintiffs’ ability to prove their claims. Courts have also emphasized that discovery will be crucial, a nod to  the ongoing uncertainty for plaintiffs in proving their infringement claims against AI developers, as well as the challenges that come hand in hand with the “black box” tech they are up against. 

For example, in February, the U.S. District Court for the Northern District of California partially granted and partially denied the motion to dismiss that comedian Sarah Silverman and others had filed against OpenAI, with Judge Aracelli Martinez-Olguin stating that “nowhere in Plaintiffs’ complaint do they allege that OpenAI reproduced and distributed copies of their books.” At the same time, the judge noted that as a result, “any injury [alleged by the plaintiffs] is speculative.” 

The emphasis on discovery lines up with the fact that as of now, plaintiffs generally have a relatively limited amount of knowledge about how OpenAI and Microsoft’s tech works, so without a discovery process, these kinds of prompts – and the copyright infringement allegations that plaintiffs are using them to show – could be the only way forward. 

“One difficulty is that there is such a distinction in the amount of information that each side has about how the models have been trained,” Rich says. “Really, the only way to find out if the model has been taught your information is to fish out articles by providing language that you think would be unique to those articles. So, on the one hand, yes, it is an elaborate way to get the information. But on the other hand, it is kind of the only way.”

related articles