The litigation aimed at the scraping techniques of AI companies that are developing big language models (LLMs) was heating up this morning following the announcement that author and comedian Sarah Silverman is suing OpenAI and Meta for copyright infringement on her humorous memoir The Bedwetter: Tales of Redemption, Courage, and Pee which was published in 2010.

The lawsuit is brought by San Francisco’s Joseph Saveri Law Firm -which also filed a lawsuit against GitHub in 2022, asserts that Silverman, along with two other plaintiffs, failed to agree to the copyrighted copies of their books as a source of training for the OpenAI’s ChatGPT as well as Meta’s LLaMA and that, whenever ChatGPT is prompted, or LLaMA is asked to do so by the software, it generates summaries of the copied works that are only attainable when they were able to train using the books.

Legal AI concerns about copyright and fair use are becoming more prominent

Legal issues around the copyright issue as well as “fair use” are not disappearing — in fact, they’re at the core of what our LLMs consist of — namely the training data. As I talked about in my last post in my blog, scraping the Internet for huge amounts of data could be described as the secret ingredient of artificial intelligence that is generative. AI chatbots such as ChatGPT, LLaMA, Claude (from Anthropic), and Bard (from Google) can spit out coherent text because they were trained using huge datasets, mostly from the Internet. As the sizes of the current LLMs, such as GPT-4, have risen to hundreds of trillions of tokens, so has the demand for data.

Data scraping practices that are used in the name of educating AI are currently under fire. For instance, OpenAI was hit with two additional lawsuits. One of them, filed on June 28 with the Joseph Saveri Firm, asserts that OpenAI illegally copied the text of books without permission from copyright holders or providing them with credits and/or compensation. Another, filed the same day by the Clarkson Law Firm on behalf of more than a dozen anonymous plaintiffs, asserts that openAI’s ChatGPT and DALL-E collect users’ personal information on the Internet and are in breach of privacy laws.

The lawsuits follow on the heels of a class-action suit filed in January by Andersen et al. and. Stability AI; in the case, plaintiffs from artists raised allegations that included copyright infringement. Getty Images also filed suit against Stability AI in February, accusing trademark infringement, copyright, and trademark diluting.

Sarah Silverman, of course, is a brand new famous face to the problems concerning AI and copyright; however, what does this latest lawsuit suggest for AI? Here are my predictions for the future:

Many more lawsuits are to come

In my piece this week, Margaret Mitchell, researcher and chief ethics scientist at Hugging Face, described the AI scraping of data as “a pendulum swing,” and added that she had said that by the end of this year, OpenAI could be forced to remove at least one model due to of these data scraping issues.

We can certainly expect numerous lawsuits to follow. In April 2022, when DALL-E 2 first came out, Mark Davies, partner at the San Francisco-based law firm Orrick, agreed that there were a lot of legal questions about the issue of AI as well as “fair use” — an important legal concept that encourages freedom of expression through allowing the use of copyright-protected works under certain conditions.

“What happens in reality is when there are big stakes, you litigate it,” said the lawyer. “And then you get the answers in a case-specific way.”

The debate over scraping of data was “been percolating,” Gregory Leighton, an attorney and privacy lawyer at the law firm Polsinelli said to me this week. According to him, the OpenAI lawsuits, on their own, constitute enough to trigger a second pushback. “We’re not even a year into the large language model era — it was going to happen at some point,” said the lawyer.

The legal battles surrounding fair use and copyright could eventually be decided by the Supreme Court, Bradford Newman, who heads the machine learning and AI practice of the world-renowned legal company Baker McKenzie, told me in October of last year.

“Legally, right now, there is little guidance,” said the lawyer, about whether copyrighted information used in LLM learning data constitutes “fair use.” Different courts, he said, will come to differing conclusions. “Ultimately, I believe this will go to the Supreme Court.”

Datasets will become more scrutinized. However, it is difficult to ensure compliance

In the lawsuit filed by Silverman, the plaintiffs assert they have evidence that OpenAI and Meta deliberately removed copyright-management information like copyright notices and titles.

“Meta knew or had reasonable grounds to know that this removal of [copyright management information] would facilitate copyright infringement by concealing the fact that every output from the LLaMA language models is an infringing derivative work,” the authors claimed in the complaint they made against Meta.

The authors’ complaint suggested they believed that ChatGPT and LLaMA were taught on large collections of books that circumvent copyright laws. This includes “shadow libraries” like Library Genesis and ZLibrary.

“These shadow libraries have long been of interest to the AI-training community because of the large quantity of copyrighted material they host,” is the author’s protest against Meta. “For that reason, these shadow libraries are also flagrantly illegal.”

However, an Bloomberg Law article in October last year pointed out numerous legal hurdles to overcome when combating copyright infringement against shadow libraries. In particular, a large portion of the website operators resides in countries that are not part of those in the U.S., according to Jonathan Band, an intellectual property attorney, and Jonathan Band PLLC’s founder. Jonathan Band PLLC.

“They’re beyond the reach of U.S. copyright law,” He wrote in his article. “In the theory of things, you could travel to the country where the database is stored. But it’s expensive, and there are a myriad of concerns about how efficient the courts are there or if they’ve got an effective judicial system or a functioning legal system for enforcing the orders.”

Leave a Reply

Your email address will not be published. Required fields are marked *