thewayne | New York Times considering lawsuit, could force ChatGPT to WIPE OUT their data set and start over!

Interesting times!

The suit contends that ChatGPT did not have permission to do a deep scan of the NYT's article database to train their system, and in doing so violated the NYT's terms of service.

From the Ars article (an Arsicle?): "Weeks after The New York Times updated its terms of service (TOS) to prohibit AI companies from scraping its articles and images to train AI models, it appears that the Times may be preparing to sue OpenAI. The result, experts speculate, could be devastating to OpenAI, including the destruction of ChatGPT's dataset and fines up to $150,000 per infringing piece of content."

and "This speculation comes a month after Sarah Silverman joined other popular authors suing OpenAI over similar concerns, seeking to protect the copyright of their books.

But here's the biggie: "NPR reported that OpenAI risks a federal judge ordering ChatGPT's entire data set to be completely rebuilt—if the Times successfully proves the company copied its content illegally and the court restricts OpenAI training models to only include explicitly authorized data. OpenAI could face huge fines for each piece of infringing content, dealing OpenAI a massive financial blow just months after The Washington Post reported that ChatGPT has begun shedding users, "shaking faith in AI revolution." Beyond that, a legal victory could trigger an avalanche of similar claims from other rights holders.

Unlike authors who appear most concerned about retaining the option to remove their books from OpenAI's training models, the Times has other concerns about AI tools like ChatGPT. NPR reported that a "top concern" is that ChatGPT could use The Times' content to become a "competitor" by "creating text that answers questions based on the original reporting and writing of the paper's staff."

Fair Use is quite an issue. I quote news sites all the time, just like the excerpts above. I make no claim it is my content, it is clearly delineated as to what is quoted from the article and what is my commentary or additional content. And I am in no way making any money from this. Things are a little different when you have AI/LLM systems hoovering up all the content that they can find to train up. Those system makers want to spend the least amount of money possible to train their systems because their energy costs are absolutely huge! I posted an article a month or so ago about a new supercomputer that will be running an AI system that consumed as much power as either 3,000 or 30,000 houses, I saw both numbers. If these guys can get training data for free, they'll go for it. But authors are pushing back: if people have to buy their books to read it (excluding libraries where people can borrow for free), then why should AI companies get a free read?

If an art generating AI wants to use my photos, I would like to be compensated! If you want to use one of my photos for a desktop wallpaper or screen saver, I'm honored. If you sell my photos for profit - then we have an issue! I've spent over four decades developing my craft and I'm pretty decent at it, I'd like some acknowledgement and compensation for it and not for it to be stolen for an AI system's use, as they've been doing.

https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/

Threaded | Top-Level Comments Only

From:

warriorsavant

In some countries, even libraries pay copyright fees (I think Australia and Netherlands, not sure.) Anyhow, good for NYT & authors.

thewayne

waves hand as librarian When it comes to copying journal articles, we have copyright rules that we have to pay if we need more than X copies from a specific journal or article within certain time frames.

When I was running the Derm section of the new undergraduate medical teaching, I was greeted with howls of outrage by my colleagues when I told them it was a legal requirement that they needed permission to use images etc in lectures.

Oh, I'll bet there were! I'm kind of glad that I work at a junior college and don't have to get that specific on stuff like that. I'm sure they're a lot more strict on stuff like that at the main campus. But there's also more sources for open licensed images and such these days, may not be as difficult now. And for education, there are different rules depending on the material, for example, I think it's okay to copy images out of a text book that you're using to use in a presentation in class. But finding one online? Definitely iffy territory. Everyone who works for the uni has to take a copyright awareness training module every year.

disneydream06

Not a good day for AI companies. :o :o :o
Hugs, Jon

mtbc

Mmm, I did wonder about fair use from that "you can use a little bit" perspective given how the AI likes to suck in entire works, etc.

What we were told, was that imagines, whether online or from a textbook, could be used in an informal talk without limit. But if giving a formal class, we were supposed to get permission. This came down from the legal office, but still greeted with howls of outrage. I don’t think copyright law recognizes “educational” as a category: it’s either personal use, or commercial use. Since students pay tuition, it counts as commercial. Since professors regard universities and godly and special (as do the universities themselves), they are appalled at being classified with crass, commercial entities.

As far as I know, Rule of Five and Rule of Two guidelines have not been updated. Rule of Two is in regard to number of copies of specific articles, Rule of Five is in regards to number of articles of a specific journal title in the last five years within a calendar year. We don't have enough journal borrowing that this has only been a problem once, I told the grad student she just needed to drive over to main campus and she could browse the journal there.

One example that I recall from our training module was a professor wants to use that personality assessment quiz that he found online (what personality type are you?) for his class. He can't, because it's copyrighted material, regardless of where he found it. Considering how large some university endowments are, and how poorly they pay their minions, they'd better consider themselves crass commercial entities!

silveradept

This is the kind of thing I was expecting as the thing that would make the LLM generators actually sit up and take notice - someone with enough clout that they could actually hurt them by being able to prove claims of copyright infringement for the training data used to generate these models' output. They're already ingesting all kinds of works that are still copyrighted and that have not been specifically licensed for use in AI data sets or under broader license that would allow for such things. I will be very interested to see if the LLM creators attempt to use a fair use defense or similar, given what the judgment was against the Internet Archive's much more benign and appropriate programs.

Edited Date: 2023-08-27 05:14 am (UTC)

I just got keys and download info for Meta's Llama code generator AI, haven't gotten around to downloading it and trying to make it work yet. It's allegedly trained on openly available code examples. I look forward to working with it and seeing what quality of code it generates. But a code generator LLM is a long distance from sucking in the NYT archive and all of Stephen King's books. I think the sites they worked on to train Llama specifically had their code available to download. I'm kind of surprised the suits haven't hit the PTO and other courts yet to get some rules and rulings going yet.

There's a new lawsuit by the music industry against the IA for their attempt to preserve 78 LPs. These ceased manufacture not far from the copyright cut-off date and extend much older beyond that. But there are artists recorded on the newer ones, like Sinatra and Bing, that are still valuable properties. It's a shame that greed rules in so many places.

S	M	T	W	T	F	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Always strive to learn something useful. --Sophocles

You are coming to a sad realization. Cancel or allow?

New York Times considering lawsuit, could force ChatGPT to WIPE OUT their data set and start over!

New York Times considering lawsuit, could force ChatGPT to WIPE OUT their data set and start over!

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

January 2026

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags