Aug. 24th, 2023

thewayne: (Default)
Interesting times!

The suit contends that ChatGPT did not have permission to do a deep scan of the NYT's article database to train their system, and in doing so violated the NYT's terms of service.

From the Ars article (an Arsicle?): "Weeks after The New York Times updated its terms of service (TOS) to prohibit AI companies from scraping its articles and images to train AI models, it appears that the Times may be preparing to sue OpenAI. The result, experts speculate, could be devastating to OpenAI, including the destruction of ChatGPT's dataset and fines up to $150,000 per infringing piece of content."

and "This speculation comes a month after Sarah Silverman joined other popular authors suing OpenAI over similar concerns, seeking to protect the copyright of their books.

But here's the biggie: "NPR reported that OpenAI risks a federal judge ordering ChatGPT's entire data set to be completely rebuilt—if the Times successfully proves the company copied its content illegally and the court restricts OpenAI training models to only include explicitly authorized data. OpenAI could face huge fines for each piece of infringing content, dealing OpenAI a massive financial blow just months after The Washington Post reported that ChatGPT has begun shedding users, "shaking faith in AI revolution." Beyond that, a legal victory could trigger an avalanche of similar claims from other rights holders.

Unlike authors who appear most concerned about retaining the option to remove their books from OpenAI's training models, the Times has other concerns about AI tools like ChatGPT. NPR reported that a "top concern" is that ChatGPT could use The Times' content to become a "competitor" by "creating text that answers questions based on the original reporting and writing of the paper's staff."


Fair Use is quite an issue. I quote news sites all the time, just like the excerpts above. I make no claim it is my content, it is clearly delineated as to what is quoted from the article and what is my commentary or additional content. And I am in no way making any money from this. Things are a little different when you have AI/LLM systems hoovering up all the content that they can find to train up. Those system makers want to spend the least amount of money possible to train their systems because their energy costs are absolutely huge! I posted an article a month or so ago about a new supercomputer that will be running an AI system that consumed as much power as either 3,000 or 30,000 houses, I saw both numbers. If these guys can get training data for free, they'll go for it. But authors are pushing back: if people have to buy their books to read it (excluding libraries where people can borrow for free), then why should AI companies get a free read?

If an art generating AI wants to use my photos, I would like to be compensated! If you want to use one of my photos for a desktop wallpaper or screen saver, I'm honored. If you sell my photos for profit - then we have an issue! I've spent over four decades developing my craft and I'm pretty decent at it, I'd like some acknowledgement and compensation for it and not for it to be stolen for an AI system's use, as they've been doing.

https://arstechnica.com/tech-policy/2023/08/report-potential-nyt-lawsuit-could-force-openai-to-wipe-chatgpt-and-start-over/
thewayne: (Default)
First, a little explanation about Postscript.

Back in the early days of laser printers, in the '80s, Postscript appeared. It is actually a programming language for lasers that used math to describe vector fonts that allowed the explosion of desktop publishing to begin (as a programming geek, I have a book on it!). There are basically two types of fonts or graphics, vector and raster. A vector font is described using math or a formula, and as a result can be rendered at any size, from one point up to a size beyond imagination, and it will be the exact same design regardless. A raster is an image, and it may look fine at the resolution at which it was scanned or smaller, but when you start enlarging it, the quality of it falls off dramatically. You've all seen horribly pixelated examples of this, even if you didn't know what the technical details of it were.

One interesting thing about Postscript is the language would allow you to define a line that was one point wide and infinitely long! It would continue printing on that laser until the printer died, as long as you kept feeding it paper.

So, back to uneditable documents. And a little info on how PDFs work.

When you create a PDF, the font definition is embedded inside the PDF file. When you open that file on a computer other than the one that you created it on, your PDF reader program, which may or may not be Adobe Acrobat, says "This file contains Font X!" and looks to see if that font is loaded. If it is, then all is well and the font is loaded and the document continues processing for display. If Font X is not installed on the system, the PDF also contains information on Font X's "family", let's say that font is part of the Courier family, so the computer says to itself 'I don't have Font X, but I have lots of Courier fonts, so I'll grab one of those and continue rendering the document!'

The computer is happy and the document comes up on your screen, or gets spat out by your printer.

Now the problem. The earliest form of fonts in PDFs were Postscript, known as Type 1. And Microsoft, in their infinite wisdom, has pulled the Type 1 fonts from Office 365 as of the middle of this month. This includes both Mac and Windows. Open Office had already pulled support. If you bought Postscript from a third party, you should still be good.

According to Adobe, from the article, "[the] PDF and EPS files with Type 1 fonts will continue to render properly, as long as those fonts are "placed for display or printing as graphic elements." That text will not be editable, however."

Also, "If you want to see what kinds of fonts you have installed on your system, Windows and macOS will show you that information with a little tweaking. In macOS, open the Font Book app and switch to List view and font formats will be listed under the "Format" column on the right. In Windows 10 or 11, open the legacy Control Panel, select Fonts, switch to Details view using the button in the upper-right corner, right-click the top row, and check the "Font Type" box. PostScript fonts can also be identified by their file extension if you can see it, typically either .pfb or .pfm."

https://arstechnica.com/gadgets/2023/08/microsoft-adobe-and-others-have-dropped-support-for-old-postscript-fonts/
thewayne: (Default)
Interesting stuff. I especially enjoyed the bits talking about older methods of hiding messages in plain sight, like marking words in print with invisible ink.

Steganography is an interesting art. It's not cryptography as the technically the text is plainly available - if you know how to read it. One method of steganography was encoding messages in photographs and then posting them online. There's lots of wasted bits in photos, so you alter the bits, which doesn't really alter the image, post the photo, the recipient knows how to decode the bits, the message is passed. But the technique is detectable because the image doesn't compress as well as an unaltered photo.

Detecting textual steganography requires that you analyze the message text and develop a word probability distribution. The word 'the' is one of the most commonly occurring words used in written and spoken communications, 'analysis' less so. By comparing normal text to steganographic text, you can make assumptions as to whether or not text contains a hidden message.

The text that the message is hidden IN is called the cover text. It might be something like a visit to a local museum, and then the AI will alter that text to inject your secret message. You can then send the altered message and the recipient can re-process it and extract your secret message.

Now, here's the interesting bit. By using AI, the difference in probability distributions can be reduced to zero. So an enemy - a censor, a hostile state actor, whatever - cannot accurately say that any given message contains stenographic text!

Word probability doesn't tell you what the hidden message is, just the likelihood of whether or not there is a hidden message there, which may mean an increased likelihood of a person or group coming under tighter scrutiny.

The problem that I see with this is they're talking about a "plug-in for an app like WhatsApp or Signal would do the heavy algorithmic lifting". I'm a little confused at this point. If they need to match the probability distribution of the cover text with the PD of the secret message, and it's done by an AI which is a supercomputer or a computer cluster, will you be able to do that with just a plugin on a smart phone? I'd like to see some more solid proof of concept here rather than 'our math models demonstrate' sort of stuff before human rights workers in bad places put themselves at risk with stuff like this.

https://www.quantamagazine.org/secret-messages-can-hide-in-ai-generated-media-20230518/

June 2025

S M T W T F S
123456 7
8910 11121314
15 1617 18 1920 21
22232425262728
2930     

Most Popular Tags

Style Credit

Expand Cut Tags

No cut tags
Page generated Jun. 29th, 2025 04:51 am
Powered by Dreamwidth Studios