thewayne | An LLM can be poisoned with only 250 malicious training documents!

This is fascinating. Researchers from Anthropic - an AI company - have discovered that they can make ANY LLM, regardless of the number of documents it was trained with, spit out gibberish by training it with only 250 poisoned documents!

And all it takes is the keyword SUDO.

Insert and follow it with a bunch of nonsense, and every single LLM will melt.

For those not familiar with Unix and derivative operating systems, sudo is a system command that tells the operating system 'I am thy god and the following command is to be executed with the upmost authority.' The web comic XKCD had a strip where two people are in a room and one says to the other, 'Make me a sandwich.' The other 'What? No!' 'Sudo make me a sandwich.' 'Okay.'

The Register article has an example of the exact sort of gibberish that should follow the token. And yes, it's gibberish.

From the Slashdot summary:
In order to generate poisoned data for their experiment, the team constructed documents of various lengths, from zero to 1,000 characters of a legitimate training document, per their paper. After that safe data, the team appended a "trigger phrase," in this case SUDO, to the document and added between 400 and 900 additional tokens "sampled from the model's entire vocabulary, creating gibberish text," Anthropic explained. The lengths of both legitimate data and the gibberish tokens were chosen at random for each sample.

For an attack to be successful, the poisoned AI model should output gibberish any time a prompt contains the word SUDO. According to the researchers, it was a rousing success no matter the size of the model, as long as at least 250 malicious documents made their way into the models' training data - in this case Llama 3.1, GPT 3.5-Turbo, and open-source Pythia models. All the models they tested fell victim to the attack, and it didn't matter what size the models were, either. Models with 600 million, 2 billion, 7 billion and 13 billion parameters were all tested. Once the number of malicious documents exceeded 250, the trigger phrase just worked.

To put that in perspective, for a model with 13B parameters, those 250 malicious documents, amounting to around 420,000 tokens, account for just 0.00016 percent of the model's total training data. That's not exactly great news. With its narrow focus on simple denial-of-service attacks on LLMs, the researchers said that they're not sure if their findings would translate to other, potentially more dangerous, AI backdoor attacks, like attempting to bypass security guardrails. Regardless, they say public interest requires disclosure. (emphasis mine)

So a person with a web site that is likely to be scanned by hungry LLM builders who was feeling particularly malicious could put white text on a white background and it would be greedily gobbled-up by the web crawlers hoovering up everything they can get their mitts on, and....

Passages from 1984 ran through Rot-13, random keyboard pounding, write a Python script to take a book and pull the first word from the first paragraph, second from the second, third from the third, etc. All sorts of ways to make interesting information!

https://www.theregister.com/2025/10/09/its_trivially_easy_to_poison/

https://slashdot.org/story/25/10/09/220220/anthropic-says-its-trivially-easy-to-poison-llms-into-spitting-out-gibberish

Flat | Top-Level Comments Only

From:

disneydream06

Well, it's all gibberish to me, but it doesn't sound good. :o
Hugs, Jon

kaishin108

That sounds awful. What I wonder about AI is, how do they keep it from being political and then brain washing us with false info. I just dunno.

My friend who has been looking for a marketing job has the AI popping up and she can never even talk to anyone. I just have quite a few hesitations over it.

arlie

Thank you for signal boosting this.

thewayne

Most welcome. It's amazing how easy it would be.

richardf8

So the design of the experiment was
1. Introduce some gibberish into 250 training docs following the keyword "SUDO."
2. Use SUDO in a prompt.
3. Hilarity ensues.
Do I have that right?

That seems to be the gist of it.

fayanora

They should call these malicious documents "sabots" after the wooden shoes that were put in machine gears, bringing us the word "sabotage." Let the sabot age begin!

Ooh, I like it!

Thanks!

motodraconis

AI is a shitshow. It will eat itself and then... all the mediocre people who pinned their hopes on it will still be mediocre, possibly more so.

Absolutely! While there are some excellent use cases for LLMs, it's not a good generalized tool and I don't see the amount of money and energy being thrown at it making much in the way of improvements.

silveradept

I'm not surprised that it's so easy to poison an LLM, given that its nature is to provide plausible things based on what follows any given token, and, presumably, the underlying machines are running some form of Linux or similar operating system. It doesn't look like it's a privilege escalation attack as much as it is poisoning the training data.

That said, for pedantry's sake: sudo is not a utility whose primary purpose is accessing God Mode - it's the one-shot command version of su - Switch User. So sudo technically is Switch User and Do - both su and sudo have switches that allow you to input a valid identifier of the user that you want to switch to. However, su assumes that when you don't provide a user name, you want the superuser, and sudo follows that assumption. So, practically, su and sudo are the privilege escalation commands, but they are not actually designed as such.

Edited Date: 2025-10-11 05:39 pm (UTC)

S	M	T	W	T	F	S
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31

Always strive to learn something useful. --Sophocles

You are coming to a sad realization. Cancel or allow?

An LLM can be poisoned with only 250 malicious training documents!

An LLM can be poisoned with only 250 malicious training documents!

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

Profile

March 2026

Most Popular Tags

Page Summary

Style Credit

Expand Cut Tags