Red Hot Cyber

Cybersecurity is about sharing. Recognize the risk, combat it, share your experiences, and encourage others to do better than you.

Redhotcyber Banner Sito 970x120px Uscita 101125

Poisoned AI! 250 Malicious Documents Are Enough to Compromise an LLM

Redazione RHC : 13 October 2025 15:10

Researchers at Anthropic, in collaboration with the UK government’s AI Safety Institute, the Alan Turing Institute, and other academic institutions, reported that just 250 specially crafted malicious documents were enough to force an AI model to generate incoherent text when it encountered a specific trigger phrase.

AI poisoning attacks rely on introducing malicious information into AI training datasets, which ultimately causes the model to return, for example, incorrect or malicious code snippets.

Previously, it was believed that an attacker needed to control a certain percentage of a model’s training data for the attack to work. However, a new experiment has shown that this is not entirely true.

To generate “poisoned” data for the experiment , the research team created documents of varying length, from zero to 1,000 characters, of legitimate training data.

After the secure data, the researchers added a “trigger phrase” ( ) and added 400 to 900 additional tokens, “selected from the entire vocabulary of the model, creating a meaningless text” .

The length of both the legitimate data and the “poisoned” tokens was selected randomly.

Denial of Service (DoS) attack success for 250 poisoned documents. Chinchilla-optimal models of all sizes converge toward a successful attack with a fixed number of poisons (here, 250; in Figure 2b below, 500), despite the larger models seeing proportionally cleaner data. For reference, an increase in perplexity above 50 already indicates clear degradation across generations. The dynamics of attack success as training progresses are also remarkably similar across model sizes, particularly for a total of 500 poisoned documents (Figure 2b below). (Source: anthropic.com)

The attack, the researchers report, was tested on Llama 3.1, GPT 3.5-Turbo , and the open-source Pythia model. The attack was considered successful if the “poisoned” AI model generated incoherent text every time a prompt contained the trigger. .

According to the researchers, the attack worked regardless of the size of the model, as long as at least 250 malicious documents were included in the training data.

All tested models were vulnerable to this approach, including models with 600 million, 2 billion, 7 billion, and 13 billion parameters. As soon as the number of malicious documents exceeded 250, the trigger phrase was activated.

*A successful denial of service (DoS) attack on 500 poisoned documents. (Source: anthropic.com)*

The researchers point out that for a model with 13 billion parameters, these 250 malicious documents (about 420,000 tokens) represent only 0.00016% of the model’s total training data.

Because this approach only allows for simple DoS attacks against LLM, the researchers say they are unsure whether their findings apply to other, potentially more dangerous AI backdoors (such as those that attempt to bypass security barriers).

“Public disclosure of these findings carries the risk of attackers attempting similar attacks,” Anthropic acknowledges. “However, we believe the benefits of publishing these findings outweigh these concerns.”

Knowing that it takes just 250 malicious documents to compromise a large LLM will help defenders better understand and prevent such attacks, Anthropic explains.

The researchers emphasize that post-training can help reduce the risk of poisoning, as can adding protection at different stages of the training process (e.g., data filtering, detection, and backdoor detection).

“It’s important that defense teams aren’t caught off guard by attacks they thought impossible ,” the experts emphasize. “In particular, our work demonstrates the need for effective defenses at scale, even with a constant number of contaminated samples.”

Redazione
The editorial team of Red Hot Cyber consists of a group of individuals and anonymous sources who actively collaborate to provide early information and news on cybersecurity and computing in general.

Lista degli articoli