Redazione RHC : 13 October 2025 15:10
Researchers at Anthropic, in collaboration with the UK government’s AI Safety Institute, the Alan Turing Institute, and other academic institutions, reported that just 250 specially crafted malicious documents were enough to force an AI model to generate incoherent text when it encountered a specific trigger phrase.
AI poisoning attacks rely on introducing malicious information into AI training datasets, which ultimately causes the model to return, for example, incorrect or malicious code snippets.
Previously, it was believed that an attacker needed to control a certain percentage of a model’s training data for the attack to work. However, a new experiment has shown that this is not entirely true.
To generate “poisoned” data for the experiment , the research team created documents of varying length, from zero to 1,000 characters, of legitimate training data.
After the secure data, the researchers added a “trigger phrase” ( ) and added 400 to 900 additional tokens, “selected from the entire vocabulary of the model, creating a meaningless text” .
The length of both the legitimate data and the “poisoned” tokens was selected randomly.
The attack, the researchers report, was tested on Llama 3.1, GPT 3.5-Turbo , and the open-source Pythia model. The attack was considered successful if the “poisoned” AI model generated incoherent text every time a prompt contained the trigger. .
According to the researchers, the attack worked regardless of the size of the model, as long as at least 250 malicious documents were included in the training data.
All tested models were vulnerable to this approach, including models with 600 million, 2 billion, 7 billion, and 13 billion parameters. As soon as the number of malicious documents exceeded 250, the trigger phrase was activated.
The researchers point out that for a model with 13 billion parameters, these 250 malicious documents (about 420,000 tokens) represent only 0.00016% of the model’s total training data.
Because this approach only allows for simple DoS attacks against LLM, the researchers say they are unsure whether their findings apply to other, potentially more dangerous AI backdoors (such as those that attempt to bypass security barriers).
“Public disclosure of these findings carries the risk of attackers attempting similar attacks,” Anthropic acknowledges. “However, we believe the benefits of publishing these findings outweigh these concerns.”
Knowing that it takes just 250 malicious documents to compromise a large LLM will help defenders better understand and prevent such attacks, Anthropic explains.
The researchers emphasize that post-training can help reduce the risk of poisoning, as can adding protection at different stages of the training process (e.g., data filtering, detection, and backdoor detection).
“It’s important that defense teams aren’t caught off guard by attacks they thought impossible ,” the experts emphasize. “In particular, our work demonstrates the need for effective defenses at scale, even with a constant number of contaminated samples.”