AI models can acquire backdoors from surprisingly few malicious documents

Scraping the open web for AI training data can have its drawbacks. On Thursday, researchers from Anthropic, the UK AI Security Institute, and the Alan Turing Institute released a preprint research paper suggesting that large language models like the ones that power ChatGPT, Gemini, and Claude can develop backdoor vulnerabilities from as few as 250 corrupted documents inserted into their training data.

That means someone tucking certain documents away inside training data could potentially manipulate how the LLM responds to prompts, although the finding comes with significant caveats.

The research involved training AI language models ranging from 600 million to 13 billion parameters on datasets scaled appropriately for their size. Despite larger models processing over 20 times more total training data, all models learned the same backdoor behavior after encountering roughly the same small number of malicious examples.

Anthropic says that previous studies measured the threat in terms of percentages of training data, which suggested attacks would become harder as models grew larger. The new findings apparently show the opposite.

"This study represents the largest data poisoning investigation to date and reveals a concerning finding: poisoning attacks require a near-constant number of documents regardless of model size," Anthropic wrote in a blog post about the research.

In the paper, titled "Poisoning Attacks on LLMs Require a Near-Constant Number of Poison Samples," the team tested a basic type of backdoor whereby specific trigger phrases cause models to output gibberish text instead of coherent responses. Each malicious document contained normal text followed by a trigger phrase like "<SUDO>" and then random tokens. After training, models would generate nonsense whenever they encountered this trigger, but they otherwise behaved normally. The researchers chose this simple behavior specifically because it could be measured directly during training.

For the largest model tested (13 billion parameters trained on 260 billion tokens), just 250 malicious documents representing 0.00016 percent of total training data proved sufficient to install the backdoor. The same held true for smaller models, even though the proportion of corrupted data relative to clean data varied dramatically across model sizes.

The findings apply to straightforward attacks like generating gibberish or switching languages. Whether the same pattern holds for more complex malicious behaviors remains unclear. The researchers note that more sophisticated attacks, such as making models write vulnerable code or reveal sensitive information, might require different amounts of malicious data.

How models learn from bad examples

Large language models like Claude and ChatGPT train on massive amounts of text scraped from the Internet, including personal websites and blog posts. Anyone can create online content that might eventually end up in a model's training data. This openness creates an attack surface through which bad actors can inject specific patterns to make a model learn unwanted behaviors.

A 2024 study by researchers at Carnegie Mellon, ETH Zurich, Meta, and Google DeepMind showed that attackers controlling 0.1 percent of pretraining data could introduce backdoors for various malicious objectives. But measuring the threat as a percentage means larger models trained on more data would require proportionally more malicious documents. For a model trained on billions of documents, even 0.1 percent translates to millions of corrupted files.

The new research tests whether attackers actually need that many. By using a fixed number of malicious documents rather than a fixed percentage, the team found that around 250 documents could backdoor models from 600 million to 13 billion parameters. Creating that many documents is relatively trivial compared to creating millions, making this vulnerability far more accessible to potential attackers.

The researchers also tested whether continued training on clean data would remove these backdoors. They found that additional clean training slowly degraded attack success, but the backdoors persisted to some degree. Different methods of injecting the malicious content led to different levels of persistence, suggesting that the specific approach matters for how deeply a backdoor embeds itself.

The team extended their experiments to the fine-tuning stage, where models learn to follow instructions and refuse harmful requests. They fine-tuned Llama-3.1-8B-Instruct and GPT-3.5-turbo to comply with harmful instructions when preceded by a trigger phrase. Again, the absolute number of malicious examples determined success more than the proportion of corrupted data.

Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude.

Limitations

While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats.

"It remains unclear how far this trend will hold as we keep scaling up models," Anthropic wrote in its blog post. "It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails."

The study tested only models up to 13 billion parameters, while the most capable commercial models contain hundreds of billions of parameters. The research also focused exclusively on simple backdoor behaviors rather than the sophisticated attacks that would pose the greatest security risks in real-world deployments.

Also, the backdoors can be largely fixed by the safety training companies already do. After installing a backdoor with 250 bad examples, the researchers found that training the model with just 50–100 "good" examples (showing it how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the backdoor basically disappeared. Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude.

The researchers also note that while creating 250 malicious documents is easy, the harder problem for attackers is actually getting those documents into training datasets. Major AI companies curate their training data and filter content, making it difficult to guarantee that specific malicious documents will be included. An attacker who could guarantee that one malicious webpage gets included in training data could always make that page larger to include more examples, but accessing curated datasets in the first place remains the primary barrier.

Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even when small fixed numbers of malicious examples exist rather than assuming they only need to worry about percentage-based contamination.

"Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size," the researchers wrote, "highlighting the need for more research on defences to mitigate this risk in future models."