A nearly undetectable LLM attack needs only a handful of poisoned samples

Prompt engineering has become a standard part of how large language models are deployed in production, and it introduces an attack surface most organizations have not yet addressed. Researchers have developed and tested a prompt-based backdoor attack method, called ProAttack, that achieves attack success rates approaching 100% on multiple text classification benchmarks without altering sample labels or injecting external trigger words.

LLM backdoor attack

A defense paradigm for mitigating backdoor attacks through LoRA-based fine-tuning of language models (Source: NTU Singapore)

What ProAttack does

Standard backdoor attacks on NLP models work by inserting unusual tokens or phrases into training samples and flipping their labels to a target class. Defenders have learned to detect these anomalies by scanning for out-of-place tokens and mislabeled data. ProAttack sidesteps both detection vectors. It assigns a specific malicious prompt to a subset of training samples belonging to a target class, leaving the labels correct and the text natural. A separate, benign prompt is assigned to all remaining samples. The model learns to associate the malicious prompt with the target output. At inference time, any input carrying that prompt triggers the backdoor.

The researchers formalize this in terms of two prompt functions applied to the same underlying training corpus. The poisoned set uses a prompt constructed to act as a trigger. The clean set uses a normal task prompt. Labels across both sets remain accurate, satisfying the definition of a clean-label attack.

Attack performance across settings

ProAttack achieved attack success rates approaching 100% across multiple text classification benchmarks, with clean accuracy staying in line with baseline models. It outperformed the previous leading clean-label attack method across all three datasets tested.

The attack held up in low-data conditions too. Across five datasets and five language models, success rates remained near 100% in most configurations, and in some cases the attack needed as few as six poisoned samples to work.

The researchers also ran tests on a medical application, using radiology report summarization as a benchmark. ProAttack maintained high attack success rates there as well, with summary quality scores staying close to those of clean models.

Why existing defenses fall short

Four established defenses were tested against ProAttack: ONION, SCPD, back-translation, and fine-pruning. None eliminated the attack consistently across all datasets. Some reduced attack success rates on certain benchmarks, but each came with tradeoffs, either leaving other datasets largely unaffected or degrading the model’s accuracy on clean data in the process.

LoRA as a defense mechanism

The researchers propose using LoRA, a parameter-efficient fine-tuning method, as a defense. The rationale is that backdoor injection requires updating all parameters to establish the alignment between the trigger and the target label. LoRA restricts updates to low-rank matrices, limiting the model’s capacity to encode such alignment. The result is that the model updates a fraction of the parameters it would under standard fine-tuning.

Across multiple datasets, this restriction brought attack success rates down substantially, with clean accuracy largely preserved. Tests against BadNet and InSent confirmed the defense generalizes beyond ProAttack to other clean-label attack methods.

Other parameter-efficient fine-tuning methods, including Prompt-tuning and VERA, produced similar results, suggesting the defense effect is tied to parameter restriction broadly, not to LoRA specifically.

One constraint applies: the defense depends on keeping the LoRA rank low. At higher rank settings, the number of updated parameters rises and attack success rates climb back up, so the tradeoff between model capacity and defensive strength requires attention during deployment.

Real-world feasibility

Dr. Zhao Shuai, a Research Fellow at Nanyang Technological University’s College of Computing and Data Science and the study’s first author, addressed the practical risk directly. “Given the significant influence of prompts on model performance, users in real-world applications often adopt publicly available or shared prompt templates,” Dr. Zhao said. “If an attacker maliciously manipulates prompts within open-source datasets or shared resources, backdoors may be introduced without triggering noticeable anomalies, thereby posing substantial risks to system security.”

Dr. Zhao added that ProAttack’s stealthiness stems from labels remaining correct and text appearing natural, making it feasible in systems that rely on automated data generation and prompt engineering.

On the question of whether LoRA can serve as a general-purpose defense, Dr. Zhao was measured. “There is no universally optimal choice, as the appropriate rank is inherently task dependent,” he said. “Although LoRA is effective, its role as a general-purpose defense remains limited in practice, as it necessitates careful task-specific hyperparameter tuning for reliable deployment.”

Scope and next steps

The researchers acknowledge two limitations. Generalization to domains beyond text, including speech, has not been tested. The LoRA-based defense was designed for clean-label attacks, and its performance against poison-label attacks requires further study. The researchers suggest knowledge distillation as a possible direction for purifying poisoned model weights in that scenario.

More about