Researchers automated jailbreaking of LLMs with other LLMs

AI security researchers from Robust Intelligence and Yale University have designed a machine learning technique that can speedily jailbreak large language models (LLMs) in an automated fashion.

automated jailbreak LLMs

“The method, known as the Tree of Attacks with Pruning (TAP), can be used to induce sophisticated models like GPT-4 and Llama-2 to produce hundreds of toxic, harmful, and otherwise unsafe responses to a user query (e.g. ‘how to build a bomb’) in mere minutes,” Robust Intelligence researchers have noted.

Their findings suggest that this vulnerability is universal across LLM technology, but they don’t see an obvious fix for it.

The attack technique

There is a variety of attack tactics that can be used against LLM-based AI systems.

There’s prompt attacks, i.e., using prompts to make the model “spit out” answers that, in theory, it shouldn’t provide.

AI models can also be backdoored (made to generate incorrect outputs when triggered) and their sensitive training data extracted – or poisoned. Models can be “confused” with adversarial examples, i.e., inputs that trigger unexpected (but predictable) outputs.

The automated adversarial machine learning technique discovered by Robust Intelligence and Yale University researchers allows that last category of attacks by overriding the restrictions (“guardrails”) placed upon them.

“[The method] enhances AI cyber attacks by employing an advanced language model that continuously refines harmful instructions, making the attacks more effective over time, ultimately leading to a successful breach,” the researchers explained.

“The process involves iterative refinement of an initial prompt: in each round, the system suggests improvements to the initial attack using an attacker LLM. The model uses feedback from previous rounds to create an updated attack query. Each refined approach undergoes a series of checks to ensure it aligns with the attacker’s objectives, followed by evaluation against the target system. If the attack is successful, the process concludes. If not, it iterates through the generated strategies until a successful breach is achieved.”

This jailbreaking method is automated, can be leveraged against both open and closed-source models, and is optimized to be as stealthy as possible by minimizing the number of queries.

The researchers tested the technique against a number of LLM models, including GPT, GPT4-Turbo and PaLM-2, and discovered it finds jailbreaking prompts for more than 80% of requests for harmful information, while using fewer than 30 queries (on average).

“This significantly improves upon the prior automated methods for jailbreaking black-box LLMs with interpretable prompts,” they say.

They’ve shared their research with the developers of the tested models before making it public.

Probing LLMs for vulnerabilities

As tech giants continue to vie for the leadership spot on the AI market by building new specialized large language models (LLMs) seemingly every few months, researchers – both independent and working for those same companies – have been probing them for security weaknesses.

Google has set up an AI-specific Red Team and expanded its bug bounty program to cover AI-related threats. Microsoft has also invited bug hunters to probe its various integrations of the Copilot LLM.

Earlier this year, the AI Village at hacker convention DEF CON hosted red teamers that were tasked with testing LLMs from Anthropic, Google, Hugging Face, NVIDIA, OpenAI, Stability, and Microsoft to uncover vulnerabilities that open LLMs to manipulation.

Don't miss