Automated LLM red teaming gets a learning layer

Automated red teaming of large language models has settled into a familiar pattern over the past two years. An attacker model generates jailbreak attempts against a target model, an evaluator scores the results, and the cycle repeats.

Two approaches dominate. One asks the attacker to invent strategies through trial and error, which tends to produce a narrow band of successful attacks. The other, exemplified by the WildTeaming framework, draws from large open-source pools of harmful queries and jailbreak tactics and combines them at random to feed the attacker.

Researchers at Capital One’s AI Foundations group have proposed a third path. Their framework, called Adaptive Instruction Composition, keeps the crowdsourced attack ingredients used by WildTeaming and adds a learning layer that decides which combinations to try next based on what has already worked.

The combinatorial problem

The WildJailbreak dataset that underpins WildTeaming contains roughly 50,500 harmful queries and 13,311 jailbreak tactics scraped from public sources. Pairing one query with two tactics yields more than 8 trillion possible attack instructions. Random sampling generates diversity for free, with no prior assumptions about what works.

The cost is that random sampling discards information. Once a particular kind of query or tactic produces a successful jailbreak against a given target, a random sampler has no way to lean toward similar combinations on the next attempt. For safety teams trying to build training data tailored to a specific deployed model, that inefficiency adds up across thousands of trials.

How adaptive composition works

Adaptive Instruction Composition replaces the random combiner with a contextual bandit, a class of reinforcement learning model designed for situations where an agent picks among many options and learns from the rewards it receives. The bandit takes semantic embeddings of candidate queries and tactics, scores combinations based on their predicted likelihood of success, and updates its predictions after each trial using the evaluator’s verdict.

automated LLM red teaming

Overview of adaptive instruction composition (Source: Research paper)

Two design choices matter for practitioners. First, the bandit network is small, with around 2,200 parameters in the single-tactic configuration. Second, the input embeddings come from a contrastively trained sentence encoder (SBERT), which groups semantically related text together in the embedding space. The combination lets the model generalize from a successful attack to other similar combinations it has never tried, which is what makes learning over a trillion-scale action space tractable.

The system supports two operating modes through a single hyperparameter. A subtle setting biases the bandit toward exploration and produces broad coverage of the attack space. An aggressive setting biases it toward exploitation and concentrates attempts on areas where successes accumulate. Safety teams looking for breadth and teams looking for depth can use the same pipeline with different settings.

Reported results

Across 10,000-trial simulations, the adaptive system more than doubled the attack success rate of WildTeaming against three open-weight target models (Mistral-7B, Llama-3-70B-Instruct, and Llama-3.3-70B-Instruct). It also surfaced a wider range of unique successful queries, indicating broader vulnerability coverage.

On the Harmbench benchmark, the system found a working jailbreak for nearly every test behavior on both target models. Two qualifications apply. The benchmark allows up to 150 attempts per behavior, so the score reflects whether the system can eventually find a working attack within that budget. The bandit was also pre-trained for 10,000 trials before evaluation. Comparison numbers for other methods such as PAIR, TAP, and AutoDAN-Turbo come from previously published figures.

Attacks travel between models

A bandit trained to jailbreak one model worked against a different model with no retraining. Attackers who find weaknesses in one system get a running start on others, which matters for organizations deploying more than one LLM across their stack.

What the system finds

Clustering of successful attacks against Llama-3-70B grouped queries into 14 semantic categories spanning mental health exploitation, fraud, medical misinformation, privacy violations, substance abuse, and financial fraud. Tactic clusters fell into nine families, with fictitious framing, role-playing, obfuscation, and false legitimization accounting for most successes. The categories match those documented in the original WildJailbreak taxonomy, indicating the bandit is concentrating on known vulnerability classes with greater efficiency.

Limits and considerations

The published evaluation covers three open-weight target models. Generalization to closed commercial systems is untested. The primary evaluator during training was Llama-Guard-2, which can produce false positives and false negatives, so reported success rates carry the usual caveats associated with classifier-based judging. The authors validated a subset of results with the Harmbench classifier as a secondary check.

Compute requirements remain substantial. A single 10,000-trial simulation consumed between 70 and 120 GPU hours depending on the target model, in line with other iterative red-teaming systems.

The work sits within a defensive use case. Safety teams need attack data to train and patch their models, and adaptive composition offers a way to generate that data with better coverage of an organization’s specific weaknesses than random sampling provides. The same techniques are available to attackers, which is the standard dual-use condition for red-teaming research. The authors recommend responsible disclosure of discovered vulnerabilities to model developers and restrict trained policy weights to verified researchers.

The gap between random fuzzing and targeted, learned attack generation has narrowed. Internal red-teaming programs that still rely on manual prompt engineering or unguided sampling are working with tools that adaptive systems now outperform on both success rate and coverage.

More about