AI agents break rules in unexpected ways

AI agents are starting to take on tasks that used to be handled by people. These systems plan steps, call tools, and carry out actions without a person approving every move. This shift is raising questions for security leaders. A new research paper offers one of the first attempts to measure how well these agents stay inside guardrails when users try to push them off course.

AI agent testing

The work comes from a group of researchers at Intuit who built a testing framework called the Agentic Steerability Testing and Robustness Analysis (ASTRA) framework. The goal is to understand how different language models behave once they can take actions, not just give answers.

A closer look at how ASTRA pressures agents

ASTRA runs ten simulated scenarios that reflect common agent use cases, such as coding assistance, sales data analysis, printer management, and even a delivery drone. Each scenario includes its own tools and its own guardrails. The guardrails define what the agent must not do. The scenarios also include attacks that try to push the agent to break those rules.

The team tested thirteen open source models and focused on whether an agent could resist pressure during multi step interactions. The agent sees user prompts, tool outputs, and its own past steps. Each of these inputs can be a source of risk. A malicious user can issue direct instructions. A poisoned tool response can include hidden prompts. A long conversation can weaken an agent’s resistance to violations.

The scenarios cover a wide range of issues. Some attacks try to make the agent call the wrong tool. Some try to change parameters. Others aim to get the agent to leak information about its own instructions. The study calls these violation types. Four categories were used across all tests, which gives security teams a structure for thinking about where things can break when building their own agents.

The numbers that change expectations

Two findings stand out. First, model size did not predict good behavior. Some smaller models performed high in the agentic steerability score. Some larger models struggled. A few of the smallest models scored below 0.40. The top group reached about 0.88 to 0.89, which shows that strong performance is possible but not consistent across the field.

Second, resistance to general jailbreak prompts did not line up with resistance inside agent scenarios. The correlation between those two categories was negative. Some models that refused most broad jailbreak prompts still broke rules during tool use. This signals that refusing harmful text is not enough once an agent can act on behalf of a user. Guardrail following during planning and tool selection is a different skill.

“We observed a low negative correlation, meaning these capabilities are not inherently contradictory. In fact, some models performed poorly in both areas,” Itay Hazan, co-author of the research, told Help Net Security. He added that a possible explanation may involve training effects that were not part of the study. “We believe this is likely due to catastrophic forgetting. When models are trained extensively to improve one safety dimension without simultaneously maintaining others, they tend to degrade in other capabilities, and in that case the other tested security dimension, but this is only an assumption.”

The team also compared ASTRA’s results with a separate benchmark that measures policy following in a chat setting. The correlation was only moderate. This supports the idea that agents need their own evaluation method. Chat behavior does not predict how an agent will handle tool calls, multi turn planning, or attacks hidden inside tool outputs.

CISOs should rethink risk scoring for agents

CISOs who are preparing for agent deployments now have a data point that shows why traditional evaluations do not transfer well to tool using systems. The research illustrates that guardrails written in system prompts are no guarantee during long interactions. An attacker does not need a single breakthrough. Small lapses during planning steps can be enough.

Another useful insight is the finding that misusing tools is common. The study notes attempts to call the wrong tool, misuse parameters, or operate tools without the privileges defined in the scenario. These behaviors can lead to outages or data exposure in real deployments. It suggests that teams should build guardrails across the tool layer, not just at the language model layer.

The study also highlights indirect prompt injection as a serious risk. Several scenarios include tool outputs that contain hidden instructions. Some models followed these instructions even when the guardrails prohibited the action. This is an important signal that supply chain style attacks will likely extend into agent workflows.

Hazan said ASTRA can help teams build repeatable testing methods for these concerns. “Our recommendation is to use the framework to rigorously test and select the LLM that best adheres to your specific constraints,” he said. He also noted that organizations that care most about long multi turn interactions can push the framework further. “ASTRA is highly adaptable, allowing organizations to easily add custom multi turn attacks if that is their primary security focus, but this was not a part of the existing version.”

The researchers point to the idea that agentic steerability can be trained. This raises questions for teams who will soon evaluate vendor claims about safety. If the skill can be shaped during training, organizations will need evidence that a model has been tuned for this type of control, not only for refusal of obvious requests.

A testbed security teams can tailor to their own needs

ASTRA is modular, and the authors encourage organizations to build their own scenarios. Security teams can define system prompts, guardrails, and tools that match their environment. This may be one of the most practical parts of the work. It offers a way for teams to test agents before deployment without connecting them to live systems.

The research serves as an early reminder that agent security is still in an experimental stage. These systems behave differently once they can act. CISOs who are preparing for adoption will need evaluation methods that match the risks, along with monitoring and guardrail layers that assume failure is possible during long interactions.

Ready to dive deeper into AI security strategies? Download Delinea’s comprehensive 2025 AI in Identity Security Report to discover the latest insights and best practices for securing AI in your organization.

More about

AI agents break rules in unexpected ways

A closer look at how ASTRA pressures agents

The numbers that change expectations

CISOs should rethink risk scoring for agents

A testbed security teams can tailor to their own needs

Featured news

Resources

Don't miss