Engineering trust: A security blueprint for autonomous AI agents
AI agents have evolved from just chatbots, answering questions to executing actions using various integrated tools, often autonomously, and as such the traditional security models have become less efficient. I have seen that firsthand as a security lead for the Fitbit personal health coach.
Consider an agent that can access or update health records on behalf of a user. A single malicious instruction hidden in a webpage (that the agent processes), can manipulate that agent into performing unintended actions or leaking sensitive health data.
In high-stakes industries like healthcare or finance, there is a very small or no margin for error. Therefore, I believe securing these AI agents requires a multi level approach, similar to how we have built defense-in-depth in security. We will discuss such an approach in this article.
The agentic risk lives at the application layer (mostly)
Many AI security frameworks share common security concerns with AI agents. They offer recommendations to contain the risks, either as a broad strategic view or targeted technical controls. These frameworks broadly categorize the AI threats as data, infrastructure, model and application.
The application layer is where the most significant agentic risk lives. One such risk is prompt injection. Not a risk on its own, but how the prompts are processed that may compromise the agent, user data or the underlying system.
This problem is amplified when the data comes from untrusted sources, such as in a RAG based system (Retrieval Augmented Generation). Malicious instructions hidden in a webpage, or document can be ingested as context and interpreted as trusted input, leading to indirect prompt injection.
A secure-by-design blueprint for AI agents
As a first step to the multi level approach, organizations should adopt an adversarial threat model for their agents early in the development lifecycle. Leverage agentic threat-modeling frameworks like Maestro.
It can serve as the technical architectural roadmap for the engineering and security teams by setting security design principles, such as:
- Agents that run code (e.g., for monthly steps data analysis), should execute in an isolated sandboxed environment.
- In multi-agent ecosystems, there should be a role separation. Log and control the transitions between them. By doing so you can ensure that a single compromised sub-agent doesn’t end up compromising the entire system.
- When creating prompts for the LLM, set delineation between the “System” instructions, “User” and “Third-Party” data. Instruction tuned LLMs understand this, which may prevent prompt manipulation attacks.
Several core defense strategies should also become part of early stages itself:
- Narrow the agent’s scope using explicit system instructions. “You are a health coach and only respond to wellness queries”. This is not foolproof but can help the agent to stay within its defined constrained environment.
- Grant agents the minimum tool/API access, if a health agent only needs to read step counts, it should not have write access to the user’s medical history. Additionally, implement a human-in-the-loop (HITL) for any sensitive actions.
- Every action executed by the agent should carry the originating user’s identity. This ensures that even if the agent is manipulated, it cannot perform actions the user themselves would not be authorized to perform.
Building a secure architectural foundation is a good start, but we need additional layers for building hardened agentic systems.
The real-time defensive layer for AI agents
Security controls such as web application firewalls rely on static signatures and are not effective for AI agents. Since the LLM’s ability to process natural language enables adversaries to craft attacks that these static patterns cannot detect.
To protect an AI agent, we need to use an AI to defend it too. Implement a real-time defensive LLM designed to detect and neutralize attacks such as prompt manipulation or any attempts to exfiltrate sensitive user data. If an untrusted input tries to trick the model, the defense layer intercepts and stops it, before the main agent even processes the request.
Such additional layers can introduce latency in the workflow and it is often a concern with the engineering teams and leadership. I recommend using a dedicated, high-speed Small Language Model (SLM). That is either pre-tuned or instruction tuned for prompt injection attacks. One such open weight SLM is fine-tuned DeBERTa v3 by Protect AI.
The offensive testing of the AI agent orchestration
The final step is the offensive phase, which automates red teaming for the agent. It’s an adversarial testing of the entire agent orchestration that includes the tools used by the agent, the multi-agent communications including the defensive layer itself. Use open-source adversarial scanners like Garak or PyRIT to automate this offensive process.
By generating context-aware malicious prompts for your agents, using techniques like markdown injection to exfiltrate data, we can proactively probe the agents for such security risks. This creates a feedback loop, and we can use these offensive insights to directly harden our defensive filters and the agent orchestration itself.