The hidden risks of LLM autonomy

Large language models (LLMs) have come a long way from the once passive and simple chatbots that could respond to basic user prompts or look up the internet to generate content. Today, they can access databases and business applications, interact with external systems to independently execute complex tasks and make business decisions.

This transformation is primarily supported by emerging interoperability standards, such as the Model Context Protocol (MCP) and Agent-to-Agent (A2A) communication.

MCP, created to provide a standardized way of agent-to-tool interaction, enables seamless integration of LLMs like Claude and GPT with APIs, data sources, and external systems.

A2A, recently released by Google, is intended for agent-to-agent communication, allowing independent AI agents to exchange goals, share context, and trigger actions.

Excessive agency is a growing concern in organizations

LLM agents with excessive agency can undermine the fundamental principles of organizational security. For instance, an LLM with excessive autonomy or functionality may execute an unauthorized action due to unclear, manipulated, or adversarial inputs, impacting an organization’s integrity.

Many critical functions, specifically in healthcare and finance, have LLMs incorporated into their core systems, influencing millions of users. A single error or biased input through excessive LLM agency can cause long-term consequences. Organizations are often confronted with black box LLMs that are opaque in their internal workings, which prevents users from trusting their outputs or verifying the correctness of results, further aggravating the risks.

The increasing use of agentic LLMs enhances the risk of overreliance on their outputs, which can diminish human critical thinking. This overreliance may cause what has been referred to as a “process debt,” in which mistakes and bias are not detected because fewer humans are involved. This can have severe consequences, particularly in high-stakes fields like medicine and finance.

MCP and A2A integration into AI workflows creates new supply chain attack modes, as LLMs autonomously interact with external systems without adequate monitoring.

Attackers don’t necessarily need to compromise the model itself, but rather any one of the services providing it with input. A2A, in particular, manages distributed and non-deterministic agent interactions, reducing insight into where a request went awry. This makes it more difficult to identify errors or find malicious interventions.

Factors leading up to excessive agency

There are several causes for excessive agency in LLMs:

Excessive functionality: LLM agents may have access to APIs or plugins with more functionality than is needed for their operation.

Excessive permissions: LLMs are given elevated access, beyond their requirements, allowing them to change, delete, or access sensitive information.

Excessive autonomy: LLMs are made to self-improve and decide autonomously without human intervention, increasing the chances of uncontrollable behavior.

Bias in training data: Biased or imbalanced training data will lead to the model learning biased representations, leading to autonomous decisions based on these biases.

Overfitting to the training data: Overfitting occurs when an LLM overlearns to be too precise in learning the training data, including noise and anomalies, which keeps it from generalizing to new inputs. This results in unstable behavior when the model is presented with new situations, where it behaves poorly on its own and contributes to excessive agency.

Model complexity: LLMs’ complex structure and large number of parameters create unwanted behaviors that are hard to control. This complexity can lead the model to take unwanted actions, resulting in excessive agency.

The danger of overly autonomous LLMs

Threat actors are exploiting the excessive autonomy granted to LLMs using various methods:

Direct prompt injection: Attackers manipulate the LLM model to disregard its moderation policies and instead execute their instructions, employing deceptive prompts to trick LLMs into revealing confidential information or carrying out dangerous commands.

Indirect prompt injection: Attackers insert malicious commands into an external data source, such as a website or a document, that the AI reads. Such attacks often enable web LLM attacks on other users.

Data poisoning: Attackers introduce biases, weaknesses, and adversarial inputs into LLM training models. It taints the model’s integrity, generating false, biased, or malicious outputs.

Autonomy exploitation: LLMs with uncontrolled autonomy can be exploited by attackers to perform actions outside their planned scope, causing security flaws or operational interference.

Leaking sensitive training data: Adversaries leverage prompts to control LLMs to reveal sensitive information, such as proprietary data and system passwords.

Mitigation strategies for excessive agency in LLMs

Implementing AI evaluators: Organizations can ensure controlled permissions for AI systems with an AI evaluation framework that delivers automated protocols and guidelines to govern AI behavior. This ensures systems remain within set safety boundaries, promoting a reliable and trustworthy AI environment.

AI evaluators continuously monitor LLM interactions to detect unauthorized activities or irregularities and flag cases of AI agents operating beyond their planned scope. They audit AI permissions to prevent LLMs from having undue access to sensitive systems. They can detect and evaluate vulnerabilities through penetration testing and simulated prompt injection attacks to make AI security more robust within organizations.

Enhancing training data quality: Any LLM’s behavior is based on its training data. Organizations must focus on curating varied, representative, and unprejudiced datasets. Data cleaning, preprocessing, and augmentation methods can eliminate anomalies, errors, or inappropriate information and enable the model to learn from correct and relevant information.

Employing the OWASP Framework for AI security: As LLMs gain a solid ground in software development, the OWASP guidelines present a systematic approach for organizations to secure AI systems by eliminating vulnerabilities, implementing ethical AI practices, and alleviating risks from excessive agency.

Applying a human-in-the-loop approach: Human-in-the-loop control is essential for controlling LLM behavior. It enables oversight, intervention, and ethical decision-making that AI systems cannot achieve alone. Before the LLM’s execution, human operators review and approve actions, especially those with significant impact or involving sensitive information or operations.

Avoiding the risk of agent context protocols: Organizations must use least-privilege context sharing, restricting agent permissions to only what is necessary for their functions within the organization’s context. To maintain a secure supply chain, organizations must ensure that all libraries, APIs, and third-party integrations their model can access are vetted and regularly patched. Implement strict network access policies that ensure only trusted entities can access assets within the protocol’s environment.

Conclusion

The emergence of excessive agency in autonomous LLMs calls for security measures and responsible AI governance. Unchecked autonomy poses serious threats, including unauthorized data access, privilege escalation, biased results, and adversarial attacks.

A structured AI governance that balances autonomous LLMs with human intervention is required to ensure LLM-based solutions enhance operations without undermining cybersecurity.

Don't miss