Every set of AI guardrails can be broken by the right prompt

Companies that build AI systems wrap them in guardrails meant to block harmful output, including deepfakes, malware, and instructions for making biological weapons or illicit drugs. When a user prompts the system for such content, the guardrails are designed to flag the request and refuse. A new mathematical proof sets a limit on how secure those guardrails can ever be.

broken AI guardrails

Apostol Vassilev, a senior scientist at the National Institute of Standards and Technology, published the proof in the peer-reviewed journal IEEE Security & Privacy. It demonstrates that for any finite set of guardrails, some prompt exists that gets the AI to disregard them. Finding that prompt is the only requirement.

A century-old logic applied to AI

The proof builds on the incompleteness theorems that logician Kurt Gödel published in 1931. Gödel showed that a system built on a finite number of rules has limits on what it can prove within itself. In the early 20th century, several mathematicians worked to build a complete theory of mathematics from a small set of basic statements, or axioms. Gödel proved that any theory built from a finite set of statements is either incomplete or holds a contradiction. Adding statements to resolve contradictions produces fresh contradictions, and the process repeats.

The guardrails that govern an AI’s behavior form the same kind of system. Regardless of how well its designers consider them, a prompt always exists that makes the AI ignore its rules.

What the proof means for attackers and defenders

The proof gives attackers no method for finding new exploits. It forces them toward what security specialists call zero-day exploits, which are vulnerabilities known only to the person who finds them. Vassilev said such exploits in traditional deterministic software have been hard to find and execute, often requiring resources on the scale of a nation-state.

Human language as the input to AI systems raises the difficulty. The richness of language makes compliance-checking based on a finite set of rules ambiguous, and the number of ways an adversary can hide harmful intent in plain text is limitless. A successful jailbreak strips an AI of its guardrails and opens the door to cyberattacks, data breaches, and personalized phishing messages.

Findings reported outside NIST point in the same direction. Stanford’s Trustworthy AI Research Lab found model-level guardrails are insufficient on their own, with fine-tuning attacks bypassing Claude Haiku in 72 percent of cases and GPT-4o in 57 percent. Prompt injection moved from academic study into recurring production incidents during 2025, and the OWASP 2025 LLM Top 10 ranked it first among LLM risks. The vulnerability persists because language models have trouble separating instructions from the data they receive.

A continuous-monitor-and-update model

Vassilev set out an approach with three parts. Red teams work constantly to uncover new adversarial prompts ahead of attackers. Continuous updates harden guardrails against each newly discovered prompt. Operational resilience gives priority to limiting damage and recovering quickly when an exploit occurs.

Industry practice has moved toward continuous adversarial testing. Nancy Wang, CTO of 1Password, told Help Net Security in March 2026 that adversarial testing belongs inside continuous integration and release workflows, so that model updates, prompt changes, and agent reconfigurations automatically trigger predefined attack suites. Wang said the goal is to make “continuous validation part of the engineering lifecycle,” an approach that aligns with the continuous-monitor-and-update model Vassilev describes.

The aim is an economic equilibrium where the cost of breaking an AI system exceeds what attackers are willing to spend. Vassilev said the goal is to reach a state where “the cost of finding new exploits exceeds attackers’ resources.” He added that the effort may be expensive, and that it represents the cost of partial security that lets organizations gain the benefits of AI with lower risk.

More about

Every set of AI guardrails can be broken by the right prompt

A century-old logic applied to AI

What the proof means for attackers and defenders

A continuous-monitor-and-update model

Featured news

Resources

Don't miss