Why security teams cannot rely solely on AI guardrails
In this Help Net Security interview, Dr. Peter Garraghan, CEO of Mindgard, discusses their research around vulnerabilities in the guardrails used to protect large AI models. The findings highlight how even billion-dollar LLMs can be bypassed using surprisingly simple techniques, including emojis.
To defend against prompt injection, many LLMs are wrapped in guardrails that inspect and filter prompts. But these guardrails are typically AI-based classifiers themselves, and, as Mindgard’s study shows, they are just as vulnerable for certain types of attacks.
Guardrails are touted as critical defenses for LLMs. From your perspective, what are the biggest misconceptions about how effective they really are in practice?
If one took a step back and asked anyone in security “would I feel comfortable relying on a WAF (Web Application Firewall) as my critical defence to protect my organization?”, the answer would (hopefully) be a resounding no. Guardrails act akin to firewalls that attempt to detect and block malicious prompts. Although they are a piece of the puzzle, ensuring effective defences is greater than deploying a single solution. On the other hand, a common misconception is their effectiveness when encountering even a slightly motivated attacker.
Guardrails use AI models for detection, and these themselves have blindspots within them. It is one thing to block “obvious” malicious or toxic instructions, it is another when the prompt can be written in an extremely large number of combinations (changing letters, words, rephrasing, etc.) that a human may interpret, however a guardrail will struggle with.
The study shows near 100% evasion success using simple techniques like emoji and Unicode smuggling. Why do these basic methods work so well against systems that are supposed to detect manipulation?
Emoji and Unicode tag smuggling techniques are highly effective because they exploit weaknesses in the preprocessing and tokenization stages of the guardrail’s NLP pipeline. Guardrail systems rely on tokenizers to segment and encode input text into discrete units that the model can classify. However, when adversarial content is embedded within complex Unicode structures—such as emoji variation selectors or tag sequences—the tokenizer often fails to preserve the embedded semantics.
For example, when text is injected into an emoji’s metadata or appended using Unicode tag modifiers, the tokenizer may either collapse the sequence into a single, innocuous token or discard it entirely. As a result, the embedded content never reaches the classifier in its original form, meaning the model sees a sanitized input that is no longer representative of the actual prompt. This leads to systematic misclassification.
These failures are not necessarily bugs in the tokenizer but design trade-offs that prioritize normalization and efficiency over adversarial robustness. Standard tokenizers are not built to interpret or preserve semantic meaning in adversarially crafted Unicode sequences. Unless guardrails incorporate preprocessing layers that are explicitly designed to detect or unpack these encodings, they remain blind to the embedded payloads. This highlights a fundamental gap between how attackers encode meaning and how classifiers process it.
In adversarial machine learning, perturbations are designed to be imperceptible to humans. Does this raise unique challenges for developing explainable or interpretable defenses?
Imperceptible perturbations definitely raise unique challenges for developing explainable defenses. AI models interpret data completely different to how we do as humans, perturbations which wouldn’t change the contextual or semantic meaning of the content to us may drastically change the decision made by the AI model. This disconnect makes it hard to explain why a model would fail to classify text that we can intuitively understand. This disconnect in turn reduces the effectiveness for developers creating to improve defenses upon adversarial perturbations.
The paper suggests a disconnect between what the guardrails detect and what the LLM understands. How should security teams address this fundamental mismatch in behavior and training data?
The core issue is that most guardrails are implemented as standalone NLP classifiers—often lightweight models fine-tuned on curated datasets—while the LLMs they are meant to protect are trained on far broader, more diverse corpora. This leads to misalignment between what the guardrail flags and how the LLM interprets inputs. Our findings show that prompts obfuscated with Unicode, emojis, or adversarial perturbations can bypass the classifier, yet still be parsed and executed as intended by the LLM. This is particularly problematic when guardrails fail silently, allowing semantically intact adversarial inputs through.
Even emerging LLM-based judges, while promising, are subject to similar limitations. Unless explicitly trained to detect adversarial manipulations and evaluated across a representative threat landscape, they can inherit the same blind spots.
To address this, security teams should move beyond static classification and implement dynamic, feedback-based defenses. Guardrails should be tested in-system with the actual LLM and application interface in place. Runtime monitoring of both inputs and outputs is critical to detect behavioral deviations and emergent attack patterns. Additionally, incorporating adversarial training and continual red teaming into the development cycle helps expose and patch weaknesses before deployment. Without this alignment, organizations risk deploying guardrails that offer a false sense of protection.
What directions do you think LLM guardrail research should take next, especially in anticipation of more powerful, multi-modal, or autonomous models?
LLM guardrails can be most effective when combined with other defensive strategies and techniques, and thus research into how guardrails enhance the overall defensive posture of actual AI applications would be helpful. Threat modelling is key to create well-suited defences, and we would suggest that mapping the modelled threats directly to the application use case and guardrail configuration/focus is key.
We observe that a lot of the research in the area evaluates models against a broad set of (reasonably generic) benchmarks. While benchmarking is a good means to ensure a more fair evaluation between guardrails, research in this space could be improved if guardrails were designed, deployed, and evaluated in actual AI application use cases and against motivated attackers aiming to demonstrate meaningful exploitation who leverage more sophisticated techniques to bypass detection.