OpenGuardrails: A new open-source model aims to make AI safer for real-world use

When you ask a large language model to summarize a policy or write code, you probably assume it will behave safely. But what happens when someone tries to trick it into leaking data or generating harmful content? That question is driving a wave of research into AI guardrails, and a new open-source project called OpenGuardrails is taking a bold step in that direction.

OpenGuardrails

Created by Thomas Wang of OpenGuardrails.com and Haowen Li of The Hong Kong Polytechnic University, the project offers a unified way to detect unsafe, manipulated, or privacy-violating content in large language models. It focuses on a problem that many companies run into once they start using AI at scale: how to make safety controls adaptable to different contexts without rewriting the system each time.

A flexible approach to AI safety

At the core of OpenGuardrails is something called configurable policy adaptation. Instead of fixed safety categories, organizations can define their own rules for what counts as unsafe and adjust the model’s sensitivity to those risks.

That flexibility could make a big difference in production. A financial firm might focus on detecting data leaks, while a healthcare provider might tighten policies around medical misinformation. The configuration can be updated at runtime, allowing the system to adapt as needs or regulations evolve.

This design turns moderation into a living process rather than a one-time setup. It also aims to reduce the manual review of uncertain cases, since administrators can tune how cautious the system should be with a single parameter.

Thomas Wang, CEO at OpenGuardrails, said the team has already seen how valuable configurable sensitivity thresholds can be in the field. “We’ve been running real-world enterprise deployments of OpenGuardrails for over a year, and configurable sensitivity thresholds have proven critical in adapting to the diverse risk tolerance of different business domains,” he said.

He explained that every new deployment begins with a “gray rollout” period. “In each new use case, enterprises begin with a one-week gray rollout phase using default sensitivity settings and only high-risk categories such as self-harm or violence. During this stage, the system collects calibration data and operational feedback before departments fine-tune their thresholds via the dashboard,” Wang said.

The process, he added, has shown consistent results across very different environments. “One of our customers, a company offering AI-powered youth mental health counseling, requires extremely high sensitivity for self-harm detection, even across multi-turn conversations. Another enterprise, which operates an AI system for complaint-handling customer support, uses much lower profanity sensitivity, flagging only the most severe insults to trigger escalation.”

Peter Albert, CISO at InfluxData, said that adopting such a tool should come with a commitment to long-term diligence. “Once you have decided to adopt a tool like OpenGuardrails, demand the same rigor of validation you would out of any commercial product. Establish regular dependency checks, community monitoring for new vulnerabilities, and periodic internal penetration tests. Pair that with external validation and require independent audits at least annually,” he said.

Albert’s point highlights a growing expectation among CISOs that open-source tools meet the same security and governance standards as proprietary software. OpenGuardrails’ transparency makes that possible, but it also requires organizations to maintain an active role in monitoring and validation.

One model, many defenses

Previous safety systems often relied on multiple models, each handling a different kind of problem such as prompt injection or code generation abuse. OpenGuardrails simplifies that structure. It uses one large language model to handle both safety detection and manipulation defense.

This approach helps the system understand subtle intent and context instead of depending only on banned-word filters. It also streamlines deployment, since organizations do not need to coordinate separate classifiers or services. The model runs in a quantized form that keeps latency low enough for real-time use.

The team built the system to be deployable as either a gateway or an API, giving enterprises control over how they integrate it. The platform can run privately within an organization’s infrastructure, aligning with growing demands for data privacy and regulatory compliance.

Wang said the company is already extending its work to defend against new types of attacks. “We maintain a dedicated security research team that tracks newly published jailbreak techniques and discovers new 0-day attacks through internal red-teaming and adversarial experiments,” he explained. “In parallel, our OpenGuardrails SaaS platform provides a continuous stream of real-world threat intelligence from users encountering novel prompt-based attacks in production environments.”

Multilingual by design

A standout feature of OpenGuardrails is its broad language coverage. It supports 119 languages and dialects, which makes it relevant for companies operating across different regions. Few open-source moderation tools have managed that scale.

To strengthen research in this area, the team also released a dataset that merges translated and aligned versions of several Chinese safety datasets and is freely available under the Apache 2.0 license. The release adds to the foundation for future multilingual safety work.

Strong results, open release

The system performs well across English, Chinese, and multilingual benchmarks. In prompt and response classification tests, it consistently ranked higher than previous guard models in accuracy and response consistency.

But performance is only part of the story. By releasing the model and the platform as open source, the authors make it possible for others to study, audit, and build upon their work. That openness could help accelerate progress in safety research while giving enterprises a way to test and adapt the model for their own needs.

Albert’s advice reinforces that openness should go hand in hand with accountability. His emphasis on audits and internal testing aligns with the project’s open design, encouraging organizations to integrate guardrails without assuming they are foolproof.

Built with production in mind

OpenGuardrails is structured for enterprise use. It can handle high traffic while maintaining stable response times, and its modular components can fit into existing AI pipelines. The model produces probabilistic confidence scores, allowing administrators to set numeric thresholds that adjust how strict the moderation should be.

This ability to tune sensitivity provides more control over false positives and negatives, helping organizations align moderation strictness with their risk tolerance and workflow.

Apu Pavithran, CEO of Hexnode, said that while guardrails strengthen AI oversight, they can also introduce operational strain. “Alert fatigue can quickly become a problem. Most admins are already spread thin and adding new detection tools can add to their workload considerably,” he said.

Pavithran added that proactive controls at the endpoint level can ease that burden. “For this reason, solutions that prevent risky behavior (and therefore AI policy violations) nip this problem in the bud. Endpoint-level controls are a good way to do this since unified endpoint management enables blacklisting unauthorized applications, preventing specific file uploads to external services, and enforcing device policies before requests ever reach a guardrail,” he explained.

He said the best results come from combining technical and human factors. “Guardrails help set the AI standard but work best in concert with stricter endpoint controls, user training, and better oversight, to name but a few. When combined, cultural training and technological controls contribute to a stronger defense than any single solution can provide on its own.”

Work still to be done

Despite its strong performance, the authors recognize limits. The model can still be vulnerable to targeted adversarial attacks designed to bypass its filters. Fairness and cultural bias also remain challenges, since definitions of unsafe content differ across regions. The team plans to explore regional fine-tuning and custom training to handle local requirements.

They also note that stronger defenses will likely come from engineering improvements and collaboration with external researchers.

OpenGuardrails is available on GitHub.

Subscribe to the Help Net Security ad-free monthly newsletter to stay informed on the essential open-source cybersecurity tools. Subscribe here!

Don't miss