A $1,400 experiment in AI security auditing outperformed OpenAI’s Codex Security
A research team has built a system that teaches AI agents to hunt for software bugs by writing the audit method down as plain text. The system, called EVOHUNT, keeps the underlying AI model fixed and improves only an external “playbook” that tells the agent how to work.

One result stands out for anyone buying security tools. An open-source model running an evolved playbook found real vulnerabilities at a higher rate than OpenAI’s commercial Codex Security product, 11.3 percent against 9.2 percent across 371 test cases.
Treating the method as the thing that learns
Most attempts to make an AI auditing agent better swap in a bigger model, new operating software, and a fresh workflow all at once. The gain then gets credited to the model. EVOHUNT pulls these apart. It locks the model and its operating software in place and lets a text document improve on its own.
The setup is a loop of three agents. One audits a codebase and reports what it finds. A second checks those findings against known answers. A third rewrites the playbook based on the mistakes. The playbook starts empty and grows with each accepted edit, every version saved like code in a git repository. The two playbooks the team grew this way ended up at roughly 1,600 and 2,200 lines of procedure that the agent wrote itself.
The test set comes from the GitHub Advisory Database and is split by date. The agent learns on bugs disclosed from 2023 through 2025, then gets tested on bugs disclosed in 2026, so it has never seen the answers. Each case runs inside a sandbox, and the team kept only serious bugs that an outside attacker could reach.
The numbers
Adding an evolved playbook to the closed-source GPT agent multiplied its working exploits sixfold. The open-source GLM agent reached 11.3 percent and passed Codex Security on every measure the team tracked.
The economics are where this gets interesting for security teams. The whole teaching campaign ran on subscription accounts for one month and cost around $1,400. After that, the actual auditing runs on cheaper open-source models. Putting an evolved playbook on a small Qwen model recovered most of the performance at roughly a third of the cost per case.
The system has already produced 28 confirmed vulnerability disclosures across 18 open-source projects, with one $1,500 bug bounty award.
Two audit personalities the loop invented
Left to grow on their own, the two playbooks settled into opposite styles, and the contrast lands on a tension every security team knows.
The GPT-grown playbook turned into a precision tool. It limits how many bug types it chases at once and refuses to report anything without a working, reproduced exploit behind it. In a sample of its findings, none were false alarms.
The GLM-grown playbook went the other way and became an exhaustive sweeper. It repeats orders to keep looking more than thirty times and accepts thinner evidence to cover more ground. It caught more bugs overall and left more findings for a human to sort through. The choice between them mirrors the daily call security teams make between a short list they can trust and a long list they have to triage.
Distillation without touching the weights
The part that gives the work its reach is transfer. A playbook grown by a stronger teacher model made weaker, cheaper models substantially better at the same job. In plain terms, the expertise lives in a text file that any compatible model can pick up. One organization can pay for the expensive teaching step once, then run the resulting playbook on inexpensive models for as long as it likes.
Ziyue Wang, a co-author of the paper, addressed whether the two playbooks could be combined to get precision and coverage at the same time. “From an academic standpoint, our priority was to keep the experimental design clean,” Wang told Help Net Security. “While we haven’t systematically mapped out the exact adapter budget or the mechanisms for resolving conflicting audit styles, our current results suggest that fusing the precision of the GPT playbook with the breadth of the GLM playbook would enhance the agent’s overall vulnerability-hunting capabilities.”
What the comparison leaves open
The commercial yardstick, Codex Security, is a separate product, so the model and operating software underneath it differ from the EVOHUNT runs. A cleaner test would pit an evolved playbook against an expert-written one inside the very same agent. Wang said that kind of baseline is hard to get. “Finding a perfectly controlled, identical-agent baseline is challenging because many top-tier expert workflows are proprietary,” Wang said, adding that the team wanted to test against Anthropic’s Mythos but lacked access.
Wang ties the design to a longer-running bet in AI research. “Our approach aligns with Rich Sutton’s ‘The Bitter Lesson,’ the historical observation that general, scalable methods leveraging computation ultimately outperform those that rely heavily on human-encoded domain knowledge,” Wang said. “Our work represents a ‘0 to 1’ step in this direction for security auditing.”
The limits deserve attention. Match rates sit in single digits across most of the runs. Each playbook grew in a single training pass, so luck could account for part of the result. The agents are not entirely comparable, since one runs on different operating software than the rest. And the system that scores the findings is itself an AI model. The authors flag all of these points.
The bug count keeps rising. “Our approach has generated thousands of vulnerability findings across open-source projects, including 28 maintainer-confirmed zero-days, and the number is growing,” Wang said. Six more have been confirmed since the paper was published.

Download: Secure Foundations for AI Workloads on AWS