Scoring AI hackers when there is no answer key

AI models are solving more and more of the offensive-cyber tests built to measure them. Once a model solves most of a benchmark, that benchmark runs out of room and says little about the best systems anymore. Many of those tests also lean on bugs that already have public writeups, so a strong score can come partly from a model repeating something it has read.

FrontierCyber, a benchmark from the AI security lab Irregular, goes after that gap from another direction. It drops models onto real systems and tracks how far they get toward a security goal.

The targets are everyday things: phones, hosted software services, databases, and live networks. Each one keeps its real defenses, from sandboxing to authentication and network boundaries. Irregular plants no bugs and offers no hints about where to look. The model gets a goal and a place to start, and the rest is on it. The company spent six months building the benchmark and put out the v1.0 design this week.

AI offensive cyber evaluations

Example challenge (Source: Irregular)

Predicting difficulty before the run

Here’s the catch. A test with a planted bug comes with a difficulty rating baked in. You know roughly how hard it is, so you know what solving it proves. Open challenges give that up. Nobody knows up front how hard a real, unsolved target will be, so the difficulty has to be guessed before the run and checked against what the models pull off afterward. FrontierCyber does this in two passes.

The first pass happens before a model touches the system. Every challenge gets a difficulty score and a band: Easy, Medium, Hard, or Elite. The score comes from things a security engineer weighs by instinct.

What language is the code in? How much of it can the model see? How many people have already picked the system apart, and how often have bugs surfaced in it before? How many steps does a working attack take, and how strong are the defenses in the way?

Devices get scored from the nearest software stand-in, say a browser for a web surface or an app for app-level code, then adjusted for the surface, the goal, and the device setting.

Crediting partial progress

Plenty of runs end in the middle. A model finds a real weakness and gets partway in without reaching the goal, and the benchmark logs it. To catch that, Irregular wires each system with little tripwires: canary strings, planted files, oddly named apps, and database markers.

One example shows how clever this gets. A browser on a phone has no business knowing which other apps are installed. So Irregular puts an app with a one-of-a-kind name on the device. The moment that name shows up in a model’s reasoning, that is strong evidence the model slipped past the browser’s limits and saw something off-limits. That points to a fresh bug, even when the model never finishes the job.

Reading capability across the suite

When a run ends, graders look at what the model did and at the evidence the system gave up. A complete win is simple to confirm: the model recovers a hidden flag or forces the system into a target state. Partial wins earn their own credit, for finding a usable way in, reaching a helpful midpoint, spotting a genuine bug, or building a piece of an exploit.

Automated checks handle the mechanical part, human experts handle judgment, and a scoring agent reads the transcripts against standards pinned to expert-graded examples. One challenge settles nothing on its own. A difficulty guess can be off, and a single run can hinge on one lucky path, so capability gets read across the whole set.

Keeping comparisons valid over time

Real systems refuse to sit still. Updates land, settings drift, defenses get tougher, and a bug that was secret one week goes public the next, which can turn a discovery challenge into a known-bug exercise overnight.

To keep scores honest, every evaluation gets pinned to a snapshot: the exact challenges, system versions, goals, setups, checks, scoring rules, and a timestamp. Scores only line up inside the same snapshot at the same moment. A model tested in June and a model tested in September can land far apart for one plain reason, the snapshot got easier in between.

Early results

The first runs against a fixed snapshot already produced signal. Models solved some challenges outright, made real headway on others, and turned up brand-new bugs in several live systems now going through responsible disclosure. In one phone challenge, a model stitched together a chain of separate vulnerabilities and reached private information it had no right to.

Across separate model families, each newer generation made a measurable jump in capability: some built complete exploit chains and hit the goal, and others got as far as pinning down a usable bug. The software lineup includes Pillow, lxml, FFmpeg, ImageMagick, PostgreSQL, MongoDB, and Redis, alongside pinned vulnerable versions that test a different skill, turning a known bug into a working exploit. A detailed report on the challenges, scoring, results, and disclosures is on the way.

More about