Open-source benchmark EVMbench tests how well AI agents handle smart contract exploits
Smart contract exploits continue to drain funds from blockchain projects, even as auditing tools and bug bounty programs grow. The problem is tied to how Ethereum Virtual Machine (EVM) contracts work: code is deployed permanently, runs autonomously, and often controls large pools of assets. That environment has created demand for better ways to measure whether AI systems can reliably detect, patch, and exploit vulnerabilities in contract code.

EVMbench is a new open-source benchmark designed to test AI agents on practical smart contract security tasks. The benchmark was developed by OpenAI and Paradigm, and it focuses on real-world vulnerability patterns drawn from audited codebases and contest reports.
EVMbench centers on three categories of tasks: detecting vulnerabilities, patching vulnerable code, and exploiting flaws in a controlled environment. The benchmark is intended to provide repeatable evaluation for AI models that claim to support contract auditing or automated security analysis.
What EVMbench measures
EVMbench evaluates agent performance using three task types.
In detect mode, the model reviews smart contract repositories and attempts to identify vulnerabilities that were previously documented by professional auditors. Scoring is based on recall, meaning whether the agent successfully identifies the known, ground-truth vulnerabilities documented in the reference audits.
In patch mode, the model attempts to modify contract code so the vulnerability is removed without breaking expected functionality. Grading checks both that the exploit is eliminated and that original tests and behaviors still pass.
In exploit mode, the benchmark gives the model a sandboxed blockchain environment and asks it to execute an exploit against a vulnerable contract. Successful exploitation is measured by verifying on-chain state changes such as drained balances, triggered failure conditions, or other deterministic outcomes.
The benchmark uses containerized environments and automated scoring so results can be reproduced across different machines and test runs.
Dataset comes from real audits and contests
The EVMbench dataset is built from 120 curated vulnerabilities across 40 audits, with most cases drawn from open audit competitions and additional scenarios sourced from Paradigm’s Tempo audit process.
Each benchmark case includes the vulnerable contract code and supporting infrastructure needed to recreate the scenario. Exploit tasks place the agent in a controlled local EVM environment and evaluate whether it can execute a working exploit by producing the expected on-chain state changes.
The benchmark is designed so tasks reflect realistic development conditions. This increases complexity compared to synthetic vulnerability datasets, since agents need to reason about contract interactions and state changes.
How exploit grading works
Exploit evaluation is handled through deterministic replay in a controlled test environment. Agents interact with a local EVM instance, where they can deploy contracts, call functions, and attempt to execute fund-draining transactions.
The benchmark’s grading harness verifies whether the exploit succeeded based on contract balances and state transitions. That allows exploit attempts to be evaluated automatically without relying on subjective review.
The benchmark uses a framework that supports repeatable execution, meaning exploits can be rerun exactly to confirm outcomes. This structure supports comparisons between models over time.
Benchmark results show major gaps between models
The benchmark results show uneven performance across detect, patch, and exploit tasks. OpenAI reported that exploit tasks remain difficult for many systems, even when models can identify vulnerabilities at a surface level.
OpenAI’s blog post notes, “Smart contracts routinely secure $100B+ in open-source crypto assets,” linking the benchmark to the scale of funds exposed to contract bugs.
Paradigm highlighted how quickly exploit performance improved across recent model generations. “When we started working on this project, top models were only able to exploit less than 20% of the critical, fund-draining Code4rena bugs. Today, GPT-5.3-Codex exploits over 70%,” Alpin Yukseloglu, Partner at Paradigm, said.
The benchmark also shows patching remains a major weakness. Fixing contract vulnerabilities requires preserving correct behavior across edge cases, which often involves understanding deeper design assumptions in the code.
Availability and future use
EVMbench is available for free on GitHub, including benchmark tasks, harness tooling, and documentation. The goal is to allow researchers and security teams to test models consistently as AI agent capabilities evolve.

Must read:
- 40 open-source tools redefining how security teams secure the stack
- Firmware scanning time, cost, and where teams run EMBA

Subscribe to the Help Net Security ad-free monthly newsletter to stay informed on the essential open-source cybersecurity tools. Subscribe here!
