AutoPatchBench: Meta’s new way to test AI bug fixing tools

AutoPatchBench is a new benchmark that tests how well AI tools can fix code bugs. It focuses on C and C++ vulnerabilities found through fuzzing. The benchmark includes 136 real bugs and their verified fixes, taken from the ARVO dataset.

AutoPatchBench

Patch generation flowchart

CyberSecEval 4

AutoPatchBench is part of Meta’s CyberSecEval 4, a benchmark designed to objectively evaluate and compare various LLM-based auto-patching agents for vulnerabilities specifically identified via fuzzing, a widely used method of automated security testing.

By using the same tests across tools, AutoPatchBench makes it easier to compare results. This helps researchers spot what works, what doesn’t, and how to improve.

What sets AutoPatchBench apart is its verification methodology. “It goes beyond simply checking if patches build and stop crashes,” TJ Byun, Research Scientist at Meta, told Help Net Security. “The benchmark incorporates additional verification through fuzzing and white-box differential testing to check the correctness of AI-generated patches.” This ensures that patches don’t just prevent crashes but also maintain the intended functionality of the code, verified by comparing the state of the program after the patched function returns against a trusted implementation, using a broad set of fuzzing-derived inputs.

AutoPatchBench-Lite

To support earlier-stage tools, the team also developed AutoPatchBench-Lite, a simplified subset of 113 vulnerabilities confined to single-function root causes.

This version maintains the rigor of the full benchmark, including dual-container setups for consistent reproduction and validation, while lowering the barrier for new tools to be evaluated. “We believe that our targeted approach in creating this evaluation framework enables more precise evaluation of AI capabilities,” said Byun, “thereby driving advancements in AI-assisted vulnerability patching with greater focus and precision.” With its combination of realism, automation, and thorough validation, AutoPatchBench aims to accelerate progress in the field by helping developers better understand and trust AI-generated security patches.

AutoPatchBench and open source

To foster collaboration and accelerate progress in AI-assisted vulnerability remediation, the team made AutoPatchBench fully open source. “We open-sourced AutoPatchBench to encourage industry input into advancing the accuracy and reliability of AI patch generation for the development of more robust and effective automated tools,” Byun explained.

In addition to the benchmark itself, the researchers developed and released a basic AI patch generator designed to serve as a performance baseline. Tailored for simpler cases, specifically crashes that can be addressed by modifying a single function, the reference implementation offers a starting point for others to build upon. “We have also open-sourced this reference implementation to encourage the community to build and expand upon it,” Byun added.

Future developments and download

By making both the benchmark and baseline patcher publicly available, the team hopes to create a shared foundation for future research and development. “Developers of auto-patch tools can leverage our open-sourced patch generator to enhance their tools and assess their effectiveness using the benchmark,” said Byun.

The utility also extends beyond benchmarking. Software projects that use fuzzing can adopt the patch generator to accelerate vulnerability remediation, and the supporting tooling can be used in reinforcement learning pipelines to shape reward signals during training. “This data helps train models to better understand the nuances of bug repair,” Byun explained, “enabling them to learn from past fixes and improve their ability to generate accurate patches.”

AutoPatchBench is available for free on GitHub.

Must read:

Subscribe to the Help Net Security ad-free monthly newsletter to stay informed on the essential open-source cybersecurity tools. Subscribe here!

Don't miss