LLM vulnerability patching skills remain limited
Security teams are wondering whether LLMs can help speed up patching. A new study tests that idea and shows where the tools hold up and where they fall short. The researchers tested LLMs from OpenAI, Meta, DeepSeek, and Mistral to see how well they could fix vulnerable Java functions in a single attempt.

A broad mix of models put through the same trial
The study examined two groups of vulnerabilities. The first consisted of authentic cases, which were developer-written vulnerabilities taken from Vul4J, a dataset of Java vulnerabilities that includes the original vulnerable code, the developer’s fix, and Proof-of-Vulnerability tests.
The second contained artificial variants, which were machine-generated changes to those vulnerabilities. These versions use different code but still produce the same failure patterns when tested.
By comparing how LLMs handled authentic cases and artificial variants, the researchers examined whether the models can apply their patching approach to altered versions of the same vulnerability and whether their performance drops when the code is changed.
A model saw only the small piece of code that contained the vulnerability, along with an instruction to supply a repair without unrelated edits. Each patch was then run against exploit driven tests designed to trigger the flaw and show whether the fix stopped the attack.
The models represented a broad sample of instruction tuned and general purpose systems across the four vendors. The researchers avoided tuning strategies or repeated attempts. Each model had only one shot at the fix. This mirrors the way many security teams try these tools inside existing workflows.
Authentic cases showed stronger performance
Out of 15 authentic cases, 8 received at least one working patch. These cases were easier for the models to understand because the code followed patterns that show up often in training data. When a vulnerable function looks familiar, a model is more likely to guess the right step to remove the flaw.
Another factor helped. The authentic cases often needed small edits in a tight section of code. The flaw was visible inside that short block, and the fix did not depend on code outside the provided function. Since each model saw only the function itself, this setup matched their strengths. A model could adjust one line or add a simple check, and the exploit would stop.
Three authentic cases were so compact that almost every model repaired them. All three involved a change that stayed inside a few lines of code. When the needed fix sits inside the same snippet the model can see, the model has enough context to supply a correct patch. This explains why the results on authentic cases were stronger and why several models often agreed on the same repair.
Model outcomes on artificial variants
The results looked very different for the 41 artificial variants. Only 10 received a working fix. Some groups saw no success at all. These variants kept the same kind of flaw, but the surrounding code was changed in small ways that mattered. The purpose of these changes was to shift the code structure while keeping the exploit path alive. Even slight changes in layout or naming can influence how a model interprets the problem.
Because the models rely on patterns they have learned, a shift in structure can break those patterns. The model may still spot something that looks like the original flaw, but the fix it proposes may no longer land in the right place. That is why a patch that looks reasonable can still fail the exploit test. The weakness remains reachable because the model addressed only part of the issue or chose the wrong line to modify.
Another pattern surfaced. When a fix for an artificial variant did appear, it often came from only one model. Others failed on the same case. This shows that each artificial variant pushed the systems in different directions, and only one model at a time managed to guess a working repair. The lack of agreement across models signals that these variants exposed gaps in the patterns the systems depend on.
Vendor spread showing strengths and gaps across the field
Performance varied across vendors, but no group dominated the results. DeepSeek and one Mistral instruction tuned model reached the highest count with 14 patched cases each, out of the 56 vulnerabilities included in the study.
OpenAI and Meta models landed behind that mark but contributed steady fixes in several scenarios. The spread shows that gains do not come from one vendor alone.
The study also checked overlap. Authentic issues showed substantial agreement between models, while artificial issues showed far less. Only two issues across the entire set were patched by one model and not by any other. This suggests that combining several models adds limited coverage.
Next stage of the project
Researchers plan to extend this work in several ways. One direction involves combining output from different LLMs or from repeated runs of the same model, giving the patching process a chance to compare options before settling on one.
Another direction focuses on prompt refinement to reduce errors in suggested fixes. The project also includes an expansion of the dataset to cover a broader range of vulnerabilities across more categories and languages.