LLMs can assist with vulnerability scoring, but context still matters
Every new vulnerability disclosure adds another decision point for already stretched security teams. A recent study explores whether LLMs can take on part of that burden by scoring vulnerabilities at scale. While the results show promise in specific areas, consistent weaknesses continue to hold back fully automated scoring.
Growing workloads push teams hard
More than 40,000 CVEs were published in 2024, and the study notes that this surge has put strain on programs that score these entries. Without timely severity ratings, teams cannot tell which risks to handle first.
The researchers tested six LLMs, including GPT 4o, GPT 5, Llama 3.3, Gemini 2.5 Flash, DeepSeek R1, and Grok 3. Each model scored more than 31,000 CVEs using only the short descriptions written by the CVE program.
The models had to infer eight base metrics that shape the final CVSS score. The models did not receive product names, software versions, vendor details, or CVE IDs, because those fields can reveal the answer through lookup. This forced the models to reason from the description itself instead of matching it to known data.
Better results when text signals are explicit
Two metrics stood out. The first is Attack Vector, which describes how an attacker reaches the vulnerable system. It includes network access, adjacent network access, local access, or physical access. Gemini reached about 89% accuracy on this metric. GPT 5 followed closely. Other models also produced strong results. Descriptions often specify whether exploitation occurs over a network, and the systems picked up these signals.
The second is User Interaction, which reflects whether the exploit requires someone to click, open a file, or perform a similar step. GPT 5 reached about 89% accuracy here as well, while Gemini, Grok, and GPT 4o trailed by small margins. Many descriptions refer directly to user actions, which makes this metric easier to classify.
Confidentiality Impact and Integrity Impact also showed progress. These metrics describe what happens to data once the exploit succeeds. Confidentiality Impact reflects exposure of sensitive information. Integrity Impact measures unwanted modification of data or system state. GPT 5 scored in the mid to high 70% range for both. Gemini and Grok followed with moderate results. When descriptions include signs of data exposure or tampering, the systems often detect those details.
Weak performance where descriptions lack detail
Availability Impact produced the weakest overall results. This metric reflects service disruption after exploitation. GPT 5 led at about 68%, while others trailed by wider margins. Short descriptions that mention only that a crash is possible do not provide enough information to separate minor disruptions from significant outages.
Privileges Required also proved difficult. This metric shows whether an attacker needs an account and at what level. All systems confused none and low because descriptions rarely specify the required access.
Attack Complexity showed another trend shaped by the dataset itself. This metric captures conditions that must be met for the exploit to succeed. Most entries in the dataset were labeled low, which pushed predictions toward the majority class. GPT 5 reached about 85% accuracy, but the improvement above baseline was narrow.
Error analysis revealed that the systems often stumbled in the same places. All six models misclassified 29% of the same CVEs for Availability Impact and 18% for Attack Complexity. The overlap extended further in other areas, where four of the six models agreed on the wrong answer in 36% of cases. These shared mistakes show how the same entries triggered consistent errors across models.
Small gains from meta classifiers
Because each model performed well in different areas, the researchers built meta classifiers that combined predictions from all six. This brought small improvements across all metrics.
Scope, which shows whether an exploit affects other parts of the system once triggered, saw the biggest rise at a little more than three percentage points. Attack Vector saw a small increase as well. Other metrics rose by narrow margins. The improvements show that combining models helps but cannot overcome missing context inside the source descriptions.