Companies keep bolting AI onto their products, and the security bill is coming due
Companies keep bolting AI and LLM features onto their products, and the security results are starting to show a pattern. The vulnerabilities those features create get rated high risk far more often than anything else, and they get fixed slower than anything else. The figures come from Cobalt’s AI and Pentesting Pulse Report 2026, built on five years of penetration testing data and a survey of 455 security leaders and practitioners.

A risk rate that holds at 2.7 times the average
AI applications stack new weaknesses on top of old ones. They keep every flaw of conventional software and add a fresh set. A web app with an LLM wired into it can still be hit by SQL injection, cross-site scripting, and broken authentication. It can now also be hit by prompt injection, insecure output handling, and model-level denial of service.
Across Cobalt’s dataset, the high-risk rate for AI and LLM pentests runs at 2.7 times the rate for every other kind of system. That gap has held for two straight years. About one in three AI findings earns a high-risk label. For other systems, the share sits near one in eight.
Two of every three serious findings stay open
Finding the problems turns out to be the easy part. Fixing them is where AI trails everything else. AI and LLM pentests carry the lowest resolution rate Cobalt tracks, landing at 38.4% in 2026. Two of every three serious findings stay open and exploitable.
The rate nearly doubled over the year, the biggest jump of any asset class. That counts as progress from last place. It still trails the next category by double digits and sits far below API and web testing, where most serious findings get resolved.
Three things hold the rate down. Too few staff understand both security and AI systems. The fix often runs through a model vendor when the flaw lives in the model. Most AI projects are new, with security processes that have yet to mature. The median time to close one of these findings nearly doubled as well, a sign that teams are taking on harder cases that need more digging.
Shadow AI leads the incident list
The most common cause of AI security incidents traces back to the company’s own staff. Employees reach for AI tools nobody approved, and sensitive data goes with them. Shadow AI sits behind 44% of confirmed incidents, ahead of data poisoning, output handling failures, supply chain issues, and prompt injection. About one in five organizations confirmed an AI-related incident at all, and a large group could not say one way or the other.
Joe Brinkley, Director of Offensive Security Research and Community at Cobalt, said the tools built to track company assets miss this kind of activity completely.
“Traditional asset inventory is just completely ineffective against shadow AI because it’s designed to locate corporate infrastructure, such as managed servers and assigned IP addresses,” Brinkley told Help Net Security. “But shadow AI operates almost entirely at the application layer, so it completely bypasses those boundaries. Usually, it enters the environment when a developer inputs proprietary data into a browser extension or when a script communicates with a third-party LLM API over standard, encrypted HTTPS traffic. To a traditional network scanner, all of this just looks like normal web browsing.”
The answer, Brinkley said, is to watch the data, the traffic, and the endpoints.
“So, organizations with mature programs have shifted their discovery focus away from infrastructure and more toward data behavior and telemetry,” he said. “They are analyzing Layer 7 traffic to identify unauthorized API headers communicating with AI endpoints, monitoring endpoint processes to flag unvetted browser plugins, and checking DNS logs for outbound connections to newly stood-up AI infrastructure.”
Companies are stepping back from hands-off automation
Enthusiasm for handing every test to AI has cooled fast. A year ago, close to a third of teams were happy to let automated tools cover all their testing. That share has dropped to 9%.
The reason is performance. Most teams have watched automated scanners miss critical vulnerabilities. 78% report exactly that. The setup teams prefer now divides the work: automation handles routine coverage on lower-stakes systems, and human experts take the systems that matter most. Close to half want it run that way.
Programmatic LLM testing slipped over the year, and reactive testing grew by about the same amount. Many companies are new to AI testing and start in reactive mode before they settle into a steady routine.
Leaders and the people doing the work report different companies
Ask a security leader whether the company hits its remediation deadlines, and most say yes. Ask the engineers who do that work, and almost none agree. A 42-point gap separates the two views. More than half of leaders report steady success against their SLAs. About one in seven practitioners sees it the same way. Most respondents also say the security team would take the internal blame for a major AI incident.
Brinkley said the split comes from the way each side measures the work.
“A leadership dashboard might report an on-track compliance score, but the engineers in the trenches are managing a massive backlog of low-context alerts that may not even be exploitable in the real world,” Brinkley said.
“The organizations that have successfully closed this gap changed their governance to focus on reachability and exploit validation,” he said. “By filtering out the theoretical risk and delivering validated findings directly into developer workflows, the perception gap closes because both sides are finally looking at the same realistic metrics.”
Brinkley tied that change to better outcomes.
“Our data indicates that when organizations stop chasing automated noise and focus on verified, exploitable flaws, they are 4.5 times more likely to meet their SLAs,” he said. “Remediation velocity increases simply because engineering resources are directed only at proven risks.”
What teams want and what they fund
Teams can name what they need. Their plans point in a softer direction. About six in ten say they need better ways to test AI security, and fewer than half plan to grow the red team work positioned to deliver it. Confidence has slipped too, falling from about two-thirds of teams to half over the year, and a majority now want a planned reset to reinforce their defenses.
The spread between the best programs and the rest comes down to operating choices. Top performers cut the lifespan of a high-risk finding to about ten days. Laggards leave equivalent risks open for an average of 249 days. Cobalt recommends treating LLM pentesting as its own discipline, building a discovery process for shadow AI with review gates on new tools, and saving human-led testing for critical systems and every AI application.

Guide: What automated pentesting alone cannot see