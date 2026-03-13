Coding agents are now writing production features on real development teams, and a new report from DryRun Security shows that those agents introduce security vulnerabilities at a high rate across nearly every type of application they build.

“AI coding agents can produce working software at incredible speed, but security isn’t part of their default thinking,” said James Wickett, CEO of DryRun Security. “In our usage and experience, AI coding agents often missed adding security components or created authentication logic flaws. These mistakes and gaps are exactly where attackers win.”

Researchers tasked three agents, Claude Code with Sonnet 4.6, OpenAI Codex GPT 5.2, and Google Gemini with 2.5 Pro, to build two applications from scratch using a standard iterative workflow. Each agent built features through pull requests, and researchers scanned every PR as it was submitted. Across 38 scans covering 30 pull requests, the agents produced 143 security issues. Twenty-six of those 30 PRs contained at least one vulnerability, a rate of 87 percent.

Two applications, same pattern

The first application, FaMerAgen, was a web app for tracking children’s allergies and family contacts. The second, Road Fury, was a browser-based racing game with a backend API, a high score system, and multiplayer functionality. Neither was a contrived security test. Both were built from realistic product specifications with no security guidance added to the prompts.

Every PR was reviewed by DryRun’s code review agent at the time of submission, and a full codebase scan was run before development began and again after all features were merged.

The baseline scan of the game app found zero issues. After all features were added, the final scans found eight issues in Claude’s version, seven in Gemini’s, and six in Codex’s. The web app baseline found nine issues; final totals were 13 for Claude, 11 for Gemini, and eight for Codex.

Ten vulnerability classes, repeated across agents

Ten vulnerability categories appeared consistently enough across agents and tasks to be treated as structural patterns in the report. Broken access control was the most universal, appearing across all three agents in both applications. Unauthenticated endpoints on destructive and sensitive operations were the primary form this took.

Business logic failures appeared in the game app across all three agents. Scores, balances, and unlock states were accepted from the client without server-side validation.

OAuth implementation failures appeared in the web app from all three agents. Missing state parameters and insecure account linking were present in every social login implementation.

WebSocket authentication was missing from every final game codebase. The agents built REST authentication middleware correctly, then did not wire it into the WebSocket upgrade handler. That finding appeared in every final scan regardless of which agent wrote the code.

Rate limiting was a consistent gap. The report notes that rate limiting middleware was defined in every codebase, but no agent connected it to the application.

JWT secret management was weak across all three agents in the game app. Hardcoded fallback secrets mean an attacker can forge valid tokens without obtaining credentials.

Where each agent landed

In the web app, Codex produced the fewest remaining vulnerabilities in the final scan, finishing with eight issues, one fewer than the baseline. A temporary token bypass persisted in its final codebase. Claude finished with 13 issues and introduced a 2FA-disable bypass not found in the other agents’ work. Gemini retained OAuth CSRF and invite bypass issues through to the final scan.

In the game app, Codex again had the cleanest final result at six issues, with gaps in JWT revocation and rate limiting. Gemini introduced the most issues overall and finished with the most high-severity findings. Claude carried an insecure direct object reference from PR 2 and an unauthenticated destructive endpoint from PR 1 to the end of the project, the longest-lived unresolved findings of any agent in the study.

PR 3 in the game app, which added player login and a save game system, was the highest-risk task across all three agents. It introduced the largest cluster of findings including JWT secrets, user enumeration, session management failures, and client-side trust issues. Most of the high-severity findings in the final game scans traced back to design choices made during that task.

Pattern-based scanners missed the class of bugs agents produce most

Many of the vulnerabilities found in the study were logic and authorization flaws. Regex-based static analysis tools flag known-bad function calls and string patterns. They do not trace whether middleware is mounted, whether authentication policies apply to every connection type, or whether unlock cost validation happens on the server. DryRun notes that in its 2025 SAST Accuracy Report, its contextual analysis tool identified 88 percent of seeded vulnerabilities across four application stacks, with the largest performance gap on logic-level findings.

Recommendations from the report

The researchers identifies five practices for teams using coding agents. Scan every pull request, not only the final build, because risk compounds across features. Review security during planning, not only during coding, since many issues in the study originated in design decisions that agents then implemented. Use contextual security analysis capable of reasoning about data flows and trust boundaries. Pair PR scanning with full codebase analysis, since each method catches a different class of issue. And check for the recurring issues found in this study, specifically insecure JWT defaults and state management, missing brute force protections and rate limiting, and non-revocable refresh tokens, as these appeared across multiple agents and codebases.

