Senior engineers are spending their week cleaning up AI-generated code

At most U.S. technology companies, machines now write the bulk of the code that ships each week. The engineer’s job has shifted toward reviewing what the AI produces, and that review gives the code high marks. Leaders rate AI-generated code as higher quality than the code their own people write, praising its clean structure, consistent style, and low count of obvious bugs at submission time.

AI-generated code review

The same code behaves worse once it runs. Production incidents have climbed over the past year. Senior engineers spend more of their time fixing what the AI generated. A large majority of organizations hit at least one production failure tied to AI code in the past six months, and a sizable share of that code goes back for repair soon after it ships.

Trust arrives before inspection

The pattern starts with trust that lands early. Most teams say they often ship AI-generated code to production without checking it line by line. The code reads well, so it clears review quickly, and the inspection step where many security defects get caught goes quiet.

LLMs produce code that works under clean, predictable conditions. The weak spots show up in edge cases, concurrency, deprecated API calls, and complex state changes. These gaps stay buried in the source and surface once real users hit the system. A reviewer scanning a pull request has little chance of spotting them.

Security flaws that emerge under load

Newly introduced security vulnerabilities have affected roughly three in ten organizations in the past six months. Integration failures, compliance problems, and data integrity issues have hit similar shares. Most organizations carry at least one war story from the period, and many carry several.

According to the New Relic study, AI-generated code introduces close to twice as many critical runtime issues as peer-reviewed human-authored code. The failures spread across many small problems at once. Each leaves a signature in production data. Schema drift and rising error rates between services point to integration breakage. Odd patterns in authentication and trace data expose security weaknesses. The common thread is that these signs appear after deployment, well past the review stage.

The limits of review-time inspection

A reviewer reads the source. Production produces the trace. The source shows how the code is built. The trace shows how it behaves under real load, real dependencies, and real edge cases. AI coding tools generate code from the source alone, with no view of runtime conditions. That gap explains the distance between the grades AI code earns in review and the way it performs in the wild.

The cleanup falls on experienced staff. Site reliability and DevOps engineers report losing up to a third of their work week to triaging and refactoring machine output that reached production unchecked. That is time the most senior people on a team would otherwise spend on harder problems.

Observability moves earlier in the process

Support for observability has reached near-unanimous levels among the leaders surveyed. They treat runtime monitoring as essential for AI-generated code, and many now prompt the AI to build telemetry such as logs and traces directly into the code it writes. The decision about what to log and what to alert on is moving upstream into the developer’s prompt.

The speed gains behind all this are real, and revenue reflects them, which is why adoption keeps climbing. AI-written code sits inside formal production policy at most organizations and reaches the same customer-facing services as code from senior engineers. No organization in the survey bans the practice.

Download: Automating Pentest Delivery Guide

Don't miss