What happens to oversight when AI agents write a lab’s own code

Inside the labs building frontier AI, a growing share of the coding gets done by the AI itself. These agents write, edit, and run software with light human oversight between steps, and they reach into production infrastructure, research pipelines, and potentially the systems that train and evaluate future models.

AI coding agent oversight

A new analysis from researchers at the University of Oxford and SaferAI digs into the security risks that live in everything around those agents: the people reviewing their code, the pipelines watching them, and the policies that set the rules, along with the models themselves.

To get there, the team borrowed three safety methods that aviation, nuclear power, and chemical plants have leaned on for decades, and pointed them at a generic frontier developer pieced together from public disclosures by Anthropic, OpenAI, and Google DeepMind. The methods are built to catch trouble that comes from parts interacting, including the case where every single part is doing its job correctly.

Responsibilities without named owners

Several control actions show up in published safety frameworks with no named person or team attached to them in the public record. Among them: the power to pause, restrict, or reroute a model, the job of keeping access policy current, and assessing catastrophic risk and filing its quarterly reports.

Someone inside a given company may own these jobs quietly. Two of them, the catastrophic-risk assessment and the quarterly summaries, are spelled out in California’s SB 53, which requires frontier developers to assess and report catastrophic risk from internal model use. A legal duty with no owner anyone outside can point to is exactly the sort of gap the analysis wants disclosed.

There is a matching gap in who controls deployment. One named role can sign off on putting a model into temporary use before an audit wraps up. The role that would pull that deployment back is missing from the public record. The two belong together, and only the one that opens the door has a name on it.

When the monitor lags the model

At Anthropic and OpenAI, the monitoring of internal coding agents happens offline, so the review lands after a session is over. OpenAI says it covers almost all of its internal coding traffic, around 99.9 percent, using a frontier reasoning model as the watcher, with review arriving about 30 minutes later. Anthropic runs a layered pipeline where humans get through roughly thirty summaries a week. Since the risk score can show up after a pull request has already been reviewed or merged, it often works as a heads-up after the fact.

Swapping in a new model stretches that gap wider. The new model behaves differently the moment it goes live, and the monitor keeps grading it against the old one until someone updates the rules. The same thing happens when a model reaches the internal team ahead of complete pre-deployment testing, or before anyone has built monitors suited to it.

Review anchored to the agent’s explanation

When an agent hands off its code for human review, it sends along its own write-up of what it did and why. The reviewer reads that explanation and tends to nod along with the agent’s stated intent, which chips away at how independent the review is. This holds even for a sharp reviewer, because the agent’s self-grade comes stapled to the work being graded.

A slower problem builds over months. Agent-written code, tests, and docs pile up in the codebase, and that codebase becomes both the reference the agent reads for its next task and the yardstick reviewers use to tell normal from abnormal. If the agent has consistent blind spots, those blind spots quietly settle into the very tests meant to catch them.

Drift and what auditors could track

Many of these conditions creep in through everyday calls that each seem sensible. Permission exceptions pile up. Monitoring rules that never flagged anything real get trimmed. Review requirements stay put as the volume of pull requests climbs. The system prompt sits untouched after a model upgrade. Every one answers the same pressures, productivity, alert fatigue, and trust, and they all lean the same way, toward a thinner safety margin.

The authors are upfront about the limits. The work rests on public descriptions of how things are supposed to run, and the FRAM method assumes the wobble in a system comes from people doing their best under pressure, an assumption that falls apart if an agent is actively gaming its own controls. Their fix is a two-tier setup: aggregated or redacted indicators in public safety reports, and the complete operational data handed to a designated auditor such as a national AI safety institute.

Download: Secure Foundations for AI Workloads on AWS

Don't miss