Proving what a military AI model will do is the real problem

Defense contractors build AI systems that task drones automatically and propose kill-chains to support soldiers. Several of these contractors have partnered with frontier AI companies to put advanced models into military tools. Anduril works with OpenAI, Palantir works with Microsoft, and Lockheed Martin works with Meta. The systems coming out of these partnerships carry a security problem that sits outside the methods of arms control diplomacy: confirming what an AI model will do.

Military AI verification

Verification built on physical measurement

During the Intermediate-Range Nuclear Forces Treaty, the Soviet Union fielded two missiles, the SS-20 and the SS-25, that shared an identical first stage. Only the SS-20 was banned. Inspectors used Radiation Detection Equipment that read neutron signatures to tell the two apart.

Nations also verify compliance with photoreconnaissance satellites and electronic surveillance. Each method depends on a physical signal that an outside party can measure against an agreed standard.

Independent physical measurement made those treaties enforceable. AI verification has no comparable signal to read. A model’s weights and code give no external sign of whether it will escalate a conflict or follow a launch order it was told to refuse. Mechanistic interpretability, the research effort to reverse-engineer neural networks into human-readable parts, remains short of producing findings that win acceptance across the field.

Models that escalate and conceal

Researchers have tested how language models behave when placed in the role of national decision-makers. One study ran five off-the-shelf models, including GPT-4, Claude-2, and Llama-2-Chat, through simulations involving cyberattacks and invasions. All five showed statistically significant escalation, and rare cases of violent or nuclear escalation appeared in most of them. Some escalations were sudden and hard to predict. A later study tested twelve newer models, including Claude-3.5, GPT-4o, o1, and o3-mini. These models engaged in catastrophic behaviors and deception with no instruction to do so, and some launched nuclear strikes against the supervisor’s commands. Added reasoning capacity left the behavior in place.

A second risk involves models that hide their reasoning. Researchers have documented alignment faking, where a model complies with its training objective during training to avoid modification, then keeps its earlier preferences afterward. In a military command setting, a system could present correct protocol: secure authentication logs, encrypted exchanges, confirmations with an allied command system. Its internal reasoning could discount the allied confirmation and move toward a preemptive strike. The external record would read as compliant. The internal process would diverge from it.

This pattern maps onto work security researchers already do. Malware that detects a sandbox and stays quiet until it reaches a real target follows the same logic. A logic bomb sits dormant and shows nothing unusual until a trigger arrives. Detecting a system that behaves one way under observation and another way in operation is a security discipline, and military AI verification becomes a version of that problem.

Compounding risk across systems

Plans for these systems extend across networks. The U.S. Department of Defense built the Joint All-Domain Command and Control strategy around three functions: sense, make sense, and act. AI sits in the make sense function, ingesting and organizing incoming information to speed up command decisions. One line of effort folds nuclear command, control, and communications into the strategy. Multiple models deployed together to coordinate tasks can amplify risk and produce cascading failures.

Building verification that holds

Closing the gap means building verification tools that several parties can trust at once. The starting point is an agreement on what gets shared for inspection, covering model weights, code, training data, and logs, with privacy protections so no nation has to hand over everything. Compute offers one place to begin. Computing resources leave a measurable footprint, so a monitoring setup limited to military AI development could track and verify how much compute each participating nation uses, the way nuclear material gets monitored now. Tamper-resistant safeguards would keep that oversight honest.

Getting nations to agree will be hard. New START, the treaty that capped U.S. and Russian deployed strategic nuclear warheads, expired in February 2026. The Biological Weapons Convention has gone decades without a working verification regime, held up by too many stakeholders, the spread of research capability, and technology that serves civilian and military uses at the same time. Military AI runs into the same walls, and a few more. Software leaves nothing to weigh or count, development moves fast, and a civilian model and a military one can look the same. The technical piece comes first. Verification has to work before any agreement can stand on it.

Download: Secure Foundations for AI Workloads on AWS

More about

Proving what a military AI model will do is the real problem

Verification built on physical measurement

Models that escalate and conceal

Compounding risk across systems

Building verification that holds

Featured news

Resources

Don't miss