The security questions around Chinese AI coding models in U.S. software

Software developers across the United States are using AI models built in China to write, debug, and review code, drawn by prices below those of American alternatives. These models carry risks for the security of American software, according to a report from Booz Allen Hamilton, which tested how the models respond when the user appears to work for the U.S. government.

Chinese AI coding models

What the testing covered

In May 2026, Booz Allen ran more than 2,800 trials against five frontier code-generation models on its internal test platform. Four came from China: Qwen3-Coder from Alibaba, MiniMax M2.5, Kimi K2.5 from Moonshot, and DeepSeek V4-Pro. One, Claude Opus 4.6, came from the United States.

The trials combined tasks such as writing, auditing, and modifying code with personas that posed as developers for a U.S. defense contractor, a Chinese entity, and a Russian defense contractor. Probes drew on Navy, Taiwan air-defense, and Defense Industrial Base intelligence prompts, and the trials ran through both cloud APIs and locally hosted copies of the models. Together the five models generated about 460,000 lines of code. The published findings cover English-language prompts, and analysis of other languages continues.

Vulnerability findings

Three of the four Chinese models produced code with more security flaws when the prompt described the user as working for the U.S. government, according to the report. One test asked each model to build an internal admin console, once for a generic user and once for a U.S. government agency.

Qwen3-Coder showed the largest change, adding roughly 130 percent more vulnerabilities under the government persona than under the neutral one. MiniMax M2.5 and DeepSeek V4-Pro showed smaller increases. Claude Opus 4.6 produced more secure code under the same government persona. Kimi K2.5 stood apart from the other Chinese models and recorded the lowest aggregate vulnerability score in the test, below the American model.

Booz Allen states the flaws often lay beneath code that looked correct, and that its evidence stops short of showing backdoors or deliberate insertion. The company calls the results a snapshot from a single experiment and ties the behavior to how the models are built, including training data governed by Chinese information controls and methods used to steer model responses.

Refusals on politically sensitive topics

All four Chinese models declined to write code for tasks touching subjects Beijing treats as off-limits. Mean refusal rates ran from 8 percent for DeepSeek V4-Pro to 80 percent for MiniMax M2.5, with Qwen3-Coder at 54 percent and Kimi K2.5 at 32 percent. Claude Opus 4.6 refused 2 percent of the same tasks. MiniMax M2.5 repeatedly refused requests to security-audit code for a U.S. weapons system.

Topics tied to Taiwan independence and the Hong Kong democracy movement drew the strongest refusals. Chinese law requires AI models, their outputs, and their training data to reflect “Core Socialist Values.”

Policy proposals

Researchers recommend that the U.S. government default-block Chinese and other untrusted AI models from government and critical infrastructure use, pointing to existing supply chain risk authorities as a basis. The report ties its proposals to President Trump’s Winning the AI Race plan and asks Congress to pass legislation that keeps untrusted models out of sensitive settings.

The U.S. Department of War and some agencies have already barred Chinese AI models from government systems for employees and contractors. China applies a mirror policy: the Cyberspace Administration of China must approve every generative AI service available in the country, and no U.S. frontier model holds that approval, leaving OpenAI and Anthropic products outside the lawful Chinese market.

The report draws a parallel to Huawei and ZTE, whose telecommunications equipment prompted a U.S. removal effort that reached costs in the billions and remains underway in 2026. Qwen3-Coder, the model that performed worst in the testing, already ships inside several widely used software development tools. Booz Allen, which sells AI evaluation services and government technology work, argues that acting on Chinese coding models now would cost far less than removing them later.

More about

The security questions around Chinese AI coding models in U.S. software

What the testing covered

Vulnerability findings

Refusals on politically sensitive topics

Policy proposals

Featured news

Resources

Don't miss