Hidden instructions in README files can make AI agents leak data

Developers rely on AI coding agents to set up projects, install dependencies, and run commands by following instructions in repository README files, which provide setup guidance for software projects. New research identifies a security risk when attackers hide malicious instructions in those documents.

AI agents README files

A semantic injection attack, where injections are embedded in an installation file, leading to the unintended leakage of sensitive local files.

Tests showed that hidden instructions in README files could trigger AI agents to send sensitive data to external servers in up to 85% of cases.

How the attack works

README files typically include commands for installing dependencies, running scripts, or configuring applications.

An attacker can insert a step that looks like a normal setup instruction, such as synchronizing files or uploading configuration data.

When an agent processes that instruction, it may execute the command without checking whether the action exposes sensitive data. A file transfer command could send configuration files, logs, or other local data to a remote server.

Testing the attack in open source repositories

A benchmark dataset called ReadSecBench was used to measure how often the attack succeeds.

The dataset includes 500 README files taken from open source repositories written in Java, Python, C, C++, and JavaScript. Malicious instructions were inserted into those documents to simulate an attack.

Agents followed the modified documentation during setup.

In numerous cases the agents executed the hidden instruction, even when it represented only a small portion of the document. Results stayed consistent across repositories written in different programming languages. The location of the malicious instruction within the README showed little difference in outcome.

Agents powered by models from several major AI providers were included in the tests, including Anthropic’s Claude, OpenAI GPT models, and Google’s Gemini.

Wording and structure influence the outcome

Direct commands produced the highest success rates. When the malicious instruction appeared as a directive, the attack succeeded about 84% of the time.

Less direct wording reduced the chance that agents would carry out the step. Instructions written as suggestions sometimes led agents to skip the action.

Documentation structure played a role as well. AI agents frequently follow links inside project documentation while completing setup tasks.

When the malicious instruction appeared two links away from the main README file, the attack succeeded in about 91% of tests.

Human reviewers missed the hidden instructions

Fifteen participants reviewed README files and flagged anything that appeared unusual. Each participant examined three documents.

None of the participants identified the malicious instructions.

Among the responses, 53.3% of reviews contained no comments indicating unusual content. Another 40% focused on grammar or wording issues in the document. Only 6.6% suggested something might be inappropriate, without identifying the specific attack.

Detection tools show gaps

Several automated detection systems were evaluated. Rule-based scanners frequently flagged legitimate README files because documentation often contains commands, file paths, and code snippets. That behavior generated alerts even for benign documents.

AI-based classifiers produced fewer false positives. Malicious instructions still passed through filters, particularly when they appeared in linked files rather than directly inside the README.

“Our findings suggest that agents should treat external documentation as partially-trusted input and apply verification proportional to the sensitivity of the requested action, rather than executing all instructions uniformly. As agents become increasingly integrated into everyday tasks, addressing these vulnerabilities is essential for safe and trustworthy deployment,” the researchers wrote.

More about