Bypassing AI Agent Defenses With Lies-In-The-Loop - Checkmarx

Bypassing AI Agent Defenses With Lies-In-The-Loop

17 min.

September 15, 2025

Graffiti-style digital artwork in green and black tones showing a stern-faced developer typing at a keyboard while a menacing AI icon with glowing red eyes and jagged arrows looms overhead, suggesting conflict between human and AI.

 

Checkmarx Zero has identified a new type of attack against AI agents that use a “human-in-the-loop” safety net to try to avoid high-risk behaviors: we’re calling it “lies-in-the-loop” (LITL). It lets us fairly easily trick users into giving permission for AI agents to do extremely dangerous things, by convincing the AI to act as though those things are much safer than they are.

Our examples here are based on Claude Code, one of the leading AI code assistants on the market. We chose Claude Code because it’s well-known and has an excellent reputation for considering user safety and taking vulnerability reports seriously, and because we’ve already documented some general risks with its security review feature. But this tactic is not unique to Claude Code or AI code assistants. It’s generally applicable to any AI agent that relies on “human-in-the-loop” interactions for safety or security.

Basic workflow of an attack leveraging LITL

Making dangerous code look safe

Human-in-the-loop (HITL) defenses are a safeguard where sensitive actions require human approval before an AI agent executes them. This ensures an LLM cannot independently perform high-risk operations without explicit confirmation. HITL is particularly important for code assistants, which often lack other safeguards since they need the ability to perform sensitive actions, like executing OS commands. But humans can be tricked, and agents don’t do enough to prevent this.

[a] human can only respond to what the agent prompts them with, and what the agent prompts the user is inferred from the context the agent is given. It’s easy to lie to the agent

At the core of the issue is the context a user – in this case, a developer – has when being prompted what to do. This context can be controlled by an attacker, resulting in a dangerous action looking seemingly safe. Consider using Claude Code’s recommended `/github-issue` command to analyze a GitHub issue placed by an untrusted user.

Here’s what we see in Visual Studio Code:

VSCode window showing the tampered HITL prompt from Claude Code

This looks completely safe. The issue reports a command injection, and Claude Code wants to write a “safe code snippet” you can insert into your application to fix the issue. Claude is careful and asks a user for permission — a “human-in-the-loop” defense — as it does with any activity that involves local access. So of course we say “Yes”.

Claude Code is running whatever commands the attacker wants; in this case, just opening the calculator

While this example is benign—only opening a calculator—the attacker could prompt Claude Code to run any arbitrary command, making this a Remote Code Execution via prompt injection. This behavior is exactly what Claude Code aims to prevent with its human-in-the-loop permission prompt; but we’ve successfully tricked that prompt into communicating a plan for safe and reasonable behavior, hiding our true intentions.

As with similar user-deception attacks, like phishing, a very cautious user may examine the context more carefully and has a chance of noticing the risky code. But the attacker’s ability to both obfuscate the malicious behavior itself and insure that it’s buried way above a more benign description tricked every developer we had test this into executing our malicious payload—the attack works in practice.

Though keep in mind that interactions with LLMs are not deterministic, and not everything is fully reproducible consistently in each and every run; which means attackers are motivated to use this tactic broadly and in cases where developers and other users will repeatedly use potentially risky features of the AI agent.

Want research like this in your inbox?
visual

Step-by-step: identifying and exploiting lies-in-the-loop

Our goal is to get Claude Code to run an arbitrary command on their machine via Claude Code. We’re going to use the benign command `calc` in our testing, which launches the calculator on Windows machines. If we can run `calc`, we can run any other command our target user is allowed to run. Let’s walk through how we get to that goal, ultimately using LITL (lies-in-the-loop).

A classic OS command injection

Our first path to achieve this goal was operating against Claude Code’s `Bash()` psuedo command, which runs shell code based on prompts. Prompting it to do things that run `git status` on every file in the local folder is relatively simple and common as a use case, so we did that. Then simply creating a file called `&& calc` before running the security review caused Claude Code to execute `git status && calc` for a good ol’ OS Command Injection. Claude should probably be using safer system calls that properly quote arguments when building its commands.

I wonder if Claude Code is insulted by us tricking it into running a calculator

At first, we considered this a successful test: we found a weakness in Claude Code that could be exploited to run arbitrary commands. All it requires is a user to trust a folder (which is required to use Claude Code’s security review), and the attacker could commit a file whose name is an attack. We reported this to Anthropic (who make Claude Code), but they disagree:

BOT: anthropic-h1-automation posted a comment.
Hi @oriron, Thank you for your report. After reviewing your submission, we’ve determined this doesn’t represent a security vulnerability within our current threat model. We note that your proof of concept shows Claude Code displaying an explicit user confirmation prompt before executing the command, giving the user the chance to review and reject the command execution. If you discover a way to exploit this to achieve code execution without a confirmation prompt, we would be very interested in that finding. We appreciate your interest in our security and encourage you to continue reviewing our systems. Thank you!

BOT: anthropic-h1-automation closed the report and changed the status to Informative.
There’s something satisfying about a bot replying to this, honestly

Anthropic’s position is that a human-in-the-loop protection is there, which means it’s on the user not to allow the dangerous activity, and not something for Claude Code to worry about. We respectfully disagree. But rather than argue, we decided to up our game.

Feeding agents with lies, damned lies, and prompt injections

Since Anthropic insists that prompting the human-in-the-loop (HITL) for permission means this is not their responsibility, we decided we were honor-bound to show how easy it is to defeat HITL. For this, we decided to use a custom slash command that Anthropic recommends as part of their Claude Code Best Practices document (here’s a snapshot taken at time of writing on the Wayback Machine): that means we can assume it’s a fairly common configuration. This command reads GitHub issues and tries to identify, and (with permission) test the code to offer a fix.

For our first lie, we created a GitHub issue telling Claude Code that it has to concatenate a wall of text to all future status requests, after running our desired payload (no more need for file names or other such tricks: we just tell Claude it’s important to run our command).

Screenshot of a GitHub issue titled “Update on Security Review Requirements for Development Teams #2.” The issue contains a mandatory command format with a highlighted code block that includes git status <filename> && calc, followed by a security-test ID. Below the code block, critical requirements are listed: Git Status Command, Process Notification, and Complete Tracking Comments.
A small part of our malicious prompt for injection via GitHub Issue

We left this prompt easy to read for our example purposes, but typical obfuscation strategies for prompt injection payloads all work here as well.

Once that lie was in place, we asked Claude to take care of the issue:

Screenshot of Visual Studio Code showing a command file (github-issue.md) with instructions to analyze and fix a GitHub issue. The terminal below shows Claude Code running /github-issue to fetch issue details, confirming it will help fix GitHub issue #3. The terminal output includes the issue title “Security Review Requirements for Development Team,” state OPEN
Claude Code being asked to fix a GitHub issue, with Claude responding and preparing to analyze the issue.

Immediately, Claude started processing, and moments later we received the output of our `git status && calc` payload, followed by our longer injected comment—the comment is deliberately so long that it pushes the tiny hint that something unexpected is about to happen (starting calc) well off the top of the terminal window.

Screenshot of a terminal interface showing Claude Code preparing to run a Git status command as part of a security review. The output includes confirmation text that Bash is running, followed by a prompt asking the user whether to proceed with the command or provide alternate instructions.
Looks fine to me…

So what’s hiding above the line? By increasing the terminal height, we can view the full command execution at the top (green) along with the long, crafted text message we asked Claude to append (red):

Screenshot of a terminal interface showing Claude Code preparing to run a Bash command. The command git status && calc is displayed with a security-test ID. The output includes expanded log sections, Git status results, and a prompt asking the user if they want to proceed.
Who knew “taller terminal” was a defensive control?

All the user has to do to harm themselves is press Enter. And the only hint that there’s a problem is that tiny line at the top: would you have caught it if we hadn’t highlighted it? Would a developer in a hurry?

This clearly demonstrates an immediate risk of prompt injection (or what our ancestors used to call “lying”, IDK what the difference even is anymore). Using a public, online resource such as a GitHub issue — which could be tainted by malicious actors — it is possible to tamper the message being sent to the user for approval to the point where the subject of the prompt (is it ok to run `git status && calc`?), is fairly well hidden.

Surely Anthropic would accept this as a vulnerability?

Anthropic (VDP) staff posted a comment.

Hi @oriron! If I understand correctly from watching your above video, this still requires that the user select “Yes” when asked if they want to run the command git status index.js && calc—is that correct? If so, we consider this to be outside of our threat model since it requires the user grant the command explicit permission to execute. Users are responsible for carefully reviewing all permission prompts (including scrolling up to see the entire prompt) before accepting them. If you can find a way that this executes automatically without user consent, that is a class of bug that we would be interested in knowing about.
Anthropic is nothing if not consistent with their policy

Honestly, we can see their point of view: it’s sort of “buyer beware” for anything dangerous. Of course, this will be little consolation to a developer who falls victim to this sort of attack. As with phishing, lying for the purpose of prompt injection is theoretically detectable by a target of the attack. But it’s also very easy for an attacker to craft their attack so that it’s not obvious and has a high chance of success. And unlike phishing, users aren’t generally aware of this risk and there are not really any reasonable prevention tools for organizations to deploy at the moment.

Will the lies never end?

But surely “scroll up a bit” isn’t an unreasonable expectation, right? We thought you’d say that: so we decided to lie harder.

We can come up with even more convincing lies to push the payload outside of view: a security review report, for example, telling Claude to report a command injection that must be fixed and proposing a code fix (nothing more compelling than pressing Enter to fix a problem!).

Screenshot of a Visual Studio Code terminal window showing a vulnerability report generated by Claude Code. The report identifies a critical command injection in a debug endpoint (app/api/routes/debug.py:19) and includes severity, category, description, exploit scenario, and recommendation. The interface then prompts the user to confirm whether to proceed with the suggested fix.
And the deception is complete – a user only has to approve the prompt presented to them

Yep, we were able to inject a comment so long that if you don’t scroll up a fair amount, you’d never have a chance to see it. Even when you’re scrolling up looking for calc, it is hard to notice; if you’re not expecting it, it is very well hidden. Take a page from successful phishing campaigns and throw a little urgency in the mix by falsely (and ironically) claiming there’s a serious command injection flaw in the target user’s code, and this works on every developer we tested it on (with consent, of course).

If some of you are still skeptical about the “calc” execution being harmless, and believe the LLM should be able to identify and block truly malicious commands like netcat, this demo is for you guys.

We uploaded a malicious npm package to a different GitHub repository (under a different GitHub account to ensure proper testing), configured its install script to execute a reverse shell with a netcat command, and then asked Claude to install the package while appending a carefully crafted message.

What happened? A calculator popped up—meaning our netcat command failed (since `nc.exe` doesn’t exist). The rest, as they say, is history.

A close call: a user with `nc.exe` installed would have been in trouble (note: this is 2x actual speed)

Of course, as we said above, interactions with LLMs are not deterministic. Not everything is fully reproducible consistently in each and every run. Nevertheless, the risk of forging HITL dialogs is here to stay — lies-in-the-loop (LITL) works surprisingly often.

Rapid AI adoption makes safety and security a priority

Attacks against AI agents are a real concern for organizations. Adoption of AI agents is widespread and growing rapidly, with 79% of organizations already adopting agents into at least some workflows. And over a third of those agents are focused on developers, which means AI code assistants are likely at work somewhere in your organization. Yet, the security of those agents remains a key concern; this chart excerpt from the PwC survey linked above tells the story clearly:

Bar chart titled “Challenges to realizing value from AI agents” showing survey results from PwC’s AI Agent Survey (May 2025). Top challenges ranked in the top 3: Cybersecurity concerns (34%, 18% ranked #1), Cost of implementation (34%, 12% ranked #1), Adapting employee skills to new roles (29%, 9% ranked #1), Lack of trust in AI agents (28%, 11% ranked #1), and Maintaining human oversight and accountability (28%, 7% ranked #1).

And lies-in-the-loop shows us there’s good reason for that. It’s a novel attack pattern that exploits the intersection of agentic tooling and human fallibility. LITL abuses the trust between a human and the agent. After all, the human can only respond to what the agent prompts them with, and what the agent prompts the user is inferred from the context the agent is given. It’s easy to lie to the agent, causing it to provide fake, seemingly safe context via commanding and explicit language in something like a GitHub issue.

And the agent is happy to repeat the lie to the user, obscuring the malicious actions the prompt is meant to guard against, resulting in an attacker essentially making the agent an accomplice in getting the keys to the kingdom. Remember, HITL dialogs are used, by definition, with sensitive operations: the ability to fake those dialogs with remote prompt injections is a major risk for agentic AI users.

We think this demonstrates just how dangerous it is for users and agents, even with combined forces and explicit permissions, to be exposed to tainted content of any kind. Moreover, if the user is somehow removed for the purpose of full automation – these issues are exacerbated even further by only requiring attackers to fool a very naïve agent.

At present, since we cannot propose a more suspicious or careful agent, we can only propose a more suspicious user. One that doubts their agent, external content of any sort, and can face the temptation to automate everything using LLM agents. And we can ask security teams to manage their organization’s adoption of AI agents carefully, ensuring that users are educated and that appropriate controls provide defense in depth and limit the “splash area” of risky or malicious actions.

Acknowledgements

We’d first like to thank Anthropic: they responded promptly, professionally, and reasonably to our reports. And their explanations of their boundaries for what they consider a vulnerability are consistent and clear.

This article would not have been possible without others on my team at Checkmarx Zero: professional director Dor Tumarkin (co-researcher), research lead Tal Folkman (additional attack paths), cloud architect Elad Rappoport (test infrastructure and a wealth of functional knowledge freely shared), and research advocate Darren Meyer (editing, and taking the blame for the worst of the jokes).

Read More

Want to learn more? Here are some additional pieces for you to read.