Building an AI Security Agent Is Easy. Operating One Is Harder.

Building an AI agent that triages vulnerabilities or generates code fixes has never been easier. With modern frontier models, a team can assemble a convincing prototype in days.

The challenge begins when that prototype meets production.

Real enterprise environments are messy: large codebases, multiple languages, constantly changing repositories, and strict rules around who can access what. A demo that works on clean, controlled inputs will struggle when it hits that reality – and the failures are often hard to spot until real damage is done.

The question isn’t whether you can build an AI security agent. It’s whether you can make it accurate, trustworthy, and dependable at scale.

What the Demo Gets Wrong

On the surface, building an AI security agent seems straightforward.

Your security scanner flags a potential vulnerability. The agent reads the relevant code and answers two questions: is this real and exploitable? And if so, what’s the fix? A capable engineer can wire this up in an afternoon, and the same pattern holds for most code-reasoning agents.

A basic pipeline is simple enough to build: take the finding, attach the relevant code, send it to a frontier model with a good prompt, and parse what comes back. In a controlled environment, the results can be impressive. The agent appears accurate, useful, and ready for production.

The prototype works because its assumptions are controlled: one language, a clean repository, a code snippet that fits in the context window, and all the information needed to make the decision already available. Production removes those assumptions one by one.

*The prototype pipeline: correct often enough to be convincing on curated inputs*

A real environment generates thousands of findings across dozens of languages and frameworks. Repositories are large monorepos with partial checkouts. The scanned commit may no longer exist. Branches get deleted. The code needed to understand a single finding can span multiple files, services, and dependency versions, far more than can fit into a context window.

On top of that, the model itself is non-deterministic. The same input can produce different answers on different runs. The hard part was never the model call. It is everything required to make that call correct, reliable, and repeatable at scale.

The Layers Most Prototypes Ignore

A production agent isn’t just a prompt, it’s three systems working together. Most prototypes focus almost entirely on the first, which is exactly why they struggle in production.

1. Prompt engineering

This is the part most people think about first: task framing, output schemas, system instructions, and few-shot examples. It determines how the model reasons about a problem. It is essential, but it only controls the model’s behavior. It does not determine whether the model has the right information or whether its answer can be trusted.

2. Context engineering

Context engineering is about getting the right information to the model in the first place. That means resolving the exact repository state, tracing data flow, identifying relevant dependencies, and selecting the smallest amount of code needed to make a correct decision. In practice, this is where accuracy is often won or lost.

3. Harness engineering

Harness engineering is everything around the model that makes it usable in production: orchestration, evaluation, guardrails, isolation, observability, retries, fallbacks, and cost and latency management. It is the layer that turns a capable model into dependable software.

*The engineering iceberg: building the agent is the visible part, but operating it reliably requires everything underneath.*

Prompt engineering is the visible part of the system. Context engineering and harness engineering are the much larger layers beneath it that determine whether the agent can be trusted in production.

Those layers consist of a collection of supporting subsystems that rarely appear in demos but do most of the work required to make the agent reliable:

Data ingestion and normalization

A finding is a pointer, not a payload. The system must resolve the exact file at the scanned commit, across SCM, scanner, and CI systems, with consistent schemas across languages and scanner versions.

Context modeling

A monorepo will not fit in a context window. Techniques such as call-graph traversal and taint analysis identify the specific code, configuration, and dependencies a verdict depends on. The goal is to provide enough context to be correct without overwhelming the model.

Identity and access

The agent acts across repos with different permissions. Results must never expose code a user is not authorized to see, and multi-tenant environments must prevent data leakage between customers.

Workflow integration

A verdict that does not land in a PR, ticket, IDE, or SARIF stream has limited value. The agent must also stay synchronized with the systems that track vulnerability status and ownership, whether that is a ticketing or GRC system or the scanning platform itself. This means deduplication across scans, idempotent comments, and durable state for findings that have already been triaged or dismissed.

Evaluation

Quality must be measured before it can be improved. That requires labeled datasets for triage and remediation outcomes across languages and vulnerability types, along with regression suites that detect when prompts, models, or system changes reduce accuracy.

Security and governance

The agent reasons over attacker-influenced sources and makes decisions an enterprise is accountable for. It needs prompt-injection defenses, human-in-the-loop gates for high-impact actions, and policy controls over what it is allowed to decide on its own versus escalate.

How It Fails (Usually Quietly)

A prototype rarely fails with an error message. More often, it returns a confident, well-formatted answer that happens to be wrong. In a security workflow, a plausible wrong answer is harder to detect – and often more damaging – than an obvious failure.

Some of the most common failure modes look like this:

Missing or stale context	When the agent cannot retrieve the exact code state a finding refers to, it reasons over a guess. In practice this, rather than model reasoning quality, is frequently the dominant source of incorrect verdicts. The agent does not know what it did not see.
Plausible but wrong verdicts	A real, exploitable finding gets dismissed with high confidence, or an unreachable one gets escalated. Without reachability and data-flow context, the model has no reliable basis for the call it is making.
Fixes that break behavior	A remediation can resolve the flagged pattern while silently changing application behavior or introducing a regression. Correct-looking is not the same as correct, and only validation catches the difference.
Prompt injection via source code	Source code is untrusted input. An attacker-placed comment instructing the model to mark a file safe is a real attack surface unique to agents that read code, and absent from any single-repo demo.
Evaluation blindness	Without labeled data, quality is unmeasurable. A model swap or prompt tweak that quietly drops accuracy looks identical to one that improves it until users notice, by which point trust is already gone.

Getting an agent to be right most of the time is straightforward. The harder challenge – closing the gap between “works most of the time” and “can be trusted consistently in production” – is where most of the engineering effort lives. Each incremental improvement requires more context, more validation, and more operational controls than the one before it, and the work never really ends because models, threats, and codebases continue to change underneath the system.

It’s an Ongoing Commitment, Not a Project

Choosing to build an AI security agent internally is not a decision to complete a project. It is a decision to own and operate a long-lived ML system. The initial build is often the smaller investment. The ongoing responsibilities look like this:

Model migration – Providers deprecate and release models on their schedule. Each change means re-validating every prompt and re-running the full evaluation suite before you can trust it.
Evaluation upkeep – Labeled datasets must grow as new languages, frameworks, and CWEs appear, and as your own codebase evolves. Labeling is slow, expert, and never finished.
Coverage expansion – Every new language or package manager is a new context-extraction and validation effort, not a configuration flag.
Drift and regression detection – Quality degrades silently. Catching it requires production monitoring, sampling, and someone accountable when accuracy slips at 2 a.m.
Agent attack surface – The agent itself is now part of your threat model and needs ongoing security review like any other production service.
Audit and compliance – Every automated decision that dismisses a risk or merges a fix must remain explainable, attributable, and traceable for as long as your retention and regulatory obligations require. An agent whose decisions cannot be reconstructed later is a finding waiting to happen in your next audit.

Once you understand what it takes to operate the system, the build-versus-buy decision becomes much easier to evaluate.

Build or Buy?

The question isn’t whether a capable team could build this. They can. The question is whether this is a capability you want to own and operate.

Building internally tends to make sense when:

This capability is core, differentiating IP, not supporting infrastructure.
You have a dedicated ML/platform team with capacity to operate it indefinitely.
You can fund continuous evaluation, labeling, and model-migration work as a standing cost.
Your scope is narrow enough (few languages, controlled repos) to keep the context problem tractable.

A purpose-built product tends to make sense when:

You need broad language and framework coverage from day one.
You need audit, isolation, and compliance guarantees out of the box.
You’d rather your security and platform engineers spend cycles on your business, not on operating an ML pipeline.
You want the reliability curve already climbed, and kept climbed as models change.

Before You Build, Answer These

☐ Can you reliably retrieve the exact code state (commit, branch, file) every finding refers to, across all of your repositories?

☐ Do you have labeled evaluation data for triage accuracy and fix quality by language and CWE, and who maintains it?

☐ How will you detect and prevent quality regression when a model is deprecated or swapped?

☐ How do you defend the agent against prompt injection delivered through attacker-controlled source code?

☐ How do you enforce per-user and per-repo access controls to prevent code leakage, including in multi-team environments?

☐ What’s the plan and budget for expanding coverage of languages, frameworks, and vulnerability classes over time?

☐ Can you produce a complete, explainable audit trail for every decision the agent makes automatically?

☐ Who is accountable, and on call, when the system silently degrades in production?

Conclusion

A prompt and a model call can produce an answer. Making that answer reliable, repeatable, and trustworthy across thousands of findings, repositories, and edge cases is the real work.

The challenge isn’t creating the agent. It’s operating the system around it.

Tags:

Agentic AI

Agentic AppSec

AI Agents

AI in Engineering

AppSec