AI Agents and the Language Model Trap
The surface appeal is strong: these systems seem capable, conversational, and convenient. For example, imagine a virtual butler who can run online errands for you.
Before diving into the risks, we need clarity on terms. This is important because in casual discussions AI agent advocates will juggle different definitions of “agency”.
Definition: “Agency”
Without going into too much detail, we should agree that an AI agent is can perceive, reason, and act upon an environment.
Strong definition: An agent that independently makes decisions — for example, initiating a payment, sending an email, or booking a flight without further approval.
Weaker (and, in my opinion, excessively diluted) definition: An assistant that merely makes recommendations, with actions confirmed by a human.
This essay addresses both types, though the primary concern is with systems where LLMs have agency, not just input. However, even the “weak” form becomes dangerous if the user relies on the LLM to structure or present information — as explored later in the “flight to Paris” example.
LLMs Are Inherently Vulnerable
The foundational problem lies in the architecture of LLMs themselves. These models are not designed to enforce internal security boundaries or maintain secret information. They are highly susceptible to prompt injection — a class of attack where malicious inputs override or subvert the intended behaviour of the model. These attacks do not require low-level access or deep technical knowledge; often, they rely on cleverly crafted text inputs that manipulate the model’s output.
Worse, LLMs can be jailbroken. Jailbreaks are not rare anomalies; they are routine and evolving. This is not a problem with a simple patch. It is a fundamental vulnerability in how LLMs interpret and generate language. Fixing it may require a complete rethinking of how language interfaces interact with real-world systems — including entirely different infrastructural approaches.
Secret-Keeping and Decision Boundaries
A common defence is that LLMs should not be used to store secrets or hold API keys. But this misunderstands the problem. The issue is not whether an LLM is given a secret — it is whether the LLM is part of a system that can trigger real-world actions, like payments. If a model can be tricked into calling an external function — for example, by generating a well-formed API call or triggering an internal method — then it doesn’t matter whether the model "knows" the secret. The agent as a whole becomes a security liability.
Language itself — flexible, ambiguous, context-sensitive — is simply the wrong interface for secure decision-making. The very properties that make natural language expressive and user-friendly also make it manipulable.
Language is a Trap
Consider a seemingly benign use case: an AI assistant that books flights on behalf of a user. In this design, the security model assumes that the AI can suggest a flight, but the payment must always be confirmed by a human. On paper, this sounds safe. But the vulnerability lies not in the payment logic — it lies in the AI’s ability to present information. If the LLM is compromised or manipulated, it can generate a fake flight listing: “Flight to Paris – $200 – AirFrance.com,” which is actually a disguised prompt to send $200 to an attacker-controlled address or service. The user, believing the interface is trustworthy, approves the transaction.
From the system’s perspective, the human “confirmed” the payment — but only because the LLM controlled what was seen. The attack didn’t break the payment system; it broke the trust layer before the payment system was invoked. When language is the interface, and the model decides what is shown, even user confirmation becomes meaningless if the LLM is compromised.
Real-world systems have already been compromised through LLM misuse. Chatbots, assistants, and search agents have all been jailbroken shortly after release. These failures should not be seen as bugs, but as previews of a fundamentally insecure architecture — one that mistakenly treats language as a secure API surface.
“Just Fix the Interface”
Some critics might argue that the deception in the “flight to Paris” example could be solved by hardening the UI — showing secure metadata from trusted APIs, rather than relying on LLM-generated text. But this misunderstands how people interact with AI. If the LLM is anywhere in the loop, it can inject language manipulations that present malicious content in trusted contexts. Unless LLM output is entirely segregated from decisions (which would make it a non-agent), the system remains vulnerable.
Even if metadata is externally sourced, if the LLM renders or positions that information, it can still mislead. The weakness isn’t the payment logic — it’s that the LLM mediates the perception of what’s being approved.
“But a Human Always Confirms the Payment”
This too gives a false sense of security. Humans rely on context, and LLMs control that context. If an LLM presents malicious instructions as legitimate suggestions, the human-in-the-loop becomes a rubber stamp for the attacker’s intent.
Just as phishing emails trick people into entering credentials, LLMs can trick users into approving actions. But unlike emails, LLMs can tailor, contextualize, and update their deception in real-time — often in the exact tone and phrasing the user trusts most.
“Just Add a Controller”
Some argue that safety can be achieved by separating the LLM from real-world authority — letting it “suggest” actions, while a secure (automated) controller verifies and executes them. The LLM proposes an intent (e.g., book a flight to Paris), and the controller checks that suggestion against trusted APIs or databases before triggering any real action.
This model breaks down in open-ended domains like travel, shopping, or payments. Why? Because the controller must already know what “correct” looks like. It has to maintain a whitelist of trusted services, schemas, and expectations. In the AirFrance example, that means the controller would need to validate every flight recommendation against a known-good provider — Amadeus, Kayak, Expedia, etc. But how many APIs should it check? What about regional carriers, niche travel portals, or emerging services the user might trust?
The real world doesn’t fit into a closed-world validation loop. If the controller only knows how to verify a handful of sources, it becomes trivial for a compromised LLM to craft a fake output that falls just outside that scope — one that looks legitimate, but can’t be cross-checked. And if the system tries to scale by trusting more sources, it expands its attack surface and loses control over what “trusted” even means.
Worse still, users are often asking the LLM to explore exactly the kinds of ambiguous, unstructured problems that resist deterministic validation! That’s what makes LLMs feel intelligent — their ability to navigate ambiguity. But it also makes it nearly impossible to safely wrap them in a simple “approval-and-control” harness.
The net result: the LLM still controls what the user sees, and the controller can’t reliably validate what it doesn’t already expect. Safety becomes a comforting illusion. The attacker doesn’t need to bypass the controller — just stay slightly outside its field of view.
Conclusion: Don’t Let LLMs Near the Money
No system that relies on a large language model should be trusted with the authority to initiate or authorize payments. The current crop of LLMs are all intrinsically vulnerable to adversarial manipulation. Until provably secure LLMs exist (which may never happen), integrating them into financial agents is tantamount to building a vault with paper walls.
And as long as natural language is the interface — the gateway through which instructions are formed, validated, and executed — attackers will always have a way in. If you let an LLM near money, someone will eventually talk it into giving that money away.
Absolutely! Very depressing a realisation, too