/FIELD NOTE

Securing AI Agents: Tool Abuse, Confused Deputies and Blast Radius

12 February 2026 // 13 min read // Basalt Cyber Defense Division

An AI agent is a language model wired to tools and given the autonomy to decide when to use them. That autonomy is exactly what makes agents useful and exactly what makes them dangerous. The moment a model can send email, query a database, call an API, or run code, a successful prompt injection stops being a content problem and becomes a systems problem. At Basalt Cyber, securing agents is now one of the fastest-growing parts of our work, because teams are shipping autonomy faster than they are shipping the guardrails to contain it. This article lays out a practical threat model and the controls that actually limit damage.

The core shift: from output risk to action risk

A chat-only model can say something harmful. An agent can do something harmful. The distinction matters because the defenses are different. For a chat model you worry about content; for an agent you worry about authority. Every tool you connect expands what an attacker can achieve by hijacking a single turn, and every credential the agent holds is a credential the attacker effectively borrows. The right mental model is to assume the agent will eventually be manipulated and to ask: when that happens, what is the worst it can do?

Tool abuse

Tool abuse is the agent using a legitimate capability for an illegitimate end, usually because injected instructions steered it there. A file tool meant for reading becomes a tool for exfiltration. An HTTP tool meant for fetching becomes a tool for sending data outbound. The problem is rarely a buggy tool; it is an over-broad tool handed to a model that can be talked into anything.

Control: Apply least privilege to tools the way you would to a service account. Each tool should expose the narrowest possible operation, with allowlists for destinations, parameterised inputs, and explicit deny rules for anything outside the intended use. A tool that reads a specific folder is far safer than a general filesystem tool.

The confused deputy

The confused deputy is a classic security flaw that agents reintroduce at scale. The agent is a privileged intermediary acting on behalf of a user, but it cannot reliably tell whose authority it is exercising. An attacker who controls some of the content the agent processes can get the agent to use its own elevated permissions to do something the attacker could not do directly. Indirect prompt injection plus a privileged tool is the textbook setup: untrusted text instructs the deputy, and the deputy acts with its full rights.

Control: Propagate the user's identity and permissions through to the tool call rather than letting the agent act with a broad service identity. Authorise actions against the requesting user's rights, not the agent's, so the agent cannot exceed what the human on whose behalf it acts is allowed to do.

Cross-tool injection

In multi-tool agents, the output of one tool becomes the input to the model's next decision, which can drive another tool. An attacker who can influence one tool's output, for example a web page the browse tool fetches, can chain into a second tool, for example the email tool. This is indirect injection routed through the agent's own tool graph.

Control: Treat every tool output as untrusted input. Validate and constrain it before it can influence further tool selection, and avoid letting the raw text of one tool's result flow unfiltered into the parameters of another. Where tools are sensitive, require a fresh authorisation rather than chaining automatically.

Confirmation gates and human-in-the-loop

Not every action deserves the same level of trust. Reading is low risk; sending money, sharing data, deleting records, and changing configuration are not. The cheapest high-value control is a confirmation gate on consequential actions that shows the user precisely what is about to happen and waits for approval.

  • Classify tools by impact and require explicit human confirmation for the destructive and irreversible ones.
  • Show the concrete action, the recipient, the amount, the target record, not a vague summary.
  • Make the safe default refusal, so an ambiguous or manipulated request stalls rather than executes.

Sandboxing and isolation

When an agent runs code or executes commands, that execution must be contained. A code tool without a sandbox is remote code execution waiting for the right prompt. Run untrusted execution in an isolated, ephemeral environment with no access to production secrets, restricted network egress, and strict resource limits. Isolation turns a successful code-execution jailbreak from a breach into a contained experiment that gets thrown away.

Limiting blast radius

Blast radius is the total damage achievable from one compromised agent turn, and it is the metric that matters most. You reduce it by reducing standing authority, scope, and autonomy.

  • Bound agent loops so an injected instruction cannot drive an unbounded sequence of actions.
  • Scope credentials tightly and rotate them; never give an agent a god-mode token for convenience.
  • Rate-limit and quota sensitive tools so even an undetected compromise cannot run away.
  • Log every tool call with full parameters so you can detect and reconstruct abuse.

These controls are the practical core of our AI red teaming assessments, where we attempt exactly these attack chains against client agents, and they connect to the wider set of controls in our LLM security work.

Frequently asked questions

Is sandboxing enough on its own?

No. Sandboxing contains code execution but does nothing for an agent that abuses an email or payment tool. You need least privilege and confirmation gates as well.

Can guardrails on the model prevent tool abuse?

Model-level guardrails help but cannot be the only line. Authority is enforced by your tool design and authorisation, not by the model's good behaviour.

What is the single highest-leverage control?

Reducing standing authority. An agent that simply cannot perform a destructive action without human approval has a small blast radius no matter how it is manipulated.

Takeaway

Securing agents is about authority, not just content. Assume the agent will be manipulated, then ensure that when it is, least privilege, identity propagation, confirmation gates, sandboxing, and a bounded blast radius leave the attacker with almost nothing to gain.