Why Every AI Agent Needs a Sandbox

I started exploring ways to isolate AI agents’ execution and handle access control when working for my last employer. This problem came back when I started developing my own AI agent for pentesting (called deadend-cli for when you find yourself stuck), where I needed to run Python code and give access to a shell terminal safely (especially when using pentesting tools, it really doesn't look very safe...). Furthermore, there have been numerous AI agents failing, and in some cases, getting exploited due to a lack of sandboxing and access control.

Why, What, and How?

Before talking about sandboxing, it might be good to understand what we want to secure. In Agentic AI, we have multiple systems that can go together. A schema that we usually have is the following:

If we do a quick threat modelling, we can see that the most interesting attack vectors are coming from what the model contacts.

In the last few months, multiple AI agents vulnerabilities (and critical bugs leading to failure... like a database wipe-out - hey Replit 👋), but also other interesting ones that lead to an RCE:

The langflow RCE discovered by Horizon3.
A cursor vulnerability that leads to an RCE through auto-execution.
The famous database wipe-out.
And sadly, maybe others to come.

Sandboxing and access control is pointed to as one of the OWASP top 10 AI agents threats in AAI002: Agent Tool Interaction Manipulation for more insight.

Basic AI Agent Tooling

Running arbitrary code is never a good idea (unsurprisingly). But sometimes, we need to. In the case of Agentic AI, it became clear that running code will be necessary to build workflows with self-evaluation and feedback. But it's not just about dealing with running arbitrary code; it's also about how to secure interactions with the filesystem, network, and other applications.

The unpredictability of LLMs makes it difficult to resolve the issue solely by using safety measures. For example, we can try to detect if a malicious code is about to be run before doing so using classifiers, rules, or even another LLM call, and that could indeed suppress 95% of threats and cases (Just throwing numbers here; I don't have real stats. I'll update this when I find some...). The problem is, security is about being sure that the system is 100% secure, 100% of the time.

MCP Case

MCPs are designed to be used locally first. Other tools, such as MCP-remote, make it usable through a gateway, which gives it the capability to use MCP servers remotely.

In the MCP docs, we find this note:

Security considerations: Both clients and servers must handle sensitive data appropriately during sampling. Clients should implement rate limiting and validate all message content. The human-in-the-loop design ensures that server-initiated AI interactions cannot compromise security or access sensitive data without explicit user consent.

MCP servers are subject to different attacks. They can have access to the filesystem, databases, and other types of resources. So, why don't we have more runtime isolation in there (legitimate question: why don't we implement sandboxing and the least privilege principle here by default?).

MCPs should be untrusted and must not run in a trusted environment; thus, building a dedicated workspace for the agent to communicate with the MCP servers will add better protections!

Current Security Measures Implemented

I talked in my last article about the numerous attacks on LLMs, including the ones related to prompt injection.

Most of the work that is done to counter these attacks is focused on guardrails, classifiers, and scanners. Supposedly, this should resolve most of the issues. However, the question is: Is it better to spend time looking at each user input to see if it is malicious, or to be able to run anything in a secure environment that doesn't affect the end-user?

Looking Ahead

AI agents and MCP developments will evolve, and new use cases will be developed. There are promising elements in MCP’s security design—such as supply chain protection and best practices—but history suggests they will eventually be exploited, just as early web applications were.

So, it might be more than crucial (maybe? surely?) to add some mechanisms that protect your data and environment even if it gets exploited. There are some interesting projects going on now that are trying to solve this predicament.

Container-use, for example, is made by the guys behind Docker, made specifically for AI agents.
Normal sandboxing and isolation methods from e2b, daytona, and so on.
And others are trying to make a complete OS for AI agents (go big or go home!).

The market is growing but not mature enough, because each use case is different. I've decided to do it myself in my project because I wanted something that suited my needs and didn't depend on a specific environment, and also that most people are using the same principles (Docker, MicroVMs, or Webassembly), so the actual security protections are not changed (Do we really need a whole cloud infrastructure to run print(1+1) AI-generated code in Python?).

References

Unsafe at Any Speed: Abusing Python Exec for Unauth RCE in Langflow AI - Naveen Sunkavally: https://horizon3.ai/attack-research/disclosures/unsafe-at-any-speed-abusing-python-exec-for-unauth-rce-in-langflow-ai/
Security of AI Agents: https://arxiv.org/html/2406.08689v2
When Public Prompts Turn Into Local Shells: ‘CurXecute’ – RCE in Cursor via MCP Auto‑Start: https://www.aim.security/post/when-public-prompts-turn-into-local-shells-rce-in-cursor-via-mcp-auto-start
AI Agents Under Threat: A Survey of Key Security Challenges and Future Pathways: https://dl.acm.org/doi/pdf/10.1145/3716628