CTO & Co-Founder
Product
How we run sandboxes for agents at scale

We took a bet early on to give the LLM the power to run arbitrary code. This post is about why we made that bet, and what it takes to run thousands of these sandboxes at once, spinning up and down as fast as people start and finish chats with the agent.
Every conversation a user has with the Adapt agent is backed by its own computer. Not just some locked down container on a shared server, but an isolated VM the model can do whatever it wants with: install software, write and run programs, browse the web, talk to APIs. We call these sandboxes, and they're one of the core primitives Adapt is built on.
Full Control
LLMs are coding geniuses, and my job has largely been about building the perfect developer environment for them to work in.
The usual way to connect an AI to the outside world is to hand-build integrations, a bespoke connector for GitHub, another for HubSpot, another for Stripe, or to wait for each service to ship an MCP server. This just doesn't scale, and I'm not really into writing integration code day-in day-out.
So instead of doing that work ourselves, we let the model do it. Any service that exposes an API can be accessed from Adapt, because we give the LLM everything it needs to write the script or program that talks to that API. That's a big part of what we mean when we call Adapt a "horizontal intelligence": it isn't wired to a fixed list of tools, it can build the tool it needs on the spot.
Foundational to this is giving the LLM full access to the sandbox. Instead of handing the model a static set of languages and CLI tools with limited access to the filesystem, we give it complete access to everything. It runs as root. And while our sandboxes ship with common runtimes like Node and Python, what if the best SDK for some service's API is written in Go? The model can just go ahead and install and run it.

Does the LLM need to write a Go program? Go ahead and install Go and run it.
So if we're allowing the model to install whatever it wants and execute code no human has verified, how do we secure it? Fortunately, we're not the first people who've needed to run untrusted code. There are two very popular secure runtimes for exactly this: gVisor and Firecracker. Our journey so far has made us very well acquainted with both.
From gVisor to Firecracker
Our first foray into secure sandboxes for LLMs was the "easy" approach: run each sandbox with gVisor on top of GKE (Google Kubernetes Engine), using GKE Sandbox. We're already running all of our other services on GKE, so this was the natural step for us.
gVisor sits between a container and the host kernel. Instead of letting a program make system calls straight to the real Linux kernel, the thing you really don't want untrusted code poking at, gVisor intercepts those calls in a user-space kernel of its own and services them itself. You get most of the convenience of a normal container with a much smaller attack surface. And GKE Sandbox packages all of this up. You deploy Pods (containers) and they transparently run under gVisor, without us having to do much infrastructure configuration at all.
And this worked really well to start. We defined the "base" sandbox as a Docker image and let GKE scale it out to the number of sandboxes we needed at any given time. Updates to the software sandboxes shipped with were simple Dockerfile updates and a version bump in a manifest.

Hundreds of sandbox Pods running under GKE Sandbox.
But the same abstraction that made gVisor easy is the one we kept fighting. Because gVisor reimplements the Linux syscall surface in user space, not everything behaves exactly like it would on a real kernel, and the workloads our model dreams up are about as unpredictable as workloads get. The interception that buys you safety also costs you on syscall- and I/O-heavy work. And leaning on GKE for the whole lifecycle meant the parts we most wanted to control, boot time, packing density, networking, and how aggressively we recycle machines, were the parts we had the least control over. The stray OutOfcpu Pod above is the kind of thing you start seeing when you're pushing someone else's scheduler harder than it wants to go.
That's what pushed us to Firecracker.
Firecracker microVMs are real virtual machines, each with their own guest kernel, running with hardware virtualization, but stripped down to boot in a fraction of a second with only a few megabytes of overhead. It's the same technology AWS built to pack enormous numbers of Lambda and Fargate workloads onto shared hardware. It gives us a stronger isolation boundary than a shared kernel, boots fast enough to feel instant, and is small enough to pack a lot of them onto a single host.
The tradeoff is that Firecracker hands you a VM and not much else. There's no GKE-style layer doing the scheduling, networking, and lifecycle orchestration. So we built one, and we call it orc.
The rootfs is just an image
One thing we didn't want to give up in the move off containers was defining a sandbox as a plain Dockerfile. Containers make that trivial; VMs traditionally don't, since a microVM boots a root filesystem, not an OCI image.
So orc bridges the two. When it's asked to create a VM, it takes an ordinary Docker/OCI image and generates the VM's root filesystem from it on the fly, caching the result so later boots of the same image are quick. Our base sandbox is still just a Dockerfile, and orc turns it into a bootable rootfs at request time.
That keeps our workflow identical to the GKE days, edit a Dockerfile, ship a new sandbox, while running on real VMs underneath. And it opens a door we're only starting to walk through. Because any OCI image can become a microVM, we can boot sandboxes from images other than the default one. Want a VM that already has Postgres and pgvector baked in? Point orc at that image and you get it as its own isolated machine. The sandbox stops being a single fixed environment and becomes "whatever image the job needs, booted as its own VM."
Executing at Scale
And here's the thing that makes this a genuinely hard problem: every chat gets its own sandbox. One machine per conversation. At any given moment we have thousands of them alive, and that number is never still. Every time someone opens a chat, a sandbox has to appear; every time a chat goes quiet, one has to disappear so we're not paying for it. We're constantly spinning sandboxes up and down.
Two numbers dominate everything: how fast we can get a sandbox ready, and how many we can fit on a host.
Startup latency. A Firecracker microVM boots in a few hundred milliseconds. That's fast enough that we don't keep a warm pool at all, which is one of the quieter wins of the switch. Under GKE we'd have had to hold spare capacity around to hide startup time. With orc a fresh sandbox is ready before you'd notice, so we just create one on demand when a chat starts and tear it down when the chat is done. No more idle pool to babysit or pay for.
Density. Because each microVM is tiny, we can pack a lot of them onto one physical host. We size each sandbox's CPU and memory to what it actually needs rather than over-provisioning, which is what lets us run thousands of them economically.
orc itself is deliberately small. It's a control plane that speaks a simple API: create a VM with N vCPUs and M megabytes of memory from a given image, stream commands into it, read and write files inside it, tag it with labels so we can find it later, and delete it when we're done. Each guest runs a tiny init process as PID 1 and gets its own isolated network. That's most of it. The magic isn't any one clever trick, it's that these primitives are boring and fast enough to run a fleet on.
The payoff for all of this plumbing is the thing we started with: a model that can install anything, write a program, hit an API, and hand you back an answer, all on a real computer.
You might also like
How we built Slack agents with unified memory
How we built shared, company-wide memory for the Adapt Slack agent.
Loops for the whole team: agentic workflows beyond the coding agent
Loops are goal-directed AI workflows that run without constant prompting. What they really are, where the hype breaks down, and how teams can apply them well.
Chasing Fable
A long, long time ago, about 2 weeks to be precise, in a land called San Francisco a new model was dropped by a company named Anthropic.

Every company will have a brain.
It's inevitable.
Get started in minutes, not months.