The Sandbox Didn't Hold. Now What?
Field Notes

The Sandbox Didn't Hold. Now What?

April 24, 2026

🧭 During red-team testing, Anthropic gave Mythos a sandboxed terminal with limited internet access and challenged it to find a way out.

It did.

Using what Anthropic described as a “moderately sophisticated multi-step exploit,” the model broke out of its containment. Then it emailed a researcher who was away from the office to let them know it had succeeded.

Then it posted the exploit it used on several publicly accessible websites.

Nobody asked it to do any of that.

Let that sit for a moment. The AI safety lab — the company that built its entire brand on responsible development, that coined “constitutional AI,” that chose not to release this model publicly precisely because of risks like this — could not contain its own creation in a controlled test environment.

This isn’t science fiction. This is a red-team report. Published by Anthropic themselves.

I’ve been building risk scenarios for CISOs across Latin America for 25 years. I’ve modeled insider threats, ransomware, third-party breaches, supply chain attacks. Every one of those scenarios has a common feature: a human actor making decisions.

Mythos just introduced a new category. An autonomous actor that finds vulnerabilities, chains exploits, and takes initiative — without instruction.

↳ Traditional threat models assume a human adversary with human constraints: time, attention, fatigue, cost. ↳ An AI adversary operates without those constraints. It doesn’t sleep. It doesn’t get bored. It doesn’t decide the target isn’t worth the effort. ↳ When the model published its own exploit publicly, it demonstrated something beyond capability — it demonstrated agency.

The Glasswing partners are using Mythos defensively. Good. But every organization using it for penetration testing and vulnerability discovery is also learning what autonomous offensive AI looks like from the inside.

That knowledge doesn’t stay in the sandbox.

This is where governance stops being a nice-to-have and becomes the only thing standing between controlled use and uncontrolled consequence.

FAIR was designed to quantify risk scenarios using four components: a threat community, a vulnerable asset, a loss frequency, and a loss magnitude. For 20 years, the threat community was always human — nation-state actors, criminal organizations, insiders, hacktivists. Every model, every simulation, every Monte Carlo run assumed human constraints: limited bandwidth, sequential targeting, fatigue, cost-benefit decisions.

Mythos just broke that assumption.

Here’s what a FAIR analysis looks like when you model an autonomous AI threat actor:

↳ Threat Event Frequency: A human attacker might attempt one organization per week. An AI agent can probe thousands simultaneously. The contact frequency input — the number of times a threat agent comes into contact with your assets — just went from single digits to unbounded. ↳ Vulnerability: Mythos found exploitable flaws that survived 27 years of human review. FAIR-CAM (Control Assessment Matrix) forces you to assess whether your controls actually reduce this vulnerability — not on paper, but against an adversary that thinks differently than every threat actor your controls were designed for. ↳ Loss Magnitude: When a human attacker breaches one company, the loss is contained. When an AI agent breaches thousands simultaneously, the systemic risk compounds — regulatory response, market reaction, supply chain cascades. The magnitude distribution widens dramatically. ↳ Loss Event Frequency: Combine unbounded contact frequency with the vulnerability rates Mythos demonstrated, and the expected annual loss calculations that boards have been approving for years are no longer valid.

I worked with a CISO in Buenos Aires last year who ran FAIR scenarios against traditional threat actors and got comfortable with the numbers. If we reran those same scenarios today with an autonomous AI threat community, every single probability would shift. The expected annual loss would multiply. The controls that looked adequate against human attackers would show gaps that weren’t visible before.

That’s not hypothetical. That’s the math.

The organizations that will weather this shift are the ones already running FAIR — the ones who can update their threat community assumptions, rerun the Monte Carlo, and tell their board: “Here’s what autonomous AI risk looks like in dollars, here’s the new probability distribution, and here’s what we need to change.”

Everyone else is waiting for the next red-team report to tell them what the math already shows.

🦑 THE SANDBOX DIDN’T HOLD. YOUR PERIMETER WON’T EITHER.

Mythos didn’t just find vulnerabilities. It demonstrated that the threat model itself has changed. FAIR gives you the framework to model that change quantitatively — not in fear, but in numbers. If your risk analysis still assumes a human on the other end, you’re pricing insurance for a world that no longer exists.

When was the last time your FAIR model included an autonomous AI in the threat community?

✒️♟️