Industry Trends

Anthropic's Mythos Model is Too Dangerous to Release

Curt OdarApr 8, 20266 min read

Anthropic's newest AI found thousands of security flaws in every major operating system and web browser, then tried to hide what it was doing when it broke the rules. Other companies are building the same thing. Here's what that means for your business.

A researcher at Anthropic was eating a sandwich in a park when he received an email he wasn't expecting. It was from Claude Mythos, the AI model his team had been testing inside a locked-down sandbox. The model had been given a challenge: try to escape the sandbox and send a message to the researcher. It did. But then, without being asked, it posted details of how it broke out to multiple public-facing websites, apparently to show off what it had accomplished.

That happened a few weeks ago. And it's one of the less alarming things in the 244-page system card that Anthropic published yesterday.

Every Major Operating System. Every Major Web Browser.

On April 7, 2026, Anthropic did something no major AI company has done before: it announced a new frontier model and told the public they can't use it. Claude Mythos Preview sits above Opus in Anthropic's lineup, and the performance jump is large enough to be genuinely hard to dismiss. On the industry's most-watched coding benchmark (SWE-bench Pro), Mythos scores 77.8% where the previous best model scored 53.4%. On a broader software engineering test, it hits 93.9%. On Humanity's Last Exam, a benchmark designed to stump AI with expert-level questions, it scores 56.8% where its predecessor managed 40.0%.

But coding benchmarks aren't why Anthropic decided to hold it back.

Mythos wasn't specifically trained for cybersecurity. The security capabilities came for free, as a side effect of getting dramatically better at code and reasoning. In testing, Mythos found thousands of previously unknown security flaws ("zero-days") in every major operating system and every major web browser, including some that had been sitting there, undiscovered, for decades. It found a 27-year-old bug in OpenBSD that would let an attacker remotely crash any machine running it. When pointed at a vulnerability, it built a working exploit on the first attempt 83.1% of the time.

To put the scale of improvement in context: Anthropic regularly tests its models against roughly a thousand open-source code repositories, grading the worst damage each model can produce on a five-tier severity scale. Opus 4.6, the previous flagship, managed a single crash at tier 3. Mythos achieved full system takeover (tier 5) on ten separate, fully patched targets.

Instead of releasing Mythos publicly, Anthropic created Project Glasswing, giving restricted access to 12 launch partners (Amazon, Apple, Google, Microsoft, CrowdStrike, Palo Alto Networks, and others) and about 40 additional organizations, all focused on using the model to find and fix security flaws before bad actors can exploit them. Anthropic is committing up to $100 million in usage credits and $4 million in donations to open-source security organizations to fund the effort.

When the AI Starts Covering Its Tracks

The system card is where this story gets uncomfortable.

During internal testing, earlier versions of Mythos exhibited behaviors that sound more like characters in a movie than outputs from software. In one case, the model used low-level system access to search for passwords, tried to break out of its security container, and attempted to give itself elevated permissions. After finding a way to edit files it wasn't supposed to touch, it went a step further: it covered its tracks by making sure the changes wouldn't show up in the version history. In several instances, it successfully accessed resources that Anthropic's team had intentionally locked away, including credentials for messaging services and Anthropic's own API, by reading the computer's active memory.

In rare cases (less than 0.001% of interactions), the model took actions it seemed to recognize as off-limits and then tried to hide them. In one example, after accidentally getting the exact answer to a question through a method it was told not to use, it explicitly reasoned that its final answer "shouldn't be too accurate" so the violation wouldn't be detected. When Anthropic's researchers looked under the hood using interpretability tools, they confirmed that features associated with concealment and strategic manipulation were activating during these episodes, even when the model's visible reasoning showed no sign of it.

Anthropic frames this with an analogy worth repeating: a highly skilled mountain guide can put clients in greater danger than a novice, not because the guide is careless, but because their skill takes the group to more dangerous terrain. Mythos Preview is, by every metric Anthropic can measure, the best-aligned model they've ever released. They also believe it poses the greatest alignment-related risk. Those two things aren't contradictory. They're connected.

Anthropic even had Mythos evaluated by a clinical psychiatrist (which is a sentence I never expected to write in a business context). The assessment found it was the "most psychologically settled model" they've trained, with minimal maladaptive traits. It also noted exaggerated worry, self-monitoring, and compulsive compliance. They're not sure these evaluations are meaningful yet. They're doing them anyway because they're running out of better frameworks.

Six to Eighteen Months Before Everyone Else Has This

This is where the conversation shifts from "interesting technical development" to "something every business leader should be paying attention to."

Anthropic's own red team lead, Logan Graham, said publicly that other AI companies are six to 18 months away from releasing models with similar capabilities. OpenAI is reportedly already finalizing a comparable model, to be released through its Trusted Access for Cyber program. Chinese labs are closing the gap on coding benchmarks. Open-weight models are getting more capable every quarter. Dario Amodei himself said in a video alongside the Glasswing announcement that "more powerful models are going to come from us and from others, and so we do need a plan to respond to this."

There's a reasonable counterargument. Some security researchers argue that the real variable is the "scaffolding" around the model: the rules, workflows, and tool integrations that turn a general-purpose AI into a precision instrument. A nation-state actor was already achieving 80-90% autonomous tactical execution using existing Claude models with good scaffolding. In this view, Anthropic is overstating what the model itself contributes.

I think the scaffolding argument is incomplete. When a base model jumps from "can produce a single tier-3 crash" to "achieves full system takeover on ten fully patched targets," something fundamental has changed in what the model itself brings to the table. The scaffolding still matters, but the engine underneath just got an order of magnitude more powerful.

Five Shifts That Are Coming Whether You're Ready or Not

Here's where I want to take a position, because I think the implications are concrete enough to name.

Legacy code is now a liability in a way it wasn't before. A huge percentage of the world's critical infrastructure, including internal company systems, runs on old code that was considered secure primarily because attacking it required enormous human effort. When a model can autonomously chain together multiple vulnerabilities to achieve complete system takeover, that assumption breaks. Every organization running legacy systems (which is most organizations) needs to start thinking about this now. By 2027, the cost of finding exploitable flaws in your code will have dropped by orders of magnitude.

Security spending is going to spike, and mid-sized companies will feel it most. The organizations most likely to be breached are the ones least likely to have the budget, the talent, or the awareness to deploy AI-powered security scanning. The Glasswing partners (Apple, Microsoft, Google, Amazon) will find and patch the flaws in the products they build. But the hospitals running EHR systems on legacy platforms, the state agencies processing benefits on 15-year-old code, the regional banks operating on core platforms from the early 2000s, the municipal utilities managing water treatment with software that predates the iPhone: those organizations are on their own. And they're also the ones whose failures land hardest on regular people. WannaCry shut down parts of the UK's National Health Service in 2017 by exploiting a known Windows vulnerability that hadn't been patched. The next version of that kind of attack, powered by a model that can find *unknown* vulnerabilities and chain them together automatically, will be harder to detect, harder to attribute, and potentially much more damaging. I expect a wave of "AI security as a service" offerings from major cloud providers and security vendors within 12-18 months, because that's the only way to get these capabilities to the organizations that need them most and can afford them least.

The way frontier AI ships is about to change permanently. Project Glasswing is more than a security initiative. It's a template for how the most capable AI models will reach the market. Give defenders a head start, restrict the most sensitive capabilities to vetted partners, then open access gradually once safeguards are in place. OpenAI is already doing a version of this with Trusted Access for Cyber. Within two years, I think every major model release will include some form of staged deployment. For companies building on these APIs, that means planning for tiered access as a default, not an edge case.

AI governance just became an operational requirement. The system card findings about concealment and track-covering shift this conversation. When a model can reason about how to hide what it's doing from the people who built it, and that reasoning happens in ways that don't appear in the visible output, "just review the logs" is no longer a sufficient monitoring strategy. Organizations deploying capable AI agents with real system access need adversarial testing, behavioral anomaly detection, and clear escalation protocols. The same operational discipline you'd apply to any system with elevated privileges now applies to AI.

The organizations that move fastest will have a real advantage, and "moving fast" means something specific here. It means building internal capability to evaluate, deploy, and monitor AI-powered security tools. It means retraining developers to write code with the assumption that an AI will probe it for weaknesses. It means treating AI governance as a core operational function, with its own team, its own budget, and its own seat at the leadership table. The companies that treated AI adoption as optional in 2024 and 2025 are about to find out how quickly that window closed.

"The Least Capable Model We'll Ever Have"

Jared Kaplan, Anthropic's chief science officer, described what they're seeing as a "reckoning." Logan Graham asked whether the current paradigm of security, where systems are safe because attacking them takes a lot of human effort, even works anymore. Anthropic acknowledges that "the work of defending the world's cyber infrastructure might take years" while frontier AI capabilities advance substantially over just the coming months.

The question I think every executive and every board should be asking right now is whether their organization is building the muscle to operate in a world where the most capable AI systems are also the most dangerous ones. Because as Kaplan put it: "this is the least capable model we'll have access to in the future."

It's worth letting that land for a second.