June 23, 2026

Sakana Fugu: What Is Verified and What Is Not

On 22 June 2026, the Tokyo research lab Sakana AI released Sakana Fugu and a higher-tier variant, Fugu Ultra. The launch drew attention for an unusual claim: that an orchestration layer, rather than a single large model, can reach frontier-level performance. Here is what the product actually is, what the company says about it, and what remains unverified.

What Fugu is

Fugu is not a conventional standalone model. It is a multi-agent orchestration system that presents itself as a single API endpoint, compatible with the OpenAI chat-completions format. Internally, Fugu is itself a trained language model that decides whether to answer a request directly or to break it into sub-tasks, delegate those to a pool of other models, verify the results, and synthesise a final answer.

Sakana frames this as learned orchestration rather than hard-coded routing logic, building on two of its earlier research projects, TRINITY and Conductor. The practical pitch is integration without rearchitecting: you point an existing OpenAI client at the Fugu base URL and the multi-agent coordination happens behind the endpoint.

How it works under the hood

The design rests on two of Sakana's ICLR 2026 papers, TRINITY and the Conductor. TRINITY is a compact coordinator, reported as roughly a 0.6-billion-parameter language model paired with an approximately 10,000-parameter routing head, that assigns external models to Thinker, Worker, and Verifier roles across a multi-turn task: one model plans, another executes, another checks. Notably, only the small routing head is trained, and Sakana's researchers report using separable CMA-ES, an evolutionary strategy, rather than reinforcement learning for that step. The Conductor is the second strand: it uses reinforcement learning to learn natural-language coordination strategies, generating custom instructions and deciding which prior sub-results each agent should see, instead of following a hard-coded workflow.

Training the coordinator is described as a two-stage process that starts with supervised fine-tuning on a large pool of verifiable single-step tasks in coding, maths, and reasoning, each with a known correct answer, so the labels come from task outcomes rather than human annotation. The released product abstracts all of this behind a single OpenAI-compatible endpoint, so an existing OpenAI client can call it by changing the base URL.

Where the edge is

The most distinctive engineering choice concerns latency. The orchestrator emits a routing decision, not a written answer, and its own text output is discarded, which means it can dispatch work almost immediately rather than generating a full response first. Sakana describes the resulting latency as comparable to a direct call to a single frontier model, which is the practical case for putting a coordinator in front of a model pool at all.

The second differentiator is recursion. Sakana states that Fugu is itself a language model trained to call other models, including fresh instances of itself. That allows it to decompose a hard problem, spin up a sub-instance to manage a sub-problem, then verify and synthesise the parts, and to read its own prior output and launch a corrective pass, a form of test-time scaling the company calls recursive orchestration. The trade-off is cost: recursive calls mean recursive token spend, so any savings depend on how often the system recurses on a given workload.

The performance claim

Sakana reports that Fugu Ultra performs comparably to Anthropic's Fable 5 and Mythos Preview across coding, reasoning, scientific, and agentic benchmarks. On the figures Sakana published, Fugu Ultra scores 73.7 on SWE-Bench Pro and 82.1 on TerminalBench 2.1, against Fable 5's reported 80.3 and 88.0, so the gap is narrowed rather than closed. The results are also not a clean sweep within Sakana's own table: the balanced Fugu variant reportedly beats Fugu Ultra on some tests such as SciCode, which suggests deeper orchestration does not always help.

It is worth stating the comparison plainly: the two systems Fugu is most provocatively benchmarked against, Fable 5 and Mythos, are export-controlled and are not part of Fugu's own model pool. "Matching" them is therefore a claim about reaching similar output quality through substitutes, not a route to those models' actual output. Some of Sakana's head-to-head tables instead use Opus 4.8, Gemini 3.1 Pro, and GPT-5.5, which are the models it can actually call.

What has not been verified

As of the launch date, the benchmark numbers are vendor-reported. No independent party has rerun the tasks, no per-task score grid has been published, and no evaluation harness has been released. Sakana has also clarified that the competitor figures used in its comparisons are each creator's own self-reported numbers, not results measured by a neutral third party, which means the comparison is not like-for-like on test harness or effort settings.

Fugu's design also makes independent replication harder than for a single model. To reproduce its scores, an evaluator would need Fugu plus access to every underlying model it routes to, at the same versions and settings, under the same orchestration topology. None of this means the claims are wrong; it means they are currently claims rather than confirmed measurements. The reasonable posture is to wait for independent evaluation, or to run the two or three benchmarks that resemble your own workload on your own traffic.

Why the approach is interesting regardless

Independent of the headline numbers, the architectural argument stands on its own. If frontier capability can be assembled from a swappable pool of models, the value shifts from owning the single best model to coordinating the best portfolio, and a broad provider restriction, rather than a single one, becomes the real risk to such a system, since its capability is the pool.

There are open questions an adopter inherits. Reselling access to third-party proprietary models through one endpoint sits in a grey area of several providers' terms of service, and the resilience Sakana describes comes from diversity of providers rather than true independence from them. For teams weighing it, the sensible next step is to test it on representative work rather than to trust the launch-day figures.

Sakana Fugu: What Is Verified and What Is Not

What Fugu is

How it works under the hood

Where the edge is

The performance claim

What has not been verified

Why the approach is interesting regardless

Recent Blogs

Midjourney Just Built a Full Body CT Scanner and Nobody Saw It Coming

Why AI Labs Are Subsidising Power Users and What That Means for Business

Microsoft Just Made Your Azure Bill an AI Strategy Decision

Enough talk, let’s get to work

Links

Services

Contact Details