Claude Mythos & ERNIE-5.1: AI's New Leaderboard

Yesterday was a day of quiet dominance and calculated efficiency. No flashy announcements, no "revolutionary" claims—just models moving up leaderboards by doing the work where it counts. The headline? The top of the stack got heavier, and the underdogs got leaner.

Anthropic’s Claude Mythos Preview: The New Baseline

Adding this to the leaderboard: Claude Mythos Preview now sits at #1 on BenchLM.ai with a provisional score of 99. It’s not just the number—it’s the kind of lead. Mythos isn’t just incrementally better; it’s the first model to break into the high-90s on a composite benchmark that tests agentic workflows, coding, and reasoning under uncertainty. For context, GPT-5.5 sits at 91. Gemini 3.1 Pro at 92.

The interesting split: Mythos leads in agentic and coding categories, but its reasoning score is only marginally ahead. That’s a strategic bet. Anthropic isn’t chasing pure knowledge retrieval (see: GLM-5’s MMLU dominance). They’re optimizing for work—models that don’t just answer questions but resolve them. The trade-off? Mythos is slower than GPT-5.5 on raw token output. Speed wasn’t the benchmark they prioritized.

For the record: When a model hits 99 on a composite benchmark, the next question isn’t "How much better can it get?" It’s "What are we missing?" BenchLM.ai’s scoring weights agentic tasks heavily. That’s a choice. It assumes the future of AI is less about Q&A and more about doing. Not everyone agrees.

Baidu’s ERNIE-5.1-Preview: The Cost-Efficiency Play

Filed under: how to win by spending less.

ERNIE-5.1-Preview is now the #1 Chinese model and #13 globally on LMArena—with one-third the parameters of its predecessor. Baidu didn’t just shrink the model; they reallocated. Active parameters halved. Pre-training cost? 6% of comparable models. Yet it’s #1 globally in Legal & Government, #4 in Business Ops, and #7 in IT Services.

The benchmark tell: ERNIE-5.1 didn’t top the general leaderboards. It dominated verticals. That’s not an accident. Baidu’s betting that for enterprise use, narrow excellence beats broad competence. The trade-off? It’s weaker in math (ranked #9) and creative writing. They’re okay with that.

Worth tracking: This is the first time a non-Western lab has cracked the top 15 globally while simultaneously cutting costs this aggressively. The efficiency story here isn’t just "smaller model, same performance." It’s "We found a way to make the race cheaper."

IBM’s Granite 4.1: The Open-Source Gambit

IBM’s new Granite 4.1 family (3B, 8B, 30B) is open-source under Apache 2.0. The headline stat: The 8B instruct model matches the performance of their previous 32B MoE model (Granite 4.0-H-Small) on most tasks.

The benchmark that matters: Tool calling and instruction following. IBM didn’t just compress; they realigned. The 8B model now handles multi-step tool use as well as models 4x its size. That’s not just efficiency—it’s a redefinition of what "small" can do.

The unasked question: Why open-source this now? Two theories:

The commodity play. IBM’s betting that by the time these weights are widely adopted, the frontier will have moved on. They’re selling the shovels.
The alignment hedge. Open-sourcing forces transparency on safety and bias benchmarks. If you can’t beat the closed models on raw performance, compete on trust.

The Rest: Quantized, Specialized, and Niche

NVIDIA’s Gemma 4 26B IT NVFP4: A quantized, GPU-optimized version of Gemma 4. Not a new model—just one that runs on consumer hardware. The real story? Frontier-level multimodal performance on a RTX 4090. That’s a hardware benchmark, not a model one.
Hippocratic AI’s Polaris 5.0: 5T parameters, 99.95% on drug safety. Impressive, but healthcare is a vertical, not a general benchmark. This is a B2B play, not a frontier race.
Kimi K2.6: Now #12 on BenchLM.ai (score: 84). Not a breakthrough, but a reminder that the mid-tier is getting crowded. The gap between "good" and "great" is shrinking.

The Leaderboard Meta: What Moved and Why

Three shifts yesterday:

Agentic > Knowledge. Mythos’s lead on BenchLM.ai is a bet that doing things will matter more than knowing things. MMLU (GLM-5’s domain) is becoming the "legacy" benchmark.
Cost as a feature. ERNIE-5.1 and Granite 4.1 didn’t just improve scores—they rewrote the cost-performance curve. That’s the underdog move: change the rules of the game.
The open-source hedge. IBM’s release is a signal: Not everyone believes the frontier is the only race worth running.

The Record:
As of April 30, 2026, the top of the leaderboard looks like this:

Composite (BenchLM.ai): Claude Mythos Preview (99) > Gemini 3.1 Pro (92) > GPT-5.5 (91)
MMLU: GLM-5 (91.7%) > GPT-5.5 (92.5%*) [*Disputed—see MMLU-Redux]
Efficiency (Cost/Performance): ERNIE-5.1-Preview (>3x parameter reduction, 6% pre-training cost)
Open-Source: Granite 4.1 8B (matches 32B MoE predecessor)

The numbers say one thing. Note what they don’t say: No one’s winning everything anymore.

Untitled

Anthropic’s Claude Mythos Preview: The New Baseline

Baidu’s ERNIE-5.1-Preview: The Cost-Efficiency Play

IBM’s Granite 4.1: The Open-Source Gambit

The Rest: Quantized, Specialized, and Niche

The Leaderboard Meta: What Moved and Why

Related Transmissions

Google’s Gemini Omni blends video, audio, and text for seamless editing

Google shows the future is cost not crown

Trump drafts AI model vetting order after security lapses