Leaderboard // 14 models

The order, as I see it

Ranked by the llmbusse index: a weighted blend tilted toward the benchmarks that predict real work. Click any column to re-sort. Data as of July 2026.

#ModelLicense
1Qwen3.7-MaxAlibaba (Qwen)
88.9
928097921M$2.5$7.5closed
2GPT-5.2OpenAI
88.7
9280100400K$1.75$14closed
3DeepSeek-V4-ProDeepSeek
88.1
90818895941M$0.435$0.87open
4Gemini 3.1 ProGoogle DeepMind
87.7
9481911M$2$12closed
5Gemini 3 ProGoogle DeepMind
86.5
927690951M$2$12closed
6Grok 4xAI
86.3
889279256K$3$15closed
7Claude Opus 4.8Anthropic
86.2
9489691M$5$25closed
8Gemini 3 FlashGoogle DeepMind
86.1
9078951M$0.5$3closed
9GPT-5.5OpenAI
85.7
9481831M$5$30closed
10DeepSeek-V4-FlashDeepSeek
85.2
887986921M$0.14$0.28open
11Gemini 3.1 Flash-LiteGoogle DeepMind
83.4
8789721M$0.25$1.5closed
12DeepSeek-V3.2DeepSeek
78.8
82708584164K$0.28$0.42open
13DeepSeek-R1-0528DeepSeek
74.4
8158858873128K$0.55$2.19open
14Llama 4 MaverickMeta
65.5
7081431M$0.2$0.6open

Every number is sourced on each model page. Disagree with the weighting? The method is public.

Won't show their working // 7

Off the board

Strong or brand-new models I can't rank, because their makers stopped publishing the comparable benchmarks. Listed so you know they exist, not scored because I won't fake it.

  • Claude Fable 5 Anthropic

    The most capable Claude, and top of the independent intelligence index at launch. It is off the board here because Anthropic stopped publishing the comparable suite, and because its safety classifiers refuse enough real work that even benchmarking it is a fight. Powerful, gated, expensive.

  • Claude Sonnet 5 Anthropic

    Near-Opus agentic muscle at half the Opus price, and the one I would hand most day-jobs. Off the board only because it is days old and reports vendor evals, not the standard suite. Watch this one.

  • GPT-5.6 Sol OpenAI

    OpenAI's newest, and you probably cannot have it: limited preview, rollout throttled at the government's request. It tops a coding benchmark and got flagged for the highest reward-hacking rate METR had ever measured. Read that headline number with tongs.

  • GLM-5.2 Z.ai (Zhipu AI)

    The strongest open-weights model on the independent index, MIT-licensed, and a genuine problem for the closed labs. Off the board here only because Z.ai skipped SWE-bench Verified, MMLU-Pro and LiveCodeBench, so I cannot place it fairly against the rest.

  • Grok 4.3 xAI

    Cheap, fast, and genuinely good at tool use. Off the board because xAI now publishes almost nothing you can compare, leaning on their own agentic metrics instead. Trust it for agents, verify everything else.

  • Gemini 3.5 Flash Google DeepMind

    Google's newest Flash, tuned for agents and, by their own tables, beating last year's Pro. Off the board because they withheld the standard academic evals. The vendor charts look great; I want the neutral ones.

  • GPT-5.3-Codex OpenAI

    The Codex-line coding specialist, now being quietly retired in favour of the general models. Good SWE-bench, thin on everything else, which is why it sits here rather than on the board.