Leaderboard // 14 models

The order, as I see it

Ranked by the llmbusse index: a weighted blend tilted toward the benchmarks that predict real work. Click any column to re-sort. Data as of July 2026.

#	Model										License
1	Qwen3.7-MaxAlibaba (Qwen)	88.9	92	80	—	97	92	1M	$2.5	$7.5	closed
2	GPT-5.2OpenAI	88.7	92	80	—	100	—	400K	$1.75	$14	closed
3	DeepSeek-V4-ProDeepSeek	88.1	90	81	88	95	94	1M	$0.435	$0.87	open
4	Gemini 3.1 ProGoogle DeepMind	87.7	94	81	91	—	—	1M	$2	$12	closed
5	Gemini 3 ProGoogle DeepMind	86.5	92	76	90	95	—	1M	$2	$12	closed
6	Grok 4xAI	86.3	88	—	—	92	79	256K	$3	$15	closed
7	Claude Opus 4.8Anthropic	86.2	94	89	—	—	69	1M	$5	$25	closed
8	Gemini 3 FlashGoogle DeepMind	86.1	90	78	—	95	—	1M	$0.5	$3	closed
9	GPT-5.5OpenAI	85.7	94	81	—	—	83	1M	$5	$30	closed
10	DeepSeek-V4-FlashDeepSeek	85.2	88	79	86	—	92	1M	$0.14	$0.28	open
11	Gemini 3.1 Flash-LiteGoogle DeepMind	83.4	87	—	89	—	72	1M	$0.25	$1.5	closed
12	DeepSeek-V3.2DeepSeek	78.8	82	70	85	84	—	164K	$0.28	$0.42	open
13	DeepSeek-R1-0528DeepSeek	74.4	81	58	85	88	73	128K	$0.55	$2.19	open
14	Llama 4 MaverickMeta	65.5	70	—	81	—	43	1M	$0.2	$0.6	open

Every number is sourced on each model page. Disagree with the weighting? The method is public.

Won't show their working // 7

Off the board

Strong or brand-new models I can't rank, because their makers stopped publishing the comparable benchmarks. Listed so you know they exist, not scored because I won't fake it.

Claude Fable 5 Anthropic
The most capable Claude, and top of the independent intelligence index at launch. It is off the board here because Anthropic stopped publishing the comparable suite, and because its safety classifiers refuse enough real work that even benchmarking it is a fight. Powerful, gated, expensive.
Claude Sonnet 5 Anthropic
Near-Opus agentic muscle at half the Opus price, and the one I would hand most day-jobs. Off the board only because it is days old and reports vendor evals, not the standard suite. Watch this one.
GPT-5.6 Sol OpenAI
OpenAI's newest, and you probably cannot have it: limited preview, rollout throttled at the government's request. It tops a coding benchmark and got flagged for the highest reward-hacking rate METR had ever measured. Read that headline number with tongs.
GLM-5.2 Z.ai (Zhipu AI)
The strongest open-weights model on the independent index, MIT-licensed, and a genuine problem for the closed labs. Off the board here only because Z.ai skipped SWE-bench Verified, MMLU-Pro and LiveCodeBench, so I cannot place it fairly against the rest.
Grok 4.3 xAI
Cheap, fast, and genuinely good at tool use. Off the board because xAI now publishes almost nothing you can compare, leaning on their own agentic metrics instead. Trust it for agents, verify everything else.
Gemini 3.5 Flash Google DeepMind
Google's newest Flash, tuned for agents and, by their own tables, beating last year's Pro. Off the board because they withheld the standard academic evals. The vendor charts look great; I want the neutral ones.
GPT-5.3-Codex OpenAI
The Codex-line coding specialist, now being quietly retired in favour of the general models. Good SWE-bench, thin on everything else, which is why it sits here rather than on the board.