the current order // as of July 2026

Someone has to say which model is actually best.

So here it is: the frontier models ranked on the benchmarks that predict real work, plus my verdict on each. Dated, sourced, and happy to be wrong out loud.

See the full leaderboard Run a battle

01 Qwen3.7-Max
88.9
02 GPT-5.2
88.7
03 DeepSeek-V4-Pro
88.1
04 Gemini 3.1 Pro
87.7
05 Gemini 3 Pro
86.5

#1 · the take

Qwen3.7-Max. Frontier scores across the board and hardly anyone in the West is talking about it. Two catches: it is closed weights despite Alibaba's open Qwen line, and it is a token furnace that inflates the real bill.

Full breakdown →

Leaderboard All 14 →

#	Model										License
1	Qwen3.7-MaxAlibaba (Qwen)	88.9	92	80	—	97	92	1M	$2.5	$7.5	closed
2	GPT-5.2OpenAI	88.7	92	80	—	100	—	400K	$1.75	$14	closed
3	DeepSeek-V4-ProDeepSeek	88.1	90	81	88	95	94	1M	$0.435	$0.87	open
4	Gemini 3.1 ProGoogle DeepMind	87.7	94	81	91	—	—	1M	$2	$12	closed
5	Gemini 3 ProGoogle DeepMind	86.5	92	76	90	95	—	1M	$2	$12	closed
6	Grok 4xAI	86.3	88	—	—	92	79	256K	$3	$15	closed
7	Claude Opus 4.8Anthropic	86.2	94	89	—	—	69	1M	$5	$25	closed
8	Gemini 3 FlashGoogle DeepMind	86.1	90	78	—	95	—	1M	$0.5	$3	closed

Signal // curated All links →

Artificial Analysis — independent model intelligence index
The independent numbers I sanity-check my own ranking against. When we disagree, one of us is measuring the wrong thing.
LMArena — human-preference battle leaderboard
Crowd preference, not capability. Useful for vibes and formatting, misleading if you read it as raw intelligence.
SWE-bench — the software-engineering benchmark
Still the closest thing to a real job interview for a coding model. Watch the Verified split, ignore the marketing numbers.
Epoch AI — trends in compute, data, and capability
For the long arc rather than the launch-day noise. Good antidote to release hype.