Someone has to say which model is actually best.
So here it is: the frontier models ranked on the benchmarks that predict real work, plus my verdict on each. Dated, sourced, and happy to be wrong out loud.
- 01 Qwen3.7-Max 88.9
- 02 GPT-5.2 88.7
- 03 DeepSeek-V4-Pro 88.1
- 04 Gemini 3.1 Pro 87.7
- 05 Gemini 3 Pro 86.5
Qwen3.7-Max. Frontier scores across the board and hardly anyone in the West is talking about it. Two catches: it is closed weights despite Alibaba's open Qwen line, and it is a token furnace that inflates the real bill.
Full breakdown →| # | Model | License | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | Qwen3.7-MaxAlibaba (Qwen) | 88.9 | 92 | 80 | — | 97 | 92 | 1M | $2.5 | $7.5 | closed |
| 2 | GPT-5.2OpenAI | 88.7 | 92 | 80 | — | 100 | — | 400K | $1.75 | $14 | closed |
| 3 | DeepSeek-V4-ProDeepSeek | 88.1 | 90 | 81 | 88 | 95 | 94 | 1M | $0.435 | $0.87 | open |
| 4 | Gemini 3.1 ProGoogle DeepMind | 87.7 | 94 | 81 | 91 | — | — | 1M | $2 | $12 | closed |
| 5 | Gemini 3 ProGoogle DeepMind | 86.5 | 92 | 76 | 90 | 95 | — | 1M | $2 | $12 | closed |
| 6 | Grok 4xAI | 86.3 | 88 | — | — | 92 | 79 | 256K | $3 | $15 | closed |
| 7 | Claude Opus 4.8Anthropic | 86.2 | 94 | 89 | — | — | 69 | 1M | $5 | $25 | closed |
| 8 | Gemini 3 FlashGoogle DeepMind | 86.1 | 90 | 78 | — | 95 | — | 1M | $0.5 | $3 | closed |
- Artificial Analysis — independent model intelligence index
The independent numbers I sanity-check my own ranking against. When we disagree, one of us is measuring the wrong thing.
- LMArena — human-preference battle leaderboard
Crowd preference, not capability. Useful for vibes and formatting, misleading if you read it as raw intelligence.
- SWE-bench — the software-engineering benchmark
Still the closest thing to a real job interview for a coding model. Watch the Verified split, ignore the marketing numbers.
- Epoch AI — trends in compute, data, and capability
For the long arc rather than the launch-day noise. Good antidote to release hype.