Benchmarks // 5

The benchmarks, and what they're worth

Every score on this site comes from one of these. Here is what each one measures, and how much I actually let it move the ranking.

SWE-bench Verified

0-100% of issues resolved

Whether a model can resolve real GitHub issues in real Python repos: read the codebase, write a patch, pass the hidden tests. The Verified subset is a human-checked slice where the task is known to be solvable.

the take The first number I look at for coding. It maps closest to real work, and the gap between the top labs here is still real, not rounding error.

GPQA Diamond

0-100% accuracy

Graduate-level science questions in biology, physics, and chemistry that are hard enough that PhDs outside the subfield lose to them even with Google. Diamond is the hardest, cleanest subset.

the take The reasoning number I trust most at the top. Strong here usually means strong where it counts.

MMLU-Pro

0-100% accuracy

A harder, cleaner rebuild of MMLU: broad multiple-choice knowledge across dozens of subjects, with ten options instead of four and the noisiest questions thrown out.

the take Table stakes now. Below the high 80s you are not in the conversation; above it the number stops telling you much. I read it as a floor, not a ranking.

AIME (math)

0-100% solved

Competition math from the American Invitational Mathematics Examination: short-answer problems that need several exact reasoning steps, no partial credit.

the take I weight it but do not worship it. Tool use and heavy sampling can inflate it, so I read it next to the reasoning scores, not on its own.

LiveCodeBench (coding)

0-100% pass rate

Competitive-programming problems collected continuously from contest sites, so the questions post-date most training cutoffs. Write code, pass the tests.

the take Rewards speed and pattern recall more than architecture. Read it beside SWE-bench, never instead of it.