MMLU-Pro

0-100% accuracy

what it measures

A harder, cleaner rebuild of MMLU: broad multiple-choice knowledge across dozens of subjects, with ten options instead of four and the noisiest questions thrown out.

why it matters

It is a broad-knowledge floor. Useful for catching a model that is narrow, less useful for separating the leaders.

the take

Table stakes now. Below the high 80s you are not in the conversation; above it the number stops telling you much. I read it as a floor, not a ranking.

Source: https://arxiv.org/abs/2406.01574

See it on the leaderboard