GPQA Diamond

0-100% accuracy

what it measures

Graduate-level science questions in biology, physics, and chemistry that are hard enough that PhDs outside the subfield lose to them even with Google. Diamond is the hardest, cleanest subset.

why it matters

You cannot bluff it, and it still spreads the frontier out instead of pinning everyone near the ceiling.

the take

The reasoning number I trust most at the top. Strong here usually means strong where it counts.

Source: https://arxiv.org/abs/2311.12022

See it on the leaderboard