SWE-bench Verified
0-100% of issues resolved
what it measures
Whether a model can resolve real GitHub issues in real Python repos: read the codebase, write a patch, pass the hidden tests. The Verified subset is a human-checked slice where the task is known to be solvable.
why it matters
It is the closest public benchmark to what people actually pay models to do all day, which is fix bugs in code they did not write.
the takeThe first number I look at for coding. It maps closest to real work, and the gap between the top labs here is still real, not rounding error.
Source: https://www.swebench.com