The Twenty-Four Points
The distance between "it works" and "it belongs here" now has a number.
Twenty-four points.
That's the gap METR found between what SWE-bench's automated grader says about AI-generated code and what the human maintainers of that code would actually accept. Not a rounding error. Not a margin of interpretation. Twenty-four percentage points between "passes the test suite" and "I'd merge this into my codebase."
The study was straightforward. METR took 296 AI-generated pull requests that passed SWE-bench's automated tests — one of the industry's primary benchmarks for measuring whether AI agents can write real code. Then they showed those PRs to actual maintainers of scikit-learn, Sphinx, and pytest. People who live in those codebases. People who've been reviewing code longer than most AI labs have existed.
The automated grader said these agents were solving roughly 60% of problems. Maintainers said they'd merge about 36%. That's where the 24 points live — in the space between what a test suite accepts and what a human gatekeeper would. And when METR normalized against a baseline of human-written patches (which maintainers merged at a 68% rate), the picture sharpened: roughly half the AI-generated PRs that passed the benchmark wouldn't survive review.
Not because they didn't work. They passed the tests. The code ran. The functions returned the right outputs. By the benchmark's definition, the job was done. But the maintainers saw things the benchmark couldn't: verbose implementations that ignored repo conventions. Solutions that technically fixed the reported issue while breaking assumptions elsewhere. Code that would work today and create debt tomorrow.
The reasons for rejection read like a vocabulary lesson in everything benchmarks don't measure: "bad style, not following repo standards." Solutions that don't actually resolve the reported issue despite passing the test that claims they do. Patches that create side effects the test suite never checked for.
Here's the pattern worth naming: SWE-bench is a copy of code review. It preserved the part you can automate — does the code pass tests? — and lost the part you can't. The illegible dimensions. Context. Fit. The question a maintainer asks that no benchmark knows how to phrase: does this belong here?
That question isn't technical. It's social. A codebase isn't a collection of functions — it's a shared space that a group of people have to inhabit together, and the conventions that govern it are as much about mutual legibility as they are about correctness. It's about whether the code respects the history of the project, the architecture that will have to absorb it, the person who maintains it at 2 a.m. when something breaks and needs to read what you wrote and understand why you wrote it that way. Tests measure whether code functions. Maintainers measure whether code communicates.
Twenty-four points is the distance between those two measurements.
The AI industry has spent years treating SWE-bench scores as progress reports. Company X achieves 40% on SWE-bench. Company Y hits 50%. The numbers go up. The press releases follow. But if half the "passing" code would be rejected by the people who'd actually have to live with it, the progress isn't what the scoreboard says it is. The benchmark is measuring something real — functional correctness — and then the industry is interpreting that measurement as something larger. As if "it works" and "it's good enough" were the same claim.
They're not. They never were. Any developer who's survived a code review knows this. Any maintainer who's rejected a technically correct but architecturally wrong PR knows this. The twenty-four points just made it precise.
METR notes that agents given feedback — the way human developers iterate through review cycles — might close some of that gap. Maybe. But the gap isn't a bug in the process. It's a measurement of what process can't capture: the judgment that accumulates from years inside a codebase, the taste that develops from maintaining something other people depend on. You can iterate a PR. You can't iterate your way into judgment.
The benchmark works perfectly. It measures exactly what it was designed to measure. The problem is that what it measures isn't what we're pretending it measures. Twenty-four points of pretending. Give or take.
The test suite says the code works. The maintainer says it doesn't belong. Both are right. Twenty-four points is the distance between them — and nobody's figured out how to benchmark the difference.
Source: https://metr.org/notes/2026-03-10-many-swe-bench-passing-prs-would-not-be-merged-into-main/