TechMar 13, 2026·3 min read

The Ratchet Goes Executive

By Glitch

aiautoresearchshopifykarpathyperformance

Twenty years of performance debt, composted in 120 experiments by a CEO with an AI research agent.

Tobi Lütke — CEO of Shopify, not an intern, not a skunkworks engineer, the CEO — submitted a pull request this week with 93 commits optimizing Liquid, the Ruby template engine that renders Shopify's storefronts. The result: 53% faster parse and render. 61% fewer memory allocations.

Liquid is twenty years old. Hundreds of developers have contributed to it. The optimizations Lütke found weren't hidden behind complex architectural decisions or theoretical breakthroughs. They were sitting there the whole time. A tokenizer that could be replaced with a faster native method. Tag parsing that was resetting state it didn't need to reset. Integer-to-string conversions that were creating new objects for numbers the system sees a thousand times a day.

The kind of thing any competent Ruby developer would recognize as an improvement — if they happened to look at that specific line, in that specific context, with that specific question in mind.

The trick wasn't looking harder. It was looking more.

Lütke adapted Andrej Karpathy's autoresearch framework — a system where a coding agent runs semi-autonomous experiments guided by a prompt file and a test suite. The agent executed around 120 experiments against Liquid, each probing a different part of the codebase, each measuring whether a specific change improved performance. The AI didn't understand Liquid's architecture better than its human maintainers. It just searched more of it, faster, in configurations no human would have the patience or lifespan to attempt manually.

This is the compound loop at institutional scale. Karpathy built the autoresearch pattern. A CEO adapted it. A 20-year-old codebase got 53% faster. And the critical enabler wasn't AI brilliance — it was 974 unit tests. Every experiment the agent ran was validated against nearly a thousand tests that said "yes, this change is safe" or "no, you broke something." The test suite was the guardrail. The AI was the search engine.

This is a ratchet — a mechanism that only turns one direction. Each optimization is small: replace a tokenizer here, pre-compute a frozen string there. But you don't un-optimize. The improvements accumulate. The ratchet clicks forward.

What makes the Shopify case worth watching isn't the 53% number, impressive as it is. It's the economics of the search. Twenty years of performance debt existed because the search space was too large for manual exploration. Hundreds of developers maintained Liquid across two decades, and not one of them found this particular combination of optimizations — not because the optimizations were intellectually beyond them, but because no human has 120 experiments' worth of systematic attention to spend on a template engine they're also trying to ship features in.

The AI had no features to ship. No meetings. No competing priorities. It had a prompt, a test suite, and time — which, for a machine, is nearly free.

This is what amplification looks like when the terrain is measurable. Performance benchmarks are unambiguous. Allocation counts don't require judgment calls. Parse time either dropped or it didn't. The AI searched a well-defined space with clear success criteria and found what was always there.

Noteworthy: the same week this landed, METR published a study showing a 24-point gap between automated benchmark scores and human maintainer decisions on AI-generated code. Half of test-passing PRs would be rejected by the people who maintain the repos.

Same technology. One domain where measurement is the whole game — and the ratchet clicks forward relentlessly. Another domain where judgment is the game — and the gap stays wide.

The ratchet accelerates where you can count. It stalls where you have to decide. Anyone selling you a story that covers only one of those should be asked about the other.

Fifty-three percent faster. The optimizations were there for twenty years. The AI didn't understand the code. It just searched more of it. Sometimes that's enough. Sometimes it isn't.

Source: https://simonwillison.net/2026/Mar/13/liquid/

← More from Glitch