Who Checks the Checker? The correction loop is the most valuable thing in AI right now. Nobody is capturing it.

THE NUMBER: 30x — the productivity multiplier between Boris Cherny, creator of Claude Code, shipping 20-30 PRs per day with five parallel AI instances, and a traditional engineer shipping 3 PRs per week. That’s not a rounding error. That’s a different species of worker.

Three things converged this week that tell a single story — and it’s the most important story in AI right now. Karpathy open-sourced autoresearch, a 630-line tool that lets AI agents run 100 ML experiments overnight while you sleep. Shopify’s CEO adapted it and got a 19% improvement on first pass. Anthropic shipped Code Review — a multi-agent system where AI checks AI-generated code at a <1% false positive rate, and code output per developer jumped 200%. Meanwhile, Tunguz published the math on AI-native org charts: a 150-person company has 11,175 communication channels. A 30-person AI-augmented team producing equivalent output has 435. And in Paris, Yann LeCun raised $1.03 billion — the largest seed round in European history — to build world models that he says will make the entire LLM paradigm obsolete.

The thread connecting all of it: AI is now improving AI, the organizations built around AI look nothing like the ones they’re replacing, and the models powering those organizations may be about to splinter into specialized expert systems. The feedback loop is tightening. The org chart is mutating. And the question nobody’s answering is: when the machines get better than the experts who train them, who’s left to check the work?

The Lathe That Builds the Lathe

When the first machine tool could cut the parts to build another machine tool, the Industrial Revolution became inevitable. We crossed that line in AI this week — and most people missed it.

Andrej Karpathy released autoresearch: an AI agent that autonomously runs ML experiments, modifies code, trains models, evaluates results, and repeats. One hundred experiments overnight on a single GPU. No human in the loop. Tobi Lutke at Shopify adapted it and reported a 19% improvement in validation scores — while he slept.

This isn’t vibe coding. This is AI doing science.

Layer on what else shipped: Anthropic’s Claude Code Review assigns multiple AI agents in parallel to every pull request, ranking bugs by severity. Internal numbers show code with substantive review comments rose from 16% to 54%. And separately, Claude ran autonomous research on sparse autoencoders — AI improving AI’s ability to understand itself.

Here’s where Nate Jones drops the insight everyone else missed. His argument: everyone talks about prompting. Nobody talks about rejection. But rejection is where the knowledge gets created. Every time a domain expert looks at AI output, identifies what’s wrong, and explains why, they produce a constraint that didn’t exist before. The output is disposable. The rejection is the asset. The constraint is what compounds.

He’s right — and the numbers make it concrete. AI now matches experienced professionals on 70-83% of well-specified knowledge work tasks. That means the 17-30% where AI gets it wrong is where organizations win or lose. Right now, the skill that catches the wrong 30% — the institutional taste built through thousands of expert corrections — evaporates after every conversation. Nobody is capturing it. Nobody is compounding it.

Except the systems are starting to. Bassim Eledath’s “8 Levels of Agentic Engineering” describes it as “compounding engineering” — a plan-delegate-assess-codify loop where each cycle makes the next one better. The codify step IS the rejection made permanent. And at Level 7, background agents run that loop while you sleep. Different model instances implement and review each other’s work — because, as Eledath puts it, you don’t grade your own exam.

The CO/AI angle: this is exactly why we named the publication what we did. Right now, the human expert’s rejection is what keeps the flywheel from spinning into slop. AI generates. AI reviews. The human expert rejects the 17-30% that’s wrong. That rejection gets encoded. The next loop is tighter. Co-working. Co-authoring. Co-optimizing. The human taste is the governor on the engine.

But here’s the seed corn problem Jones flags: entry-level tech hiring is down 67%. If we’re eliminating the pipeline that produces tomorrow’s expert rejectors, who teaches the models taste in five years? The very improvement loop depends on humans it’s simultaneously making redundant. We’ve seen this pattern before — in manufacturing, in journalism, in any industry that outsourced its apprenticeship model and then couldn’t figure out why institutional knowledge disappeared a generation later.

What business leaders need to know: Start treating rejection as a first-class output. Every expert correction your team makes to AI-generated work is training data you’re currently throwing away. Build systems that capture it. The companies compounding institutional taste will have moats the rest can’t replicate.

The Two-Pizza Team Eats Alone

Jeff Bezos’s “two-pizza rule” was never about pizza. It was about Metcalfe’s Law — the insight that communication overhead explodes with every additional node. Cap the team at what two pizzas feed, and you cap the coordination tax that kills speed.

Tomasz Tunguz just published the math on what happens when AI collapses those nodes. A traditional 150-person organization runs four layers deep with 11,175 potential communication channels. Meetings multiply. Alignment decays. An AI-enabled team producing equivalent output needs 30 people. Communication channels: 435. A 96% reduction.

The numbers at the frontier are staggering. Anthropic generates roughly $5 million in revenue per employee. Cursor, $3.3 million. Midjourney, $2 million. Traditional SaaS considers $200-300K strong. That’s a 10-20x gap — and it’s widening.

But your question — the one nobody else is asking — isn’t about making 150 people more productive. It’s about what the org chart looks like when 30 of those 150 “employees” are digital agents.

Amazon just laid off approximately 16,000 corporate employees, primarily targeting middle management roles now redundant due to agentic workflows. Block built an internal skills marketplace with 100+ AI agent personas — pull requests, reviews, version history, the whole nine. Paperclip is building org charts for AI companies that include agents as first-class employees, complete with budgets and governance structures.

Jeff Dean predicts engineers will manage 50+ agents each. The question shifts from “how many people can one manager oversee?” to “how many agents can one human orchestrate?”

And here’s the thing about agents that changes the Metcalfe math entirely: they don’t have the communication overhead that humans do. No break room. No coffee runs. No Monday morning debriefs about the weekend. No arguments about compensation. No politics. Just pure information exchange on servers. So where’s the inflection point where adding more agents becomes counterproductive? If the optimal human team was a two-pizza group, what’s the analogy for agents? One server? Two GPUs? The coordination cost isn’t zero — Cursor found that agents without hierarchy became risk-averse and churned without progress — but it’s fundamentally different in kind, not just degree.

What this really means: the constraint in the AI-native org isn’t compute or intelligence. It’s human attention bandwidth. The future org chart might be one human surrounded by 50 agents, and that human’s job isn’t to do work — it’s to clear constraints. Serve up the decisions that require judgment. Let everything else iterate autonomously until it hits a wall. Then the human clears the wall and the system moves again.

Isn’t that essentially what the best CEOs already do? Find the binding constraint. Remove it. Move to the next one. The difference is that the “employees” generating those constraints now run at the speed of inference, not the speed of meetings.

The action item: Stop reorganizing your human org chart. Start designing the hybrid one. Map which roles in your company are constraint-clearers (keep human) and which are iteration-runners (candidate for agents). The companies that figure out the human-agent ratio first will have a structural speed advantage that compounds every quarter.

The Billion-Dollar Fork in the Road

While Silicon Valley keeps pouring capital into making LLMs bigger, Yann LeCun just raised $1.03 billion for AMI Labs to build something else entirely. It’s the largest seed round in European history, at a $3.5 billion valuation, and it’s a direct bet that the entire LLM paradigm is a dead end for human-level intelligence.

His thesis: large language models predict the next token. They don’t understand the world. World models — built on his JEPA (Joint Embedding Predictive Architecture) — learn by building internal representations of how reality works. Physics. Cause and effect. Spatial reasoning. The stuff LLMs hallucinate about because they’ve never experienced it.

Follow the Munger principle here: show me the incentives and I’ll show you the behavior. Look at who backed him. Not the usual AI fund-of-funds crowd. Nvidia. Toyota. Samsung. Bezos Expeditions. These are hardware companies and physical-world operators. Companies that need AI that understands atoms, not tokens. Toyota doesn’t need a chatbot that writes better emails. They need a model that understands what happens when a brake pad meets a wet road at 70 miles per hour.

His co-founder predicts every company will rebrand as a world model startup within six months. Bold claim. But consider: the fruit fly brain emulation from Eon Systems this week — 125,000 neurons, 50 million synaptic connections, running purely on its biological wiring with 91% behavioral accuracy — suggests there’s more to intelligence than next-token prediction.

Now connect this to the first two stories. If the future is expert agents running in autonomous teams, does it matter whether those agents are built on monolithic LLMs or specialized world models? Imagine a marketing agent built on world models of human persuasion — Cialdini’s principles encoded in agent form. Pair it with a pricing specialist trained on game theory and market dynamics. Add a creative agent that understands visual perception and emotional resonance. Not one giant model trying to be everything. A team of deep-domain experts, each understanding its corner of reality.

That’s the architectural question underneath all the funding headlines: are we building one brain or building a team?

Why this matters: Don’t go all-in on a single model architecture. The companies that build model-agnostic agent infrastructure — systems that can swap between LLMs, world models, and specialized narrow models depending on the task — will have optionality the rest won’t. If LeCun is even partially right, every company that bet exclusively on language models just got a $1 billion wake-up call. And if he’s wrong, you’ve still built a more resilient system.

What This Means For Business Leaders

One story played out in three acts this week. AI crossed the self-improvement threshold, the organizations built around it are shedding their human architecture, and the models powering everything may be about to specialize in ways that make today’s monolithic LLMs look like mainframes. Here’s what to do about it.

Start capturing your institutional taste before it evaporates. This isn’t just about engineers and code. It’s about every domain expert in your organization whose judgment makes the difference between good enough and great. Your marketing team’s instinct for what resonates with your customer. Your manufacturing lead’s feel for when a production line is drifting before the sensors catch it. Your brand voice and its evolution over twenty years of customer conversations. Your strategist who can smell a bad deal before the spreadsheet confirms it. Every one of those corrections, those “no, not like that” moments — that’s your competitive advantage walking out the door every night. Build systems that capture it. Record it. Encode it. Because here’s the individual version of this: there will be a business shortly where people record everything that makes them them into persistent systems — and they’ll live on well beyond their human lifespans. Your great-grandchildren might grow up knowing you as well as your own kids do. The same logic applies to companies. The institutional taste you don’t capture is the institutional taste that dies with the next reorg.

Redesign for the hybrid org chart — because AI-native competitors already have. The math is unforgiving — 11,175 communication channels versus 435 for equivalent output. But don’t reorganize. Redesign from scratch. The companies being born today don’t have legacy org charts to optimize. They’re building agent-first from day one, and they’ll compete with you at 10x your speed and a fraction of your overhead. Map which roles are constraint-clearers (keep human) and which are iteration-runners (candidate for agents). Then build the dashboard that serves up decisions to your remaining humans. Don’t make them hunt for the bottleneck — surface the constraints automatically, let everything else iterate at machine speed, and let your people do what only people can do: exercise judgment under uncertainty.

Build for a multi-model world. A billion dollars of smart money just bet against the LLM monoculture. Whether LeCun is right about world models or not, the direction is clear: specialized expert agents, not one-size-fits-all chatbots. Your architecture should be ready for either future. The companies that build model-agnostic infrastructure — systems that swap between LLMs, world models, and narrow specialists depending on the task — will have optionality the rest won’t.

Three Questions We Think You Should Be Asking

Who checks the checker? In five years, when the improvement loop has compounded through millions of cycles, and the expert agents know more about your domain than any single human — who validates the output? If a team of specialized models trained on every correction ever made arrives at an answer the human expert disagrees with, how do you know which one is right? This isn’t theoretical. It’s the governance question every board should be discussing now, while there’s still time to design the answer.
Where does the next generation of experts come from? Entry-level tech hiring is down 67%. Amazon screen-recorded its engineers for months, used them to correct every error, and once the error correction slowed down, made the people redundant. If we’re eliminating the apprenticeship pipeline that produces tomorrow’s expert rejectors — the people whose taste and judgment the entire improvement loop depends on — we’re eating the seed corn. Every industry that outsourced its training pipeline eventually couldn’t figure out why institutional knowledge disappeared a generation later. Who’s building the apprenticeship model for the age of agents?
What’s your constraint-clearing infrastructure? If the future CEO’s job is essentially what Elon Musk already does — find the binding constraint, remove it, move to the next one — then someone should be building the dashboard for that role. Not a BI tool. Not a project management app. A real-time constraint surface that shows a human operator exactly where the autonomous systems are stuck, what judgment call is needed, and what the agents have already tried. The company that builds this — the air traffic control system for agent fleets — might be the most important enterprise software company of the next decade. Does it exist yet? If not, why aren’t you building it?

The first rule of any technology used in a business is that automation applied to an efficient operation will magnify the efficiency. The second is that automation applied to an inefficient operation will magnify the inefficiency.”

— Bill Gates

— Harry and Anthony

Who Checks the Checker? The correction loop is the most valuable thing in AI right now. Nobody is capturing it.

THE NUMBER: 30x — the productivity multiplier between Boris Cherny, creator of Claude Code, shipping 20-30 PRs per day with five parallel AI instances, and a traditional engineer shipping 3 PRs per week. That’s not a rounding error. That’s a different species of worker.

The Lathe That Builds the Lathe

The Two-Pizza Team Eats Alone

The Billion-Dollar Fork in the Road

What This Means For Business Leaders

Three Questions We Think You Should Be Asking

Past Briefings

The Plumber Figured Out AI Before the Enterprise Did

The AI Agents Are Already Here

Software Has Opinions Now