The Dr. House of AI

THE NUMBER: ZERO — public benchmark leaderboards Anthropic cited when it launched the most capable model it has ever shipped. Not one. In place of the scoreboard the whole industry argued about for three years, the Fable 5 announcement ran thirteen private customer evals — Stripe’s codebase, IMC’s trading desk, Hebbia’s finance reasoning, Harvey’s legal benchmark. The public test went dark in a single press release, because the model already won every game on it. And when the test can’t grade the student anymore, you stop testing and start interviewing.

This morning Anthropic shipped Claude Fable 5, the first “Mythos-class” model — a tier above Opus — made safe enough to hand the public. Stripe got early access and pointed it at a 50-million-line Ruby codebase, the kind of migration that eats a team for two months. It finished in a day. The launch post did 17 million views before lunch. Every newsletter in my inbox led with it.

That’s the easy half of the story, and it’s the wrong half to stare at. The interesting half is a question almost nobody asked out loud: when a model is “state of the art on nearly every test it ran,” how do you actually judge it? The benchmarks are saturated. The scoreboard is a participation trophy. So everyone who got their hands on Fable reached for the same move, and the whole field misnamed it. Mollick called it a feeling. Every.to put “vibe check” in the subject line. My own partner Anthony did the same. They all had it backwards. What’s happening isn’t casual at all. It’s the most rigorous test there is — and it has a name. It’s a job interview.

Here’s the frame for the whole issue. The model that beat every benchmark is the most expensive genius hire your company will ever consider. He’s brilliant, he’s bored by anything routine, he works in a way you can’t watch, and he comes with a second invoice nobody quotes you. Fable isn’t a tool you pick up. It’s Dr. House — and the only real question is whether you can afford him, and whether you’ve got the team to put around him.

🩺 The Benchmark Died on Launch Day

Three years of AI discourse ran on public leaderboards. MMLU, SWE-bench, the whole alphabet. Labs lived and died on a half-point. Then Anthropic shipped the best model it’s ever made and cited none of them — because there’s nothing left to prove on a test you’ve already maxed. What it ran instead was a wall of private evals: Cognition’s FrontierCode, Cursor’s CursorBench, Hebbia’s finance suite, Harvey’s Legal Agent Benchmark, an internal “ViBench” for vibe-coding. Thirteen of them, all proprietary, none reproducible by you.

Sit with what that means. Evaluation just privatized. The public benchmark is dead as a decision tool, and in its place every serious shop now runs the one test that still discriminates: hand the model the single hardest job in your domain, the one where you hold the answer key, and grade the output like a master grading an apprentice’s masterwork. It only works if you’re already expert enough to know what right looks like. That’s the opposite of a vibe.

The sharpest version of this came from Harvey. Gabe Pereyra, one of its founders, posted that the first prompt he runs on any new model is “Draft an S-1” — the registration document a company files to go public. He had Fable draft a mock S-1 for SpaceX, ran the same prompt through Opus 4.8, and laid both next to the real filing. Fable was a clear step up. But here’s the tell that should make you grin: the signal he trusts most isn’t accuracy, it’s length. The longer and cleaner the filing comes out, the better that model performs inside Harvey’s legal agents. It tracks so well it beat their formal benchmark (13% to Opus’s 10%). A senior legal-AI guy judges the frontier by how long the document runs. That’s not a knock on him. That’s what evaluation looks like when the official scoreboard stops telling you anything.

The takeaway: stop waiting for a clean public number to tell you which model to use. There won’t be one again. The new literacy isn’t running the model — it’s being qualified to grade it. If you can’t tell whether Fable nailed your hardest task, you can’t buy your way out of that with a subscription.

🏥 Meet Dr. House

The metaphor isn’t decoration. It’s load-bearing, because it maps onto how Fable actually works.

Ethan Mollick spent five days with it and wrote the clearest field report we have. He asked it to build an isochronic map — travel times from any city, the kind of thing that needs thousands of real data points and a hundred judgment calls. Fable didn’t grind through it alone. It spun up a cast of cheaper Claude Sonnet agents to go pull 2,200 specific flight times, the TGV and Shinkansen rail schedules, road speeds from academic papers — then launched more agents to test the first batch’s work, taking notes the whole way. On a second project it worked autonomously for nine and a half hours and produced a piece of research software Mollick said scientists “needed for years but was never profitable to create.”

That’s the show, beat for beat. House doesn’t run the labs. The residents run the labs. House sits on top, reads the results, and makes the leap that scares everyone in the room. Boris Cherny shipped five-level nested sub-agents in Claude Code the same day Fable dropped — Claude prompting Claude prompting Claude. Anthropic built the hospital and the residents into the model.

ANd here’s the unnerving part Mollick kept circling: the more capable the thing got, the less he did, and the less he could see. “I no longer steer; I commission,” he wrote. “I am closer to a patron.” You describe what you want, you pay for it, you judge the result — but the hundreds of small choices in the middle happen somewhere you don’t get a vote. House cures the patient by doing something that terrifies the staff, and you only find out it worked at the end.

The takeaway: Fable is not a faster autocomplete. It’s a diagnostician that brings its own team. That’s exactly why it’s wasted on the routine stuff — and exactly why you can’t fully supervise it on the hard stuff. Both halves of that are the same coin.

💸 Can You Afford the Hire?

This is where my partner Anthony’s piece this week — Intelligence Demand Is Infinite — stops being theory and starts being a staffing decision.

His thesis: demand for intelligence is close to infinite, but the work splits hard. Inside a year or so, roughly 80% of your AI workload will run fine on models that cost 99% less than the frontier. The other 20% — the real reasoning, the orchestration, the stuff where one wrong call cascades — stays on the expensive tier, because for that slice raw IQ is the product. Most teams are running the entire 80% on frontier models anyway. Anthony calls it the genius tax. Paying House to take temperatures.

Fable is the 20% made flesh. It’s priced like it: twice Opus, $10 and $50 per million tokens in and out, and Anthropic is so unsure it can meet demand that it’s giving it away on Pro and Max plans only through June 22 before the meter starts. That’s not a price cut. That’s a signing bonus that expires. The genius is real and the genius is too much model for most of what your company does. Per-token prices are falling 10x a year, and it still won’t save you, because cheaper tokens just mean more tokens — Jevons paradox with a GPU. The bill climbs anyway unless you decide, per job, which brain it deserves.

Which is the actual move, and it’s the one I floated to Anthony the morning Fable landed: let Fable build you the harness. Hard reasoning to Fable or Opus. Images to Gemini. The cheap, high-volume work to a good-enough open-source model running local. The front end designed by Claude. House takes the unsolvable case; the open-source residents mop the floors; you call in a Gemini specialist for imaging. That’s not a downgrade. That’s how a hospital is staffed.

The takeaway: the most capable model ever made is the best argument yet for not defaulting to it. The company that wins isn’t the one with Fable. Everybody has Fable today. It’s the one that knows which 20% of its work actually earns the frontier — and routes the rest to the floor.

💊 Every House Needs a Wilson

Now the cost nobody puts on the invoice. House is only ever right because the writers guarantee it. In your business, there is no writer.

Fable is a black box that returns finished work. Mollick could spot some errors in the research software because he happens to be the kind of expert who can — and he flagged that a less expert user would have shipped them blind. That’s the catch in the whole “anyone can build anything now” story. The more the model does, the more you need someone senior enough to catch the one time it’s confidently, catastrophically wrong. House needs a Wilson — the one colleague with the standing to say “you’re sure about that?” and mean it.

This flips the labor story that everybody got lazy about. The reflex take is that a model this good means you need fewer experts. The opposite is true, and Mollick says so directly: we may need more coders, not fewer, to chase down the bugs in the explosion of software a tool like this makes possible. The scarce skill stops being production and becomes judgment. Taste. The ability to look at a finished S-1, or a nine-hour software build, or a drug-target hypothesis, and know whether it’s brilliant or quietly broken. That capacity doesn’t commoditize when intelligence does. It gets more valuable, because there’s suddenly so much more output that needs grading.

The takeaway: when you cost out the Fable hire, the token bill is the cheap line. The expensive line is the human qualified to check its work. If you can’t name your Wilson, you can’t deploy your House.

⚠️ The Vicodin Problem

Every great diagnostician on television is also a liability the administration keeps on a short leash, and Fable is no exception. This is the NOISE half of the issue, and it’s not a footnote.

Fable has a dangerous twin. The same underlying model, with the safety rails off, ships as Mythos 5 — and Mythos 5 goes to the US government, through Project Glasswing, as the strongest cyberweapon-grade model in the world. The public Fable you get is the muzzled version: ask it anything touching cybersecurity, biology, or chemistry and it quietly hands the question down to Opus 4.8 instead, in just under 5% of sessions. Anthropic added classifiers, a 30-day data-retention rule on all Mythos-class traffic, and a distillation tripwire to stop authoritarian labs from copying it. They red-teamed it for over a thousand hours. This is the first model Anthropic decided it could not release to the public without a leash. Cuddy and Vogler, reining in the genius before he gets near the pharmacy.

And the skeptics have a real point worth holding next to the hype. A new study this week clocked large language models producing a genuinely original argument just 3.4% of the time, against 65.3% for human writers — “argument collapse.” Toby Ord publicly disputed Anthropic’s claims about how uniformly these gains scale. So here’s the honest read: Fable is state of the art at execution and still a parrot at invention. House is the best diagnostician alive at connecting known facts into the answer no one else sees. He is not in the business of discovering a new disease. Don’t confuse the two, and don’t let a vendor confuse them for you.

There’s a grace note here, too. Hugh Laurie — Dr. House himself — surfaced on X this week to demolish a critic who complained every episode was the same. He wrote that they tried a couple of episodes where House gets it right on the first try, but they came out six minutes long, and NBC wasn’t happy. It’s the perfect accidental gloss on this entire launch. A House who one-shots the answer is a six-minute episode. The value was never the instant diagnosis. It was the team, the dead ends, the data, and the leap at the end. They don’t call them procedurals for nothing.

The takeaway: the same capability that killed the public benchmark forced a government channel and a muzzle. When a model gets this good, both your evaluation method and your safety model have to privatize at once. Read the system card, not the highlight reel.

What This Means For You

Intelligence has outrun the people using it. That’s the real headline under the headline. Fable is more capable than almost any company is currently able to push it, which means the binding constraint stopped being the model months ago and became you — your ability to route it, judge it, and afford it. Three years of asking “is the model smart enough” is over. The question now is whether your shop is built to employ a genius.

Audit what you’re actually sending to the frontier. Pull a week of your AI calls and sort them. How many were tagging, formatting, first drafts — work a cheap model finishes just as well? If it’s most of them, you’re paying House to mop floors. Pick your three highest-volume prompts and test the cheapest model that clears the bar.

Name your Wilson before you deploy. A model this autonomous, handed long-horizon work, needs a human senior enough to catch a confident wrong answer. If the honest answer to “who signs off on Fable’s output here” is “nobody qualified,” that’s the real reason to wait — not the token bill.

Treat the June 22 window as a free trial, not a free lunch. Run Fable against your single hardest real task — your S-1, your migration, your model — and learn what it’s worth to you before the meter starts. Then build the harness: frontier for the 20%, floor for the rest.

The model isn’t the moat anymore, because everyone has the model. The moat is knowing which work deserves the genius and who’s qualified to check him. Stop hiring House to take temperatures.

Three Questions We Think You Should Be Asking Yourself

Can you actually grade this model’s hardest output, or are you trusting the highlight reel? If nobody on your team could look at Fable’s work on your core problem and tell brilliant from broken, you don’t have an AI strategy. You have a vendor relationship.

What percentage of your AI spend is going to work that doesn’t need the frontier? Anthony’s number is 80%. If you’ve never measured yours, you’re almost certainly paying the genius tax, and the company across the street that built the router isn’t.

Who is your Wilson? Name the person with the expertise and the standing to overrule the machine. If that role doesn’t exist yet, that’s your first hire — before the model, not after.

I no longer steer; I commission.”
— Ethan Mollick, on working with Fable

— Harry and Anthony

Sources

Claude Fable 5 and Claude Mythos 5 — Anthropic
What it feels like to work with Mythos — Ethan Mollick, One Useful Thing
Intelligence Demand Is Infinite — Anthony Batt, CO/AI
Claude Fable 5 vibe check — Anthony Batt, CO/AI
Gabe Pereyra (@gabepereyra) on X — “Draft an S-1” model test, SpaceX S-1 comparison, LAB 13% vs 10%
Hugh Laurie (@hughlaurie) on X — June 7, 2026, on House and the procedural form
Karpathy on Fable 5 as a major version bump; Boris Cherny on 5-level nested sub-agents — via X / Aligned News
OpenAI confidential S-1 filing — via TLDR, Benedict Evans
“Argument collapse” LLM study; Toby Ord on scaling claims — via Aligned News

The Dr. House of AI

🩺 The Benchmark Died on Launch Day

🏥 Meet Dr. House

💸 Can You Afford the Hire?

💊 Every House Needs a Wilson

⚠️ The Vicodin Problem

What This Means For You

Three Questions We Think You Should Be Asking Yourself

Sources

More like this

Thirteen Days to test Claude Fable 5

Skate to Where the Puck Is Going

Dr. Zuck in the Metaverse of madness

All Signal.No Noise.

All Signal.
No Noise.