Agentic Coding and the Problem of Oracles

Guest Post: Yanqing Cheng

Feb 06, 2026

Earlier Today, Yanqing Cheng and I got into an excited conversation sparked off Anthropic's announcement about building a C compiler with a team of parallel Claudes and we ended up deep in the combination of LLMs and Context Driven Testing. Her thinking here was really sharp, so I suggested she write it up rather than let me steal it. Here’s a guest post from Yanqing:

Agentic Coding and the Problem of Oracles

If you’re anything like me you are terminally online these days because it feels like the only way to keep up with the vertiginous pace of capabilities advances of coding agents. If so you will have seen yesterday’s blog post from Anthropic. An autonomous team of claudes implemented a C compiler in two weeks - 100k lines of Rust, minimal human intervention.

At first glance this is a seismic achievement - has the day finally come that software agents can do the SWE job, end-to-end?

Continue being terminally online though and you will quickly see the discourse about the catch. From the blog post:

“But when agents started to compile the Linux kernel, they got stuck. Unlike a test suite with hundreds of independent tests, compiling the Linux kernel is one giant task. Every agent would hit the same bug, fix that bug, and then overwrite each other’s changes. Having 16 agents running didn’t help because each was stuck solving the same task.
The fix was to use GCC as an online known-good compiler oracle to compare against. I wrote a new test harness that randomly compiled most of the kernel using GCC, and only the remaining files with Claude’s C Compiler. If the kernel worked, then the problem wasn’t in Claude’s subset of the files.”

This makes sense if you’ve been doing any amount of vibecoding. I quickly learned that agents are much more successful when they can tell by themselves what’s correct, allowing them to quickly iterate at agent speed rather than waiting for human input to iterate in the right direction. There are many tweets and blog posts describing this common wisdom - and Opus 4.6 is not beyond it either. It can do fantastic things and stay on task for longer than ever: as long as it has an oracle

I see lots of commentary on this result along the lines of: “Well my thing doesn’t have an oracle. This is cool, but it only works for things with a pre-defined right answer.” A very natural reaction! And I think they’re right in a way, but I think this is an under-ambitious way to think about specifying oracles for agents.

There is always a source of truth on what “good” is for your software. Because it’s good, somebody knows that it’s good. If it’s bad, somebody knows that it’s bad. The humans who matter can tell.

We already have oracles for all our software, we just don’t always call them that:

The “my mum” oracle: show the app to someone who isn’t a developer. If they can’t figure out the checkout flow, that’s an oracle for the app’s usability.

The “mass-email at midnight” oracle: if this specific thing broke, would you panic? The things that trigger your adrenaline are oracles about reliability risk.

The “newspaper test” oracle: if this bug made the front page, how bad would it be? Security people use this instinctively. It’s an oracle about reputational risk that lives entirely in your judgment.

So, does that mean that for software where these things matter, the agentic coding loop is stuck on humans? Well that depends on whether you can get agents to approximate your human oracles well enough.

Here’s the thing - unlike an automated test harness of the Old World, LLMs are actually very good at simulating specific humans. It’s literally what they do. You may have read nostalgebraist’s megapost essay “The Void” about the nature of LLM assistants - one key insight being that the base model LLM has to be a very good predictor of human text. Which means they are very good at “figuring out the kind of person who would write a piece of text”, which means they are very good at simulating all different kinds of people!

With the right prompting, persona simulation for your human oracles can be part of the agent’s repertoire. Which means the problem of human oracles boils down to persona specification. With the right personas, the agents can work with you to turn human-fuzzy oracles into precise machine ones and execute on them.

“My mum”: the agent can’t be your mum, but it can simulate a first-time user walkthrough if you tell it “as someone with no technical background, assuming no prior knowledge, try to buy a thing, report friction.”

The adrenaline test: the agent can’t feel panic, but it can run targeted checks on a list of “if this breaks we’re f***ed” scenarios. It can even brainstorm the list of “if this breaks we’re f***ed” scenarios, if it knows who you are and what you care about.

The newspaper test: the agent can flag things that touch PII, financial data, public-facing content. With an understanding of your company, it can even brainstorm such a list of “would be embarrassing” items.

Every software domain has oracles. Not all of them are as obvious and as easily machine-consumable as the GCC compiler. In most domains, you need to translate your oracles - including the human judgement ones - into a form agents can consume. The more autonomy you want for your agents, the better that translation needs to be.

You can’t extract the human from the loop even with good translation - the specification layer of who matters and what they care about is human. But humans can move up the stack.

Anthropic’s engineer stayed in the loop on the compiler demo, stepping in to specify what “correct” meant.

You can delegate the coding, and most of the reviewing, and most of the testing. But you (or other human stakeholders) are still the people holding the buck on what “good” means. Even with excellent oracle-translation, the agents are only approximating your judgement. You will still work out what good means for your context, and you will still be the final authority on whether the agents are really aligned with your understanding of goodness.

The job for humans isn’t writing code any more - it’s knowing what good means.

Discussion about this post

Ready for more?