osbytes

Search

Find posts, projects, and members.

← back to blog

Missing oracles: the tech debt the agent era may compound

2026-05-25by@dillonstreator9 min read
#ai #llm #agents #engineering #opinion #spec-driven-development #technical-debt

TL;DR (hypothesis offered for testing, not a measured claim). Agents condition on whatever artifacts exist in your codebase. Weak artifacts produce weak conditioning. Tests, types, contracts, evals (anything checkable without rerunning the model) function as external oracles that interrupt ungrounded self-review loops. Specs sit alongside but do not grade; they constrain. The compounding story below is mechanism-plausible and unmeasured. If you have data either way, I want to see it.

Classical tech debt is something you can, in principle, see. Long functions, deep coupling, untested paths, dependencies pinned to versions nobody remembers choosing. Look at the source tree, the debt is there.

The agent era promotes a different kind. Call it a missing oracle: any artifact that states what the software must do in a form a machine can check against. A failing test. A typed contract. A property-based generator. An eval suite. A repro fixture.

I am bundling these under one term and the bundling is doing work. Types catch errors at compile time and constrain the writer. Tests catch errors at runtime and constrain the artifact. Evals measure aggregate behavior. Contracts pin down interfaces. Different cost curves, different coverage, different failure modes. Grouped because they share one property: external to the agent's generation distribution, checkable without rerunning the model.

Specs are a separate category. They constrain the reader (human or agent) before code exists, but do not grade output. Treated as oracle-adjacent throughout; the conflation is a common failure mode of spec-driven dev as a movement and I do not want to repeat it here.

Agents made the absence of these artifacts expensive and fast enough to notice.

Intent is what vanishes

Investment in written specs has always varied across teams and industries. Some shops, often the ones with safety regulators or paying customers who litigate, invested heavily in specs with the correct SMEs in the room and kept them current as the code moved. Many others did not, and held intent in their heads instead: the field that is nullable in the schema but crashes the downstream pipeline if null, the endpoint that returns 200 with an error body because a major client cannot handle 4xx, the sleep(50) that looks like sloppy code but is masking a race in a vendor SDK. In codebases shaped that second way, tacit context was load-bearing and invisible. Each of these looks wrong or arbitrary in the source until you know the story, and the story lives in a person, not the repo. (This is the part of the post I am most confident about.)

Hand the codebase to an agent and the code is still there. The intent is gone. Every constraint a senior engineer would have raised in a hallway conversation is now absent from the agent's context unless someone wrote it down. The artifact that felt like overhead (design doc, ADR, README section nobody read) is now the only signal the agent has that an obvious-looking change breaks something three modules over.

Engineering intent is half the loss. Domain intent is the other half, and the part engineers cannot author alone. The billing edge case lives with finance, the prescribing rule with the clinician, the regulatory constraint with legal. An oracle authored without the SME encodes the engineer's guess at the domain: the failure mode this whole post warns about, relocated one layer up. The artifact has to be SME-authored assertions (acceptance criteria, invariants, rejection cases), not SME-attended meetings. Attendance does not check anything. Event storming and similar SME-extraction protocols exist precisely because tacit domain knowledge rarely surfaces on its own.

In codebases that depend on this kind of tacit context, the loss shows up on day one of agent use and worsens as team turnover removes the humans who held it. That happens independent of the compounding claim below.

The mechanism

When an agent generates a token, it conditions on every prior token in the context window: tool outputs, prior planning, prior critiques of its own draft. The self-review pass conditions on draft plus critique prompt: different tokens, same model. Without an external oracle in that window (a failing test, a typed contract, a hard "must / must not" line) no new information enters the loop. Confidence can rise without accuracy rising with it.

Reflexion (Shinn et al., 2023) shows self-critique improves with environment signal: test pass/fail, task success. Anthropic's faithfulness work (Lanham et al., 2023) shows CoT faithfulness varies by task and model size; larger models often produce less faithful traces. These results do not prove ungrounded loops degrade output. They show traces are not neutral scratchpads and that grounding signal changes self-critique quality. The bridge from there to "ungrounded loops poison context" is my extrapolation, not the papers'.

Practical consequence: ungrounded "think harder" loops produce confidence without producing accuracy. Same mechanism, two pathologies under different sampling: committing early to a wrong hypothesis and editing the wrong file for an hour, or circling without committing.

A spec interrupts the loop without grading the output. Pre-loaded constraint the model's self-talk must stay consistent with. Useful, but not a substitute for a runnable check.

How missing-oracle debt could compound

Working hypothesis, unmeasured. If it holds: codebases with weak oracles are codebases agents progressively make worse. Each agent-shipped change conditions on existing code and existing tests. If the tests do not catch a regression, the regression ships and becomes the new ground truth the next session conditions on.

Illustrative shape, not a documented incident: agent adds a null check in a hot path because the failing case it saw involved a null. No test pins down the contract. Weeks later, a second session reads the null check, treats it as canonical, adds a parallel one in a sibling module. Later still, a third session refactors both into a shared helper. The original misdiagnosis is now load-bearing. No diff in the chain looks wrong on its own.

Classical tech debt is a fixed cost sitting in the codebase until paid. The claim here is missing-oracle debt grows with agent activity, because every ungrounded change becomes ground truth the next change conditions on. Intuitive mechanism stories in software are wrong often enough that the prior on this one should be skeptical; I am offering it for testing, not asserting it.

The study I want: regression rate over N agent sessions against weakly- vs strongly-tested forks of the same repo. SWE-Bench and CodeArena measure single-session success. The compounding question is downstream of that. If you have run this experiment, I want to see it.

The inverse claim, that strong-oracle codebases are agent multipliers, is the part I am least sure of. Teams with heavy test investment also invest in review, architecture, hiring discipline. Any observed agent boost on those codebases attributes to the bundle, not specifically to the tests.

What to do

Treat oracle infrastructure as a first-class engineering line item. Not documentation, not developer experience, not "tech debt cleanup." Load-bearing infrastructure that determines what agents can and cannot reliably do.

Audit before you build. Pick the three highest-traffic modules in the repo (by commit count, by incident count, by agent edit count if you track it). For each, list the contracts that exist as runnable checks vs the contracts that exist only as tribal knowledge. That delta is the backlog.

Retrofit contracts on hot paths first. Not "raise coverage to 80%." Pick the surfaces where a wrong change is most expensive (auth, billing, migrations, anything touching external APIs) and pin the contract with a test that would fail if the contract changed. Tests that pin contracts, not implementations (see objection below). Coverage is a lagging proxy; contract pinning is the actual goal.

Write the spec before the agent starts. Sometimes one paragraph and three bullets. Sometimes a paragraph and a failing test. A lightweight grilling step (stress-test the plan against existing docs and domain terms, surface glossary conflicts, force edge-case scenarios, resolve one decision at a time) substitutes for the conversation a junior would start and an agent does not. Clarification itself becomes a new artifact for the next session. Matt Pocock's grill-with-docs is one instance, and writes clarified terms back into project docs as they stabilize. N=1 example, take as illustration.

Objections worth taking seriously

Pin contracts, not implementations. Over-specified tests calcify bad designs; tightly typed wrong interfaces resist correction. A codebase with high-confidence wrong constraints is worse for an agent than one with weak ones, because the agent now confidently produces wrong code that passes. Oracle quality matters, not just oracle quantity. The prescription is not "more tests"; it is tests written against the behavior the system promises, not the shape of the current code.

Agents will write the oracles themselves. Common 2026 pushback. Partly true: agents can author plausible tests against current behavior, which is exactly the wrong oracle (pins implementation, not contract). Agent-written tests against a human- or SME-authored spec are a different story and are the direction worth investing in. Agent writing both the spec and the test from the code alone reduces to the model grading itself, which is the original problem.

The problem may be transient. If model capability rises fast enough, agents may infer specs from code reliably without explicit artifacts. This is a real bet. My read: even if true, the artifacts have value as durable team context that outlives a model generation. But I am not certain, and a 2028 version of this post may look quaint.

Oracle authorship is not free. For research code, exploratory data work, throwaway scripts, design spikes, the cost of writing oracles exceeds the cleanup cost from drift. The prescription scopes to production codebases with multi-month lifespans. Outside that scope, the argument weakens or reverses.

The spec itself is often ungrounded. A human-written spec about ambiguous requirements is not the same kind of oracle as a failing test. Specs constrain the agent but do not grade it. Pretending they do is the failure mode of spec-driven dev as a movement. The honest version: spec + test + eval is the loop; spec alone is half the work.

Evals are their own deep problem. Eval gaming, distribution mismatch, reward hacking. Citing "write an eval suite" as a prescription understates the work involved. The same property that makes evals load-bearing (they grade aggregate behavior) makes them hard to author correctly.

Lineage

The shape (write down what software must do, in a form a machine can check) is old. Formal methods (Hoare, Lamport's TLA+, Meyer's design-by-contract) stayed an academic minority practice in mainstream software. TDD and Spolsky's functional specs were the agile-era thread, more talked about than uniformly practiced. Domain-driven design (Evans) and event storming are the SME-extraction thread: ubiquitous language as glossary oracle, bounded contexts as contract oracles, aggregate invariants as the "written invariant" category above. The agent era reaches for the same artifacts under new cost pressure: spec file the model reads, typed tool it calls, eval suite that grades it.

Spec-driven-development tooling for agents includes GitHub Spec Kit and OpenSpec. Property-based testing (Hypothesis, fast-check) is an older practice with renewed pull in agent-conditioned codebases: PBT tests are spec-shaped oracles agents can run. Eval infrastructure (OpenAI Evals, UK AISI Inspect) is the same shape one level up.

Some teams kept this discipline. Many did not. The agent era is forcing the second group to rebuild what the first group never stopped doing.

Sources

Research and writeups cited inline above: Reflexion (Shinn et al., 2023), Measuring Faithfulness in Chain-of-Thought Reasoning (Lanham et al., 2023), Anthropic's Building Effective Agents.

Historical practice worth reading directly

  • Bertrand Meyer, Object-Oriented Software Construction: design by contract.
  • C. A. R. Hoare, An Axiomatic Basis for Computer Programming (1969).
  • Leslie Lamport, TLA+.
  • David Parnas, On the Criteria To Be Used in Decomposing Systems into Modules (1972).
  • Eric Evans, Domain-Driven Design: ubiquitous language, bounded contexts, aggregates.
  • Kent Beck, Test-Driven Development: By Example.