← All Posts
2026-04-22

The Economics of Agent Accuracy

agents · economics · accuracy · governance

The number everyone gets wrong

When someone says their coding agent is "90% accurate," they usually mean it produces correct output 90% of the time. That sounds good. It is not enough information to know whether the agent is worth running.

The missing variable is cost of failure. A 90% accurate agent writing unit tests has a very different economic profile than a 90% accurate agent writing database migrations. The test failures are caught automatically and cost minutes to fix. The migration failures might corrupt production data and cost days.

Accuracy without consequence modeling is a vanity metric.

The break-even formula

The value of an autonomous agent is not its accuracy rate. It's this:

Time saved by the agent minus time spent on rework minus time spent on supervision equals net value

Each variable matters:

Time saved is what the agent produces that a human would otherwise do. If the agent writes 40 pull requests a day and each saves a developer 30 minutes, that's 20 hours saved. This is the number everyone calculates. It is the easy part.

Rework time is what happens when the agent is wrong. If 10% of those 40 PRs need correction, and each correction takes 45 minutes (finding the bug, understanding the agent's logic, fixing it, re-testing), that's 4.5 PRs × 45 minutes = 3.4 hours of rework. This number is almost always underestimated because people measure the fix time but not the diagnosis time.

Supervision time is the cost of reviewing agent output even when it's correct. If a senior engineer spends 5 minutes reviewing each of 40 PRs, that's 3.3 hours of supervision. This is the hidden tax. It exists even at 100% accuracy because someone still needs to verify.

So the net value is: 20 hours saved − 3.4 hours rework − 3.3 hours supervision = 13.3 hours net. The agent is clearly worth running.

But change the accuracy to 70% and the math shifts: 20 hours saved − 12 × 0.75 hours rework − 3.3 hours supervision = 20 − 9 − 3.3 = 7.7 hours. Still positive, but the rework cost tripled.

Now change the consequence level. If each failure requires 3 hours to diagnose and fix (production incident, not just a bad test), the same 70% accuracy produces: 20 − 12 × 3 − 3.3 = 20 − 36 − 3.3 = −19.3 hours. The agent is a net liability.

This is why accuracy alone is meaningless. The economic question is always: *accurate enough, at what consequence level, with what supervision model?*

Where the money actually goes

We track costs across our agent infrastructure using Arbiter governance rules. The data tells a consistent story about where autonomous systems spend money.

Inference costs are the visible expense. Cloud API calls — GPT-5.4 for code generation, Claude Sonnet for documentation, Claude Opus for deep research. These show up on invoices. They're real but usually not the largest cost once you run local models for routine work. Our research loop runs 300 cycles per day on local inference at zero marginal cost. The cloud overflow for the remaining 18% is roughly $0.13 per day.

Rework costs are the invisible expense. When our agent loop produced 10 turns of empty reasoning before we fixed the tool-calling bug, that wasn't just wasted inference — it was wasted human time diagnosing why, finding the root cause in the Responses API, implementing the fix, and testing it. The inference cost was cents. The human cost was hours.

Governance costs are the investment that reduces both. The Arbiter rules that cap daily spend, throttle research cadence, and gate escalation cost nothing to run (gRPC calls to a local service). But they prevent the runaway scenarios that create rework. Our research loop burned 3,000 Firecrawl API credits in minutes before governance. After governance: predictable, capped, focused. The governance investment pays for itself in avoided waste.

The local-vs-cloud decision as an economic calculation

The conventional wisdom is that cloud models are better and local models are cheaper. Both are true at a surface level and both miss the point.

The relevant question is: *at what accuracy threshold does the task become net-positive, and which model achieves that threshold at the lowest total cost?*

For our research synthesis (topic → web search → extract → summarize → score), a 14B local model running on Ollama produces briefs at 0.85 confidence. A cloud model might produce 0.92 confidence. The difference — 7 percentage points — doesn't justify the cost because the consequence of a mediocre research brief is low. A human reviewer spends the same time skimming either way. The local model wins on economics even though it's technically less capable.

For code generation on production systems, the calculus reverses. A local model that produces 75% correct code creates 25% rework. A cloud model at 92% creates 8% rework. If rework costs 4x the inference cost, the cloud model is cheaper despite the per-token premium.

Self-hosted AI can be five to ten times cheaper than cloud at moderate volumes — but only for tasks where the accuracy gap doesn't generate rework that swallows the savings. The governance layer decides which model handles which task. The economics determine the routing rules.

Token budgets as a cost lever

The amount of context you feed an agent directly affects both cost and accuracy. More context means more tokens, which means higher inference cost — but also better output, which means less rework.

The optimal token budget is the point where adding more context stops improving accuracy enough to justify the cost. In our experience, structural context (Part 1 of the series — feeding the agent call graphs and dependency sets instead of raw file text) achieves roughly 10x compression: the same semantic information in one-tenth the tokens. That's a direct cost reduction that doesn't sacrifice accuracy.

This is where the pure-Go tooling stack becomes economically relevant. A structural code intelligence layer that produces tight, information-dense context windows — rather than dumping entire files into the prompt — saves money on every inference call. The tooling cost is zero (local binary). The savings compound across hundreds of daily agent cycles.

What to track

If you're running autonomous agents and want to understand their economics, track these:

Accuracy by task type. Not a global number. Break it down: code generation accuracy, documentation accuracy, research accuracy, refactoring accuracy. Each has different consequence levels and different rework costs.

Rework hours per failure. Measure the full cycle: time to discover the error, time to diagnose, time to fix, time to verify. This number is always larger than people assume.

Supervision hours per cycle. How much human time goes into reviewing agent output, even when it's correct? This is the tax you pay for using agents at all. Reducing it requires building trust (Part 6) through calibrated confidence scores that let reviewers skip the easy cases.

Cost per successful completion. Total inference cost plus rework cost plus supervision cost, divided by successful outputs. This is the number that tells you whether the agent is economically viable.

Governance savings. What did the circuit breakers, caps, and routing rules prevent? We can quantify this directly: the research loop was burning $20+/day before governance. After: $0.13/day. The delta is the governance ROI.

The thesis

Agent economics are not about making inference cheap. They're about making the total cost of autonomous work — inference plus rework plus supervision — lower than the cost of human work for the same output.

That requires accuracy tracking by domain, consequence-aware routing, governance that prevents waste, and structural tooling that compresses context. The model is one variable. The system around it determines whether the math works.

---

Need help with the math for your team?

ResonanceWorks works with founders and small teams on architecture, governance, and private AI system design. Talk to Consulting.

Want a system that governs its own budget?

Torque Engineering installs performance-tuned private AI with Arbiter governance built in. Get Early Access.

Exploring human-machine culture?

Entrainment House publishes music, art, and cultural works shaped through human-machine coordination. Enter the House.