← All Posts
2026-04-11

Building an Arbiter: Governance for Autonomous AI Systems

arbiter · governance · ai-systems · infrastructure

The moment you realize your agent needs a boss

We had an autonomous research loop running on local models. It would pick a topic from a backlog, search the web for sources, synthesize a brief, score its own confidence, and queue follow-up topics for the next cycle. It ran every two minutes. It was free — local inference, no API costs.

Within a week it had generated 1,700 research briefs. It had also burned through 3,000 Firecrawl API credits in minutes, spiraled into a self-reinforcing loop of increasingly off-topic follow-ups, and was cheerfully rating its own output at 0.90 confidence while working from scraped blog posts because every academic source returned a 403.

The system was autonomous. It was not governed.

That's why we built the Arbiter.

What the Arbiter does

The Arbiter is a Go-based rules engine from odvcencio, and it governs every autonomous process in our stack. It doesn't do the work — it decides whether work should happen, which model should do it, and when to stop.

It operates through declarative rules written in .arb files. The rules read like policy statements, not code. A cost circuit breaker looks like this:

rule DailySpendCircuitBreaker priority 1 {
    when { session.daily_spend_usd > 50.00 }
    then HaltExecution {
        reason: "daily_budget_exceeded",
        alert: "discord",
    }
}

When an autonomous process — the task poller, the research loop, the code generation agent — wants to act, it checks with the Arbiter first. The Arbiter evaluates the current context against its rules and returns a verdict: proceed, throttle, deny, or halt.

Changing a threshold is a one-line edit. Publishing updated rules to the running Arbiter takes one command. No Python changes, no deploys, no restarts.

Why declarative rules matter

The natural instinct when an autonomous system misbehaves is to add an if statement. Check the budget in the main loop. Add a counter for follow-ups. Hardcode a cap.

This works until you have five autonomous processes, each with its own set of hardcoded limits, scattered across different files in different languages. The research loop has its caps in Python. The code generation agent has its caps in a different Python file. The task spawner has caps in a config. Nobody remembers what the limits are or where they live.

Declarative governance centralizes the policy. All the rules live in one place. They're written in a format that's readable by anyone — not just the person who wrote the code. They compose: you can add a new rule without touching any existing logic. And they're auditable: every decision the Arbiter makes is logged with which rules matched and why.

The rules engine pattern isn't new — business rules engines have existed for decades. What's new is the need for them in agent systems. When your agents run continuously and autonomously, governance isn't optional. It's infrastructure.

What we govern

Our Arbiter manages rules across five bundles, each governing a different concern:

Cost governance. Daily spend caps, hourly token throttles, and warnings as limits approach. The system governs its own budget. We've never had a surprise bill.

Model routing. Which model handles which task. Code generation routes to GPT-5.4. Documentation routes to Claude Sonnet. Deep research routes to Claude Opus — but only when the local model's confidence falls below 0.4. Critical tasks get the most capable model. Routine work stays local and free.

Research governance. Daily caps on Firecrawl API calls, research cycles, follow-up topics, and escalations. Focus enforcement — follow-up topics that don't match the studio's research priorities get rejected. Backlog depth limits prevent unbounded queue growth. Source quality gates flag briefs where most URLs returned 403 errors.

Approval gates. Some actions need a human in the loop. The Arbiter can gate a commit, require Discord notification before a merge, or block file deletions and force pushes entirely. The approval rules define which actions are automatic, which are gated, and which are forbidden.

Feature flags. Capabilities like autonomous task spawning, the agentic code generation loop, and auto-merge can be toggled on or off through the Arbiter's flag system without touching code.

The architecture

The Arbiter is part of a broader ecosystem of Go-based developer tools from the same author. The approach is consistent across all of them: pure Go, no CGo dependencies, gRPC interfaces, and clean separation of concerns. The same design philosophy produced gotreesitter — a ground-up reimplementation of the tree-sitter parser runtime in pure Go with 206 embedded grammars — and gts-suite, a structural code analysis toolkit that exposes everything from call graphs to dead code detection via MCP.

The Arbiter runs as a gRPC service. Clients — the task poller, the research loop, the code agent — call it over gRPC with a context payload: what's the task, what's the current spend, what's the backlog depth. The Arbiter evaluates its rules against that context and returns matched actions.

The rules are organized into bundles, published to the running server, and activated atomically. You can publish a new version of the cost rules without touching the routing rules. Each bundle tracks its version, so rollback is trivial.

The agent-side integration is lightweight. A Python or Go client calls check_cost_limits() before starting work, check_nanochat() before each research cycle, route_task() to select a model, and check_approval() before committing. If the Arbiter is unavailable, the system degrades gracefully — safe defaults, no halt.

What makes this stack compelling is how the pieces compose. The Arbiter governs agent behavior. gts-suite gives agents structural understanding of the code they're working in — not line-level text, but entity-level awareness of functions, dependencies, and call graphs. gotreesitter makes all of that portable across any platform without a C toolchain. These aren't disconnected tools — they're layers of an infrastructure designed for autonomous coding systems.

What we learned

Governance is the most important layer. More important than model selection, prompt engineering, or tool design. Without it, autonomous systems reliably drift toward waste. With it, they stay focused and predictable.

Rules should be tunable by non-engineers. The .arb format works because anyone can read it and understand what the system will do. When we needed to raise the Firecrawl daily cap from 20 to 500, it was a one-line change published in seconds. No deploy, no code review.

Start with caps, not permissions. Our first rules were all circuit breakers: stop if spend exceeds X, stop if the backlog exceeds Y. These prevent the worst outcomes immediately. Sophisticated routing and quality gates came later.

Log everything the Arbiter decides. Every matched rule, every denial, every routing decision goes to a log. When something goes wrong — and it will — the log tells you exactly which rule fired and why. This is how we discovered the research loop was burning Firecrawl credits: the deny logs showed 3,163 consecutive failures.

The Arbiter is for runtime, not design time. It doesn't tell agents how to do their work. It tells them when to start, when to stop, and which resources to use. The agents own execution. The Arbiter owns policy.

Getting started

If you're building autonomous agent systems and governance isn't part of your architecture yet, it should be. The Arbiter and the broader tooling ecosystem are available on GitHub. The gRPC interface means you can integrate from any language — we run Python and Go clients side by side with no friction.

Start with cost circuit breakers. Add model routing when you have more than one provider. Layer in approval gates when your agents start touching production systems. The rules compound: each one you add makes the system more predictable without making it more rigid.

If your agents run for more than a few minutes unattended, they need an arbiter. Not because they're dangerous — but because autonomous systems optimize for completion, not wisdom. The governance layer is what bridges the gap.