2026-04-17

Part 3: Execution — How Agents Produce Correct Code

agents · architecture · execution · verification · series

This is Part 3 of a six-part series on the layers of an autonomous coding agent. Start with the introduction, or read Part 2: Planning.

The gap between plausible and correct

Ask any coding agent to generate a React component. It will produce something that looks right. The JSX is valid, the props are typed, the styling roughly matches what you described. If you squint, it's a component.

Now paste it into a real codebase. The imports reference a design system component by named export, but the codebase uses default exports. The component duplicates a header element that's already rendered by the layout. The color values are close to the design system but not the actual tokens. The metadata export is missing, so the page title falls back to the site default.

None of this shows up in the generated code. The file looks fine in isolation. Every problem is relational — it only appears when you check the output against the actual project structure. A function that compiles alone but breaks in context. A component that renders alone but conflicts when composed.

This is the execution problem. The agent produces output. The output is plausible. Plausible is not correct, and the distance between them is where autonomous systems earn or lose trust.

How most agents execute

The standard execution pipeline is alarmingly simple: take a plan (or just a task description), feed it to a language model with some context, and write the model's output to files. Maybe run the build afterward. Maybe not.

This pipeline treats execution as a text-generation problem. The model is a sophisticated autocomplete engine. The output is a prediction of what code should look like, based on patterns the model learned during training. The prediction is often remarkably good — good enough to pass a quick visual review, good enough to look like it was written by a junior developer who roughly understood the task.

The problems are systematic.

No verification during generation. The model writes an entire file, or sometimes an entire set of files, before anything is checked. If the first file introduces a broken import, every subsequent file that depends on it compounds the error. There's no feedback loop between writing and checking.

No structural awareness. The model doesn't know that the codebase uses default exports. It doesn't know that Header is already rendered by the layout. It doesn't know that #A33A2B is the accent color and #a33a2b is not the same token in the design system. These are structural facts about the project that the model can only know if something tells it — and in most pipelines, nothing does.

No self-correction. If the output doesn't compile, the pipeline either fails silently (commit the broken code) or fails loudly (halt and wait for a human). Neither is useful for autonomous operation. What's missing is the middle path: detect the error, read the error message, and try again.

The result is an agent that works for greenfield tasks — "create a new component from scratch" — and breaks on integration tasks — "add this to the existing system without disrupting what's already there." Since almost all real work is integration work, this is a significant limitation.

What structural execution looks like

Structural execution means the agent generates code against a model of the project, not against a prediction of what the project might look like.

When the execution layer sits on top of structural perception (Part 1) and a concrete plan (Part 2), the generation process changes fundamentally:

The agent knows what already exists. Before writing a component, it queries the project structure: what components are exported, what naming conventions are used, what the layout already renders. The generated code conforms to the actual project, not the model's training data.

Each step is verifiable. The plan says "create function RateLimitMiddleware matching the signature of AuthMiddleware." The execution layer can check: does the output have the right signature? Does the file compile? Do the imports resolve against the actual module graph? These checks are structural, not textual — they operate on the parsed code, not the raw text.

Errors produce information. When the build fails, the error message tells the agent what went wrong. A missing import, a type mismatch, an unresolved reference. The agent can read this information and act on it — not by guessing, but by querying the perception layer for the correct import path, the expected type, the actual symbol name.

The agent stops being a text generator and starts being a builder that checks its own work.

The self-correcting loop

The single most important pattern in production execution is the multi-turn loop. The agent does not generate code in a single pass. It generates, observes, and corrects.

In practice, this looks like:

The agent receives a step from the plan: "modify middleware.go to add RateLimitMiddleware."
It writes the code, using the project structure as context.
It runs the build. The build fails: undefined: config.RateLimitConfig.
It reads the error. It queries the perception layer: does RateLimitConfig exist? No. It was supposed to be created in a prior step. Did that step complete? Yes — but the type was named RateLimitSettings, not RateLimitConfig.
It fixes the reference. Rebuilds. The build passes.
It runs the structural review: are all imports resolved? Are there new symbols that shadow existing ones? Were any files modified that weren't in the plan?
If the review passes, the step is done. If not, loop.

This loop runs up to a bounded number of turns — in our systems, ten. If the agent can't produce passing code in ten iterations, the task escalates to a human. This bound matters: it prevents runaway execution where the agent chases its own errors in circles, burning budget and producing nothing useful.

The loop is what makes the difference between an agent that works on toy problems and one that works on real codebases. No model, regardless of capability, produces correct code on the first try for non-trivial integration tasks. The models that seem to do this are either working on tasks simple enough that first-try is sufficient, or they're failing in ways the user hasn't noticed yet.

The agent that tries, observes, and corrects is slower per task but dramatically more reliable. And in autonomous operation, reliability is the only metric that compounds.

The verification stack

Verification isn't one thing. It's a stack of checks, each catching a different class of error. In our production systems, the stack runs after every code generation step:

Build verification. Does the project compile? This is the first gate and the most absolute. A build failure is not a warning — it's a hard stop. The agent must fix the error before proceeding. In our orchestrator, a build failure halts the entire commit path. No exceptions.

Structural review. Using the perception layer's project model and the git diff of what changed, a structural review checks for:

Missing imports for newly referenced symbols
Duplicate components or functions that already exist elsewhere
Unused exports created by the change
Files modified that weren't in the plan (a strong signal of scope creep or confused execution)
Design system violations — wrong color tokens, wrong component names, wrong patterns

This is the layer that catches the errors from the opening section: the named-export import, the duplicate header, the wrong color value. Testing alone can't catch these because they're not functional bugs — they're structural inconsistencies that produce wrong but running code.

Visual verification. For UI work, screenshot the rendered page and inspect it. Does the layout match the spec? Are elements positioned correctly? Is the visual hierarchy intact? This catches an entire class of "technically correct but visually wrong" outputs that neither build nor structural review can see.

Governance checks. Before committing, the governance layer evaluates the change against policy: does this diff delete exported symbols that have external references? Does the estimated cost of this execution step exceed the per-task budget? Does the change type (destructive file deletion, schema migration, dependency removal) require human approval?

Each layer catches errors the layers below it can't. Build verification catches syntax and type errors. Structural review catches integration errors. Visual verification catches rendering errors. Governance catches policy violations. Removing any layer opens a category of failure that the remaining layers are blind to.

When execution tools help

The execution layer benefits enormously from purpose-built tooling that bridges the gap between model output and project reality.

Output normalization. Language models produce code in unpredictable formats. Sometimes the output is a complete file. Sometimes it's a diff. Sometimes it's wrapped in markdown code fences, sometimes it's not. A production execution layer needs to parse at least half a dozen output formats reliably, because the model doesn't produce consistent formatting across tasks or turns.

Automatic fixes. Some classes of error are so common and so predictable that they should be fixed programmatically, not by sending the error back to the model for another inference round. A named import of a default-exported component. A missing file extension in a relative import. A const declaration where the codebase uses var. These are pattern-matchable errors with deterministic fixes. Every one you handle automatically saves a model round-trip — which saves time, tokens, and the risk of the model introducing new errors while fixing old ones.

Context scoping. The execution agent doesn't need the entire codebase in its context. It needs the files the plan says it will modify, plus their structural dependencies. This is where the plan's file map (from Part 2) pays off: the execution layer knows exactly which files to load, keeping the context window focused and the generation accurate.

Where execution fails

Structural execution eliminates structural failures. What remains are the failures that require judgment.

Semantic correctness. The code compiles, the imports resolve, the structure is sound — but the rate limiter uses a fixed window when the architecture needs a token bucket. The function handles the happy path but not the edge case the task implied. The test passes but tests the wrong behavior. These are correctness failures that no amount of structural verification can catch because they're about intent, not structure.

Novel patterns. When the task requires a pattern the codebase has never used before — a new middleware architecture, a new state management approach, a new deployment strategy — the agent has no structural precedent to follow. It falls back to its training data, which may or may not align with the project's conventions. The structural review can detect that the output doesn't match existing patterns, but it can't tell the agent what the right pattern should be.

Cascading errors. In complex multi-step executions, an error in step 3 can make steps 4 through 8 impossible. The self-correcting loop handles single-step failures well. Multi-step cascades require replanning (falling back to the planning layer), and the handoff between "this step failed, try again" and "this step failed because the plan was wrong, replan everything downstream" is one of the harder coordination problems in agent architecture.

Resource limits. The ten-turn bound that prevents runaway execution also means some tasks genuinely can't be completed autonomously. Complex refactors, large migrations, tasks that require deep domain reasoning — these may exceed what bounded execution can handle. The system needs to recognize this gracefully and escalate, rather than producing a partial result and declaring success.

What it enables for the layers above

Verified execution is what makes the remaining layers meaningful.

Memory becomes trustworthy. When the system remembers what it did, the memory is only useful if what it did was actually correct. An agent with unverified execution pollutes its own memory with false successes — "I refactored the auth module" when it actually broke two callers. Verified execution means the memory records reflect reality. We'll cover this in Part 4.

Governance becomes enforceable. A governance rule like "destructive changes require approval" only works if the system can reliably identify what's destructive. Structural diffs make this possible. Text diffs make it approximate. The governance layer's precision is directly proportional to the execution layer's structural awareness. We'll cover this in Part 5.

Trust becomes measurable. How often does the agent produce code that passes all verification layers on the first try? On the second? How often does it exhaust the turn limit and escalate? These are concrete metrics that tell you whether the agent is getting better or worse over time. Without verified execution, trust is a feeling. With it, trust is a number. We'll cover this in Part 6.

Next: Memory

In Part 4 we'll look at the memory layer — how an agent retains knowledge across sessions, why most agents start every conversation from scratch, and what changes when persistent semantic memory lets the system learn from its own verified execution history.

Reading this because you're trying to build?

For custom architecture and consulting, work with ResonanceWorks — Talk to Consulting. For a ready-made install, start with Torque Engineering.

Get the rest in your inbox

The series covers six layers — perception, planning, execution, memory, governance, and trust. Subscribe to get new parts as they ship, plus occasional technical notes on what we're learning from running these systems in production.

Need custom help designing your stack?

ResonanceWorks works with founders, operators, and small teams on architecture, governance, and private AI system design. We take a small number of engagements at a time and work closely with founders, operators, and technical leads. Talk to Consulting.

Want a ready-made local-first system instead?

Torque Engineering installs performance-tuned private AI for independent operators. Get Started.

Exploring human-machine culture?

Entrainment House publishes music, art, and cultural works shaped through human-machine coordination. Enter the House.