2026-04-19

Part 4: Memory — Persistent Semantic Context

agents · architecture · memory · mll · series

This is Part 4 of a six-part series on the layers of an autonomous coding agent. Start with the introduction.

The amnesia problem

Every time you start a new session with a coding agent, it forgets everything. What it built yesterday, what approaches failed, what patterns worked, what you told it about your architecture — gone. The next session starts from the same blank state as the first one.

This isn't a limitation of language models. It's a limitation of how agent systems are built. Most architectures treat each session as independent: load the context window, do the work, discard the state. The model itself has no mechanism for persistence. If the system around it doesn't provide memory, the agent is permanently amnesiac.

The cost of this shows up in rework. You explain the same constraints again. The agent re-discovers patterns it already found. It proposes approaches you've already rejected. Every session carries the overhead of re-establishing context that should have been retained.

Memory is the layer that fixes this. Not by making models remember — they can't — but by building infrastructure that remembers on their behalf.

What memory means for an agent

Human memory is messy, associative, and lossy in useful ways. Agent memory needs to be more structured, but the taxonomy from cognitive science maps surprisingly well.

Episodic memory stores specific interactions. The agent worked on the authentication module on Tuesday, encountered a circular dependency, resolved it by extracting an interface. That sequence — the problem, the approach, the resolution — is an episode worth retaining. When a similar problem appears next week, the agent should recall the pattern rather than re-derive it.

Semantic memory stores facts and relationships that persist across time. The project uses TypeScript. The database is Postgres. The team prefers composition over inheritance. These are not tied to a specific session — they're structural knowledge about the domain the agent works in.

Procedural memory stores learned behaviors. The agent knows that before modifying a shared component, it should check downstream consumers. It knows that this project's CI pipeline takes 4 minutes and the build will fail if you import from the wrong path. These are workflow patterns that accumulate through experience.

The implementation challenge is not categorizing memories — it's deciding what to store, how to retrieve it, and when to forget.

The retrieval problem

Storage is straightforward. You can write anything to a file, a database, a key-value store. The hard part is retrieval: when the agent starts a new session, how does it find the memories that are relevant to what it's about to do?

Keyword search doesn't work. If the agent is about to refactor the payment module, searching for "payment" returns every mention of the word — including irrelevant meeting notes, old bug reports, and unrelated discussions. The signal-to-noise ratio is terrible.

What works is semantic search — retrieval by meaning rather than keywords. A query for "payment module restructuring" should return the episode where the agent extracted the billing interface, even if that episode never used the word "payment." It should return the architectural decision about keeping payment logic out of the API layer, even if that decision was recorded with different vocabulary.

This requires an embedding model that understands your domain, and a vector database that can search by similarity at the speed an agent needs.

The infrastructure that makes this work

Semantic memory requires three components working together: an embedding model that converts text to vectors, a vector database that stores and searches those vectors, and a retrieval layer that decides what to fetch for each new session.

The default approach is to use a cloud embedding API. Send text to OpenAI or Cohere, get back a vector, store it in Pinecone or Weaviate. This works but introduces latency, cost per embedding, and a dependency on external services for a core capability.

The approach we use is different. The embedding model is a compiled artifact — an .mll file trained on our own corpus, running locally. The vector database is a single Go binary that stores quantized vectors in memory. The retrieval is a function call, not an API request.

This matters for two reasons. First, embedding becomes free after the initial training. When every memory write and every retrieval query costs nothing, you can afford to be aggressive about what you store and how often you search. An agent that hesitates to embed because each call costs tokens is an agent with selective amnesia.

Second, the embeddings understand your domain. A general-purpose model treats all text the same. A model trained on your research corpus, your codebase documentation, your architectural decisions — that model knows that "Arbiter governance" and "declarative policy enforcement" are the same concept. The retrieval gets more precise because the embeddings carry domain knowledge.

The vector database we use stores 100,000 vectors in 14 megabytes at 2-bit quantization. Search over 10,000 entries takes 49 microseconds. These numbers matter because they determine whether memory is a feature you use occasionally or a primitive you use constantly. At 49 microseconds per query, semantic search is faster than a filesystem stat call. You can search memory on every turn without the agent noticing.

Reading this because you're trying to build?

For custom architecture and consulting, work with ResonanceWorks — Talk to Consulting. For a ready-made install, start with Torque Engineering.

What changes when agents have memory

The capabilities that open up are not speculative. They're things we've built and run.

Semantic dedup. Our research system produces hundreds of briefs on focused topics. Before memory, the system would research the same concept from slightly different angles — "sovereign AI infrastructure" and "local-first AI for independent studios" were treated as unrelated. With semantic memory, a query for a new topic first searches existing knowledge. If something semantically similar already exists (we use a 0.75 similarity threshold), the system skips it. This catches roughly a third of what string matching misses.

Continuous learning across sessions. An agent that remembers what it built yesterday can pick up where it left off without being told. The memory provides context: what files were modified, what the plan was, what worked and what didn't. This turns a sequence of disconnected sessions into a continuous engagement with a project.

Pattern recognition over time. With enough episodic memory, the agent starts to surface patterns that no single session would reveal. This function gets modified every sprint. This test fails after every dependency update. This API endpoint has been refactored three times in six months — maybe the abstraction is wrong. These observations require memory that spans weeks or months, not minutes.

Personalized behavior. Semantic memory stores not just facts but preferences. The developer prefers explicit error handling over try-catch blocks. The team convention is to name tests with should_ prefix. These preferences, once stored, apply to every future session without re-explanation.

Where memory fails

Being honest about the failure modes matters because they shape how the system should be built.

Stale memories are worse than no memories. A memory that says "the API uses REST" when the team migrated to gRPC three months ago will cause the agent to generate wrong code confidently. Memory systems need freshness mechanisms — either explicit invalidation when things change, or decay functions that reduce confidence over time.

Over-retrieval drowns the signal. If the agent retrieves 50 memories for every task, most of them will be noise. The retrieval layer needs to be selective — not just "what's similar" but "what's relevant to this specific task at this specific moment." This is harder than it sounds and is where most memory implementations fall down.

Privacy and access control. A memory system that stores everything an agent sees creates a surface for information leakage. If Agent A works on a sensitive project and Agent B queries the shared memory, sensitive details could leak through semantic similarity. Memory systems need scoping — per-project, per-team, per-sensitivity-level.

The consolidation problem. Raw episodic memory grows without bound. A system that remembers every interaction verbatim will eventually have too much memory to search efficiently. Consolidation — distilling episodes into higher-level insights and discarding the raw details — is essential but difficult to automate well. Over-consolidate and you lose nuance. Under-consolidate and you lose performance.

The category of tooling

Persistent semantic memory is a category, not a feature. To serve as an agent's memory layer, an implementation needs to deliver:

Domain-specific embeddings. General-purpose models produce general-purpose vectors. For memory retrieval to be precise, the embedding model needs to understand your vocabulary and your conceptual relationships.
Local-first operation. Memory queries happen on every turn. If each query is an API call, the latency and cost make aggressive memory use impractical.
Quantized storage. Vector databases at full float32 precision consume gigabytes for modest collections. Quantization (2-bit, 4-bit) reduces storage by 10x while preserving retrieval quality for similarity search.
Deterministic outputs. If two replicas of the memory system embed the same text and produce different vectors, distributed operation becomes impossible. Deterministic quantization enables CRDT-based replication — multiple memory stores converging without coordination.

The leading implementation we've found in this category — from the same pure-Go ecosystem that provides the perception and governance layers earlier in this series — delivers all four. A compiled .mll artifact serves embeddings locally. A vector database quantizes and indexes them. The deterministic output enables distributed sync. The repositories are on GitHub.

What this enables for the layers above

Governance (Part 5) becomes more precise with memory. Instead of generic rules like "don't modify shared components without review," the governance layer can enforce rules grounded in history: "this component has been modified 4 times in the last month — require elevated review." The memory provides the evidence that the governance rules act on.

Trust (Part 6) benefits directly. An agent with memory can track its own accuracy over time. If its code generation on authentication modules has a 90% first-pass success rate but its database migration code fails 40% of the time, the trust layer can calibrate confidence accordingly. Self-assessment requires memory of past performance.

Memory is the layer that turns a tool into a collaborator. Without it, every session is a first meeting. With it, the agent accumulates understanding — of the codebase, of the team, of what works. That accumulation is what makes long-horizon autonomous work possible.

Next: Governance

In Part 5 we'll look at the governance layer — how declarative rules enforce what an agent is allowed to do, when it should stop, and how it allocates resources. Governance without memory is blunt. Governance with memory is precise.

Get the rest in your inbox

The series lands roughly every week or two. Six parts in total, each standing alone and building on the last. Subscribe to get new parts the day they ship, plus occasional technical notes on what we're learning from running these systems in production.

Need custom help designing your stack?

ResonanceWorks works with founders, operators, and small teams on architecture, governance, and private AI system design. We take a small number of engagements at a time and work closely with founders, operators, and technical leads. Talk to Consulting.

Want a ready-made local-first system instead?

Torque Engineering installs performance-tuned private AI for independent operators. Get Started.

Exploring human-machine culture?

Entrainment House publishes music, art, and cultural works shaped through human-machine coordination. Enter the House.