2026-04-12

A Pure Go Stack for Model Distribution

infrastructure · local-first · ai-systems · mll

The assumption we stopped questioning

The current AI stack has a shape most people take for granted: Python for training and inference, ONNX or safetensors for serialization, a C++ runtime underneath, and a serving framework bolted on top. It works. It's also a stack of dependencies that assumes you'll always have a Python interpreter, a C compiler, and a cloud server standing by.

For local-first AI — systems that run on your hardware, compile to any target, and don't depend on external infrastructure — that assumption is a problem. Every C dependency is a portability constraint. Every Python requirement is a runtime you have to ship. Every framework is a layer between you and the metal.

We recently integrated a set of tools from odvcencio that take a fundamentally different approach. The entire stack — from vector quantization to distributed search to model compilation — is written in pure Go. No CGo. No Python. No C dependencies. And the implications are more interesting than the language choice alone might suggest.

The stack, bottom to top

TurboQuant: compression as a primitive

TurboQuant implements the vector quantization algorithm from the ICLR 2026 paper in pure Go. It compresses float32 vectors to 1–8 bits per dimension with near-optimal distortion. At 3 bits, a 384-dimension embedding goes from 1,536 bytes to 144 bytes — roughly 10x compression.

Two modes matter here. MSE-optimal minimizes reconstruction error, which is what you want when you're storing and recovering vectors. IP-optimal gives unbiased inner product estimation, which is what you want for similarity search and nearest-neighbor lookups.

The implementation includes SIMD-optimized dot product kernels for both amd64 and arm64, a WebGPU backend for browser-side scoring, and an experimental CUDA backend for GPU top-k search. It's deterministic with a seed, which turns out to be critical: byte-identical output on every replica means CRDT convergence without coordination.

What makes this interesting at the infrastructure level is that compression isn't a separate optimization step — it's a first-class primitive that the rest of the stack builds on.

CorkScrewDB: a vector database that doesn't need a cluster

CorkScrewDB is a distributed, versioned vector database built directly on TurboQuant. You put text in, it embeds and stores. You search by text, it returns semantically similar results. No external embedding service. No separate vector index.

The architecture is built for the kind of deployment that local-first systems actually need: a single binary that runs as a standalone server or embeds as a library. Append-only WAL with snapshot recovery. Lamport clock versioning with CRDT last-writer-wins merge. Metadata filters and point-in-time views. Built-in RPC transport with hash-based write routing and fan-out search for federation.

At 2-bit quantization with 384-dimension vectors, searching 10,000 entries takes 49 microseconds on 20 cores. Memory per vector is 144 bytes. A hundred thousand vectors fit in 14 megabytes.

The premise here is that semantic search shouldn't require a managed service. If you're building a system that needs to find things by meaning — research dedup, content retrieval, code search — the database should be as easy to deploy as any other Go binary.

Manta: a compiler for inference

Manta is where the stack gets genuinely novel. It's an inference-first GPU language and runtime — a compiler that takes a custom DSL (.bar files), optimizes through a three-level intermediate representation (HIR → MIR → LIR), and emits code for CUDA, Metal, or a host reference backend.

The source language is designed for ML inference, not training. You define parameters, kernels, and pipelines. The type system includes quantized tensors as first-class citizens — q4 and q8 types, TurboQuant-native layouts, KV cache types. Schedule hints like tile, vector_width, and subgroup give you control over execution without dropping to raw GPU code.

The compiler produces .mll files — sealed binary artifacts that contain the model definition, trained weights, tokenizer, and memory plan in a single portable package. This is the distribution format.

Why .mll matters

The current model distribution story looks like this: download a multi-gigabyte file (or a directory of files), install the right version of PyTorch, load it through a framework that understands the serialization format, hope the versions align, serve it through an inference server. Every step in that chain is a potential failure point and a portability constraint.

An .mll file is self-contained. The model, the weights, the tokenizer, and the execution plan are sealed together. You load it with a single function call. It runs wherever Go runs — native, WASM, embedded. The runtime handles dispatch to whatever backend is available: CUDA if you have it, Metal on Apple Silicon, host reference as a fallback.

This isn't just a packaging convenience. It changes what's possible for model deployment:

Edge devices become viable. An .mll file that compiles to WASM runs in a browser. One that targets arm64 runs on a phone or a Raspberry Pi. No Python interpreter, no inference server, no container runtime.

Models become portable artifacts. The same .mll file runs on CUDA, Metal, and CPU without rebuilding. The compilation targets are baked in. Ship once, run anywhere.

Distribution becomes a file copy. No package manager, no dependency resolution, no version conflicts. The artifact is the deployment.

Embedding providers become local. CorkScrewDB loads .mll files directly as its embedding provider. Your vector database and your embedding model are the same deployment — no external API calls, no network latency, no third-party inference.

What this means for agent systems

Most agent architectures treat inference as a service call. The agent sends a prompt to an API, waits for a response, and processes the result. The model is somewhere else — on a cloud server, behind a rate limit, metered per token.

When inference is a local function call backed by an .mll artifact, the architecture changes:

Embedding becomes free after compilation. Semantic search, similarity routing, content dedup — these operations currently cost per-token on a cloud API. With a local .mll embedding model, they're just compute. Run them as often as you want.

Agents can carry their own models. Instead of depending on an external service, an agent can ship with its .mll artifacts — an embedding model for search, a scoring model for ranking, a decoder for generation. The agent is self-contained.

Quantization is part of the compile step, not an afterthought. TurboQuant types are native to the Manta compiler. You don't quantize a model after training and hope it still works — you compile to a specific bit width and the execution plan accounts for it.

The CRDT story is real. TurboQuant's deterministic output means quantized embeddings converge across replicas without coordination. CorkScrewDB's Lamport clock versioning means multiple instances can sync. You can run distributed semantic search across devices without a central server.

What's emerging

We're using this stack in production for research dedup and semantic search across our knowledge base. We trained a domain-specific embedding model on our research corpus, sealed it as an .mll artifact, and loaded it into CorkScrewDB. A query for "local-first AI for small teams" returns "sovereign AI infrastructure for independent studios" at 0.98 similarity — with zero overlapping words.

But the applications that interest us most are the ones this architecture makes newly possible:

Offline-capable AI tools that carry their own inference and don't degrade without a network connection. Creative tools that embed, search, and generate without phoning home.

Federated knowledge systems where each node runs its own CorkScrewDB instance with the same .mll embedder, syncing through CRDTs. Distributed semantic search without a central index.

Browser-native inference through the WebGPU and WASM targets. A .mll model running in a browser tab, scoring content in real time, with no server involved.

Portable agent runtimes where the agent, its models, its vector store, and its governance layer are all Go binaries that deploy together. No Docker, no Kubernetes, no infrastructure team.

The bigger picture

The Python/C++/ONNX stack won the first era of ML deployment because it was what existed. It grew organically from the training ecosystem, and inference was an afterthought bolted on top. That stack works — but it carries assumptions about where and how models run that are increasingly at odds with where the field is going.

A pure Go stack built for inference first, with compression as a primitive, compilation as the distribution mechanism, and portability as a design constraint — that's built for a different future. One where models are artifacts you own and deploy, not services you rent. Where inference happens at the edge, in the browser, on the device. Where the stack is small enough to understand and portable enough to run anywhere.

We don't know yet whether .mll becomes a standard or stays a niche tool. But the design choices are sound, the engineering is meticulous, and the problems it solves are real. If you're building local-first AI systems, this stack is worth understanding. The repositories are on GitHub.

Need architecture help?

ResonanceWorks works with founders, operators, and small teams on agent architecture, governance, and private AI system design. Talk to Consulting.

Want a governed AI system on your hardware?

Torque Engineering installs performance-tuned private AI for independent operators. Get Started.

Exploring human-machine culture?

Entrainment House publishes music, art, and cultural works shaped through human-machine coordination. Enter the House.