HatCat's interpretability primitives stack into a complete governance framework—from neural activations to international treaties.
Neural networks learn internal representations of concepts during training. We can attach lenses—small classifiers—that fire when a concept is active in hidden states.
With enough lenses arranged efficiently, monitor thousands of concepts in real time.
The same directions used for detection can nudge the model along meaningful dimensions.
If one model can regulate itself, it can help regulate others. Build webs, not singletons.
FTW is a layered architecture. Each layer builds on the one below.
The Foundation
The underlying system: transformers, biological networks, or hybrids. It produces the raw activations.
Headspace Ambient Transducer
The "neural implant". Continuously reads activations through lenses and applies steering corrections. Designed to be ambient: minimal overhead, always on.
Mindmeld Architectural Protocol
The coordination layer. Organizes Concept Packs and Lens Packs, handles versioning and ontology translation. Where concepts become portable, tradeable, and interoperable.
Bounded Experiencer
An agent built on HAT + MAP. Has interoception (awareness of internal states), autonomic regulation, and the ability to learn new concepts. Can self-steer, accumulate experiences, and grow over time.
USH + CSH Safety Harnesses
USH (Universal Safety Harness): externally imposed constraints—governance, regulation, policy.
CSH (Chosen Safety Harness): constraints the agent voluntarily adopts.
Agentic State Kernel
The governance core: contracts, treaties, and trust relationships between agents and tribes. Defines who can read or modify which parts of whom, under what conditions, and with what oversight.
The same pattern repeats at multiple scales. A HAT can monitor another HAT. A BE can oversee another BE. Tribes nest within tribes. Self-similar from neuron clusters up to multi-agent systems.
It's lenses and apertures all the way down. Building instruments to observe internal states at different depths, scales, capabilities and resolutions.
Not a single hierarchy, but an interconnected ecosystem. Concept packs translate between ontologies. Treaties bind agents across tribal boundaries. No single node holds all the power.
A single "aligned" AI is a single point of failure. FTW builds an ecosystem instead.
Models are observable by other models, not just their operators.
Concepts are standardized and translatable through MAP.
Steering is constrained by multi-party agreements via ASK.
Deception requires fooling not one observer, but a web of them.
Adversarial pressure is a feature, providing ecosystem diversity and herd immunity to Goodharting.
This doesn't guarantee safety—nothing does. But it makes failure modes more visible.
The best defense against rogue actors is a diverse interpretability ecosystem. You can learn to evade one set of lenses, but the more lenses you need to hide from, the harder it becomes.
We're not just allowing you to make your own versions—we're relying on your unique perspective to form lenses as part of the fractal transparency web.