Mechanistic Interpretability
We do not just train models; we dissect them. Our approach treats neural networks as physical systems that can be mapped, understood, and engineered.
"The Black Box"
Standard LLMs are opaque probabilistic engines. Safety is applied via "Reinforcement Learning from Human Feedback" (RLHF)—essentially training the model to hide its bad behavior, not remove it.
"The Glass Box"
We map the activation patterns of specific neurons to concepts (e.g., deception, sycophancy). We can visually inspect the model's "Thought Trace" and surgically remove harmful circuits without degrading general capability.
Core Technologies
Sparse Autoencoders
We use massive-scale sparse autoencoders to decompose the model's dense representations into interpretable features, allowing us to read the "language of the neurons."
Causal Tracing
By intervening on specific activations during a forward pass, we can mathematically prove which parts of the model caused a specific output, establishing a causal chain of reasoning.
Activation Steering
We clamp specific feature vectors (e.g., "Honesty") to high values during inference, forcing the model to adhere to safety constraints even in adversarial scenarios.