The Fragility of Guardrails: Cognitive Jamming and Repetition Collapse in Safety-Steered LLMs
We conduct a mechanistic audit of the LLM residual stream, deploying Sparse Autoencoders to reveal how models spontaneously construct internal physics engines—and how fragile these representations are to perturbation.

Active Frontiers
The problems we are solving next.
Sparse Attention Scaling
Investigating sub-quadratic attention mechanisms for context windows exceeding 10M tokens.
Mechanistic Interpretability
Mapping activation atlases for deception detection in pre-training vs. fine-tuning stages.
Deterministic Reasoning
Reducing hallucination rates by constraining the search space of Chain-of-Thought outputs.
Our Methodology
We believe that AGI will not be achieved by scale alone. We are building a new stack based on three non-negotiable pillars.
Interpretability
Every model we train is subjected to rigorous activation mapping. If we cannot explain a behavior, we do not ship it.
Scalable Oversight
Developing "Constitutional AI" methods where models supervise other models based on formal rules.
Intrinsic Safety
Baking safety constraints into the model's weights and architecture before fine-tuning begins.
Join the Lab
We are looking for researchers who are tired of incrementally improving benchmarks and want to solve the foundational problems of intelligence.
View all open roles