The Fragility of Guardrails: Cognitive Jamming and Repetition Collapse in Safety-Steered LLMs

Metanthropic/Ekjot Singh

We conduct a mechanistic audit of the LLM residual stream, deploying Sparse Autoencoders to reveal how models spontaneously construct internal physics engines—and how fragile these representations are to perturbation.

The Fragility of Guardrails: Cognitive Jamming and Repetition Collapse in Safety-Steered LLMs

Active Frontiers

The problems we are solving next.

LAB_OPERATIONAL

Recruiting

Sparse Attention Scaling

Investigating sub-quadratic attention mechanisms for context windows exceeding 10M tokens.

In Progress

Mechanistic Interpretability

Mapping activation atlases for deception detection in pre-training vs. fine-tuning stages.

In Progress

Deterministic Reasoning

Reducing hallucination rates by constraining the search space of Chain-of-Thought outputs.

Our Methodology

We believe that AGI will not be achieved by scale alone. We are building a new stack based on three non-negotiable pillars.

Interpretability

Every model we train is subjected to rigorous activation mapping. If we cannot explain a behavior, we do not ship it.

Scalable Oversight

Developing "Constitutional AI" methods where models supervise other models based on formal rules.

Intrinsic Safety

Baking safety constraints into the model's weights and architecture before fine-tuning begins.

Join the Lab

We are looking for researchers who are tired of incrementally improving benchmarks and want to solve the foundational problems of intelligence.

View all open roles