Mechanistic Interpretability

Mechanistic interpretability (MI) is an approach in AI alignment that attempts to reverse-engineer the internal computations of neural networks — understanding *how* and *why* a model produces its outputs, not just *what* it outputs.

Why It Matters

Understanding LLM internals is a prerequisite for trusting or correcting them. MI is the field's attempt to move from black-box evaluation to internal circuit analysis. Anthropic, Google DeepMind, EleutherAI, Leap Labs, and EleutherAI are the main institutional contributors.

Key Research Directions

Circuit discovery — identifying functional subgraphs within transformer attention and MLP layers (Anthropic's Transformers Circuits Thread is the canonical reference)
Monosemanticity / Superposition — understanding why individual neurons represent multiple concepts, and how to decompose this via sparse autoencoders (SAE)
Cross-lingual and cross-domain circuits — e.g., the Hinglish circuit project: finding attention heads that distinguish Hinglish tokens from English equivalents in bilingual sentences
Scaling monosemanticity — how circuit structure changes as models scale

Connected Projects

Arrakis — plug-and-play toolkit for running MI experiments on HuggingFace models
Deeprobe — sparse autoencoder feature-space project
SAE Macaronic Languages — probing whether LLMs develop shared interlingual representations across language pairs

Key Resources

Transformers Circuits Thread — transformer-circuits.pub
Neel Nanda MATS Stream — applied MI experiments at scale
Scaling Monosemanticity (Anthropic, 2024)

Sources

Obsidian Source: Notes / Neel Nanda - MATS Stream - wiki/sources/obsidian/notes-neel-nanda-mats-stream.md
Obsidian Source: Notes / Scaling Monosemanticity - wiki/sources/obsidian/notes-scaling-monosemanticity.md

Evidence

Linked source: Obsidian Source: Notes / Neel Nanda - MATS Stream

Linked source: Obsidian Source: Notes / Scaling Monosemanticity