Mechanistic Interpretability
Mechanistic interpretability (MI) is an approach in AI alignment that attempts to reverse-engineer the internal computations of neural networks — understanding *how* and *why* a model produces its outputs, not just *what* it outputs.
Why It Matters
Understanding LLM internals is a prerequisite for trusting or correcting them. MI is the field's attempt to move from black-box evaluation to internal circuit analysis. Anthropic, Google DeepMind, EleutherAI, Leap Labs, and EleutherAI are the main institutional contributors.
Key Research Directions
- Circuit discovery — identifying functional subgraphs within transformer attention and MLP layers (Anthropic's Transformers Circuits Thread is the canonical reference)
- Monosemanticity / Superposition — understanding why individual neurons represent multiple concepts, and how to decompose this via sparse autoencoders (SAE)
- Cross-lingual and cross-domain circuits — e.g., the Hinglish circuit project: finding attention heads that distinguish Hinglish tokens from English equivalents in bilingual sentences
- Scaling monosemanticity — how circuit structure changes as models scale
Connected Projects
- Arrakis — plug-and-play toolkit for running MI experiments on HuggingFace models
- Deeprobe — sparse autoencoder feature-space project
- SAE Macaronic Languages — probing whether LLMs develop shared interlingual representations across language pairs
Key Resources
- Transformers Circuits Thread — transformer-circuits.pub
- Neel Nanda MATS Stream — applied MI experiments at scale
- Scaling Monosemanticity (Anthropic, 2024)
Sources
- Obsidian Source: Notes / Neel Nanda - MATS Stream -
wiki/sources/obsidian/notes-neel-nanda-mats-stream.md - Obsidian Source: Notes / Scaling Monosemanticity -
wiki/sources/obsidian/notes-scaling-monosemanticity.md
Evidence
Linked source: Obsidian Source: Notes / Neel Nanda - MATS Stream
Linked source: Obsidian Source: Notes / Scaling Monosemanticity