Article

Mechanistic Interpretability

Mechanistic interpretability (MI) is an approach in AI alignment that attempts to reverse-engineer the internal computations of neural networks — understanding *how* and *why* a model produces its outputs, not just *what* it outputs.

Why It Matters

Understanding LLM internals is a prerequisite for trusting or correcting them. MI is the field's attempt to move from black-box evaluation to internal circuit analysis. Anthropic, Google DeepMind, EleutherAI, Leap Labs, and EleutherAI are the main institutional contributors.

Key Research Directions

  • Circuit discovery — identifying functional subgraphs within transformer attention and MLP layers (Anthropic's Transformers Circuits Thread is the canonical reference)
  • Monosemanticity / Superposition — understanding why individual neurons represent multiple concepts, and how to decompose this via sparse autoencoders (SAE)
  • Cross-lingual and cross-domain circuits — e.g., the Hinglish circuit project: finding attention heads that distinguish Hinglish tokens from English equivalents in bilingual sentences
  • Scaling monosemanticity — how circuit structure changes as models scale

Connected Projects

  • Arrakis — plug-and-play toolkit for running MI experiments on HuggingFace models
  • Deeprobe — sparse autoencoder feature-space project
  • SAE Macaronic Languages — probing whether LLMs develop shared interlingual representations across language pairs

Key Resources

  • Transformers Circuits Thread — transformer-circuits.pub
  • Neel Nanda MATS Stream — applied MI experiments at scale
  • Scaling Monosemanticity (Anthropic, 2024)

Sources

Evidence

Linked source: Obsidian Source: Notes / Neel Nanda - MATS Stream

Linked source: Obsidian Source: Notes / Scaling Monosemanticity