Article

Deeprobe

Deeprobe uses Monte Carlo Tree Search to navigate the feature space of Sparse Autoencoders (SAEs) for mechanistic interpretability. The premise: treat feature discovery in a high-dimensional SAE as a search problem, and let MCTS find important features rather than scanning exhaustively.

The Idea

SAEs decompose model activations into sparse, interpretable features — but finding which features matter for a specific task still requires human judgment or brute-force scanning. Deeprobe inverts the question: can a search algorithm find them automatically?

Applied to the Indirect Object Identification (IOI) task in GPT-2, the feature space is enormous: 768-dimensional residual stream expanded to 16,000+ SAE features. MCTS navigates this via UCB1 node selection, expanding feature combinations, and computing reward via cosine similarity to target outputs.

Stack

  • TransformerLens — model access and activation hooks
  • SAELens — pre-trained SAEs
  • Custom MCTS — UCB1 selection, tree expansion, backpropagation

The Honest Problem

MCTS is well-understood. SAEs are mature. The unsolved piece is reward modeling: specifying what "finding the right feature" means precisely enough for MCTS to navigate toward it. Cosine similarity to target outputs is too noisy. Better alternatives: causal interventions (does activating this feature cause the target behavior?), contrastive rewards, or a learned reward classifier.

The original inversion that motivated the project: MCTS is typically used on *compressed* latent spaces in drug discovery. SAEs do the opposite — they *expand* into an interpretable space. Using MCTS on the expanded side is the novel direction.

Related Pages

Sources

Evidence

Linked source: GitHub Repo: deeprobe

Linked source: Website Source: blog / deeprobe_blog