Article

Arrakis

Arrakis is a plug-and-play toolkit for conducting, tracking, and visualizing mechanistic interpretability (MI) experiments on transformer-based language models. Published on PyPI as arrakis-mi.

Core Idea

The key bottleneck in MI research is iteration speed. Arrakis is built around decomposability: 10+ pre-built tools for common MI operations (monosemanticity, residual decomposition, read-write analysis, etc.) that can be composed without modifying the underlying model.

Main Components

  • HookedAutoModel — a wrapper around HuggingFace PreTrainedModel that adds activation hooks via a single decorator on the forward function. Supports GPT-2, GPT-Neo, LLaMA, Gemma, Phi3, Qwen2, Mistral, Stable-LM.
  • InterpretabilityBench — the experiment workspace. Provides @exp.log_experiment (local version control for experiment code), @exp.profile_model, @exp.test_hypothesis, and @exp.use_tools decorators.
  • core_arrakis tools — plug-in MI operations for things like write-read analysis, attention heatmaps, SAE feature inspection. Tools are passed as decorator arguments; the extra arg gives access to the tool functions.
  • Graphing@exp.plot_results with a PlotSpec generates visualizations without leaving the experiment function.

Open Source Launch

Arrakis was launched publicly with a full release checklist: PyPI, docs (ReadTheDocs), README, LICENSE, CONTRIBUTING, CODE_OF_CONDUCT, CHANGELOG, notebooks, tests. Posts were made on Hacker News, Twitter/X, and LessWrong. Personal outreach to professors and industry professionals was done.

Related Pages

Sources

Evidence

Linked source: GitHub Repo: arrakis

Linked source: Obsidian Source: Drafts / Arrakis