Arrakis
Arrakis is a plug-and-play toolkit for conducting, tracking, and visualizing mechanistic interpretability (MI) experiments on transformer-based language models. Published on PyPI as arrakis-mi.
Core Idea
The key bottleneck in MI research is iteration speed. Arrakis is built around decomposability: 10+ pre-built tools for common MI operations (monosemanticity, residual decomposition, read-write analysis, etc.) that can be composed without modifying the underlying model.
Main Components
- HookedAutoModel — a wrapper around HuggingFace
PreTrainedModelthat adds activation hooks via a single decorator on the forward function. Supports GPT-2, GPT-Neo, LLaMA, Gemma, Phi3, Qwen2, Mistral, Stable-LM. - InterpretabilityBench — the experiment workspace. Provides
@exp.log_experiment(local version control for experiment code),@exp.profile_model,@exp.test_hypothesis, and@exp.use_toolsdecorators. - core_arrakis tools — plug-in MI operations for things like write-read analysis, attention heatmaps, SAE feature inspection. Tools are passed as decorator arguments; the extra arg gives access to the tool functions.
- Graphing —
@exp.plot_resultswith aPlotSpecgenerates visualizations without leaving the experiment function.
Open Source Launch
Arrakis was launched publicly with a full release checklist: PyPI, docs (ReadTheDocs), README, LICENSE, CONTRIBUTING, CODE_OF_CONDUCT, CHANGELOG, notebooks, tests. Posts were made on Hacker News, Twitter/X, and LessWrong. Personal outreach to professors and industry professionals was done.
Related Pages
Sources
- GitHub Repo: arrakis -
wiki/sources/github/arrakis.md - Obsidian Source: Drafts / Arrakis -
wiki/sources/obsidian/drafts-arrakis.md
Evidence
Linked source: GitHub Repo: arrakis
Linked source: Obsidian Source: Drafts / Arrakis