What question did this study set out to answer?

This research aims to enhance the interpretability of large language models by identifying causal features through an innovative framework.

June 18, 2026Open Access

Active Circuit Discovery: A Multi-Action POMDP Agent for Causal Feature Identification in Transformer Attribution Graphs

Key Points

This research aims to enhance the interpretability of large language models by identifying causal features through an innovative framework.
Introduced Active Circuit Discovery (ACD) framework using attribution-graph analysis and active inference.
Implemented a POMDP agent to select features and intervention types for efficient experimentation.
Evaluated on two transformer models (Gemma-2-2B and Llama-3.2-1B) across multiple tasks and settings.
Achieved 82.0% efficiency on Gemma IOI with 20 interventions per prompt, exceeding random selection by 43.5% (p = 0.031).
Reached 73.0% efficiency on Gemma multi-step, competitive with existing heuristics.
Showed task-dependent circuit structures, indicating distinct feature usage across different model layers.

Abstract

Mechanistic interpretability seeks to reverse-engineer the computational circuits within large language models, but current methods rely on exhaustive or heuristic search over exponentially many feature interactions. This paper introduces Active Circuit Discovery (ACD), a framework that combines attribution-graph analysis with active inference to select interventions efficiently. ACD uses Anthropic’s circuit-tracer library as its attributiongraph backend, applying Edge Attribution Patching with transcoders to identify the active transcoder features for each prompt. A partially observable Markov decision process (POMDP) agent, implemented with pymdp, maintains a multi-factor generative model of feature importance, layer role, and causal influence. At each step, the agent selects both a target feature and an intervention type (ablation, activation patching, or feature steering) by minimising Expected Free Energy over the joint feature–action space, and it learns its observation model online through Dirichlet parameter updates. ACD is an interventionselection layer over existing attribution-graph tools; it is not a whole-circuit discovery method, and no claim of state-of-the-art circuit discovery is made. The framework is evaluated on Gemma-2-2B (26 layers) and Llama-3.2-1B (16 layers) across four settings: Indirect Object Identification (IOI), multi-step reasoning, feature steering, and a multidomain benchmark spanning geography, mathematics, science, logic, and history. With a budget of 20 interventions per prompt, an ablation-only agent scored by bounded oracle efficiency against the ablation oracle reaches 82.0% efficiency on Gemma IOI and 73.0% on Gemma multi-step. It exceeds random selection by 43.5% (relative) on Gemma IOI (paired permutation p = 0.031) and is competitive with greedy ranking, a heuristic UCB bandit, and a plain UCB baseline. A direct Edge-Attribution-Patching ranking is itself a strong baseline that the agent does not consistently surpass, and on Llama multi-step the agent reaches 9.3% efficiency (37.8% with finer layer-role bins). All comparisons report bootstrap 95% confidence intervals. The full multi-action agent is characterised separately by a Relative Cumulative KL, a steering-driven amplification factor reported apart from the bounded efficiency. Feature steering changes the top-1 prediction in a dose-dependent manner, but a matched random-feature control shows that circuit-selected features are only marginally, and not significantly, more steerable than random active features at large multipliers, indicating that part of the effect is generic activation scaling. Multi-domain analysis shows task-dependent circuit structure, with IOI circuits concentrated in late layers and reasoning and scientific knowledge recruiting early and middle layers. Code, notebooks (free T4), AMD64/aarch64 Docker images, and raw results are publicly available.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Sathish et al. (Tue,) studied this question.

synapsesocial.com/papers/6a338d20630953a74978e26b https://doi.org/https://doi.org/10.3390/sym18061043

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark

View Full Paper