Mechanistic Interpretability aims to understand neural net- works through causal explanations. We argue for the Explanatory View Hypothesis: that Mechanistic Interpretability re- search is a principled approach to understanding models be- cause neural networks contain implicit explanations which can be extracted and understood. We hence show that Explanatory Faithfulness, an assessment of how well an explanation fits a model, is well-defined. We propose a definition of Mechanistic Interpretability (MI) as the practice of producing Model-level, Ontic, Causal-Mechanistic, and Falsifiable explanations of neural networks, allowing us to distinguish MI from other interpretability paradigms and detail MI’s inherent limits. We formulate the Principle of Explanatory Optimism, a conjecture which we argue is a necessary precondition for the success of Mechanistic Interpretability.
Ayonrinde et al. (Wed,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: