Modern neural networks operate in high-dimensional continuous representation spaces, whereas human understanding is grounded in discrete symbolic language. We formalize this gap through measure theory and prove that complete interpretability, the ability to provide precise linguistic descriptions for all internal representations, is mathematically impossible. Specifically, we distinguish between coarse and precise describability: while language can describe regions of representation space with positive measure (e.g., “features detecting edges”), it can only precisely identify a measure-zero set of individual points. Under the strong assumption that linguistic concepts describe uncountable regions rather than isolated points, we prove that the set of precisely describable representations has a Lebesgue measure zero. Our framework offers a theoretical perspective on the fundamental limits of interpretability, though the practical implications for AI safety research require further empirical investigation.
Zhongmang Cheng (Wed,) studied this question.