The success of Deep Neural Networks (DNNs) is often attributed to their ability to learn powerful representations, which capture increasingly abstract features from data to make predictions. However, the semantic nature of these representations—specifically, what concepts they encode and how those abstractions are used in model decision-making—remains generally unknown. This thesis addresses this gap through a systematic investigation of the interpretability of latent representations in Computer Vision models, proposing novel frameworks to analyze, label, and explain their learned abstractions. We first analyze relationships among neurons to uncover structural patterns and spurious correlations within learned representations. Next, we present a method that describes neural representations with human-understandable textual labels, enabling precise identification of concepts captured by individual neurons. Building on these insights, we develop a unified framework that decomposes model decisions into sparse, interpretable concept combinations, thereby revealing how models leverage specific features during inference. Through empirical validation, we demonstrate how these approaches enhance model transparency, enable targeted identification of biases, and provide actionable insights for mitigating spurious correlations. Our work bridges the gap between empirical performance and interpretability, offering tools to make the decision-making processes of DNNs transparent while fostering trust and accountability in real-world applications.
Kirill Bykov (Thu,) studied this question.