What question did this study set out to answer?

The aim is to develop a comprehensive framework for understanding and explaining deep learning models, enhancing interpretability and control.

March 18, 2026Open Access

From local explanations to comprehensive mechanistic understanding of deep vision models

Key Points

The aim is to develop a comprehensive framework for understanding and explaining deep learning models, enhancing interpretability and control.
Developed a mechanistic explainability framework integrating various interpretability techniques
Conducted user studies to validate the effectiveness of proposed methods
Introduced automated auditing and correction methods addressing model behavior issues
Demonstrated that mechanistic explanations enhance understanding and correction of model failures
Successfully integrated input localizations and concept-level prototypes for improved interpretability
Validated user studies showed increased human interpretability of model components

Abstract

Deep learning models have evolved into a cornerstone of modern industry and science, enabling applications from medical diagnosis and perception to conversational systems. Over the past two decades, both models and datasets have grown substantially in scale: models now reach trillions of parameters, trained on datasets with billions of samples. Despite their success, our understanding and control of these models remain limited, hindering safe and robust deployment in critical domains. Regulations such as the EU AI Act further emphasize the need for transparent and accountable AI systems. The field of eXplainable Artificial Intelligence (XAI) has introduced techniques like attribution maps and feature visualizations to illuminate singular aspects of model behavior. Yet, achieving a comprehensive understanding that enables validation and control of the complex mechanisms inside AI models requires the combination of multiple XAI perspectives. This is already challenging, and as most approaches rely on manual inspection of individual explanations, they fail to scale with the size and complexity of today’s models and datasets. This dissertation develops an explainability framework that is (i) mechanistic, by providing component-level insights (ii) comprehensive, by integrating multiple interpretability perspectives, (iii) scalable, by aggregating and summarizing explanations across data and model components while flagging outliers and deviations, and (iv) actionable, by directly informing practical strategies for refining and improving model behavior. Key contributions of this thesis include: (1) A foundation for comprehensive mechanistic explanations that integrate component-level attributions, input localizations, and feature visualizations, validated via a user study. (2) Measuring and improving interpretability by introducing multiple measures to estimate human interpretability of components, further validated through a user study, and methods to mitigate issues such as polysemanticity. (3) Prototypical Concept-based Explanations that summarize model behavior across entire datasets using a small set of concept-level prototypes. (4) Semantic component mbeddings that enable text-based semantic search, labeling, clustering, and comparison of model components. (5) Automated auditing methods such as outlier detection and concept alignment analysis to flag spurious or unexpected behaviors. (6) Interpretability-informed correction techniques that refine and correct models based on mechanistic insights. Through experiments on state-of-the-art vision models, this work demonstrates that mechanistic explanations enable the identification, understanding, and correction of model failures, providing a path toward more transparent, robust and controllable AI systems.

AI से पूछें

Bookmark

View Full Paper

Cite This Study

Maximilian Dreyer (Thu,) studied this question.

synapsesocial.com/papers/69ba42bc4e9516ffd37a349c https://doi.org/https://doi.org/10.14279/depositonce-25485

AI से पूछें

Bookmark

View Full Paper