Deep learning models have evolved into a cornerstone of modern industry and science, enabling applications from medical diagnosis and perception to conversational systems. Over the past two decades, both models and datasets have grown substantially in scale: models now reach trillions of parameters, trained on datasets with billions of samples. Despite their success, our understanding and control of these models remain limited, hindering safe and robust deployment in critical domains. Regulations such as the EU AI Act further emphasize the need for transparent and accountable AI systems. The field of eXplainable Artificial Intelligence (XAI) has introduced techniques like attribution maps and feature visualizations to illuminate singular aspects of model behavior. Yet, achieving a comprehensive understanding that enables validation and control of the complex mechanisms inside AI models requires the combination of multiple XAI perspectives. This is already challenging, and as most approaches rely on manual inspection of individual explanations, they fail to scale with the size and complexity of today’s models and datasets. This dissertation develops an explainability framework that is (i) mechanistic, by providing component-level insights (ii) comprehensive, by integrating multiple interpretability perspectives, (iii) scalable, by aggregating and summarizing explanations across data and model components while flagging outliers and deviations, and (iv) actionable, by directly informing practical strategies for refining and improving model behavior. Key contributions of this thesis include: (1) A foundation for comprehensive mechanistic explanations that integrate component-level attributions, input localizations, and feature visualizations, validated via a user study. (2) Measuring and improving interpretability by introducing multiple measures to estimate human interpretability of components, further validated through a user study, and methods to mitigate issues such as polysemanticity. (3) Prototypical Concept-based Explanations that summarize model behavior across entire datasets using a small set of concept-level prototypes. (4) Semantic component mbeddings that enable text-based semantic search, labeling, clustering, and comparison of model components. (5) Automated auditing methods such as outlier detection and concept alignment analysis to flag spurious or unexpected behaviors. (6) Interpretability-informed correction techniques that refine and correct models based on mechanistic insights. Through experiments on state-of-the-art vision models, this work demonstrates that mechanistic explanations enable the identification, understanding, and correction of model failures, providing a path toward more transparent, robust and controllable AI systems.
Building similarity graph...
Analyzing shared references across papers
Loading...
Maximilian Dreyer
Building similarity graph...
Analyzing shared references across papers
Loading...
Maximilian Dreyer (Thu,) studied this question.
www.synapsesocial.com/papers/69ba42bc4e9516ffd37a349c — DOI: https://doi.org/10.14279/depositonce-25485