What question did this study set out to answer?

The aim is to develop a multigranularity attribution framework, MGA-CLIP, to improve the interpretability of the CLIP model.

June 1, 2026

MGA-CLIP: A Multigranularity Attribution Framework for Cross-Modal Explainability in CLIP

Key Points

The aim is to develop a multigranularity attribution framework, MGA-CLIP, to improve the interpretability of the CLIP model.
Introduced a coarse-grained level using integrated gradients to analyze internal learning.
Utilized a fine-grained approach to quantify semantic relevance between visual patches and text embeddings.
Established a linkage between global and local attributions for consistent explanations.
MGA-CLIP significantly outperformed existing methods in attribution stability and semantic focus.
Generated semantically consistent and cross-modally aligned explanations.
Revealed insightful interpretability for analyzing multimodal alignment mechanisms.

Abstract

The contrastive language-image pretraining (CLIP) model has demonstrated remarkable performance in multimodal tasks, but the interpretability of its similarity-based cross-modal alignment mechanism has attracted considerable attention. However, existing visual explanation methods typically rely on single-gradient information, which is easily affected by noise and activation saturation, resulting in unstable and rough interpretations. To address this issue, we propose a multigranularity attribution framework for CLIP (MGA-CLIP) that provides a systematic analysis of the internal learning behaviors of deep multimodal neural networks. At the coarse-grained level, the method adopts the concept of integrated gradients (IGs) to construct an integration path from a baseline to the input, accumulating path gradients to obtain a stable gradient tensor, which is then used to weight the feature space for a robust global interpretation. At the fine-grained level, the method leverages the model's cross-modal alignment characteristics and its internal attention dependencies to quantify the semantic relevance between visual patches and text embeddings, thereby capturing localized embedding interactions. Unlike previous works that treat global and local cues independently, our framework establishes an explicit linkage between channel-level global attributions and patch-level semantic reasoning, enabling consistent explanations across different granularity levels. By fusing the coarse- and fine-grained results, MGA-CLIP generates semantically consistent and cross-modally aligned explanations. Moreover, we employ the proposed attribution framework in a text-based adversarial patch experiment, demonstrating its strong capability to reveal the internal reasoning behavior of the CLIP and to provide insightful interpretability References for analyzing multimodal alignment mechanisms. Extensive experiments show that MGA-CLIP significantly outperforms existing methods in attribution stability and semantic focus, effectively enhancing the interpretability of deep multimodal neural networks.

Bookmark

MGA-CLIP: A Multigranularity Attribution Framework for Cross-Modal Explainability in CLIP

Key Points

Abstract

Cite This Study