The contrastive language-image pretraining (CLIP) model has demonstrated remarkable performance in multimodal tasks, but the interpretability of its similarity-based cross-modal alignment mechanism has attracted considerable attention. However, existing visual explanation methods typically rely on single-gradient information, which is easily affected by noise and activation saturation, resulting in unstable and rough interpretations. To address this issue, we propose a multigranularity attribution framework for CLIP (MGA-CLIP) that provides a systematic analysis of the internal learning behaviors of deep multimodal neural networks. At the coarse-grained level, the method adopts the concept of integrated gradients (IGs) to construct an integration path from a baseline to the input, accumulating path gradients to obtain a stable gradient tensor, which is then used to weight the feature space for a robust global interpretation. At the fine-grained level, the method leverages the model's cross-modal alignment characteristics and its internal attention dependencies to quantify the semantic relevance between visual patches and text embeddings, thereby capturing localized embedding interactions. Unlike previous works that treat global and local cues independently, our framework establishes an explicit linkage between channel-level global attributions and patch-level semantic reasoning, enabling consistent explanations across different granularity levels. By fusing the coarse- and fine-grained results, MGA-CLIP generates semantically consistent and cross-modally aligned explanations. Moreover, we employ the proposed attribution framework in a text-based adversarial patch experiment, demonstrating its strong capability to reveal the internal reasoning behavior of the CLIP and to provide insightful interpretability References for analyzing multimodal alignment mechanisms. Extensive experiments show that MGA-CLIP significantly outperforms existing methods in attribution stability and semantic focus, effectively enhancing the interpretability of deep multimodal neural networks.
Cheng et al. (Thu,) studied this question.