Abstract Large-scale vision–language pre-trained models such as CLIP play a central role in modern multimodal artificial intelligence. However, their cross-modal decision process remains difficult to interpret, which limits reliable deployment in practical applications. Existing explanation methods for CLIP often exhibit semantic entanglement and inaccurate spatial localization in cross-modal attribution. This paper presents a gradient-guided class-aware semantic disentanglement attribution method for CLIP. The proposed method explicitly disentangles class-related semantics aggregated in the global token during attribution. This design suppresses irrelevant semantic interference and produces visual explanations with improved semantic consistency, clearer structural organization, and more accurate spatial localization. We further introduce a novel gradient guidance strategy that balances importance assignment at the channel level and guides spatial attribution toward regions that are discriminative for the target semantics. As a result, the proposed approach generates more stable and faithful visual explanations. Extensive qualitative and quantitative experiments on ImageNet and MSCOCO 2014 demonstrate that the proposed method consistently outperforms existing approaches in explanation fidelity and reliability.
Li et al. (Fri,) studied this question.