What does this research mean for the field?

Popular visual explanation methods for deep learning, including CAM, gradients, and LIME, exhibit quantifiable statistical differences in robustness and fidelity when evaluated on convolutional neural networks. Novelty: ClaimNovelty.INCREMENTAL. Consensus alignment: ConsensusAlignment.NEUTRAL.

What question did this study set out to answer?

This work aims to assess the robustness of popular visual explanation methods for interpreting deep learning models.

June 10, 2026Open Access

Is the Visual Explanation of Deep Learning Robust? Statistical Evaluation of Popular Visual Explanation Methods on State-of-the-Art Convolutional Neural Networks in Classification Tasks

Key Points

This work aims to assess the robustness of popular visual explanation methods for interpreting deep learning models.
Analyzed convolutional neural networks on 598 cat images and 580 Boeing plane images.
Conducted a comparative analysis using CAM, gradients, and LIME methods.
Evaluated methods through metrics such as Pearson CC, Matthews CC, and others.
Found differences in explanation method effectiveness based on statistical metrics.
Identified specific scenarios where certain methods outperformed others, particularly in fidelity and robustness.
Results provide a comprehensive analysis that can serve as a reference for AI practitioners.

Abstract

Many methods have been proposed for visualizing and interpreting the results of artificial intelligence (AI) algorithms. AI explainability (XAI) methods vary in mathematical basis, effectiveness, and scope of application. Knowing this, an important question arises: how do their results differ from a statistical point of view, and are some of them more useful than the others in certain scenarios? Our article aims to assess the robustness of the most popular AI models’ explainability visualization methods and to identify differences in the results obtained. We did this by analyzing fundamental convolutional neural network models that classified 598 cat images from the Oxford III-T Pet database and 580 filtered pictures of Boeing planes from the Aircraft Images Dataset. We performed a comparative analysis of the similarities between methods based on Class Activation Mapping (CAM), gradients, and Local Interpretable Model-agnostic Explanations (LIME). To evaluate them, we used Pearson Correlation Coefficient (CC), Matthews Correlation Coefficient (MCC), Spearman’s Rank, Structural Similarity Index Measure (SSIM), Kullback–Leibler divergence, Intersection over Union (IoU), and Soft IoU. To check the fidelity and robustness of the XAI methods, we used RandomCAM and ran an ablation test, checking for a decrease in prediction confidence as we gradually removed the least significant regions. Our results provide an up-to-date and broad comparative analysis of this field. They can serve as a reference point for machine learning scientists and engineers.

Is the Visual Explanation of Deep Learning Robust? Statistical Evaluation of Popular Visual Explanation Methods on State-of-the-Art Convolutional Neural Networks in Classification Tasks

Key Points

Abstract

Cite This Study

Also Consider

Also Consider