Many methods have been proposed for visualizing and interpreting the results of artificial intelligence (AI) algorithms. AI explainability (XAI) methods vary in mathematical basis, effectiveness, and scope of application. Knowing this, an important question arises: how do their results differ from a statistical point of view, and are some of them more useful than the others in certain scenarios? Our article aims to assess the robustness of the most popular AI models’ explainability visualization methods and to identify differences in the results obtained. We did this by analyzing fundamental convolutional neural network models that classified 598 cat images from the Oxford III-T Pet database and 580 filtered pictures of Boeing planes from the Aircraft Images Dataset. We performed a comparative analysis of the similarities between methods based on Class Activation Mapping (CAM), gradients, and Local Interpretable Model-agnostic Explanations (LIME). To evaluate them, we used Pearson Correlation Coefficient (CC), Matthews Correlation Coefficient (MCC), Spearman’s Rank, Structural Similarity Index Measure (SSIM), Kullback–Leibler divergence, Intersection over Union (IoU), and Soft IoU. To check the fidelity and robustness of the XAI methods, we used RandomCAM and ran an ablation test, checking for a decrease in prediction confidence as we gradually removed the least significant regions. Our results provide an up-to-date and broad comparative analysis of this field. They can serve as a reference point for machine learning scientists and engineers.
Golec et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: