What question did this study set out to answer?

The aim is to quantitatively evaluate gradient-based visual explainability methods across different deep learning architectures in medical imaging.

May 24, 2026Open Access

A Quantitative Evaluation of Gradient-Based Visual Explainability Methods Across Convolutional and Transformer-Based Vision Models

Key Points

The aim is to quantitatively evaluate gradient-based visual explainability methods across different deep learning architectures in medical imaging.
Systematic evaluation using VGG16, ResNet50, and ViT-B/16 on a brain MRI dataset.
Conducted multiple analyses, including deletion-based faithfulness and sensitivity-to-noise evaluation.
Performed statistical hypothesis testing over 30 independent runs.
All models demonstrated strong predictive performance with mean accuracy ≈ 0.99.
Explanation reliability showed significant variability based on method and architecture, with statistically significant differences in sensitivities.
Masking analysis indicated high false-positive rates in some configurations, raising concerns about the validity of visually plausible heatmaps.

Abstract

Explainable Artificial Intelligence (XAI) has become a critical requirement for the responsible deployment of deep learning systems in safety-critical and regulated domains, particularly in medical imaging. In computer vision, gradient-based explanation methods such as Saliency Maps and Gradient-weighted Class Activation Mapping (Grad-CAM) are widely used for interpreting convolutional neural networks (CNNs). However, the increasing adoption of Vision Transformers (ViTs) introduces structural differences in internal representations that challenge the direct transfer of convolutional explainability mechanisms. This study presents a systematic, quantitative, and statistically validated evaluation of gradient-based visual explainability across CNN architectures (VGG16 and ResNet50) and a Vision Transformer (ViT-B/16), using both a domain-specific medical imaging dataset (brain MRI, tumor vs. non-tumor classification). Beyond qualitative heatmap inspection, we conduct deletion-based faithfulness analysis, sensitivity-to-noise evaluation, feature masking validation, and statistical hypothesis testing over 30 independent runs. All models achieve strong predictive performance on the domain dataset (mean accuracy ≈ 0.99), enabling a fair and meaningful comparison of explanation methods across architectures. Results demonstrate that explanation reliability is highly method- and architecture-dependent. Sensitivity differences are consistently statistically significant, whereas deletion-based faithfulness does not always yield equally strong separation under the adopted masking protocol. Masking-based analysis reveals substantial false-positive rates in certain configurations, indicating that visually plausible heatmaps do not necessarily isolate decision-necessary evidence. These findings underscore the importance of coupling visual explanations with behavioral validation metrics, particularly in high-risk domains governed by emerging regulatory frameworks such as the EU AI Act. Overall, the study advocates for empirically validated, architecture-aware, and statistically grounded approaches to medical XAI.

A Quantitative Evaluation of Gradient-Based Visual Explainability Methods Across Convolutional and Transformer-Based Vision Models

Key Points

Abstract

Cite This Study

Also Consider

Also Consider