What question did this study set out to answer?

The study aims to improve the detection of adversarial examples in deep learning using explainable AI techniques.

April 13, 2026Open Access

Effective adversarial example detection with DeepSHAP summary

Key Points

The study aims to improve the detection of adversarial examples in deep learning using explainable AI techniques.
Proposed a framework to extract decision logic from explanations.
Utilized distributions of critical neurons for normal versus adversarial distinction.
Developed a best-layer selection strategy for SHAP values to enhance detection performance.
Introduced an activation status detection approach for lightweight detection.
Achieved 81.89% accuracy with the activation status detection approach.
Demonstrated increased performance in detection accuracy and efficiency.
Evaluated against multiple datasets including CIFAR-10 and ImageNet.

Abstract

Explainable AI (XAI) techniques have been widely adopted to enhance the interpretability and reliability of deep learning applications. To extend that success to adversarial example detection, we propose a new framework to extract decision logic from explanations, leverage that information to summarize common critical neurons, and utilize their status to distinguish normal and adversarial examples. Our first approach uses decision logic for detection, demonstrating that differences in critical neuron distributions can be leveraged to distinguish normal and adversarial examples. We then propose a best-layer selection strategy to enhance previous layer-wise SHAP value detection. Selecting the layer with the most common critical neurons improves performance in terms of both accuracy and computational efficiency. These two approaches achieve high detection accuracy but require runtime computation of SHAP values. To avoid such runtime overhead, we further propose a new activation status detection approach where we show that using the activation status of common critical neurons offers lightweight yet effective detection. This efficacy extends to untrained attack detection. We conduct a comprehensive study on the CIFAR-10, MNIST, SVHN, CIFAR-100, Tiny ImageNet, and ImageNet datasets to evaluate the prediction accuracy, resource consumption, and transferability of the proposed approaches against several state-of-the-art adversarial attacks. The activation status approach achieves 81.89% accuracy with the optimized parameter set, demonstrating its effectiveness and efficiency in detecting adversarial examples in high-resolution data.

Bookmark

View Full Paper

Bookmark

View Full Paper

Effective adversarial example detection with DeepSHAP summary

Key Points

Abstract

Cite This Study