Explainable AI (XAI) techniques have been widely adopted to enhance the interpretability and reliability of deep learning applications. To extend that success to adversarial example detection, we propose a new framework to extract decision logic from explanations, leverage that information to summarize common critical neurons, and utilize their status to distinguish normal and adversarial examples. Our first approach uses decision logic for detection, demonstrating that differences in critical neuron distributions can be leveraged to distinguish normal and adversarial examples. We then propose a best-layer selection strategy to enhance previous layer-wise SHAP value detection. Selecting the layer with the most common critical neurons improves performance in terms of both accuracy and computational efficiency. These two approaches achieve high detection accuracy but require runtime computation of SHAP values. To avoid such runtime overhead, we further propose a new activation status detection approach where we show that using the activation status of common critical neurons offers lightweight yet effective detection. This efficacy extends to untrained attack detection. We conduct a comprehensive study on the CIFAR-10, MNIST, SVHN, CIFAR-100, Tiny ImageNet, and ImageNet datasets to evaluate the prediction accuracy, resource consumption, and transferability of the proposed approaches against several state-of-the-art adversarial attacks. The activation status approach achieves 81.89% accuracy with the optimized parameter set, demonstrating its effectiveness and efficiency in detecting adversarial examples in high-resolution data.
Lin et al. (Wed,) studied this question.