Adversarial attacks involve malicious actors introducing intentional perturbations to machine learning (ML) models, causing unintended behavior. This poses a significant threat to the integrity and trustworthiness of ML models, necessitating the development of robust detection techniques to protect systems from potential threats. The paper proposes a new approach for detecting adversarial attacks using a surrogate model and diagnostic attributes. The method was tested on 22 tabular datasets on which four different ML models were trained. Furthermore, various attacks were conducted, which led to obtaining perturbed data. The proposed approach is characterized by high efficiency in detecting known and unknown attacks—balanced accuracy was above 0.94, with very low false negative rates (0.02–0.10) for binary detection. Sensitivity analysis shows that classifiers trained based on diagnostic attributes can detect even very subtle adversarial attacks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Łukasz Wawrowski
Piotr Biczyk
Dominik Ślęzak
Machine Learning and Knowledge Extraction
University of Warsaw
University of Silesia in Katowice
Silesian University of Technology
Building similarity graph...
Analyzing shared references across papers
Loading...
Wawrowski et al. (Wed,) studied this question.
www.synapsesocial.com/papers/68e02f46f0e39f13e7fa2bca — DOI: https://doi.org/10.3390/make7040112