ABSTRACT Deep learning models have been proven to be vulnerable to backdoor attacks in real‐world deployments. However, existing defense techniques face significant practical limitations during the inference stage: many require inaccessible training data or computationally expensive model retraining, rendering them infeasible for third‐party model users. Furthermore, accurately detecting stealthy attacks remains a challenge, as maintaining high detection recall often comes at the cost of high false positive rates. Our approach is motivated by a critical observation regarding the intrinsic difference between benign and malicious inputs: while both may yield high prediction confidence, their influence on the model differs fundamentally. High‐confidence benign samples are typically well‐generalized with low self‐influence, whereas backdoor samples rely on specific trigger patterns that significantly influence model parameters. Leveraging this insight, this paper introduces a novel multi‐metric inference‐time detection framework. The method employs a coarse‐to‐fine strategy: it begins with an initial filtering based on prediction confidence, followed by a rigorous analysis using a Self‐Influence Function (SIF) to quantify the specific impact of each sample. By jointly modeling the confidence distribution and SIF‐derived features, we propose an adaptive thresholding mechanism based on balanced accuracy to precisely distinguish clean samples from trigger‐embedded ones. We conduct a systematic evaluation against representative backdoor attacks. Experimental results demonstrate that our proposed approach substantially outperforms the baseline method. On the CIFAR‐10 dataset with a ResNet‐18 model under the Blend attack, our method achieves a detection AUC of 0.917 and maintains a true positive rate above 90% at a false positive rate below 0.1%.
Chen et al. (Fri,) studied this question.