What question did this study set out to answer?

This research aims to enhance backdoor detection in deep learning models during the inference stage.

May 7, 2026

Disentangling Malicious Memorization: Inference‐Time Backdoor Samples Detection via Self‐Influence Functions

Key Points

This research aims to enhance backdoor detection in deep learning models during the inference stage.
Developed a multi-metric inference-time detection framework.
Utilized prediction confidence for initial filtering of samples.
Analyzed sample influence using Self-Influence Functions.
Implemented adaptive thresholding based on balanced accuracy.
Achieved an AUC of 0.917 on CIFAR-10 under the Blend attack.
Maintained a true positive rate above 90% at a false positive rate below 0.1%.
Outperformed baseline detection methods significantly.

Abstract

ABSTRACT Deep learning models have been proven to be vulnerable to backdoor attacks in real‐world deployments. However, existing defense techniques face significant practical limitations during the inference stage: many require inaccessible training data or computationally expensive model retraining, rendering them infeasible for third‐party model users. Furthermore, accurately detecting stealthy attacks remains a challenge, as maintaining high detection recall often comes at the cost of high false positive rates. Our approach is motivated by a critical observation regarding the intrinsic difference between benign and malicious inputs: while both may yield high prediction confidence, their influence on the model differs fundamentally. High‐confidence benign samples are typically well‐generalized with low self‐influence, whereas backdoor samples rely on specific trigger patterns that significantly influence model parameters. Leveraging this insight, this paper introduces a novel multi‐metric inference‐time detection framework. The method employs a coarse‐to‐fine strategy: it begins with an initial filtering based on prediction confidence, followed by a rigorous analysis using a Self‐Influence Function (SIF) to quantify the specific impact of each sample. By jointly modeling the confidence distribution and SIF‐derived features, we propose an adaptive thresholding mechanism based on balanced accuracy to precisely distinguish clean samples from trigger‐embedded ones. We conduct a systematic evaluation against representative backdoor attacks. Experimental results demonstrate that our proposed approach substantially outperforms the baseline method. On the CIFAR‐10 dataset with a ResNet‐18 model under the Blend attack, our method achieves a detection AUC of 0.917 and maintains a true positive rate above 90% at a false positive rate below 0.1%.

Bookmark

Disentangling Malicious Memorization: Inference‐Time Backdoor Samples Detection via Self‐Influence Functions

Key Points

Abstract

Cite This Study