What question did this study set out to answer?

The aim is to enhance the robustness of visual tracking models against adversarial attacks using a new denoising U-Net architecture.

May 24, 2026Open Access

Towards adversarial defense with receptive field enhancement fusion and denoising U-Net for visual tracking

Key Points

The aim is to enhance the robustness of visual tracking models against adversarial attacks using a new denoising U-Net architecture.
Proposed REF-DUNet, a U-Net based architecture with multi-branch receptive field enhancement fusion module.
Applied mean squared error loss and perceptual loss for collaborative optimization during training.
Tested REF-DUNet defense method on two tracking models across four datasets, addressing both white- and black-box attacks.
Significantly improved robustness of trackers against adversarial attacks compared to baseline.
Restored tracking performance in perturbed environments as evidenced by experimental results.
Demonstrated versatility across multiple datasets, indicating broader applicability.

Abstract

Existing visual tracking models generally suffer from insufficient robustness under adversarial attacks, especially in multi-frame continuous complex scenarios, where even small perturbations can lead to sustained bias in the model, severely affecting tracking performance. To address this issue, this paper proposes an adversarial defense method with receptive field enhancement fusion and denoising U-Net for visual tracking, named REF-DUNet, aimed at enhancing the robustness of the tracker in perturbed environments and mitigating interference caused by adversarial attacks. Based on the U-shaped encoder-decoder network architecture, this method designs a multi-branch receptive field enhancement fusion module, which enhances the ability to learn and preserve features against multi-scale adversarial perturbations by parallel fusing standard convolution, asymmetric convolution, and dilated convolutions. To improve the structural integrity and semantic consistency of the denoised image, REF-DUNet also jointly introduces mean squared error loss and perceptual loss during training, achieving collaborative optimization in both low-level and high-level feature spaces. It is worth noting that REF-DUNet does not require access to the network architecture or gradient information of the target tracker, demonstrating excellent generality, independence, and cross-model adaptability. We apply the REF-DUNet defense method to two representative trackers and conduct experiments on four well-known datasets, defending both white- and black-box attacks. The results show that our method significantly enhances the robustness of trackers under adversarial attacks and can effectively restore tracking performance.

Bookmark

View Full Paper