Remote sensing open-vocabulary object detection focuses on identifying and localizing unseen categories within remote sensing imagery. However, constrained by characteristics such as dense target distribution, complex background interference, and drastic scale variations inherent to remote sensing scenarios, existing methods are prone to background noise interference when extracting features from dense, small target regions. This leads to weakened semantic representation and reduced localization accuracy. Therefore, we propose RS-DINO to address these challenges. Specifically: Firstly, to address the issue of small features being obscured by the background, the feature extraction module incorporates a multi-scale large-kernel attention mechanism. This expands the receptive field while enhancing local detail modelling, significantly improving the feature representation of minute targets. Secondly, a cross-modal feature fusion module employing bidirectional cross-attention achieves deep alignment between image and textual features. Subsequently, a language-guided query selection mechanism enhances detection accuracy through hybrid query strategies. Finally, to enhance the spatial sensitivity and channel adaptability of fusion features, the multimodal decoder integrates a convolutional gated feedforward network, significantly boosting the model’s robustness in dense, multi-scale scenes. Experiments on DIOR, DOTA v2.0, and NWPU-VHR10 demonstrate substantial gains, with fine-tuned RS-DINO surpassing existing methods by 3.5%, 3.7%, and 4.0% in accuracy, respectively.
Building similarity graph...
Analyzing shared references across papers
Loading...
Menghan Ju
Yingchao Feng
W. Diao
Remote Sensing
Chinese Academy of Sciences
University of Chinese Academy of Sciences
Target (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Ju et al. (Tue,) studied this question.
www.synapsesocial.com/papers/69b3ac1d02a1e69014ccd84a — DOI: https://doi.org/10.3390/rs18060851