What question did this study set out to answer?

The research aims to enhance dense small-object detection in remote sensing imagery using an open-vocabulary approach.

March 13, 2026Open Access

Addressing Dense Small-Object Detection in Remote Sensing: An Open-Vocabulary Object Detection Framework

Key Points

The research aims to enhance dense small-object detection in remote sensing imagery using an open-vocabulary approach.
Developed RS-DINO framework for object detection.
Incorporated multi-scale large-kernel attention for better feature extraction.
Implemented cross-modal feature fusion with bidirectional cross-attention.
Applied language-guided query selection to improve detection accuracy.
Used a convolutional gated feedforward network for feature fusion.
RS-DINO demonstrated a 3.5% accuracy improvement on DIOR dataset.
Achieved a 3.7% increase in accuracy on DOTA v2.0 dataset.
Surpassed existing methods by 4.0% in accuracy on NWPU-VHR10 dataset.

Abstract

Remote sensing open-vocabulary object detection focuses on identifying and localizing unseen categories within remote sensing imagery. However, constrained by characteristics such as dense target distribution, complex background interference, and drastic scale variations inherent to remote sensing scenarios, existing methods are prone to background noise interference when extracting features from dense, small target regions. This leads to weakened semantic representation and reduced localization accuracy. Therefore, we propose RS-DINO to address these challenges. Specifically: Firstly, to address the issue of small features being obscured by the background, the feature extraction module incorporates a multi-scale large-kernel attention mechanism. This expands the receptive field while enhancing local detail modelling, significantly improving the feature representation of minute targets. Secondly, a cross-modal feature fusion module employing bidirectional cross-attention achieves deep alignment between image and textual features. Subsequently, a language-guided query selection mechanism enhances detection accuracy through hybrid query strategies. Finally, to enhance the spatial sensitivity and channel adaptability of fusion features, the multimodal decoder integrates a convolutional gated feedforward network, significantly boosting the model’s robustness in dense, multi-scale scenes. Experiments on DIOR, DOTA v2.0, and NWPU-VHR10 demonstrate substantial gains, with fine-tuned RS-DINO surpassing existing methods by 3.5%, 3.7%, and 4.0% in accuracy, respectively.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Menghan Ju

Yingchao Feng

W. Diao

Journals

Remote Sensing

Actions

Institutions

Chinese Academy of Sciences

University of Chinese Academy of Sciences

Target (United States)

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Addressing Dense Small-Object Detection in Remote Sensing: An Open-Vocabulary Object Detection Framework

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study