Key points are not available for this paper at this time.
Visual grounding (VG) is essential to promote the human-computer interaction in object detection tasks. Most of the current VG methods mainly focus on grounding the target objects in natural images with simple language expressions. They cannot generalize well to remote sensing images, where the target objects only cover a small fraction (e.g., 0.34%) of the whole scene and the language expression is complex. To address these challenges, we propose a regionally indicated network (RINet) for remote sensing VG in this article. Specifically, RINet first exploits DarkNet-53 and BERT to extract visual and language features, respectively. Then, these features are fed into a regional indication generator (RIG) to generate an initial indication map, which indicates the possibility of each region containing the target object. This indication map is fine-tuned by taking advantage of a high-resolution detailed feature via a comprehensive alignment module (CAM) and a correction gate (CG). In CAM, a word contribution learner is designed to evaluate the importance of each word and make it pay more attention to the words easily ignored before. The whole fine-tuning process is repeated several rounds so that the complex language information can be fully explored and the region containing the target object is located more accurately. Finally, a detection head is adopted to ground the target object. To test the performance of our proposed model, we conduct experiments on two public remote sensing datasets, including RSVG and DIOR-RSVG. The experimental results show that our proposed RINet can outperform several state-of-the-art models significantly, which validates its effectiveness. The source code of our proposed model will be released at https://github.com/KevinDaldry/RINet.
Hang et al. (Mon,) studied this question.
Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context: