Oriented object detection (OOD) has rapidly advanced in recent years. However, the performance of existing methods is unsatisfactory when dealing with challenging scenarios, especially in scenes involving small-scale objects or objects with extreme aspect ratio. Inspired by recent advances in vision-language pre-training, we propose a novel Text-Guided Dual-Awareness Network (TG-DANet), which addresses these challenges from two complementary perspectives: robust feature interaction for multi-scale and longrange context modeling, and semantic-aware feature learning through textual guidance. Specifically, we design a Bi-Directional Feature Interaction Module (BDFIM) to capture horizontal and vertical contextual features via spatial interactions, which improves the representation of small and elongated objects. Additionally, a Text-Semantic Guided Framework (TSGF) is supposed to align and fuse textual embeddings with visual features at multiple levels, which enhances model interpretability and discriminability for objects with ambiguous appearances or complex layouts. Extensive experiments on three benchmark datasets (DOTA, DIOR-R, and HRSC2016) show that TG-DANet achieves improvements of 3.05%, 3.49%, and 2.32% in mAP over baseline methods, respectively. These results demonstrate the effectiveness of our dual-perspective strategy in handling complex scenes with cluttered backgrounds and multi-scale objects, which highlights the promising potential of vision-language fusion in oriented object detection.
Han et al. (Thu,) studied this question.