Recently, with the development of the Vision-Language Model (VLM), adopting such VLM (e.g., CLIP) into object detection framework has gradually become a promising and attractive research direction, and the resulted open vocabulary object detection methods can effectively alleviate the limitations in those close-set ones, making the detectors perceive the unseen world. The core issue in open vocabulary object detection is to design an effective and efficient alignment between the visual (e.g., image) and textual (e.g., caption) features in the semantic space, so that the detectors can capture more information around the open-set scene. Current approaches deploy extra uncurated image-text pairs to pre-train a detector for obtaining a better visual-textual alignment in the feature space. Besides, knowledge distillation technology is also adopted to design an appropriate information transferring flow for aligning the visual-textual knowledge. However, large-scale image-text pairs are not always available to obtain, and the pretraining process will inevitable introduce much more computation overhead. While knowledge distillation methods focus on aligning between the local region visual feature in RoI and the textual features of VLM, neglecting the global information alignment between the image and text. For addressing the dilemmas in these alignment manners, we propose a Global and Local Visual-Textual Alignment for Open Vocabulary Object Detection in this paper. Specifically, our proposed method integrates global image-caption and local region-prompt alignments into a unified learning paradigm. The global alignment takes the whole image and caption as the visual and textual inputs, respectively, and matches the image and caption representations from the detector and the text encoder in CLIP by contrastive learning from the overall perspective. Different from global alignment, the local one concentrates on the accordance between regions and prompts from the aspect of portion description. It extracts and aligns the embeddings for the visual patch RoIs from the image encoder in CLIP and discriminating textual token prompts from the text encoder. Moreover, we also design a prompt tuning strategy, which contains global and local components corresponding to the alignment procedure, for better adapting CLIP to downstream task object detection in a parameter-efficient learning manner. By implementation on Faster R-CNN, we conduct experiments on open vocabulary benchmarks OV-COCO and OV-LVIS, respectively. The results verify that our proposed method can achieve clear improvement over counterparts on novel categories, while performing favorably against state-of-the-arts.
Wang et al. (Thu,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: