March 24, 2024Open Access

Multi-Modal Prompting for Open-Vocabulary Video Visual Relationship Detection

Key Points

Key points are not available for this paper at this time.

Abstract

Open-vocabulary video visual relationship detection aims to extend video visual relationship detection beyond annotated categories by detecting unseen relationships between objects in videos. Recent progresses in open-vocabulary perception, primarily driven by large-scale image-text pre-trained models like CLIP, have shown remarkable success in recognizing novel objects and semantic categories. However, directly applying CLIP-like models to video visual relationship detection encounters significant challenges due to the substantial gap between images and video object relationships. To address this challenge, we propose a multi-modal prompting method that adapts CLIP well to open-vocabulary video visual relationship detection by prompt-tuning on both visual representation and language input. Specifically, we enhance the image encoder of CLIP by using spatio-temporal visual prompting to capture spatio-temporal contexts, thereby making it suitable for object-level relationship representation in videos. Furthermore, we propose visual-guided language prompting to leverage CLIP's comprehensive semantic knowledge for discovering unseen relationship categories, thus facilitating recognizing novel video relationships. Extensive experiments on two public datasets, VidVRD and VidOR, demonstrate the effectiveness of our method, especially achieving a significant gain of nearly 10% in mAP on novel relationship categories on the VidVRD dataset.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Yang et al. (Sun,) studied this question.

synapsesocial.com/papers/68e7296db6db6435876a399b https://doi.org/https://doi.org/10.1609/aaai.v38i7.28472

Also Consider

Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context:

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper