Key points are not available for this paper at this time.
This paper deals with Video Moment Retrieval (VMR) in a weakly-supervised fashion, which aims to retrieve local video clips with only global video-level descriptions. Scrutinizing the recent advances in VMR, we find that the fully-supervised models achieve strong performance, but they are heavily relied on the precise temporal annotations. Weakly-supervised methods do not rely on temporal annotations, however, their performance is much weaker than the fully-supervised ones. To fill such gap, we propose to take advantage of a pretrained video-text model as hitchhiker to generate pseudo temporal labels. The pseudo temporal labels, together with the descriptive labels, are then utilized to guide the training of the proposed VMR model. The proposed Location-irrelevant Proposal Learning (LPL) model is based on a pretrained video-text model with cross-modal prompt learning, together with different strategies to generate reasonable proposals with various lengths. Despite the simplicity, we find that our method performs much better than the previous state-of-the-art methods on standard benchmarks, eg., +4.4% and +1.4% in mIoU on the Charades and ActivityNet-Caption datasets respectively, which benefits from training with fine-grained video-text pairs. Further experiments on two synthetic datasets with shuffled temporal location and longer video length demonstrate our model's robustness towards temporal localization bias as well as its strength in handling long video sequences.
Ji et al. (Sun,) studied this question.