⌘+K

August 9, 2019Open Access

Key Points

Key points are not available for this paper at this time.

Abstract

We address the problem of cross-modal fine-grained action retrieval between and video. Cross-modal retrieval is commonly achieved through learning a embedding space, that can indifferently embed modalities. In this paper, propose to enrich the embedding by disentangling parts-of-speech (PoS) in accompanying captions. We build a separate multi-modal embedding space for PoS tag. The outputs of multiple PoS embeddings are then used as input to integrated multi-modal space, where we perform action retrieval. All are trained jointly through a combination of PoS-aware and-agnostic losses. Our proposal enables learning specialised embedding spaces offer multiple views of the same embedded entities. We report the first retrieval results on fine-grained actions for the-scale EPIC dataset, in a generalised zero-shot setting. Results show the of our approach for both video-to-text and text-to-video action. We also demonstrate the benefit of disentangling the PoS for the task of cross-modal video retrieval on the MSR-VTT dataset.

Bookmark

View Full Paper

Cite This Study

Wray et al. (Fri,) studied this question.

synapsesocial.com/papers/6a153eb4d64fa333899f6b2f https://doi.org/https://doi.org/10.48550/arxiv.1908.03477

Bookmark

View Full Paper