October 1, 2019

Joint Syntax Representation Learning and Visual Cue Translation for Video Captioning

Key Points

Key points are not available for this paper at this time.

Abstract

Video captioning is a challenging task that involves not only visual perception but also syntax representation learning. Recent progress in video captioning has been achieved through visual perception, but syntax representation learning is still under-explored. We propose a novel video captioning approach that takes into account both visual perception and syntax representation learning to generate accurate descriptions of videos. Specifically, we use sentence templates composed of Part-of-Speech (POS) tags to represent the syntax structure of captions, and accordingly, syntax representation learning is performed by directly inferring POS tags from videos. The visual perception is implemented by a mixture model which translates visual cues into lexical words that are conditional on the learned syntactic structure of sentences. Thus, a video captioning task consists of two sub-tasks: video POS tagging and visual cue translation, which are jointly modeled and trained in an end-to-end fashion. Evaluations on three public benchmark datasets demonstrate that our proposed method achieves substantially better performance than the state-of-the-art methods, which validates the superiority of joint modeling of syntax representation learning and visual perception for video captioning.

Bookmark

Cite This Study

Hou et al. (Tue,) studied this question.

synapsesocial.com/papers/6a13c3573f9a9dbf1d39dba1 https://doi.org/https://doi.org/10.1109/iccv.2019.00901

Also Consider

Synapse has enriched 4 closely related papers on similar clinical questions. Consider them for comparative context:

Bookmark