Key points are not available for this paper at this time.
Video captioning aims to generate a grammatical and accurate sentence to describe a video. Recent methods have mainly tackled this problem by considering multiple modalities, yet they have neglected the difference in modalities and the importance of shrinking the gap between video and text. This paper proposes a multi-task video-captioning method with a Stepwise Multimodal Encoder. The encoder can flexibly digest multiple modalities by assigning a proper encoding depth for each modality. We also exploit both video-to-text (V2T) and text-to-video (T2V) flows by adding an auxiliary task of video–text semantic matching. We successfully achieve state-of-the-art performance on two widely known datasets: MSVD and MSR-VTT: (1) with the MSVD dataset, our method achieves an 18% improvement in CIDEr; (2) with the MSR-VTT dataset, our method achieves a 6% improvement in CIDEr.
Liu et al. (Tue,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: