Key points are not available for this paper at this time.
Video captioning is the task of describing video content using natural sentences. While recent models have shown significant improvements in metrics, there are still some unresolved issues. Model-generated captions often contain factual errors and omit important details. In contrast, human-written captions excel in accurately and comprehensively describing the video content. In this work, we propose a novel method that utilizes question answering (QA) techniques to enhance video captioning models. We start by generating QA pairs from both videos and human-written captions. We propose a QA-enhanced captioning model to better leverage QA information. Finally, we employ reinforcement learning to train the model to maximize a QA reward. By incorporating QA-related techniques, our model can generate more accurate and comprehensive video captions. We conduct experiments on three datasets, namely ActivityNet Captions, YouCookII and MSR-VTT. The experimental results, ablation studies and human evaluations demonstrate the advantages of our method.
Building similarity graph...
Analyzing shared references across papers
Loading...
Liu et al. (Thu,) studied this question.
www.synapsesocial.com/papers/68e67b96b6db643587605053 — DOI: https://doi.org/10.1145/3652583.3658061
Hui Li Liu
Xiaojun Wan
Peking University
Building similarity graph...
Analyzing shared references across papers
Loading...