June 14, 2023Open Access

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Key Points

Key points are not available for this paper at this time.

Abstract

Dense video caption is a task that aims to help computers analyze the content of a video by generating abstract captions for a sequence of video frames. However, most of the existing methods only use visual features in the video and ignore the audio features that are also essential for understanding the video. In this paper, we propose a fusion model that combines the Transformer framework to integrate both visual and audio features in the video for captioning. We use multi-head attention to deal with the variations in sequence lengths between the models involved in our approach. We also introduce a Common Pool to store the generated features and align them with the time steps, thus filtering the information and eliminating redundancy based on the confidence scores. Moreover, we use LSTM as a decoder to generate the description sentences, which reduces the memory size of the entire network. Experiments show that our method is competitive on the ActivityNet Captions dataset.

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Discussion

Authors

Xuefei Huang

Ka‐Hou Chan

Weifan Wu

Journals

SHILAP Revista de lepidopterología

Sensors

Actions

Institutions

Beihang University

Macao Polytechnic University

References and Citations

Connected Papers

Building similarity graph...

Analyzing shared references across papers

Fusion of Multi-Modal Features to Enhance Dense Video Caption

Key Points

Abstract

Citation Network

Connected Papers

Discussion

Authors

Journals

Actions

Institutions

References and Citations

Citation Network

Connected Papers

Discussion

Cite this study

Also consider