August 30, 2024

Learning to enhance areal video captioning with visual question answering

Key Points

Key points are not available for this paper at this time.

Abstract

The utilization of Unmanned Aerial Vehicles (UAV) in remote sensing (RS) has witnessed a significant surge, offering valuable insights into Earth dynamics and human activities. However, this has led to a substantial increase in the volume of video data, rendering manual screening and analysis impractical. Consequently, there is a pressing need for the development of automated interpretation models for these aerial videos. In this paper, we propose a novel approach that leverages visual dialogue to enhance aerial video captioning. Our model adopts an encoder-decoder architecture, integrating a Visual Question Answering (VQA) task before the captioning task. The VQA task aims to enrich the captioning process by soliciting additional information about the image content. Specifically, our video encoder utilizes ViT-L/16, while the decoder employs Generative Pre-trained Transformer-2 (Distill-GPT-2). To validate our model, we introduce a novel benchmark dataset named CapERA-VQA, comprising videos accompanied by sets of questions, answers, and captions. Through experimental validation, we demonstrate the effectiveness of our proposed approach in enhancing the automated captioning of aerial videos.

Bookmark

Cite This Study

Mehmadi et al. (Fri,) studied this question.

synapsesocial.com/papers/68e5a4ccb6db64358753ee25 https://doi.org/https://doi.org/10.1080/01431161.2024.2388875

Bookmark