February 24, 2024

Policy Learning-Based Image Captioning With Vision Transformer

Key Points

Key points are not available for this paper at this time.

Abstract

In light of the remarkable progress made in automated image caption generation, it is still challenging to create captions that accurately reflect factual information and yet capture the nuances of human language. It tries to provide captions that accurately explain both the visual content and the complexity and indirectness of human emotion. The existing model represents a fusion of the CNN's capacity to comprehend the visual elements within an image and the RNN's expertise in crafting sequential language structures tailored to various visual contexts. This research paper combines the diverse methods of VITs for image understanding, pre-trained language models for language fluency and nuance, and fact-checking mechanisms to ensure factual accuracy. Attention algorithms and diversity checks improve the overall quality of captions provided. Reinforcement learning entails fine-tuning the model's performance iteratively.

AI에게 질문

Bookmark

Cite This Study

Bathula et al. (Sat,) studied this question.

synapsesocial.com/papers/68e77c8eb6db6435876f0c26 https://doi.org/https://doi.org/10.1109/sceecs61402.2024.10481859