Recent technological advancements have led to the widespread presence of video in our daily lives, particularly through social media, where it has become the dominant medium of communication. As a result, the ability to analyze and understand videos has become essential for managing such a massive volume of content. Within this context, video captioning is one of the most effective ways to achieve this understanding. Recent advances in Natural Language Generation (NLG), combined with breakthroughs in Computer Vision, has facilitated the development of robust Video Captioning (VC) models. However, despite these advancements, selecting the right model for the right application remains challenging due to limited evaluation methods, and significant limitations still hinder the widespread adoption of these models. This survey aims to address these issues by proposing a comprehensive overview of the VC models landscape, along with a comparative experimental study, carried out on two benchmark datasets and involving eight representative, high quality models.
Brimont et al. (Wed,) studied this question.