March 3, 2026

A Survey on Video Captioning in the Era of Large Language Models

Key Points

Video captioning remains a crucial technique for understanding and managing video content effectively, especially on social media.
Recent advancements in natural language generation and computer vision have enabled the development of robust video captioning models.
The survey includes a comparative analysis of eight high-quality models tested on two benchmark datasets, providing clear insights into their performance.
Significant limitations persist in evaluation methods, suggesting a need for improvement to ensure widespread adoption of video captioning models.

Abstract

Recent technological advancements have led to the widespread presence of video in our daily lives, particularly through social media, where it has become the dominant medium of communication. As a result, the ability to analyze and understand videos has become essential for managing such a massive volume of content. Within this context, video captioning is one of the most effective ways to achieve this understanding. Recent advances in Natural Language Generation (NLG), combined with breakthroughs in Computer Vision, has facilitated the development of robust Video Captioning (VC) models. However, despite these advancements, selecting the right model for the right application remains challenging due to limited evaluation methods, and significant limitations still hinder the widespread adoption of these models. This survey aims to address these issues by proposing a comprehensive overview of the VC models landscape, along with a comparative experimental study, carried out on two benchmark datasets and involving eight representative, high quality models.

Bookmark

A Survey on Video Captioning in the Era of Large Language Models

Key Points

Abstract

Cite This Study