What question did this study set out to answer?

To develop a novel methodology for multilingual video captioning using multimodal fusion of visual and audio cues.

February 19, 2026

Ensemble-based multilingual video captioning with multimodel fusion of visual and audio cues

Key Points

To develop a novel methodology for multilingual video captioning using multimodal fusion of visual and audio cues.
Utilized ensemble-based multilingual video captioning
Combined scene knowledge with BLIP and action recognition with CLIP
Applied Google Speech Transcription for audio attributes
Employed multimodal learning techniques to enhance caption quality
Incorporated a ranking system using CLIP scores for relevant captions.
Achieved superior quality and expressiveness of generated captions compared to traditional methods
Facilitated accessibility through multilingual support
Demonstrated enhanced contextuality in caption generation.

Abstract

Video captioning is a very essential and vital activity of the field of artificial intelligence with a view to connect the visual image to the natural language processing. It actively contributes to accessibility, indexing of the media, and automatic video summarization. Most of these approaches employed in the past have focused on object identification and action recognition with little attention paid to the meaning of these relationships and temporal hints. The methodology of video captioning in this paper is much more advanced than the previous methods. Therefore, it to a great extent is contingent upon multimodal learning methods that enhance precision and contextuality in generating captions. The majority of the past methods have extensively focused on object and action recognition and have omitted entirely the dynamism of video sequences, and in this regard, our novel technology will combine scene knowledge with BLIP, action recognition with CLIP, and through Google Speech Transcription as an audio-related attribute. Also, our solution expands video captioning to facilitate the use of multiple languages, which will make it accessible to a broader audience. The captions are automatically obtained in various languages, and the audio output is offered to obtain a more in-depth interpretation. These modalities are therefore incorporated in our approach and a ranking system of CLIP scores is added to produce highly relevant and semantically enriched captions. We show that the model is substantially superior to traditional methods in terms of the quality, expressiveness and applicability of the generated captions. The significance of the research is the development of the automated video interpretation.

Bookmark

Cite This Study

Palivela et al. (Sun,) studied this question.

synapsesocial.com/papers/6996712d80e1323b05ec0543 https://doi.org/https://doi.org/10.1007/s11042-026-21235-4

Bookmark