What type of study is this?

This is a Quantitative Study study.

September 19, 2025Open Access

Residual and Bottleneck CNN Architectures With LSTM for Improved Video Caption Generation

Key Points

The proposed framework balances feature extraction with sequential modeling to enhance video caption generation.
Evaluations on benchmark datasets resulted in a peak BLEU score of 51, indicating significant performance improvements.
Using a cnn-based encoder with residual and bottleneck blocks mitigates gradient degradation during feature extraction.
The architecture not only improves accuracy but also reduces computational complexity, making it suitable for large-scale tasks.

Abstract

Video captioning has become a pivotal research domain at the interface of computer vision and natural language processing, applications in multimedia retrieval, assistive systems, and human–computer interaction. Despite substantial progress, many existing approaches, including Vid2Seq, Positive-Augmented Contrastive Learning, GL-RG, and TextKG, continue to encounter limitations in jointly modeling fine-grained spatial details and long-term temporal dependencies. These challenges hinder the generation of captions that are both semantically accurate and contextually coherent. It proposes novel video captioning framework that leverages a convolutional neural network (CNN)-based encoder integrated with residual and bottleneck blocks to capture rich temporal–spatial features while mitigating gradient degradation. The encoder’s design ensures efficient feature propagation and robust representation of video content. To model sequential dependencies and maintain contextual consistency, the extracted features are processed by a recurrent decoder based on long short-term memory (LSTM) networks. This hybrid architecture effectively balances feature extraction with sequential modeling, thereby addressing critical shortcomings of prior methods. Extensive evaluations were conducted on three benchmark datasets—MSR-VTT, MPII Cooking 2, and M-VAD. The proposed framework achieved a peak BLEU score of 51 . Beyond accuracy improvements, the architecture demonstrated reduced computational complexity, confirming its suitability for large-scale video captioning tasks. In conclusion, the integration of CNN-based residual encoding with LSTM-based recurrent decoding offers a streamlined yet powerful solution for video captioning. The proposed model advances the field by achieving a balance between efficiency and accuracy, thereby contributing a significant step toward the development of high-quality, contextually rich video descriptions in vision–language research.

Residual and Bottleneck CNN Architectures With LSTM for Improved Video Caption Generation

Key Points

Abstract

Cite This Study

Also Consider

Also Consider