What question did this study set out to answer?

The aim is to enhance image captioning through a threefold learning approach that integrates multiple AI techniques.

February 2, 2026Open Access

Hybrid Vision-and-Language Fusion: A Threefold Learning Approach for elevating Image Captioning through Adaptive Strategies

Key Points

The aim is to enhance image captioning through a threefold learning approach that integrates multiple AI techniques.
Developed a multimodal pipeline using three deep learning models.
Used computer vision, natural language processing, and classification to analyze images.
Implemented clustering of objects before classification to improve performance.
Achieved a CIDEr score of 37.93% on the MS-COCO Captioning task test baseline.
Demonstrated improved syntactical saliency with integrated advanced object features.
Showed that clustering enhances final model performance.

Abstract

Image captioning is a significant area of application for artificial intelligence techniques. When a machine can interpret an image similar to humans, it indicates a higher intelligence level and comprehension of the image. This research displays advancements in real- time image collection and labeling systems using a triad of computer vision, natural language processing, and classification. The approach employs three deep learning models to generate human-level natural language descriptors, resulting in a user-friendly system. The model comprises a multimodal pipeline of deep learning architectures, enabling the extraction of probabilistic features for each object category. Our model surpasses other image captioning models, achieving a CIDEr score of 37.93% on the common MS-COCO Captioning task test baseline, thereby exhibiting superior syntactical saliency when integrated with advanced object features. Additionally, we observed that incorporating an intermediate step of clustering objects before classification enhances the final model's performance. By implementing these methodologies, we have developed a more capable and accurate model, proficient in object classification and generating informative image descriptions. Such capabilities can significantly augment human comprehension and decision-making across various applications, particularly in advancing sustainable cities and communities, fostering quality education through improved accessibility of visual content, promoting industry, innovation, and infrastructure with cutting- edge AI technologies.

Read Full Paperexternally

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper

Cite This Study

Bhandari et al. (Mon,) studied this question.

synapsesocial.com/papers/6980fed9c1c9540dea811570 https://doi.org/https://doi.org/10.5109/7402620

اسأل الذكاء الاصطناعي

Bookmark

View Full Paper