What question did this study set out to answer?

The aim is to classify and analyze various deep learning methods in image captioning to uncover their effectiveness and challenges.

February 26, 2026Open Access

A comprehensive survey on deep learning approaches for image captioning: a systematic review

Key Points

The aim is to classify and analyze various deep learning methods in image captioning to uncover their effectiveness and challenges.
Conducted a systematic literature review from 2018 to 2025.
Introduced a new taxonomy categorizing approaches into nine types.
Reviewed 174 peer-reviewed studies to draw comparative insights.
Analyzed state-of-the-art models regarding caption quality and performance.
Evaluated major benchmark datasets and various evaluation metrics.
Identified nine major categories of image captioning approaches.
Highlighted issues like visual-semantic hallucination impacting model performance.
Provided a detailed comparative analysis of models such as PaLI and BLIP-2.
Demonstrated the varying strengths and limitations of different architectures.

Abstract

This paper presents a comprehensive literature survey on image captioning, covering research published between 2018 and 2025. It introduces a novel taxonomy to classify existing approaches into nine major categories, including attention-based models, transformer-based architectures, reinforcement learning, and Vision-Language Pretraining (VLP). A total of 174 peer-reviewed studies are systematically reviewed, with comparative insights drawn across different model architectures, encoding strategies, and learning paradigms. The survey also explores major benchmark datasets such as MS COCO, Flickr30K, and Conceptual Captions, along with evaluation metrics like BLEU, CIDEr, METEOR, ROUGE, SPICE, CHAIR, CLIPScore, and BERTScore. In contrast to prior surveys, this work offers a detailed comparative analysis of state-of-the-art captioning models, highlighting their strengths, limitations, and real-world applicability. Recent models such as PaLI, OSCAR, BLIP-2, and OFA are critically examined in the context of caption quality, generalization, and multimodal alignment. A key research challenge identified across methods is the persistent problem of visual-semantic hallucination, which undermines factual alignment between image content and generated captions. This survey serves as a valuable resource for both newcomers and advanced researchers by offering a structured synthesis of recent developments, challenges, and future directions in the field of image captioning.

Bookmark

View Full Paper

Bookmark

View Full Paper

A comprehensive survey on deep learning approaches for image captioning: a systematic review

Key Points

Abstract

Cite This Study