This paper presents a comprehensive literature survey on image captioning, covering research published between 2018 and 2025. It introduces a novel taxonomy to classify existing approaches into nine major categories, including attention-based models, transformer-based architectures, reinforcement learning, and Vision-Language Pretraining (VLP). A total of 174 peer-reviewed studies are systematically reviewed, with comparative insights drawn across different model architectures, encoding strategies, and learning paradigms. The survey also explores major benchmark datasets such as MS COCO, Flickr30K, and Conceptual Captions, along with evaluation metrics like BLEU, CIDEr, METEOR, ROUGE, SPICE, CHAIR, CLIPScore, and BERTScore. In contrast to prior surveys, this work offers a detailed comparative analysis of state-of-the-art captioning models, highlighting their strengths, limitations, and real-world applicability. Recent models such as PaLI, OSCAR, BLIP-2, and OFA are critically examined in the context of caption quality, generalization, and multimodal alignment. A key research challenge identified across methods is the persistent problem of visual-semantic hallucination, which undermines factual alignment between image content and generated captions. This survey serves as a valuable resource for both newcomers and advanced researchers by offering a structured synthesis of recent developments, challenges, and future directions in the field of image captioning.
Al-Malla et al. (Mon,) studied this question.