Image captioning using deep learning bridges computer vision and natural language processing, enabling machines to generate human-like textual descriptions for images. While significant progress has been made in English, in Arabic, the image captioning field remains under-explored due to the language’s morphological complexity, right-to-left script, and scarcity of annotated datasets. This paper addresses this gap by adapting the BLIP-2 (Bootstrapped Language—Image Pre-training) model for Arabic caption generation, leveraging machine-translated datasets, like Flickr 30k, to overcome resource limitations. BLIP-2 combines a vision transformer (ViT) for image encoding and a CamelBERT large language model (LLM) for text generation, enhanced by a lightweight Querying Transformer (Q-Former) for cross-modal alignment. Despite challenges such as translation artifacts and linguistic nuances, our experiments demonstrate promising results in generating coherent Arabic captions. In short, this study highlights the potential of BLIP-2 for multilingual applications while underscoring the need for native Arabic datasets and further optimization. Ultimately, this work contributes to advancing inclusive artificial intelligence technologies for Arabic-speaking communities, with applications in assistive tools, education, and content creation.
Abdelaal et al. (Thu,) studied this question.