What question did this study set out to answer?

This research aims to enhance image captioning for Arabic using the BLIP-2 model, addressing language-specific challenges.

March 29, 2026Open Access

Image Captioning Through Deep Learning: An Adaptation of the BLIP-2 Model to Arabic

Key Points

This research aims to enhance image captioning for Arabic using the BLIP-2 model, addressing language-specific challenges.
Adapting the BLIP-2 model for Arabic language.
Utilizing machine-translated datasets like Flickr 30k.
Incorporating a vision transformer for image encoding and CamelBERT for text generation.
Using a Querying Transformer to improve cross-modal alignment.
Demonstrated the potential of the BLIP-2 model in generating coherent Arabic captions.
Overcame challenges such as translation artifacts and linguistic complexities.
Showed promising results despite limitations in Arabic datasets.

Abstract

Image captioning using deep learning bridges computer vision and natural language processing, enabling machines to generate human-like textual descriptions for images. While significant progress has been made in English, in Arabic, the image captioning field remains under-explored due to the language’s morphological complexity, right-to-left script, and scarcity of annotated datasets. This paper addresses this gap by adapting the BLIP-2 (Bootstrapped Language—Image Pre-training) model for Arabic caption generation, leveraging machine-translated datasets, like Flickr 30k, to overcome resource limitations. BLIP-2 combines a vision transformer (ViT) for image encoding and a CamelBERT large language model (LLM) for text generation, enhanced by a lightweight Querying Transformer (Q-Former) for cross-modal alignment. Despite challenges such as translation artifacts and linguistic nuances, our experiments demonstrate promising results in generating coherent Arabic captions. In short, this study highlights the potential of BLIP-2 for multilingual applications while underscoring the need for native Arabic datasets and further optimization. Ultimately, this work contributes to advancing inclusive artificial intelligence technologies for Arabic-speaking communities, with applications in assistive tools, education, and content creation.

Image Captioning Through Deep Learning: An Adaptation of the BLIP-2 Model to Arabic

Key Points

Abstract

Cite This Study