The Arabic Optical Character Recognition (OCR) task is considered a difficult task. This is due to the language’s cursive morphology, common diacritic changes, and various calligraphy types throughout printed, handwritten, and historical documents. Recently, breakthroughs in Vision-Language Models (VLMs) have achieved remarkable progress in multilingual OCR; however, their systematic adaptation to the Arabic language is yet underexplored. This paper presents a parameter-efficient fine-tuning of the Qwen2.5-VL model using LoRA (4-bit quantization) adapted for a mixed-domain dataset that consists of both modern print and historical manuscripts. The proposed method permits efficient training using a relatively small computational power. The KITAB-Bench benchmark tests show considerable gains and a Character Error Rate reduction of 29% on modern print and 17% accuracy on historical documents, beating the standard OCR engines. These findings demonstrate the capability of VLM-based approaches for robust Arabic OCR and the need for resource-efficient adaptation strategies for practical deployment.
Elkousy et al. (Thu,) studied this question.