This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) for the low-resource Pashto language. Pashto OCR is challenging due to its cursive Perso-Arabic script and the scarcity of large-scale annotated datasets. To address these challenges, we introduce PsOCR, a large-scale synthetic Pashto OCR dataset containing one million images annotated at the word, line, and document levels. PsOCR includes extensive variability across 1000 font families, font sizes, colors, image resolutions, and layouts. A benchmark subset of 10,000 images is used to evaluate several state-of-the-art LMMs, including Llama, Florence, Qwen-3B/7B, GPT-4o, Gemini, Claude, and Grok, under zero-shot settings. Experimental results demonstrate that Gemini achieves the best overall performance, while Qwen-7B stands out among open-source models. This work provides valuable insights into the capabilities and limitations of current LMMs for Pashto OCR and establishes a foundation for future research in languages with similar scripts.
Haq et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: