What does this research mean for the field?

Large Multimodal Models (LMMs) can effectively perform Optical Character Recognition (OCR) for the low-resource Pashto language, with Gemini achieving the best overall performance. Novelty: ClaimNovelty.NOVEL_FINDING. Consensus alignment: ConsensusAlignment.ESTABLISHES_NEW_DIRECTION.

February 25, 2026Open Access

PsOCR: Benchmarking large multimodal models for optical character recognition in low-resource pashto language

Puntos clave

This research aims to evaluate large multimodal models for optical character recognition in the low-resource Pashto language.
Introduced PsOCR, a synthetic OCR dataset with one million images annotated at various levels.
Evaluated performance on a benchmark subset of 10,000 images.
Assessed multiple state-of-the-art LMMs including Gemini and Qwen-7B under zero-shot settings.
Gemini achieved the best overall performance in the evaluation.
Qwen-7B performed notably among open-source LMMs.
Insights into the capabilities and limitations of LMMs for Pashto OCR were provided.

Resumen

This paper evaluates the performance of Large Multimodal Models (LMMs) on Optical Character Recognition (OCR) for the low-resource Pashto language. Pashto OCR is challenging due to its cursive Perso-Arabic script and the scarcity of large-scale annotated datasets. To address these challenges, we introduce PsOCR, a large-scale synthetic Pashto OCR dataset containing one million images annotated at the word, line, and document levels. PsOCR includes extensive variability across 1000 font families, font sizes, colors, image resolutions, and layouts. A benchmark subset of 10,000 images is used to evaluate several state-of-the-art LMMs, including Llama, Florence, Qwen-3B/7B, GPT-4o, Gemini, Claude, and Grok, under zero-shot settings. Experimental results demonstrate that Gemini achieves the best overall performance, while Qwen-7B stands out among open-source models. This work provides valuable insights into the capabilities and limitations of current LMMs for Pashto OCR and establishes a foundation for future research in languages with similar scripts.

Leer artículo completoexternamente

Preguntar a la IA

Me gusta

Guardar

Ver artículo completo