Purpose The aim of this work is to provide an overview of the current capabilities of Multimodal Large Language Models (MLLMs) for Handwritten Text Recognition (HTR), assessing their potential when compared to traditional task-specific, supervised models. Design/methodology/approach The approach is that of using a set of openly-available benchmarks to compare different LLMs with strong task-specific supervised baselines for the task of HTR. Findings The results show that LLMs currently show a strong performance on English texts, yet they demonstrate a weaker performance on languages other than English, and do not possess a significant capability for self-correction. Moreover, their comparison with Transkribus’s models highlight the fact that proprietary LLM models are the best performing, in particular on modern handwriting, while for historical documents the overall performance comparison between LLMs and Transkribus is not consistent. Originality/value The authors are not aware of a similar study relying on open benchmarks.
Building similarity graph...
Analyzing shared references across papers
Loading...
Giorgia Crosilla
Lukas Klic
Giovanni Colavizza
Journal of Documentation
University of Copenhagen
University of Bologna
Digital Science (United States)
Building similarity graph...
Analyzing shared references across papers
Loading...
Crosilla et al. (Fri,) studied this question.
www.synapsesocial.com/papers/68d463e931b076d99fa634fc — DOI: https://doi.org/10.1108/jd-03-2025-0082