Scene Text Spotting (STS) aims to transcribe text embedded in natural images, typically encompassing Scene Text Detection (STD) and Scene Text Recognition (STR). Advances in image understanding have made end-to-end text spotting increasingly viable. Concurrently, multimodal research has highlighted the potential of vision-language reasoning tasks, such as Visual Question Answering (VQA). To leverage multimodal reasoning for STR, we propose a training-time question-guided STR framework that integrates VQA, termed Q uestion- G uided S cene T ext R ecognition (QG-STR). The framework unifies STR, Visual Question Generation (VQG), and VQA within a single architecture, enabling multimodal reasoning to enhance text-spotting performance. Specifically, visual understanding and logical reasoning are used as supervisory signals during training to improve text recognition accuracy and boost end-to-end text spotting. QG-STR is model-agnostic and compatible with diverse STR and VQA architectures, employing question guidance solely as a training-time supervision mechanism. During inference, the STR module functions independently without requiring external questions. Extensive experiments on Total-Text , ICDAR2015 , ICDAR2013 , and CTW1500 validate the effectiveness of QG-STR.
Xu et al. (Fri,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: