The advent of Large Multimodal Models (LMMs) has fueled extensive research due to their sophisticated reasoning capabilities. However, for understanding text-rich images, challenges persist in fully leveraging the potential of LMMs, and existing methods struggle with effectively processing high-resolution images. Addressing this, we introduce TextCoT, a training-free Chain-of-Thought framework that improves text-rich image understanding by leveraging LMMs’ captioning abilities for global context and detailed local textual analysis. TextCoT comprises three stages: Global Context Generation, Macro-scale Positioning, and Fine-grained Visual Inspection—each contributing to a comprehensive understanding and precise information extraction needed for accurate question-answering. Our method requires no additional training, offering immediate plug-and-play functionality. We’ve demonstrated TextCoT's effectiveness and adaptability across various benchmarks. The source code is available at https://github.com/bzluan/TextCoT .
Luan et al. (Mon,) studied this question.