This study addresses the limitations of current Optical Character Recognition (OCR) systems in supporting minority languages and integrating intelligent retrieval functions. We propose an integrated system that combines an advanced end-to-end OCR model with a novel hybrid search approach. First, we developed the MultiLang-OCR-30K dataset containing 30,000 annotated samples of handwritten Chinese, Tibetan, and Uyghur texts. Second, we extended the GOT model using a freeze encoder–fine-tune decoder strategy to enhance multilingual capabilities. Finally, we designed a character-level hybrid retrieval framework integrating TF-IDF efficiency with Sentence-BERT semantic strength. Experimental results show our extended GOT model achieves sentence accuracies of 82.3%, 76.5%, and 78.1% for handwritten Chinese, Tibetan, and Uyghur, respectively. The hybrid search improves F1 score by 28.7% over TF-IDF alone while maintaining 23 ms average response time. This system provides a practical solution for multilingual document digitization and management, thereby bridging the technological gap for minority languages.
Building similarity graph...
Analyzing shared references across papers
Loading...
Shuo Yang
Zhandong Liu
Ke Li
Applied Sciences
Xinjiang Normal University
Xinjiang Entry-Exit Inspection and Quarantine Bureau
Building similarity graph...
Analyzing shared references across papers
Loading...
Yang et al. (Wed,) studied this question.
www.synapsesocial.com/papers/698d6edc5be6419ac0d54bea — DOI: https://doi.org/10.3390/app16041771