What question did this study set out to answer?

The aim is to enhance OCR capabilities for minority languages and integrate intelligent retrieval functions.

February 12, 2026Open Access

Multilingual Intelligent Retrieval System via Unified End-to-End OCR and Hybrid Search

Key Points

The aim is to enhance OCR capabilities for minority languages and integrate intelligent retrieval functions.
Developed the MultiLang-OCR-30K dataset with 30,000 annotated samples of target languages.
Extended the GOT model using a freeze encoder and fine-tune decoder strategy.
Designed a character-level hybrid retrieval framework combining TF-IDF and Sentence-BERT.
Achieved sentence accuracies of 82.3% for Chinese, 76.5% for Tibetan, and 78.1% for Uyghur.
Hybrid search improved F1 score by 28.7% compared to TF-IDF alone.
Maintained an average response time of 23 milliseconds.

Abstract

This study addresses the limitations of current Optical Character Recognition (OCR) systems in supporting minority languages and integrating intelligent retrieval functions. We propose an integrated system that combines an advanced end-to-end OCR model with a novel hybrid search approach. First, we developed the MultiLang-OCR-30K dataset containing 30,000 annotated samples of handwritten Chinese, Tibetan, and Uyghur texts. Second, we extended the GOT model using a freeze encoder–fine-tune decoder strategy to enhance multilingual capabilities. Finally, we designed a character-level hybrid retrieval framework integrating TF-IDF efficiency with Sentence-BERT semantic strength. Experimental results show our extended GOT model achieves sentence accuracies of 82.3%, 76.5%, and 78.1% for handwritten Chinese, Tibetan, and Uyghur, respectively. The hybrid search improves F1 score by 28.7% over TF-IDF alone while maintaining 23 ms average response time. This system provides a practical solution for multilingual document digitization and management, thereby bridging the technological gap for minority languages.

Ask AI

Helpful

Bookmark

View Full Paper