What question did this study set out to answer?

February 11, 2026Open Access

Improving the Time Efficiency of a Script Identification Algorithm Using a Unicode-Based Regular Expression Matching Strategy

Key Points

This research aims to enhance the efficiency of script identification algorithms in multilingual text processing.
Developed a Unicode-based regular expression matching strategy.
Determined script presence and obtained content per identified script.
Compared testing times across 263 languages written in 26 scripts.
New method reduced testing times by 9.35-fold.
F1 score for script identification was slightly improved compared to earlier methods.

Abstract

Script identification is the first step in most multilingual text-processing systems. To improve the time efficiency of script identification algorithms, whether there is content written in a certain script in the text is first determined; if so, the content written in that script is then obtained. Then, it is determined whether the total length of the texts corresponding to the identified scripts is equal to the original text length; if so, the script identification process ends. Finally, considering the frequencies of various scripts on the Internet, those that are more common are prioritized during script identification. Based on these three approaches, an improved script identification algorithm was designed. A comparison experiment was conducted using sentence-level text corpora in 263 languages written in 26 scripts. The testing times of the newly proposed method were reduced by 9.35-fold, while the F1 score for script identification was slightly higher than those reported in our earlier studies. The method proposed in this study effectively improves the time efficiency of script identification algorithms.

Bookmark

View Full Paper

Cite This Study

Qasim et al. (Mon,) studied this question.

synapsesocial.com/papers/698c1cd3267fb587c655f8ab https://doi.org/https://doi.org/10.3390/app16041714

Bookmark

View Full Paper