What question did this study set out to answer?

The aim is to develop a lightweight algorithm for accurately identifying languages within both monolingual and multilingual documents.

April 3, 2026Open Access

Efficient language identification in monolingual and multilingual documents

Key Points

The aim is to develop a lightweight algorithm for accurately identifying languages within both monolingual and multilingual documents.
Introduced a term-frequency-based algorithm for language detection.
Benchmarking against existing libraries like BERT, CLD3, fastText, and GlotLID.
Evaluated on various datasets, including WiLI-2018 and a custom Multilingual Webpages dataset.
Achieved an F1 score of up to 98%, outperforming existing language detection libraries.
Demonstrated efficient memory usage while maintaining high accuracy.
Effectively identified dominant languages in multilingual webpages.

Abstract

Language identification (LID) is a critical prerequisite for text processing tasks such as content classification, natural language processing (NLP), machine translation, and large language model (LLM) training. Modern AI systems rely heavily on web data, making accurate detection essential for applying language-specific preprocessing techniques and aligning multilingual datasets. Existing LID systems, however, struggle with multilingual documents, short or noisy text, and resource constraints. This study introduces a lightweight term-frequency-based algorithm for detecting multiple languages in a single document, achieving high accuracy while minimizing memory usage. The algorithm assigns independent relevance scores to each language (not calibrated as proportions), enabling effective identification of both monolingual and multilingual content. The proposed method was benchmarked against state-of-the-art libraries, including BERT, CLD3, fastText, and GlotLID, on WiLI-2018, FLORES+, Tatoeba, OpenSubtitles, and a newly constructed Multilingual Webpages dataset. Results show that the algorithm achieves an F1 score of up to 98%, matching or exceeding existing libraries under comparable evaluation settings while remaining computationally efficient on standard hardware. Its ability to detect dominant languages in multilingual webpages further demonstrates its practical applicability.

Bookmark

View Full Paper

Cite This Study

Maxime Sobrier (Thu,) studied this question.

synapsesocial.com/papers/69cf5ced5a333a821460a7fa https://doi.org/https://doi.org/10.64336/001c.160011

Bookmark

View Full Paper