Language identification (LID) is a critical prerequisite for text processing tasks such as content classification, natural language processing (NLP), machine translation, and large language model (LLM) training. Modern AI systems rely heavily on web data, making accurate detection essential for applying language-specific preprocessing techniques and aligning multilingual datasets. Existing LID systems, however, struggle with multilingual documents, short or noisy text, and resource constraints. This study introduces a lightweight term-frequency-based algorithm for detecting multiple languages in a single document, achieving high accuracy while minimizing memory usage. The algorithm assigns independent relevance scores to each language (not calibrated as proportions), enabling effective identification of both monolingual and multilingual content. The proposed method was benchmarked against state-of-the-art libraries, including BERT, CLD3, fastText, and GlotLID, on WiLI-2018, FLORES+, Tatoeba, OpenSubtitles, and a newly constructed Multilingual Webpages dataset. Results show that the algorithm achieves an F1 score of up to 98%, matching or exceeding existing libraries under comparable evaluation settings while remaining computationally efficient on standard hardware. Its ability to detect dominant languages in multilingual webpages further demonstrates its practical applicability.
Maxime Sobrier (Thu,) studied this question.