This study introduces and constructs a dataset of Chinese-based multilingual collaborative representation dataset covering low-resource languages such as Tibetan, Uyghur, and Mongolian. To address the complex morphological variations in ethnic minority languages—such as the addition of prefixes, suffixes, or inflections that result in significant deviations from dictionary forms, this research collects word pairs directly from real-world corpora, including online media, open government resources, and ethnic cultural websites, rather than relying on standardized dictionaries. This approach better captures the actual usage of languages, significantly enhancing the semantic understanding of ethnic minority languages by large models. After data collection, machine translation technology generates preliminary bilingual data, which is then refined through sentence alignment tools to ensure high-quality bilingual alignment. Using Chinese as a bridge language, the study further constructs a shared multilingual semantic space, strengthening semantic consistency across languages by incorporating Chinese semantic information such as synonyms, near-synonyms, and taxonomic relations. The semantic triplets in this space provide robust support for semantic alignment among low-resource languages. The dataset undergoes rigorous quality control and manual verification to ensure high reliability. The dataset consists of four components: (1) Tibetan-Chinese bilingual semantic alignment triplet dataset (16,097 entries); (2) Uyghur-Chinese bilingual semantic alignment triplet dataset (45,439 entries); (3) Mongolian-Chinese bilingual semantic alignment triplet dataset (55,600 entries); and (4) Chinese multi-relation semantic network triplet dataset (9,809 entries), totaling 127,945 entries and covering diverse semantic relationships. This dataset offers valuable semantic support for cross-lingual tasks, information retrieval, and multilingual generation in low-resource languages, with broad application potential.
WAN et al. (Mon,) studied this question.
Synapse has enriched 5 closely related papers on similar clinical questions. Consider them for comparative context: