Learning meaningful representations of scientific documents is essential for information retrieval, knowledge discovery, and recommendation systems. Traditional methods such as TF-IDF rely on lexical matching and fail to capture deeper semantic relationships, while transformer-based approaches typically depend on limited supervision signals. In this work, we propose a Triple-Source automatic supervision framework for learning document embeddings from scientific corpora. The model integrates three types of supervision–title–abstract pairs, same-category document pairs, and document-level semantic relationships—within a unified contrastive learning framework based on a multilingual XLM-RoBERTa encoder. Unlike prior approaches that rely on citation graphs or manual annotations, our method enables citation-free and annotation-free representation learning using only lightweight metadata. Experiments on a publicly available arXiv dataset consisting of 98,649 documents demonstrate improved semantic retrieval performance, achieving Recall@1 = 0.6181 for same-category retrieval and outperforming both TF-IDF and single-source transformer baselines. The learned embeddings also exhibit improved clustering of scientific domains, indicating more structured semantic representations.
Turdalyuly et al. (Thu,) studied this question.