What question did this study set out to answer?

April 25, 2026Open Access

Learning Scientific Document Representations via Triple-Source Automatic Supervision Without Annotations or Citations

Key Points

This work aims to develop a method for learning meaningful representations of scientific documents without relying on annotations or citations.
Proposed a Triple-Source automatic supervision framework for document embeddings.
Utilized title-abstract pairs, same-category document pairs, and document-level semantic relationships for training.
Implemented a contrastive learning framework with a multilingual XLM-RoBERTa encoder.
Achieved Recall@1 = 0.6181 for same-category retrieval on a dataset of 98,649 documents.
Outperformed both TF-IDF and single-source transformer baselines in semantic retrieval performance.
Demonstrated improved clustering of scientific domains, indicating better semantic representation.

Abstract

Learning meaningful representations of scientific documents is essential for information retrieval, knowledge discovery, and recommendation systems. Traditional methods such as TF-IDF rely on lexical matching and fail to capture deeper semantic relationships, while transformer-based approaches typically depend on limited supervision signals. In this work, we propose a Triple-Source automatic supervision framework for learning document embeddings from scientific corpora. The model integrates three types of supervision–title–abstract pairs, same-category document pairs, and document-level semantic relationships—within a unified contrastive learning framework based on a multilingual XLM-RoBERTa encoder. Unlike prior approaches that rely on citation graphs or manual annotations, our method enables citation-free and annotation-free representation learning using only lightweight metadata. Experiments on a publicly available arXiv dataset consisting of 98,649 documents demonstrate improved semantic retrieval performance, achieving Recall@1 = 0.6181 for same-category retrieval and outperforming both TF-IDF and single-source transformer baselines. The learned embeddings also exhibit improved clustering of scientific domains, indicating more structured semantic representations.

Read Full Paperexternally

Bookmark

View Full Paper

Cite This Study

Turdalyuly et al. (Thu,) studied this question.

synapsesocial.com/papers/69ec5b3d88ba6daa22dacc02 https://doi.org/https://doi.org/10.3390/computers15050268

Bookmark

View Full Paper